Re: [DISCUSS] Adding "byteWidth" field to Decimal Flatbuffers type for forward compatibility

2020-06-02 Thread Micah Kornfield
Hi Wes,
I'm +1 on this.  As part of the PR it might good to beef up the
documentation in general:
1.  Encoding representation expected for bytes.
2.  Add a clarification that the only accepted lengths are 16 for the time
being.

Thanks,
Micah

On Mon, Jun 1, 2020 at 3:48 PM Wes McKinney  wrote:

> I mentioned this on the recent sync call and opened
>
> https://issues.apache.org/jira/browse/ARROW-8985
>
> I believe at some point that Arrow may need to be used to transport
> decimal widths different from 128 bits. For example systems like
> Apache Kudu have 32-bit and 64-bit decimals. Computational code may
> immediately promote small decimals, but it's valuable to be able to
> transfer and represent the data as is rather than forcing an
> up-promotion even for low-precision decimal data.
>
> In order to allow for this work to possibly happen in the future
> without requiring a new value be added to the "Type" Flatbuffers
> union, I propose to add a "byteWidth" field with default value 16 to
> the existing Decimal type. Here is a patch with this change:
>
> https://github.com/apache/arrow/pull/7321
>
> To make the forward compatibility issue clear, if this field is not
> added then current library versions would not be able to perceive the
> absence of the field, this making it unsafe for future library
> versions to annotate anything other than 16-byte decimals with this
> metadata.
>
> As part of adopting this change, we would want to add assertions to
> the existing libraries to check that the byteWidth is indeed 16 and
> either throwing an exception or passing through the data as
> FixedSizeBinary otherwise.
>
> Thanks,
> Wes
>


[jira] [Created] (ARROW-9020) read_json won't respect explicit_schema in parse_options

2020-06-02 Thread Felipe Santos (Jira)
Felipe Santos created ARROW-9020:


 Summary: read_json won't respect explicit_schema in parse_options
 Key: ARROW-9020
 URL: https://issues.apache.org/jira/browse/ARROW-9020
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.17.1
 Environment: CPython 3.8.2, MacOS Mojave 10.14.6
Reporter: Felipe Santos
 Fix For: 0.17.1


I am trying to read a json file using an explicit schema but it looks like the 
schema is ignored. Moreover, if the my schema contains a field not present in 
the json file, then the output table contains all the fields in the json file 
plus the fields of my schema not found in the file.

A minimal example:
{code:python}
import pyarrow as pa
from pyarrow import json

# allowing for type inference
print(json.read_json('tmp.json'))
# prints:
# pyarrow.Table
# foo: string
# baz: string

# using an explicit schema that would read only "foo"
schema = pa.schema([('foo', pa.string())])
print(json.read_json('tmp.json', 
parse_options=json.ParseOptions(explicit_schema=schema)))
# prints:
# pyarrow.Table
# foo: string
# baz: string

# using an explicit schema that would read only "not_a_field",
# which is not present in the json file
schema = pa.schema([('not_a_field', pa.string())])
print(json.read_json('tmp.json', 
parse_options=json.ParseOptions(explicit_schema=schema)))
# prints:
# pyarrow.Table
# not_a_field: string
# foo: string
# baz: string
{code}

And the tmp.json file looks like:
{code:json}
{"foo": "bar", "baz": "1"}

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9019) pyarrow hdfs fails to connect to for HDFS 3.x cluster

2020-06-02 Thread Thomas Graves (Jira)
Thomas Graves created ARROW-9019:


 Summary: pyarrow hdfs fails to connect to for HDFS 3.x cluster
 Key: ARROW-9019
 URL: https://issues.apache.org/jira/browse/ARROW-9019
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Thomas Graves


I'm trying to use the pyarrow hdfs connector with Hadoop 3.1.3 and I get an 
error that looks like a protobuf or jar mismatch problem with Hadoop. The same 
code works on a Hadoop 2.9 cluster.
 
I'm wondering if there is something special I need to do or if pyarrow doesn't 
support Hadoop 3.x yet?
Note I tried with pyarrow 0.15.1, 0.16.0, and 0.17.1.
 
    import pyarrow as pa
    hdfs_kwargs = dict(host="namenodehost",
                      port=9000,
                      user="tgraves",
                      driver='libhdfs',
                      kerb_ticket=None,
                      extra_conf=None)
    fs = pa.hdfs.connect(**hdfs_kwargs)
    res = fs.exists("/user/tgraves")
 
Error that I get on Hadoop 3.x is:
 
dfsExists: invokeMethod((Lorg/apache/hadoop/fs/Path;)Z) error:
ClassCastException: 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto
 cannot be cast to 
org.apache.hadoop.shaded.com.google.protobuf.Messagejava.lang.ClassCastException:
 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto
 cannot be cast to org.apache.hadoop.shaded.com.google.protobuf.Message
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
        at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:904)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
        at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1661)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1577)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1574)
        at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1589)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1683)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9018) [C++] Remove APIs that were deprecated in 0.17.x and prior

2020-06-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9018:
---

 Summary: [C++] Remove APIs that were deprecated in 0.17.x and prior
 Key: ARROW-9018
 URL: https://issues.apache.org/jira/browse/ARROW-9018
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9017) [Python] Refactor the Scalar classes

2020-06-02 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9017:


 Summary: [Python] Refactor the Scalar classes
 Key: ARROW-9017
 URL: https://issues.apache.org/jira/browse/ARROW-9017
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


The situation regarding scalars in Python is currently not optimal.

We have two different "types" of scalars:

- {{ArrayValue(Scalar)}} (and subclasses of that for all types):  this is used 
when you access a single element of an array (eg {{arr[0]}})
- {{ScalarValue(Scalar)}} (and subclasses of that for _some_ types): this is 
used when wrapping a C++ scalar into a python scalar, eg when you get back a 
scalar from a reduction like {{arr.sum()}}.

And while we have two versions of scalars, neither of them can actually easily 
be used as scalar as they both can't be constructed from a python scalar (there 
is no {{scalar(1)}} function to use when calling a kernel, for example).

I think we should try to unify those scalar classes? (which probably means 
getting rid of the ArrayValue scalar)

In addition, there is an issue of trying to re-use python scalar <-> arrow 
conversion code, as this is also logic for this in the {{python_to_arrow.cc}} 
code. But this is probably a bigger change. cc [~kszucs] 





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9016) [Java] Remove direct references to Netty/Unsafe Allocators

2020-06-02 Thread Ryan Murray (Jira)
Ryan Murray created ARROW-9016:
--

 Summary: [Java] Remove direct references to Netty/Unsafe Allocators
 Key: ARROW-9016
 URL: https://issues.apache.org/jira/browse/ARROW-9016
 Project: Apache Arrow
  Issue Type: Task
Reporter: Ryan Murray


As part of ARROW-8230 this removes direct references to Netty and Unsafe 
Allocation managers in the `DefaultAllocationManagerOption`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9015) [Java] Make BaseBuffer package private

2020-06-02 Thread Ryan Murray (Jira)
Ryan Murray created ARROW-9015:
--

 Summary: [Java] Make BaseBuffer package private
 Key: ARROW-9015
 URL: https://issues.apache.org/jira/browse/ARROW-9015
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ryan Murray
Assignee: Ryan Murray


As part of the netty work in ARROW-8230 it became clear that BaseAllocator 
should be package private



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9014) [Packaging] Bump the minor part of the automatically generated version in crossbow

2020-06-02 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-9014:
--

 Summary: [Packaging] Bump the minor part of the automatically 
generated version in crossbow
 Key: ARROW-9014
 URL: https://issues.apache.org/jira/browse/ARROW-9014
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging, Python
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 1.0.0


Crossbow uses setuptools_scm to generate a development version number using git 
describe command. This means that it finds the latest {{reachable}} tag from 
the current commit on master. 

The minor releases are created from the master branch whereas the patch release 
tags point to commits on maintenance branches (like 0.17.x) which means that if 
we already have released a patch version, like 0.17.1 then crossbow generates a 
version number like 0.17.0.dev{number-of-commits-from-0.17.0} and bumps its 
patch tag, eventually creating binary packages with version 0.17.1.dev123.

The main problem with this is that the produced nightly python wheels are not 
picked up by pip, because on pypi we already have that patch release available 
and pip doesn't consider 0.17.1.dev123 newer than 0.17.1 (with --pre option 
passed). 

So to force pip to install the newer nightly packages we need to bump the minor 
version instead. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9013) [C++] Validate enum-style CMake options

2020-06-02 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-9013:
-

 Summary: [C++] Validate enum-style CMake options
 Key: ARROW-9013
 URL: https://issues.apache.org/jira/browse/ARROW-9013
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Developer Tools
Reporter: Antoine Pitrou


It seems that some CMake options silently allow invalid values, such as 
{{-DARROW_SIMD_LEVEL=foobar}}. We should validate inputs to avoid typos (such 
as "SSE42" instead of "SSE4_2").



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9012) [Packaging] Add the the anaconda cleanup task to the list of nightlies

2020-06-02 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-9012:
--

 Summary: [Packaging] Add the the anaconda cleanup task to the list 
of nightlies
 Key: ARROW-9012
 URL: https://issues.apache.org/jira/browse/ARROW-9012
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Krisztian Szucs


Follow-up of https://github.com/apache/arrow/pull/7305



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9011) [Python][Packaging] Move the anaconda cleanup script to crossbow

2020-06-02 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-9011:
--

 Summary: [Python][Packaging] Move the anaconda cleanup script to 
crossbow
 Key: ARROW-9011
 URL: https://issues.apache.org/jira/browse/ARROW-9011
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging, Python
Reporter: Krisztian Szucs


Right now it is a standalone script, but it would be better to have it 
integrated to crossbow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2020-06-02-0

2020-06-02 Thread Crossbow


Arrow Build Report for Job nightly-2020-06-02-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0

Failed Tasks:
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-azure-conda-linux-gcc-py37
- conda-linux-gcc-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-azure-conda-linux-gcc-py38
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-azure-conda-osx-clang-py38
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-azure-conda-win-vs2015-py37
- conda-win-vs2015-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-azure-conda-win-vs2015-py38
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-travis-homebrew-cpp
- homebrew-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-travis-homebrew-r-autobrew
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-test-conda-cpp-valgrind
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-spark-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-test-conda-python-3.7-spark-master
- test-conda-python-3.8-dask-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-test-conda-python-3.8-dask-master
- test-conda-python-3.8-jpype:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-test-conda-python-3.8-jpype
- test-r-rstudio-r-base-3.6-centos6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-azure-test-r-rstudio-r-base-3.6-centos6

Succeeded Tasks:
- centos-6-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-centos-6-amd64
- centos-7-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-travis-centos-7-aarch64
- centos-7-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-centos-7-amd64
- centos-8-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-travis-centos-8-aarch64
- centos-8-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-centos-8-amd64
- debian-buster-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-debian-buster-amd64
- debian-buster-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-travis-debian-buster-arm64
- debian-stretch-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-debian-stretch-amd64
- debian-stretch-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-travis-debian-stretch-arm64
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-travis-gandiva-jar-osx
- gandiva-jar-xenial:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-travis-gandiva-jar-xenial
- nuget:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-nuget
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-test-conda-cpp
- test-conda-python-3.6-pandas-0.23:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-test-conda-python-3.6-pandas-0.23
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-test-conda-python-3.6
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-test-conda-python-3.7-hdfs-2.9.2
- test-conda-python-3.7-kartothek-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-test-conda-python-3.7-kartothek-latest
- test-conda-python-3.7-kartothek-master:
  URL: 
htt

[jira] [Created] (ARROW-9010) [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-06-02 Thread Liya Fan (Jira)
Liya Fan created ARROW-9010:
---

 Summary: [Java] Framework and interface changes for RecordBatch 
IPC buffer compression
 Key: ARROW-9010
 URL: https://issues.apache.org/jira/browse/ARROW-9010
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


This is the first sub-work item of ARROW-8672 (
[Java] Implement RecordBatch IPC buffer compression from ARROW-300). However, 
it does not involve any concrete compression algorithms. The purpose of this PR 
is to establish basic interfaces for data compression, and make changes to the 
IPC framework so that different compression algorithms can be plug-in smoothly. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9009) [C++][Dataset] ARROW:schema should be removed from schema's metadata when reading Parquet files

2020-06-02 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9009:


 Summary: [C++][Dataset] ARROW:schema should be removed from 
schema's metadata when reading Parquet files
 Key: ARROW-9009
 URL: https://issues.apache.org/jira/browse/ARROW-9009
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


When reading a parquet file (which was written by Arrow) with the datasets API, 
it preserves the "ARROW:schema" field in the metadata:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

table = pa.table({'a': [1, 2, 3]})
pq.write_table(table, "test.parquet")

dataset = ds.dataset("test.parquet", format="parquet")
{code}
In [7]: dataset.schema  

  
Out[7]: 
a: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
ARROW:schema: '/3gQAAAKAAwABgAFAAgACgABAwAMCAAIBA' + 114

In [8]: dataset.to_table().schema   

  
Out[8]: 
a: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
ARROW:schema: '/3gQAAAKAAwABgAFAAgACgABAwAMCAAIBA' + 114
{code}

while when reading with the `parquet` module reader, we do not preserve this 
metadata:

{code}
In [9]: pq.read_table("test.parquet").schema

  
Out[9]: 
a: int64
  -- field metadata --
  PARQUET:field_id: '1'
{code}

Since the "ARROW:schema" information is used to properly reconstruct the Arrow 
schema from the ParquetSchema, it is no longer needed once you already have the 
arrow schema, so it's probably not needed to keep it as metadata in the arrow 
schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9008) jemalloc_set_decay_ms precedence

2020-06-02 Thread Remi Dettai (Jira)
Remi Dettai created ARROW-9008:
--

 Summary: jemalloc_set_decay_ms precedence
 Key: ARROW-9008
 URL: https://issues.apache.org/jira/browse/ARROW-9008
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Remi Dettai


I've noticed that the jemalloc const configuration [je_arrow_malloc_conf 
|https://github.com/apache/arrow/blob/d542482bdc6bea8a449f000bdd74de8990c20015/cpp/src/arrow/memory_pool.h#L169]
 overrides the arrow public function 
[jemalloc_set_decay_ms()|https://github.com/apache/arrow/blob/e4bf4297585e1d0723957833d012aaf5c119f6b0/cpp/src/arrow/memory_pool.cc#L69].
 

Is their a way to call jemalloc_set_decay_ms so that it has the right 
precedence ? 
-> if yes, I believe this should be specified in the comments
-> if no, the function should be deprecated



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9007) [Rust] Support appending arrays by merging array data

2020-06-02 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9007:
-

 Summary: [Rust] Support appending arrays by merging array data
 Key: ARROW-9007
 URL: https://issues.apache.org/jira/browse/ARROW-9007
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Affects Versions: 0.17.0
Reporter: Neville Dipale


ARROW-9005 introduces a concat kernel which allows for concatenating multiple 
arrays of the same type into a single array. This is useful for sorting on 
multiple arrays, among other things.

The concat kernel is implemented for most array types, but not yet for nested 
arrays (lists, structs, etc).

This Jira is for creating a way of appending/merging all array types, so that 
concat (and functionality that depends on it) can support all array types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Why downloading sources of pyarrow and its requirements takes several minutes?

2020-06-02 Thread Joris Van den Bossche
I think this is due to numpy starting to have a pyproject.toml file since
1.18 (https://github.com/numpy/numpy/pull/14053)
And apparently, when a package includes a pyproject.toml, pip will create a
build environment, just to get the metadata (and in case of numpy, this
means creating an environment with setuptools, wheel and cython packages
installed). And this is what takes some more time, compared to older
versions of numpy.

On Fri, 29 May 2020 at 20:02, Valentyn Tymofieiev
 wrote:

> Thanks for the input. Opened
> https://issues.apache.org/jira/browse/ARROW-8983, we can continue the
> conversation there.
>
> On Thu, May 28, 2020 at 2:46 PM Valentyn Tymofieiev 
> wrote:
>
> > Hi Arrow dev community,
> >
> > Do you have any insight why
> >
> >   python -m pip download --dest /tmp pyarrow==0.16.0 --no-binary
> > :all:
> >
> > takes several minutes to execute? From the output we can see that pip get
> > stuck on:
> >
> >   File was already downloaded /tmp/pyarrow-0.16.0.tar.gz
> >   Installing build dependencies ... |
> >
> > There is a significant increase in runtime between 0.15.1 and 0.16.0. I
> > suspect  some build dependencies need to be installed before pip
> > understands the dependencies of pyarrow.  Is there some inefficiency in
> > Avro's setup.py that is causing this?
> >
> > Thanks,
> > Valentyn
> >
>