Re: [DISCUSS] Adding "byteWidth" field to Decimal Flatbuffers type for forward compatibility
Hi Wes, I'm +1 on this. As part of the PR it might good to beef up the documentation in general: 1. Encoding representation expected for bytes. 2. Add a clarification that the only accepted lengths are 16 for the time being. Thanks, Micah On Mon, Jun 1, 2020 at 3:48 PM Wes McKinney wrote: > I mentioned this on the recent sync call and opened > > https://issues.apache.org/jira/browse/ARROW-8985 > > I believe at some point that Arrow may need to be used to transport > decimal widths different from 128 bits. For example systems like > Apache Kudu have 32-bit and 64-bit decimals. Computational code may > immediately promote small decimals, but it's valuable to be able to > transfer and represent the data as is rather than forcing an > up-promotion even for low-precision decimal data. > > In order to allow for this work to possibly happen in the future > without requiring a new value be added to the "Type" Flatbuffers > union, I propose to add a "byteWidth" field with default value 16 to > the existing Decimal type. Here is a patch with this change: > > https://github.com/apache/arrow/pull/7321 > > To make the forward compatibility issue clear, if this field is not > added then current library versions would not be able to perceive the > absence of the field, this making it unsafe for future library > versions to annotate anything other than 16-byte decimals with this > metadata. > > As part of adopting this change, we would want to add assertions to > the existing libraries to check that the byteWidth is indeed 16 and > either throwing an exception or passing through the data as > FixedSizeBinary otherwise. > > Thanks, > Wes >
[jira] [Created] (ARROW-9020) read_json won't respect explicit_schema in parse_options
Felipe Santos created ARROW-9020: Summary: read_json won't respect explicit_schema in parse_options Key: ARROW-9020 URL: https://issues.apache.org/jira/browse/ARROW-9020 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.17.1 Environment: CPython 3.8.2, MacOS Mojave 10.14.6 Reporter: Felipe Santos Fix For: 0.17.1 I am trying to read a json file using an explicit schema but it looks like the schema is ignored. Moreover, if the my schema contains a field not present in the json file, then the output table contains all the fields in the json file plus the fields of my schema not found in the file. A minimal example: {code:python} import pyarrow as pa from pyarrow import json # allowing for type inference print(json.read_json('tmp.json')) # prints: # pyarrow.Table # foo: string # baz: string # using an explicit schema that would read only "foo" schema = pa.schema([('foo', pa.string())]) print(json.read_json('tmp.json', parse_options=json.ParseOptions(explicit_schema=schema))) # prints: # pyarrow.Table # foo: string # baz: string # using an explicit schema that would read only "not_a_field", # which is not present in the json file schema = pa.schema([('not_a_field', pa.string())]) print(json.read_json('tmp.json', parse_options=json.ParseOptions(explicit_schema=schema))) # prints: # pyarrow.Table # not_a_field: string # foo: string # baz: string {code} And the tmp.json file looks like: {code:json} {"foo": "bar", "baz": "1"} {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9019) pyarrow hdfs fails to connect to for HDFS 3.x cluster
Thomas Graves created ARROW-9019: Summary: pyarrow hdfs fails to connect to for HDFS 3.x cluster Key: ARROW-9019 URL: https://issues.apache.org/jira/browse/ARROW-9019 Project: Apache Arrow Issue Type: Bug Reporter: Thomas Graves I'm trying to use the pyarrow hdfs connector with Hadoop 3.1.3 and I get an error that looks like a protobuf or jar mismatch problem with Hadoop. The same code works on a Hadoop 2.9 cluster. I'm wondering if there is something special I need to do or if pyarrow doesn't support Hadoop 3.x yet? Note I tried with pyarrow 0.15.1, 0.16.0, and 0.17.1. import pyarrow as pa hdfs_kwargs = dict(host="namenodehost", port=9000, user="tgraves", driver='libhdfs', kerb_ticket=None, extra_conf=None) fs = pa.hdfs.connect(**hdfs_kwargs) res = fs.exists("/user/tgraves") Error that I get on Hadoop 3.x is: dfsExists: invokeMethod((Lorg/apache/hadoop/fs/Path;)Z) error: ClassCastException: org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto cannot be cast to org.apache.hadoop.shaded.com.google.protobuf.Messagejava.lang.ClassCastException: org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto cannot be cast to org.apache.hadoop.shaded.com.google.protobuf.Message at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:904) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1661) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1577) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1574) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1589) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1683) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9018) [C++] Remove APIs that were deprecated in 0.17.x and prior
Wes McKinney created ARROW-9018: --- Summary: [C++] Remove APIs that were deprecated in 0.17.x and prior Key: ARROW-9018 URL: https://issues.apache.org/jira/browse/ARROW-9018 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9017) [Python] Refactor the Scalar classes
Joris Van den Bossche created ARROW-9017: Summary: [Python] Refactor the Scalar classes Key: ARROW-9017 URL: https://issues.apache.org/jira/browse/ARROW-9017 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche The situation regarding scalars in Python is currently not optimal. We have two different "types" of scalars: - {{ArrayValue(Scalar)}} (and subclasses of that for all types): this is used when you access a single element of an array (eg {{arr[0]}}) - {{ScalarValue(Scalar)}} (and subclasses of that for _some_ types): this is used when wrapping a C++ scalar into a python scalar, eg when you get back a scalar from a reduction like {{arr.sum()}}. And while we have two versions of scalars, neither of them can actually easily be used as scalar as they both can't be constructed from a python scalar (there is no {{scalar(1)}} function to use when calling a kernel, for example). I think we should try to unify those scalar classes? (which probably means getting rid of the ArrayValue scalar) In addition, there is an issue of trying to re-use python scalar <-> arrow conversion code, as this is also logic for this in the {{python_to_arrow.cc}} code. But this is probably a bigger change. cc [~kszucs] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9016) [Java] Remove direct references to Netty/Unsafe Allocators
Ryan Murray created ARROW-9016: -- Summary: [Java] Remove direct references to Netty/Unsafe Allocators Key: ARROW-9016 URL: https://issues.apache.org/jira/browse/ARROW-9016 Project: Apache Arrow Issue Type: Task Reporter: Ryan Murray As part of ARROW-8230 this removes direct references to Netty and Unsafe Allocation managers in the `DefaultAllocationManagerOption` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9015) [Java] Make BaseBuffer package private
Ryan Murray created ARROW-9015: -- Summary: [Java] Make BaseBuffer package private Key: ARROW-9015 URL: https://issues.apache.org/jira/browse/ARROW-9015 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ryan Murray Assignee: Ryan Murray As part of the netty work in ARROW-8230 it became clear that BaseAllocator should be package private -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9014) [Packaging] Bump the minor part of the automatically generated version in crossbow
Krisztian Szucs created ARROW-9014: -- Summary: [Packaging] Bump the minor part of the automatically generated version in crossbow Key: ARROW-9014 URL: https://issues.apache.org/jira/browse/ARROW-9014 Project: Apache Arrow Issue Type: Improvement Components: Packaging, Python Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 1.0.0 Crossbow uses setuptools_scm to generate a development version number using git describe command. This means that it finds the latest {{reachable}} tag from the current commit on master. The minor releases are created from the master branch whereas the patch release tags point to commits on maintenance branches (like 0.17.x) which means that if we already have released a patch version, like 0.17.1 then crossbow generates a version number like 0.17.0.dev{number-of-commits-from-0.17.0} and bumps its patch tag, eventually creating binary packages with version 0.17.1.dev123. The main problem with this is that the produced nightly python wheels are not picked up by pip, because on pypi we already have that patch release available and pip doesn't consider 0.17.1.dev123 newer than 0.17.1 (with --pre option passed). So to force pip to install the newer nightly packages we need to bump the minor version instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9013) [C++] Validate enum-style CMake options
Antoine Pitrou created ARROW-9013: - Summary: [C++] Validate enum-style CMake options Key: ARROW-9013 URL: https://issues.apache.org/jira/browse/ARROW-9013 Project: Apache Arrow Issue Type: Bug Components: C++, Developer Tools Reporter: Antoine Pitrou It seems that some CMake options silently allow invalid values, such as {{-DARROW_SIMD_LEVEL=foobar}}. We should validate inputs to avoid typos (such as "SSE42" instead of "SSE4_2"). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9012) [Packaging] Add the the anaconda cleanup task to the list of nightlies
Krisztian Szucs created ARROW-9012: -- Summary: [Packaging] Add the the anaconda cleanup task to the list of nightlies Key: ARROW-9012 URL: https://issues.apache.org/jira/browse/ARROW-9012 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Krisztian Szucs Follow-up of https://github.com/apache/arrow/pull/7305 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9011) [Python][Packaging] Move the anaconda cleanup script to crossbow
Krisztian Szucs created ARROW-9011: -- Summary: [Python][Packaging] Move the anaconda cleanup script to crossbow Key: ARROW-9011 URL: https://issues.apache.org/jira/browse/ARROW-9011 Project: Apache Arrow Issue Type: Improvement Components: Packaging, Python Reporter: Krisztian Szucs Right now it is a standalone script, but it would be better to have it integrated to crossbow. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[NIGHTLY] Arrow Build Report for Job nightly-2020-06-02-0
Arrow Build Report for Job nightly-2020-06-02-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0 Failed Tasks: - conda-linux-gcc-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-azure-conda-linux-gcc-py36 - conda-linux-gcc-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-azure-conda-linux-gcc-py37 - conda-linux-gcc-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-azure-conda-linux-gcc-py38 - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-azure-conda-osx-clang-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-azure-conda-osx-clang-py37 - conda-osx-clang-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-azure-conda-osx-clang-py38 - conda-win-vs2015-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-azure-conda-win-vs2015-py36 - conda-win-vs2015-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-azure-conda-win-vs2015-py37 - conda-win-vs2015-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-azure-conda-win-vs2015-py38 - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-travis-homebrew-cpp - homebrew-r-autobrew: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-travis-homebrew-r-autobrew - test-conda-cpp-valgrind: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-test-conda-cpp-valgrind - test-conda-python-3.7-dask-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-test-conda-python-3.7-dask-latest - test-conda-python-3.7-spark-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-test-conda-python-3.7-spark-master - test-conda-python-3.8-dask-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-test-conda-python-3.8-dask-master - test-conda-python-3.8-jpype: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-test-conda-python-3.8-jpype - test-r-rstudio-r-base-3.6-centos6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-azure-test-r-rstudio-r-base-3.6-centos6 Succeeded Tasks: - centos-6-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-centos-6-amd64 - centos-7-aarch64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-travis-centos-7-aarch64 - centos-7-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-centos-7-amd64 - centos-8-aarch64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-travis-centos-8-aarch64 - centos-8-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-centos-8-amd64 - debian-buster-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-debian-buster-amd64 - debian-buster-arm64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-travis-debian-buster-arm64 - debian-stretch-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-debian-stretch-amd64 - debian-stretch-arm64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-travis-debian-stretch-arm64 - gandiva-jar-osx: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-travis-gandiva-jar-osx - gandiva-jar-xenial: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-travis-gandiva-jar-xenial - nuget: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-nuget - test-conda-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-test-conda-cpp - test-conda-python-3.6-pandas-0.23: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-test-conda-python-3.6-pandas-0.23 - test-conda-python-3.6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-test-conda-python-3.6 - test-conda-python-3.7-hdfs-2.9.2: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-test-conda-python-3.7-hdfs-2.9.2 - test-conda-python-3.7-kartothek-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-02-0-github-test-conda-python-3.7-kartothek-latest - test-conda-python-3.7-kartothek-master: URL: htt
[jira] [Created] (ARROW-9010) [Java] Framework and interface changes for RecordBatch IPC buffer compression
Liya Fan created ARROW-9010: --- Summary: [Java] Framework and interface changes for RecordBatch IPC buffer compression Key: ARROW-9010 URL: https://issues.apache.org/jira/browse/ARROW-9010 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan This is the first sub-work item of ARROW-8672 ( [Java] Implement RecordBatch IPC buffer compression from ARROW-300). However, it does not involve any concrete compression algorithms. The purpose of this PR is to establish basic interfaces for data compression, and make changes to the IPC framework so that different compression algorithms can be plug-in smoothly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9009) [C++][Dataset] ARROW:schema should be removed from schema's metadata when reading Parquet files
Joris Van den Bossche created ARROW-9009: Summary: [C++][Dataset] ARROW:schema should be removed from schema's metadata when reading Parquet files Key: ARROW-9009 URL: https://issues.apache.org/jira/browse/ARROW-9009 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche When reading a parquet file (which was written by Arrow) with the datasets API, it preserves the "ARROW:schema" field in the metadata: {code:python} import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds table = pa.table({'a': [1, 2, 3]}) pq.write_table(table, "test.parquet") dataset = ds.dataset("test.parquet", format="parquet") {code} In [7]: dataset.schema Out[7]: a: int64 -- field metadata -- PARQUET:field_id: '1' -- schema metadata -- ARROW:schema: '/3gQAAAKAAwABgAFAAgACgABAwAMCAAIBA' + 114 In [8]: dataset.to_table().schema Out[8]: a: int64 -- field metadata -- PARQUET:field_id: '1' -- schema metadata -- ARROW:schema: '/3gQAAAKAAwABgAFAAgACgABAwAMCAAIBA' + 114 {code} while when reading with the `parquet` module reader, we do not preserve this metadata: {code} In [9]: pq.read_table("test.parquet").schema Out[9]: a: int64 -- field metadata -- PARQUET:field_id: '1' {code} Since the "ARROW:schema" information is used to properly reconstruct the Arrow schema from the ParquetSchema, it is no longer needed once you already have the arrow schema, so it's probably not needed to keep it as metadata in the arrow schema. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9008) jemalloc_set_decay_ms precedence
Remi Dettai created ARROW-9008: -- Summary: jemalloc_set_decay_ms precedence Key: ARROW-9008 URL: https://issues.apache.org/jira/browse/ARROW-9008 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Remi Dettai I've noticed that the jemalloc const configuration [je_arrow_malloc_conf |https://github.com/apache/arrow/blob/d542482bdc6bea8a449f000bdd74de8990c20015/cpp/src/arrow/memory_pool.h#L169] overrides the arrow public function [jemalloc_set_decay_ms()|https://github.com/apache/arrow/blob/e4bf4297585e1d0723957833d012aaf5c119f6b0/cpp/src/arrow/memory_pool.cc#L69]. Is their a way to call jemalloc_set_decay_ms so that it has the right precedence ? -> if yes, I believe this should be specified in the comments -> if no, the function should be deprecated -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9007) [Rust] Support appending arrays by merging array data
Neville Dipale created ARROW-9007: - Summary: [Rust] Support appending arrays by merging array data Key: ARROW-9007 URL: https://issues.apache.org/jira/browse/ARROW-9007 Project: Apache Arrow Issue Type: New Feature Components: Rust Affects Versions: 0.17.0 Reporter: Neville Dipale ARROW-9005 introduces a concat kernel which allows for concatenating multiple arrays of the same type into a single array. This is useful for sorting on multiple arrays, among other things. The concat kernel is implemented for most array types, but not yet for nested arrays (lists, structs, etc). This Jira is for creating a way of appending/merging all array types, so that concat (and functionality that depends on it) can support all array types. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Why downloading sources of pyarrow and its requirements takes several minutes?
I think this is due to numpy starting to have a pyproject.toml file since 1.18 (https://github.com/numpy/numpy/pull/14053) And apparently, when a package includes a pyproject.toml, pip will create a build environment, just to get the metadata (and in case of numpy, this means creating an environment with setuptools, wheel and cython packages installed). And this is what takes some more time, compared to older versions of numpy. On Fri, 29 May 2020 at 20:02, Valentyn Tymofieiev wrote: > Thanks for the input. Opened > https://issues.apache.org/jira/browse/ARROW-8983, we can continue the > conversation there. > > On Thu, May 28, 2020 at 2:46 PM Valentyn Tymofieiev > wrote: > > > Hi Arrow dev community, > > > > Do you have any insight why > > > > python -m pip download --dest /tmp pyarrow==0.16.0 --no-binary > > :all: > > > > takes several minutes to execute? From the output we can see that pip get > > stuck on: > > > > File was already downloaded /tmp/pyarrow-0.16.0.tar.gz > > Installing build dependencies ... | > > > > There is a significant increase in runtime between 0.15.1 and 0.16.0. I > > suspect some build dependencies need to be installed before pip > > understands the dependencies of pyarrow. Is there some inefficiency in > > Avro's setup.py that is causing this? > > > > Thanks, > > Valentyn > > >