[jira] [Created] (ARROW-7659) [Rust] Reduce Rc usage
Gurwinder Singh created ARROW-7659: -- Summary: [Rust] Reduce Rc usage Key: ARROW-7659 URL: https://issues.apache.org/jira/browse/ARROW-7659 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Gurwinder Singh Assignee: Gurwinder Singh Follow up of ARROW-7560 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7658) [R] Support dplyr filtering on date/time
Neal Richardson created ARROW-7658: -- Summary: [R] Support dplyr filtering on date/time Key: ARROW-7658 URL: https://issues.apache.org/jira/browse/ARROW-7658 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 0.16.0 Plus some NSE refactoring suggested by Hadley. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7657) [R] Add option to preserve dictionary logical type rather than coerce to factor
Neal Richardson created ARROW-7657: -- Summary: [R] Add option to preserve dictionary logical type rather than coerce to factor Key: ARROW-7657 URL: https://issues.apache.org/jira/browse/ARROW-7657 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Fix For: 1.0.0 See ARROW-7639. R factor "levels" must be strings, but dictionary "values" aren't restricted like that. Provide an option to govern how dictionary arrays with non-string "values" are converted to R: keep the dictionary encoding by making the R vector be type factor and coerce the dictionary values to strings, or keep the dictionary values the type they are and generate an R vector of that type, dropping the dictionary encoding. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: new to Arrow / integration with Kudu
On Wed, Jan 22, 2020 at 12:28 PM Shazz wrote: > > Thanks Wes, > > I will follow what is happening between Arrow and Kudu. > In the short term, if you would have to define a storage for Arrow which > has good (enough) performance, not too costly to operate... what would > you choose ? I saw there is an example to store Parquet files on Azure > Blob Storage, would it be ok to start ? Or there is a better choice ? Many people are doing that. Note that you'll need to do some tuning (e.g. read buffering) to obtain acceptable performance against things like ABS > --- > sh...@metaverse.fr > GPG public key ID : B517C4C8 > > Le 21/01/2020 17:54, Wes McKinney a écrit : > > I'm interested to see an Arrow adapter for Apache Kudu developed. My > > gut feeling is that this work should be undertaken in Kudu itself, > > potentially having the tablet servers producing Arrow Record Batches > > locally and sending them to the client rather than converting to > > Kudu's own on-the-wire record format and then deserializing into Arrow > > on the receiver side. It might be worth a conversation with the Kudu > > community to see what they think. > > > > Of course one can build an Arrow deserializer for the current Kudu C++ > > client API and probably get pretty good performance. see also > > ARROW-814 > > > > https://issues.apache.org/jira/browse/ARROW-814 > > > > On Tue, Jan 21, 2020 at 12:32 PM Shazz wrote: > >> > >> Hi, > >> > >> I'm thinking of an architecture to store and access efficiently > >> tabular > >> data and I was told to look at Arrow and Kudu. > >> I saw on the frontpage a diagram where Arrow can be integrated with > >> Kudu > >> but nothing in the documentation. Is there an example available > >> somewhere ? > >> > >> Thanks ! > >> > >> -- > >> sh...@metaverse.fr > >> GPG public key ID : B517C4C8
Re: new to Arrow / integration with Kudu
Thanks Wes, I will follow what is happening between Arrow and Kudu. In the short term, if you would have to define a storage for Arrow which has good (enough) performance, not too costly to operate... what would you choose ? I saw there is an example to store Parquet files on Azure Blob Storage, would it be ok to start ? Or there is a better choice ? --- sh...@metaverse.fr GPG public key ID : B517C4C8 Le 21/01/2020 17:54, Wes McKinney a écrit : I'm interested to see an Arrow adapter for Apache Kudu developed. My gut feeling is that this work should be undertaken in Kudu itself, potentially having the tablet servers producing Arrow Record Batches locally and sending them to the client rather than converting to Kudu's own on-the-wire record format and then deserializing into Arrow on the receiver side. It might be worth a conversation with the Kudu community to see what they think. Of course one can build an Arrow deserializer for the current Kudu C++ client API and probably get pretty good performance. see also ARROW-814 https://issues.apache.org/jira/browse/ARROW-814 On Tue, Jan 21, 2020 at 12:32 PM Shazz wrote: Hi, I'm thinking of an architecture to store and access efficiently tabular data and I was told to look at Arrow and Kudu. I saw on the frontpage a diagram where Arrow can be integrated with Kudu but nothing in the documentation. Is there an example available somewhere ? Thanks ! -- sh...@metaverse.fr GPG public key ID : B517C4C8
[jira] [Created] (ARROW-7656) [Python] csv.ConvertOptions Documentation Is Unclear Around Disabling Type Inference
Tim Lantz created ARROW-7656: Summary: [Python] csv.ConvertOptions Documentation Is Unclear Around Disabling Type Inference Key: ARROW-7656 URL: https://issues.apache.org/jira/browse/ARROW-7656 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.1 Environment: Documentation, N/A. Reporter: Tim Lantz High level description: * The documentation [here|[https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html#pyarrow.csv.ConvertOptions]] says that setting column_types disables type inference. * Under the hood I can see why it is clear you need to also set ReadOptions.column_names to support all current use cases however it is unclear to new users of the library when you read the docs. Especially since you can supply a Schema object to column_types in the Python bindings. * Suggested change: update the csv.ConvertOptions to note that you also must set csv.ReadOptions.column_names in order to disable type inference. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7655) [Python] csv.ConvertOptions Do Not Pass Through/Retain Nullability from Schema
Tim Lantz created ARROW-7655: Summary: [Python] csv.ConvertOptions Do Not Pass Through/Retain Nullability from Schema Key: ARROW-7655 URL: https://issues.apache.org/jira/browse/ARROW-7655 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.1 Environment: Reproduced on Ubuntu 18.04 and OSX Catalina in Python 3.7.4. Reporter: Tim Lantz Originally mentioned in: [https://github.com/apache/arrow/issues/6243] *High level description of the issue:* * It is possible ([though not documented|https://issues.apache.org/jira/browse/ARROW-7654]) that you may assign the column_types field of ConvertOptions to a Schema object instead of a Dict[str, DataType]. * Expected result: the nullable attribute, in addition to the type, of the Fields in the Schema supplied are present on the Schema used when reading CSV data. * Actual result: the Field type information is present, but nullable is lost. All fields are nullable. *Minimal reproduction case:* * Use case notes: this is especially noticeable when using pyarrow as a meant to save data with a known schema to parquet as the ParquetWriter will check that the schema of a table being written matches the schema supplied to the writer. If that same schema is used to to read the CSV data and contains a nullable field, a mismatch will be detected resulting in an error which is demonstrated below. {code:java} $ cat test.csv 0 1 $ python >>> import pyarrow >>> schema = pyarrow.schema([pyarrow.field(name="foo", type=pyarrow.bool_(), >>> nullable=False)]) >>> read_options = csv.ReadOptions(column_names=["foo"]) >>> from pyarrow import csv >>> read_options = csv.ReadOptions(column_names=["foo"]) >>> convert_options = csv.ConvertOptions(column_types=schema) >>> table = csv.read_csv("test.csv", convert_options=convert_options, >>> read_options=read_options) >>> schema foo: bool not null >>> table.schema foo: bool >>> from pyarrow import parquet as pq >>> writer = pq.ParquetWriter("test.parquet", schema) >>> writer.write_table(table) Traceback (most recent call last): File "", line 1, in File "(REDACTED)/lib/python3.7/site-packages/pyarrow-0.15.1-py3.7-macosx-10.9-x86_64.egg/pyarrow/parquet.py", line 472, in write_table raise ValueError(msg) ValueError: Table schema does not match schema used to create file: table: foo: bool vs. file: foo: bool not null >>> pyarrow.__version__ '0.15.1' >>> exit() $ python --version Python 3.7.4{code} * As a side note: if I don't set column_names in read_options when calling read_csv, but I set convert_options with column_types set, type inference is still performed which seems like a bug vs. what the docs state. That seems like a possibly related, but independent bug, and I haven't done a search yet to see if it is an open/known issue but if someone believes it should be filed with a repro case upon reading this I am happy to help! I only realized this when minimizing the repro case as my original code was setting column_names. *Potential source of issue:* * **I did not yet look at how hard it is to fix, but I note that [here|https://github.com/apache/arrow/blob/ace72c2afa6b7608bca9ba858fdd10b23e7f2dbf/python/pyarrow/_csv.pyx#L411] only the name and type are passed down from a Field. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7654) [Python] Ability to Set column_types to a Schema in csv.ConvertOptions is Undocumented
Tim Lantz created ARROW-7654: Summary: [Python] Ability to Set column_types to a Schema in csv.ConvertOptions is Undocumented Key: ARROW-7654 URL: https://issues.apache.org/jira/browse/ARROW-7654 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.1, 0.12.0 Environment: N/A, documentation issue. Reporter: Tim Lantz Originally mentioned in: [https://github.com/apache/arrow/issues/6243] High level description: * As of [this commit|https://github.com/apache/arrow/commit/df54da211448b5202aa08ed2b245eb78cfd1e50c] support to supply a Schema to ConvertOptions in the csv module module was added (I'll add, extremely useful!). * As of 0.15.1 the [published documentation|https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html#pyarrow.csv.ConvertOptions] only explains that a dictionary from field name to DataType can be supplied. Minimal reproduction: N/A, see link. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7653) [C++][Dataset] Handle DictType index mismatch better
Francois Saint-Jacques created ARROW-7653: - Summary: [C++][Dataset] Handle DictType index mismatch better Key: ARROW-7653 URL: https://issues.apache.org/jira/browse/ARROW-7653 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset Reporter: Francois Saint-Jacques There will be a schema incompatibility raised if the index width doesn't match for fragments/sources. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7652) [Python] Insert implicit cast in ScannerBuilder.filter
Joris Van den Bossche created ARROW-7652: Summary: [Python] Insert implicit cast in ScannerBuilder.filter Key: ARROW-7652 URL: https://issues.apache.org/jira/browse/ARROW-7652 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7651) [CI][Crossbow] Nightly macOS wheel builds fail
Neal Richardson created ARROW-7651: -- Summary: [CI][Crossbow] Nightly macOS wheel builds fail Key: ARROW-7651 URL: https://issues.apache.org/jira/browse/ARROW-7651 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration, Packaging, Python Reporter: Neal Richardson Fix For: 0.16.0 See https://travis-ci.org/ursa-labs/crossbow/builds/640350008 for example {code} $ install_wheel arrow ~/build/ursa-labs/crossbow/arrow ~/build/ursa-labs/crossbow ERROR: You must give at least one requirement to install (see "pip help install") {code} cc [~kszucs] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7650) [C++] Dataset tests not built on Windows
Antoine Pitrou created ARROW-7650: - Summary: [C++] Dataset tests not built on Windows Key: ARROW-7650 URL: https://issues.apache.org/jira/browse/ARROW-7650 Project: Apache Arrow Issue Type: Bug Components: C++, C++ - Dataset Reporter: Antoine Pitrou They are explicitly disabled in {{cpp/src/arrow/dataset/CMakeLists.txt}}. Also, if re-enable them, there are many compile errors (on VS 2017). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7649) [Python] Expose dataset PartitioningFactory.inspect ?
Joris Van den Bossche created ARROW-7649: Summary: [Python] Expose dataset PartitioningFactory.inspect ? Key: ARROW-7649 URL: https://issues.apache.org/jira/browse/ARROW-7649 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche In C++, the PartitioningFactory has a {{Inspect}} method, which, given a path, will infer the schema. We could expose this in Python as well, it could eg be used to easily explore or illustrate what types are inferred from a path (int32, string) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[NIGHTLY] Arrow Build Report for Job nightly-2020-01-22-0
Arrow Build Report for Job nightly-2020-01-22-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0 Failed Tasks: - conda-win-vs2015-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-azure-conda-win-vs2015-py36 - conda-win-vs2015-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-azure-conda-win-vs2015-py37 - conda-win-vs2015-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-azure-conda-win-vs2015-py38 - gandiva-jar-osx: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-travis-gandiva-jar-osx - test-conda-python-3.7-spark-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-circle-test-conda-python-3.7-spark-master - test-ubuntu-fuzzit-fuzzing: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-circle-test-ubuntu-fuzzit-fuzzing - test-ubuntu-fuzzit-regression: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-circle-test-ubuntu-fuzzit-regression - wheel-osx-cp27m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-travis-wheel-osx-cp27m - wheel-osx-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-travis-wheel-osx-cp35m - wheel-osx-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-travis-wheel-osx-cp36m - wheel-osx-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-travis-wheel-osx-cp37m - wheel-osx-cp38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-travis-wheel-osx-cp38 Succeeded Tasks: - centos-6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-azure-centos-6 - centos-7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-azure-centos-7 - centos-8: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-azure-centos-8 - conda-linux-gcc-py27: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-azure-conda-linux-gcc-py27 - conda-linux-gcc-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-azure-conda-linux-gcc-py36 - conda-linux-gcc-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-azure-conda-linux-gcc-py37 - conda-linux-gcc-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-azure-conda-linux-gcc-py38 - conda-osx-clang-py27: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-azure-conda-osx-clang-py27 - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-azure-conda-osx-clang-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-azure-conda-osx-clang-py37 - conda-osx-clang-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-azure-conda-osx-clang-py38 - debian-buster: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-azure-debian-buster - debian-stretch: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-azure-debian-stretch - gandiva-jar-trusty: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-travis-gandiva-jar-trusty - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-travis-homebrew-cpp - macos-r-autobrew: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-travis-macos-r-autobrew - test-conda-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-circle-test-conda-cpp - test-conda-python-2.7-pandas-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-circle-test-conda-python-2.7-pandas-latest - test-conda-python-2.7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-circle-test-conda-python-2.7 - test-conda-python-3.6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-circle-test-conda-python-3.6 - test-conda-python-3.7-dask-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-circle-test-conda-python-3.7-dask-latest - test-conda-python-3.7-hdfs-2.9.2: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-circle-test-conda-python-3.7-hdfs-2.9.2 - test-conda-python-3.7-pandas-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-22-0-circle-test-conda-python-3.7-pandas-latest -
[jira] [Created] (ARROW-7648) [C++] Sanitize local paths on Windows
Antoine Pitrou created ARROW-7648: - Summary: [C++] Sanitize local paths on Windows Key: ARROW-7648 URL: https://issues.apache.org/jira/browse/ARROW-7648 Project: Apache Arrow Issue Type: Wish Components: C++ Reporter: Antoine Pitrou One way or the other, we should try to sanitize local filesystem paths on Windows, by converting backslashes into regular slahes. One place to do it is {{FileSystemFromUri}}. One complication is that \-separated paths can fail parsing as a URI, but we only want to sanitize a path if we detected it's a local path (by parsing the URI). Perhaps trying on error would work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7647) Problem with read_json and arrays
Johan Forsberg created ARROW-7647: - Summary: Problem with read_json and arrays Key: ARROW-7647 URL: https://issues.apache.org/jira/browse/ARROW-7647 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.1 Environment: Ubuntu Linux 18.04 Python 3.7.5 Reporter: Johan Forsberg Hi! I'm trying to load some nested JSON data and am running into a problem with arrays. I can reproduce it with a slightly modified example from the documentation: {code:python} from pyarrow import json import pyarrow as pa with open("test.json", "w") as f: test_json = """{"a": [1], "b": {"c": true, "d": "1991-02-03"}} {"a": [], "b": {"c": false, "d": "2019-04-01"}} """ f.write(test_json) json.read_json("test.json") {code} Running this code with pyarrow 0.15.1 (I also tried 0.14) gives the following error: {code:java} Traceback (most recent call last): File "issue.py", line 11, in ccs = json.read_json("test.json") File "pyarrow/_json.pyx", line 195, in pyarrow._json.read_json File "pyarrow/public-api.pxi", line 285, in pyarrow.lib.pyarrow_wrap_table File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Column 0 named a expected length 2 but got length 1 {code} I've tried various combinations and it seems like the error only appears when the *total* number of elements in all the "a" arrays is less than the number of *rows* in the file. I did not expect there to be any relationship between those things and have found nothing in the documentation about it. Is this intentional? If not, I'd suspect there's some problem in the validation step. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7646) [C++][Dataset] Ability to restrict Hive partitioning to certain fields
Krisztian Szucs created ARROW-7646: -- Summary: [C++][Dataset] Ability to restrict Hive partitioning to certain fields Key: ARROW-7646 URL: https://issues.apache.org/jira/browse/ARROW-7646 Project: Apache Arrow Issue Type: New Feature Components: C++ - Dataset Reporter: Krisztian Szucs I can imagine use cases where the user only want a subset of the fields discovered by the HivePartitioningFactory. It would look like the following at the python user level API: {code:python} partitioning(field_names=[...], flavor='hive') {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7645) [Packaging][deb][RPM] arm64 build by crossbow is broken
Kouhei Sutou created ARROW-7645: --- Summary: [Packaging][deb][RPM] arm64 build by crossbow is broken Key: ARROW-7645 URL: https://issues.apache.org/jira/browse/ARROW-7645 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)