[jira] [Created] (ARROW-7644) Add vcpkg installation instructions
JackBoosY created ARROW-7644: Summary: Add vcpkg installation instructions Key: ARROW-7644 URL: https://issues.apache.org/jira/browse/ARROW-7644 Project: Apache Arrow Issue Type: Improvement Components: Documentation Affects Versions: 0.15.1 Environment: All platforms Reporter: JackBoosY arrow is available as a port in vcpkg, a C++ library manager that simplifies installation for arrow and other project dependencies. Documenting the install process here will help users get started by providing a single set of commands to build arrow, ready to be included in their projects. We also test whether our library ports build in various configurations (dynamic, static) on various platforms (OSX, Linux, Windows: x86, x64, UWP, ARM) to keep a wide coverage for users. I'm a maintainer for vcpkg, and [here is what the port script looks like|https://github.com/microsoft/vcpkg/blob/master/ports/arrow/portfile.cmake]. We try to keep the library maintained as close as possible to the original library. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7643) Add ToList method to all Array
Takashi Hashida created ARROW-7643: -- Summary: Add ToList method to all Array Key: ARROW-7643 URL: https://issues.apache.org/jira/browse/ARROW-7643 Project: Apache Arrow Issue Type: Improvement Components: C# Reporter: Takashi Hashida Converting (Arrow)Array to List will be usable for users. However, some arrays have no method to achieve it. We should add a ToList method to such arrays. See these discussions. https://github.com/apache/arrow/pull/6102#discussion_r368347992 https://github.com/apache/arrow/pull/6102#discussion_r368349401 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7642) [Rust] Create build.rs to generate flatbuffers files
Andy Grove created ARROW-7642: - Summary: [Rust] Create build.rs to generate flatbuffers files Key: ARROW-7642 URL: https://issues.apache.org/jira/browse/ARROW-7642 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Andy Grove Fix For: 1.0.0 We should take the logic from the regen.sh [1] bash script and convert it into a Rust build.rs script that can run in CI. This would require flatc to be installed to be able to build the project. [1] https://github.com/apache/arrow/blob/master/rust/arrow/regen.sh -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7641) [R] Make dataset vignette have executable code
Neal Richardson created ARROW-7641: -- Summary: [R] Make dataset vignette have executable code Key: ARROW-7641 URL: https://issues.apache.org/jira/browse/ARROW-7641 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson -- This message was sent by Atlassian Jira (v8.3.4#803005)
Arrow sync call January 22 at 12:00 US/Eastern, 17:00 UTC
Hi all, Reminder that our biweekly call is tomorrow (or much later today, depending on your time zone) at https://meet.google.com/vtm-teks-phx. All are welcome to join. Notes will be sent out to the mailing list afterwards. Neal
Re: [DISCUSS] C Data Interface, take 2
Thanks Jacques. I agree that none of the ways forward on this problem are wholly satisfactory. We should encourage users of this C API to prefer emitting byte-aligned / 0-offset in line with the IPC spec wherever possible. It will be interesting to see after a period of time how downstream projects are able to leverage this interface as part of their overall Arrow adoption. On Tue, Jan 21, 2020 at 4:05 PM Jacques Nadeau wrote: > > Upon further reflection (and as I've noted on the PR), I think merging the > ABI as a general feature of Arrow is preferable to making this be a > subinterface of the C++ part of the project. While the offset field is > awkward given its absence from the IPC spec, it's better to avoid > fragmenting the community based on that fields absence or existence. > > Thanks for the lively discussion Antoine, Wes and others! > > J > > On Mon, Jan 20, 2020 at 11:09 AM Wes McKinney wrote: > > > Independent of the particulars of the discussion, the C++ project > > needs to be free to create a C API for itself. If you want to try to > > block the C++ contributors from doing this we may be barreling toward > > a governance crisis in the project. I'm stepping back from this > > discussion for a time now to allow others to catch up on the > > discussion and to weigh in as needed > > > > On Mon, Jan 20, 2020 at 1:00 PM Jacques Nadeau wrote: > > > > > > I don't see this as an endogenous concern of the C++ project. I > > appreciate > > > your goal with saying so but I think this has broader ramifications > > around > > > fragmentation of the project. > > > > > > The core challenge that we're dealing with is we introduced foundational > > > concepts in some implementations that go beyond the spec and then > > provided > > > useful features based on them (in this case, the offset concept). > > Ideally, > > > those concepts are first introduced at the specification level so there > > > aren't inconsistent viewpoints of what Arrow is (which I believe is what > > is > > > happening here). Having a cross-language specification for in-memory > > > processing is a new concept so it isn't surprising that we're going to > > > learn these things along the way. > > > > > > Without this, we create a slippery slope of fragmentation between the > > > specifications and the implementations. I understand that the toothpaste > > is > > > out of the tube in this particular case. We can respond in two ways: stop > > > the slip or continue to slide down the slope. I'm inclined to stop the > > slip. > > > > > > As I said on the GitHub, I'm struggling with how much of this should be > > > solved in the project. I'm going to pause a bit on responding to reflect > > > further about this as well to reduce the likelihood that this devolves > > into > > > a flame war (which is always a risk with complex issues such as these). > > > > > > > > > > > > On Mon, Jan 20, 2020 at 9:59 AM Wes McKinney > > wrote: > > > > > > > hi Jacques, > > > > > > > > Taking a step back from the discussion, the original problem statement > > > > was to enable third party projects to produce the data structure used > > > > by C++ Array classes in C without depending on the C++ code > > > > > > > > That's the ArrayData class here > > > > > > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L232 > > > > > > > > It is important for us simplify the programming interface with the C++ > > > > library, so I think that we should address this as an endogenous > > > > concern of the C++ project, namely providing a "C API for the C++ > > > > project". The C API for the C++ library needs to mirror what's in the > > > > C++ project (i.e. the ArrayData data structure). We should not > > > > advertise this as being a part of the project specification. > > > > > > > > - Wes > > > > > > > > On Mon, Jan 20, 2020 at 11:51 AM Jacques Nadeau > > > > wrote: > > > > > > > > > > As I noted on the pull request, I think fundamentally this work is at > > > > odds > > > > > with the Arrow specification and being used to introduce a shadow > > > > > specification. > > > > > > > > > > I don't think our intentions about how people should use something > > really > > > > > influence how people will actually use or perceive it. They'll just > > find > > > > > supported Arrow code and expose things based on it and call it "Arrow > > > > > compatible". In other words, I don't think people in the outside > > world > > > > will > > > > > be able to perceive the distinction between "Arrow C++ compatible" > > and > > > > > "Arrow compatible". > > > > > > > > > > On Mon, Jan 20, 2020 at 9:28 AM Wes McKinney > > > > wrote: > > > > > > > > > > > hi folks, > > > > > > > > > > > > I just made a comment in https://github.com/apache/arrow/pull/6026 > > > > > > that I wanted to surface here on the mailing list. > > > > > > > > > > > > It seems that to reach consensus for a C interface that is > > intended to > > > > > > be broadly used by multiple programming
Re: new to Arrow / integration with Kudu
I'm interested to see an Arrow adapter for Apache Kudu developed. My gut feeling is that this work should be undertaken in Kudu itself, potentially having the tablet servers producing Arrow Record Batches locally and sending them to the client rather than converting to Kudu's own on-the-wire record format and then deserializing into Arrow on the receiver side. It might be worth a conversation with the Kudu community to see what they think. Of course one can build an Arrow deserializer for the current Kudu C++ client API and probably get pretty good performance. see also ARROW-814 https://issues.apache.org/jira/browse/ARROW-814 On Tue, Jan 21, 2020 at 12:32 PM Shazz wrote: > > Hi, > > I'm thinking of an architecture to store and access efficiently tabular > data and I was told to look at Arrow and Kudu. > I saw on the frontpage a diagram where Arrow can be integrated with Kudu > but nothing in the documentation. Is there an example available > somewhere ? > > Thanks ! > > -- > sh...@metaverse.fr > GPG public key ID : B517C4C8
Re: [DISCUSS] C Data Interface, take 2
Upon further reflection (and as I've noted on the PR), I think merging the ABI as a general feature of Arrow is preferable to making this be a subinterface of the C++ part of the project. While the offset field is awkward given its absence from the IPC spec, it's better to avoid fragmenting the community based on that fields absence or existence. Thanks for the lively discussion Antoine, Wes and others! J On Mon, Jan 20, 2020 at 11:09 AM Wes McKinney wrote: > Independent of the particulars of the discussion, the C++ project > needs to be free to create a C API for itself. If you want to try to > block the C++ contributors from doing this we may be barreling toward > a governance crisis in the project. I'm stepping back from this > discussion for a time now to allow others to catch up on the > discussion and to weigh in as needed > > On Mon, Jan 20, 2020 at 1:00 PM Jacques Nadeau wrote: > > > > I don't see this as an endogenous concern of the C++ project. I > appreciate > > your goal with saying so but I think this has broader ramifications > around > > fragmentation of the project. > > > > The core challenge that we're dealing with is we introduced foundational > > concepts in some implementations that go beyond the spec and then > provided > > useful features based on them (in this case, the offset concept). > Ideally, > > those concepts are first introduced at the specification level so there > > aren't inconsistent viewpoints of what Arrow is (which I believe is what > is > > happening here). Having a cross-language specification for in-memory > > processing is a new concept so it isn't surprising that we're going to > > learn these things along the way. > > > > Without this, we create a slippery slope of fragmentation between the > > specifications and the implementations. I understand that the toothpaste > is > > out of the tube in this particular case. We can respond in two ways: stop > > the slip or continue to slide down the slope. I'm inclined to stop the > slip. > > > > As I said on the GitHub, I'm struggling with how much of this should be > > solved in the project. I'm going to pause a bit on responding to reflect > > further about this as well to reduce the likelihood that this devolves > into > > a flame war (which is always a risk with complex issues such as these). > > > > > > > > On Mon, Jan 20, 2020 at 9:59 AM Wes McKinney > wrote: > > > > > hi Jacques, > > > > > > Taking a step back from the discussion, the original problem statement > > > was to enable third party projects to produce the data structure used > > > by C++ Array classes in C without depending on the C++ code > > > > > > That's the ArrayData class here > > > > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L232 > > > > > > It is important for us simplify the programming interface with the C++ > > > library, so I think that we should address this as an endogenous > > > concern of the C++ project, namely providing a "C API for the C++ > > > project". The C API for the C++ library needs to mirror what's in the > > > C++ project (i.e. the ArrayData data structure). We should not > > > advertise this as being a part of the project specification. > > > > > > - Wes > > > > > > On Mon, Jan 20, 2020 at 11:51 AM Jacques Nadeau > > > wrote: > > > > > > > > As I noted on the pull request, I think fundamentally this work is at > > > odds > > > > with the Arrow specification and being used to introduce a shadow > > > > specification. > > > > > > > > I don't think our intentions about how people should use something > really > > > > influence how people will actually use or perceive it. They'll just > find > > > > supported Arrow code and expose things based on it and call it "Arrow > > > > compatible". In other words, I don't think people in the outside > world > > > will > > > > be able to perceive the distinction between "Arrow C++ compatible" > and > > > > "Arrow compatible". > > > > > > > > On Mon, Jan 20, 2020 at 9:28 AM Wes McKinney > > > wrote: > > > > > > > > > hi folks, > > > > > > > > > > I just made a comment in https://github.com/apache/arrow/pull/6026 > > > > > that I wanted to surface here on the mailing list. > > > > > > > > > > It seems that to reach consensus for a C interface that is > intended to > > > > > be broadly used by multiple programming languages, we may make some > > > > > compromises that harm or outright undermine some of the use cases > that > > > > > motivated the creation of the C interface in the first place. That > > > > > does not seem good. I wonder if it would be more productive to > reduce > > > > > the scope of the project to merely providing a C-header-based data > > > > > interface to the C++ project only. That was the original problem > > > > > statement and it seems in attempting to make it useful beyond C++ > has > > > > > made it difficult to reach consensus. > > > > > > > > > > Thanks > > > > > Wes > > > > > > > > > > On Sat, Dec 21, 2019 at 4:38 PM Jacques
new to Arrow / integration with Kudu
Hi, I'm thinking of an architecture to store and access efficiently tabular data and I was told to look at Arrow and Kudu. I saw on the frontpage a diagram where Arrow can be integrated with Kudu but nothing in the documentation. Is there an example available somewhere ? Thanks ! -- sh...@metaverse.fr GPG public key ID : B517C4C8
[jira] [Created] (ARROW-7637) [GLib] Check components installed when building
Yosuke Shiro created ARROW-7637: --- Summary: [GLib] Check components installed when building Key: ARROW-7637 URL: https://issues.apache.org/jira/browse/ARROW-7637 Project: Apache Arrow Issue Type: Improvement Components: GLib Reporter: Yosuke Shiro Assignee: Yosuke Shiro -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7636) [Python] Clean-up the pyarrow.dataset.partitioning() API
Joris Van den Bossche created ARROW-7636: Summary: [Python] Clean-up the pyarrow.dataset.partitioning() API Key: ARROW-7636 URL: https://issues.apache.org/jira/browse/ARROW-7636 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 0.16.0 A left-over review comment at https://github.com/apache/arrow/pull/6022#discussion_r367016454 on the API of {{partitioning()}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7635) [C++] Add pkg-config support for each components
Yosuke Shiro created ARROW-7635: --- Summary: [C++] Add pkg-config support for each components Key: ARROW-7635 URL: https://issues.apache.org/jira/browse/ARROW-7635 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Yosuke Shiro Assignee: Yosuke Shiro -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7634) [Python] Dataset tests failing on Windows to parse file path
Joris Van den Bossche created ARROW-7634: Summary: [Python] Dataset tests failing on Windows to parse file path Key: ARROW-7634 URL: https://issues.apache.org/jira/browse/ARROW-7634 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 0.16.0 See eg https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=5217=logs=4c86bc1b-1091-5192-4404-c74dfaad23e7=ec99a26b-0264-5e86-36fb-9cfd0ca0f9f3=4066 Failing on the backward slashes of the pathlib file paths, and clearly not run in CI since this was not catched. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7633) [C++][CI] Create fuzz targets for tensors and sparse tensors
Antoine Pitrou created ARROW-7633: - Summary: [C++][CI] Create fuzz targets for tensors and sparse tensors Key: ARROW-7633 URL: https://issues.apache.org/jira/browse/ARROW-7633 Project: Apache Arrow Issue Type: Task Reporter: Antoine Pitrou These use separate API calls disjoint from RecordBatchFileReader and RecordBatchStreamReader, so probably more natural to expose as separate fuzz targets. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[NIGHTLY] Arrow Build Report for Job nightly-2020-01-21-0
Arrow Build Report for Job nightly-2020-01-21-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0 Failed Tasks: - conda-win-vs2015-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-conda-win-vs2015-py36 - conda-win-vs2015-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-conda-win-vs2015-py37 - conda-win-vs2015-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-conda-win-vs2015-py38 - debian-buster: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-debian-buster - gandiva-jar-osx: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-travis-gandiva-jar-osx - test-conda-python-3.7-spark-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-circle-test-conda-python-3.7-spark-master - test-r-rstudio-r-base-3.6-bionic: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-test-r-rstudio-r-base-3.6-bionic - test-ubuntu-fuzzit-fuzzing: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-circle-test-ubuntu-fuzzit-fuzzing - test-ubuntu-fuzzit-regression: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-circle-test-ubuntu-fuzzit-regression - wheel-manylinux2010-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-wheel-manylinux2010-cp35m - wheel-manylinux2010-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-wheel-manylinux2010-cp37m - wheel-osx-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-travis-wheel-osx-cp37m - wheel-win-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-appveyor-wheel-win-cp36m - wheel-win-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-appveyor-wheel-win-cp37m - wheel-win-cp38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-appveyor-wheel-win-cp38 Succeeded Tasks: - centos-6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-centos-6 - centos-7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-centos-7 - centos-8: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-centos-8 - conda-linux-gcc-py27: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-conda-linux-gcc-py27 - conda-linux-gcc-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-conda-linux-gcc-py36 - conda-linux-gcc-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-conda-linux-gcc-py37 - conda-linux-gcc-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-conda-linux-gcc-py38 - conda-osx-clang-py27: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-conda-osx-clang-py27 - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-conda-osx-clang-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-conda-osx-clang-py37 - conda-osx-clang-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-conda-osx-clang-py38 - debian-stretch: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-debian-stretch - gandiva-jar-trusty: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-travis-gandiva-jar-trusty - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-travis-homebrew-cpp - macos-r-autobrew: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-travis-macos-r-autobrew - test-conda-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-circle-test-conda-cpp - test-conda-python-2.7-pandas-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-circle-test-conda-python-2.7-pandas-latest - test-conda-python-2.7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-circle-test-conda-python-2.7 - test-conda-python-3.6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-circle-test-conda-python-3.6 - test-conda-python-3.7-dask-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-circle-test-conda-python-3.7-dask-latest -
[jira] [Created] (ARROW-7632) [C++] [CI] Improve fuzzing seed corpus
Antoine Pitrou created ARROW-7632: - Summary: [C++] [CI] Improve fuzzing seed corpus Key: ARROW-7632 URL: https://issues.apache.org/jira/browse/ARROW-7632 Project: Apache Arrow Issue Type: Task Components: C++, Continuous Integration Reporter: Antoine Pitrou Assignee: Antoine Pitrou The coverage stats produced by OSS-Fuzz instruct us to guide the fuzzing process towards the following areas: - extension arrays - tensors - sparse tensors -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7631) [C++][Gandiva] return zero if there is an overflow while converting a decimal to a lower precision/scale
Prudhvi Porandla created ARROW-7631: --- Summary: [C++][Gandiva] return zero if there is an overflow while converting a decimal to a lower precision/scale Key: ARROW-7631 URL: https://issues.apache.org/jira/browse/ARROW-7631 Project: Apache Arrow Issue Type: Bug Reporter: Prudhvi Porandla Assignee: Prudhvi Porandla -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7629) [C++][CI] Add fuzz regression files to arrow-testing
Antoine Pitrou created ARROW-7629: - Summary: [C++][CI] Add fuzz regression files to arrow-testing Key: ARROW-7629 URL: https://issues.apache.org/jira/browse/ARROW-7629 Project: Apache Arrow Issue Type: Task Components: C++, Continuous Integration Reporter: Antoine Pitrou Assignee: Antoine Pitrou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7630) [C++][CI] Check fuzz crash regressions in CI
Antoine Pitrou created ARROW-7630: - Summary: [C++][CI] Check fuzz crash regressions in CI Key: ARROW-7630 URL: https://issues.apache.org/jira/browse/ARROW-7630 Project: Apache Arrow Issue Type: Task Components: C++, Continuous Integration Reporter: Antoine Pitrou Assignee: Antoine Pitrou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7628) PyArrow read_csv problematic cases
Athanassios Hatzis created ARROW-7628: - Summary: PyArrow read_csv problematic cases Key: ARROW-7628 URL: https://issues.apache.org/jira/browse/ARROW-7628 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.1 Environment: Ubuntu bionic Reporter: Athanassios Hatzis Attachments: spc_catalog.tsv Hi, I have found two problematic cases, possibly bugs in pyarrow read_csv module. I have written the following piece of code and run a test on the attached CSV file. The code compares pandas read_csv with pyarrow csv to show that the second is not behaving correctly with the following set of parameters: 1. change parameter skip_rows = 10, {code:python} Traceback (most recent call last): File "/home/athan/anaconda3/envs/TRIADB/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 4, in read_options=csv.ReadOptions(skip_rows=skip_rows, autogenerate_column_names=False, use_threads=True, column_names=column_names) File "pyarrow/_csv.pyx", line 541, in pyarrow._csv.read_csv File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status pyarrow.lib.ArrowKeyError: Column 'catcost' in include_columns does not exist in CSV file {code} 2. skip_rows = 12, columns = None In this case you don't get the error above, projection is None, but compare the two dataframes, the one from pyarrow with to_pandas() and the one from the output of pandas read_csv(). You will notice that the first one has not parsed correctly the null values ('\\N') in the last column catname. On the contrary pandas read_csv managed to parse all the null values correctly. {code:python} Out[28]: 1082 991 16.5200 2014-09-10 1 bar 0 1082 997 0.55 100.0 2014-09-10 1 bar 1 1082 998 7.95 200.0 2014-03-03 0 \N 2 1083 998 12.50NaNNaT 0 bar 3 1083 999 1.00NaNNaT 0 foo 4 1084 994 57.30 100.0 2014-12-20 1 \N 5 1084 995 22.20NaNNaT 0 foo 6 1084 998 48.60 200.0 2014-12-20 1 foo {code} Python code to test the attached CSV file for the bugs reported above {code:python} from pyarrow import csv import pyarrow as pa import pandas as pd file_location = 'spc_catalog.tsv' sep = '\t' nulls=['\\N'] columns = ['catcost', 'catqnt', 'catdate', 'catchk', 'catname'] column_names = None column_types = None skip_rows = None nrecords = None csv.read_csv(file_location, parse_options=csv.ParseOptions(delimiter=sep), convert_options=csv.ConvertOptions(include_columns=columns, column_types=column_types, null_values=nulls), read_options=csv.ReadOptions(skip_rows=skip_rows, autogenerate_column_names=False, use_threads=True, column_names=column_names) ).to_pandas() pd.read_csv(file_location, sep=sep, na_values='\\N', usecols=columns, nrows=nrecords, names=column_names, dtype=column_types) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7627) [C++][Gandiva] Optimize string truncate function
Projjal Chanda created ARROW-7627: - Summary: [C++][Gandiva] Optimize string truncate function Key: ARROW-7627 URL: https://issues.apache.org/jira/browse/ARROW-7627 Project: Apache Arrow Issue Type: Task Components: C++ - Gandiva Reporter: Projjal Chanda Assignee: Projjal Chanda Current string truncate function does unnecessarily traverses through the string two times. Can be done in one pass -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7626) [Parquet][GLib] Add support for version macros
Kouhei Sutou created ARROW-7626: --- Summary: [Parquet][GLib] Add support for version macros Key: ARROW-7626 URL: https://issues.apache.org/jira/browse/ARROW-7626 Project: Apache Arrow Issue Type: Improvement Components: GLib Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)