date:20200121

[jira] [Created] (ARROW-7644) Add vcpkg installation instructions

2020-01-21 Thread JackBoosY (Jira)

JackBoosY created ARROW-7644:


 Summary: Add vcpkg installation instructions
 Key: ARROW-7644
 URL: https://issues.apache.org/jira/browse/ARROW-7644
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 0.15.1
 Environment: All platforms
Reporter: JackBoosY


arrow is available as a port in vcpkg, a C++ library manager that simplifies 
installation for arrow and other project dependencies. Documenting the install 
process here will help users get started by providing a single set of commands 
to build arrow, ready to be included in their projects.

We also test whether our library ports build in various configurations 
(dynamic, static) on various platforms (OSX, Linux, Windows: x86, x64, UWP, 
ARM) to keep a wide coverage for users.

I'm a maintainer for vcpkg, and [here is what the port script looks 
like|https://github.com/microsoft/vcpkg/blob/master/ports/arrow/portfile.cmake].
 We try to keep the library maintained as close as possible to the original 
library.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7643) Add ToList method to all Array

2020-01-21 Thread Takashi Hashida (Jira)

Takashi Hashida created ARROW-7643:
--

 Summary: Add ToList method to all Array
 Key: ARROW-7643
 URL: https://issues.apache.org/jira/browse/ARROW-7643
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C#
Reporter: Takashi Hashida


Converting (Arrow)Array to List will be usable for users.
However, some arrays have no method to achieve it.
We should add a ToList method to such arrays.

See these discussions.
https://github.com/apache/arrow/pull/6102#discussion_r368347992
https://github.com/apache/arrow/pull/6102#discussion_r368349401



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7642) [Rust] Create build.rs to generate flatbuffers files

2020-01-21 Thread Andy Grove (Jira)

Andy Grove created ARROW-7642:
-

 Summary: [Rust] Create build.rs to generate flatbuffers files
 Key: ARROW-7642
 URL: https://issues.apache.org/jira/browse/ARROW-7642
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Andy Grove
 Fix For: 1.0.0


We should take the logic from the regen.sh [1] bash script and convert it into 
a Rust build.rs script that can run in CI. This would require flatc to be 
installed to be able to build the project.

 

[1] https://github.com/apache/arrow/blob/master/rust/arrow/regen.sh



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7641) [R] Make dataset vignette have executable code

2020-01-21 Thread Neal Richardson (Jira)

Neal Richardson created ARROW-7641:
--

 Summary: [R] Make dataset vignette have executable code
 Key: ARROW-7641
 URL: https://issues.apache.org/jira/browse/ARROW-7641
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Arrow sync call January 22 at 12:00 US/Eastern, 17:00 UTC

2020-01-21 Thread Neal Richardson

Hi all,
Reminder that our biweekly call is tomorrow (or much later today, depending
on your time zone) at https://meet.google.com/vtm-teks-phx. All are welcome
to join. Notes will be sent out to the mailing list afterwards.

Neal

Re: [DISCUSS] C Data Interface, take 2

2020-01-21 Thread Wes McKinney

Thanks Jacques. I agree that none of the ways forward on this problem
are wholly satisfactory. We should encourage users of this C API to
prefer emitting byte-aligned / 0-offset in line with the IPC spec
wherever possible. It will be interesting to see after a period of
time how downstream projects are able to leverage this interface as
part of their overall Arrow adoption.

On Tue, Jan 21, 2020 at 4:05 PM Jacques Nadeau  wrote:
>
> Upon further reflection (and as I've noted on the PR), I think merging the
> ABI as a general feature of Arrow is preferable to making this be a
> subinterface of the C++ part of the project. While the offset field is
> awkward given its absence from the IPC spec, it's better to avoid
> fragmenting the community based on that fields absence or existence.
>
> Thanks for the lively discussion Antoine, Wes and others!
>
> J
>
> On Mon, Jan 20, 2020 at 11:09 AM Wes McKinney  wrote:
>
> > Independent of the particulars of the discussion, the C++ project
> > needs to be free to create a C API for itself. If you want to try to
> > block the C++ contributors from doing this we may be barreling toward
> > a governance crisis in the project. I'm stepping back from this
> > discussion for a time now to allow others to catch up on the
> > discussion and to weigh in as needed
> >
> > On Mon, Jan 20, 2020 at 1:00 PM Jacques Nadeau  wrote:
> > >
> > > I don't see this as an endogenous concern of the C++ project. I
> > appreciate
> > > your goal with saying so but I think this has broader ramifications
> > around
> > > fragmentation of the project.
> > >
> > > The core challenge that we're dealing with is we introduced foundational
> > > concepts in some implementations that go beyond the spec and then
> > provided
> > > useful features based on them (in this case, the offset concept).
> > Ideally,
> > > those concepts are first introduced at the specification level so there
> > > aren't inconsistent viewpoints of what Arrow is (which I believe is what
> > is
> > > happening here). Having a cross-language specification for in-memory
> > > processing is a new concept so it isn't surprising that we're going to
> > > learn these things along the way.
> > >
> > > Without this, we create a slippery slope of fragmentation between the
> > > specifications and the implementations. I understand that the toothpaste
> > is
> > > out of the tube in this particular case. We can respond in two ways: stop
> > > the slip or continue to slide down the slope. I'm inclined to stop the
> > slip.
> > >
> > > As I said on the GitHub, I'm struggling with how much of this should be
> > > solved in the project. I'm going to pause a bit on responding to reflect
> > > further about this as well to reduce the likelihood that this devolves
> > into
> > > a flame war (which is always a risk with complex issues such as these).
> > >
> > >
> > >
> > > On Mon, Jan 20, 2020 at 9:59 AM Wes McKinney 
> > wrote:
> > >
> > > > hi Jacques,
> > > >
> > > > Taking a step back from the discussion, the original problem statement
> > > > was to enable third party projects to produce the data structure used
> > > > by C++ Array classes in C without depending on the C++ code
> > > >
> > > > That's the ArrayData class here
> > > >
> > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L232
> > > >
> > > > It is important for us simplify the programming interface with the C++
> > > > library, so I think that we should address this as an endogenous
> > > > concern of the C++ project, namely providing a "C API for the C++
> > > > project". The C API for the C++ library needs to mirror what's in the
> > > > C++ project (i.e. the ArrayData data structure). We should not
> > > > advertise this as being a part of the project specification.
> > > >
> > > > - Wes
> > > >
> > > > On Mon, Jan 20, 2020 at 11:51 AM Jacques Nadeau 
> > > > wrote:
> > > > >
> > > > > As I noted on the pull request, I think fundamentally this work is at
> > > > odds
> > > > > with the Arrow specification and being used to introduce a shadow
> > > > > specification.
> > > > >
> > > > > I don't think our intentions about how people should use something
> > really
> > > > > influence how people will actually use or perceive it. They'll just
> > find
> > > > > supported Arrow code and expose things based on it and call it "Arrow
> > > > > compatible". In other words, I don't think people in the outside
> > world
> > > > will
> > > > > be able to perceive the distinction between "Arrow C++ compatible"
> > and
> > > > > "Arrow compatible".
> > > > >
> > > > > On Mon, Jan 20, 2020 at 9:28 AM Wes McKinney 
> > > > wrote:
> > > > >
> > > > > > hi folks,
> > > > > >
> > > > > > I just made a comment in https://github.com/apache/arrow/pull/6026
> > > > > > that I wanted to surface here on the mailing list.
> > > > > >
> > > > > > It seems that to reach consensus for a C interface that is
> > intended to
> > > > > > be broadly used by multiple programming

Re: new to Arrow / integration with Kudu

2020-01-21 Thread Wes McKinney

I'm interested to see an Arrow adapter for Apache Kudu developed. My
gut feeling is that this work should be undertaken in Kudu itself,
potentially having the tablet servers producing Arrow Record Batches
locally and sending them to the client rather than converting to
Kudu's own on-the-wire record format and then deserializing into Arrow
on the receiver side. It might be worth a conversation with the Kudu
community to see what they think.

Of course one can build an Arrow deserializer for the current Kudu C++
client API and probably get pretty good performance. see also
ARROW-814

https://issues.apache.org/jira/browse/ARROW-814

On Tue, Jan 21, 2020 at 12:32 PM Shazz  wrote:
>
> Hi,
>
> I'm thinking of an architecture to store and access efficiently tabular
> data and I was told to look at Arrow and Kudu.
> I saw on the frontpage a diagram where Arrow can be integrated with Kudu
> but nothing in the documentation. Is there an example available
> somewhere ?
>
> Thanks !
>
> --
> sh...@metaverse.fr
> GPG public key ID : B517C4C8

Re: [DISCUSS] C Data Interface, take 2

2020-01-21 Thread Jacques Nadeau

Upon further reflection (and as I've noted on the PR), I think merging the
ABI as a general feature of Arrow is preferable to making this be a
subinterface of the C++ part of the project. While the offset field is
awkward given its absence from the IPC spec, it's better to avoid
fragmenting the community based on that fields absence or existence.

Thanks for the lively discussion Antoine, Wes and others!

J

On Mon, Jan 20, 2020 at 11:09 AM Wes McKinney  wrote:

> Independent of the particulars of the discussion, the C++ project
> needs to be free to create a C API for itself. If you want to try to
> block the C++ contributors from doing this we may be barreling toward
> a governance crisis in the project. I'm stepping back from this
> discussion for a time now to allow others to catch up on the
> discussion and to weigh in as needed
>
> On Mon, Jan 20, 2020 at 1:00 PM Jacques Nadeau  wrote:
> >
> > I don't see this as an endogenous concern of the C++ project. I
> appreciate
> > your goal with saying so but I think this has broader ramifications
> around
> > fragmentation of the project.
> >
> > The core challenge that we're dealing with is we introduced foundational
> > concepts in some implementations that go beyond the spec and then
> provided
> > useful features based on them (in this case, the offset concept).
> Ideally,
> > those concepts are first introduced at the specification level so there
> > aren't inconsistent viewpoints of what Arrow is (which I believe is what
> is
> > happening here). Having a cross-language specification for in-memory
> > processing is a new concept so it isn't surprising that we're going to
> > learn these things along the way.
> >
> > Without this, we create a slippery slope of fragmentation between the
> > specifications and the implementations. I understand that the toothpaste
> is
> > out of the tube in this particular case. We can respond in two ways: stop
> > the slip or continue to slide down the slope. I'm inclined to stop the
> slip.
> >
> > As I said on the GitHub, I'm struggling with how much of this should be
> > solved in the project. I'm going to pause a bit on responding to reflect
> > further about this as well to reduce the likelihood that this devolves
> into
> > a flame war (which is always a risk with complex issues such as these).
> >
> >
> >
> > On Mon, Jan 20, 2020 at 9:59 AM Wes McKinney 
> wrote:
> >
> > > hi Jacques,
> > >
> > > Taking a step back from the discussion, the original problem statement
> > > was to enable third party projects to produce the data structure used
> > > by C++ Array classes in C without depending on the C++ code
> > >
> > > That's the ArrayData class here
> > >
> > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L232
> > >
> > > It is important for us simplify the programming interface with the C++
> > > library, so I think that we should address this as an endogenous
> > > concern of the C++ project, namely providing a "C API for the C++
> > > project". The C API for the C++ library needs to mirror what's in the
> > > C++ project (i.e. the ArrayData data structure). We should not
> > > advertise this as being a part of the project specification.
> > >
> > > - Wes
> > >
> > > On Mon, Jan 20, 2020 at 11:51 AM Jacques Nadeau 
> > > wrote:
> > > >
> > > > As I noted on the pull request, I think fundamentally this work is at
> > > odds
> > > > with the Arrow specification and being used to introduce a shadow
> > > > specification.
> > > >
> > > > I don't think our intentions about how people should use something
> really
> > > > influence how people will actually use or perceive it. They'll just
> find
> > > > supported Arrow code and expose things based on it and call it "Arrow
> > > > compatible". In other words, I don't think people in the outside
> world
> > > will
> > > > be able to perceive the distinction between "Arrow C++ compatible"
> and
> > > > "Arrow compatible".
> > > >
> > > > On Mon, Jan 20, 2020 at 9:28 AM Wes McKinney 
> > > wrote:
> > > >
> > > > > hi folks,
> > > > >
> > > > > I just made a comment in https://github.com/apache/arrow/pull/6026
> > > > > that I wanted to surface here on the mailing list.
> > > > >
> > > > > It seems that to reach consensus for a C interface that is
> intended to
> > > > > be broadly used by multiple programming languages, we may make some
> > > > > compromises that harm or outright undermine some of the use cases
> that
> > > > > motivated the creation of the C interface in the first place. That
> > > > > does not seem good. I wonder if it would be more productive to
> reduce
> > > > > the scope of the project to merely providing a C-header-based data
> > > > > interface to the C++ project only. That was the original problem
> > > > > statement and it seems in attempting to make it useful beyond C++
> has
> > > > > made it difficult to reach consensus.
> > > > >
> > > > > Thanks
> > > > > Wes
> > > > >
> > > > > On Sat, Dec 21, 2019 at 4:38 PM Jacques

new to Arrow / integration with Kudu

2020-01-21 Thread Shazz


Hi,

I'm thinking of an architecture to store and access efficiently tabular 
data and I was told to look at Arrow and Kudu.
I saw on the frontpage a diagram where Arrow can be integrated with Kudu 
but nothing in the documentation. Is there an example available 
somewhere ?


Thanks !

--
sh...@metaverse.fr
GPG public key ID : B517C4C8

[jira] [Created] (ARROW-7637) [GLib] Check components installed when building

2020-01-21 Thread Yosuke Shiro (Jira)

Yosuke Shiro created ARROW-7637:
---

 Summary: [GLib] Check components installed when building
 Key: ARROW-7637
 URL: https://issues.apache.org/jira/browse/ARROW-7637
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Yosuke Shiro
Assignee: Yosuke Shiro






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7636) [Python] Clean-up the pyarrow.dataset.partitioning() API

2020-01-21 Thread Joris Van den Bossche (Jira)

Joris Van den Bossche created ARROW-7636:


 Summary: [Python] Clean-up the pyarrow.dataset.partitioning() API
 Key: ARROW-7636
 URL: https://issues.apache.org/jira/browse/ARROW-7636
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.16.0


A left-over review comment at 
https://github.com/apache/arrow/pull/6022#discussion_r367016454 on the API of 
{{partitioning()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7635) [C++] Add pkg-config support for each components

2020-01-21 Thread Yosuke Shiro (Jira)

Yosuke Shiro created ARROW-7635:
---

 Summary: [C++] Add pkg-config support for each components
 Key: ARROW-7635
 URL: https://issues.apache.org/jira/browse/ARROW-7635
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yosuke Shiro
Assignee: Yosuke Shiro






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7634) [Python] Dataset tests failing on Windows to parse file path

2020-01-21 Thread Joris Van den Bossche (Jira)

Joris Van den Bossche created ARROW-7634:


 Summary: [Python] Dataset tests failing on Windows to parse file 
path
 Key: ARROW-7634
 URL: https://issues.apache.org/jira/browse/ARROW-7634
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.16.0


See eg 
https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=5217=logs=4c86bc1b-1091-5192-4404-c74dfaad23e7=ec99a26b-0264-5e86-36fb-9cfd0ca0f9f3=4066

Failing on the backward slashes of the pathlib file paths, and clearly not run 
in CI since this was not catched.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7633) [C++][CI] Create fuzz targets for tensors and sparse tensors

2020-01-21 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-7633:
-

 Summary: [C++][CI] Create fuzz targets for tensors and sparse 
tensors
 Key: ARROW-7633
 URL: https://issues.apache.org/jira/browse/ARROW-7633
 Project: Apache Arrow
  Issue Type: Task
Reporter: Antoine Pitrou


These use separate API calls disjoint from RecordBatchFileReader and 
RecordBatchStreamReader, so probably more natural to expose as separate fuzz 
targets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[NIGHTLY] Arrow Build Report for Job nightly-2020-01-21-0

2020-01-21 Thread Crossbow



Arrow Build Report for Job nightly-2020-01-21-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0

Failed Tasks:
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-conda-win-vs2015-py37
- conda-win-vs2015-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-conda-win-vs2015-py38
- debian-buster:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-debian-buster
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-travis-gandiva-jar-osx
- test-conda-python-3.7-spark-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-circle-test-conda-python-3.7-spark-master
- test-r-rstudio-r-base-3.6-bionic:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-test-r-rstudio-r-base-3.6-bionic
- test-ubuntu-fuzzit-fuzzing:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-circle-test-ubuntu-fuzzit-fuzzing
- test-ubuntu-fuzzit-regression:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-circle-test-ubuntu-fuzzit-regression
- wheel-manylinux2010-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-wheel-manylinux2010-cp35m
- wheel-manylinux2010-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-wheel-manylinux2010-cp37m
- wheel-osx-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-travis-wheel-osx-cp37m
- wheel-win-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-appveyor-wheel-win-cp36m
- wheel-win-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-appveyor-wheel-win-cp37m
- wheel-win-cp38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-appveyor-wheel-win-cp38

Succeeded Tasks:
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-centos-6
- centos-7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-centos-7
- centos-8:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-centos-8
- conda-linux-gcc-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-conda-linux-gcc-py27
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-conda-linux-gcc-py37
- conda-linux-gcc-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-conda-linux-gcc-py38
- conda-osx-clang-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-conda-osx-clang-py27
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-conda-osx-clang-py38
- debian-stretch:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-azure-debian-stretch
- gandiva-jar-trusty:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-travis-gandiva-jar-trusty
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-travis-homebrew-cpp
- macos-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-travis-macos-r-autobrew
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-circle-test-conda-cpp
- test-conda-python-2.7-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-circle-test-conda-python-2.7-pandas-latest
- test-conda-python-2.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-circle-test-conda-python-2.7
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-circle-test-conda-python-3.6
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-21-0-circle-test-conda-python-3.7-dask-latest
-

[jira] [Created] (ARROW-7632) [C++] [CI] Improve fuzzing seed corpus

2020-01-21 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-7632:
-

 Summary: [C++] [CI] Improve fuzzing seed corpus
 Key: ARROW-7632
 URL: https://issues.apache.org/jira/browse/ARROW-7632
 Project: Apache Arrow
  Issue Type: Task
  Components: C++, Continuous Integration
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


The coverage stats produced by OSS-Fuzz instruct us to guide the fuzzing 
process towards the following areas:
- extension arrays
- tensors
- sparse tensors





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7631) [C++][Gandiva] return zero if there is an overflow while converting a decimal to a lower precision/scale

2020-01-21 Thread Prudhvi Porandla (Jira)

Prudhvi Porandla created ARROW-7631:
---

 Summary: [C++][Gandiva] return zero if there is an overflow while 
converting a decimal to a lower precision/scale
 Key: ARROW-7631
 URL: https://issues.apache.org/jira/browse/ARROW-7631
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Prudhvi Porandla
Assignee: Prudhvi Porandla






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7629) [C++][CI] Add fuzz regression files to arrow-testing

2020-01-21 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-7629:
-

 Summary: [C++][CI] Add fuzz regression files to arrow-testing
 Key: ARROW-7629
 URL: https://issues.apache.org/jira/browse/ARROW-7629
 Project: Apache Arrow
  Issue Type: Task
  Components: C++, Continuous Integration
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7630) [C++][CI] Check fuzz crash regressions in CI

2020-01-21 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-7630:
-

 Summary: [C++][CI] Check fuzz crash regressions in CI
 Key: ARROW-7630
 URL: https://issues.apache.org/jira/browse/ARROW-7630
 Project: Apache Arrow
  Issue Type: Task
  Components: C++, Continuous Integration
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7628) PyArrow read_csv problematic cases

2020-01-21 Thread Athanassios Hatzis (Jira)

Athanassios Hatzis created ARROW-7628:
-

 Summary: PyArrow read_csv problematic cases
 Key: ARROW-7628
 URL: https://issues.apache.org/jira/browse/ARROW-7628
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.1
 Environment: Ubuntu bionic
Reporter: Athanassios Hatzis
 Attachments: spc_catalog.tsv

Hi, I have found two problematic cases, possibly bugs in pyarrow read_csv 
module. I have written the following piece of code and run a test on the 
attached CSV file. 

The code compares pandas read_csv with pyarrow csv to show that the second is 
not behaving correctly with the following set of parameters:

1. change parameter skip_rows = 10, 
{code:python}
Traceback (most recent call last):
  File 
"/home/athan/anaconda3/envs/TRIADB/lib/python3.7/site-packages/IPython/core/interactiveshell.py",
 line 3326, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
  File "", line 4, in 
read_options=csv.ReadOptions(skip_rows=skip_rows, 
autogenerate_column_names=False, use_threads=True, column_names=column_names)
  File "pyarrow/_csv.pyx", line 541, in pyarrow._csv.read_csv
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowKeyError: Column 'catcost' in include_columns does not exist 
in CSV file
{code}

2. skip_rows = 12, columns = None
In this case you don't get the error above, projection is None, but compare the 
two dataframes, the one from pyarrow with to_pandas() and the one from the 
output of pandas read_csv(). You will notice that the first one has not parsed 
correctly the null values ('\\N') in the last column catname. On the contrary 
pandas read_csv managed to parse all the null values correctly.

{code:python}
Out[28]: 
   1082  991   16.5200 2014-09-10  1  bar
0  1082  997   0.55  100.0 2014-09-10  1  bar
1  1082  998   7.95  200.0 2014-03-03  0   \N
2  1083  998  12.50NaNNaT  0  bar
3  1083  999   1.00NaNNaT  0  foo
4  1084  994  57.30  100.0 2014-12-20  1   \N
5  1084  995  22.20NaNNaT  0  foo
6  1084  998  48.60  200.0 2014-12-20  1  foo

{code}

Python code to test the attached CSV file for the bugs reported above


{code:python}
from pyarrow import csv
import pyarrow as pa
import pandas as pd

file_location = 'spc_catalog.tsv'

sep = '\t'
nulls=['\\N']

columns = ['catcost', 'catqnt', 'catdate', 'catchk', 'catname']
column_names = None
column_types = None

skip_rows = None
nrecords = None

csv.read_csv(file_location,
parse_options=csv.ParseOptions(delimiter=sep),
convert_options=csv.ConvertOptions(include_columns=columns, 
column_types=column_types, null_values=nulls),
read_options=csv.ReadOptions(skip_rows=skip_rows, 
autogenerate_column_names=False, use_threads=True, column_names=column_names)
).to_pandas()

pd.read_csv(file_location, sep=sep, na_values='\\N', usecols=columns, 
nrows=nrecords, names=column_names, dtype=column_types)

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7627) [C++][Gandiva] Optimize string truncate function

2020-01-21 Thread Projjal Chanda (Jira)

Projjal Chanda created ARROW-7627:
-

 Summary: [C++][Gandiva] Optimize string truncate function
 Key: ARROW-7627
 URL: https://issues.apache.org/jira/browse/ARROW-7627
 Project: Apache Arrow
  Issue Type: Task
  Components: C++ - Gandiva
Reporter: Projjal Chanda
Assignee: Projjal Chanda


Current string truncate function does unnecessarily traverses through the 
string two times. Can be done in one pass



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7626) [Parquet][GLib] Add support for version macros

2020-01-21 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-7626:
---

 Summary: [Parquet][GLib] Add support for version macros
 Key: ARROW-7626
 URL: https://issues.apache.org/jira/browse/ARROW-7626
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7644) Add vcpkg installation instructions

[jira] [Created] (ARROW-7643) Add ToList method to all Array

[jira] [Created] (ARROW-7642) [Rust] Create build.rs to generate flatbuffers files

[jira] [Created] (ARROW-7641) [R] Make dataset vignette have executable code

Arrow sync call January 22 at 12:00 US/Eastern, 17:00 UTC

Re: [DISCUSS] C Data Interface, take 2

Re: new to Arrow / integration with Kudu

Re: [DISCUSS] C Data Interface, take 2

new to Arrow / integration with Kudu

[jira] [Created] (ARROW-7637) [GLib] Check components installed when building

[jira] [Created] (ARROW-7636) [Python] Clean-up the pyarrow.dataset.partitioning() API

[jira] [Created] (ARROW-7635) [C++] Add pkg-config support for each components

[jira] [Created] (ARROW-7634) [Python] Dataset tests failing on Windows to parse file path

[jira] [Created] (ARROW-7633) [C++][CI] Create fuzz targets for tensors and sparse tensors

[NIGHTLY] Arrow Build Report for Job nightly-2020-01-21-0

[jira] [Created] (ARROW-7632) [C++] [CI] Improve fuzzing seed corpus

[jira] [Created] (ARROW-7631) [C++][Gandiva] return zero if there is an overflow while converting a decimal to a lower precision/scale

[jira] [Created] (ARROW-7629) [C++][CI] Add fuzz regression files to arrow-testing

[jira] [Created] (ARROW-7630) [C++][CI] Check fuzz crash regressions in CI

[jira] [Created] (ARROW-7628) PyArrow read_csv problematic cases

[jira] [Created] (ARROW-7627) [C++][Gandiva] Optimize string truncate function

[jira] [Created] (ARROW-7626) [Parquet][GLib] Add support for version macros

22 matches

Site Navigation

Mail list logo

Footer information