[jira] [Commented] (ARROW-6168) [C++] IWYU docker-compose job is broken

2019-08-07 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902662#comment-16902662
 ] 

Wes McKinney commented on ARROW-6168:
-

I fixed this temporarily by adding an explicit build type 

https://github.com/apache/arrow/pull/5036/files#diff-60422e0e36ec191f5e2687ffb18b5796R25


> [C++] IWYU docker-compose job is broken
> ---
>
> Key: ARROW-6168
> URL: https://issues.apache.org/jira/browse/ARROW-6168
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> Not sure what happened in the last week or so:
> {code}
> $ docker-compose run iwyu
> WARNING: The CI_ARROW_SHA variable is not set. Defaulting to a blank string.
> WARNING: The CI_ARROW_BRANCH variable is not set. Defaulting to a blank 
> string.
> + mkdir -p /build/lint
> + pushd /build/lint
> /build/lint /
> + cmake -GNinja -DARROW_FLIGHT=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON 
> -DARROW_PYTHON=ON -DCMAKE_CXX_FLAGS=-D_GLIBCXX_USE_CXX11_ABI=0 
> -DCMAKE_EXPORT_COMPILE_COMMANDS=ON /arrow/cpp
> -- Building using CMake version: 3.14.5
> -- Arrow version: 1.0.0 (full: '1.0.0-SNAPSHOT')
> -- Arrow SO version: 100 (full: 100.0.0)
> -- clang-tidy found at /usr/bin/clang-tidy-7
> -- clang-format found at /usr/bin/clang-format-7
> -- infer not found
> -- Using ccache: /opt/conda/bin/ccache
> -- Found cpplint executable at /arrow/cpp/build-support/cpplint.py
> -- Compiler command: env LANG=C /usr/bin/g++ -v
> -- Compiler version: Using built-in specs.
> COLLECT_GCC=/usr/bin/g++
> COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper
> OFFLOAD_TARGET_NAMES=nvptx-none
> OFFLOAD_TARGET_DEFAULT=1
> Target: x86_64-linux-gnu
> Configured with: ../src/configure -v --with-pkgversion='Ubuntu 
> 7.4.0-1ubuntu1~18.04.1' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs 
> --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr 
> --with-gcc-major-version-only --program-suffix=-7 
> --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id 
> --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix 
> --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu 
> --enable-libstdcxx-debug --enable-libstdcxx-time=yes 
> --with-default-libstdcxx-abi=new --enable-gnu-unique-object 
> --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie 
> --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto 
> --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 
> --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic 
> --enable-offload-targets=nvptx-none --without-cuda-driver 
> --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu 
> --target=x86_64-linux-gnu
> Thread model: posix
> gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1) 
> -- Compiler id: GNU
> Selected compiler gcc 7.4.0
> CMake Error at cmake_modules/SetupCxxFlags.cmake:42 (string):
>   string no output variable specified
> Call Stack (most recent call first):
>   CMakeLists.txt:357 (include)
> -- Arrow build warning level: CHECKIN
> Using ld linker
> Configured for  build (set with cmake -DCMAKE_BUILD_TYPE={release,debug,...})
> CMake Error at cmake_modules/SetupCxxFlags.cmake:429 (message):
>   Unknown build type:
> Call Stack (most recent call first):
>   CMakeLists.txt:357 (include)
> -- Configuring incomplete, errors occurred!
> See also "/build/lint/CMakeFiles/CMakeOutput.log".
> See also "/build/lint/CMakeFiles/CMakeError.log".
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-3246) [Python][Parquet] direct reading/writing of pandas categoricals in parquet

2019-08-07 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902652#comment-16902652
 ] 

Wes McKinney commented on ARROW-3246:
-

OK, I was able to get the initial refactor done today. Now we need the plumbing 
to be able to write dictionary values and indices separately to 
{{DictEncoder}}

> [Python][Parquet] direct reading/writing of pandas categoricals in parquet
> --
>
> Key: ARROW-3246
> URL: https://issues.apache.org/jira/browse/ARROW-3246
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: parquet
> Fix For: 1.0.0
>
>
> Parquet supports "dictionary encoding" of column data in a manner very 
> similar to the concept of Categoricals in pandas. It is natural to use this 
> encoding for a column which originated as a categorical. Conversely, when 
> loading, if the file metadata says that a given column came from a pandas (or 
> arrow) categorical, then we can trust that the whole of the column is 
> dictionary-encoded and load the data directly into a categorical column, 
> rather than expanding the labels upon load and recategorising later.
> If the data does not have the pandas metadata, then the guarantee cannot 
> hold, and we cannot assume either that the whole column is dictionary encoded 
> or that the labels are the same throughout. In this case, the current 
> behaviour is fine.
>  
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6152) [C++][Parquet] Write arrow::Array directly into parquet::TypedColumnWriter

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6152:
--
Labels: pull-request-available  (was: )

> [C++][Parquet] Write arrow::Array directly into parquet::TypedColumnWriter
> -
>
> Key: ARROW-6152
> URL: https://issues.apache.org/jira/browse/ARROW-6152
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> This is an initial refactoring task to enable the Arrow write layer to access 
> some of the internal implementation details of 
> {{parquet::TypedColumnWriter}}. See discussion in ARROW-3246



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6142) [R] Install instructions on linux could be clearer

2019-08-07 Thread Sutou Kouhei (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sutou Kouhei resolved ARROW-6142.
-
Resolution: Fixed

Issue resolved by pull request 5027
[https://github.com/apache/arrow/pull/5027]

> [R] Install instructions on linux could be clearer
> --
>
> Key: ARROW-6142
> URL: https://issues.apache.org/jira/browse/ARROW-6142
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: R
>Affects Versions: 0.14.1
> Environment: Ubuntu 19.04
>Reporter: Karl Dunkle Werner
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: documentation, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Installing R packages on Linux is almost always from source, which means 
> Arrow needs some system dependencies. The existing help message (from 
> arrow::install_arrow()) is very helpful in pointing that out, but it's still 
> a heavy lift for users who install R packages from source but don't plan to 
> develop Arrow itself.
> Here are a couple of things that could make things slightly smoother:
>  # I would be very grateful if the install_arrow() message or installation 
> page told me which libraries were essential to make the R package work.
>  # install_arrow() refers to a PPA. Previously I've only seen PPAs hosted on 
> launchpad.net, so the bintray URL threw me. Changing it to "bintray.com PPA" 
> instead of just "PPA" would have caused me less confusion. (Others may differ)
>  # A snap package would be easier than installing a new apt address, but I 
> understand that building for snap would be more packaging work and only 
> benefits Ubuntu users.
>  
> Thanks for making R bindings, and congratulations on the CRAN release!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6166) [Go] Slice of slice causes index out of range panic

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6166:
--
Labels: pull-request-available  (was: )

> [Go] Slice of slice causes index out of range panic
> ---
>
> Key: ARROW-6166
> URL: https://issues.apache.org/jira/browse/ARROW-6166
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Reporter: Roshan Kumaraswamy
>Priority: Major
>  Labels: pull-request-available
>
> When slicing a slice, the offset of the underlying data will cause an index 
> out of range panic if the offset if greater than the slice length. See 
> [https://github.com/apache/arrow/issues/5033]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6168) [C++] IWYU docker-compose job is broken

2019-08-07 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6168:
---

 Summary: [C++] IWYU docker-compose job is broken
 Key: ARROW-6168
 URL: https://issues.apache.org/jira/browse/ARROW-6168
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.15.0


Not sure what happened in the last week or so:

{code}
$ docker-compose run iwyu
WARNING: The CI_ARROW_SHA variable is not set. Defaulting to a blank string.
WARNING: The CI_ARROW_BRANCH variable is not set. Defaulting to a blank string.
+ mkdir -p /build/lint
+ pushd /build/lint
/build/lint /
+ cmake -GNinja -DARROW_FLIGHT=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON 
-DARROW_PYTHON=ON -DCMAKE_CXX_FLAGS=-D_GLIBCXX_USE_CXX11_ABI=0 
-DCMAKE_EXPORT_COMPILE_COMMANDS=ON /arrow/cpp
-- Building using CMake version: 3.14.5
-- Arrow version: 1.0.0 (full: '1.0.0-SNAPSHOT')
-- Arrow SO version: 100 (full: 100.0.0)
-- clang-tidy found at /usr/bin/clang-tidy-7
-- clang-format found at /usr/bin/clang-format-7
-- infer not found
-- Using ccache: /opt/conda/bin/ccache
-- Found cpplint executable at /arrow/cpp/build-support/cpplint.py
-- Compiler command: env LANG=C /usr/bin/g++ -v
-- Compiler version: Using built-in specs.
COLLECT_GCC=/usr/bin/g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 
7.4.0-1ubuntu1~18.04.1' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs 
--enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr 
--with-gcc-major-version-only --program-suffix=-7 
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id 
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix 
--libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu 
--enable-libstdcxx-debug --enable-libstdcxx-time=yes 
--with-default-libstdcxx-abi=new --enable-gnu-unique-object 
--disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie 
--with-system-zlib --with-target-system-zlib --enable-objc-gc=auto 
--enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 
--with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic 
--enable-offload-targets=nvptx-none --without-cuda-driver 
--enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu 
--target=x86_64-linux-gnu
Thread model: posix
gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1) 

-- Compiler id: GNU
Selected compiler gcc 7.4.0
CMake Error at cmake_modules/SetupCxxFlags.cmake:42 (string):
  string no output variable specified
Call Stack (most recent call first):
  CMakeLists.txt:357 (include)


-- Arrow build warning level: CHECKIN
Using ld linker
Configured for  build (set with cmake -DCMAKE_BUILD_TYPE={release,debug,...})
CMake Error at cmake_modules/SetupCxxFlags.cmake:429 (message):
  Unknown build type:
Call Stack (most recent call first):
  CMakeLists.txt:357 (include)


-- Configuring incomplete, errors occurred!
See also "/build/lint/CMakeFiles/CMakeOutput.log".
See also "/build/lint/CMakeFiles/CMakeError.log".
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6167) [R] macOS binary R packages on CRAN don't have arrow_available

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6167:
--
Labels: pull-request-available  (was: )

> [R] macOS binary R packages on CRAN don't have arrow_available
> --
>
> Key: ARROW-6167
> URL: https://issues.apache.org/jira/browse/ARROW-6167
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.14.1
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Critical
>  Labels: pull-request-available
>
> The {{configure}} script in the R package has some 
> [magic|https://github.com/apache/arrow/blob/master/r/configure#L66-L86] that 
> should ensure that on macOS, you're guaranteed a successful library 
> installation even (especially) if you don't have libarrow installed on your 
> system. This magic also is designed so that when CRAN builds a binary package 
> for macOS, the C++ libraries are bundled and "just work" when a user installs 
> it, no compilation required. 
> However, the magic appeared to fail on CRAN this time, as the binaries linked 
> on [https://cran.r-project.org/web/packages/arrow/index.html] were built 
> without libarrow ({{arrow::arrow_available()}} returns {{FALSE}}). 
> I've identified three vectors by which you can get an arrow package 
> installation on macOS in this state:
>  # The [check|https://github.com/apache/arrow/blob/master/r/configure#L71] to 
> see if you've already installed {{apache-arrow}} via Homebrew always passes, 
> so if you have Homebrew installed but haven't done {{brew install 
> apache-arrow}}, the script won't do it for you like it looks like it intends. 
> (This is not suspected to be the problem on CRAN because they don't have 
> Homebrew installed.)
>  # If the 
> "[autobrew|https://github.com/apache/arrow/blob/master/r/configure#L80-L81]"; 
> installation fails, then the [test on 
> L102|https://github.com/apache/arrow/blob/master/r/configure#L102] will 
> correctly fail. I managed to trigger this (by luck?) on the [R-hub testing 
> service|https://builder.r-hub.io/status/arrow_0.14.1.tar.gz-da083126612b46e28854b95156b87b31#L533].
>  This is possibly what happened on CRAN, though the only [build 
> logs|https://www.r-project.org/nosvn/R.check/r-release-osx-x86_64/arrow-00check.html]
>  we have from CRAN are terse because it believes the build was successful. 
>  # Some idiosyncrasy in the compiler on the CRAN macOS system such that the 
> autobrew script would successfully download the arrow libraries but the L102 
> check would error. I've been unable to reproduce this using the [version of 
> clang7 that CRAN provides|https://cran.r-project.org/bin/macosx/tools/].
> I have a fix for the first one and will provide workaround documentation for 
> the README and announcement blog post. Unfortunately, I don't know that 
> there's anything we can do about the useless binaries on CRAN at this time, 
> particularly since CRAN is going down for maintenance August 9-18.
> cc [~jeroenooms] [~romainfrancois] [~wesmckinn]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6167) [R] macOS binary R packages on CRAN don't have arrow_available

2019-08-07 Thread Neal Richardson (JIRA)
Neal Richardson created ARROW-6167:
--

 Summary: [R] macOS binary R packages on CRAN don't have 
arrow_available
 Key: ARROW-6167
 URL: https://issues.apache.org/jira/browse/ARROW-6167
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.14.1
Reporter: Neal Richardson
Assignee: Neal Richardson


The {{configure}} script in the R package has some 
[magic|https://github.com/apache/arrow/blob/master/r/configure#L66-L86] that 
should ensure that on macOS, you're guaranteed a successful library 
installation even (especially) if you don't have libarrow installed on your 
system. This magic also is designed so that when CRAN builds a binary package 
for macOS, the C++ libraries are bundled and "just work" when a user installs 
it, no compilation required. 

However, the magic appeared to fail on CRAN this time, as the binaries linked 
on [https://cran.r-project.org/web/packages/arrow/index.html] were built 
without libarrow ({{arrow::arrow_available()}} returns {{FALSE}}). 

I've identified three vectors by which you can get an arrow package 
installation on macOS in this state:
 # The [check|https://github.com/apache/arrow/blob/master/r/configure#L71] to 
see if you've already installed {{apache-arrow}} via Homebrew always passes, so 
if you have Homebrew installed but haven't done {{brew install apache-arrow}}, 
the script won't do it for you like it looks like it intends. (This is not 
suspected to be the problem on CRAN because they don't have Homebrew installed.)
 # If the 
"[autobrew|https://github.com/apache/arrow/blob/master/r/configure#L80-L81]"; 
installation fails, then the [test on 
L102|https://github.com/apache/arrow/blob/master/r/configure#L102] will 
correctly fail. I managed to trigger this (by luck?) on the [R-hub testing 
service|https://builder.r-hub.io/status/arrow_0.14.1.tar.gz-da083126612b46e28854b95156b87b31#L533].
 This is possibly what happened on CRAN, though the only [build 
logs|https://www.r-project.org/nosvn/R.check/r-release-osx-x86_64/arrow-00check.html]
 we have from CRAN are terse because it believes the build was successful. 
 # Some idiosyncrasy in the compiler on the CRAN macOS system such that the 
autobrew script would successfully download the arrow libraries but the L102 
check would error. I've been unable to reproduce this using the [version of 
clang7 that CRAN provides|https://cran.r-project.org/bin/macosx/tools/].

I have a fix for the first one and will provide workaround documentation for 
the README and announcement blog post. Unfortunately, I don't know that there's 
anything we can do about the useless binaries on CRAN at this time, 
particularly since CRAN is going down for maintenance August 9-18.

cc [~jeroenooms] [~romainfrancois] [~wesmckinn]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6166) [Go] Slice of slice causes index out of range panic

2019-08-07 Thread Roshan Kumaraswamy (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roshan Kumaraswamy updated ARROW-6166:
--
External issue URL: https://github.com/apache/arrow/issues/5033

> [Go] Slice of slice causes index out of range panic
> ---
>
> Key: ARROW-6166
> URL: https://issues.apache.org/jira/browse/ARROW-6166
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Reporter: Roshan Kumaraswamy
>Priority: Major
>
> When slicing a slice, the offset of the underlying data will cause an index 
> out of range panic if the offset if greater than the slice length. See 
> [https://github.com/apache/arrow/issues/5033]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6166) [Go] Slice of slice causes index out of range panic

2019-08-07 Thread Roshan Kumaraswamy (JIRA)
Roshan Kumaraswamy created ARROW-6166:
-

 Summary: [Go] Slice of slice causes index out of range panic
 Key: ARROW-6166
 URL: https://issues.apache.org/jira/browse/ARROW-6166
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Reporter: Roshan Kumaraswamy


When slicing a slice, the offset of the underlying data will cause an index out 
of range panic if the offset if greater than the slice length. See 
[https://github.com/apache/arrow/issues/5033]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-6152) [C++][Parquet] Write arrow::Array directly into parquet::TypedColumnWriter

2019-08-07 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6152:
---

Assignee: Wes McKinney

> [C++][Parquet] Write arrow::Array directly into parquet::TypedColumnWriter
> -
>
> Key: ARROW-6152
> URL: https://issues.apache.org/jira/browse/ARROW-6152
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> This is an initial refactoring task to enable the Arrow write layer to access 
> some of the internal implementation details of 
> {{parquet::TypedColumnWriter}}. See discussion in ARROW-3246



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6039) [GLib] Add garrow_array_filter()

2019-08-07 Thread Sutou Kouhei (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sutou Kouhei resolved ARROW-6039.
-
Resolution: Fixed

Issue resolved by pull request 5025
[https://github.com/apache/arrow/pull/5025]

> [GLib] Add garrow_array_filter()
> 
>
> Key: ARROW-6039
> URL: https://issues.apache.org/jira/browse/ARROW-6039
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: GLib
>Reporter: Yosuke Shiro
>Assignee: Yosuke Shiro
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Add bindings of a boolean selection filter.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6165) [Integration] Use multiprocessing to run integration tests on multiple CPU cores

2019-08-07 Thread lidavidm (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902399#comment-16902399
 ] 

lidavidm commented on ARROW-6165:
-

We'll also have to find free ports for the Flight tests, as right now they 
assume a hardcoded port. (Not hard to do, fortunately.)

> [Integration] Use multiprocessing to run integration tests on multiple CPU 
> cores
> 
>
> Key: ARROW-6165
> URL: https://issues.apache.org/jira/browse/ARROW-6165
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Integration
>Reporter: Wes McKinney
>Priority: Major
>
> The stdout/stderr will have to be captured appropriate so that the console 
> output when run in parallel is still readable



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5559) [C++] Introduce IpcOptions struct object for better API-stability when adding new options

2019-08-07 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902398#comment-16902398
 ] 

Wes McKinney commented on ARROW-5559:
-

I agree it's weird

> [C++] Introduce IpcOptions struct object for better API-stability when adding 
> new options
> -
>
> Key: ARROW-5559
> URL: https://issues.apache.org/jira/browse/ARROW-5559
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Related to ARROW-2006. There are various IPC-related options like allowing 
> 64-bit lengths that might be better encapsulated in an options struct rather 
> than littered around different public APIs



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6165) [Integration] Use multiprocessing to run integration tests on multiple CPU cores

2019-08-07 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6165:
---

 Summary: [Integration] Use multiprocessing to run integration 
tests on multiple CPU cores
 Key: ARROW-6165
 URL: https://issues.apache.org/jira/browse/ARROW-6165
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Integration
Reporter: Wes McKinney


The stdout/stderr will have to be captured appropriate so that the console 
output when run in parallel is still readable



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Closed] (ARROW-6059) [Python] Regression memory issue when calling pandas.read_parquet

2019-08-07 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-6059.
---
   Resolution: Duplicate
Fix Version/s: 0.15.0

This should be resolved with the fix for ARROW-6060. If you can verify from 
master that would be helpful. If you run into more issues please reopen an issue

> [Python] Regression memory issue when calling pandas.read_parquet
> -
>
> Key: ARROW-6059
> URL: https://issues.apache.org/jira/browse/ARROW-6059
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0, 0.14.1
>Reporter: Francisco Sanchez
>Priority: Major
> Fix For: 0.15.0
>
> Attachments: Memory_profile_0.13.png, Memory_profile_0.13_rs.png, 
> Memory_profile_0.14.1_use_thread_FALSE.png, 
> Memory_profile_0.14.1_use_thread_false_rs.png, 
> Memory_profile_0.14.1_use_thread_true.png
>
>
> I have a ~3MB parquet file with the next schema:
> {code:java}
> bag_stamp: timestamp[ns]
> transforms_[]_.header.seq: list
>   child 0, item: int64
> transforms_[]_.header.stamp: list
>   child 0, item: timestamp[ns]
> transforms_[]_.header.frame_id: list
>   child 0, item: string
> transforms_[]_.child_frame_id: list
>   child 0, item: string
> transforms_[]_.transform.translation.x: list
>   child 0, item: double
> transforms_[]_.transform.translation.y: list
>   child 0, item: double
> transforms_[]_.transform.translation.z: list
>   child 0, item: double
> transforms_[]_.transform.rotation.x: list
>   child 0, item: double
> transforms_[]_.transform.rotation.y: list
>   child 0, item: double
> transforms_[]_.transform.rotation.z: list
>   child 0, item: double
> transforms_[]_.transform.rotation.w: list
>   child 0, item: double
> {code}
>  If I read it with *pandas.read_parquet()* using pyarrow 0.13.0 all seems 
> fine and it takes no time to load. If I try the same with 0.14.0 or 0.14.1 it 
> takes a lot of time and uses ~10GB of RAM. Many times if I don't have enough 
> available memory it will just be killed OOM. Now, if I use the next code 
> snippet instead it works perfectly with all the versions:
> {code}
> parquet_file = pq.ParquetFile(input_file)
> tables = []
> for row_group in range(parquet_file.num_row_groups):
> tables.append(parquet_file.read_row_group(row_group, columns=columns, 
> use_pandas_metadata=True))
> df = pa.concat_tables(tables).to_pandas()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5559) [C++] Introduce IpcOptions struct object for better API-stability when adding new options

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5559:
--
Labels: pull-request-available  (was: )

> [C++] Introduce IpcOptions struct object for better API-stability when adding 
> new options
> -
>
> Key: ARROW-5559
> URL: https://issues.apache.org/jira/browse/ARROW-5559
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Related to ARROW-2006. There are various IPC-related options like allowing 
> 64-bit lengths that might be better encapsulated in an options struct rather 
> than littered around different public APIs



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6164) [Docs][Format] Document project versioning schema and forward/backward compatibility policies

2019-08-07 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6164:
---

 Summary: [Docs][Format] Document project versioning schema and 
forward/backward compatibility policies
 Key: ARROW-6164
 URL: https://issues.apache.org/jira/browse/ARROW-6164
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Format
Reporter: Wes McKinney
 Fix For: 1.0.0


Based on policy adopted via vote on mailing list



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6163) [C++] Misnamed test

2019-08-07 Thread Dmitry Kalinkin (JIRA)
Dmitry Kalinkin created ARROW-6163:
--

 Summary: [C++] Misnamed test
 Key: ARROW-6163
 URL: https://issues.apache.org/jira/browse/ARROW-6163
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.14.1
Reporter: Dmitry Kalinkin


"arrow-dataset-file_test" defined in
https://github.com/apache/arrow/blob/49badd25804af85dfe9019ab1390c649a02c89fa/cpp/src/arrow/dataset/CMakeLists.txt#L49
the existing naming convention seems to be "foo-bar-test" and not 
"foo-bar_test". Test need to be renamed.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-3246) [Python][Parquet] direct reading/writing of pandas categoricals in parquet

2019-08-07 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902341#comment-16902341
 ] 

Wes McKinney commented on ARROW-3246:
-

Writing BYTE_ARRAY can also definitely be made more efficient. See logic at

https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L858

The dictionary page size issue is usually handled through the WriterProperties

https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L178

If the dictionary is written all at once then this property can be 
circumvented, that would be my plan.

> [Python][Parquet] direct reading/writing of pandas categoricals in parquet
> --
>
> Key: ARROW-3246
> URL: https://issues.apache.org/jira/browse/ARROW-3246
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: parquet
> Fix For: 1.0.0
>
>
> Parquet supports "dictionary encoding" of column data in a manner very 
> similar to the concept of Categoricals in pandas. It is natural to use this 
> encoding for a column which originated as a categorical. Conversely, when 
> loading, if the file metadata says that a given column came from a pandas (or 
> arrow) categorical, then we can trust that the whole of the column is 
> dictionary-encoded and load the data directly into a categorical column, 
> rather than expanding the labels upon load and recategorising later.
> If the data does not have the pandas metadata, then the guarantee cannot 
> hold, and we cannot assume either that the whole column is dictionary encoded 
> or that the labels are the same throughout. In this case, the current 
> behaviour is fine.
>  
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6162) [C++][Gandiva] Do not truncate string in castVARCHAR_varchar when out_len parameter is zero

2019-08-07 Thread Prudhvi Porandla (JIRA)
Prudhvi Porandla created ARROW-6162:
---

 Summary: [C++][Gandiva] Do not truncate string in 
castVARCHAR_varchar when out_len parameter is zero
 Key: ARROW-6162
 URL: https://issues.apache.org/jira/browse/ARROW-6162
 Project: Apache Arrow
  Issue Type: Task
Reporter: Prudhvi Porandla
Assignee: Prudhvi Porandla






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6144) [C++][Gandiva] Implement random function in Gandiva

2019-08-07 Thread Prudhvi Porandla (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prudhvi Porandla updated ARROW-6144:

Summary: [C++][Gandiva] Implement random function in Gandiva  (was: 
Implement random function in Gandiva)

> [C++][Gandiva] Implement random function in Gandiva
> ---
>
> Key: ARROW-6144
> URL: https://issues.apache.org/jira/browse/ARROW-6144
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Implement random(), random(int seed) functions



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-5559) [C++] Introduce IpcOptions struct object for better API-stability when adding new options

2019-08-07 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-5559:
-

Assignee: Antoine Pitrou

> [C++] Introduce IpcOptions struct object for better API-stability when adding 
> new options
> -
>
> Key: ARROW-5559
> URL: https://issues.apache.org/jira/browse/ARROW-5559
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 1.0.0
>
>
> Related to ARROW-2006. There are various IPC-related options like allowing 
> 64-bit lengths that might be better encapsulated in an options struct rather 
> than littered around different public APIs



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6161) [C++] Implements dataset::ParquetFile and associated Scan structures

2019-08-07 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6161:
--
Labels: datasets  (was: )

> [C++] Implements dataset::ParquetFile and associated Scan structures
> 
>
> Key: ARROW-6161
> URL: https://issues.apache.org/jira/browse/ARROW-6161
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: datasets
>
> This is first baby step in supporting datasets. The initial implementation 
> will be minimal and trivial, no parallel, no schema adaptation.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6161) [C++] Implements dataset::ParquetFile and associated Scan structures

2019-08-07 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6161:
--
Component/s: C++

> [C++] Implements dataset::ParquetFile and associated Scan structures
> 
>
> Key: ARROW-6161
> URL: https://issues.apache.org/jira/browse/ARROW-6161
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>
> This is first baby step in supporting datasets. The initial implementation 
> will be minimal and trivial, no parallel, no schema adaptation.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-6161) [C++] Implements dataset::ParquetFile and associated Scan structures

2019-08-07 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-6161:
-

Assignee: Francois Saint-Jacques

> [C++] Implements dataset::ParquetFile and associated Scan structures
> 
>
> Key: ARROW-6161
> URL: https://issues.apache.org/jira/browse/ARROW-6161
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>
> This is first baby step in supporting datasets. The initial implementation 
> will be minimal and trivial, no parallel, no schema adaptation.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6161) [C++] Implements dataset::ParquetFile and associated Scan structures

2019-08-07 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6161:
--
Description: This is first baby step in supporting datasets. The initial 
implementation will be minimal and trivial, no parallel, no schema adaptation.

> [C++] Implements dataset::ParquetFile and associated Scan structures
> 
>
> Key: ARROW-6161
> URL: https://issues.apache.org/jira/browse/ARROW-6161
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> This is first baby step in supporting datasets. The initial implementation 
> will be minimal and trivial, no parallel, no schema adaptation.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6161) [C++] Implements dataset::ParquetFile and associated Scan structures

2019-08-07 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-6161:
-

 Summary: [C++] Implements dataset::ParquetFile and associated Scan 
structures
 Key: ARROW-6161
 URL: https://issues.apache.org/jira/browse/ARROW-6161
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5610) [Python] Define extension type API in Python to "receive" or "send" a foreign extension type

2019-08-07 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902176#comment-16902176
 ] 

Joris Van den Bossche commented on ARROW-5610:
--

> But which also means that you loose all information about the extension type 
> defined elsewhere (as discussed above).

Correcting myself: this is not fully True. The _type_ is no longer an extension 
type (but the storage type), but the _field_ in the schema still has the 
metadata. 

For example, reading an IPC file with python for an not-registered type 
(created from C++ where the 'ext' column was a Uuid type as defined in the 
tests):

{code}
In [31]: f_ext = 
pa.ipc.open_stream("repos/arrow/cpp/build/examples/arrow/arrow-example-ipc-extension.arrow")

In [32]: table = f_ext.read_all() 

In [33]: table
Out[33]: 
pyarrow.Table
int: int64
ext: int64

In [35]: table.schema.field_by_name('ext')  

   
Out[35]: pyarrow.Field

In [36]: table.schema.field_by_name('ext').metadata 

   
Out[36]: 
{b'ARROW:extension:metadata': b'uuid-type-unique-code',
 b'ARROW:extension:name': b'uuid'}
{code}



> [Python] Define extension type API in Python to "receive" or "send" a foreign 
> extension type
> 
>
> Key: ARROW-5610
> URL: https://issues.apache.org/jira/browse/ARROW-5610
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> In work in ARROW-840, a static {{arrow.py_extension_type}} name is used. 
> There will be cases where an extension type is coming from another 
> programming language (e.g. Java), so it would be useful to be able to "plug 
> in" a Python extension type subclass that will be used to deserialize the 
> extension type coming over the wire. This has some different API requirements 
> since the serialized representation of the type will not have knowledge of 
> Python pickling, etc. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5610) [Python] Define extension type API in Python to "receive" or "send" a foreign extension type

2019-08-07 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902129#comment-16902129
 ] 

Joris Van den Bossche commented on ARROW-5610:
--

[~lidavidm] no apologies needed, it was just a question to check the status :)

> > You want to transfer a table containing a column of that type to and from 
> > Python. Right now, you can read that data from Python, but you can't create 
> > a table with that type
>
> I'm curious, which error do you get when trying to do so?

I tried this out, and so if you have an IPC message that contains an extension 
type unknown to Python / C++, and you read that into a (pyarrow) Table, you 
don't get an error at the moment, but it falls back to the storage type. But 
which also means that you loose all information about the extension type 
defined elsewhere (as discussed above).

To me, it seems that we would need some way to have an "unknown extension type" 
in C++ that can have arbitrary name and metadata to be able to receive such 
data.

> [Python] Define extension type API in Python to "receive" or "send" a foreign 
> extension type
> 
>
> Key: ARROW-5610
> URL: https://issues.apache.org/jira/browse/ARROW-5610
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> In work in ARROW-840, a static {{arrow.py_extension_type}} name is used. 
> There will be cases where an extension type is coming from another 
> programming language (e.g. Java), so it would be useful to be able to "plug 
> in" a Python extension type subclass that will be used to deserialize the 
> extension type coming over the wire. This has some different API requirements 
> since the serialized representation of the type will not have knowledge of 
> Python pickling, etc. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6160) [Java] AbstractStructVector#getPrimitiveVectors fails to work with complex child vectors

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6160:
--
Labels: pull-request-available  (was: )

> [Java] AbstractStructVector#getPrimitiveVectors fails to work with complex 
> child vectors
> 
>
> Key: ARROW-6160
> URL: https://issues.apache.org/jira/browse/ARROW-6160
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
>
> Currently in {{AbstractStructVector#getPrimitiveVectors}}, only struct type 
> child vectors will recursively get primitive vectors, other complex type like 
> {{ListVector}}, {{UnionVector}} was treated as primitive type and return 
> directly.
> For example, Struct(List(Int), Struct(Int, Varchar)) {{getPrimitiveVectors}} 
> should return {{[IntVector, IntVector, VarCharVector]}} instead of 
> [ListVector, IntVector, VarCharVector]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6160) [Java] AbstractStructVector#getPrimitiveVectors fails to work with complex child vectors

2019-08-07 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6160:
-

 Summary: [Java] AbstractStructVector#getPrimitiveVectors fails to 
work with complex child vectors
 Key: ARROW-6160
 URL: https://issues.apache.org/jira/browse/ARROW-6160
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently in {{AbstractStructVector#getPrimitiveVectors}}, only struct type 
child vectors will recursively get primitive vectors, other complex type like 
{{ListVector}}, {{UnionVector}} was treated as primitive type and return 
directly.

For example, Struct(List(Int), Struct(Int, Varchar)) {{getPrimitiveVectors}} 
should return {{[IntVector, IntVector, VarCharVector]}} instead of [ListVector, 
IntVector, VarCharVector]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5559) [C++] Introduce IpcOptions struct object for better API-stability when adding new options

2019-08-07 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902122#comment-16902122
 ] 

Antoine Pitrou commented on ARROW-5559:
---

Also {{RecordBatchWriter::WriteTable}} hardcodes {{allow_64bit = true}} when 
calling {{WriteRecordBatch}}, which is weird.

> [C++] Introduce IpcOptions struct object for better API-stability when adding 
> new options
> -
>
> Key: ARROW-5559
> URL: https://issues.apache.org/jira/browse/ARROW-5559
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Related to ARROW-2006. There are various IPC-related options like allowing 
> 64-bit lengths that might be better encapsulated in an options struct rather 
> than littered around different public APIs



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5559) [C++] Introduce IpcOptions struct object for better API-stability when adding new options

2019-08-07 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902118#comment-16902118
 ] 

Antoine Pitrou commented on ARROW-5559:
---

Question: why is {{allow_64bit}} passed to 
{{RecordBatchWriter::WriteRecordBatch}} rather than at construction time?

> [C++] Introduce IpcOptions struct object for better API-stability when adding 
> new options
> -
>
> Key: ARROW-5559
> URL: https://issues.apache.org/jira/browse/ARROW-5559
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Related to ARROW-2006. There are various IPC-related options like allowing 
> 64-bit lengths that might be better encapsulated in an options struct rather 
> than littered around different public APIs



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-5559) [C++] Introduce IpcOptions struct object for better API-stability when adding new options

2019-08-07 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902118#comment-16902118
 ] 

Antoine Pitrou edited comment on ARROW-5559 at 8/7/19 2:46 PM:
---

Question: why is {{allow_64bit}} passed to 
{{RecordBatchWriter::WriteRecordBatch}} rather than at construction time?

Since that method is supposed to be overriden by implementors, it makes it a 
bit delicate to change its signature... Though better do it before 1.0.0.


was (Author: pitrou):
Question: why is {{allow_64bit}} passed to 
{{RecordBatchWriter::WriteRecordBatch}} rather than at construction time?

> [C++] Introduce IpcOptions struct object for better API-stability when adding 
> new options
> -
>
> Key: ARROW-5559
> URL: https://issues.apache.org/jira/browse/ARROW-5559
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Related to ARROW-2006. There are various IPC-related options like allowing 
> 64-bit lengths that might be better encapsulated in an options struct rather 
> than littered around different public APIs



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6159) [C++] PrettyPrint of arrow::Schema missing identation for first line

2019-08-07 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-6159:
-
Labels: beginner  (was: )

> [C++] PrettyPrint of arrow::Schema missing identation for first line
> 
>
> Key: ARROW-6159
> URL: https://issues.apache.org/jira/browse/ARROW-6159
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.1
>Reporter: Joris Van den Bossche
>Priority: Minor
>  Labels: beginner
>
> Minor issue, but I noticed when printing a Schema with indentation, like:
> {code}
>   std::shared_ptr field1 = arrow::field("column1", 
> arrow::int32());
>   std::shared_ptr field2 = arrow::field("column2", 
> arrow::utf8());
>   std::shared_ptr schema = arrow::schema({field1, field2});
>   arrow::PrettyPrintOptions options{4};
>   arrow::PrettyPrint(*schema, options, &std::cout);
> {code}
> you get 
> {code}
> column1: int32
> column2: string
> {code}
> so not applying the indent for the first line.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6159) [C++] PrettyPrint of arrow::Schema missing identation for first line

2019-08-07 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-6159:


 Summary: [C++] PrettyPrint of arrow::Schema missing identation for 
first line
 Key: ARROW-6159
 URL: https://issues.apache.org/jira/browse/ARROW-6159
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.14.1
Reporter: Joris Van den Bossche


Minor issue, but I noticed when printing a Schema with indentation, like:

{code}
  std::shared_ptr field1 = arrow::field("column1", 
arrow::int32());
  std::shared_ptr field2 = arrow::field("column2", arrow::utf8());

  std::shared_ptr schema = arrow::schema({field1, field2});

  arrow::PrettyPrintOptions options{4};
  arrow::PrettyPrint(*schema, options, &std::cout);
{code}

you get 

{code}
column1: int32
column2: string
{code}

so not applying the indent for the first line.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6159) [C++] PrettyPrint of arrow::Schema missing identation for first line

2019-08-07 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-6159:
-
Labels:   (was: first)

> [C++] PrettyPrint of arrow::Schema missing identation for first line
> 
>
> Key: ARROW-6159
> URL: https://issues.apache.org/jira/browse/ARROW-6159
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.1
>Reporter: Joris Van den Bossche
>Priority: Minor
>
> Minor issue, but I noticed when printing a Schema with indentation, like:
> {code}
>   std::shared_ptr field1 = arrow::field("column1", 
> arrow::int32());
>   std::shared_ptr field2 = arrow::field("column2", 
> arrow::utf8());
>   std::shared_ptr schema = arrow::schema({field1, field2});
>   arrow::PrettyPrintOptions options{4};
>   arrow::PrettyPrint(*schema, options, &std::cout);
> {code}
> you get 
> {code}
> column1: int32
> column2: string
> {code}
> so not applying the indent for the first line.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6159) [C++] PrettyPrint of arrow::Schema missing identation for first line

2019-08-07 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-6159:
-
Labels: first  (was: )

> [C++] PrettyPrint of arrow::Schema missing identation for first line
> 
>
> Key: ARROW-6159
> URL: https://issues.apache.org/jira/browse/ARROW-6159
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.1
>Reporter: Joris Van den Bossche
>Priority: Minor
>  Labels: first
>
> Minor issue, but I noticed when printing a Schema with indentation, like:
> {code}
>   std::shared_ptr field1 = arrow::field("column1", 
> arrow::int32());
>   std::shared_ptr field2 = arrow::field("column2", 
> arrow::utf8());
>   std::shared_ptr schema = arrow::schema({field1, field2});
>   arrow::PrettyPrintOptions options{4};
>   arrow::PrettyPrint(*schema, options, &std::cout);
> {code}
> you get 
> {code}
> column1: int32
> column2: string
> {code}
> so not applying the indent for the first line.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-6154) [Rust] Too many open files (os error 24)

2019-08-07 Thread Yesh (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902001#comment-16902001
 ] 

Yesh edited comment on ARROW-6154 at 8/7/19 11:36 AM:
--

Thanks for ack. Below is the error message.  Additional data point is that it 
is able to dump schema via parquet-schema . 
{code:java}
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: 
General("underlying IO error: Too many open files (os error 24)")', 
src/libcore/result.rs:1084:5{code}


was (Author: madras):
Thanks for ack. Here is the error message.  
{code:java}
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: 
General("underlying IO error: Too many open files (os error 24)")', 
src/libcore/result.rs:1084:5{code}

> [Rust] Too many open files (os error 24)
> 
>
> Key: ARROW-6154
> URL: https://issues.apache.org/jira/browse/ARROW-6154
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Yesh
>Priority: Major
>
> Used [rust]*parquet-read binary to read a deeply nested parquet file and see 
> the below stack trace. Unfortunately won't be able to upload file.*
> {code:java}
> stack backtrace:
>    0: std::panicking::default_hook::{{closure}}
>    1: std::panicking::default_hook
>    2: std::panicking::rust_panic_with_hook
>    3: std::panicking::continue_panic_fmt
>    4: rust_begin_unwind
>    5: core::panicking::panic_fmt
>    6: core::result::unwrap_failed
>    7: parquet::util::io::FileSource::new
>    8:  as 
> parquet::file::reader::RowGroupReader>::get_column_page_reader
>    9:  as 
> parquet::file::reader::RowGroupReader>::get_column_reader
>   10: parquet::record::reader::TreeBuilder::reader_tree
>   11: parquet::record::reader::TreeBuilder::reader_tree
>   12: parquet::record::reader::TreeBuilder::reader_tree
>   13: parquet::record::reader::TreeBuilder::reader_tree
>   14: parquet::record::reader::TreeBuilder::reader_tree
>   15: parquet::record::reader::TreeBuilder::build
>   16:  core::iter::traits::iterator::Iterator>::next
>   17: parquet_read::main
>   18: std::rt::lang_start::{{closure}}
>   19: std::panicking::try::do_call
>   20: __rust_maybe_catch_panic
>   21: std::rt::lang_start_internal
>   22: main{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6154) [Rust] Too many open files (os error 24)

2019-08-07 Thread Yesh (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902001#comment-16902001
 ] 

Yesh commented on ARROW-6154:
-

Thanks for ack. Here is the error message.  
{code:java}
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: 
General("underlying IO error: Too many open files (os error 24)")', 
src/libcore/result.rs:1084:5{code}

> [Rust] Too many open files (os error 24)
> 
>
> Key: ARROW-6154
> URL: https://issues.apache.org/jira/browse/ARROW-6154
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Yesh
>Priority: Major
>
> Used [rust]*parquet-read binary to read a deeply nested parquet file and see 
> the below stack trace. Unfortunately won't be able to upload file.*
> {code:java}
> stack backtrace:
>    0: std::panicking::default_hook::{{closure}}
>    1: std::panicking::default_hook
>    2: std::panicking::rust_panic_with_hook
>    3: std::panicking::continue_panic_fmt
>    4: rust_begin_unwind
>    5: core::panicking::panic_fmt
>    6: core::result::unwrap_failed
>    7: parquet::util::io::FileSource::new
>    8:  as 
> parquet::file::reader::RowGroupReader>::get_column_page_reader
>    9:  as 
> parquet::file::reader::RowGroupReader>::get_column_reader
>   10: parquet::record::reader::TreeBuilder::reader_tree
>   11: parquet::record::reader::TreeBuilder::reader_tree
>   12: parquet::record::reader::TreeBuilder::reader_tree
>   13: parquet::record::reader::TreeBuilder::reader_tree
>   14: parquet::record::reader::TreeBuilder::reader_tree
>   15: parquet::record::reader::TreeBuilder::build
>   16:  core::iter::traits::iterator::Iterator>::next
>   17: parquet_read::main
>   18: std::rt::lang_start::{{closure}}
>   19: std::panicking::try::do_call
>   20: __rust_maybe_catch_panic
>   21: std::rt::lang_start_internal
>   22: main{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6158) [Python] possible to create StructArray with type that conflicts with child array's types

2019-08-07 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901993#comment-16901993
 ] 

Antoine Pitrou commented on ARROW-6158:
---

Validation should really catch this.

> [Python] possible to create StructArray with type that conflicts with child 
> array's types
> -
>
> Key: ARROW-6158
> URL: https://issues.apache.org/jira/browse/ARROW-6158
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> Using the Python interface as example. This creates a {{StructArray}} where 
> the field types don't match the child array types:
> {code}
> a = pa.array([1, 2, 3], type=pa.int64())
> b = pa.array(['a', 'b', 'c'], type=pa.string())
> inconsistent_fields = [pa.field('a', pa.int32()), pa.field('b', pa.float64())]
> a = pa.StructArray.from_arrays([a, b], fields=inconsistent_fields) 
> {code}
> The above works fine. I didn't find anything that errors (eg conversion to 
> pandas, slicing), also validation passes, but the type actually has the 
> inconsistent child types:
> {code}
> In [2]: a
> Out[2]: 
> 
> -- is_valid: all not null
> -- child 0 type: int64
>   [
> 1,
> 2,
> 3
>   ]
> -- child 1 type: string
>   [
> "a",
> "b",
> "c"
>   ]
> In [3]: a.type
> Out[3]: StructType(struct)
> In [4]: a.to_pandas()
> Out[4]: 
> array([{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3, 'b': 'c'}],
>   dtype=object)
> In [5]: a.validate() 
> {code}
> Shouldn't this be disallowed somehow? (it could be checked in the Python 
> {{from_arrays}} method, but maybe also in {{StructArray::Make}} which already 
> checks for the number of fields vs arrays and a consistent array length). 
> Similarly to discussion in ARROW-6132, I would also expect that this the 
> {{ValidateArray}} catches this.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-3246) [Python][Parquet] direct reading/writing of pandas categoricals in parquet

2019-08-07 Thread Hatem Helal (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901988#comment-16901988
 ] 

Hatem Helal commented on ARROW-3246:


Adding {{TypedColumnWriter::WriteArrow(const ::arrow::Array&)}} makes a lot 
of sense to me. [~wesmckinn] do you have a list of cases that you know can be 
optimized? The main one I'm aware of is the [dictionary 
array|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L1079]
 case, but but I'm curious if there are others arrow types that could be 
handled more efficiently.

As an aside, has it ever been considered to automatically tune the size of the 
dictionary page? I think for the limited case where of writing 
{{arrow::DictionaryArray}} we might want to ensure that the encoder doesn't 
fallback to plain encoding. That could be handled as a separate feature.

> [Python][Parquet] direct reading/writing of pandas categoricals in parquet
> --
>
> Key: ARROW-3246
> URL: https://issues.apache.org/jira/browse/ARROW-3246
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: parquet
> Fix For: 1.0.0
>
>
> Parquet supports "dictionary encoding" of column data in a manner very 
> similar to the concept of Categoricals in pandas. It is natural to use this 
> encoding for a column which originated as a categorical. Conversely, when 
> loading, if the file metadata says that a given column came from a pandas (or 
> arrow) categorical, then we can trust that the whole of the column is 
> dictionary-encoded and load the data directly into a categorical column, 
> rather than expanding the labels upon load and recategorising later.
> If the data does not have the pandas metadata, then the guarantee cannot 
> hold, and we cannot assume either that the whole column is dictionary encoded 
> or that the labels are the same throughout. In this case, the current 
> behaviour is fine.
>  
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6158) [Python] possible to create StructArray with type that conflicts with child array's types

2019-08-07 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901978#comment-16901978
 ] 

Joris Van den Bossche commented on ARROW-6158:
--

Found an example where it starts to give errors: after taking a subset with 
{{Take}}.

{code}
In [6]: subset = a.take(pa.array([0, 2]))  

In [7]: subset 
Out[7]: 

-- is_valid: all not null
-- child 0 type: int32
  [
1,
2
  ]
-- child 1 type: double
  [
2.122e-314,
0
  ]

In [8]: subset.validate() 

In [9]: subset.to_pandas()
Out[9]: array([{'a': 1, 'b': 2.121995791e-314}, {'a': 2, 'b': 0.0}], 
dtype=object)
{code}


> [Python] possible to create StructArray with type that conflicts with child 
> array's types
> -
>
> Key: ARROW-6158
> URL: https://issues.apache.org/jira/browse/ARROW-6158
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> Using the Python interface as example. This creates a {{StructArray}} where 
> the field types don't match the child array types:
> {code}
> a = pa.array([1, 2, 3], type=pa.int64())
> b = pa.array(['a', 'b', 'c'], type=pa.string())
> inconsistent_fields = [pa.field('a', pa.int32()), pa.field('b', pa.float64())]
> a = pa.StructArray.from_arrays([a, b], fields=inconsistent_fields) 
> {code}
> The above works fine. I didn't find anything that errors (eg conversion to 
> pandas, slicing), also validation passes, but the type actually has the 
> inconsistent child types:
> {code}
> In [2]: a
> Out[2]: 
> 
> -- is_valid: all not null
> -- child 0 type: int64
>   [
> 1,
> 2,
> 3
>   ]
> -- child 1 type: string
>   [
> "a",
> "b",
> "c"
>   ]
> In [3]: a.type
> Out[3]: StructType(struct)
> In [4]: a.to_pandas()
> Out[4]: 
> array([{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3, 'b': 'c'}],
>   dtype=object)
> In [5]: a.validate() 
> {code}
> Shouldn't this be disallowed somehow? (it could be checked in the Python 
> {{from_arrays}} method, but maybe also in {{StructArray::Make}} which already 
> checks for the number of fields vs arrays and a consistent array length). 
> Similarly to discussion in ARROW-6132, I would also expect that this the 
> {{ValidateArray}} catches this.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6158) [Python] possible to create StructArray with type that conflicts with child array's types

2019-08-07 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-6158:


 Summary: [Python] possible to create StructArray with type that 
conflicts with child array's types
 Key: ARROW-6158
 URL: https://issues.apache.org/jira/browse/ARROW-6158
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


Using the Python interface as example. This creates a {{StructArray}} where the 
field types don't match the child array types:

{code}
a = pa.array([1, 2, 3], type=pa.int64())
b = pa.array(['a', 'b', 'c'], type=pa.string())
inconsistent_fields = [pa.field('a', pa.int32()), pa.field('b', pa.float64())]

a = pa.StructArray.from_arrays([a, b], fields=inconsistent_fields) 
{code}

The above works fine. I didn't find anything that errors (eg conversion to 
pandas, slicing), also validation passes, but the type actually has the 
inconsistent child types:

{code}
In [2]: a
Out[2]: 

-- is_valid: all not null
-- child 0 type: int64
  [
1,
2,
3
  ]
-- child 1 type: string
  [
"a",
"b",
"c"
  ]

In [3]: a.type
Out[3]: StructType(struct)

In [4]: a.to_pandas()
Out[4]: 
array([{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3, 'b': 'c'}],
  dtype=object)

In [5]: a.validate() 
{code}

Shouldn't this be disallowed somehow? (it could be checked in the Python 
{{from_arrays}} method, but maybe also in {{StructArray::Make}} which already 
checks for the number of fields vs arrays and a consistent array length). 

Similarly to discussion in ARROW-6132, I would also expect that this the 
{{ValidateArray}} catches this.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays

2019-08-07 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901968#comment-16901968
 ] 

Antoine Pitrou edited comment on ARROW-6132 at 8/7/19 10:46 AM:


My expectation is that {{ValidateArray}} is O(1), not O(n) in array size.

Perhaps we need a separate {{ValidateArrayData}} that digs deeper...

Edit: oh, you're right about {{ValidateArray(ListArray)}}...


was (Author: pitrou):
My expectation is that {{ValidateArray}} is O(1), not O\(n) in array size.

Perhaps we need a separate {{ValidateArrayData}} that digs deeper...

> [Python] ListArray.from_arrays does not check validity of input arrays
> --
>
> Key: ARROW-6132
> URL: https://issues.apache.org/jira/browse/ARROW-6132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.
> When creating a ListArray from offsets and values in python, there is no 
> validation of the offsets that it starts with 0 and ends with the length of 
> the array (but is that required? the docs seem to indicate that: 
> https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type
>  ("The first value in the offsets array is 0, and the last element is the 
> length of the values array.").
> The array you get "seems" ok (the repr), but on conversion to python or 
> flattened arrays, things go wrong:
> {code}
> In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 
> In [62]: a
> Out[62]: 
> 
> [
>   [
> 1,
> 2
>   ],
>   [
> 3,
> 4
>   ]
> ]
> In [63]: a.flatten()
> Out[63]: 
> 
> [
>   0,   # <--- includes the 0
>   1,
>   2,
>   3,
>   4
> ]
> In [64]: a.to_pylist()
> Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes 
> more elements as garbage
> {code}
> Calling {{validate}} manually correctly raises:
> {code}
> In [65]: a.validate()
> ...
> ArrowInvalid: Final offset invariant not equal to values length: 10!=5
> {code}
> In C++ the main constructors are not safe, and as the caller you need to 
> ensure that the data is correct or call a safe (slower) constructor. But do 
> we want to use the unsafe / fast constructors without validation in Python as 
> default as well? Or should we do a call to {{validate}} here?
> A quick search seems to indicate that `pa.Array.from_buffers` does 
> validation, but other `from_arrays` method don't seem to explicitly do this. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays

2019-08-07 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901968#comment-16901968
 ] 

Antoine Pitrou edited comment on ARROW-6132 at 8/7/19 10:45 AM:


My expectation is that {{ValidateArray}} is O(1), not O\(n) in array size.

Perhaps we need a separate {{ValidateArrayData}} that digs deeper...


was (Author: pitrou):
My expectation is that {{ValidateArray}} is O(1), not O(n) in array size.

Perhaps we need a separate {{ValidateArrayData}} that digs deeper...

> [Python] ListArray.from_arrays does not check validity of input arrays
> --
>
> Key: ARROW-6132
> URL: https://issues.apache.org/jira/browse/ARROW-6132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.
> When creating a ListArray from offsets and values in python, there is no 
> validation of the offsets that it starts with 0 and ends with the length of 
> the array (but is that required? the docs seem to indicate that: 
> https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type
>  ("The first value in the offsets array is 0, and the last element is the 
> length of the values array.").
> The array you get "seems" ok (the repr), but on conversion to python or 
> flattened arrays, things go wrong:
> {code}
> In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 
> In [62]: a
> Out[62]: 
> 
> [
>   [
> 1,
> 2
>   ],
>   [
> 3,
> 4
>   ]
> ]
> In [63]: a.flatten()
> Out[63]: 
> 
> [
>   0,   # <--- includes the 0
>   1,
>   2,
>   3,
>   4
> ]
> In [64]: a.to_pylist()
> Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes 
> more elements as garbage
> {code}
> Calling {{validate}} manually correctly raises:
> {code}
> In [65]: a.validate()
> ...
> ArrowInvalid: Final offset invariant not equal to values length: 10!=5
> {code}
> In C++ the main constructors are not safe, and as the caller you need to 
> ensure that the data is correct or call a safe (slower) constructor. But do 
> we want to use the unsafe / fast constructors without validation in Python as 
> default as well? Or should we do a call to {{validate}} here?
> A quick search seems to indicate that `pa.Array.from_buffers` does 
> validation, but other `from_arrays` method don't seem to explicitly do this. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays

2019-08-07 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901968#comment-16901968
 ] 

Antoine Pitrou commented on ARROW-6132:
---

My expectation is that {{ValidateArray}} is O(1), not O(n) in array size.

Perhaps we need a separate {{ValidateArrayData}} that digs deeper...

> [Python] ListArray.from_arrays does not check validity of input arrays
> --
>
> Key: ARROW-6132
> URL: https://issues.apache.org/jira/browse/ARROW-6132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.
> When creating a ListArray from offsets and values in python, there is no 
> validation of the offsets that it starts with 0 and ends with the length of 
> the array (but is that required? the docs seem to indicate that: 
> https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type
>  ("The first value in the offsets array is 0, and the last element is the 
> length of the values array.").
> The array you get "seems" ok (the repr), but on conversion to python or 
> flattened arrays, things go wrong:
> {code}
> In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 
> In [62]: a
> Out[62]: 
> 
> [
>   [
> 1,
> 2
>   ],
>   [
> 3,
> 4
>   ]
> ]
> In [63]: a.flatten()
> Out[63]: 
> 
> [
>   0,   # <--- includes the 0
>   1,
>   2,
>   3,
>   4
> ]
> In [64]: a.to_pylist()
> Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes 
> more elements as garbage
> {code}
> Calling {{validate}} manually correctly raises:
> {code}
> In [65]: a.validate()
> ...
> ArrowInvalid: Final offset invariant not equal to values length: 10!=5
> {code}
> In C++ the main constructors are not safe, and as the caller you need to 
> ensure that the data is correct or call a safe (slower) constructor. But do 
> we want to use the unsafe / fast constructors without validation in Python as 
> default as well? Or should we do a call to {{validate}} here?
> A quick search seems to indicate that `pa.Array.from_buffers` does 
> validation, but other `from_arrays` method don't seem to explicitly do this. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays

2019-08-07 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901966#comment-16901966
 ] 

Joris Van den Bossche commented on ARROW-6132:
--

I was actually just planning to open an issue for that: should 
{{ValidateArray}} check the indices of a DictionaryArray? 

Not knowingly in detail how {{ValidateArray}} is used internally and what its 
purpose is, I would expect that from the name of that function, but from your 
response it might not?

{{ValidateArray}} for a ListArray also does walk all offsets and check they are 
consistent.

> [Python] ListArray.from_arrays does not check validity of input arrays
> --
>
> Key: ARROW-6132
> URL: https://issues.apache.org/jira/browse/ARROW-6132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.
> When creating a ListArray from offsets and values in python, there is no 
> validation of the offsets that it starts with 0 and ends with the length of 
> the array (but is that required? the docs seem to indicate that: 
> https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type
>  ("The first value in the offsets array is 0, and the last element is the 
> length of the values array.").
> The array you get "seems" ok (the repr), but on conversion to python or 
> flattened arrays, things go wrong:
> {code}
> In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 
> In [62]: a
> Out[62]: 
> 
> [
>   [
> 1,
> 2
>   ],
>   [
> 3,
> 4
>   ]
> ]
> In [63]: a.flatten()
> Out[63]: 
> 
> [
>   0,   # <--- includes the 0
>   1,
>   2,
>   3,
>   4
> ]
> In [64]: a.to_pylist()
> Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes 
> more elements as garbage
> {code}
> Calling {{validate}} manually correctly raises:
> {code}
> In [65]: a.validate()
> ...
> ArrowInvalid: Final offset invariant not equal to values length: 10!=5
> {code}
> In C++ the main constructors are not safe, and as the caller you need to 
> ensure that the data is correct or call a safe (slower) constructor. But do 
> we want to use the unsafe / fast constructors without validation in Python as 
> default as well? Or should we do a call to {{validate}} here?
> A quick search seems to indicate that `pa.Array.from_buffers` does 
> validation, but other `from_arrays` method don't seem to explicitly do this. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6157) [Python][C++] UnionArray with invalid data passes validation / leads to segfaults

2019-08-07 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-6157:
-
Description: 
>From the Python side, you can create an "invalid" UnionArray:

{code}
binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') 
int64 = pa.array([1, 2, 3], type='int64') 
types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8')   # <- value of 2 is out 
of bound for number of childs
value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')

a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64])
{code}

Eg on conversion to python this leads to a segfault:

{code}
In [7]: a.to_pylist()
Segmentation fault (core dumped)
{code}

On the other hand, doing an explicit validation does not give an error:

{code}
In [8]: a.validate()
{code}

Should the validation raise errors for this case? (the C++ {{ValidateVisitor}} 
for UnionArray does nothing) 

(so that this can be called from the Python API to avoid creating invalid 
arrays / segfaults there)


  was:
>From the Python side, you can create an "invalid" UnionArray:

{code}
binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') 
int64 = pa.array([1, 2, 3], type='int64') 
types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8')   # <- value of 2 is out 
of bound for number of childs
value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')

a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64])
{code}

Eg on conversion to python this leads to a segfault:

{code}
In [7]: a.to_pylist()
Segmentation fault (core dumped)
{code}

On the other hand, doing an explicit validation does not give an error:

{code}
In [8]: a.validate()
{code}

Should the validation raise errors for this case? (the C++ {{ValidateVisitor}} 
for UnionArray does nothing)



> [Python][C++] UnionArray with invalid data passes validation / leads to 
> segfaults
> -
>
> Key: ARROW-6157
> URL: https://issues.apache.org/jira/browse/ARROW-6157
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> From the Python side, you can create an "invalid" UnionArray:
> {code}
> binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') 
> int64 = pa.array([1, 2, 3], type='int64') 
> types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8')   # <- value of 2 is out 
> of bound for number of childs
> value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')
> a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64])
> {code}
> Eg on conversion to python this leads to a segfault:
> {code}
> In [7]: a.to_pylist()
> Segmentation fault (core dumped)
> {code}
> On the other hand, doing an explicit validation does not give an error:
> {code}
> In [8]: a.validate()
> {code}
> Should the validation raise errors for this case? (the C++ 
> {{ValidateVisitor}} for UnionArray does nothing) 
> (so that this can be called from the Python API to avoid creating invalid 
> arrays / segfaults there)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6060) [Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True

2019-08-07 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-6060.
---
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5016
[https://github.com/apache/arrow/pull/5016]

> [Python] too large memory cost using pyarrow.parquet.read_table with 
> use_threads=True
> -
>
> Key: ARROW-6060
> URL: https://issues.apache.org/jira/browse/ARROW-6060
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Kun Liu
>Assignee: Benjamin Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
>  I tried to load a parquet file of about 1.8Gb using the following code. It 
> crashed due to out of memory issue.
> {code:java}
> import pyarrow.parquet as pq
> pq.read_table('/tmp/test.parquet'){code}
>  However, it worked well with use_threads=True as follows
> {code:java}
> pq.read_table('/tmp/test.parquet', use_threads=False){code}
> If pyarrow is downgraded to 0.12.1, there is no such problem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6157) [Python][C++] UnionArray with invalid data passes validation / leads to segfaults

2019-08-07 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-6157:


 Summary: [Python][C++] UnionArray with invalid data passes 
validation / leads to segfaults
 Key: ARROW-6157
 URL: https://issues.apache.org/jira/browse/ARROW-6157
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Joris Van den Bossche


>From the Python side, you can create an "invalid" UnionArray:

{code}
binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') 
int64 = pa.array([1, 2, 3], type='int64') 
types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8')   # <- value of 2 is out 
of bound for number of childs
value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')

a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64])
{code}

Eg on conversion to python this leads to a segfault:

{code}
In [7]: a.to_pylist()
Segmentation fault (core dumped)
{code}

On the other hand, doing an explicit validation does not give an error:

{code}
In [8]: a.validate()
{code}

Should the validation raise errors for this case? (the C++ {{ValidateVisitor}} 
for UnionArray does nothing)




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6156) [Java] Support compare semantics for ArrowBufPointer

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6156:
--
Labels: pull-request-available  (was: )

> [Java] Support compare semantics for ArrowBufPointer
> 
>
> Key: ARROW-6156
> URL: https://issues.apache.org/jira/browse/ARROW-6156
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>
> Compare two arrow buffer pointers by their content in lexicographic order.
> null is smaller and shorter buffer is smaller.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6156) [Java] Support compare semantics for ArrowBufPointer

2019-08-07 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6156:
---

 Summary: [Java] Support compare semantics for ArrowBufPointer
 Key: ARROW-6156
 URL: https://issues.apache.org/jira/browse/ARROW-6156
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Compare two arrow buffer pointers by their content in lexicographic order.

null is smaller and shorter buffer is smaller.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays

2019-08-07 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901954#comment-16901954
 ] 

Antoine Pitrou commented on ARROW-6132:
---

Well, {{DictionaryArray.from_arrays(safe=True)}} is really expensive: it will 
walk all indices and check they are within bounds. {{ValidateArray}} doesn't do 
such a thing.

> [Python] ListArray.from_arrays does not check validity of input arrays
> --
>
> Key: ARROW-6132
> URL: https://issues.apache.org/jira/browse/ARROW-6132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.
> When creating a ListArray from offsets and values in python, there is no 
> validation of the offsets that it starts with 0 and ends with the length of 
> the array (but is that required? the docs seem to indicate that: 
> https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type
>  ("The first value in the offsets array is 0, and the last element is the 
> length of the values array.").
> The array you get "seems" ok (the repr), but on conversion to python or 
> flattened arrays, things go wrong:
> {code}
> In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 
> In [62]: a
> Out[62]: 
> 
> [
>   [
> 1,
> 2
>   ],
>   [
> 3,
> 4
>   ]
> ]
> In [63]: a.flatten()
> Out[63]: 
> 
> [
>   0,   # <--- includes the 0
>   1,
>   2,
>   3,
>   4
> ]
> In [64]: a.to_pylist()
> Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes 
> more elements as garbage
> {code}
> Calling {{validate}} manually correctly raises:
> {code}
> In [65]: a.validate()
> ...
> ArrowInvalid: Final offset invariant not equal to values length: 10!=5
> {code}
> In C++ the main constructors are not safe, and as the caller you need to 
> ensure that the data is correct or call a safe (slower) constructor. But do 
> we want to use the unsafe / fast constructors without validation in Python as 
> default as well? Or should we do a call to {{validate}} here?
> A quick search seems to indicate that `pa.Array.from_buffers` does 
> validation, but other `from_arrays` method don't seem to explicitly do this. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5151) [C++] Support take from UnionArray, ListArray, StructArray

2019-08-07 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901953#comment-16901953
 ] 

Joris Van den Bossche commented on ARROW-5151:
--

[~bkietz] Based on your comment in ARROW-772, this is resolved?

> [C++] Support take from UnionArray, ListArray, StructArray
> --
>
> Key: ARROW-5151
> URL: https://issues.apache.org/jira/browse/ARROW-5151
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Benjamin Kietzman
>Priority: Minor
>
> {{arrow::compute::Take}} cannot take from UnionArray, ListArray, or 
> StructArray.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays

2019-08-07 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901945#comment-16901945
 ] 

Joris Van den Bossche commented on ARROW-6132:
--

{{DictionaryArray.from_arrays}} has a {{safe=True/False}} argument (with 
{{safe=True}} as default) to disable the validity checking. 
Although it is not exactly the same under the hood (DictionaryArray does not 
use the {{ValidateArray}} method), for users it is similar functionality, so I 
could also add such a keyword to {{ListArray.from_arrays}} as well (didn't do 
that yet in the PR https://github.com/apache/arrow/pull/5029).



> [Python] ListArray.from_arrays does not check validity of input arrays
> --
>
> Key: ARROW-6132
> URL: https://issues.apache.org/jira/browse/ARROW-6132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.
> When creating a ListArray from offsets and values in python, there is no 
> validation of the offsets that it starts with 0 and ends with the length of 
> the array (but is that required? the docs seem to indicate that: 
> https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type
>  ("The first value in the offsets array is 0, and the last element is the 
> length of the values array.").
> The array you get "seems" ok (the repr), but on conversion to python or 
> flattened arrays, things go wrong:
> {code}
> In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 
> In [62]: a
> Out[62]: 
> 
> [
>   [
> 1,
> 2
>   ],
>   [
> 3,
> 4
>   ]
> ]
> In [63]: a.flatten()
> Out[63]: 
> 
> [
>   0,   # <--- includes the 0
>   1,
>   2,
>   3,
>   4
> ]
> In [64]: a.to_pylist()
> Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes 
> more elements as garbage
> {code}
> Calling {{validate}} manually correctly raises:
> {code}
> In [65]: a.validate()
> ...
> ArrowInvalid: Final offset invariant not equal to values length: 10!=5
> {code}
> In C++ the main constructors are not safe, and as the caller you need to 
> ensure that the data is correct or call a safe (slower) constructor. But do 
> we want to use the unsafe / fast constructors without validation in Python as 
> default as well? Or should we do a call to {{validate}} here?
> A quick search seems to indicate that `pa.Array.from_buffers` does 
> validation, but other `from_arrays` method don't seem to explicitly do this. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6132:
--
Labels: pull-request-available  (was: )

> [Python] ListArray.from_arrays does not check validity of input arrays
> --
>
> Key: ARROW-6132
> URL: https://issues.apache.org/jira/browse/ARROW-6132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: pull-request-available
>
> From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.
> When creating a ListArray from offsets and values in python, there is no 
> validation of the offsets that it starts with 0 and ends with the length of 
> the array (but is that required? the docs seem to indicate that: 
> https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type
>  ("The first value in the offsets array is 0, and the last element is the 
> length of the values array.").
> The array you get "seems" ok (the repr), but on conversion to python or 
> flattened arrays, things go wrong:
> {code}
> In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 
> In [62]: a
> Out[62]: 
> 
> [
>   [
> 1,
> 2
>   ],
>   [
> 3,
> 4
>   ]
> ]
> In [63]: a.flatten()
> Out[63]: 
> 
> [
>   0,   # <--- includes the 0
>   1,
>   2,
>   3,
>   4
> ]
> In [64]: a.to_pylist()
> Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes 
> more elements as garbage
> {code}
> Calling {{validate}} manually correctly raises:
> {code}
> In [65]: a.validate()
> ...
> ArrowInvalid: Final offset invariant not equal to values length: 10!=5
> {code}
> In C++ the main constructors are not safe, and as the caller you need to 
> ensure that the data is correct or call a safe (slower) constructor. But do 
> we want to use the unsafe / fast constructors without validation in Python as 
> default as well? Or should we do a call to {{validate}} here?
> A quick search seems to indicate that `pa.Array.from_buffers` does 
> validation, but other `from_arrays` method don't seem to explicitly do this. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays

2019-08-07 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-6132:


Assignee: Joris Van den Bossche

> [Python] ListArray.from_arrays does not check validity of input arrays
> --
>
> Key: ARROW-6132
> URL: https://issues.apache.org/jira/browse/ARROW-6132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>
> From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.
> When creating a ListArray from offsets and values in python, there is no 
> validation of the offsets that it starts with 0 and ends with the length of 
> the array (but is that required? the docs seem to indicate that: 
> https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type
>  ("The first value in the offsets array is 0, and the last element is the 
> length of the values array.").
> The array you get "seems" ok (the repr), but on conversion to python or 
> flattened arrays, things go wrong:
> {code}
> In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 
> In [62]: a
> Out[62]: 
> 
> [
>   [
> 1,
> 2
>   ],
>   [
> 3,
> 4
>   ]
> ]
> In [63]: a.flatten()
> Out[63]: 
> 
> [
>   0,   # <--- includes the 0
>   1,
>   2,
>   3,
>   4
> ]
> In [64]: a.to_pylist()
> Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes 
> more elements as garbage
> {code}
> Calling {{validate}} manually correctly raises:
> {code}
> In [65]: a.validate()
> ...
> ArrowInvalid: Final offset invariant not equal to values length: 10!=5
> {code}
> In C++ the main constructors are not safe, and as the caller you need to 
> ensure that the data is correct or call a safe (slower) constructor. But do 
> we want to use the unsafe / fast constructors without validation in Python as 
> default as well? Or should we do a call to {{validate}} here?
> A quick search seems to indicate that `pa.Array.from_buffers` does 
> validation, but other `from_arrays` method don't seem to explicitly do this. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)