[jira] [Commented] (ARROW-6168) [C++] IWYU docker-compose job is broken
[ https://issues.apache.org/jira/browse/ARROW-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902662#comment-16902662 ] Wes McKinney commented on ARROW-6168: - I fixed this temporarily by adding an explicit build type https://github.com/apache/arrow/pull/5036/files#diff-60422e0e36ec191f5e2687ffb18b5796R25 > [C++] IWYU docker-compose job is broken > --- > > Key: ARROW-6168 > URL: https://issues.apache.org/jira/browse/ARROW-6168 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.15.0 > > > Not sure what happened in the last week or so: > {code} > $ docker-compose run iwyu > WARNING: The CI_ARROW_SHA variable is not set. Defaulting to a blank string. > WARNING: The CI_ARROW_BRANCH variable is not set. Defaulting to a blank > string. > + mkdir -p /build/lint > + pushd /build/lint > /build/lint / > + cmake -GNinja -DARROW_FLIGHT=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON > -DARROW_PYTHON=ON -DCMAKE_CXX_FLAGS=-D_GLIBCXX_USE_CXX11_ABI=0 > -DCMAKE_EXPORT_COMPILE_COMMANDS=ON /arrow/cpp > -- Building using CMake version: 3.14.5 > -- Arrow version: 1.0.0 (full: '1.0.0-SNAPSHOT') > -- Arrow SO version: 100 (full: 100.0.0) > -- clang-tidy found at /usr/bin/clang-tidy-7 > -- clang-format found at /usr/bin/clang-format-7 > -- infer not found > -- Using ccache: /opt/conda/bin/ccache > -- Found cpplint executable at /arrow/cpp/build-support/cpplint.py > -- Compiler command: env LANG=C /usr/bin/g++ -v > -- Compiler version: Using built-in specs. > COLLECT_GCC=/usr/bin/g++ > COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper > OFFLOAD_TARGET_NAMES=nvptx-none > OFFLOAD_TARGET_DEFAULT=1 > Target: x86_64-linux-gnu > Configured with: ../src/configure -v --with-pkgversion='Ubuntu > 7.4.0-1ubuntu1~18.04.1' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs > --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr > --with-gcc-major-version-only --program-suffix=-7 > --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id > --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix > --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu > --enable-libstdcxx-debug --enable-libstdcxx-time=yes > --with-default-libstdcxx-abi=new --enable-gnu-unique-object > --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie > --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto > --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 > --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic > --enable-offload-targets=nvptx-none --without-cuda-driver > --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu > --target=x86_64-linux-gnu > Thread model: posix > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1) > -- Compiler id: GNU > Selected compiler gcc 7.4.0 > CMake Error at cmake_modules/SetupCxxFlags.cmake:42 (string): > string no output variable specified > Call Stack (most recent call first): > CMakeLists.txt:357 (include) > -- Arrow build warning level: CHECKIN > Using ld linker > Configured for build (set with cmake -DCMAKE_BUILD_TYPE={release,debug,...}) > CMake Error at cmake_modules/SetupCxxFlags.cmake:429 (message): > Unknown build type: > Call Stack (most recent call first): > CMakeLists.txt:357 (include) > -- Configuring incomplete, errors occurred! > See also "/build/lint/CMakeFiles/CMakeOutput.log". > See also "/build/lint/CMakeFiles/CMakeError.log". > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-3246) [Python][Parquet] direct reading/writing of pandas categoricals in parquet
[ https://issues.apache.org/jira/browse/ARROW-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902652#comment-16902652 ] Wes McKinney commented on ARROW-3246: - OK, I was able to get the initial refactor done today. Now we need the plumbing to be able to write dictionary values and indices separately to {{DictEncoder}} > [Python][Parquet] direct reading/writing of pandas categoricals in parquet > -- > > Key: ARROW-3246 > URL: https://issues.apache.org/jira/browse/ARROW-3246 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Martin Durant >Assignee: Wes McKinney >Priority: Minor > Labels: parquet > Fix For: 1.0.0 > > > Parquet supports "dictionary encoding" of column data in a manner very > similar to the concept of Categoricals in pandas. It is natural to use this > encoding for a column which originated as a categorical. Conversely, when > loading, if the file metadata says that a given column came from a pandas (or > arrow) categorical, then we can trust that the whole of the column is > dictionary-encoded and load the data directly into a categorical column, > rather than expanding the labels upon load and recategorising later. > If the data does not have the pandas metadata, then the guarantee cannot > hold, and we cannot assume either that the whole column is dictionary encoded > or that the labels are the same throughout. In this case, the current > behaviour is fine. > > (please forgive that some of this has already been mentioned elsewhere; this > is one of the entries in the list at > [https://github.com/dask/fastparquet/issues/374] as a feature that is useful > in fastparquet) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6152) [C++][Parquet] Write arrow::Array directly into parquet::TypedColumnWriter
[ https://issues.apache.org/jira/browse/ARROW-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6152: -- Labels: pull-request-available (was: ) > [C++][Parquet] Write arrow::Array directly into parquet::TypedColumnWriter > - > > Key: ARROW-6152 > URL: https://issues.apache.org/jira/browse/ARROW-6152 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > > This is an initial refactoring task to enable the Arrow write layer to access > some of the internal implementation details of > {{parquet::TypedColumnWriter}}. See discussion in ARROW-3246 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6142) [R] Install instructions on linux could be clearer
[ https://issues.apache.org/jira/browse/ARROW-6142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sutou Kouhei resolved ARROW-6142. - Resolution: Fixed Issue resolved by pull request 5027 [https://github.com/apache/arrow/pull/5027] > [R] Install instructions on linux could be clearer > -- > > Key: ARROW-6142 > URL: https://issues.apache.org/jira/browse/ARROW-6142 > Project: Apache Arrow > Issue Type: Wish > Components: R >Affects Versions: 0.14.1 > Environment: Ubuntu 19.04 >Reporter: Karl Dunkle Werner >Assignee: Neal Richardson >Priority: Minor > Labels: documentation, pull-request-available > Fix For: 0.15.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > Installing R packages on Linux is almost always from source, which means > Arrow needs some system dependencies. The existing help message (from > arrow::install_arrow()) is very helpful in pointing that out, but it's still > a heavy lift for users who install R packages from source but don't plan to > develop Arrow itself. > Here are a couple of things that could make things slightly smoother: > # I would be very grateful if the install_arrow() message or installation > page told me which libraries were essential to make the R package work. > # install_arrow() refers to a PPA. Previously I've only seen PPAs hosted on > launchpad.net, so the bintray URL threw me. Changing it to "bintray.com PPA" > instead of just "PPA" would have caused me less confusion. (Others may differ) > # A snap package would be easier than installing a new apt address, but I > understand that building for snap would be more packaging work and only > benefits Ubuntu users. > > Thanks for making R bindings, and congratulations on the CRAN release! -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6166) [Go] Slice of slice causes index out of range panic
[ https://issues.apache.org/jira/browse/ARROW-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6166: -- Labels: pull-request-available (was: ) > [Go] Slice of slice causes index out of range panic > --- > > Key: ARROW-6166 > URL: https://issues.apache.org/jira/browse/ARROW-6166 > Project: Apache Arrow > Issue Type: Bug > Components: Go >Reporter: Roshan Kumaraswamy >Priority: Major > Labels: pull-request-available > > When slicing a slice, the offset of the underlying data will cause an index > out of range panic if the offset if greater than the slice length. See > [https://github.com/apache/arrow/issues/5033] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6168) [C++] IWYU docker-compose job is broken
Wes McKinney created ARROW-6168: --- Summary: [C++] IWYU docker-compose job is broken Key: ARROW-6168 URL: https://issues.apache.org/jira/browse/ARROW-6168 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 0.15.0 Not sure what happened in the last week or so: {code} $ docker-compose run iwyu WARNING: The CI_ARROW_SHA variable is not set. Defaulting to a blank string. WARNING: The CI_ARROW_BRANCH variable is not set. Defaulting to a blank string. + mkdir -p /build/lint + pushd /build/lint /build/lint / + cmake -GNinja -DARROW_FLIGHT=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_PYTHON=ON -DCMAKE_CXX_FLAGS=-D_GLIBCXX_USE_CXX11_ABI=0 -DCMAKE_EXPORT_COMPILE_COMMANDS=ON /arrow/cpp -- Building using CMake version: 3.14.5 -- Arrow version: 1.0.0 (full: '1.0.0-SNAPSHOT') -- Arrow SO version: 100 (full: 100.0.0) -- clang-tidy found at /usr/bin/clang-tidy-7 -- clang-format found at /usr/bin/clang-format-7 -- infer not found -- Using ccache: /opt/conda/bin/ccache -- Found cpplint executable at /arrow/cpp/build-support/cpplint.py -- Compiler command: env LANG=C /usr/bin/g++ -v -- Compiler version: Using built-in specs. COLLECT_GCC=/usr/bin/g++ COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.4.0-1ubuntu1~18.04.1' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1) -- Compiler id: GNU Selected compiler gcc 7.4.0 CMake Error at cmake_modules/SetupCxxFlags.cmake:42 (string): string no output variable specified Call Stack (most recent call first): CMakeLists.txt:357 (include) -- Arrow build warning level: CHECKIN Using ld linker Configured for build (set with cmake -DCMAKE_BUILD_TYPE={release,debug,...}) CMake Error at cmake_modules/SetupCxxFlags.cmake:429 (message): Unknown build type: Call Stack (most recent call first): CMakeLists.txt:357 (include) -- Configuring incomplete, errors occurred! See also "/build/lint/CMakeFiles/CMakeOutput.log". See also "/build/lint/CMakeFiles/CMakeError.log". {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6167) [R] macOS binary R packages on CRAN don't have arrow_available
[ https://issues.apache.org/jira/browse/ARROW-6167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6167: -- Labels: pull-request-available (was: ) > [R] macOS binary R packages on CRAN don't have arrow_available > -- > > Key: ARROW-6167 > URL: https://issues.apache.org/jira/browse/ARROW-6167 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.14.1 >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Critical > Labels: pull-request-available > > The {{configure}} script in the R package has some > [magic|https://github.com/apache/arrow/blob/master/r/configure#L66-L86] that > should ensure that on macOS, you're guaranteed a successful library > installation even (especially) if you don't have libarrow installed on your > system. This magic also is designed so that when CRAN builds a binary package > for macOS, the C++ libraries are bundled and "just work" when a user installs > it, no compilation required. > However, the magic appeared to fail on CRAN this time, as the binaries linked > on [https://cran.r-project.org/web/packages/arrow/index.html] were built > without libarrow ({{arrow::arrow_available()}} returns {{FALSE}}). > I've identified three vectors by which you can get an arrow package > installation on macOS in this state: > # The [check|https://github.com/apache/arrow/blob/master/r/configure#L71] to > see if you've already installed {{apache-arrow}} via Homebrew always passes, > so if you have Homebrew installed but haven't done {{brew install > apache-arrow}}, the script won't do it for you like it looks like it intends. > (This is not suspected to be the problem on CRAN because they don't have > Homebrew installed.) > # If the > "[autobrew|https://github.com/apache/arrow/blob/master/r/configure#L80-L81]"; > installation fails, then the [test on > L102|https://github.com/apache/arrow/blob/master/r/configure#L102] will > correctly fail. I managed to trigger this (by luck?) on the [R-hub testing > service|https://builder.r-hub.io/status/arrow_0.14.1.tar.gz-da083126612b46e28854b95156b87b31#L533]. > This is possibly what happened on CRAN, though the only [build > logs|https://www.r-project.org/nosvn/R.check/r-release-osx-x86_64/arrow-00check.html] > we have from CRAN are terse because it believes the build was successful. > # Some idiosyncrasy in the compiler on the CRAN macOS system such that the > autobrew script would successfully download the arrow libraries but the L102 > check would error. I've been unable to reproduce this using the [version of > clang7 that CRAN provides|https://cran.r-project.org/bin/macosx/tools/]. > I have a fix for the first one and will provide workaround documentation for > the README and announcement blog post. Unfortunately, I don't know that > there's anything we can do about the useless binaries on CRAN at this time, > particularly since CRAN is going down for maintenance August 9-18. > cc [~jeroenooms] [~romainfrancois] [~wesmckinn] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6167) [R] macOS binary R packages on CRAN don't have arrow_available
Neal Richardson created ARROW-6167: -- Summary: [R] macOS binary R packages on CRAN don't have arrow_available Key: ARROW-6167 URL: https://issues.apache.org/jira/browse/ARROW-6167 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.14.1 Reporter: Neal Richardson Assignee: Neal Richardson The {{configure}} script in the R package has some [magic|https://github.com/apache/arrow/blob/master/r/configure#L66-L86] that should ensure that on macOS, you're guaranteed a successful library installation even (especially) if you don't have libarrow installed on your system. This magic also is designed so that when CRAN builds a binary package for macOS, the C++ libraries are bundled and "just work" when a user installs it, no compilation required. However, the magic appeared to fail on CRAN this time, as the binaries linked on [https://cran.r-project.org/web/packages/arrow/index.html] were built without libarrow ({{arrow::arrow_available()}} returns {{FALSE}}). I've identified three vectors by which you can get an arrow package installation on macOS in this state: # The [check|https://github.com/apache/arrow/blob/master/r/configure#L71] to see if you've already installed {{apache-arrow}} via Homebrew always passes, so if you have Homebrew installed but haven't done {{brew install apache-arrow}}, the script won't do it for you like it looks like it intends. (This is not suspected to be the problem on CRAN because they don't have Homebrew installed.) # If the "[autobrew|https://github.com/apache/arrow/blob/master/r/configure#L80-L81]"; installation fails, then the [test on L102|https://github.com/apache/arrow/blob/master/r/configure#L102] will correctly fail. I managed to trigger this (by luck?) on the [R-hub testing service|https://builder.r-hub.io/status/arrow_0.14.1.tar.gz-da083126612b46e28854b95156b87b31#L533]. This is possibly what happened on CRAN, though the only [build logs|https://www.r-project.org/nosvn/R.check/r-release-osx-x86_64/arrow-00check.html] we have from CRAN are terse because it believes the build was successful. # Some idiosyncrasy in the compiler on the CRAN macOS system such that the autobrew script would successfully download the arrow libraries but the L102 check would error. I've been unable to reproduce this using the [version of clang7 that CRAN provides|https://cran.r-project.org/bin/macosx/tools/]. I have a fix for the first one and will provide workaround documentation for the README and announcement blog post. Unfortunately, I don't know that there's anything we can do about the useless binaries on CRAN at this time, particularly since CRAN is going down for maintenance August 9-18. cc [~jeroenooms] [~romainfrancois] [~wesmckinn] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6166) [Go] Slice of slice causes index out of range panic
[ https://issues.apache.org/jira/browse/ARROW-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roshan Kumaraswamy updated ARROW-6166: -- External issue URL: https://github.com/apache/arrow/issues/5033 > [Go] Slice of slice causes index out of range panic > --- > > Key: ARROW-6166 > URL: https://issues.apache.org/jira/browse/ARROW-6166 > Project: Apache Arrow > Issue Type: Bug > Components: Go >Reporter: Roshan Kumaraswamy >Priority: Major > > When slicing a slice, the offset of the underlying data will cause an index > out of range panic if the offset if greater than the slice length. See > [https://github.com/apache/arrow/issues/5033] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6166) [Go] Slice of slice causes index out of range panic
Roshan Kumaraswamy created ARROW-6166: - Summary: [Go] Slice of slice causes index out of range panic Key: ARROW-6166 URL: https://issues.apache.org/jira/browse/ARROW-6166 Project: Apache Arrow Issue Type: Bug Components: Go Reporter: Roshan Kumaraswamy When slicing a slice, the offset of the underlying data will cause an index out of range panic if the offset if greater than the slice length. See [https://github.com/apache/arrow/issues/5033] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-6152) [C++][Parquet] Write arrow::Array directly into parquet::TypedColumnWriter
[ https://issues.apache.org/jira/browse/ARROW-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-6152: --- Assignee: Wes McKinney > [C++][Parquet] Write arrow::Array directly into parquet::TypedColumnWriter > - > > Key: ARROW-6152 > URL: https://issues.apache.org/jira/browse/ARROW-6152 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.15.0 > > > This is an initial refactoring task to enable the Arrow write layer to access > some of the internal implementation details of > {{parquet::TypedColumnWriter}}. See discussion in ARROW-3246 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6039) [GLib] Add garrow_array_filter()
[ https://issues.apache.org/jira/browse/ARROW-6039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sutou Kouhei resolved ARROW-6039. - Resolution: Fixed Issue resolved by pull request 5025 [https://github.com/apache/arrow/pull/5025] > [GLib] Add garrow_array_filter() > > > Key: ARROW-6039 > URL: https://issues.apache.org/jira/browse/ARROW-6039 > Project: Apache Arrow > Issue Type: New Feature > Components: GLib >Reporter: Yosuke Shiro >Assignee: Yosuke Shiro >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Add bindings of a boolean selection filter. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6165) [Integration] Use multiprocessing to run integration tests on multiple CPU cores
[ https://issues.apache.org/jira/browse/ARROW-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902399#comment-16902399 ] lidavidm commented on ARROW-6165: - We'll also have to find free ports for the Flight tests, as right now they assume a hardcoded port. (Not hard to do, fortunately.) > [Integration] Use multiprocessing to run integration tests on multiple CPU > cores > > > Key: ARROW-6165 > URL: https://issues.apache.org/jira/browse/ARROW-6165 > Project: Apache Arrow > Issue Type: Improvement > Components: Integration >Reporter: Wes McKinney >Priority: Major > > The stdout/stderr will have to be captured appropriate so that the console > output when run in parallel is still readable -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5559) [C++] Introduce IpcOptions struct object for better API-stability when adding new options
[ https://issues.apache.org/jira/browse/ARROW-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902398#comment-16902398 ] Wes McKinney commented on ARROW-5559: - I agree it's weird > [C++] Introduce IpcOptions struct object for better API-stability when adding > new options > - > > Key: ARROW-5559 > URL: https://issues.apache.org/jira/browse/ARROW-5559 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Related to ARROW-2006. There are various IPC-related options like allowing > 64-bit lengths that might be better encapsulated in an options struct rather > than littered around different public APIs -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6165) [Integration] Use multiprocessing to run integration tests on multiple CPU cores
Wes McKinney created ARROW-6165: --- Summary: [Integration] Use multiprocessing to run integration tests on multiple CPU cores Key: ARROW-6165 URL: https://issues.apache.org/jira/browse/ARROW-6165 Project: Apache Arrow Issue Type: Improvement Components: Integration Reporter: Wes McKinney The stdout/stderr will have to be captured appropriate so that the console output when run in parallel is still readable -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Closed] (ARROW-6059) [Python] Regression memory issue when calling pandas.read_parquet
[ https://issues.apache.org/jira/browse/ARROW-6059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney closed ARROW-6059. --- Resolution: Duplicate Fix Version/s: 0.15.0 This should be resolved with the fix for ARROW-6060. If you can verify from master that would be helpful. If you run into more issues please reopen an issue > [Python] Regression memory issue when calling pandas.read_parquet > - > > Key: ARROW-6059 > URL: https://issues.apache.org/jira/browse/ARROW-6059 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0, 0.14.1 >Reporter: Francisco Sanchez >Priority: Major > Fix For: 0.15.0 > > Attachments: Memory_profile_0.13.png, Memory_profile_0.13_rs.png, > Memory_profile_0.14.1_use_thread_FALSE.png, > Memory_profile_0.14.1_use_thread_false_rs.png, > Memory_profile_0.14.1_use_thread_true.png > > > I have a ~3MB parquet file with the next schema: > {code:java} > bag_stamp: timestamp[ns] > transforms_[]_.header.seq: list > child 0, item: int64 > transforms_[]_.header.stamp: list > child 0, item: timestamp[ns] > transforms_[]_.header.frame_id: list > child 0, item: string > transforms_[]_.child_frame_id: list > child 0, item: string > transforms_[]_.transform.translation.x: list > child 0, item: double > transforms_[]_.transform.translation.y: list > child 0, item: double > transforms_[]_.transform.translation.z: list > child 0, item: double > transforms_[]_.transform.rotation.x: list > child 0, item: double > transforms_[]_.transform.rotation.y: list > child 0, item: double > transforms_[]_.transform.rotation.z: list > child 0, item: double > transforms_[]_.transform.rotation.w: list > child 0, item: double > {code} > If I read it with *pandas.read_parquet()* using pyarrow 0.13.0 all seems > fine and it takes no time to load. If I try the same with 0.14.0 or 0.14.1 it > takes a lot of time and uses ~10GB of RAM. Many times if I don't have enough > available memory it will just be killed OOM. Now, if I use the next code > snippet instead it works perfectly with all the versions: > {code} > parquet_file = pq.ParquetFile(input_file) > tables = [] > for row_group in range(parquet_file.num_row_groups): > tables.append(parquet_file.read_row_group(row_group, columns=columns, > use_pandas_metadata=True)) > df = pa.concat_tables(tables).to_pandas() > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5559) [C++] Introduce IpcOptions struct object for better API-stability when adding new options
[ https://issues.apache.org/jira/browse/ARROW-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5559: -- Labels: pull-request-available (was: ) > [C++] Introduce IpcOptions struct object for better API-stability when adding > new options > - > > Key: ARROW-5559 > URL: https://issues.apache.org/jira/browse/ARROW-5559 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Related to ARROW-2006. There are various IPC-related options like allowing > 64-bit lengths that might be better encapsulated in an options struct rather > than littered around different public APIs -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6164) [Docs][Format] Document project versioning schema and forward/backward compatibility policies
Wes McKinney created ARROW-6164: --- Summary: [Docs][Format] Document project versioning schema and forward/backward compatibility policies Key: ARROW-6164 URL: https://issues.apache.org/jira/browse/ARROW-6164 Project: Apache Arrow Issue Type: Improvement Components: Format Reporter: Wes McKinney Fix For: 1.0.0 Based on policy adopted via vote on mailing list -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6163) [C++] Misnamed test
Dmitry Kalinkin created ARROW-6163: -- Summary: [C++] Misnamed test Key: ARROW-6163 URL: https://issues.apache.org/jira/browse/ARROW-6163 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.14.1 Reporter: Dmitry Kalinkin "arrow-dataset-file_test" defined in https://github.com/apache/arrow/blob/49badd25804af85dfe9019ab1390c649a02c89fa/cpp/src/arrow/dataset/CMakeLists.txt#L49 the existing naming convention seems to be "foo-bar-test" and not "foo-bar_test". Test need to be renamed. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-3246) [Python][Parquet] direct reading/writing of pandas categoricals in parquet
[ https://issues.apache.org/jira/browse/ARROW-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902341#comment-16902341 ] Wes McKinney commented on ARROW-3246: - Writing BYTE_ARRAY can also definitely be made more efficient. See logic at https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L858 The dictionary page size issue is usually handled through the WriterProperties https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L178 If the dictionary is written all at once then this property can be circumvented, that would be my plan. > [Python][Parquet] direct reading/writing of pandas categoricals in parquet > -- > > Key: ARROW-3246 > URL: https://issues.apache.org/jira/browse/ARROW-3246 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Martin Durant >Assignee: Wes McKinney >Priority: Minor > Labels: parquet > Fix For: 1.0.0 > > > Parquet supports "dictionary encoding" of column data in a manner very > similar to the concept of Categoricals in pandas. It is natural to use this > encoding for a column which originated as a categorical. Conversely, when > loading, if the file metadata says that a given column came from a pandas (or > arrow) categorical, then we can trust that the whole of the column is > dictionary-encoded and load the data directly into a categorical column, > rather than expanding the labels upon load and recategorising later. > If the data does not have the pandas metadata, then the guarantee cannot > hold, and we cannot assume either that the whole column is dictionary encoded > or that the labels are the same throughout. In this case, the current > behaviour is fine. > > (please forgive that some of this has already been mentioned elsewhere; this > is one of the entries in the list at > [https://github.com/dask/fastparquet/issues/374] as a feature that is useful > in fastparquet) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6162) [C++][Gandiva] Do not truncate string in castVARCHAR_varchar when out_len parameter is zero
Prudhvi Porandla created ARROW-6162: --- Summary: [C++][Gandiva] Do not truncate string in castVARCHAR_varchar when out_len parameter is zero Key: ARROW-6162 URL: https://issues.apache.org/jira/browse/ARROW-6162 Project: Apache Arrow Issue Type: Task Reporter: Prudhvi Porandla Assignee: Prudhvi Porandla -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6144) [C++][Gandiva] Implement random function in Gandiva
[ https://issues.apache.org/jira/browse/ARROW-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prudhvi Porandla updated ARROW-6144: Summary: [C++][Gandiva] Implement random function in Gandiva (was: Implement random function in Gandiva) > [C++][Gandiva] Implement random function in Gandiva > --- > > Key: ARROW-6144 > URL: https://issues.apache.org/jira/browse/ARROW-6144 > Project: Apache Arrow > Issue Type: Task > Components: C++ - Gandiva >Reporter: Prudhvi Porandla >Assignee: Prudhvi Porandla >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Implement random(), random(int seed) functions -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-5559) [C++] Introduce IpcOptions struct object for better API-stability when adding new options
[ https://issues.apache.org/jira/browse/ARROW-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-5559: - Assignee: Antoine Pitrou > [C++] Introduce IpcOptions struct object for better API-stability when adding > new options > - > > Key: ARROW-5559 > URL: https://issues.apache.org/jira/browse/ARROW-5559 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Fix For: 1.0.0 > > > Related to ARROW-2006. There are various IPC-related options like allowing > 64-bit lengths that might be better encapsulated in an options struct rather > than littered around different public APIs -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6161) [C++] Implements dataset::ParquetFile and associated Scan structures
[ https://issues.apache.org/jira/browse/ARROW-6161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-6161: -- Labels: datasets (was: ) > [C++] Implements dataset::ParquetFile and associated Scan structures > > > Key: ARROW-6161 > URL: https://issues.apache.org/jira/browse/ARROW-6161 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > Labels: datasets > > This is first baby step in supporting datasets. The initial implementation > will be minimal and trivial, no parallel, no schema adaptation. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6161) [C++] Implements dataset::ParquetFile and associated Scan structures
[ https://issues.apache.org/jira/browse/ARROW-6161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-6161: -- Component/s: C++ > [C++] Implements dataset::ParquetFile and associated Scan structures > > > Key: ARROW-6161 > URL: https://issues.apache.org/jira/browse/ARROW-6161 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > > This is first baby step in supporting datasets. The initial implementation > will be minimal and trivial, no parallel, no schema adaptation. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-6161) [C++] Implements dataset::ParquetFile and associated Scan structures
[ https://issues.apache.org/jira/browse/ARROW-6161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques reassigned ARROW-6161: - Assignee: Francois Saint-Jacques > [C++] Implements dataset::ParquetFile and associated Scan structures > > > Key: ARROW-6161 > URL: https://issues.apache.org/jira/browse/ARROW-6161 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > > This is first baby step in supporting datasets. The initial implementation > will be minimal and trivial, no parallel, no schema adaptation. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6161) [C++] Implements dataset::ParquetFile and associated Scan structures
[ https://issues.apache.org/jira/browse/ARROW-6161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-6161: -- Description: This is first baby step in supporting datasets. The initial implementation will be minimal and trivial, no parallel, no schema adaptation. > [C++] Implements dataset::ParquetFile and associated Scan structures > > > Key: ARROW-6161 > URL: https://issues.apache.org/jira/browse/ARROW-6161 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Francois Saint-Jacques >Priority: Major > > This is first baby step in supporting datasets. The initial implementation > will be minimal and trivial, no parallel, no schema adaptation. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6161) [C++] Implements dataset::ParquetFile and associated Scan structures
Francois Saint-Jacques created ARROW-6161: - Summary: [C++] Implements dataset::ParquetFile and associated Scan structures Key: ARROW-6161 URL: https://issues.apache.org/jira/browse/ARROW-6161 Project: Apache Arrow Issue Type: New Feature Reporter: Francois Saint-Jacques -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5610) [Python] Define extension type API in Python to "receive" or "send" a foreign extension type
[ https://issues.apache.org/jira/browse/ARROW-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902176#comment-16902176 ] Joris Van den Bossche commented on ARROW-5610: -- > But which also means that you loose all information about the extension type > defined elsewhere (as discussed above). Correcting myself: this is not fully True. The _type_ is no longer an extension type (but the storage type), but the _field_ in the schema still has the metadata. For example, reading an IPC file with python for an not-registered type (created from C++ where the 'ext' column was a Uuid type as defined in the tests): {code} In [31]: f_ext = pa.ipc.open_stream("repos/arrow/cpp/build/examples/arrow/arrow-example-ipc-extension.arrow") In [32]: table = f_ext.read_all() In [33]: table Out[33]: pyarrow.Table int: int64 ext: int64 In [35]: table.schema.field_by_name('ext') Out[35]: pyarrow.Field In [36]: table.schema.field_by_name('ext').metadata Out[36]: {b'ARROW:extension:metadata': b'uuid-type-unique-code', b'ARROW:extension:name': b'uuid'} {code} > [Python] Define extension type API in Python to "receive" or "send" a foreign > extension type > > > Key: ARROW-5610 > URL: https://issues.apache.org/jira/browse/ARROW-5610 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > In work in ARROW-840, a static {{arrow.py_extension_type}} name is used. > There will be cases where an extension type is coming from another > programming language (e.g. Java), so it would be useful to be able to "plug > in" a Python extension type subclass that will be used to deserialize the > extension type coming over the wire. This has some different API requirements > since the serialized representation of the type will not have knowledge of > Python pickling, etc. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5610) [Python] Define extension type API in Python to "receive" or "send" a foreign extension type
[ https://issues.apache.org/jira/browse/ARROW-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902129#comment-16902129 ] Joris Van den Bossche commented on ARROW-5610: -- [~lidavidm] no apologies needed, it was just a question to check the status :) > > You want to transfer a table containing a column of that type to and from > > Python. Right now, you can read that data from Python, but you can't create > > a table with that type > > I'm curious, which error do you get when trying to do so? I tried this out, and so if you have an IPC message that contains an extension type unknown to Python / C++, and you read that into a (pyarrow) Table, you don't get an error at the moment, but it falls back to the storage type. But which also means that you loose all information about the extension type defined elsewhere (as discussed above). To me, it seems that we would need some way to have an "unknown extension type" in C++ that can have arbitrary name and metadata to be able to receive such data. > [Python] Define extension type API in Python to "receive" or "send" a foreign > extension type > > > Key: ARROW-5610 > URL: https://issues.apache.org/jira/browse/ARROW-5610 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > In work in ARROW-840, a static {{arrow.py_extension_type}} name is used. > There will be cases where an extension type is coming from another > programming language (e.g. Java), so it would be useful to be able to "plug > in" a Python extension type subclass that will be used to deserialize the > extension type coming over the wire. This has some different API requirements > since the serialized representation of the type will not have knowledge of > Python pickling, etc. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6160) [Java] AbstractStructVector#getPrimitiveVectors fails to work with complex child vectors
[ https://issues.apache.org/jira/browse/ARROW-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6160: -- Labels: pull-request-available (was: ) > [Java] AbstractStructVector#getPrimitiveVectors fails to work with complex > child vectors > > > Key: ARROW-6160 > URL: https://issues.apache.org/jira/browse/ARROW-6160 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Minor > Labels: pull-request-available > > Currently in {{AbstractStructVector#getPrimitiveVectors}}, only struct type > child vectors will recursively get primitive vectors, other complex type like > {{ListVector}}, {{UnionVector}} was treated as primitive type and return > directly. > For example, Struct(List(Int), Struct(Int, Varchar)) {{getPrimitiveVectors}} > should return {{[IntVector, IntVector, VarCharVector]}} instead of > [ListVector, IntVector, VarCharVector] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6160) [Java] AbstractStructVector#getPrimitiveVectors fails to work with complex child vectors
Ji Liu created ARROW-6160: - Summary: [Java] AbstractStructVector#getPrimitiveVectors fails to work with complex child vectors Key: ARROW-6160 URL: https://issues.apache.org/jira/browse/ARROW-6160 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu Currently in {{AbstractStructVector#getPrimitiveVectors}}, only struct type child vectors will recursively get primitive vectors, other complex type like {{ListVector}}, {{UnionVector}} was treated as primitive type and return directly. For example, Struct(List(Int), Struct(Int, Varchar)) {{getPrimitiveVectors}} should return {{[IntVector, IntVector, VarCharVector]}} instead of [ListVector, IntVector, VarCharVector] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5559) [C++] Introduce IpcOptions struct object for better API-stability when adding new options
[ https://issues.apache.org/jira/browse/ARROW-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902122#comment-16902122 ] Antoine Pitrou commented on ARROW-5559: --- Also {{RecordBatchWriter::WriteTable}} hardcodes {{allow_64bit = true}} when calling {{WriteRecordBatch}}, which is weird. > [C++] Introduce IpcOptions struct object for better API-stability when adding > new options > - > > Key: ARROW-5559 > URL: https://issues.apache.org/jira/browse/ARROW-5559 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > Related to ARROW-2006. There are various IPC-related options like allowing > 64-bit lengths that might be better encapsulated in an options struct rather > than littered around different public APIs -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5559) [C++] Introduce IpcOptions struct object for better API-stability when adding new options
[ https://issues.apache.org/jira/browse/ARROW-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902118#comment-16902118 ] Antoine Pitrou commented on ARROW-5559: --- Question: why is {{allow_64bit}} passed to {{RecordBatchWriter::WriteRecordBatch}} rather than at construction time? > [C++] Introduce IpcOptions struct object for better API-stability when adding > new options > - > > Key: ARROW-5559 > URL: https://issues.apache.org/jira/browse/ARROW-5559 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > Related to ARROW-2006. There are various IPC-related options like allowing > 64-bit lengths that might be better encapsulated in an options struct rather > than littered around different public APIs -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (ARROW-5559) [C++] Introduce IpcOptions struct object for better API-stability when adding new options
[ https://issues.apache.org/jira/browse/ARROW-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902118#comment-16902118 ] Antoine Pitrou edited comment on ARROW-5559 at 8/7/19 2:46 PM: --- Question: why is {{allow_64bit}} passed to {{RecordBatchWriter::WriteRecordBatch}} rather than at construction time? Since that method is supposed to be overriden by implementors, it makes it a bit delicate to change its signature... Though better do it before 1.0.0. was (Author: pitrou): Question: why is {{allow_64bit}} passed to {{RecordBatchWriter::WriteRecordBatch}} rather than at construction time? > [C++] Introduce IpcOptions struct object for better API-stability when adding > new options > - > > Key: ARROW-5559 > URL: https://issues.apache.org/jira/browse/ARROW-5559 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > Related to ARROW-2006. There are various IPC-related options like allowing > 64-bit lengths that might be better encapsulated in an options struct rather > than littered around different public APIs -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6159) [C++] PrettyPrint of arrow::Schema missing identation for first line
[ https://issues.apache.org/jira/browse/ARROW-6159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-6159: - Labels: beginner (was: ) > [C++] PrettyPrint of arrow::Schema missing identation for first line > > > Key: ARROW-6159 > URL: https://issues.apache.org/jira/browse/ARROW-6159 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.14.1 >Reporter: Joris Van den Bossche >Priority: Minor > Labels: beginner > > Minor issue, but I noticed when printing a Schema with indentation, like: > {code} > std::shared_ptr field1 = arrow::field("column1", > arrow::int32()); > std::shared_ptr field2 = arrow::field("column2", > arrow::utf8()); > std::shared_ptr schema = arrow::schema({field1, field2}); > arrow::PrettyPrintOptions options{4}; > arrow::PrettyPrint(*schema, options, &std::cout); > {code} > you get > {code} > column1: int32 > column2: string > {code} > so not applying the indent for the first line. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6159) [C++] PrettyPrint of arrow::Schema missing identation for first line
Joris Van den Bossche created ARROW-6159: Summary: [C++] PrettyPrint of arrow::Schema missing identation for first line Key: ARROW-6159 URL: https://issues.apache.org/jira/browse/ARROW-6159 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.14.1 Reporter: Joris Van den Bossche Minor issue, but I noticed when printing a Schema with indentation, like: {code} std::shared_ptr field1 = arrow::field("column1", arrow::int32()); std::shared_ptr field2 = arrow::field("column2", arrow::utf8()); std::shared_ptr schema = arrow::schema({field1, field2}); arrow::PrettyPrintOptions options{4}; arrow::PrettyPrint(*schema, options, &std::cout); {code} you get {code} column1: int32 column2: string {code} so not applying the indent for the first line. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6159) [C++] PrettyPrint of arrow::Schema missing identation for first line
[ https://issues.apache.org/jira/browse/ARROW-6159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-6159: - Labels: (was: first) > [C++] PrettyPrint of arrow::Schema missing identation for first line > > > Key: ARROW-6159 > URL: https://issues.apache.org/jira/browse/ARROW-6159 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.14.1 >Reporter: Joris Van den Bossche >Priority: Minor > > Minor issue, but I noticed when printing a Schema with indentation, like: > {code} > std::shared_ptr field1 = arrow::field("column1", > arrow::int32()); > std::shared_ptr field2 = arrow::field("column2", > arrow::utf8()); > std::shared_ptr schema = arrow::schema({field1, field2}); > arrow::PrettyPrintOptions options{4}; > arrow::PrettyPrint(*schema, options, &std::cout); > {code} > you get > {code} > column1: int32 > column2: string > {code} > so not applying the indent for the first line. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6159) [C++] PrettyPrint of arrow::Schema missing identation for first line
[ https://issues.apache.org/jira/browse/ARROW-6159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-6159: - Labels: first (was: ) > [C++] PrettyPrint of arrow::Schema missing identation for first line > > > Key: ARROW-6159 > URL: https://issues.apache.org/jira/browse/ARROW-6159 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.14.1 >Reporter: Joris Van den Bossche >Priority: Minor > Labels: first > > Minor issue, but I noticed when printing a Schema with indentation, like: > {code} > std::shared_ptr field1 = arrow::field("column1", > arrow::int32()); > std::shared_ptr field2 = arrow::field("column2", > arrow::utf8()); > std::shared_ptr schema = arrow::schema({field1, field2}); > arrow::PrettyPrintOptions options{4}; > arrow::PrettyPrint(*schema, options, &std::cout); > {code} > you get > {code} > column1: int32 > column2: string > {code} > so not applying the indent for the first line. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (ARROW-6154) [Rust] Too many open files (os error 24)
[ https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902001#comment-16902001 ] Yesh edited comment on ARROW-6154 at 8/7/19 11:36 AM: -- Thanks for ack. Below is the error message. Additional data point is that it is able to dump schema via parquet-schema . {code:java} thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: General("underlying IO error: Too many open files (os error 24)")', src/libcore/result.rs:1084:5{code} was (Author: madras): Thanks for ack. Here is the error message. {code:java} thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: General("underlying IO error: Too many open files (os error 24)")', src/libcore/result.rs:1084:5{code} > [Rust] Too many open files (os error 24) > > > Key: ARROW-6154 > URL: https://issues.apache.org/jira/browse/ARROW-6154 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Yesh >Priority: Major > > Used [rust]*parquet-read binary to read a deeply nested parquet file and see > the below stack trace. Unfortunately won't be able to upload file.* > {code:java} > stack backtrace: > 0: std::panicking::default_hook::{{closure}} > 1: std::panicking::default_hook > 2: std::panicking::rust_panic_with_hook > 3: std::panicking::continue_panic_fmt > 4: rust_begin_unwind > 5: core::panicking::panic_fmt > 6: core::result::unwrap_failed > 7: parquet::util::io::FileSource::new > 8: as > parquet::file::reader::RowGroupReader>::get_column_page_reader > 9: as > parquet::file::reader::RowGroupReader>::get_column_reader > 10: parquet::record::reader::TreeBuilder::reader_tree > 11: parquet::record::reader::TreeBuilder::reader_tree > 12: parquet::record::reader::TreeBuilder::reader_tree > 13: parquet::record::reader::TreeBuilder::reader_tree > 14: parquet::record::reader::TreeBuilder::reader_tree > 15: parquet::record::reader::TreeBuilder::build > 16: core::iter::traits::iterator::Iterator>::next > 17: parquet_read::main > 18: std::rt::lang_start::{{closure}} > 19: std::panicking::try::do_call > 20: __rust_maybe_catch_panic > 21: std::rt::lang_start_internal > 22: main{code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6154) [Rust] Too many open files (os error 24)
[ https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902001#comment-16902001 ] Yesh commented on ARROW-6154: - Thanks for ack. Here is the error message. {code:java} thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: General("underlying IO error: Too many open files (os error 24)")', src/libcore/result.rs:1084:5{code} > [Rust] Too many open files (os error 24) > > > Key: ARROW-6154 > URL: https://issues.apache.org/jira/browse/ARROW-6154 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Yesh >Priority: Major > > Used [rust]*parquet-read binary to read a deeply nested parquet file and see > the below stack trace. Unfortunately won't be able to upload file.* > {code:java} > stack backtrace: > 0: std::panicking::default_hook::{{closure}} > 1: std::panicking::default_hook > 2: std::panicking::rust_panic_with_hook > 3: std::panicking::continue_panic_fmt > 4: rust_begin_unwind > 5: core::panicking::panic_fmt > 6: core::result::unwrap_failed > 7: parquet::util::io::FileSource::new > 8: as > parquet::file::reader::RowGroupReader>::get_column_page_reader > 9: as > parquet::file::reader::RowGroupReader>::get_column_reader > 10: parquet::record::reader::TreeBuilder::reader_tree > 11: parquet::record::reader::TreeBuilder::reader_tree > 12: parquet::record::reader::TreeBuilder::reader_tree > 13: parquet::record::reader::TreeBuilder::reader_tree > 14: parquet::record::reader::TreeBuilder::reader_tree > 15: parquet::record::reader::TreeBuilder::build > 16: core::iter::traits::iterator::Iterator>::next > 17: parquet_read::main > 18: std::rt::lang_start::{{closure}} > 19: std::panicking::try::do_call > 20: __rust_maybe_catch_panic > 21: std::rt::lang_start_internal > 22: main{code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6158) [Python] possible to create StructArray with type that conflicts with child array's types
[ https://issues.apache.org/jira/browse/ARROW-6158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901993#comment-16901993 ] Antoine Pitrou commented on ARROW-6158: --- Validation should really catch this. > [Python] possible to create StructArray with type that conflicts with child > array's types > - > > Key: ARROW-6158 > URL: https://issues.apache.org/jira/browse/ARROW-6158 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > > Using the Python interface as example. This creates a {{StructArray}} where > the field types don't match the child array types: > {code} > a = pa.array([1, 2, 3], type=pa.int64()) > b = pa.array(['a', 'b', 'c'], type=pa.string()) > inconsistent_fields = [pa.field('a', pa.int32()), pa.field('b', pa.float64())] > a = pa.StructArray.from_arrays([a, b], fields=inconsistent_fields) > {code} > The above works fine. I didn't find anything that errors (eg conversion to > pandas, slicing), also validation passes, but the type actually has the > inconsistent child types: > {code} > In [2]: a > Out[2]: > > -- is_valid: all not null > -- child 0 type: int64 > [ > 1, > 2, > 3 > ] > -- child 1 type: string > [ > "a", > "b", > "c" > ] > In [3]: a.type > Out[3]: StructType(struct) > In [4]: a.to_pandas() > Out[4]: > array([{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3, 'b': 'c'}], > dtype=object) > In [5]: a.validate() > {code} > Shouldn't this be disallowed somehow? (it could be checked in the Python > {{from_arrays}} method, but maybe also in {{StructArray::Make}} which already > checks for the number of fields vs arrays and a consistent array length). > Similarly to discussion in ARROW-6132, I would also expect that this the > {{ValidateArray}} catches this. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-3246) [Python][Parquet] direct reading/writing of pandas categoricals in parquet
[ https://issues.apache.org/jira/browse/ARROW-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901988#comment-16901988 ] Hatem Helal commented on ARROW-3246: Adding {{TypedColumnWriter::WriteArrow(const ::arrow::Array&)}} makes a lot of sense to me. [~wesmckinn] do you have a list of cases that you know can be optimized? The main one I'm aware of is the [dictionary array|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L1079] case, but but I'm curious if there are others arrow types that could be handled more efficiently. As an aside, has it ever been considered to automatically tune the size of the dictionary page? I think for the limited case where of writing {{arrow::DictionaryArray}} we might want to ensure that the encoder doesn't fallback to plain encoding. That could be handled as a separate feature. > [Python][Parquet] direct reading/writing of pandas categoricals in parquet > -- > > Key: ARROW-3246 > URL: https://issues.apache.org/jira/browse/ARROW-3246 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Martin Durant >Assignee: Wes McKinney >Priority: Minor > Labels: parquet > Fix For: 1.0.0 > > > Parquet supports "dictionary encoding" of column data in a manner very > similar to the concept of Categoricals in pandas. It is natural to use this > encoding for a column which originated as a categorical. Conversely, when > loading, if the file metadata says that a given column came from a pandas (or > arrow) categorical, then we can trust that the whole of the column is > dictionary-encoded and load the data directly into a categorical column, > rather than expanding the labels upon load and recategorising later. > If the data does not have the pandas metadata, then the guarantee cannot > hold, and we cannot assume either that the whole column is dictionary encoded > or that the labels are the same throughout. In this case, the current > behaviour is fine. > > (please forgive that some of this has already been mentioned elsewhere; this > is one of the entries in the list at > [https://github.com/dask/fastparquet/issues/374] as a feature that is useful > in fastparquet) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6158) [Python] possible to create StructArray with type that conflicts with child array's types
[ https://issues.apache.org/jira/browse/ARROW-6158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901978#comment-16901978 ] Joris Van den Bossche commented on ARROW-6158: -- Found an example where it starts to give errors: after taking a subset with {{Take}}. {code} In [6]: subset = a.take(pa.array([0, 2])) In [7]: subset Out[7]: -- is_valid: all not null -- child 0 type: int32 [ 1, 2 ] -- child 1 type: double [ 2.122e-314, 0 ] In [8]: subset.validate() In [9]: subset.to_pandas() Out[9]: array([{'a': 1, 'b': 2.121995791e-314}, {'a': 2, 'b': 0.0}], dtype=object) {code} > [Python] possible to create StructArray with type that conflicts with child > array's types > - > > Key: ARROW-6158 > URL: https://issues.apache.org/jira/browse/ARROW-6158 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > > Using the Python interface as example. This creates a {{StructArray}} where > the field types don't match the child array types: > {code} > a = pa.array([1, 2, 3], type=pa.int64()) > b = pa.array(['a', 'b', 'c'], type=pa.string()) > inconsistent_fields = [pa.field('a', pa.int32()), pa.field('b', pa.float64())] > a = pa.StructArray.from_arrays([a, b], fields=inconsistent_fields) > {code} > The above works fine. I didn't find anything that errors (eg conversion to > pandas, slicing), also validation passes, but the type actually has the > inconsistent child types: > {code} > In [2]: a > Out[2]: > > -- is_valid: all not null > -- child 0 type: int64 > [ > 1, > 2, > 3 > ] > -- child 1 type: string > [ > "a", > "b", > "c" > ] > In [3]: a.type > Out[3]: StructType(struct) > In [4]: a.to_pandas() > Out[4]: > array([{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3, 'b': 'c'}], > dtype=object) > In [5]: a.validate() > {code} > Shouldn't this be disallowed somehow? (it could be checked in the Python > {{from_arrays}} method, but maybe also in {{StructArray::Make}} which already > checks for the number of fields vs arrays and a consistent array length). > Similarly to discussion in ARROW-6132, I would also expect that this the > {{ValidateArray}} catches this. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6158) [Python] possible to create StructArray with type that conflicts with child array's types
Joris Van den Bossche created ARROW-6158: Summary: [Python] possible to create StructArray with type that conflicts with child array's types Key: ARROW-6158 URL: https://issues.apache.org/jira/browse/ARROW-6158 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Using the Python interface as example. This creates a {{StructArray}} where the field types don't match the child array types: {code} a = pa.array([1, 2, 3], type=pa.int64()) b = pa.array(['a', 'b', 'c'], type=pa.string()) inconsistent_fields = [pa.field('a', pa.int32()), pa.field('b', pa.float64())] a = pa.StructArray.from_arrays([a, b], fields=inconsistent_fields) {code} The above works fine. I didn't find anything that errors (eg conversion to pandas, slicing), also validation passes, but the type actually has the inconsistent child types: {code} In [2]: a Out[2]: -- is_valid: all not null -- child 0 type: int64 [ 1, 2, 3 ] -- child 1 type: string [ "a", "b", "c" ] In [3]: a.type Out[3]: StructType(struct) In [4]: a.to_pandas() Out[4]: array([{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3, 'b': 'c'}], dtype=object) In [5]: a.validate() {code} Shouldn't this be disallowed somehow? (it could be checked in the Python {{from_arrays}} method, but maybe also in {{StructArray::Make}} which already checks for the number of fields vs arrays and a consistent array length). Similarly to discussion in ARROW-6132, I would also expect that this the {{ValidateArray}} catches this. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays
[ https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901968#comment-16901968 ] Antoine Pitrou edited comment on ARROW-6132 at 8/7/19 10:46 AM: My expectation is that {{ValidateArray}} is O(1), not O(n) in array size. Perhaps we need a separate {{ValidateArrayData}} that digs deeper... Edit: oh, you're right about {{ValidateArray(ListArray)}}... was (Author: pitrou): My expectation is that {{ValidateArray}} is O(1), not O\(n) in array size. Perhaps we need a separate {{ValidateArrayData}} that digs deeper... > [Python] ListArray.from_arrays does not check validity of input arrays > -- > > Key: ARROW-6132 > URL: https://issues.apache.org/jira/browse/ARROW-6132 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > From https://github.com/apache/arrow/pull/4979#issuecomment-517593918. > When creating a ListArray from offsets and values in python, there is no > validation of the offsets that it starts with 0 and ends with the length of > the array (but is that required? the docs seem to indicate that: > https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type > ("The first value in the offsets array is 0, and the last element is the > length of the values array."). > The array you get "seems" ok (the repr), but on conversion to python or > flattened arrays, things go wrong: > {code} > In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) > In [62]: a > Out[62]: > > [ > [ > 1, > 2 > ], > [ > 3, > 4 > ] > ] > In [63]: a.flatten() > Out[63]: > > [ > 0, # <--- includes the 0 > 1, > 2, > 3, > 4 > ] > In [64]: a.to_pylist() > Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]] # <--includes > more elements as garbage > {code} > Calling {{validate}} manually correctly raises: > {code} > In [65]: a.validate() > ... > ArrowInvalid: Final offset invariant not equal to values length: 10!=5 > {code} > In C++ the main constructors are not safe, and as the caller you need to > ensure that the data is correct or call a safe (slower) constructor. But do > we want to use the unsafe / fast constructors without validation in Python as > default as well? Or should we do a call to {{validate}} here? > A quick search seems to indicate that `pa.Array.from_buffers` does > validation, but other `from_arrays` method don't seem to explicitly do this. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays
[ https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901968#comment-16901968 ] Antoine Pitrou edited comment on ARROW-6132 at 8/7/19 10:45 AM: My expectation is that {{ValidateArray}} is O(1), not O\(n) in array size. Perhaps we need a separate {{ValidateArrayData}} that digs deeper... was (Author: pitrou): My expectation is that {{ValidateArray}} is O(1), not O(n) in array size. Perhaps we need a separate {{ValidateArrayData}} that digs deeper... > [Python] ListArray.from_arrays does not check validity of input arrays > -- > > Key: ARROW-6132 > URL: https://issues.apache.org/jira/browse/ARROW-6132 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > From https://github.com/apache/arrow/pull/4979#issuecomment-517593918. > When creating a ListArray from offsets and values in python, there is no > validation of the offsets that it starts with 0 and ends with the length of > the array (but is that required? the docs seem to indicate that: > https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type > ("The first value in the offsets array is 0, and the last element is the > length of the values array."). > The array you get "seems" ok (the repr), but on conversion to python or > flattened arrays, things go wrong: > {code} > In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) > In [62]: a > Out[62]: > > [ > [ > 1, > 2 > ], > [ > 3, > 4 > ] > ] > In [63]: a.flatten() > Out[63]: > > [ > 0, # <--- includes the 0 > 1, > 2, > 3, > 4 > ] > In [64]: a.to_pylist() > Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]] # <--includes > more elements as garbage > {code} > Calling {{validate}} manually correctly raises: > {code} > In [65]: a.validate() > ... > ArrowInvalid: Final offset invariant not equal to values length: 10!=5 > {code} > In C++ the main constructors are not safe, and as the caller you need to > ensure that the data is correct or call a safe (slower) constructor. But do > we want to use the unsafe / fast constructors without validation in Python as > default as well? Or should we do a call to {{validate}} here? > A quick search seems to indicate that `pa.Array.from_buffers` does > validation, but other `from_arrays` method don't seem to explicitly do this. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays
[ https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901968#comment-16901968 ] Antoine Pitrou commented on ARROW-6132: --- My expectation is that {{ValidateArray}} is O(1), not O(n) in array size. Perhaps we need a separate {{ValidateArrayData}} that digs deeper... > [Python] ListArray.from_arrays does not check validity of input arrays > -- > > Key: ARROW-6132 > URL: https://issues.apache.org/jira/browse/ARROW-6132 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > From https://github.com/apache/arrow/pull/4979#issuecomment-517593918. > When creating a ListArray from offsets and values in python, there is no > validation of the offsets that it starts with 0 and ends with the length of > the array (but is that required? the docs seem to indicate that: > https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type > ("The first value in the offsets array is 0, and the last element is the > length of the values array."). > The array you get "seems" ok (the repr), but on conversion to python or > flattened arrays, things go wrong: > {code} > In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) > In [62]: a > Out[62]: > > [ > [ > 1, > 2 > ], > [ > 3, > 4 > ] > ] > In [63]: a.flatten() > Out[63]: > > [ > 0, # <--- includes the 0 > 1, > 2, > 3, > 4 > ] > In [64]: a.to_pylist() > Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]] # <--includes > more elements as garbage > {code} > Calling {{validate}} manually correctly raises: > {code} > In [65]: a.validate() > ... > ArrowInvalid: Final offset invariant not equal to values length: 10!=5 > {code} > In C++ the main constructors are not safe, and as the caller you need to > ensure that the data is correct or call a safe (slower) constructor. But do > we want to use the unsafe / fast constructors without validation in Python as > default as well? Or should we do a call to {{validate}} here? > A quick search seems to indicate that `pa.Array.from_buffers` does > validation, but other `from_arrays` method don't seem to explicitly do this. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays
[ https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901966#comment-16901966 ] Joris Van den Bossche commented on ARROW-6132: -- I was actually just planning to open an issue for that: should {{ValidateArray}} check the indices of a DictionaryArray? Not knowingly in detail how {{ValidateArray}} is used internally and what its purpose is, I would expect that from the name of that function, but from your response it might not? {{ValidateArray}} for a ListArray also does walk all offsets and check they are consistent. > [Python] ListArray.from_arrays does not check validity of input arrays > -- > > Key: ARROW-6132 > URL: https://issues.apache.org/jira/browse/ARROW-6132 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > From https://github.com/apache/arrow/pull/4979#issuecomment-517593918. > When creating a ListArray from offsets and values in python, there is no > validation of the offsets that it starts with 0 and ends with the length of > the array (but is that required? the docs seem to indicate that: > https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type > ("The first value in the offsets array is 0, and the last element is the > length of the values array."). > The array you get "seems" ok (the repr), but on conversion to python or > flattened arrays, things go wrong: > {code} > In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) > In [62]: a > Out[62]: > > [ > [ > 1, > 2 > ], > [ > 3, > 4 > ] > ] > In [63]: a.flatten() > Out[63]: > > [ > 0, # <--- includes the 0 > 1, > 2, > 3, > 4 > ] > In [64]: a.to_pylist() > Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]] # <--includes > more elements as garbage > {code} > Calling {{validate}} manually correctly raises: > {code} > In [65]: a.validate() > ... > ArrowInvalid: Final offset invariant not equal to values length: 10!=5 > {code} > In C++ the main constructors are not safe, and as the caller you need to > ensure that the data is correct or call a safe (slower) constructor. But do > we want to use the unsafe / fast constructors without validation in Python as > default as well? Or should we do a call to {{validate}} here? > A quick search seems to indicate that `pa.Array.from_buffers` does > validation, but other `from_arrays` method don't seem to explicitly do this. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6157) [Python][C++] UnionArray with invalid data passes validation / leads to segfaults
[ https://issues.apache.org/jira/browse/ARROW-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-6157: - Description: >From the Python side, you can create an "invalid" UnionArray: {code} binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') int64 = pa.array([1, 2, 3], type='int64') types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8') # <- value of 2 is out of bound for number of childs value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64]) {code} Eg on conversion to python this leads to a segfault: {code} In [7]: a.to_pylist() Segmentation fault (core dumped) {code} On the other hand, doing an explicit validation does not give an error: {code} In [8]: a.validate() {code} Should the validation raise errors for this case? (the C++ {{ValidateVisitor}} for UnionArray does nothing) (so that this can be called from the Python API to avoid creating invalid arrays / segfaults there) was: >From the Python side, you can create an "invalid" UnionArray: {code} binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') int64 = pa.array([1, 2, 3], type='int64') types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8') # <- value of 2 is out of bound for number of childs value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64]) {code} Eg on conversion to python this leads to a segfault: {code} In [7]: a.to_pylist() Segmentation fault (core dumped) {code} On the other hand, doing an explicit validation does not give an error: {code} In [8]: a.validate() {code} Should the validation raise errors for this case? (the C++ {{ValidateVisitor}} for UnionArray does nothing) > [Python][C++] UnionArray with invalid data passes validation / leads to > segfaults > - > > Key: ARROW-6157 > URL: https://issues.apache.org/jira/browse/ARROW-6157 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Joris Van den Bossche >Priority: Major > > From the Python side, you can create an "invalid" UnionArray: > {code} > binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') > int64 = pa.array([1, 2, 3], type='int64') > types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8') # <- value of 2 is out > of bound for number of childs > value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') > a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64]) > {code} > Eg on conversion to python this leads to a segfault: > {code} > In [7]: a.to_pylist() > Segmentation fault (core dumped) > {code} > On the other hand, doing an explicit validation does not give an error: > {code} > In [8]: a.validate() > {code} > Should the validation raise errors for this case? (the C++ > {{ValidateVisitor}} for UnionArray does nothing) > (so that this can be called from the Python API to avoid creating invalid > arrays / segfaults there) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6060) [Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True
[ https://issues.apache.org/jira/browse/ARROW-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-6060. --- Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 5016 [https://github.com/apache/arrow/pull/5016] > [Python] too large memory cost using pyarrow.parquet.read_table with > use_threads=True > - > > Key: ARROW-6060 > URL: https://issues.apache.org/jira/browse/ARROW-6060 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 >Reporter: Kun Liu >Assignee: Benjamin Kietzman >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 3h > Remaining Estimate: 0h > > I tried to load a parquet file of about 1.8Gb using the following code. It > crashed due to out of memory issue. > {code:java} > import pyarrow.parquet as pq > pq.read_table('/tmp/test.parquet'){code} > However, it worked well with use_threads=True as follows > {code:java} > pq.read_table('/tmp/test.parquet', use_threads=False){code} > If pyarrow is downgraded to 0.12.1, there is no such problem. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6157) [Python][C++] UnionArray with invalid data passes validation / leads to segfaults
Joris Van den Bossche created ARROW-6157: Summary: [Python][C++] UnionArray with invalid data passes validation / leads to segfaults Key: ARROW-6157 URL: https://issues.apache.org/jira/browse/ARROW-6157 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Joris Van den Bossche >From the Python side, you can create an "invalid" UnionArray: {code} binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') int64 = pa.array([1, 2, 3], type='int64') types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8') # <- value of 2 is out of bound for number of childs value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64]) {code} Eg on conversion to python this leads to a segfault: {code} In [7]: a.to_pylist() Segmentation fault (core dumped) {code} On the other hand, doing an explicit validation does not give an error: {code} In [8]: a.validate() {code} Should the validation raise errors for this case? (the C++ {{ValidateVisitor}} for UnionArray does nothing) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6156) [Java] Support compare semantics for ArrowBufPointer
[ https://issues.apache.org/jira/browse/ARROW-6156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6156: -- Labels: pull-request-available (was: ) > [Java] Support compare semantics for ArrowBufPointer > > > Key: ARROW-6156 > URL: https://issues.apache.org/jira/browse/ARROW-6156 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > > Compare two arrow buffer pointers by their content in lexicographic order. > null is smaller and shorter buffer is smaller. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6156) [Java] Support compare semantics for ArrowBufPointer
Liya Fan created ARROW-6156: --- Summary: [Java] Support compare semantics for ArrowBufPointer Key: ARROW-6156 URL: https://issues.apache.org/jira/browse/ARROW-6156 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan Compare two arrow buffer pointers by their content in lexicographic order. null is smaller and shorter buffer is smaller. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays
[ https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901954#comment-16901954 ] Antoine Pitrou commented on ARROW-6132: --- Well, {{DictionaryArray.from_arrays(safe=True)}} is really expensive: it will walk all indices and check they are within bounds. {{ValidateArray}} doesn't do such a thing. > [Python] ListArray.from_arrays does not check validity of input arrays > -- > > Key: ARROW-6132 > URL: https://issues.apache.org/jira/browse/ARROW-6132 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > From https://github.com/apache/arrow/pull/4979#issuecomment-517593918. > When creating a ListArray from offsets and values in python, there is no > validation of the offsets that it starts with 0 and ends with the length of > the array (but is that required? the docs seem to indicate that: > https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type > ("The first value in the offsets array is 0, and the last element is the > length of the values array."). > The array you get "seems" ok (the repr), but on conversion to python or > flattened arrays, things go wrong: > {code} > In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) > In [62]: a > Out[62]: > > [ > [ > 1, > 2 > ], > [ > 3, > 4 > ] > ] > In [63]: a.flatten() > Out[63]: > > [ > 0, # <--- includes the 0 > 1, > 2, > 3, > 4 > ] > In [64]: a.to_pylist() > Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]] # <--includes > more elements as garbage > {code} > Calling {{validate}} manually correctly raises: > {code} > In [65]: a.validate() > ... > ArrowInvalid: Final offset invariant not equal to values length: 10!=5 > {code} > In C++ the main constructors are not safe, and as the caller you need to > ensure that the data is correct or call a safe (slower) constructor. But do > we want to use the unsafe / fast constructors without validation in Python as > default as well? Or should we do a call to {{validate}} here? > A quick search seems to indicate that `pa.Array.from_buffers` does > validation, but other `from_arrays` method don't seem to explicitly do this. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5151) [C++] Support take from UnionArray, ListArray, StructArray
[ https://issues.apache.org/jira/browse/ARROW-5151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901953#comment-16901953 ] Joris Van den Bossche commented on ARROW-5151: -- [~bkietz] Based on your comment in ARROW-772, this is resolved? > [C++] Support take from UnionArray, ListArray, StructArray > -- > > Key: ARROW-5151 > URL: https://issues.apache.org/jira/browse/ARROW-5151 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Benjamin Kietzman >Priority: Minor > > {{arrow::compute::Take}} cannot take from UnionArray, ListArray, or > StructArray. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays
[ https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901945#comment-16901945 ] Joris Van den Bossche commented on ARROW-6132: -- {{DictionaryArray.from_arrays}} has a {{safe=True/False}} argument (with {{safe=True}} as default) to disable the validity checking. Although it is not exactly the same under the hood (DictionaryArray does not use the {{ValidateArray}} method), for users it is similar functionality, so I could also add such a keyword to {{ListArray.from_arrays}} as well (didn't do that yet in the PR https://github.com/apache/arrow/pull/5029). > [Python] ListArray.from_arrays does not check validity of input arrays > -- > > Key: ARROW-6132 > URL: https://issues.apache.org/jira/browse/ARROW-6132 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > From https://github.com/apache/arrow/pull/4979#issuecomment-517593918. > When creating a ListArray from offsets and values in python, there is no > validation of the offsets that it starts with 0 and ends with the length of > the array (but is that required? the docs seem to indicate that: > https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type > ("The first value in the offsets array is 0, and the last element is the > length of the values array."). > The array you get "seems" ok (the repr), but on conversion to python or > flattened arrays, things go wrong: > {code} > In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) > In [62]: a > Out[62]: > > [ > [ > 1, > 2 > ], > [ > 3, > 4 > ] > ] > In [63]: a.flatten() > Out[63]: > > [ > 0, # <--- includes the 0 > 1, > 2, > 3, > 4 > ] > In [64]: a.to_pylist() > Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]] # <--includes > more elements as garbage > {code} > Calling {{validate}} manually correctly raises: > {code} > In [65]: a.validate() > ... > ArrowInvalid: Final offset invariant not equal to values length: 10!=5 > {code} > In C++ the main constructors are not safe, and as the caller you need to > ensure that the data is correct or call a safe (slower) constructor. But do > we want to use the unsafe / fast constructors without validation in Python as > default as well? Or should we do a call to {{validate}} here? > A quick search seems to indicate that `pa.Array.from_buffers` does > validation, but other `from_arrays` method don't seem to explicitly do this. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays
[ https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6132: -- Labels: pull-request-available (was: ) > [Python] ListArray.from_arrays does not check validity of input arrays > -- > > Key: ARROW-6132 > URL: https://issues.apache.org/jira/browse/ARROW-6132 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Minor > Labels: pull-request-available > > From https://github.com/apache/arrow/pull/4979#issuecomment-517593918. > When creating a ListArray from offsets and values in python, there is no > validation of the offsets that it starts with 0 and ends with the length of > the array (but is that required? the docs seem to indicate that: > https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type > ("The first value in the offsets array is 0, and the last element is the > length of the values array."). > The array you get "seems" ok (the repr), but on conversion to python or > flattened arrays, things go wrong: > {code} > In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) > In [62]: a > Out[62]: > > [ > [ > 1, > 2 > ], > [ > 3, > 4 > ] > ] > In [63]: a.flatten() > Out[63]: > > [ > 0, # <--- includes the 0 > 1, > 2, > 3, > 4 > ] > In [64]: a.to_pylist() > Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]] # <--includes > more elements as garbage > {code} > Calling {{validate}} manually correctly raises: > {code} > In [65]: a.validate() > ... > ArrowInvalid: Final offset invariant not equal to values length: 10!=5 > {code} > In C++ the main constructors are not safe, and as the caller you need to > ensure that the data is correct or call a safe (slower) constructor. But do > we want to use the unsafe / fast constructors without validation in Python as > default as well? Or should we do a call to {{validate}} here? > A quick search seems to indicate that `pa.Array.from_buffers` does > validation, but other `from_arrays` method don't seem to explicitly do this. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays
[ https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-6132: Assignee: Joris Van den Bossche > [Python] ListArray.from_arrays does not check validity of input arrays > -- > > Key: ARROW-6132 > URL: https://issues.apache.org/jira/browse/ARROW-6132 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Minor > > From https://github.com/apache/arrow/pull/4979#issuecomment-517593918. > When creating a ListArray from offsets and values in python, there is no > validation of the offsets that it starts with 0 and ends with the length of > the array (but is that required? the docs seem to indicate that: > https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type > ("The first value in the offsets array is 0, and the last element is the > length of the values array."). > The array you get "seems" ok (the repr), but on conversion to python or > flattened arrays, things go wrong: > {code} > In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) > In [62]: a > Out[62]: > > [ > [ > 1, > 2 > ], > [ > 3, > 4 > ] > ] > In [63]: a.flatten() > Out[63]: > > [ > 0, # <--- includes the 0 > 1, > 2, > 3, > 4 > ] > In [64]: a.to_pylist() > Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]] # <--includes > more elements as garbage > {code} > Calling {{validate}} manually correctly raises: > {code} > In [65]: a.validate() > ... > ArrowInvalid: Final offset invariant not equal to values length: 10!=5 > {code} > In C++ the main constructors are not safe, and as the caller you need to > ensure that the data is correct or call a safe (slower) constructor. But do > we want to use the unsafe / fast constructors without validation in Python as > default as well? Or should we do a call to {{validate}} here? > A quick search seems to indicate that `pa.Array.from_buffers` does > validation, but other `from_arrays` method don't seem to explicitly do this. -- This message was sent by Atlassian JIRA (v7.6.14#76016)