[jira] [Resolved] (ARROW-6809) [RUBY] Gem does not install on macOS due to glib2 3.3.7 compilation failure
[ https://issues.apache.org/jira/browse/ARROW-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-6809. - Fix Version/s: 0.15.1 1.0.0 Resolution: Fixed ARROW-6777 solves this. > [RUBY] Gem does not install on macOS due to glib2 3.3.7 compilation failure > --- > > Key: ARROW-6809 > URL: https://issues.apache.org/jira/browse/ARROW-6809 > Project: Apache Arrow > Issue Type: Bug > Components: Ruby >Affects Versions: 0.15.0 > Environment: macOS Mojave 10.14.6 > Ruby 2.6.3p62 (2019-04-16 revision 67580) [x86_64-darwin18] > Xcode 10.3 >Reporter: Keith Wedinger >Assignee: Kouhei Sutou >Priority: Blocker > Fix For: 1.0.0, 0.15.1 > > > *System information:* > * macOS Mojave 10.14.6 > * Ruby 2.6.3p62 (2019-04-16 revision 67580) [x86_64-darwin18] managed via > rbenv > *Reproduction steps:* > Run {{gem install red-arrow}} > *Observe:* > The following compilation errors occur during compilation of dependent gem > glib2 3.3.7: > {code} > Building native extensions. This could take a while... > ERROR: Error installing red-arrow: > ERROR: Failed to build gem native extension. > current directory: > /Users/kwedinger/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/glib2-3.3.7/ext/glib2 > /Users/kwedinger/.rbenv/versions/2.6.3/bin/ruby -I > /Users/kwedinger/.rbenv/versions/2.6.3/lib/ruby/2.6.0 -r > ./siteconf20191007-84053-1y4ly2q.rb extconf.rb > checking for --enable-debug-build option... no > checking for -Wall option to compiler... yes > checking for -Waggregate-return option to compiler... yes > checking for -Wcast-align option to compiler... yes > checking for -Wextra option to compiler... no > checking for -Wformat=2 option to compiler... yes > checking for -Winit-self option to compiler... yes > checking for -Wlarger-than-65500 option to compiler... yes > checking for -Wmissing-declarations option to compiler... yes > checking for -Wmissing-format-attribute option to compiler... yes > checking for -Wmissing-include-dirs option to compiler... yes > checking for -Wmissing-noreturn option to compiler... yes > checking for -Wmissing-prototypes option to compiler... yes > checking for -Wnested-externs option to compiler... yes > checking for -Wold-style-definition option to compiler... yes > checking for -Wpacked option to compiler... yes > checking for -Wp,-D_FORTIFY_SOURCE=2 option to compiler... yes > checking for -Wpointer-arith option to compiler... yes > checking for -Wswitch-default option to compiler... yes > checking for -Wswitch-enum option to compiler... yes > checking for -Wundef option to compiler... yes > checking for -Wout-of-line-declaration option to compiler... yes > checking for -Wunsafe-loop-optimizations option to compiler... no > checking for -Wwrite-strings option to compiler... yes > checking for Homebrew... yes > checking for gobject-2.0 version (>= 2.12.0)... yes > checking for gthread-2.0... yes > checking for unistd.h... yes > checking for io.h... no > checking for g_spawn_close_pid() in glib.h... yes > checking for g_thread_init() in glib.h... yes > checking for g_main_depth() in glib.h... yes > checking for g_listenv() in glib.h... yes > checking for rb_check_array_type() in ruby.h... yes > checking for rb_check_hash_type() in ruby.h... yes > checking for rb_exec_recursive() in ruby.h... yes > checking for rb_errinfo() in ruby.h... yes > checking for rb_thread_call_without_gvl() in ruby.h... yes > checking for ruby_native_thread_p() in ruby.h... yes > checking for rb_thread_call_with_gvl() in ruby.h... yes > checking for rb_gc_register_mark_object() in ruby.h... yes > checking for rb_exc_new_str() in ruby.h... yes > checking for rb_enc_str_new_static() in ruby.h... yes > checking for curr_thread in ruby.h,node.h... no > checking for rb_curr_thread in ruby.h,node.h... no > creating ruby-glib2.pc > creating glib-enum-types.c > creating glib-enum-types.h > creating Makefile > current directory: > /Users/kwedinger/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/glib2-3.3.7/ext/glib2 > make "DESTDIR=" clean > current directory: > /Users/kwedinger/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/glib2-3.3.7/ext/glib2 > make "DESTDIR=" > compiling rbglib-gc.c > compiling rbgobj_signal.c > compiling rbglib_int64.c > compiling rbglib_convert.c > compiling rbglib_bookmarkfile.c > compiling rbglib-variant.c > compiling glib-enum-types.c > glib-enum-types.c:632:9: warning: 'G_SPAWN_ERROR_2BIG' is deprecated: Use > 'G_SPAWN_ERROR_TOO_BIG' instead [-Wdeprecated-declarations] > { G_SPAWN_ERROR_2BIG, "G_SPAWN_ERROR_2BIG", "2big" }, > ^ > /usr/local/Cellar/glib/2.62.1/include/glib-2.0/glib/gspawn.h:76:22: note: > 'G_SPAWN_ERROR_2BIG' has been explicitly marked
[jira] [Updated] (ARROW-7014) [Developer] Write script to verify Linux wheels given local environment with conda or virtualenv
[ https://issues.apache.org/jira/browse/ARROW-7014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7014: -- Labels: pull-request-available (was: ) > [Developer] Write script to verify Linux wheels given local environment with > conda or virtualenv > > > Key: ARROW-7014 > URL: https://issues.apache.org/jira/browse/ARROW-7014 > Project: Apache Arrow > Issue Type: New Feature > Components: Developer Tools, Python >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Facilitate testing RC wheels. Also test checksum and sig -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6340) [R] Implements low-level bindings to Dataset classes
[ https://issues.apache.org/jira/browse/ARROW-6340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-6340: --- Description: The following classes should be accessible from R: * class DataSource * class DataSourceDiscovery * class Dataset * class ScanContext, ScanOptions, ScanTask * class ScannerBuilder * class Scanner The end result is reading a directory of parquet files as a single stream. One should be able to re-implement [https://github.com/apache/arrow/pull/5720] in R. See also [https://github.com/apache/arrow/pull/5675/files] for another end-to-end example in C++. was: The following classes should be accessible from R: * class DataSource * class DataSourceDiscovery * class Dataset * class ScanContext, ScanOptions, ScanTask * class ScannerBuilder * class Scanner The end result is reading a directory of parquet files as a single stream. One should be able to re-implement [https://github.com/apache/arrow/pull/5720] in R. > [R] Implements low-level bindings to Dataset classes > > > Key: ARROW-6340 > URL: https://issues.apache.org/jira/browse/ARROW-6340 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ - Dataset, R >Reporter: Francois Saint-Jacques >Assignee: Romain Francois >Priority: Major > Labels: dataset, pull-request-available > Fix For: 1.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > The following classes should be accessible from R: > * class DataSource > * class DataSourceDiscovery > * class Dataset > * class ScanContext, ScanOptions, ScanTask > * class ScannerBuilder > * class Scanner > The end result is reading a directory of parquet files as a single stream. > One should be able to re-implement > [https://github.com/apache/arrow/pull/5720] in R. > See also [https://github.com/apache/arrow/pull/5675/files] for another > end-to-end example in C++. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6987) [CI] Travis OSX failing to install sdk headers
[ https://issues.apache.org/jira/browse/ARROW-6987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-6987: --- Labels: (was: pull-request-available) > [CI] Travis OSX failing to install sdk headers > -- > > Key: ARROW-6987 > URL: https://issues.apache.org/jira/browse/ARROW-6987 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Francois Saint-Jacques >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > {code:java} > sudo installer -pkg > /Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg > -target /343installer: Package name is > macOS_SDK_headers_for_macOS_10.14344installer: Certificate used to sign > package is not trusted. Use -allowUntrusted to override.345The command > "$TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh --only-library --homebrew" > failed and exited with 1 during . > {code} > See [https://travis-ci.org/apache/arrow/jobs/602434884#L342-L345] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7017) [C++] Refactor AddKernel to support other operations and types
[ https://issues.apache.org/jira/browse/ARROW-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961597#comment-16961597 ] Wes McKinney commented on ARROW-7017: - I think the jury is out (for example, I'm not totally convinced) on having LLVM as a hard requirement for running simple expressions. I'm not sure what will end up being most desirable long term > [C++] Refactor AddKernel to support other operations and types > -- > > Key: ARROW-7017 > URL: https://issues.apache.org/jira/browse/ARROW-7017 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, C++ - Compute >Reporter: Francois Saint-Jacques >Priority: Major > Labels: analytics > > * Should avoid using builders (and/or NULLs) since the output shape is known > a compute time. > * Should be refatored to support other operations, e.g. Substraction, > Multiplication. > * Should have a overflow, underflow detection mode. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-7017) [C++] Refactor AddKernel to support other operations and types
[ https://issues.apache.org/jira/browse/ARROW-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961597#comment-16961597 ] Wes McKinney edited comment on ARROW-7017 at 10/29/19 1:51 AM: --- I think the jury is out (for example, I'm not totally convinced) on having LLVM compilation as a hard requirement for running simple expressions. I'm not sure what will end up being most desirable long term was (Author: wesmckinn): I think the jury is out (for example, I'm not totally convinced) on having LLVM as a hard requirement for running simple expressions. I'm not sure what will end up being most desirable long term > [C++] Refactor AddKernel to support other operations and types > -- > > Key: ARROW-7017 > URL: https://issues.apache.org/jira/browse/ARROW-7017 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, C++ - Compute >Reporter: Francois Saint-Jacques >Priority: Major > Labels: analytics > > * Should avoid using builders (and/or NULLs) since the output shape is known > a compute time. > * Should be refatored to support other operations, e.g. Substraction, > Multiplication. > * Should have a overflow, underflow detection mode. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6993) [CI] Macos SDK installation fails on Travis
[ https://issues.apache.org/jira/browse/ARROW-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961592#comment-16961592 ] Kouhei Sutou commented on ARROW-6993: - This duplicates ARROW-6987. > [CI] Macos SDK installation fails on Travis > > > Key: ARROW-6993 > URL: https://issues.apache.org/jira/browse/ARROW-6993 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Krisztian Szucs >Priority: Major > > See the failing build at https://travis-ci.org/apache/arrow/jobs/602560324 > Pass -allowUntrasted flag during the installation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6993) [CI] Macos SDK installation fails on Travis
[ https://issues.apache.org/jira/browse/ARROW-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-6993. - Resolution: Duplicate > [CI] Macos SDK installation fails on Travis > > > Key: ARROW-6993 > URL: https://issues.apache.org/jira/browse/ARROW-6993 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Krisztian Szucs >Priority: Major > > See the failing build at https://travis-ci.org/apache/arrow/jobs/602560324 > Pass -allowUntrasted flag during the installation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7013) [C++] arrow-dataset pkgconfig is incomplete
[ https://issues.apache.org/jira/browse/ARROW-7013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-7013. - Resolution: Fixed Issue resolved by pull request 5747 [https://github.com/apache/arrow/pull/5747] > [C++] arrow-dataset pkgconfig is incomplete > --- > > Key: ARROW-7013 > URL: https://issues.apache.org/jira/browse/ARROW-7013 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Dataset >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Unlike the other *.pc.in files, it doesn't include a {{Libs}} field, so > passing the result of what is found by pkgconfig results in the lib still not > being found. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7017) [C++] Refactor AddKernel to support other operations and types
[ https://issues.apache.org/jira/browse/ARROW-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961565#comment-16961565 ] Jacques Nadeau commented on ARROW-7017: --- What's the thinking of building these a second time here as opposed to just adding utility methods over Gandiva for specific patterns? My experience is that it is very rare that people only need to do a single expression. > [C++] Refactor AddKernel to support other operations and types > -- > > Key: ARROW-7017 > URL: https://issues.apache.org/jira/browse/ARROW-7017 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, C++ - Compute >Reporter: Francois Saint-Jacques >Priority: Major > Labels: analytics > > * Should avoid using builders (and/or NULLs) since the output shape is known > a compute time. > * Should be refatored to support other operations, e.g. Substraction, > Multiplication. > * Should have a overflow, underflow detection mode. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6993) [CI] Macos SDK installation fails on Travis
[ https://issues.apache.org/jira/browse/ARROW-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961561#comment-16961561 ] Neal Richardson commented on ARROW-6993: [~kou] this is causing the GLib tests to fail on master > [CI] Macos SDK installation fails on Travis > > > Key: ARROW-6993 > URL: https://issues.apache.org/jira/browse/ARROW-6993 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Krisztian Szucs >Priority: Major > > See the failing build at https://travis-ci.org/apache/arrow/jobs/602560324 > Pass -allowUntrasted flag during the installation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7017) [C++] Refactor AddKernel to support other operations and types
[ https://issues.apache.org/jira/browse/ARROW-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-7017: -- Component/s: C++ > [C++] Refactor AddKernel to support other operations and types > -- > > Key: ARROW-7017 > URL: https://issues.apache.org/jira/browse/ARROW-7017 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, C++ - Compute >Reporter: Francois Saint-Jacques >Priority: Major > Labels: analytics > > * Should avoid using builders (and/or NULLs) since the output shape is known > a compute time. > * Should be refatored to support other operations, e.g. Substraction, > Multiplication. > * Should have a overflow, underflow detection mode. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7017) [C++] Refactor AddKernel to support other operations and types
Francois Saint-Jacques created ARROW-7017: - Summary: [C++] Refactor AddKernel to support other operations and types Key: ARROW-7017 URL: https://issues.apache.org/jira/browse/ARROW-7017 Project: Apache Arrow Issue Type: Improvement Components: C++ - Compute Reporter: Francois Saint-Jacques * Should avoid using builders (and/or NULLs) since the output shape is known a compute time. * Should be refatored to support other operations, e.g. Substraction, Multiplication. * Should have a overflow, underflow detection mode. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7016) [Developer][Python] Write script to verify Windows wheels given local environment with conda
Wes McKinney created ARROW-7016: --- Summary: [Developer][Python] Write script to verify Windows wheels given local environment with conda Key: ARROW-7016 URL: https://issues.apache.org/jira/browse/ARROW-7016 Project: Apache Arrow Issue Type: New Feature Components: Developer Tools, Python Reporter: Wes McKinney Fix For: 1.0.0 Windows version of ARROW-7014 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7014) [Developer] Write script to verify Linux wheels given local environment with conda or virtualenv
Wes McKinney created ARROW-7014: --- Summary: [Developer] Write script to verify Linux wheels given local environment with conda or virtualenv Key: ARROW-7014 URL: https://issues.apache.org/jira/browse/ARROW-7014 Project: Apache Arrow Issue Type: New Feature Components: Developer Tools, Python Reporter: Wes McKinney Fix For: 1.0.0 Facilitate testing RC wheels. Also test checksum and sig -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7015) [Developer] Write script to verify macOS wheels given local environment with conda or virtualenv
Wes McKinney created ARROW-7015: --- Summary: [Developer] Write script to verify macOS wheels given local environment with conda or virtualenv Key: ARROW-7015 URL: https://issues.apache.org/jira/browse/ARROW-7015 Project: Apache Arrow Issue Type: New Feature Components: Developer Tools, Python Reporter: Wes McKinney Fix For: 1.0.0 macOS analogue to ARROW-7014 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7013) [C++] arrow-dataset pkgconfig is incomplete
[ https://issues.apache.org/jira/browse/ARROW-7013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7013: -- Labels: pull-request-available (was: ) > [C++] arrow-dataset pkgconfig is incomplete > --- > > Key: ARROW-7013 > URL: https://issues.apache.org/jira/browse/ARROW-7013 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Dataset >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Unlike the other *.pc.in files, it doesn't include a {{Libs}} field, so > passing the result of what is found by pkgconfig results in the lib still not > being found. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7013) [C++] arrow-dataset pkgconfig is incomplete
Neal Richardson created ARROW-7013: -- Summary: [C++] arrow-dataset pkgconfig is incomplete Key: ARROW-7013 URL: https://issues.apache.org/jira/browse/ARROW-7013 Project: Apache Arrow Issue Type: Bug Components: C++ - Dataset Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 1.0.0 Unlike the other *.pc.in files, it doesn't include a {{Libs}} field, so passing the result of what is found by pkgconfig results in the lib still not being found. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-2880) [Packaging] Script like verify-release-candidate.sh for automated testing of conda and wheel Python packages in ASF dist
[ https://issues.apache.org/jira/browse/ARROW-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961499#comment-16961499 ] Wes McKinney commented on ARROW-2880: - We can use Crossbow to do the verification in controlled environments. > [Packaging] Script like verify-release-candidate.sh for automated testing of > conda and wheel Python packages in ASF dist > > > Key: ARROW-2880 > URL: https://issues.apache.org/jira/browse/ARROW-2880 > Project: Apache Arrow > Issue Type: New Feature > Components: Packaging >Reporter: Wes McKinney >Priority: Major > > We have a script for verifying a source release candidate. We should make a > similar script to test out the wheels and conda packages for the supported > Python versions (2.7, 3.5, 3.6, soon 3.7) in an automated fashion -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7007) [C++] Enable mmap option for LocalFs
[ https://issues.apache.org/jira/browse/ARROW-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961494#comment-16961494 ] Wes McKinney commented on ARROW-7007: - Might consider whether there is another approach to this problem. Consider how TensorFlow handles this (I think) https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/file_system.h#L95 > [C++] Enable mmap option for LocalFs > > > Key: ARROW-7007 > URL: https://issues.apache.org/jira/browse/ARROW-7007 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Francois Saint-Jacques >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6980) [R] dplyr backend for RecordBatch/Table
[ https://issues.apache.org/jira/browse/ARROW-6980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-6980. Resolution: Fixed Issue resolved by pull request 5661 [https://github.com/apache/arrow/pull/5661] > [R] dplyr backend for RecordBatch/Table > --- > > Key: ARROW-6980 > URL: https://issues.apache.org/jira/browse/ARROW-6980 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6784) [C++][R] Move filter and take code from Rcpp to C++ library
[ https://issues.apache.org/jira/browse/ARROW-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961441#comment-16961441 ] Neal Richardson commented on ARROW-6784: Followup issues: ARROW-6959, ARROW-7009, ARROW-7012 > [C++][R] Move filter and take code from Rcpp to C++ library > --- > > Key: ARROW-6784 > URL: https://issues.apache.org/jira/browse/ARROW-6784 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > Followup to ARROW-3808 and some other previous work. Of particular interest: > * Filter and Take methods for ChunkedArray, in r/src/compute.cpp > * Methods for that and some other things that apply Array and ChunkedArray > methods across the columns of a RecordBatch or Table, respectively > * RecordBatch__select and Table__select to take columns -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6784) [C++][R] Move filter and take code from Rcpp to C++ library
[ https://issues.apache.org/jira/browse/ARROW-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-6784: --- Summary: [C++][R] Move filter and take code from Rcpp to C++ library (was: [C++][R] Move filter, take, select C++ code from Rcpp to C++ library) > [C++][R] Move filter and take code from Rcpp to C++ library > --- > > Key: ARROW-6784 > URL: https://issues.apache.org/jira/browse/ARROW-6784 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > Followup to ARROW-3808 and some other previous work. Of particular interest: > * Filter and Take methods for ChunkedArray, in r/src/compute.cpp > * Methods for that and some other things that apply Array and ChunkedArray > methods across the columns of a RecordBatch or Table, respectively > * RecordBatch__select and Table__select to take columns -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7012) [C++] Clarify ChunkedArray chunking strategy and policy
Neal Richardson created ARROW-7012: -- Summary: [C++] Clarify ChunkedArray chunking strategy and policy Key: ARROW-7012 URL: https://issues.apache.org/jira/browse/ARROW-7012 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Neal Richardson See discussion on ARROW-6784 and [https://github.com/apache/arrow/pull/5686]. Among the questions: * Do Arrow users control the chunking, or is it an internal implementation detail they should not manage? * If users control it, how do they control it? E.g. if I call Take and use a ChunkedArray for the indices to take, does the chunking follow how the indices are chunked? Or should we attempt to preserve the mapping of data to their chunks in the input table/chunked array? * If it's an implementation detail, what is the optimal chunk size? And when is it worth reshaping (concatenating, slicing) input data to attain this optimal size? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7011) [C++] Implement casts from float/double to decimal128
Wes McKinney created ARROW-7011: --- Summary: [C++] Implement casts from float/double to decimal128 Key: ARROW-7011 URL: https://issues.apache.org/jira/browse/ARROW-7011 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney see also ARROW-5905, ARROW-7010 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7010) [C++] Support lossy casts from decimal128 to float32 and float64/double
Wes McKinney created ARROW-7010: --- Summary: [C++] Support lossy casts from decimal128 to float32 and float64/double Key: ARROW-7010 URL: https://issues.apache.org/jira/browse/ARROW-7010 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 I do not believe such casts are implemented. This can be helpful for people analyzing data where the precision of decimal128 is not needed -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7009) [C++] Refactor filter/take kernels to use Datum instead of overloads
Neal Richardson created ARROW-7009: -- Summary: [C++] Refactor filter/take kernels to use Datum instead of overloads Key: ARROW-7009 URL: https://issues.apache.org/jira/browse/ARROW-7009 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Neal Richardson Fix For: 1.0.0 Followup to ARROW-6784. See discussion on [https://github.com/apache/arrow/pull/5686,|https://github.com/apache/arrow/pull/5686] as well as ARROW-6959. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7006) [Rust] Bump flatbuffers version to avoid vulnerability
[ https://issues.apache.org/jira/browse/ARROW-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paddy Horan resolved ARROW-7006. Resolution: Fixed Issue resolved by pull request 5744 [https://github.com/apache/arrow/pull/5744] > [Rust] Bump flatbuffers version to avoid vulnerability > -- > > Key: ARROW-7006 > URL: https://issues.apache.org/jira/browse/ARROW-7006 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 0.15.0 >Reporter: Paddy Horan >Assignee: Paddy Horan >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > From GitHub use emilk: > [{{cargo audit}}|https://github.com/RustSec/cargo-audit] output: > > {{ID: RUSTSEC-2019-0028 > Crate: flatbuffers > Version: 0.5.0 > Date: 2019-10-20 > URL: https://github.com/google/flatbuffers/issues/5530 > Title: Unsound `impl Follow for bool`}} > The fix should be as simple as editing > [https://github.com/apache/arrow/blob/master/rust/arrow/Cargo.toml] from > {{flatbuffers = "0.5.0"}} to {{flatbuffers = "0.6.0"}} > A more longterm improvement is to add a call to {{cargo audit}} in your CI to > catch these problems as early as possible > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6999) [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema
[ https://issues.apache.org/jira/browse/ARROW-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961317#comment-16961317 ] Tom Goodman edited comment on ARROW-6999 at 10/28/19 7:24 PM: -- [~jorisvandenbossche] please try this with the attached [^test3.hdf] (not empty) {code:java} df2 = pd.read_hdf('test3.hdf','foo') pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema){code} I still get KeyError: '__index_level_0__' (+without+ specifying preserve_index)._ This may be because the index on test3.hdf is Int64Index and I see [pyarrow docs|https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.from_pandas] say default behavior is to "store the index as a column", except for rage indexes. This unfortunately makes the bug more prevalent. was (Author: goodiegoodman): [~jorisvandenbossche] please try this with the attached [^test3.hdf] (not empty) {code:java} df2 = pd.read_hdf('test3.hdf','foo') pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema){code} I still get KeyError: '__index_level_0__' (+without+ specifying preserve_index)._ This may be because the index on test3.hdf is Int64Index and I see [pyarrow docs|#pyarrow.Table.from_pandas]] say default behavior is to "store the index as a column", except for rage indexes. This unfortunately makes the bug more prevalent. > [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own > schema > --- > > Key: ARROW-6999 > URL: https://issues.apache.org/jira/browse/ARROW-6999 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.15.0 > Environment: pandas==0.23.4 > pyarrow==0.15.0 # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0 >Reporter: Tom Goodman >Priority: Major > Fix For: 1.0.0 > > Attachments: test3.hdf > > > Steps to reproduce: > # Generate any DataFrame's pyarrow Schema using Table.from_pandas > # Pass the generated schema as input into Table.from_pandas > # Causes KeyError: '__index_level_0__' > We did not have this issue with pyarrow==0.11.0 which we used to write many > partitions across years. Our goal now is to use pyarrow==0.15.0 and produce > schema going forward that are *backwards compatible* (i.e. also have > '__index_level_0__'), so we should not need to re-generate all prior years' > partitions when we migrate to 0.15.0. > We cannot set _preserve_index=False_, since that effectively deletes > '__index_level_0__', causing inconsistent schema across earlier partitions > that had been written using pyarrow==0.11.0. > > {code:java} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame() > schema = pa.Table.from_pandas(df).schema > pa_table = pa.Table.from_pandas(df, schema=schema) > {code} > {noformat} > Traceback (most recent call last): > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", > line 3078, in get_loc > return self._engine.get_loc(key) > File "pandas/_libs/index.pyx", line 140, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/index.pyx", line 162, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in > pandas._libs.hashtable.PyObjectHashTable.get_item > File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in > pandas._libs.hashtable.PyObjectHashTable.get_item > KeyError: '__index_level_0__' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", > line 408, in _get_columns_to_convert_given_schema > col = df[name] > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", > line 2688, in __getitem__ > return self._getitem_column(key) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", > line 2695, in _getitem_column > return self._get_item_cache(key) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/generic.py", > line 2489, in _get_item_cache > values = self._data.get(item) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/internals.py", > line 4115, in get > loc = self.items.get_loc(item) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", > line 3080, in get_loc > return
[jira] [Updated] (ARROW-7006) [Rust] Bump flatbuffers version to avoid vulnerability
[ https://issues.apache.org/jira/browse/ARROW-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paddy Horan updated ARROW-7006: --- Component/s: Rust Fix Version/s: 1.0.0 > [Rust] Bump flatbuffers version to avoid vulnerability > -- > > Key: ARROW-7006 > URL: https://issues.apache.org/jira/browse/ARROW-7006 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 0.15.0 >Reporter: Paddy Horan >Assignee: Paddy Horan >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > From GitHub use emilk: > [{{cargo audit}}|https://github.com/RustSec/cargo-audit] output: > > {{ID: RUSTSEC-2019-0028 > Crate: flatbuffers > Version: 0.5.0 > Date: 2019-10-20 > URL: https://github.com/google/flatbuffers/issues/5530 > Title: Unsound `impl Follow for bool`}} > The fix should be as simple as editing > [https://github.com/apache/arrow/blob/master/rust/arrow/Cargo.toml] from > {{flatbuffers = "0.5.0"}} to {{flatbuffers = "0.6.0"}} > A more longterm improvement is to add a call to {{cargo audit}} in your CI to > catch these problems as early as possible > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6999) [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema
[ https://issues.apache.org/jira/browse/ARROW-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961317#comment-16961317 ] Tom Goodman edited comment on ARROW-6999 at 10/28/19 7:21 PM: -- [~jorisvandenbossche] please try this with the attached [^test3.hdf] (not empty) {code:java} df2 = pd.read_hdf('test3.hdf','foo') pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema){code} I still get KeyError: '__index_level_0__' (+without+ specifying preserve_index)._ This may be because the index on test3.hdf is Int64Index and I see [pyarrow docs|#pyarrow.Table.from_pandas]] say default behavior is to "store the index as a column", except for rage indexes. This unfortunately makes the bug more prevalent. was (Author: goodiegoodman): [~jorisvandenbossche] please try this with the attached [^test3.hdf] (not empty) {code:java} df2 = pd.read_hdf('test3.hdf','foo') pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema){code} I still get KeyError: '__index_level_0__' (+without+ specifying preserve_index)._ This may be because the index on test3.hdf is Int64Index and I see [pyarrow docs|[https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.from_pandas]] say default behavior is to "store the index as a column", except for rage indexes) > [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own > schema > --- > > Key: ARROW-6999 > URL: https://issues.apache.org/jira/browse/ARROW-6999 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.15.0 > Environment: pandas==0.23.4 > pyarrow==0.15.0 # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0 >Reporter: Tom Goodman >Priority: Major > Fix For: 1.0.0 > > Attachments: test3.hdf > > > Steps to reproduce: > # Generate any DataFrame's pyarrow Schema using Table.from_pandas > # Pass the generated schema as input into Table.from_pandas > # Causes KeyError: '__index_level_0__' > We did not have this issue with pyarrow==0.11.0 which we used to write many > partitions across years. Our goal now is to use pyarrow==0.15.0 and produce > schema going forward that are *backwards compatible* (i.e. also have > '__index_level_0__'), so we should not need to re-generate all prior years' > partitions when we migrate to 0.15.0. > We cannot set _preserve_index=False_, since that effectively deletes > '__index_level_0__', causing inconsistent schema across earlier partitions > that had been written using pyarrow==0.11.0. > > {code:java} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame() > schema = pa.Table.from_pandas(df).schema > pa_table = pa.Table.from_pandas(df, schema=schema) > {code} > {noformat} > Traceback (most recent call last): > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", > line 3078, in get_loc > return self._engine.get_loc(key) > File "pandas/_libs/index.pyx", line 140, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/index.pyx", line 162, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in > pandas._libs.hashtable.PyObjectHashTable.get_item > File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in > pandas._libs.hashtable.PyObjectHashTable.get_item > KeyError: '__index_level_0__' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", > line 408, in _get_columns_to_convert_given_schema > col = df[name] > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", > line 2688, in __getitem__ > return self._getitem_column(key) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", > line 2695, in _getitem_column > return self._get_item_cache(key) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/generic.py", > line 2489, in _get_item_cache > values = self._data.get(item) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/internals.py", > line 4115, in get > loc = self.items.get_loc(item) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", > line 3080, in get_loc > return self._engine.get_loc(self._maybe_cast_indexer(key)) > File
[jira] [Comment Edited] (ARROW-6999) [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema
[ https://issues.apache.org/jira/browse/ARROW-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961317#comment-16961317 ] Tom Goodman edited comment on ARROW-6999 at 10/28/19 7:15 PM: -- [~jorisvandenbossche] please try this with the attached [^test3.hdf] (not empty) {code:java} df2 = pd.read_hdf('test3.hdf','foo') pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema){code} I still get KeyError: '__index_level_0__' (+without+ specifying preserve_index)._ This may be because the index on test3.hdf is Int64Index and I see [pyarrow docs|[https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.from_pandas]] say default behavior is to "store the index as a column", except for rage indexes) was (Author: goodiegoodman): [~jorisvandenbossche] please try this with the attached [^test3.hdf] (not empty) {code:java} df2 = pd.read_hdf('test3.hdf','foo') pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema){code} I still get KeyError: '__index_level_0__' (+without+ specifying preserve_index)_, do you? > [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own > schema > --- > > Key: ARROW-6999 > URL: https://issues.apache.org/jira/browse/ARROW-6999 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.15.0 > Environment: pandas==0.23.4 > pyarrow==0.15.0 # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0 >Reporter: Tom Goodman >Priority: Major > Fix For: 1.0.0 > > Attachments: test3.hdf > > > Steps to reproduce: > # Generate any DataFrame's pyarrow Schema using Table.from_pandas > # Pass the generated schema as input into Table.from_pandas > # Causes KeyError: '__index_level_0__' > We did not have this issue with pyarrow==0.11.0 which we used to write many > partitions across years. Our goal now is to use pyarrow==0.15.0 and produce > schema going forward that are *backwards compatible* (i.e. also have > '__index_level_0__'), so we should not need to re-generate all prior years' > partitions when we migrate to 0.15.0. > We cannot set _preserve_index=False_, since that effectively deletes > '__index_level_0__', causing inconsistent schema across earlier partitions > that had been written using pyarrow==0.11.0. > > {code:java} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame() > schema = pa.Table.from_pandas(df).schema > pa_table = pa.Table.from_pandas(df, schema=schema) > {code} > {noformat} > Traceback (most recent call last): > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", > line 3078, in get_loc > return self._engine.get_loc(key) > File "pandas/_libs/index.pyx", line 140, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/index.pyx", line 162, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in > pandas._libs.hashtable.PyObjectHashTable.get_item > File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in > pandas._libs.hashtable.PyObjectHashTable.get_item > KeyError: '__index_level_0__' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", > line 408, in _get_columns_to_convert_given_schema > col = df[name] > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", > line 2688, in __getitem__ > return self._getitem_column(key) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", > line 2695, in _getitem_column > return self._get_item_cache(key) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/generic.py", > line 2489, in _get_item_cache > values = self._data.get(item) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/internals.py", > line 4115, in get > loc = self.items.get_loc(item) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", > line 3080, in get_loc > return self._engine.get_loc(self._maybe_cast_indexer(key)) > File "pandas/_libs/index.pyx", line 140, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/index.pyx", line 162, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in >
[jira] [Updated] (ARROW-6758) [Release] Install ephemeral node/npm/npx in release verification script
[ https://issues.apache.org/jira/browse/ARROW-6758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6758: -- Labels: pull-request-available (was: ) > [Release] Install ephemeral node/npm/npx in release verification script > --- > > Key: ARROW-6758 > URL: https://issues.apache.org/jira/browse/ARROW-6758 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > > Installing node with nvm isn't terribly difficult; to add this to the release > verification script would make it easier for people to verify more of the > release -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6758) [Release] Install ephemeral node/npm/npx in release verification script
[ https://issues.apache.org/jira/browse/ARROW-6758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-6758: --- Assignee: Wes McKinney > [Release] Install ephemeral node/npm/npx in release verification script > --- > > Key: ARROW-6758 > URL: https://issues.apache.org/jira/browse/ARROW-6758 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > > Installing node with nvm isn't terribly difficult; to add this to the release > verification script would make it easier for people to verify more of the > release -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6999) [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema
[ https://issues.apache.org/jira/browse/ARROW-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961317#comment-16961317 ] Tom Goodman edited comment on ARROW-6999 at 10/28/19 6:32 PM: -- [~jorisvandenbossche] please try this with the attached [^test3.hdf] (not empty) {code:java} df2 = pd.read_hdf('test3.hdf','foo') pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema){code} I still get KeyError: '__index_level_0__' (+without+ specifying preserve_index)_, do you? was (Author: goodiegoodman): [~jorisvandenbossche] please try this with the attached [^test3.hdf] (not empty) {code:java} df2 = pd.read_hdf('test3.hdf','foo') pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema){code} I still get KeyError: '__index_level_0__', do you? > [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own > schema > --- > > Key: ARROW-6999 > URL: https://issues.apache.org/jira/browse/ARROW-6999 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.15.0 > Environment: pandas==0.23.4 > pyarrow==0.15.0 # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0 >Reporter: Tom Goodman >Priority: Major > Fix For: 1.0.0 > > Attachments: test3.hdf > > > Steps to reproduce: > # Generate any DataFrame's pyarrow Schema using Table.from_pandas > # Pass the generated schema as input into Table.from_pandas > # Causes KeyError: '__index_level_0__' > We did not have this issue with pyarrow==0.11.0 which we used to write many > partitions across years. Our goal now is to use pyarrow==0.15.0 and produce > schema going forward that are *backwards compatible* (i.e. also have > '__index_level_0__'), so we should not need to re-generate all prior years' > partitions when we migrate to 0.15.0. > We cannot set _preserve_index=False_, since that effectively deletes > '__index_level_0__', causing inconsistent schema across earlier partitions > that had been written using pyarrow==0.11.0. > > {code:java} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame() > schema = pa.Table.from_pandas(df).schema > pa_table = pa.Table.from_pandas(df, schema=schema) > {code} > {noformat} > Traceback (most recent call last): > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", > line 3078, in get_loc > return self._engine.get_loc(key) > File "pandas/_libs/index.pyx", line 140, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/index.pyx", line 162, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in > pandas._libs.hashtable.PyObjectHashTable.get_item > File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in > pandas._libs.hashtable.PyObjectHashTable.get_item > KeyError: '__index_level_0__' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", > line 408, in _get_columns_to_convert_given_schema > col = df[name] > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", > line 2688, in __getitem__ > return self._getitem_column(key) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", > line 2695, in _getitem_column > return self._get_item_cache(key) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/generic.py", > line 2489, in _get_item_cache > values = self._data.get(item) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/internals.py", > line 4115, in get > loc = self.items.get_loc(item) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", > line 3080, in get_loc > return self._engine.get_loc(self._maybe_cast_indexer(key)) > File "pandas/_libs/index.pyx", line 140, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/index.pyx", line 162, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in > pandas._libs.hashtable.PyObjectHashTable.get_item > File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in > pandas._libs.hashtable.PyObjectHashTable.get_item > KeyError: '__index_level_0__' > During handling of the above exception, another exception occurred: > Traceback (most recent call
[jira] [Comment Edited] (ARROW-6999) [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema
[ https://issues.apache.org/jira/browse/ARROW-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961317#comment-16961317 ] Tom Goodman edited comment on ARROW-6999 at 10/28/19 6:13 PM: -- [~jorisvandenbossche] please try this with the attached [^test3.hdf] (not empty) {code:java} df2 = pd.read_hdf('test3.hdf','foo') pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema){code} I still get KeyError: '__index_level_0__', do you? was (Author: goodiegoodman): [~jorisvandenbossche] please try this with the attached test3.hdf (not empty) {code:java} df2 = pd.read_hdf('test3.hdf','foo') pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema){code} I still get KeyError: '__index_level_0__', do you? > [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own > schema > --- > > Key: ARROW-6999 > URL: https://issues.apache.org/jira/browse/ARROW-6999 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.15.0 > Environment: pandas==0.23.4 > pyarrow==0.15.0 # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0 >Reporter: Tom Goodman >Priority: Major > Fix For: 1.0.0 > > Attachments: test3.hdf > > > Steps to reproduce: > # Generate any DataFrame's pyarrow Schema using Table.from_pandas > # Pass the generated schema as input into Table.from_pandas > # Causes KeyError: '__index_level_0__' > We did not have this issue with pyarrow==0.11.0 which we used to write many > partitions across years. Our goal now is to use pyarrow==0.15.0 and produce > schema going forward that are *backwards compatible* (i.e. also have > '__index_level_0__'), so we should not need to re-generate all prior years' > partitions when we migrate to 0.15.0. > We cannot set _preserve_index=False_, since that effectively deletes > '__index_level_0__', causing inconsistent schema across earlier partitions > that had been written using pyarrow==0.11.0. > > {code:java} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame() > schema = pa.Table.from_pandas(df).schema > pa_table = pa.Table.from_pandas(df, schema=schema) > {code} > {noformat} > Traceback (most recent call last): > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", > line 3078, in get_loc > return self._engine.get_loc(key) > File "pandas/_libs/index.pyx", line 140, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/index.pyx", line 162, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in > pandas._libs.hashtable.PyObjectHashTable.get_item > File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in > pandas._libs.hashtable.PyObjectHashTable.get_item > KeyError: '__index_level_0__' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", > line 408, in _get_columns_to_convert_given_schema > col = df[name] > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", > line 2688, in __getitem__ > return self._getitem_column(key) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", > line 2695, in _getitem_column > return self._get_item_cache(key) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/generic.py", > line 2489, in _get_item_cache > values = self._data.get(item) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/internals.py", > line 4115, in get > loc = self.items.get_loc(item) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", > line 3080, in get_loc > return self._engine.get_loc(self._maybe_cast_indexer(key)) > File "pandas/_libs/index.pyx", line 140, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/index.pyx", line 162, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in > pandas._libs.hashtable.PyObjectHashTable.get_item > File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in > pandas._libs.hashtable.PyObjectHashTable.get_item > KeyError: '__index_level_0__' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File >
[jira] [Commented] (ARROW-6999) [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema
[ https://issues.apache.org/jira/browse/ARROW-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961317#comment-16961317 ] Tom Goodman commented on ARROW-6999: [~jorisvandenbossche] please try this with the attached test3.hdf (not empty) {code:java} df2 = pd.read_hdf('test3.hdf','foo') pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema){code} I still get KeyError: '__index_level_0__', do you? > [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own > schema > --- > > Key: ARROW-6999 > URL: https://issues.apache.org/jira/browse/ARROW-6999 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.15.0 > Environment: pandas==0.23.4 > pyarrow==0.15.0 # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0 >Reporter: Tom Goodman >Priority: Major > Fix For: 1.0.0 > > Attachments: test3.hdf > > > Steps to reproduce: > # Generate any DataFrame's pyarrow Schema using Table.from_pandas > # Pass the generated schema as input into Table.from_pandas > # Causes KeyError: '__index_level_0__' > We did not have this issue with pyarrow==0.11.0 which we used to write many > partitions across years. Our goal now is to use pyarrow==0.15.0 and produce > schema going forward that are *backwards compatible* (i.e. also have > '__index_level_0__'), so we should not need to re-generate all prior years' > partitions when we migrate to 0.15.0. > We cannot set _preserve_index=False_, since that effectively deletes > '__index_level_0__', causing inconsistent schema across earlier partitions > that had been written using pyarrow==0.11.0. > > {code:java} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame() > schema = pa.Table.from_pandas(df).schema > pa_table = pa.Table.from_pandas(df, schema=schema) > {code} > {noformat} > Traceback (most recent call last): > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", > line 3078, in get_loc > return self._engine.get_loc(key) > File "pandas/_libs/index.pyx", line 140, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/index.pyx", line 162, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in > pandas._libs.hashtable.PyObjectHashTable.get_item > File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in > pandas._libs.hashtable.PyObjectHashTable.get_item > KeyError: '__index_level_0__' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", > line 408, in _get_columns_to_convert_given_schema > col = df[name] > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", > line 2688, in __getitem__ > return self._getitem_column(key) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", > line 2695, in _getitem_column > return self._get_item_cache(key) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/generic.py", > line 2489, in _get_item_cache > values = self._data.get(item) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/internals.py", > line 4115, in get > loc = self.items.get_loc(item) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", > line 3080, in get_loc > return self._engine.get_loc(self._maybe_cast_indexer(key)) > File "pandas/_libs/index.pyx", line 140, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/index.pyx", line 162, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in > pandas._libs.hashtable.PyObjectHashTable.get_item > File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in > pandas._libs.hashtable.PyObjectHashTable.get_item > KeyError: '__index_level_0__' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/IPython/core/interactiveshell.py", > line 3326, in run_code > exec(code_obj, self.user_global_ns, self.user_ns) > File "", line 5, in > pa_table = pa.Table.from_pandas(df, > schema=pa.Table.from_pandas(df).schema) > File "pyarrow/table.pxi", line 1057, in
[jira] [Updated] (ARROW-6999) [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema
[ https://issues.apache.org/jira/browse/ARROW-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom Goodman updated ARROW-6999: --- Attachment: test3.hdf > [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own > schema > --- > > Key: ARROW-6999 > URL: https://issues.apache.org/jira/browse/ARROW-6999 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.15.0 > Environment: pandas==0.23.4 > pyarrow==0.15.0 # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0 >Reporter: Tom Goodman >Priority: Major > Fix For: 1.0.0 > > Attachments: test3.hdf > > > Steps to reproduce: > # Generate any DataFrame's pyarrow Schema using Table.from_pandas > # Pass the generated schema as input into Table.from_pandas > # Causes KeyError: '__index_level_0__' > We did not have this issue with pyarrow==0.11.0 which we used to write many > partitions across years. Our goal now is to use pyarrow==0.15.0 and produce > schema going forward that are *backwards compatible* (i.e. also have > '__index_level_0__'), so we should not need to re-generate all prior years' > partitions when we migrate to 0.15.0. > We cannot set _preserve_index=False_, since that effectively deletes > '__index_level_0__', causing inconsistent schema across earlier partitions > that had been written using pyarrow==0.11.0. > > {code:java} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame() > schema = pa.Table.from_pandas(df).schema > pa_table = pa.Table.from_pandas(df, schema=schema) > {code} > {noformat} > Traceback (most recent call last): > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", > line 3078, in get_loc > return self._engine.get_loc(key) > File "pandas/_libs/index.pyx", line 140, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/index.pyx", line 162, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in > pandas._libs.hashtable.PyObjectHashTable.get_item > File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in > pandas._libs.hashtable.PyObjectHashTable.get_item > KeyError: '__index_level_0__' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", > line 408, in _get_columns_to_convert_given_schema > col = df[name] > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", > line 2688, in __getitem__ > return self._getitem_column(key) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", > line 2695, in _getitem_column > return self._get_item_cache(key) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/generic.py", > line 2489, in _get_item_cache > values = self._data.get(item) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/internals.py", > line 4115, in get > loc = self.items.get_loc(item) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", > line 3080, in get_loc > return self._engine.get_loc(self._maybe_cast_indexer(key)) > File "pandas/_libs/index.pyx", line 140, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/index.pyx", line 162, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in > pandas._libs.hashtable.PyObjectHashTable.get_item > File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in > pandas._libs.hashtable.PyObjectHashTable.get_item > KeyError: '__index_level_0__' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/IPython/core/interactiveshell.py", > line 3326, in run_code > exec(code_obj, self.user_global_ns, self.user_ns) > File "", line 5, in > pa_table = pa.Table.from_pandas(df, > schema=pa.Table.from_pandas(df).schema) > File "pyarrow/table.pxi", line 1057, in pyarrow.lib.Table.from_pandas > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", > line 517, in dataframe_to_arrays > columns) > File >
[jira] [Commented] (ARROW-6999) [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema
[ https://issues.apache.org/jira/browse/ARROW-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961145#comment-16961145 ] Joris Van den Bossche commented on ARROW-6999: -- So this case is clearly a bug in the new implementation, I would say: {code} In [23]: import pandas as pd ...: import pyarrow as pa ...: df = pd.DataFrame({'a': [1, 2, 3]}) ...: schema = pa.Table.from_pandas(df, preserve_index=True).schema ...: pa_table = pa.Table.from_pandas(df, schema=schema, preserve_index=True) ... KeyError: "name '__index_level_0__' present in the specified schema is not found in the columns or index" {code} So if you specify {{preserve_index=True}}, and there is an index in the schema that did not have a name in the DataFrame (so ending up as the generated {{\_\_index_level_i\_\_}}), the above should work when passing an explicit schema matching that. Will look into fixing this (it's a pity that 0.15.1 is already released, it would have been nice to include this). > [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own > schema > --- > > Key: ARROW-6999 > URL: https://issues.apache.org/jira/browse/ARROW-6999 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.15.0 > Environment: pandas==0.23.4 > pyarrow==0.15.0 # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0 >Reporter: Tom Goodman >Priority: Major > Fix For: 1.0.0 > > > Steps to reproduce: > # Generate any DataFrame's pyarrow Schema using Table.from_pandas > # Pass the generated schema as input into Table.from_pandas > # Causes KeyError: '__index_level_0__' > We did not have this issue with pyarrow==0.11.0 which we used to write many > partitions across years. Our goal now is to use pyarrow==0.15.0 and produce > schema going forward that are *backwards compatible* (i.e. also have > '__index_level_0__'), so we should not need to re-generate all prior years' > partitions when we migrate to 0.15.0. > We cannot set _preserve_index=False_, since that effectively deletes > '__index_level_0__', causing inconsistent schema across earlier partitions > that had been written using pyarrow==0.11.0. > > {code:java} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame() > schema = pa.Table.from_pandas(df).schema > pa_table = pa.Table.from_pandas(df, schema=schema) > {code} > {noformat} > Traceback (most recent call last): > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", > line 3078, in get_loc > return self._engine.get_loc(key) > File "pandas/_libs/index.pyx", line 140, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/index.pyx", line 162, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in > pandas._libs.hashtable.PyObjectHashTable.get_item > File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in > pandas._libs.hashtable.PyObjectHashTable.get_item > KeyError: '__index_level_0__' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", > line 408, in _get_columns_to_convert_given_schema > col = df[name] > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", > line 2688, in __getitem__ > return self._getitem_column(key) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", > line 2695, in _getitem_column > return self._get_item_cache(key) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/generic.py", > line 2489, in _get_item_cache > values = self._data.get(item) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/internals.py", > line 4115, in get > loc = self.items.get_loc(item) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", > line 3080, in get_loc > return self._engine.get_loc(self._maybe_cast_indexer(key)) > File "pandas/_libs/index.pyx", line 140, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/index.pyx", line 162, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/hashtable_class_helper.pxi",
[jira] [Assigned] (ARROW-7008) [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers
[ https://issues.apache.org/jira/browse/ARROW-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn reassigned ARROW-7008: --- Assignee: Uwe Korn > [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers > > > Key: ARROW-7008 > URL: https://issues.apache.org/jira/browse/ARROW-7008 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 >Reporter: Uwe Korn >Assignee: Uwe Korn >Priority: Major > > Minimal reproducer: > {code} > import pyarrow as pa > pa.chunked_array([pa.array([], > type=pa.string()).dictionary_encode().dictionary]) > {code} > Traceback > {code} > (lldb) bt > * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS > (code=1, address=0x20) > * frame #0: 0x000112cd5d0e libarrow.15.dylib`arrow::Status > arrow::internal::ValidateVisitor::ValidateOffsets const>(arrow::BinaryArray const&) + 94 > frame #1: 0x000112cc79a3 libarrow.15.dylib`arrow::Status > arrow::VisitArrayInline(arrow::Array > const&, arrow::internal::ValidateVisitor*) + 915 > frame #2: 0x000112cc747d libarrow.15.dylib`arrow::Array::Validate() > const + 829 > frame #3: 0x000112e3ea19 > libarrow.15.dylib`arrow::ChunkedArray::Validate() const + 89 > frame #4: 0x000112b8eb7d > lib.cpython-37m-darwin.so`__pyx_pw_7pyarrow_3lib_135chunked_array(_object*, > _object*, _object*) + 3661 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7008) [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers
[ https://issues.apache.org/jira/browse/ARROW-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7008: -- Labels: pull-request-available (was: ) > [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers > > > Key: ARROW-7008 > URL: https://issues.apache.org/jira/browse/ARROW-7008 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 >Reporter: Uwe Korn >Assignee: Uwe Korn >Priority: Major > Labels: pull-request-available > > Minimal reproducer: > {code} > import pyarrow as pa > pa.chunked_array([pa.array([], > type=pa.string()).dictionary_encode().dictionary]) > {code} > Traceback > {code} > (lldb) bt > * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS > (code=1, address=0x20) > * frame #0: 0x000112cd5d0e libarrow.15.dylib`arrow::Status > arrow::internal::ValidateVisitor::ValidateOffsets const>(arrow::BinaryArray const&) + 94 > frame #1: 0x000112cc79a3 libarrow.15.dylib`arrow::Status > arrow::VisitArrayInline(arrow::Array > const&, arrow::internal::ValidateVisitor*) + 915 > frame #2: 0x000112cc747d libarrow.15.dylib`arrow::Array::Validate() > const + 829 > frame #3: 0x000112e3ea19 > libarrow.15.dylib`arrow::ChunkedArray::Validate() const + 89 > frame #4: 0x000112b8eb7d > lib.cpython-37m-darwin.so`__pyx_pw_7pyarrow_3lib_135chunked_array(_object*, > _object*, _object*) + 3661 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7008) [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers
[ https://issues.apache.org/jira/browse/ARROW-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961084#comment-16961084 ] Uwe Korn commented on ARROW-7008: - No, this is a different one and I can reproduce with 0.15 and master. > [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers > > > Key: ARROW-7008 > URL: https://issues.apache.org/jira/browse/ARROW-7008 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 >Reporter: Uwe Korn >Priority: Major > > Minimal reproducer: > {code} > import pyarrow as pa > pa.chunked_array([pa.array([], > type=pa.string()).dictionary_encode().dictionary]) > {code} > Traceback > {code} > (lldb) bt > * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS > (code=1, address=0x20) > * frame #0: 0x000112cd5d0e libarrow.15.dylib`arrow::Status > arrow::internal::ValidateVisitor::ValidateOffsets const>(arrow::BinaryArray const&) + 94 > frame #1: 0x000112cc79a3 libarrow.15.dylib`arrow::Status > arrow::VisitArrayInline(arrow::Array > const&, arrow::internal::ValidateVisitor*) + 915 > frame #2: 0x000112cc747d libarrow.15.dylib`arrow::Array::Validate() > const + 829 > frame #3: 0x000112e3ea19 > libarrow.15.dylib`arrow::ChunkedArray::Validate() const + 89 > frame #4: 0x000112b8eb7d > lib.cpython-37m-darwin.so`__pyx_pw_7pyarrow_3lib_135chunked_array(_object*, > _object*, _object*) + 3661 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7008) [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers
[ https://issues.apache.org/jira/browse/ARROW-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961079#comment-16961079 ] Artem KOZHEVNIKOV commented on ARROW-7008: -- is > [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers > > > Key: ARROW-7008 > URL: https://issues.apache.org/jira/browse/ARROW-7008 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 >Reporter: Uwe Korn >Priority: Major > > Minimal reproducer: > {code} > import pyarrow as pa > pa.chunked_array([pa.array([], > type=pa.string()).dictionary_encode().dictionary]) > {code} > Traceback > {code} > (lldb) bt > * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS > (code=1, address=0x20) > * frame #0: 0x000112cd5d0e libarrow.15.dylib`arrow::Status > arrow::internal::ValidateVisitor::ValidateOffsets const>(arrow::BinaryArray const&) + 94 > frame #1: 0x000112cc79a3 libarrow.15.dylib`arrow::Status > arrow::VisitArrayInline(arrow::Array > const&, arrow::internal::ValidateVisitor*) + 915 > frame #2: 0x000112cc747d libarrow.15.dylib`arrow::Array::Validate() > const + 829 > frame #3: 0x000112e3ea19 > libarrow.15.dylib`arrow::ChunkedArray::Validate() const + 89 > frame #4: 0x000112b8eb7d > lib.cpython-37m-darwin.so`__pyx_pw_7pyarrow_3lib_135chunked_array(_object*, > _object*, _object*) + 3661 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-7008) [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers
[ https://issues.apache.org/jira/browse/ARROW-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961079#comment-16961079 ] Artem KOZHEVNIKOV edited comment on ARROW-7008 at 10/28/19 2:10 PM: is it the same issue as https://issues.apache.org/jira/browse/ARROW-6857 ? was (Author: artemk): is > [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers > > > Key: ARROW-7008 > URL: https://issues.apache.org/jira/browse/ARROW-7008 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 >Reporter: Uwe Korn >Priority: Major > > Minimal reproducer: > {code} > import pyarrow as pa > pa.chunked_array([pa.array([], > type=pa.string()).dictionary_encode().dictionary]) > {code} > Traceback > {code} > (lldb) bt > * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS > (code=1, address=0x20) > * frame #0: 0x000112cd5d0e libarrow.15.dylib`arrow::Status > arrow::internal::ValidateVisitor::ValidateOffsets const>(arrow::BinaryArray const&) + 94 > frame #1: 0x000112cc79a3 libarrow.15.dylib`arrow::Status > arrow::VisitArrayInline(arrow::Array > const&, arrow::internal::ValidateVisitor*) + 915 > frame #2: 0x000112cc747d libarrow.15.dylib`arrow::Array::Validate() > const + 829 > frame #3: 0x000112e3ea19 > libarrow.15.dylib`arrow::ChunkedArray::Validate() const + 89 > frame #4: 0x000112b8eb7d > lib.cpython-37m-darwin.so`__pyx_pw_7pyarrow_3lib_135chunked_array(_object*, > _object*, _object*) + 3661 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7008) [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers
[ https://issues.apache.org/jira/browse/ARROW-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn updated ARROW-7008: Summary: [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers (was: [Python] pyarrow.chunked_array([array]) fails on array with ) > [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers > > > Key: ARROW-7008 > URL: https://issues.apache.org/jira/browse/ARROW-7008 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 >Reporter: Uwe Korn >Priority: Major > > Minimal reproducer: > {code} > import pyarrow as pa > pa.chunked_array([pa.array([], > type=pa.string()).dictionary_encode().dictionary]) > {code} > Traceback > {code} > (lldb) bt > * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS > (code=1, address=0x20) > * frame #0: 0x000112cd5d0e libarrow.15.dylib`arrow::Status > arrow::internal::ValidateVisitor::ValidateOffsets const>(arrow::BinaryArray const&) + 94 > frame #1: 0x000112cc79a3 libarrow.15.dylib`arrow::Status > arrow::VisitArrayInline(arrow::Array > const&, arrow::internal::ValidateVisitor*) + 915 > frame #2: 0x000112cc747d libarrow.15.dylib`arrow::Array::Validate() > const + 829 > frame #3: 0x000112e3ea19 > libarrow.15.dylib`arrow::ChunkedArray::Validate() const + 89 > frame #4: 0x000112b8eb7d > lib.cpython-37m-darwin.so`__pyx_pw_7pyarrow_3lib_135chunked_array(_object*, > _object*, _object*) + 3661 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7008) [Python] pyarrow.chunked_array([array]) fails on array with
Uwe Korn created ARROW-7008: --- Summary: [Python] pyarrow.chunked_array([array]) fails on array with Key: ARROW-7008 URL: https://issues.apache.org/jira/browse/ARROW-7008 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.0 Reporter: Uwe Korn Minimal reproducer: {code} import pyarrow as pa pa.chunked_array([pa.array([], type=pa.string()).dictionary_encode().dictionary]) {code} Traceback {code} (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x20) * frame #0: 0x000112cd5d0e libarrow.15.dylib`arrow::Status arrow::internal::ValidateVisitor::ValidateOffsets(arrow::BinaryArray const&) + 94 frame #1: 0x000112cc79a3 libarrow.15.dylib`arrow::Status arrow::VisitArrayInline(arrow::Array const&, arrow::internal::ValidateVisitor*) + 915 frame #2: 0x000112cc747d libarrow.15.dylib`arrow::Array::Validate() const + 829 frame #3: 0x000112e3ea19 libarrow.15.dylib`arrow::ChunkedArray::Validate() const + 89 frame #4: 0x000112b8eb7d lib.cpython-37m-darwin.so`__pyx_pw_7pyarrow_3lib_135chunked_array(_object*, _object*, _object*) + 3661 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7007) [C++] Enable mmap option for LocalFs
[ https://issues.apache.org/jira/browse/ARROW-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-7007: -- Component/s: C++ > [C++] Enable mmap option for LocalFs > > > Key: ARROW-7007 > URL: https://issues.apache.org/jira/browse/ARROW-7007 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Francois Saint-Jacques >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7007) [C++] Enable mmap option for LocalFs
Francois Saint-Jacques created ARROW-7007: - Summary: [C++] Enable mmap option for LocalFs Key: ARROW-7007 URL: https://issues.apache.org/jira/browse/ARROW-7007 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7006) [Rust] Bump flatbuffers version to avoid vulnerability
[ https://issues.apache.org/jira/browse/ARROW-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7006: -- Labels: pull-request-available (was: ) > [Rust] Bump flatbuffers version to avoid vulnerability > -- > > Key: ARROW-7006 > URL: https://issues.apache.org/jira/browse/ARROW-7006 > Project: Apache Arrow > Issue Type: Improvement >Affects Versions: 0.15.0 >Reporter: Paddy Horan >Assignee: Paddy Horan >Priority: Major > Labels: pull-request-available > > From GitHub use emilk: > [{{cargo audit}}|https://github.com/RustSec/cargo-audit] output: > > {{ID: RUSTSEC-2019-0028 > Crate: flatbuffers > Version: 0.5.0 > Date: 2019-10-20 > URL: https://github.com/google/flatbuffers/issues/5530 > Title: Unsound `impl Follow for bool`}} > The fix should be as simple as editing > [https://github.com/apache/arrow/blob/master/rust/arrow/Cargo.toml] from > {{flatbuffers = "0.5.0"}} to {{flatbuffers = "0.6.0"}} > A more longterm improvement is to add a call to {{cargo audit}} in your CI to > catch these problems as early as possible > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7006) [Rust] Bump flatbuffers version to avoid vulnerability
[ https://issues.apache.org/jira/browse/ARROW-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paddy Horan reassigned ARROW-7006: -- Assignee: Paddy Horan > [Rust] Bump flatbuffers version to avoid vulnerability > -- > > Key: ARROW-7006 > URL: https://issues.apache.org/jira/browse/ARROW-7006 > Project: Apache Arrow > Issue Type: Improvement >Affects Versions: 0.15.0 >Reporter: Paddy Horan >Assignee: Paddy Horan >Priority: Major > > From GitHub use emilk: > [{{cargo audit}}|https://github.com/RustSec/cargo-audit] output: > > {{ID: RUSTSEC-2019-0028 > Crate: flatbuffers > Version: 0.5.0 > Date: 2019-10-20 > URL: https://github.com/google/flatbuffers/issues/5530 > Title: Unsound `impl Follow for bool`}} > The fix should be as simple as editing > [https://github.com/apache/arrow/blob/master/rust/arrow/Cargo.toml] from > {{flatbuffers = "0.5.0"}} to {{flatbuffers = "0.6.0"}} > A more longterm improvement is to add a call to {{cargo audit}} in your CI to > catch these problems as early as possible > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7006) [Rust] Bump flatbuffers version to avoid vulnerability
Paddy Horan created ARROW-7006: -- Summary: [Rust] Bump flatbuffers version to avoid vulnerability Key: ARROW-7006 URL: https://issues.apache.org/jira/browse/ARROW-7006 Project: Apache Arrow Issue Type: Improvement Affects Versions: 0.15.0 Reporter: Paddy Horan >From GitHub use emilk: [{{cargo audit}}|https://github.com/RustSec/cargo-audit] output: {{ID:RUSTSEC-2019-0028 Crate: flatbuffers Version: 0.5.0 Date:2019-10-20 URL: https://github.com/google/flatbuffers/issues/5530 Title: Unsound `impl Follow for bool`}} The fix should be as simple as editing [https://github.com/apache/arrow/blob/master/rust/arrow/Cargo.toml] from {{flatbuffers = "0.5.0"}} to {{flatbuffers = "0.6.0"}} A more longterm improvement is to add a call to {{cargo audit}} in your CI to catch these problems as early as possible -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7005) [Rust] run "cargo audit" in CI
Paddy Horan created ARROW-7005: -- Summary: [Rust] run "cargo audit" in CI Key: ARROW-7005 URL: https://issues.apache.org/jira/browse/ARROW-7005 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Paddy Horan -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6999) [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema
[ https://issues.apache.org/jira/browse/ARROW-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960931#comment-16960931 ] Joris Van den Bossche commented on ARROW-6999: -- [~goodiegoodman] thanks for the report! Your "steps to reproduce" actually do work if you do not use an empty dataframe: {code} In [15]: import pandas as pd ...: import pyarrow as pa ...: df = pd.DataFrame({'a': [1, 2, 3]}) ...: schema = pa.Table.from_pandas(df).schema ...: pa_table = pa.Table.from_pandas(df, schema=schema) In [16]: schema Out[16]: a: int64 metadata {b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "' b'stop": 3, "step": 1}], "column_indexes": [{"name": null, "field_' b'name": null, "pandas_type": "unicode", "numpy_type": "object", "' b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "f' b'ield_name": "a", "pandas_type": "int64", "numpy_type": "int64", ' b'"metadata": null}], "creator": {"library": "pyarrow", "version":' b' "0.15.1.dev177+g5df424bd6"}, "pandas_version": "0.26.0.dev0+669' b'.g3c29114b1"}'} {code} The empty dataframe is tricky edge-case regarding the index, because in such a case the index is not a RangeIndex but a empty object-dtype Index (see ARROW-5104 for a similar report about that aspect). That said, if passing an explicit schema, and if there is a column not found that has a "\_\_index_level_i\_\_" pattern, we should try to handle this (certainly in case of passing {{preserve_index=True}}). > [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own > schema > --- > > Key: ARROW-6999 > URL: https://issues.apache.org/jira/browse/ARROW-6999 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.15.0 > Environment: pandas==0.23.4 > pyarrow==0.15.0 # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0 >Reporter: Tom Goodman >Priority: Major > Fix For: 1.0.0 > > > Steps to reproduce: > # Generate any DataFrame's pyarrow Schema using Table.from_pandas > # Pass the generated schema as input into Table.from_pandas > # Causes KeyError: '__index_level_0__' > We did not have this issue with pyarrow==0.11.0 which we used to write many > partitions across years. Our goal now is to use pyarrow==0.15.0 and produce > schema going forward that are *backwards compatible* (i.e. also have > '__index_level_0__'), so we should not need to re-generate all prior years' > partitions when we migrate to 0.15.0. > We cannot set _preserve_index=False_, since that effectively deletes > '__index_level_0__', causing inconsistent schema across earlier partitions > that had been written using pyarrow==0.11.0. > > {code:java} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame() > schema = pa.Table.from_pandas(df).schema > pa_table = pa.Table.from_pandas(df, schema=schema) > {code} > {noformat} > Traceback (most recent call last): > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", > line 3078, in get_loc > return self._engine.get_loc(key) > File "pandas/_libs/index.pyx", line 140, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/index.pyx", line 162, in > pandas._libs.index.IndexEngine.get_loc > File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in > pandas._libs.hashtable.PyObjectHashTable.get_item > File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in > pandas._libs.hashtable.PyObjectHashTable.get_item > KeyError: '__index_level_0__' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", > line 408, in _get_columns_to_convert_given_schema > col = df[name] > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", > line 2688, in __getitem__ > return self._getitem_column(key) > File > "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", > line 2695, in _getitem_column > return self._get_item_cache(key) > File >
[jira] [Commented] (ARROW-5379) [Python] support pandas' nullable Integer type in from_pandas
[ https://issues.apache.org/jira/browse/ARROW-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960912#comment-16960912 ] Joris Van den Bossche commented on ARROW-5379: -- So the pandas -> arrow/feather conversion already works with pandas master and the latest Arrow release (0.15). If you want to use this feature without relying on pandas master, you can use this monkeypatch (it's basically what is added in the development version of pandas master): {code} import pandas as pd import pyarrow pd.arrays.IntegerArray.__arrow_array__ = lambda self, type: pyarrow.array(self._data, mask=self._mask, type=type) {code} > [Python] support pandas' nullable Integer type in from_pandas > - > > Key: ARROW-5379 > URL: https://issues.apache.org/jira/browse/ARROW-5379 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > > From https://github.com/apache/arrow/issues/4168. We should add support for > pandas' nullable Integer extension dtypes, as those could map nicely to > arrows integer types. > Ideally this happens in a generic way though, and not specific for this > extension type, which is discussed in ARROW-5271 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-7002) Support pandas nullable integer type Int64
[ https://issues.apache.org/jira/browse/ARROW-7002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche closed ARROW-7002. Resolution: Duplicate Closing as a duplicate of ARROW-5379 > Support pandas nullable integer type Int64 > -- > > Key: ARROW-7002 > URL: https://issues.apache.org/jira/browse/ARROW-7002 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Christian Roth >Priority: Major > > Pandas has a nullable integer type Int64 which does not seem to be supported > by feather yet. > {code:python} > from pyarrow import feather > import pandas as pd > col1 = pd.Series([0, None, 1, 23]).astype('Int64') > col2 = pd.Series([1, 3, 2, 1]).astype('Int64') > df = pd.DataFrame({'a': col1, 'b': col2}) > feather.write_feather(df, '/tmp/foo') > {code} > Gives following error message: > {code:java} > --- > ArrowTypeErrorTraceback (most recent call last) > in > > 1 feather.write_feather(df, '/tmp/foo') > ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in > write_feather(df, dest) > 181 writer = FeatherWriter(dest) > 182 try: > --> 183 writer.write(df) > 184 except Exception: > 185 # Try to make sure the resource is closed > ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in > write(self, df) > 92 # TODO(wesm): Remove this length check, see ARROW-1732 > 93 if len(df.columns) > 0: > ---> 94 table = Table.from_pandas(df, preserve_index=False) > 95 for i, name in enumerate(table.schema.names): > 96 col = table[i] > ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/table.pxi in > pyarrow.lib.Table.from_pandas() > ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py > in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe) > 551 if nthreads == 1: > 552 arrays = [convert_column(c, f) > --> 553 for c, f in zip(columns_to_convert, convert_fields)] > 554 else: > 555 from concurrent import futures > ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py > in (.0) > 551 if nthreads == 1: > 552 arrays = [convert_column(c, f) > --> 553 for c, f in zip(columns_to_convert, convert_fields)] > 554 else: > 555 from concurrent import futures > ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py > in convert_column(col, field) > 542 e.args += ("Conversion failed for column {0!s} with type > {1!s}" > 543.format(col.name, col.dtype),) > --> 544 raise e > 545 if not field_nullable and result.null_count > 0: > 546 raise ValueError("Field {} was non-nullable but pandas > column " > ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py > in convert_column(col, field) > 536 > 537 try: > --> 538 result = pa.array(col, type=type_, from_pandas=True, > safe=safe) > 539 except (pa.ArrowInvalid, > 540 pa.ArrowNotImplementedError, > ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for > column a with type Int64') > {code} > xref: > [https://stackoverflow.com/questions/58571419/exporting-dataframe-with-null-able-int64-from-pandas-to-r] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-7002) Support pandas nullable integer type Int64
[ https://issues.apache.org/jira/browse/ARROW-7002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960902#comment-16960902 ] Joris Van den Bossche edited comment on ARROW-7002 at 10/28/19 10:18 AM: - Writing is already supported with pandas master and latest arrow (v0.15), so it is waiting on the next pandas release to have it in a stable version. {code} In [1]: from pyarrow import feather ...: import pandas as pd ...: ...: col1 = pd.Series([0, None, 1, 23]).astype('Int64') ...: col2 = pd.Series([1, 3, 2, 1]).astype('Int64') ...: ...: df = pd.DataFrame({'a': col1, 'b': col2}) ...: ...: feather.write_feather(df, '/tmp/foo') ...: In [2]: pd.read_feather('/tmp/foo') Out[2]: a b 0 0.0 1 1 NaN 3 2 1.0 2 3 23.0 1 {code} So converting to R should work properly. Reading it back in with Python will still give you a float array (if there were NaNs), as that is the default conversion of arrow integer to pandas. There is work going on to also preserve those specific pandas types in that case (see ARROW-2428). was (Author: jorisvandenbossche): Writing is already supported with pandas master and latest arrow (0.15), so it is waiting on the next pandas release to have it in a stable version. {code} In [1]: from pyarrow import feather ...: import pandas as pd ...: ...: col1 = pd.Series([0, None, 1, 23]).astype('Int64') ...: col2 = pd.Series([1, 3, 2, 1]).astype('Int64') ...: ...: df = pd.DataFrame({'a': col1, 'b': col2}) ...: ...: feather.write_feather(df, '/tmp/foo') ...: In [2]: pd.read_feather('/tmp/foo') Out[2]: a b 0 0.0 1 1 NaN 3 2 1.0 2 3 23.0 1 {code} Reading it back in will still give you a float array (if there were NaNs), as that is the default conversion of arrow integer to pandas. There is work going on to also preserve those specific pandas types in that case (see ARROW-2428). > Support pandas nullable integer type Int64 > -- > > Key: ARROW-7002 > URL: https://issues.apache.org/jira/browse/ARROW-7002 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Christian Roth >Priority: Major > > Pandas has a nullable integer type Int64 which does not seem to be supported > by feather yet. > {code:python} > from pyarrow import feather > import pandas as pd > col1 = pd.Series([0, None, 1, 23]).astype('Int64') > col2 = pd.Series([1, 3, 2, 1]).astype('Int64') > df = pd.DataFrame({'a': col1, 'b': col2}) > feather.write_feather(df, '/tmp/foo') > {code} > Gives following error message: > {code:java} > --- > ArrowTypeErrorTraceback (most recent call last) > in > > 1 feather.write_feather(df, '/tmp/foo') > ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in > write_feather(df, dest) > 181 writer = FeatherWriter(dest) > 182 try: > --> 183 writer.write(df) > 184 except Exception: > 185 # Try to make sure the resource is closed > ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in > write(self, df) > 92 # TODO(wesm): Remove this length check, see ARROW-1732 > 93 if len(df.columns) > 0: > ---> 94 table = Table.from_pandas(df, preserve_index=False) > 95 for i, name in enumerate(table.schema.names): > 96 col = table[i] > ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/table.pxi in > pyarrow.lib.Table.from_pandas() > ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py > in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe) > 551 if nthreads == 1: > 552 arrays = [convert_column(c, f) > --> 553 for c, f in zip(columns_to_convert, convert_fields)] > 554 else: > 555 from concurrent import futures >
[jira] [Commented] (ARROW-7002) Support pandas nullable integer type Int64
[ https://issues.apache.org/jira/browse/ARROW-7002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960902#comment-16960902 ] Joris Van den Bossche commented on ARROW-7002: -- Writing is already supported with pandas master and latest arrow (0.15), so it is waiting on the next pandas release to have it in a stable version. {code} In [1]: from pyarrow import feather ...: import pandas as pd ...: ...: col1 = pd.Series([0, None, 1, 23]).astype('Int64') ...: col2 = pd.Series([1, 3, 2, 1]).astype('Int64') ...: ...: df = pd.DataFrame({'a': col1, 'b': col2}) ...: ...: feather.write_feather(df, '/tmp/foo') ...: In [2]: pd.read_feather('/tmp/foo') Out[2]: a b 0 0.0 1 1 NaN 3 2 1.0 2 3 23.0 1 {code} Reading it back in will still give you a float array (if there were NaNs), as that is the default conversion of arrow integer to pandas. There is work going on to also preserve those specific pandas types in that case (see ARROW-2428). > Support pandas nullable integer type Int64 > -- > > Key: ARROW-7002 > URL: https://issues.apache.org/jira/browse/ARROW-7002 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Christian Roth >Priority: Major > > Pandas has a nullable integer type Int64 which does not seem to be supported > by feather yet. > {code:python} > from pyarrow import feather > import pandas as pd > col1 = pd.Series([0, None, 1, 23]).astype('Int64') > col2 = pd.Series([1, 3, 2, 1]).astype('Int64') > df = pd.DataFrame({'a': col1, 'b': col2}) > feather.write_feather(df, '/tmp/foo') > {code} > Gives following error message: > {code:java} > --- > ArrowTypeErrorTraceback (most recent call last) > in > > 1 feather.write_feather(df, '/tmp/foo') > ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in > write_feather(df, dest) > 181 writer = FeatherWriter(dest) > 182 try: > --> 183 writer.write(df) > 184 except Exception: > 185 # Try to make sure the resource is closed > ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in > write(self, df) > 92 # TODO(wesm): Remove this length check, see ARROW-1732 > 93 if len(df.columns) > 0: > ---> 94 table = Table.from_pandas(df, preserve_index=False) > 95 for i, name in enumerate(table.schema.names): > 96 col = table[i] > ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/table.pxi in > pyarrow.lib.Table.from_pandas() > ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py > in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe) > 551 if nthreads == 1: > 552 arrays = [convert_column(c, f) > --> 553 for c, f in zip(columns_to_convert, convert_fields)] > 554 else: > 555 from concurrent import futures > ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py > in (.0) > 551 if nthreads == 1: > 552 arrays = [convert_column(c, f) > --> 553 for c, f in zip(columns_to_convert, convert_fields)] > 554 else: > 555 from concurrent import futures > ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py > in convert_column(col, field) > 542 e.args += ("Conversion failed for column {0!s} with type > {1!s}" > 543.format(col.name, col.dtype),) > --> 544 raise e > 545 if not field_nullable and result.null_count > 0: > 546 raise ValueError("Field {} was non-nullable but pandas > column " > ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py > in convert_column(col, field) > 536 > 537 try: > --> 538 result = pa.array(col, type=type_, from_pandas=True, > safe=safe) > 539 except (pa.ArrowInvalid, > 540 pa.ArrowNotImplementedError, > ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for > column a with type Int64') > {code} > xref: > [https://stackoverflow.com/questions/58571419/exporting-dataframe-with-null-able-int64-from-pandas-to-r] > -- This message was sent by Atlassian Jira
[jira] [Commented] (ARROW-6985) [Python] Steadily increasing time to load file using read_parquet
[ https://issues.apache.org/jira/browse/ARROW-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960896#comment-16960896 ] Casey commented on ARROW-6985: -- So it sounds like this is just a known use case where parquet is not well suited. For my own knowledge, why exactly is the heap fragmenting? Shouldn't the heap allocation just grab the same memory that was used in the previous iteration? Anyway, happy to have the issue closed as not needed and I'll restructure our data to work within these limitations. > [Python] Steadily increasing time to load file using read_parquet > - > > Key: ARROW-6985 > URL: https://issues.apache.org/jira/browse/ARROW-6985 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0, 0.14.0, 0.15.0 >Reporter: Casey >Priority: Minor > Attachments: image-2019-10-25-14-52-46-165.png, > image-2019-10-25-14-53-37-623.png, image-2019-10-25-14-54-32-583.png > > > I've noticed that reading from parquet using pandas read_parquet function is > taking steadily longer with each invocation. I've seen the other ticket about > memory usage but I'm seeing no memory impact just steadily increasing read > time until I restart the python session. > Below is some code to reproduce my results. I notice it's particularly bad on > wide matrices, especially using pyarrow==0.15.0 > {code:python} > import pyarrow.parquet as pq > import pyarrow as pa > import pandas as pd > import os > import numpy as np > import time > file = "skinny_matrix.pq" > if not os.path.isfile(file): > mat = np.zeros((6000, 26000)) > mat.ravel()[::100] = np.random.randn(60 * 26000) > df = pd.DataFrame(mat.T) > table = pa.Table.from_pandas(df) > pq.write_table(table, file) > n_timings = 50 > timings = np.empty(n_timings) > for i in range(n_timings): > start = time.time() > new_df = pd.read_parquet(file) > end = time.time() > timings[i] = end - start > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7004) [Plasma] Make it possible to bump up object in LRU cache
[ https://issues.apache.org/jira/browse/ARROW-7004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7004: -- Labels: pull-request-available (was: ) > [Plasma] Make it possible to bump up object in LRU cache > > > Key: ARROW-7004 > URL: https://issues.apache.org/jira/browse/ARROW-7004 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Plasma >Reporter: Philipp Moritz >Assignee: Philipp Moritz >Priority: Major > Labels: pull-request-available > > To avoid evicting objects too early, we sometimes want to bump up a number of > objects up in the LRU cache. While it would be possible to call Get() on > these objects, this can be undesirable, since it is blocking on the objects > if they are not available and will make it necessary to call Release() on > them. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7004) [Plasma] Make it possible to bump up object in LRU cache
Philipp Moritz created ARROW-7004: - Summary: [Plasma] Make it possible to bump up object in LRU cache Key: ARROW-7004 URL: https://issues.apache.org/jira/browse/ARROW-7004 Project: Apache Arrow Issue Type: Improvement Components: C++ - Plasma Reporter: Philipp Moritz Assignee: Philipp Moritz To avoid evicting objects too early, we sometimes want to bump up a number of objects up in the LRU cache. While it would be possible to call Get() on these objects, this can be undesirable, since it is blocking on the objects if they are not available and will make it necessary to call Release() on them. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-4223) [Python] Support scipy.sparse integration
[ https://issues.apache.org/jira/browse/ARROW-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rok Mihevc updated ARROW-4223: -- Fix Version/s: 1.0.0 > [Python] Support scipy.sparse integration > - > > Key: ARROW-4223 > URL: https://issues.apache.org/jira/browse/ARROW-4223 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Kenta Murata >Assignee: Rok Mihevc >Priority: Minor > Labels: pull-request-available, sparse > Fix For: 1.0.0 > > Time Spent: 9h 10m > Remaining Estimate: 0h > > It would be great to support integration with scipy.sparse. -- This message was sent by Atlassian Jira (v8.3.4#803005)