[jira] [Created] (ARROW-8580) Pyarrow exceptions are not helpful
Soroush Radpour created ARROW-8580: -- Summary: Pyarrow exceptions are not helpful Key: ARROW-8580 URL: https://issues.apache.org/jira/browse/ARROW-8580 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Soroush Radpour I'm trying to understand an exception in the code using pyarrow, and it is not very helpful. {{ File "pyarrow/_parquet.pyx", line 1036, in pyarrow._parquet.ParquetReader.open File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status OSError: IOError: b'Service Unavailable'. Detail: Python exception: RuntimeError}} It would be great if each of the three exceptions was unwrapped with full stack trace and error messages that came with it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8473) [Rust] "Statistics support" in rust/parquet readme is incorrect
[ https://issues.apache.org/jira/browse/ARROW-8473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved ARROW-8473. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 6951 [https://github.com/apache/arrow/pull/6951] > [Rust] "Statistics support" in rust/parquet readme is incorrect > --- > > Key: ARROW-8473 > URL: https://issues.apache.org/jira/browse/ARROW-8473 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Krzysztof Stanisławek >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Statistics are not actually supported in rust implementation of parquet. See > [https://github.com/apache/arrow/blob/3e3712a14a3242d70145fb9d3d6f0f4b8c374e68/rust/parquet/src/column/writer.rs#L522] > or similar lines in this file, or writer.rs. > https://github.com/apache/arrow/pull/6951 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8579) [C++] AVX512 part for SIMD operations of DecodeSpaced/EncodeSpaced
Frank Du created ARROW-8579: --- Summary: [C++] AVX512 part for SIMD operations of DecodeSpaced/EncodeSpaced Key: ARROW-8579 URL: https://issues.apache.org/jira/browse/ARROW-8579 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Frank Du Assignee: Frank Du As part of https://issues.apache.org/jira/browse/PARQUET-1841, AVX512 path identified with the helper of mask_compress_/mask_expand_ API. This Jira created for spaced benchmark, unittest and AVX512 path and other basic support of further potential SIMD chance of SSE/AVX2. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8577) [GLib][Plasma] gplasma_client_options_new() default settings are enabling a check for CUDA device
[ https://issues.apache.org/jira/browse/ARROW-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou reassigned ARROW-8577: --- Assignee: Kouhei Sutou > [GLib][Plasma] gplasma_client_options_new() default settings are enabling a > check for CUDA device > - > > Key: ARROW-8577 > URL: https://issues.apache.org/jira/browse/ARROW-8577 > Project: Apache Arrow > Issue Type: Bug > Components: GLib >Reporter: Tanveer >Assignee: Kouhei Sutou >Priority: Major > > Hi all, > Previously, I was using c_glib Plasma library (build 0.12) for creating > plasma objects. It was working as expected. But now I want to use Arrow's > newest build. I incurred the following error: > > /build/apache-arrow-0.17.0/cpp/src/arrow/result.cc:28: ValueOrDie called on > an error: IOError: Cuda error 100 in function 'cuInit': > [CUDA_ERROR_NO_DEVICE] no CUDA-capable device is detected > I think plasma client options (gplasma_client_options_new()) which I am using > with default settings are enabling a check for my CUDA device and I have no > CUDA device attached to my system. How I can disable this check? Any help > will be highly appreciated. Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8577) [GLib][Plasma] gplasma_client_options_new() default settings are enabling a check for CUDA device
[ https://issues.apache.org/jira/browse/ARROW-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091124#comment-17091124 ] Kouhei Sutou commented on ARROW-8577: - Could you show a program that reproduces this problem? > [GLib][Plasma] gplasma_client_options_new() default settings are enabling a > check for CUDA device > - > > Key: ARROW-8577 > URL: https://issues.apache.org/jira/browse/ARROW-8577 > Project: Apache Arrow > Issue Type: Bug > Components: GLib >Reporter: Tanveer >Priority: Major > > Hi all, > Previously, I was using c_glib Plasma library (build 0.12) for creating > plasma objects. It was working as expected. But now I want to use Arrow's > newest build. I incurred the following error: > > /build/apache-arrow-0.17.0/cpp/src/arrow/result.cc:28: ValueOrDie called on > an error: IOError: Cuda error 100 in function 'cuInit': > [CUDA_ERROR_NO_DEVICE] no CUDA-capable device is detected > I think plasma client options (gplasma_client_options_new()) which I am using > with default settings are enabling a check for my CUDA device and I have no > CUDA device attached to my system. How I can disable this check? Any help > will be highly appreciated. Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8578) [C++][Flight] Test executable failures due to "SO_REUSEPORT unavailable on compiling system"
[ https://issues.apache.org/jira/browse/ARROW-8578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091125#comment-17091125 ] Wes McKinney commented on ARROW-8578: - Hm, I rebooted my laptop and it works out, so the warning above may be a red herring. It's curious that something would go wrong with my networking > [C++][Flight] Test executable failures due to "SO_REUSEPORT unavailable on > compiling system" > > > Key: ARROW-8578 > URL: https://issues.apache.org/jira/browse/ARROW-8578 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > Tried compiling and running this today (with grpc 1.28.1) > {code} > $ release/arrow-flight-benchmark > Using standalone server: false > Server running with pid 22385 > Testing method: DoGet > Server host: localhost > Server port: 31337 > E0423 21:54:15.174285695 22385 socket_utils_common_posix.cc:222] check for > SO_REUSEPORT: {"created":"@1587696855.174280083","description":"SO_REUSEPORT > unavailable on compiling > system","file":"../src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":190} > Server host: localhost > {code} > my Linux kernel > {code} > $ uname -a > Linux 4.15.0-1079-oem #89-Ubuntu SMP Fri Mar 27 05:22:11 UTC 2020 x86_64 > x86_64 x86_64 GNU/Linux > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8577) [GLib][Plasma] gplasma_client_options_new() default settings are enabling a check for CUDA device
[ https://issues.apache.org/jira/browse/ARROW-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-8577: Summary: [GLib][Plasma] gplasma_client_options_new() default settings are enabling a check for CUDA device (was: [CGlib Plasma] gplasma_client_options_new() default settings are enabling a check for CUDA device) > [GLib][Plasma] gplasma_client_options_new() default settings are enabling a > check for CUDA device > - > > Key: ARROW-8577 > URL: https://issues.apache.org/jira/browse/ARROW-8577 > Project: Apache Arrow > Issue Type: Bug >Reporter: Tanveer >Priority: Major > > Hi all, > Previously, I was using c_glib Plasma library (build 0.12) for creating > plasma objects. It was working as expected. But now I want to use Arrow's > newest build. I incurred the following error: > > /build/apache-arrow-0.17.0/cpp/src/arrow/result.cc:28: ValueOrDie called on > an error: IOError: Cuda error 100 in function 'cuInit': > [CUDA_ERROR_NO_DEVICE] no CUDA-capable device is detected > I think plasma client options (gplasma_client_options_new()) which I am using > with default settings are enabling a check for my CUDA device and I have no > CUDA device attached to my system. How I can disable this check? Any help > will be highly appreciated. Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8577) [GLib][Plasma] gplasma_client_options_new() default settings are enabling a check for CUDA device
[ https://issues.apache.org/jira/browse/ARROW-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-8577: Component/s: GLib > [GLib][Plasma] gplasma_client_options_new() default settings are enabling a > check for CUDA device > - > > Key: ARROW-8577 > URL: https://issues.apache.org/jira/browse/ARROW-8577 > Project: Apache Arrow > Issue Type: Bug > Components: GLib >Reporter: Tanveer >Priority: Major > > Hi all, > Previously, I was using c_glib Plasma library (build 0.12) for creating > plasma objects. It was working as expected. But now I want to use Arrow's > newest build. I incurred the following error: > > /build/apache-arrow-0.17.0/cpp/src/arrow/result.cc:28: ValueOrDie called on > an error: IOError: Cuda error 100 in function 'cuInit': > [CUDA_ERROR_NO_DEVICE] no CUDA-capable device is detected > I think plasma client options (gplasma_client_options_new()) which I am using > with default settings are enabling a check for my CUDA device and I have no > CUDA device attached to my system. How I can disable this check? Any help > will be highly appreciated. Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8578) [C++][Flight] Test executable failures due to "SO_REUSEPORT unavailable on compiling system"
[ https://issues.apache.org/jira/browse/ARROW-8578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091119#comment-17091119 ] Wes McKinney commented on ARROW-8578: - The executable appears to just hang, which is also a bad failure mode > [C++][Flight] Test executable failures due to "SO_REUSEPORT unavailable on > compiling system" > > > Key: ARROW-8578 > URL: https://issues.apache.org/jira/browse/ARROW-8578 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > Tried compiling and running this today (with grpc 1.28.1) > {code} > $ release/arrow-flight-benchmark > Using standalone server: false > Server running with pid 22385 > Testing method: DoGet > Server host: localhost > Server port: 31337 > E0423 21:54:15.174285695 22385 socket_utils_common_posix.cc:222] check for > SO_REUSEPORT: {"created":"@1587696855.174280083","description":"SO_REUSEPORT > unavailable on compiling > system","file":"../src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":190} > Server host: localhost > {code} > my Linux kernel > {code} > $ uname -a > Linux 4.15.0-1079-oem #89-Ubuntu SMP Fri Mar 27 05:22:11 UTC 2020 x86_64 > x86_64 x86_64 GNU/Linux > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8578) [C++][Flight] Test executable failures due to "SO_REUSEPORT unavailable on compiling system"
[ https://issues.apache.org/jira/browse/ARROW-8578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091115#comment-17091115 ] Wes McKinney commented on ARROW-8578: - [~lidavidm] do you know what this is about? > [C++][Flight] Test executable failures due to "SO_REUSEPORT unavailable on > compiling system" > > > Key: ARROW-8578 > URL: https://issues.apache.org/jira/browse/ARROW-8578 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > Tried compiling and running this today (with grpc 1.28.1) > {code} > $ release/arrow-flight-benchmark > Using standalone server: false > Server running with pid 22385 > Testing method: DoGet > Server host: localhost > Server port: 31337 > E0423 21:54:15.174285695 22385 socket_utils_common_posix.cc:222] check for > SO_REUSEPORT: {"created":"@1587696855.174280083","description":"SO_REUSEPORT > unavailable on compiling > system","file":"../src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":190} > Server host: localhost > {code} > my Linux kernel > {code} > $ uname -a > Linux 4.15.0-1079-oem #89-Ubuntu SMP Fri Mar 27 05:22:11 UTC 2020 x86_64 > x86_64 x86_64 GNU/Linux > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8578) [C++][Flight] Test executable failures due to "SO_REUSEPORT unavailable on compiling system"
Wes McKinney created ARROW-8578: --- Summary: [C++][Flight] Test executable failures due to "SO_REUSEPORT unavailable on compiling system" Key: ARROW-8578 URL: https://issues.apache.org/jira/browse/ARROW-8578 Project: Apache Arrow Issue Type: Bug Components: C++, FlightRPC Reporter: Wes McKinney Fix For: 1.0.0 Tried compiling and running this today (with grpc 1.28.1) {code} $ release/arrow-flight-benchmark Using standalone server: false Server running with pid 22385 Testing method: DoGet Server host: localhost Server port: 31337 E0423 21:54:15.174285695 22385 socket_utils_common_posix.cc:222] check for SO_REUSEPORT: {"created":"@1587696855.174280083","description":"SO_REUSEPORT unavailable on compiling system","file":"../src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":190} Server host: localhost {code} my Linux kernel {code} $ uname -a Linux 4.15.0-1079-oem #89-Ubuntu SMP Fri Mar 27 05:22:11 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8577) [CGlib Plasma] gplasma_client_options_new() default settings are enabling a check for CUDA device
Tanveer created ARROW-8577: -- Summary: [CGlib Plasma] gplasma_client_options_new() default settings are enabling a check for CUDA device Key: ARROW-8577 URL: https://issues.apache.org/jira/browse/ARROW-8577 Project: Apache Arrow Issue Type: Bug Reporter: Tanveer Hi all, Previously, I was using c_glib Plasma library (build 0.12) for creating plasma objects. It was working as expected. But now I want to use Arrow's newest build. I incurred the following error: /build/apache-arrow-0.17.0/cpp/src/arrow/result.cc:28: ValueOrDie called on an error: IOError: Cuda error 100 in function 'cuInit': [CUDA_ERROR_NO_DEVICE] no CUDA-capable device is detected I think plasma client options (gplasma_client_options_new()) which I am using with default settings are enabling a check for my CUDA device and I have no CUDA device attached to my system. How I can disable this check? Any help will be highly appreciated. Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8576) [Rust] Implement ArrayEqual for UnionArray
Paddy Horan created ARROW-8576: -- Summary: [Rust] Implement ArrayEqual for UnionArray Key: ARROW-8576 URL: https://issues.apache.org/jira/browse/ARROW-8576 Project: Apache Arrow Issue Type: New Feature Components: Rust Reporter: Paddy Horan Assignee: Paddy Horan -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8516) [Rust] Slow BufferBuilder inserts within PrimitiveBuilder::append_slice
[ https://issues.apache.org/jira/browse/ARROW-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paddy Horan resolved ARROW-8516. Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 6980 [https://github.com/apache/arrow/pull/6980] > [Rust] Slow BufferBuilder inserts within > PrimitiveBuilder::append_slice > > > Key: ARROW-8516 > URL: https://issues.apache.org/jira/browse/ARROW-8516 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Raphael Taylor-Davies >Assignee: Raphael Taylor-Davies >Priority: Trivial > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > {color:#00}BufferBuilder{color}{color:#0073bf}<{color}{color:#00}BooleanType>::append_slice > is called by PrimitiveBuilder{color}{color:#00}::append_slice with a > constructed vector of true values. {color} > {color:#00}Even in release builds the associated allocations and > iterations are not optimised out, resulting in a third of the time to parse a > parquet file containing single integers being spent in > PrimitiveBuilder::append_slice.{color} > {color:#00}This PR adds an append_n method to the BufferBuilderTrait that > allows this to be handled more efficiently. My rather unscientific testing > shows it to halve the amount of time spent in this method yielding an ~20% > speedup for my particular workload.{color} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8552) [Rust] support column iteration for parquet row
[ https://issues.apache.org/jira/browse/ARROW-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paddy Horan resolved ARROW-8552. Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7009 [https://github.com/apache/arrow/pull/7009] > [Rust] support column iteration for parquet row > --- > > Key: ARROW-8552 > URL: https://issues.apache.org/jira/browse/ARROW-8552 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: QP Hou >Assignee: QP Hou >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > It would be useful to be able to iterate through all the columns in a parquet > row. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8552) [Rust] support column iteration for parquet row
[ https://issues.apache.org/jira/browse/ARROW-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paddy Horan reassigned ARROW-8552: -- Assignee: QP Hou > [Rust] support column iteration for parquet row > --- > > Key: ARROW-8552 > URL: https://issues.apache.org/jira/browse/ARROW-8552 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: QP Hou >Assignee: QP Hou >Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > It would be useful to be able to iterate through all the columns in a parquet > row. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8541) [Release] Don't remove previous source releases automatically
[ https://issues.apache.org/jira/browse/ARROW-8541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-8541. - Resolution: Fixed Issue resolved by pull request 6998 [https://github.com/apache/arrow/pull/6998] > [Release] Don't remove previous source releases automatically > - > > Key: ARROW-8541 > URL: https://issues.apache.org/jira/browse/ARROW-8541 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > We should keep at least the last three source tarballs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8575) [Developer] Add issue_comment workflow to rebase a PR
[ https://issues.apache.org/jira/browse/ARROW-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8575: -- Labels: pull-request-available (was: ) > [Developer] Add issue_comment workflow to rebase a PR > - > > Key: ARROW-8575 > URL: https://issues.apache.org/jira/browse/ARROW-8575 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8575) [Developer] Add issue_comment workflow to rebase a PR
Neal Richardson created ARROW-8575: -- Summary: [Developer] Add issue_comment workflow to rebase a PR Key: ARROW-8575 URL: https://issues.apache.org/jira/browse/ARROW-8575 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Neal Richardson Assignee: Neal Richardson -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8564) [Website] Add Ubuntu 20.04 LTS to supported package list
[ https://issues.apache.org/jira/browse/ARROW-8564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-8564. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 56 [https://github.com/apache/arrow-site/pull/56] > [Website] Add Ubuntu 20.04 LTS to supported package list > > > Key: ARROW-8564 > URL: https://issues.apache.org/jira/browse/ARROW-8564 > Project: Apache Arrow > Issue Type: Improvement > Components: Website >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Fix For: 1.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7950) [Python] When initializing pandas API shim, inform user if their installed pandas version is too old
[ https://issues.apache.org/jira/browse/ARROW-7950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-7950. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 6992 [https://github.com/apache/arrow/pull/6992] > [Python] When initializing pandas API shim, inform user if their installed > pandas version is too old > > > Key: ARROW-7950 > URL: https://issues.apache.org/jira/browse/ARROW-7950 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8572) [Python] Expose UnionArray.array and other fields
[ https://issues.apache.org/jira/browse/ARROW-8572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8572: -- Labels: pull-request-available (was: ) > [Python] Expose UnionArray.array and other fields > - > > Key: ARROW-8572 > URL: https://issues.apache.org/jira/browse/ARROW-8572 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.17.0 >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently in Python, you can construct a UnionArray easily, but getting the > data back out (without copying) is near-impossible. We should expose the > getter for UnionArray.array so we can pull out the constituent arrays. We > should also expose fields like mode while we're at it. > The use case is: in Flight, we'd like to write multiple distinct datasets > (with distinct schemas) in a single logical call; using UnionArrays lets us > combine these datasets into a single logical dataset. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7391) [Python] Remove unnecessary classes from the binding layer
[ https://issues.apache.org/jira/browse/ARROW-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7391: -- Labels: dataset pull-request-available (was: dataset) > [Python] Remove unnecessary classes from the binding layer > -- > > Key: ARROW-7391 > URL: https://issues.apache.org/jira/browse/ARROW-7391 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.17.0 >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Labels: dataset, pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Several Python classes introduced by > https://github.com/apache/arrow/pull/5237 are unnecessary and can be removed > in favor of simple functions which produce opaque pointers, including the > PartitionScheme and Expression classes. These should be removed to reduce > cognitive overhead of the Python datasets API and to loosen coupling between > Python and C++. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8566) [R] error when writing POSIXct to spark
[ https://issues.apache.org/jira/browse/ARROW-8566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090916#comment-17090916 ] Neal Richardson commented on ARROW-8566: Great, thanks for debugging with me. I created https://github.com/sparklyr/sparklyr/issues/2439 because I think the current {{arrow}} behavior is correct (certainly the 0.16 behavior was not correct, unless you happen to live in UTC) so this might need to be worked around in {{sparklyr}}. > [R] error when writing POSIXct to spark > --- > > Key: ARROW-8566 > URL: https://issues.apache.org/jira/browse/ARROW-8566 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.17.0 > Environment: #> R version 3.6.3 (2020-02-29) > #> Platform: x86_64-apple-darwin15.6.0 (64-bit) > #> Running under: macOS Mojave 10.14.6 > sparklyr::spark_version(sc) > #> [1] '2.4.5' >Reporter: Curt Bergmann >Priority: Major > > monospaced text}}``` r > library(DBI) > library(sparklyr) > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > sc <- spark_connect(master = "local") > sparklyr::spark_version(sc) > #> [1] '2.4.5' > x <- data.frame(y = Sys.time()) > dbWriteTable(sc, "test_posixct", x) > #> Error: org.apache.spark.SparkException: Job aborted. > #> at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > #> at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > #> at > org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:503) > #> at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217) > #> at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176) > #> at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > #> at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > #> at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > #> at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > #> at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > #> at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > #> at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > #> at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > #> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > #> at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83) > #> at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81) > #> at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > #> at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > #> at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) > #> at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) > #> at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) > #> at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) > #> at > org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:474) > #> at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:453) > #> at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409) > #> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > #> at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > #> at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > #> at java.lang.reflect.Method.invoke(Method.java:498) > #> at sparklyr.Invoke.invoke(invoke.scala:147) > #> at sparklyr.StreamHandler.handleMethodCall(stream.scala:136) > #> at sparklyr.StreamHandler.read(stream.scala:61) > #> at > sparklyr.BackendHandler$$anonfun$channelRead0$1.apply$mcV$sp(handler.scala:58) > #> at scala.util.control.Breaks.breakable(Breaks.scala:38) > #> at sparklyr.BackendHandler.channelRead0(handler.scala:38) > #> at sparklyr.BackendHandler.channelRead0(handler.scala:14) > #> at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > #> at > io.netty.channel.AbstractChannelHandl
[jira] [Updated] (ARROW-2260) [C++][Plasma] plasma_store should show usage
[ https://issues.apache.org/jira/browse/ARROW-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2260: -- Labels: pull-request-available (was: ) > [C++][Plasma] plasma_store should show usage > > > Key: ARROW-2260 > URL: https://issues.apache.org/jira/browse/ARROW-2260 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Plasma >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Currently the options exposed by the {{plasma_store}} executable aren't very > discoverable: > {code:bash} > $ plasma_store -h > please specify socket for incoming connections with -s switch > Abandon > (pyarrow) antoine@fsol:~/arrow/cpp (ARROW-2135-nan-conversion-when-casting > *)$ plasma_store > please specify socket for incoming connections with -s switch > Abandon > (pyarrow) antoine@fsol:~/arrow/cpp (ARROW-2135-nan-conversion-when-casting > *)$ plasma_store --help > plasma_store: invalid option -- '-' > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8574) [Rust] Implement Debug for all plain types
[ https://issues.apache.org/jira/browse/ARROW-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahmut Bulut updated ARROW-8574: Description: Currently, no type plain type (like RecordBatch) or any array implementation implements debug. So peeking into columns and looking to metadata quite a bit cumbersome. (was: Currently, no type plain type (like RecordBatch) or any array implementation implements debug. So peeking into columns and looking to metadata quite a bit cumbersome. If no objection arises, I would like to implement Debug for major plain structs around.) > [Rust] Implement Debug for all plain types > --- > > Key: ARROW-8574 > URL: https://issues.apache.org/jira/browse/ARROW-8574 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Mahmut Bulut >Assignee: Mahmut Bulut >Priority: Major > > Currently, no type plain type (like RecordBatch) or any array implementation > implements debug. So peeking into columns and looking to metadata quite a bit > cumbersome. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8574) [Rust] Implement Debug for all plain types
Mahmut Bulut created ARROW-8574: --- Summary: [Rust] Implement Debug for all plain types Key: ARROW-8574 URL: https://issues.apache.org/jira/browse/ARROW-8574 Project: Apache Arrow Issue Type: Improvement Reporter: Mahmut Bulut Assignee: Mahmut Bulut Currently, no type plain type (like RecordBatch) or any array implementation implements debug. So peeking into columns and looking to metadata quite a bit cumbersome. If no objection arises, I would like to implement Debug for major plain structs around. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8574) [Rust] Implement Debug for all plain types
[ https://issues.apache.org/jira/browse/ARROW-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahmut Bulut updated ARROW-8574: Component/s: Rust > [Rust] Implement Debug for all plain types > --- > > Key: ARROW-8574 > URL: https://issues.apache.org/jira/browse/ARROW-8574 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Mahmut Bulut >Assignee: Mahmut Bulut >Priority: Major > > Currently, no type plain type (like RecordBatch) or any array implementation > implements debug. So peeking into columns and looking to metadata quite a bit > cumbersome. > > If no objection arises, I would like to implement Debug for major plain > structs around. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8574) [Rust] Implement Debug for all plain types
[ https://issues.apache.org/jira/browse/ARROW-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090865#comment-17090865 ] Mahmut Bulut commented on ARROW-8574: - If no objection arises, I would like to implement Debug for major plain structs around. > [Rust] Implement Debug for all plain types > --- > > Key: ARROW-8574 > URL: https://issues.apache.org/jira/browse/ARROW-8574 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Mahmut Bulut >Assignee: Mahmut Bulut >Priority: Major > > Currently, no type plain type (like RecordBatch) or any array implementation > implements debug. So peeking into columns and looking to metadata quite a bit > cumbersome. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8573) [Rust] Upgrade to Rust 1.44 nightly
[ https://issues.apache.org/jira/browse/ARROW-8573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8573: -- Labels: pull-request-available (was: ) > [Rust] Upgrade to Rust 1.44 nightly > --- > > Key: ARROW-8573 > URL: https://issues.apache.org/jira/browse/ARROW-8573 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Rust 1.43.0 was just released, so we should update to 1.44 nightly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8573) [Rust] Upgrade to Rust 1.44 nightly
Andy Grove created ARROW-8573: - Summary: [Rust] Upgrade to Rust 1.44 nightly Key: ARROW-8573 URL: https://issues.apache.org/jira/browse/ARROW-8573 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Andy Grove Assignee: Andy Grove Fix For: 1.0.0 Rust 1.43.0 was just released, so we should update to 1.44 nightly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8559) [Rust] Consolidate Record Batch iterator traits in main arrow crate
[ https://issues.apache.org/jira/browse/ARROW-8559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090863#comment-17090863 ] Mahmut Bulut commented on ARROW-8559: - This looks good. I am perfectly ok with this. > [Rust] Consolidate Record Batch iterator traits in main arrow crate > --- > > Key: ARROW-8559 > URL: https://issues.apache.org/jira/browse/ARROW-8559 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Paddy Horan >Assignee: Paddy Horan >Priority: Major > > We have the `BatchIterator` trait in DataFusion and the `RecordBatchReader` > trait in the main arrow crate. > They differ in that `BatchIterator` is Send + Sync. They should both be in > the Arrow crate and be named `BatchIterator` and `SendableBatchIterator` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8566) [R] error when writing POSIXct to spark
[ https://issues.apache.org/jira/browse/ARROW-8566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090861#comment-17090861 ] Curt Bergmann commented on ARROW-8566: -- Assigning the timezone solved the problem. Even though Sys.time() when printed shows a time zone, apparently the tzone attribute is not set. When I set it then I have success writing to the file. The column type that gets created also comes back as a posixct. Following is a run that shows the failure followed by success. To avoid re-showing the long java trace I just print "Failed" for when it fails. Thank you! library(DBI) library(sparklyr) library(arrow) #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp sc <- spark_connect(master = "local") sparklyr::spark_version(sc) #> [1] '2.4.4' Sys.timezone() #> [1] "America/Chicago" x <- data.frame(y = Sys.time()) x$y #> [1] "2020-04-23 14:17:18 CDT" lubridate::tz(x$y) #> [1] "" tryCatch(dbWriteTable(sc, "test_posixct", x), error = function(e) print("Failed")) #> [1] "Failed" attr(x$y, "tzone") <- Sys.timezone() x$y #> [1] "2020-04-23 14:17:18 CDT" lubridate::tz(x$y) #> [1] "America/Chicago" dbWriteTable(sc, "test_posixct", x) result_df <- dbReadTable(sc, "test_posixct") lubridate::tz(x$y) #> [1] "America/Chicago" > [R] error when writing POSIXct to spark > --- > > Key: ARROW-8566 > URL: https://issues.apache.org/jira/browse/ARROW-8566 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.17.0 > Environment: #> R version 3.6.3 (2020-02-29) > #> Platform: x86_64-apple-darwin15.6.0 (64-bit) > #> Running under: macOS Mojave 10.14.6 > sparklyr::spark_version(sc) > #> [1] '2.4.5' >Reporter: Curt Bergmann >Priority: Major > > monospaced text}}``` r > library(DBI) > library(sparklyr) > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > sc <- spark_connect(master = "local") > sparklyr::spark_version(sc) > #> [1] '2.4.5' > x <- data.frame(y = Sys.time()) > dbWriteTable(sc, "test_posixct", x) > #> Error: org.apache.spark.SparkException: Job aborted. > #> at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > #> at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > #> at > org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:503) > #> at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217) > #> at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176) > #> at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > #> at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > #> at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > #> at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > #> at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > #> at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > #> at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > #> at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > #> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > #> at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83) > #> at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81) > #> at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > #> at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > #> at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) > #> at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) > #> at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) > #> at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) > #> at > org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:474) > #> at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:453) > #> at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409) > #> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[jira] [Created] (ARROW-8572) [Python] Expose UnionArray.array and other fields
David Li created ARROW-8572: --- Summary: [Python] Expose UnionArray.array and other fields Key: ARROW-8572 URL: https://issues.apache.org/jira/browse/ARROW-8572 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.17.0 Reporter: David Li Assignee: David Li Currently in Python, you can construct a UnionArray easily, but getting the data back out (without copying) is near-impossible. We should expose the getter for UnionArray.array so we can pull out the constituent arrays. We should also expose fields like mode while we're at it. The use case is: in Flight, we'd like to write multiple distinct datasets (with distinct schemas) in a single logical call; using UnionArrays lets us combine these datasets into a single logical dataset. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8508) [Rust] ListBuilder of FixedSizeListBuilder creates wrong offsets
[ https://issues.apache.org/jira/browse/ARROW-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-8508. --- Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7006 [https://github.com/apache/arrow/pull/7006] > [Rust] ListBuilder of FixedSizeListBuilder creates wrong offsets > > > Key: ARROW-8508 > URL: https://issues.apache.org/jira/browse/ARROW-8508 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 0.16.0 >Reporter: Christian Beilschmidt >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > I created an example of storing multi points with Arrow. > # A coordinate consists of two floats (Float64Builder) > # A multi point consists of one or more coordinates (FixedSizeListBuilder) > # A list of multi points consists of multiple multi points (ListBuilder) > This is the corresponding code snippet: > {code:java} > let float_builder = arrow::array::Float64Builder::new(0); > let coordinate_builder = > arrow::array::FixedSizeListBuilder::new(float_builder, 2); > let mut multi_point_builder = > arrow::array::ListBuilder::new(coordinate_builder); > multi_point_builder > .values() > .values() > .append_slice(&[0.0, 0.1]) > .unwrap(); > multi_point_builder.values().append(true).unwrap(); > multi_point_builder > .values() > .values() > .append_slice(&[1.0, 1.1]) > .unwrap(); > multi_point_builder.values().append(true).unwrap(); > multi_point_builder.append(true).unwrap(); // first multi point > multi_point_builder > .values() > .values() > .append_slice(&[2.0, 2.1]) > .unwrap(); > multi_point_builder.values().append(true).unwrap(); > multi_point_builder > .values() > .values() > .append_slice(&[3.0, 3.1]) > .unwrap(); > multi_point_builder.values().append(true).unwrap(); > multi_point_builder > .values() > .values() > .append_slice(&[4.0, 4.1]) > .unwrap(); > multi_point_builder.values().append(true).unwrap(); > multi_point_builder.append(true).unwrap(); // second multi point > let multi_point = dbg!(multi_point_builder.finish()); > let first_multi_point_ref = multi_point.value(0); > let first_multi_point: &arrow::array::FixedSizeListArray = > first_multi_point_ref.as_any().downcast_ref().unwrap(); > let coordinates_ref = first_multi_point.values(); > let coordinates: &Float64Array = > coordinates_ref.as_any().downcast_ref().unwrap(); > assert_eq!(coordinates.value_slice(0, 2 * 2), &[0.0, 0.1, 1.0, 1.1]); > let second_multi_point_ref = multi_point.value(1); > let second_multi_point: &arrow::array::FixedSizeListArray = > second_multi_point_ref.as_any().downcast_ref().unwrap(); > let coordinates_ref = second_multi_point.values(); > let coordinates: &Float64Array = > coordinates_ref.as_any().downcast_ref().unwrap(); > assert_eq!(coordinates.value_slice(0, 2 * 3), &[2.0, 2.1, 3.0, 3.1, 4.0, > 4.1]); > {code} > The second assertion fails and the output is {{[0.0, 0.1, 1.0, 1.1, 2.0, > 2.1]}}. > Moreover, the debug output produced from {{dbg!}} confirms this: > {noformat} > [ > FixedSizeListArray<2> > [ > PrimitiveArray > [ > 0.0, > 0.1, > ], > PrimitiveArray > [ > 1.0, > 1.1, > ], > ], > FixedSizeListArray<2> > [ > PrimitiveArray > [ > 0.0, > 0.1, > ], > PrimitiveArray > [ > 1.0, > 1.1, > ], > PrimitiveArray > [ > 2.0, > 2.1, > ], > ], > ]{noformat} > The second list should contain the values 2-4. > > So either I am using the builder wrong or there is a bug with the offsets. I > used {{0.16}} as well as the current {{master}} from GitHub. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8566) [R] error when writing POSIXct to spark
[ https://issues.apache.org/jira/browse/ARROW-8566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-8566: --- Summary: [R] error when writing POSIXct to spark (was: Upgraded from r package arrow 16 to r package arrow 17 and now get an error when writing posixct to spark) > [R] error when writing POSIXct to spark > --- > > Key: ARROW-8566 > URL: https://issues.apache.org/jira/browse/ARROW-8566 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.17.0 > Environment: #> R version 3.6.3 (2020-02-29) > #> Platform: x86_64-apple-darwin15.6.0 (64-bit) > #> Running under: macOS Mojave 10.14.6 > sparklyr::spark_version(sc) > #> [1] '2.4.5' >Reporter: Curt Bergmann >Priority: Major > > monospaced text}}``` r > library(DBI) > library(sparklyr) > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > sc <- spark_connect(master = "local") > sparklyr::spark_version(sc) > #> [1] '2.4.5' > x <- data.frame(y = Sys.time()) > dbWriteTable(sc, "test_posixct", x) > #> Error: org.apache.spark.SparkException: Job aborted. > #> at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > #> at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > #> at > org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:503) > #> at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217) > #> at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176) > #> at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > #> at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > #> at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > #> at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > #> at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > #> at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > #> at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > #> at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > #> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > #> at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83) > #> at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81) > #> at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > #> at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > #> at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) > #> at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) > #> at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) > #> at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) > #> at > org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:474) > #> at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:453) > #> at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409) > #> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > #> at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > #> at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > #> at java.lang.reflect.Method.invoke(Method.java:498) > #> at sparklyr.Invoke.invoke(invoke.scala:147) > #> at sparklyr.StreamHandler.handleMethodCall(stream.scala:136) > #> at sparklyr.StreamHandler.read(stream.scala:61) > #> at > sparklyr.BackendHandler$$anonfun$channelRead0$1.apply$mcV$sp(handler.scala:58) > #> at scala.util.control.Breaks.breakable(Breaks.scala:38) > #> at sparklyr.BackendHandler.channelRead0(handler.scala:38) > #> at sparklyr.BackendHandler.channelRead0(handler.scala:14) > #> at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > #> at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) > #> at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
[jira] [Updated] (ARROW-8566) Upgraded from r package arrow 16 to r package arrow 17 and now get an error when writing posixct to spark
[ https://issues.apache.org/jira/browse/ARROW-8566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-8566: --- Priority: Major (was: Blocker) > Upgraded from r package arrow 16 to r package arrow 17 and now get an error > when writing posixct to spark > - > > Key: ARROW-8566 > URL: https://issues.apache.org/jira/browse/ARROW-8566 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.17.0 > Environment: #> R version 3.6.3 (2020-02-29) > #> Platform: x86_64-apple-darwin15.6.0 (64-bit) > #> Running under: macOS Mojave 10.14.6 > sparklyr::spark_version(sc) > #> [1] '2.4.5' >Reporter: Curt Bergmann >Priority: Major > > monospaced text}}``` r > library(DBI) > library(sparklyr) > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > sc <- spark_connect(master = "local") > sparklyr::spark_version(sc) > #> [1] '2.4.5' > x <- data.frame(y = Sys.time()) > dbWriteTable(sc, "test_posixct", x) > #> Error: org.apache.spark.SparkException: Job aborted. > #> at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > #> at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > #> at > org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:503) > #> at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217) > #> at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176) > #> at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > #> at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > #> at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > #> at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > #> at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > #> at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > #> at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > #> at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > #> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > #> at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83) > #> at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81) > #> at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > #> at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > #> at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) > #> at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) > #> at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) > #> at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) > #> at > org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:474) > #> at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:453) > #> at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409) > #> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > #> at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > #> at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > #> at java.lang.reflect.Method.invoke(Method.java:498) > #> at sparklyr.Invoke.invoke(invoke.scala:147) > #> at sparklyr.StreamHandler.handleMethodCall(stream.scala:136) > #> at sparklyr.StreamHandler.read(stream.scala:61) > #> at > sparklyr.BackendHandler$$anonfun$channelRead0$1.apply$mcV$sp(handler.scala:58) > #> at scala.util.control.Breaks.breakable(Breaks.scala:38) > #> at sparklyr.BackendHandler.channelRead0(handler.scala:38) > #> at sparklyr.BackendHandler.channelRead0(handler.scala:14) > #> at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > #> at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) > #> at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360
[jira] [Resolved] (ARROW-8569) [CI] Upgrade xcode version for testing homebrew formulae
[ https://issues.apache.org/jira/browse/ARROW-8569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-8569. Resolution: Fixed Issue resolved by pull request 7019 [https://github.com/apache/arrow/pull/7019] > [CI] Upgrade xcode version for testing homebrew formulae > > > Key: ARROW-8569 > URL: https://issues.apache.org/jira/browse/ARROW-8569 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Packaging >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > To prevent as many bottles from being built from source. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8065) [C++][Dataset] Untangle Dataset, Fragment and ScanOptions
[ https://issues.apache.org/jira/browse/ARROW-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman resolved ARROW-8065. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7000 [https://github.com/apache/arrow/pull/7000] > [C++][Dataset] Untangle Dataset, Fragment and ScanOptions > - > > Key: ARROW-8065 > URL: https://issues.apache.org/jira/browse/ARROW-8065 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > Labels: dataset, pull-request-available > Fix For: 1.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Currently: a fragment is a product of a scan; it is a lazy collection of scan > tasks corresponding to a data source which is logically singular (like a > single file, a single row group, ...). It would be more useful if instead a > fragment were the direct object of a scan; one scans a fragment (or a > collection of fragments): > # Remove {{ScanOptions}} from Fragment's properties and move it into > {{Fragment::Scan}} parameters. > # Remove {{ScanOptions}} from {{Dataset::GetFragments}}. We can provide an > overload to support predicate pushdown in FileSystemDataset and UnionDataset > {{Dataset::GetFragments(std::shared_ptr predicate)}}. > # Expose lazy accessor to Fragment::physical_schema() > # Consolidate ScanOptions and ScanContext > This will lessen the cognitive dissonance between fragments and files since > fragments will no longer include references to scan properties. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8199) [C++] Guidance for creating multi-column sort on Table example?
[ https://issues.apache.org/jira/browse/ARROW-8199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090811#comment-17090811 ] Scott Wilson commented on ARROW-8199: - Ah. That will be very cool. Thanks for your feedback. I’ll continue with this approach, we’re moving our ML pipeline from python to c++, until yours materializes. On Thu, Apr 23, 2020 at 10:47 AM Wes McKinney (Jira) -- Sent from Gmail Mobile > [C++] Guidance for creating multi-column sort on Table example? > --- > > Key: ARROW-8199 > URL: https://issues.apache.org/jira/browse/ARROW-8199 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Affects Versions: 0.16.0 >Reporter: Scott Wilson >Priority: Minor > Labels: c++, newbie > Attachments: ArrowCsv.cpp > > > I'm just coming up to speed with Arrow and am noticing a dearth of examples > ... maybe I can help here. > I'd like to implement multi-column sorting for Tables and just want to ensure > that I'm not duplicating existing work or proposing a bad design. > My thought was to create a Table-specific version of SortToIndices() where > you can specify the columns and sort order. > Then I'd create Array "views" that use the Indices to remap from the original > Array values to the values in sorted order. (Original data is not sorted, but > could be as a second step.) I noticed some of the array list variants keep > offsets, but didn't see anything that supports remapping per a list of > indices, but this may just be my oversight? > Thanks in advance, Scott -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-8570) [CI] [C++] Link failure with AWS SDK on AppVeyor (Windows)
[ https://issues.apache.org/jira/browse/ARROW-8570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou closed ARROW-8570. - Resolution: Duplicate > [CI] [C++] Link failure with AWS SDK on AppVeyor (Windows) > -- > > Key: ARROW-8570 > URL: https://issues.apache.org/jira/browse/ARROW-8570 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Blocker > > See e.g. > https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/32391335/job/ptbl9h9fffu0s5he > {code} >Creating library release\arrow_flight.lib and object > release\arrow_flight.exp > absl_str_format_internal.lib(float_conversion.cc.obj) : error LNK2019: > unresolved external symbol __std_reverse_trivially_swappable_1 referenced in > function "void __cdecl std::_Reverse_unchecked1(char * const,char * > const,struct std::integral_constant)" > (??$_Reverse_unchecked1@PEAD@std@@YAXQEAD0U?$integral_constant@_K$00@0@@Z) > absl_strings.lib(charconv_bigint.cc.obj) : error LNK2001: unresolved external > symbol __std_reverse_trivially_swappable_1 > release\arrow_flight.dll : fatal error LNK1120: 1 unresolved externals > {code} > This is probably an issue with a conda-forge package: > https://github.com/conda-forge/grpc-cpp-feedstock/issues/58 > In the meantime we could pin {{grpc-cpp}} on your CI configuration. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8571) [C++] Switch AppVeyor image to VS 2017
[ https://issues.apache.org/jira/browse/ARROW-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-8571. --- Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7023 [https://github.com/apache/arrow/pull/7023] > [C++] Switch AppVeyor image to VS 2017 > -- > > Key: ARROW-8571 > URL: https://issues.apache.org/jira/browse/ARROW-8571 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Uwe Korn >Assignee: Uwe Korn >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > conda-forge did the switch, so we should follow this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8199) [C++] Guidance for creating multi-column sort on Table example?
[ https://issues.apache.org/jira/browse/ARROW-8199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090795#comment-17090795 ] Wes McKinney commented on ARROW-8199: - > Mainly I'd like to know if this looks like the direction you're thinking for > arrow::DataFrame? No, to be honest from a glance it's a different direction from what I've been thinking. My thoughts there actually are for the data frame internally to be a mix of yet-to-be-scanned Datasets (e.g. from CSV or Parquet files), manifest (materialized in-memory) chunked arrays, and unevaluated expressions. Analytics requests translate requests into physical query plans to be executed by the to-be-developed query engine. I haven't been able to give this my full attention since writing the design docs last year but I intend to spend a large fraction of my time on it the rest of the year. The reasoning for wanting to push data frame operations into a query engine is to get around the memory use issues and performance problems associated with "eager evaluation" data frame libraries like pandas (for example, a join in pandas materializes the entire joined data frame in memory). There are similar issues around sorting (particular with the knowledge of what you want to do with the sorted data -- e.g. sort followed by a slice can be executed as a Top-K operation for substantially less memory use) That said, I know a number of people have expressed interest in having STL interface layers in Arrow to the data structures. This would be a valuable thing to contribute to the project. It's not mutually exclusive with the stuff I wrote above but wanted to give some idea > [C++] Guidance for creating multi-column sort on Table example? > --- > > Key: ARROW-8199 > URL: https://issues.apache.org/jira/browse/ARROW-8199 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Affects Versions: 0.16.0 >Reporter: Scott Wilson >Priority: Minor > Labels: c++, newbie > Attachments: ArrowCsv.cpp > > > I'm just coming up to speed with Arrow and am noticing a dearth of examples > ... maybe I can help here. > I'd like to implement multi-column sorting for Tables and just want to ensure > that I'm not duplicating existing work or proposing a bad design. > My thought was to create a Table-specific version of SortToIndices() where > you can specify the columns and sort order. > Then I'd create Array "views" that use the Indices to remap from the original > Array values to the values in sorted order. (Original data is not sorted, but > could be as a second step.) I noticed some of the array list variants keep > offsets, but didn't see anything that supports remapping per a list of > indices, but this may just be my oversight? > Thanks in advance, Scott -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8199) [C++] Guidance for creating multi-column sort on Table example?
[ https://issues.apache.org/jira/browse/ARROW-8199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Scott Wilson updated ARROW-8199: Attachment: ArrowCsv.cpp Hi Wes, I hope you and yours are staying healthy in this strange new world! I've taken a stab at creating a DataFrame like cover for arrow::Table. My first milestone was to see if I could come up with a df.eval() like representation for single-line transforms -- see the EVAL2 macro. Attached is my code, I'm not quite sure where, if anywhere, I should post it to get your thoughts so I'm sending this email. (I posted an earlier version on Jira Arrow-602.) Mainly I'd like to know if this looks like the direction you're thinking for arrow::DataFrame? Thanks, Scott Code, also included as attachment #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include using namespace std; using namespace arrow; // SBW 2020.04.15 For ArrayCoverRaw::iterator, we can simply use the the pointer interface. // Wes suggests returning std::optional, but sizeof(double) < sizeof(std::optional) and // is not a drop-in replacement for T, i.e. optional can't be used in expression, need optional.value(). // STL container-like cover for arrow::Array. // Only works for Array types that support raw_values(). template class ArrayCoverRaw { public: using T = typename ArrType::value_type; using pointer = T*; using const_pointer = const T*; using reference = T&; using const_reference = const T&; // Match size_type to Array offsets rather than using size_t and ptrdiff_t. using size_type = int64_t; using difference_type = int64_t; using iterator = pointer; using const_iterator = const_pointer; using reverse_iterator = pointer; using const_reverse_iterator = const_pointer; ArrayCoverRaw(std::shared_ptr& array) : _array(array) {} size_type size() const { return _array->length(); } // Should non-const versions fail if Array is immutable? iterator begin() { return const_cast(_array->raw_values()); } iterator end() { return const_cast(_array->raw_values()+_array->length()); } reverse_iterator rbegin() { return const_cast(_array->raw_values()+_array->length()-1); } reverse_iterator rend() { return const_cast(_array->raw_values()-1); } const_iterator cbegin() const { return _array->raw_values(); } const_iterator cend() const { return _array->raw_values()+_array->length(); } const_reverse_iterator crbegin() const { return _array->raw_values()+_array->length()-1; } const_reverse_iterator crend() const { return _array->raw_values()-1; } // We could return std::optional to encapsulate IsNull() info, but this would seem to break the expected semantics. reference operator[](const difference_type off) { assert(_array->data()->is_mutable()); return _array->raw_values()+off; } const_reference operator[](const difference_type off) const { return _array->raw_values()+off; } // ISSUE: is there an interface for setting IsNull() if array is mutable. bool IsNull(difference_type off) const { return _array->IsNull(off); } protected: std::shared_ptr _array; }; // TODO: Add ArrayCoverString and iterators, perhaps others. // Use template on RefType so we can create iterator and const_iterator by changing Value. // Use class specializations to support Arrays that don't have raw_values(). template class ChunkedArrayIterator : public boost::iterator_facade, RefType, boost::random_access_traversal_tag> { public: using difference_type = int64_t; using T = CType; using ArrayType = typename CTypeTraits::ArrayType; using pointer = T*; explicit ChunkedArrayIterator(std::shared_ptr ch_arr = 0, difference_type pos = 0) : _ch_arr(ch_arr) { set_position(pos); } bool IsNull() const { auto arr = _ch_arr->chunk(_chunk_index); return arr->IsNull(_current-_first); } private: friend class boost::iterator_core_access; bool equal(ChunkedArrayIterator const& other) const { return this->_position == other._position; } void increment() { _position++; // Need to move to next chunk? if ((_current == _last) && ((_chunk_index+1) < _ch_arr->num_chunks())) { _chunk_index++; auto arr = _ch_arr->chunk(_chunk_index); auto typed_arr = std::static_pointer_cast(arr); _first = const_cast(typed_arr->raw_values()); _last = _first + arr->length() - 1; _current = _first; } else { _current++; } } void decrement() { _position--; // Need to move to previous chunk? if ((_current == _first) && (_chunk_index > 0)) { _chunk_index--; auto arr = _ch_arr->chunk(_chunk_index); auto typed_arr = std::static_pointer_cast(arr); _first = const_cast(typed_arr->raw_values()); _last = _first + arr->length() - 1; _current = _last; } else { _current--; } } RefType& dereference() const { return *_current; } void advance(difference_type n) { _position += n; while (n > 0) { difference_type max_delta = _last - _current; if ((max_delta >= n) || ((_chunk_index+
[jira] [Updated] (ARROW-8571) [C++] Switch AppVeyor image to VS 2017
[ https://issues.apache.org/jira/browse/ARROW-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8571: -- Labels: pull-request-available (was: ) > [C++] Switch AppVeyor image to VS 2017 > -- > > Key: ARROW-8571 > URL: https://issues.apache.org/jira/browse/ARROW-8571 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Uwe Korn >Assignee: Uwe Korn >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > conda-forge did the switch, so we should follow this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8571) [C++] Switch AppVeyor image to VS 2017
[ https://issues.apache.org/jira/browse/ARROW-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn updated ARROW-8571: Description: conda-forge did the switch, so we should follow this. > [C++] Switch AppVeyor image to VS 2017 > -- > > Key: ARROW-8571 > URL: https://issues.apache.org/jira/browse/ARROW-8571 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Uwe Korn >Assignee: Uwe Korn >Priority: Major > > conda-forge did the switch, so we should follow this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8571) [C++] Switch AppVeyor image to VS 2017
Uwe Korn created ARROW-8571: --- Summary: [C++] Switch AppVeyor image to VS 2017 Key: ARROW-8571 URL: https://issues.apache.org/jira/browse/ARROW-8571 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Uwe Korn Assignee: Uwe Korn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8566) Upgraded from r package arrow 16 to r package arrow 17 and now get an error when writing posixct to spark
[ https://issues.apache.org/jira/browse/ARROW-8566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090772#comment-17090772 ] Neal Richardson commented on ARROW-8566: Hmm. Unfortunately, {{java.lang.UnsupportedOperationException}} doesn't tell me anything about what is unsupported. The only thing about posixt types that changed in the last {{arrow}} release was a fix for ARROW-3543, specifically https://github.com/apache/arrow/commit/507762fa51d17e61f08d36d3626ab8b8df716198. I wonder, does it work if you explicitly set {{tz="GMT"}} on a POSIXct and send that? > Upgraded from r package arrow 16 to r package arrow 17 and now get an error > when writing posixct to spark > - > > Key: ARROW-8566 > URL: https://issues.apache.org/jira/browse/ARROW-8566 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.17.0 > Environment: #> R version 3.6.3 (2020-02-29) > #> Platform: x86_64-apple-darwin15.6.0 (64-bit) > #> Running under: macOS Mojave 10.14.6 > sparklyr::spark_version(sc) > #> [1] '2.4.5' >Reporter: Curt Bergmann >Priority: Blocker > > monospaced text}}``` r > library(DBI) > library(sparklyr) > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > sc <- spark_connect(master = "local") > sparklyr::spark_version(sc) > #> [1] '2.4.5' > x <- data.frame(y = Sys.time()) > dbWriteTable(sc, "test_posixct", x) > #> Error: org.apache.spark.SparkException: Job aborted. > #> at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > #> at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > #> at > org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:503) > #> at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217) > #> at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176) > #> at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > #> at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > #> at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > #> at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > #> at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > #> at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > #> at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > #> at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > #> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > #> at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83) > #> at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81) > #> at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > #> at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > #> at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) > #> at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) > #> at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) > #> at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) > #> at > org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:474) > #> at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:453) > #> at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409) > #> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > #> at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > #> at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > #> at java.lang.reflect.Method.invoke(Method.java:498) > #> at sparklyr.Invoke.invoke(invoke.scala:147) > #> at sparklyr.StreamHandler.handleMethodCall(stream.scala:136) > #> at sparklyr.StreamHandler.read(stream.scala:61) > #> at > sparklyr.BackendHandler$$anonfun$channelRead0$1.apply$mcV$sp(handler.scala:58) > #> at scala.util.control.Breaks.breakable(Breaks.scala:38) > #> at sparklyr.BackendHandler
[jira] [Created] (ARROW-8570) [CI] [C++] Link failure with AWS SDK on AppVeyor (Windows)
Antoine Pitrou created ARROW-8570: - Summary: [CI] [C++] Link failure with AWS SDK on AppVeyor (Windows) Key: ARROW-8570 URL: https://issues.apache.org/jira/browse/ARROW-8570 Project: Apache Arrow Issue Type: Bug Components: C++, Continuous Integration Reporter: Antoine Pitrou See e.g. https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/32391335/job/ptbl9h9fffu0s5he {code} Creating library release\arrow_flight.lib and object release\arrow_flight.exp absl_str_format_internal.lib(float_conversion.cc.obj) : error LNK2019: unresolved external symbol __std_reverse_trivially_swappable_1 referenced in function "void __cdecl std::_Reverse_unchecked1(char * const,char * const,struct std::integral_constant)" (??$_Reverse_unchecked1@PEAD@std@@YAXQEAD0U?$integral_constant@_K$00@0@@Z) absl_strings.lib(charconv_bigint.cc.obj) : error LNK2001: unresolved external symbol __std_reverse_trivially_swappable_1 release\arrow_flight.dll : fatal error LNK1120: 1 unresolved externals {code} This is probably an issue with a conda-forge package: https://github.com/conda-forge/grpc-cpp-feedstock/issues/58 In the meantime we could pin {{grpc-cpp}} on your CI configuration. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8562) [C++] IO: Parameterize I/O coalescing using S3 storage metrics
[ https://issues.apache.org/jira/browse/ARROW-8562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090742#comment-17090742 ] Mayur Srivastava commented on ARROW-8562: - [~apitrou], [~lidavidm], I've created a PR for this work: [https://github.com/apache/arrow/pull/7022] Please take a look when you get a chance. Thanks, Mayur > [C++] IO: Parameterize I/O coalescing using S3 storage metrics > -- > > Key: ARROW-8562 > URL: https://issues.apache.org/jira/browse/ARROW-8562 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Mayur Srivastava >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Related to https://issues.apache.org/jira/browse/ARROW-7995 > The adaptive I/O coalescing algorithm uses two parameters: > 1. max_io_gap: Max I/O gap/hole size in bytes > 2. ideal_request_size = Ideal I/O Request size in bytes > These parameters can be derived from S3 metrics as described below: > In an S3 compatible storage, there are two main metrics: > 1. Seek-time or Time-To-First-Byte (TTFB) in seconds: call setup latency of > a new S3 request > 2. Transfer Bandwidth (BW) for data in bytes/sec > 1. Computing max_io_gap: > max_io_gap = TTFB * BW > This is also called Bandwidth-Delay-Product (BDP). > Two byte ranges that have a gap can still be mapped to the same read if the > gap is less than the bandwidt-delay product [TTFB * TransferBandwidth], i.e. > if the Time-To-First-Byte (or call setup latency of a new S3 request) is > expected to be greater than just reading and discarding the extra bytes on an > existing HTTP request. > 2. Computing ideal_request_size: > We want to have high bandwidth utilization per S3 connections, i.e. transfer > large amounts of data to amortize the seek overhead. > But, we also want to leverage parallelism by slicing very large IO chunks. > We define two more config parameters with suggested default values to control > the slice size and seek to balance the two effects with the goal of > maximizing net data load performance. > BW_util (ideal bandwidth utilization): > This means what fraction of per connection bandwidth should be utilized to > maximize net data load. > A good default value is 90% or 0.9. > MAX_IDEAL_REQUEST_SIZE: > This means what is the maximum single request size (in bytes) to maximize > net data load. > A good default value is 64 MiB. > The amount of data that needs to be transferred in a single S3 get_object > request to achieve effective bandwidth eff_BW = BW_util * BW is as follows: > eff_BW = ideal_request_size / (TTFB + ideal_request_size / BW) > Substituting TTFB = max_io_gap / BW and eff_BW = BW_util * BW, we get the > following result: > ideal_request_size = max_io_gap * BW_util / (1 - BW_util) > Applying the MAX_IDEAL_REQUEST_SIZE, we get the following: > ideal_request_size = min(MAX_IDEAL_REQUEST_SIZE, max_io_gap * BW_util / (1 - > BW_util)) > The proposal is to create a named constructor in the io::CacheOptions (PR: > [https://github.com/apache/arrow/pull/6744] created by [~lidavidm]) to > compute max_io_gap and ideal_request_size from TTFB and BW which will then be > passed to reader to configure the I/O coalescing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8562) [C++] IO: Parameterize I/O coalescing using S3 storage metrics
[ https://issues.apache.org/jira/browse/ARROW-8562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090735#comment-17090735 ] Mayur Srivastava commented on ARROW-8562: - I'm going to send a PR soon > [C++] IO: Parameterize I/O coalescing using S3 storage metrics > -- > > Key: ARROW-8562 > URL: https://issues.apache.org/jira/browse/ARROW-8562 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Mayur Srivastava >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Related to https://issues.apache.org/jira/browse/ARROW-7995 > The adaptive I/O coalescing algorithm uses two parameters: > 1. max_io_gap: Max I/O gap/hole size in bytes > 2. ideal_request_size = Ideal I/O Request size in bytes > These parameters can be derived from S3 metrics as described below: > In an S3 compatible storage, there are two main metrics: > 1. Seek-time or Time-To-First-Byte (TTFB) in seconds: call setup latency of > a new S3 request > 2. Transfer Bandwidth (BW) for data in bytes/sec > 1. Computing max_io_gap: > max_io_gap = TTFB * BW > This is also called Bandwidth-Delay-Product (BDP). > Two byte ranges that have a gap can still be mapped to the same read if the > gap is less than the bandwidt-delay product [TTFB * TransferBandwidth], i.e. > if the Time-To-First-Byte (or call setup latency of a new S3 request) is > expected to be greater than just reading and discarding the extra bytes on an > existing HTTP request. > 2. Computing ideal_request_size: > We want to have high bandwidth utilization per S3 connections, i.e. transfer > large amounts of data to amortize the seek overhead. > But, we also want to leverage parallelism by slicing very large IO chunks. > We define two more config parameters with suggested default values to control > the slice size and seek to balance the two effects with the goal of > maximizing net data load performance. > BW_util (ideal bandwidth utilization): > This means what fraction of per connection bandwidth should be utilized to > maximize net data load. > A good default value is 90% or 0.9. > MAX_IDEAL_REQUEST_SIZE: > This means what is the maximum single request size (in bytes) to maximize > net data load. > A good default value is 64 MiB. > The amount of data that needs to be transferred in a single S3 get_object > request to achieve effective bandwidth eff_BW = BW_util * BW is as follows: > eff_BW = ideal_request_size / (TTFB + ideal_request_size / BW) > Substituting TTFB = max_io_gap / BW and eff_BW = BW_util * BW, we get the > following result: > ideal_request_size = max_io_gap * BW_util / (1 - BW_util) > Applying the MAX_IDEAL_REQUEST_SIZE, we get the following: > ideal_request_size = min(MAX_IDEAL_REQUEST_SIZE, max_io_gap * BW_util / (1 - > BW_util)) > The proposal is to create a named constructor in the io::CacheOptions (PR: > [https://github.com/apache/arrow/pull/6744] created by [~lidavidm]) to > compute max_io_gap and ideal_request_size from TTFB and BW which will then be > passed to reader to configure the I/O coalescing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8562) [C++] IO: Parameterize I/O coalescing using S3 storage metrics
[ https://issues.apache.org/jira/browse/ARROW-8562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8562: -- Labels: pull-request-available (was: ) > [C++] IO: Parameterize I/O coalescing using S3 storage metrics > -- > > Key: ARROW-8562 > URL: https://issues.apache.org/jira/browse/ARROW-8562 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Mayur Srivastava >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Related to https://issues.apache.org/jira/browse/ARROW-7995 > The adaptive I/O coalescing algorithm uses two parameters: > 1. max_io_gap: Max I/O gap/hole size in bytes > 2. ideal_request_size = Ideal I/O Request size in bytes > These parameters can be derived from S3 metrics as described below: > In an S3 compatible storage, there are two main metrics: > 1. Seek-time or Time-To-First-Byte (TTFB) in seconds: call setup latency of > a new S3 request > 2. Transfer Bandwidth (BW) for data in bytes/sec > 1. Computing max_io_gap: > max_io_gap = TTFB * BW > This is also called Bandwidth-Delay-Product (BDP). > Two byte ranges that have a gap can still be mapped to the same read if the > gap is less than the bandwidt-delay product [TTFB * TransferBandwidth], i.e. > if the Time-To-First-Byte (or call setup latency of a new S3 request) is > expected to be greater than just reading and discarding the extra bytes on an > existing HTTP request. > 2. Computing ideal_request_size: > We want to have high bandwidth utilization per S3 connections, i.e. transfer > large amounts of data to amortize the seek overhead. > But, we also want to leverage parallelism by slicing very large IO chunks. > We define two more config parameters with suggested default values to control > the slice size and seek to balance the two effects with the goal of > maximizing net data load performance. > BW_util (ideal bandwidth utilization): > This means what fraction of per connection bandwidth should be utilized to > maximize net data load. > A good default value is 90% or 0.9. > MAX_IDEAL_REQUEST_SIZE: > This means what is the maximum single request size (in bytes) to maximize > net data load. > A good default value is 64 MiB. > The amount of data that needs to be transferred in a single S3 get_object > request to achieve effective bandwidth eff_BW = BW_util * BW is as follows: > eff_BW = ideal_request_size / (TTFB + ideal_request_size / BW) > Substituting TTFB = max_io_gap / BW and eff_BW = BW_util * BW, we get the > following result: > ideal_request_size = max_io_gap * BW_util / (1 - BW_util) > Applying the MAX_IDEAL_REQUEST_SIZE, we get the following: > ideal_request_size = min(MAX_IDEAL_REQUEST_SIZE, max_io_gap * BW_util / (1 - > BW_util)) > The proposal is to create a named constructor in the io::CacheOptions (PR: > [https://github.com/apache/arrow/pull/6744] created by [~lidavidm]) to > compute max_io_gap and ideal_request_size from TTFB and BW which will then be > passed to reader to configure the I/O coalescing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8566) Upgraded from r package arrow 16 to r package arrow 17 and now get an error when writing posixct to spark
[ https://issues.apache.org/jira/browse/ARROW-8566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090719#comment-17090719 ] Curt Bergmann commented on ARROW-8566: -- Yes, this fails every time. It is also reproduced on my colleague's machine. The failure is only with a posixct data type. This data type did not fail in version 16. It seems to be associated with this in the traceback: java.lang.UnsupportedOperationException at org.apache.spark.sql.vectorized.ArrowColumnVector.(ArrowColumnVector.java:173) > Upgraded from r package arrow 16 to r package arrow 17 and now get an error > when writing posixct to spark > - > > Key: ARROW-8566 > URL: https://issues.apache.org/jira/browse/ARROW-8566 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.17.0 > Environment: #> R version 3.6.3 (2020-02-29) > #> Platform: x86_64-apple-darwin15.6.0 (64-bit) > #> Running under: macOS Mojave 10.14.6 > sparklyr::spark_version(sc) > #> [1] '2.4.5' >Reporter: Curt Bergmann >Priority: Blocker > > monospaced text}}``` r > library(DBI) > library(sparklyr) > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > sc <- spark_connect(master = "local") > sparklyr::spark_version(sc) > #> [1] '2.4.5' > x <- data.frame(y = Sys.time()) > dbWriteTable(sc, "test_posixct", x) > #> Error: org.apache.spark.SparkException: Job aborted. > #> at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > #> at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > #> at > org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:503) > #> at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217) > #> at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176) > #> at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > #> at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > #> at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > #> at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > #> at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > #> at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > #> at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > #> at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > #> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > #> at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83) > #> at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81) > #> at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > #> at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > #> at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) > #> at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) > #> at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) > #> at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) > #> at > org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:474) > #> at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:453) > #> at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409) > #> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > #> at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > #> at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > #> at java.lang.reflect.Method.invoke(Method.java:498) > #> at sparklyr.Invoke.invoke(invoke.scala:147) > #> at sparklyr.StreamHandler.handleMethodCall(stream.scala:136) > #> at sparklyr.StreamHandler.read(stream.scala:61) > #> at > sparklyr.BackendHandler$$anonfun$channelRead0$1.apply$mcV$sp(handler.scala:58) > #> at scala.util.control.Breaks.breakable(Breaks.scala:38) > #> at sparklyr.BackendHandler.channelRead0(handler.scala:38) > #> at sparklyr.BackendHandle
[jira] [Commented] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64
[ https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090700#comment-17090700 ] Antoine Pitrou commented on ARROW-3329: --- Thank you [~jacek.pliszka] for doing this. The PR unexpected uncovered two issues: ARROW-8567 and ARROW-8568. > [Python] Error casting decimal(38, 4) to int64 > -- > > Key: ARROW-3329 > URL: https://issues.apache.org/jira/browse/ARROW-3329 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Environment: Python version : 3.6.5 > Pyarrow version : 0.10.0 >Reporter: Kavita Sheth >Assignee: Jacek Pliszka >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 8h 10m > Remaining Estimate: 0h > > Git issue LInk : https://github.com/apache/arrow/issues/2627 > I want to cast pyarrow table column from decimal(38,4) to int64. > col.cast(pa.int64()) > Error: > File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast > File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, > 4) to int64 > Python version : 3.6.5 > Pyarrow version : 0.10.0 > is it not implemented yet or I am not using it correctly? If not implemented > yet, then any work around to cast columns? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64
[ https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-3329. --- Resolution: Fixed Issue resolved by pull request 6846 [https://github.com/apache/arrow/pull/6846] > [Python] Error casting decimal(38, 4) to int64 > -- > > Key: ARROW-3329 > URL: https://issues.apache.org/jira/browse/ARROW-3329 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Environment: Python version : 3.6.5 > Pyarrow version : 0.10.0 >Reporter: Kavita Sheth >Assignee: Jacek Pliszka >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 8h 10m > Remaining Estimate: 0h > > Git issue LInk : https://github.com/apache/arrow/issues/2627 > I want to cast pyarrow table column from decimal(38,4) to int64. > col.cast(pa.int64()) > Error: > File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast > File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, > 4) to int64 > Python version : 3.6.5 > Pyarrow version : 0.10.0 > is it not implemented yet or I am not using it correctly? If not implemented > yet, then any work around to cast columns? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8569) [CI] Upgrade xcode version for testing homebrew formulae
[ https://issues.apache.org/jira/browse/ARROW-8569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8569: -- Labels: pull-request-available (was: ) > [CI] Upgrade xcode version for testing homebrew formulae > > > Key: ARROW-8569 > URL: https://issues.apache.org/jira/browse/ARROW-8569 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Packaging >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > To prevent as many bottles from being built from source. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8569) [CI] Upgrade xcode version for testing homebrew formulae
Neal Richardson created ARROW-8569: -- Summary: [CI] Upgrade xcode version for testing homebrew formulae Key: ARROW-8569 URL: https://issues.apache.org/jira/browse/ARROW-8569 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Packaging Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 1.0.0 To prevent as many bottles from being built from source. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8566) Upgraded from r package arrow 16 to r package arrow 17 and now get an error when writing posixct to spark
[ https://issues.apache.org/jira/browse/ARROW-8566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090673#comment-17090673 ] Neal Richardson commented on ARROW-8566: Is this consistently reproducible? Do any other data types cause issues? I can't tell from the spark traceback what is failing exactly. > Upgraded from r package arrow 16 to r package arrow 17 and now get an error > when writing posixct to spark > - > > Key: ARROW-8566 > URL: https://issues.apache.org/jira/browse/ARROW-8566 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.17.0 > Environment: #> R version 3.6.3 (2020-02-29) > #> Platform: x86_64-apple-darwin15.6.0 (64-bit) > #> Running under: macOS Mojave 10.14.6 > sparklyr::spark_version(sc) > #> [1] '2.4.5' >Reporter: Curt Bergmann >Priority: Blocker > > monospaced text}}``` r > library(DBI) > library(sparklyr) > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > sc <- spark_connect(master = "local") > sparklyr::spark_version(sc) > #> [1] '2.4.5' > x <- data.frame(y = Sys.time()) > dbWriteTable(sc, "test_posixct", x) > #> Error: org.apache.spark.SparkException: Job aborted. > #> at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > #> at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > #> at > org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:503) > #> at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217) > #> at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176) > #> at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > #> at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > #> at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > #> at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > #> at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > #> at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > #> at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > #> at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > #> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > #> at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83) > #> at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81) > #> at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > #> at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > #> at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) > #> at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) > #> at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) > #> at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) > #> at > org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:474) > #> at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:453) > #> at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409) > #> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > #> at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > #> at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > #> at java.lang.reflect.Method.invoke(Method.java:498) > #> at sparklyr.Invoke.invoke(invoke.scala:147) > #> at sparklyr.StreamHandler.handleMethodCall(stream.scala:136) > #> at sparklyr.StreamHandler.read(stream.scala:61) > #> at > sparklyr.BackendHandler$$anonfun$channelRead0$1.apply$mcV$sp(handler.scala:58) > #> at scala.util.control.Breaks.breakable(Breaks.scala:38) > #> at sparklyr.BackendHandler.channelRead0(handler.scala:38) > #> at sparklyr.BackendHandler.channelRead0(handler.scala:14) > #> at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > #> at > io.netty.channel.AbstractChannelHandlerContext.invokeChann
[jira] [Commented] (ARROW-8565) [C++] Static build with AWS SDK
[ https://issues.apache.org/jira/browse/ARROW-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090664#comment-17090664 ] Remi Dettai commented on ARROW-8565: Yes, but I could not manage to make it work with the static build of arrow and I don't want to use the shared lib of arrow as it generates a set of binaries that is too large for my usecase (embeding into aws lambda with optimized coldstart) > [C++] Static build with AWS SDK > --- > > Key: ARROW-8565 > URL: https://issues.apache.org/jira/browse/ARROW-8565 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.17.0 >Reporter: Remi Dettai >Priority: Major > Labels: aws-s3, build-problem > > I can't find my way around the build system when using the S3 client. > It seems that only shared target is allowed when the S3 feature is ON. In the > thirdparty toolchain, when printing: > ??FATAL_ERROR "FIXME: Building AWS C++ SDK from source will link with wrong > libcrypto"?? > What is actually meant is that static build will not work, correct ? If it is > the case, should libarrow.a be generated at all when S3 feature is on ? > What can be done to fix this ? What does it mean that the SDK links to the > wrong libcrypto ? Is it fixable ? Or is their a way to have the static build > but maintain a dynamic link to a shared version of the SDK ? > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8560) [Rust] Docs for MutableBuffer resize are incorrect
[ https://issues.apache.org/jira/browse/ARROW-8560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-8560. --- Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7015 [https://github.com/apache/arrow/pull/7015] > [Rust] Docs for MutableBuffer resize are incorrect > -- > > Key: ARROW-8560 > URL: https://issues.apache.org/jira/browse/ARROW-8560 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Paddy Horan >Assignee: Paddy Horan >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8532) [C++][CSV] Add support for sentinel values.
[ https://issues.apache.org/jira/browse/ARROW-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090648#comment-17090648 ] Francois Saint-Jacques commented on ARROW-8532: --- This is a duplicate of ARROW-8348? > [C++][CSV] Add support for sentinel values. > --- > > Key: ARROW-8532 > URL: https://issues.apache.org/jira/browse/ARROW-8532 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ravil Bikbulatov >Priority: Major > > Some systems still use sentinel values to store nulls. It would be good if > read_csv would place sentinel values and user wouldn't need to convet null > bitmaps to sentinel values. > Adding this support doesn't contradict Arrow specification as null values are > undefined. Also it wouldn't add any overhead to read_csv. Since Arrow is > general purpose framework I think we can relieve users from pain of > converting bitmats to sentinel values. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8348) [C++] Support optional sentinel values in primitive Array for nulls
[ https://issues.apache.org/jira/browse/ARROW-8348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090647#comment-17090647 ] Francois Saint-Jacques commented on ARROW-8348: --- I'm not proposing this as a format change, just a C++ interface niceties. It could be done with Metadata, but string typing is a pain. > [C++] Support optional sentinel values in primitive Array for nulls > --- > > Key: ARROW-8348 > URL: https://issues.apache.org/jira/browse/ARROW-8348 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Francois Saint-Jacques >Priority: Major > > This is an optional feature where a sentinel value is stored in null cells > and is exposed via an accessor method, e.g. `optional > Array::HasSentinel() const;`. This would allow zero-copy bi-directional > conversion with R. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8565) [C++] Static build with AWS SDK
[ https://issues.apache.org/jira/browse/ARROW-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090637#comment-17090637 ] Francois Saint-Jacques commented on ARROW-8565: --- Have you tried with a shared build of aws sdk? > [C++] Static build with AWS SDK > --- > > Key: ARROW-8565 > URL: https://issues.apache.org/jira/browse/ARROW-8565 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.17.0 >Reporter: Remi Dettai >Priority: Major > Labels: aws-s3, build-problem > > I can't find my way around the build system when using the S3 client. > It seems that only shared target is allowed when the S3 feature is ON. In the > thirdparty toolchain, when printing: > ??FATAL_ERROR "FIXME: Building AWS C++ SDK from source will link with wrong > libcrypto"?? > What is actually meant is that static build will not work, correct ? If it is > the case, should libarrow.a be generated at all when S3 feature is on ? > What can be done to fix this ? What does it mean that the SDK links to the > wrong libcrypto ? Is it fixable ? Or is their a way to have the static build > but maintain a dynamic link to a shared version of the SDK ? > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8568) [C++][Python] Crash on decimal cast in debug mode
Antoine Pitrou created ARROW-8568: - Summary: [C++][Python] Crash on decimal cast in debug mode Key: ARROW-8568 URL: https://issues.apache.org/jira/browse/ARROW-8568 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 0.17.0 Reporter: Antoine Pitrou {code:python} >>> arr = pa.array([Decimal('123.45')]) >>> >>> >>> arr >>> >>> [ 123.45 ] >>> arr.type >>> >>> Decimal128Type(decimal(5, 2)) >>> arr.cast(pa.decimal128(4, 2)) >>> >>> ../src/arrow/util/basic_decimal.cc:626: Check failed: (original_scale) != (new_scale) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8563) [Go] Minor change to make newBuilder public
[ https://issues.apache.org/jira/browse/ARROW-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8563: Summary: [Go] Minor change to make newBuilder public (was: Minor change to make newBuilder public) > [Go] Minor change to make newBuilder public > --- > > Key: ARROW-8563 > URL: https://issues.apache.org/jira/browse/ARROW-8563 > Project: Apache Arrow > Issue Type: Improvement > Components: Go >Reporter: Amol Umbarkar >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > This minor change makes newBuilder() public to reduce verbosity for > downstream. > To give you example, I am working on a parquet read / write into Arrow Record > batch where the parquet data types are mapped to Arrow data types. > My Repo: [https://github.com/mindhash/arrow-parquet-go] > In such cases, it will be nice to have a builder API (newBuilder) be generic > to accept a data type and return a respective array. > I am looking at a similar situation for JSON reader. I think this change will > make the builder API much easier for upstream as well as internal packages. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8532) [C++][CSV] Add support for sentinel values.
[ https://issues.apache.org/jira/browse/ARROW-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090615#comment-17090615 ] Wes McKinney commented on ARROW-8532: - If you want to propose a configuration option to place some non-zero value in the null slots, please feel free to do so. > [C++][CSV] Add support for sentinel values. > --- > > Key: ARROW-8532 > URL: https://issues.apache.org/jira/browse/ARROW-8532 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ravil Bikbulatov >Priority: Major > > Some systems still use sentinel values to store nulls. It would be good if > read_csv would place sentinel values and user wouldn't need to convet null > bitmaps to sentinel values. > Adding this support doesn't contradict Arrow specification as null values are > undefined. Also it wouldn't add any overhead to read_csv. Since Arrow is > general purpose framework I think we can relieve users from pain of > converting bitmats to sentinel values. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-8565) [C++] Static build with AWS SDK
[ https://issues.apache.org/jira/browse/ARROW-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090605#comment-17090605 ] Remi Dettai edited comment on ARROW-8565 at 4/23/20, 1:14 PM: -- That's what I am trying to do. I have a static build of the sdk that is correctly picked up (_${AWSSDK_LINK_LIBRARIES}_ seems defined) but when i try to link to it in an example with {{target_link_libraries(example PRIVATE parquet_static ${AWSSDK_LINK_LIBRARIES})}} I keep getting awefull _undefined reference to `Aws::XXX`_ errors :( was (Author: rdettai): That what I am trying to do. I have a static build of the sdk that is correctly picked up (_${AWSSDK_LINK_LIBRARIES}_ seems defined) but when i try to link to it in an example with {{target_link_libraries(example PRIVATE parquet_static ${AWSSDK_LINK_LIBRARIES})}} I keep getting awefull _undefined reference to `Aws::XXX`_ errors :( > [C++] Static build with AWS SDK > --- > > Key: ARROW-8565 > URL: https://issues.apache.org/jira/browse/ARROW-8565 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.17.0 >Reporter: Remi Dettai >Priority: Major > Labels: aws-s3, build-problem > > I can't find my way around the build system when using the S3 client. > It seems that only shared target is allowed when the S3 feature is ON. In the > thirdparty toolchain, when printing: > ??FATAL_ERROR "FIXME: Building AWS C++ SDK from source will link with wrong > libcrypto"?? > What is actually meant is that static build will not work, correct ? If it is > the case, should libarrow.a be generated at all when S3 feature is on ? > What can be done to fix this ? What does it mean that the SDK links to the > wrong libcrypto ? Is it fixable ? Or is their a way to have the static build > but maintain a dynamic link to a shared version of the SDK ? > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8565) [C++] Static build with AWS SDK
[ https://issues.apache.org/jira/browse/ARROW-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090605#comment-17090605 ] Remi Dettai commented on ARROW-8565: That what I am trying to do. I have a static build of the sdk that is correctly picked up (_${AWSSDK_LINK_LIBRARIES}_ seems defined) but when i try to link to it in an example with {{target_link_libraries(example PRIVATE parquet_static ${AWSSDK_LINK_LIBRARIES})}} I keep getting awefull _undefined reference to `Aws::XXX`_ errors :( > [C++] Static build with AWS SDK > --- > > Key: ARROW-8565 > URL: https://issues.apache.org/jira/browse/ARROW-8565 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.17.0 >Reporter: Remi Dettai >Priority: Major > Labels: aws-s3, build-problem > > I can't find my way around the build system when using the S3 client. > It seems that only shared target is allowed when the S3 feature is ON. In the > thirdparty toolchain, when printing: > ??FATAL_ERROR "FIXME: Building AWS C++ SDK from source will link with wrong > libcrypto"?? > What is actually meant is that static build will not work, correct ? If it is > the case, should libarrow.a be generated at all when S3 feature is on ? > What can be done to fix this ? What does it mean that the SDK links to the > wrong libcrypto ? Is it fixable ? Or is their a way to have the static build > but maintain a dynamic link to a shared version of the SDK ? > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8567) [Python] pa.array() sometimes ignore "safe=False"
Antoine Pitrou created ARROW-8567: - Summary: [Python] pa.array() sometimes ignore "safe=False" Key: ARROW-8567 URL: https://issues.apache.org/jira/browse/ARROW-8567 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.17.0 Reporter: Antoine Pitrou Generally, {{pa.array(data).cast(sometype, safe=...)}} is equivalent to {{pa.array(data, sometype, safe=...)}}. Consider the following: {code:python} >>> pa.array([Decimal('12.34')]).cast(pa.int32(), safe=False) >>> >>> [ 12 ] >>> pa.array([Decimal('12.34')], pa.int32(), safe=False) >>> >>> [ 12 ] {code} However, that is not always the case: {code:python} >>> pa.array([Decimal('1234')]).cast(pa.int8(), safe=False) >>> >>> [ -46 ] >>> pa.array([Decimal('1234')], pa.int8(), safe=False) >>> >>> Traceback (most recent call last): ... ArrowInvalid: Value 1234 too large to fit in C integer type {code} I don't think this is very important: first because you can call cast() directly, second because the results are unusable anyway. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8497) [Archery] Add missing component to builds
[ https://issues.apache.org/jira/browse/ARROW-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques reassigned ARROW-8497: - Assignee: Francois Saint-Jacques > [Archery] Add missing component to builds > - > > Key: ARROW-8497 > URL: https://issues.apache.org/jira/browse/ARROW-8497 > Project: Apache Arrow > Issue Type: Improvement > Components: Archery, Developer Tools >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8497) [Archery] Add missing component to builds
[ https://issues.apache.org/jira/browse/ARROW-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques resolved ARROW-8497. --- Resolution: Fixed Issue resolved by pull request 6966 [https://github.com/apache/arrow/pull/6966] > [Archery] Add missing component to builds > - > > Key: ARROW-8497 > URL: https://issues.apache.org/jira/browse/ARROW-8497 > Project: Apache Arrow > Issue Type: Improvement > Components: Archery, Developer Tools >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8566) Upgraded from r package arrow 16 to r package arrow 17 and now get an error when writing posixct to spark
Curt Bergmann created ARROW-8566: Summary: Upgraded from r package arrow 16 to r package arrow 17 and now get an error when writing posixct to spark Key: ARROW-8566 URL: https://issues.apache.org/jira/browse/ARROW-8566 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 0.17.0 Environment: #> R version 3.6.3 (2020-02-29) #> Platform: x86_64-apple-darwin15.6.0 (64-bit) #> Running under: macOS Mojave 10.14.6 sparklyr::spark_version(sc) #> [1] '2.4.5' Reporter: Curt Bergmann monospaced text}}``` r library(DBI) library(sparklyr) library(arrow) #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp sc <- spark_connect(master = "local") sparklyr::spark_version(sc) #> [1] '2.4.5' x <- data.frame(y = Sys.time()) dbWriteTable(sc, "test_posixct", x) #> Error: org.apache.spark.SparkException: Job aborted. #> at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) #> at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) #> at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:503) #> at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217) #> at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176) #> at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) #> at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) #> at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) #> at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) #> at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) #> at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) #> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) #> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) #> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) #> at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83) #> at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81) #> at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) #> at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) #> at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) #> at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) #> at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) #> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) #> at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:474) #> at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:453) #> at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409) #> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) #> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) #> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) #> at java.lang.reflect.Method.invoke(Method.java:498) #> at sparklyr.Invoke.invoke(invoke.scala:147) #> at sparklyr.StreamHandler.handleMethodCall(stream.scala:136) #> at sparklyr.StreamHandler.read(stream.scala:61) #> at sparklyr.BackendHandler$$anonfun$channelRead0$1.apply$mcV$sp(handler.scala:58) #> at scala.util.control.Breaks.breakable(Breaks.scala:38) #> at sparklyr.BackendHandler.channelRead0(handler.scala:38) #> at sparklyr.BackendHandler.channelRead0(handler.scala:14) #> at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) #> at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) #> at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) #> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) #> at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) #> at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) #> at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) #> at io.netty.channel.AbstractChann
[jira] [Commented] (ARROW-8536) [Rust] Failed to locate format/Flight.proto in any parent directory
[ https://issues.apache.org/jira/browse/ARROW-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090527#comment-17090527 ] Neville Dipale commented on ARROW-8536: --- Done, please take a look > [Rust] Failed to locate format/Flight.proto in any parent directory > --- > > Key: ARROW-8536 > URL: https://issues.apache.org/jira/browse/ARROW-8536 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 0.17.0 >Reporter: Andy Grove >Assignee: Neville Dipale >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > When using Arrow 0.17.0 as a dependency, it is likely that you will get the > error "Failed to locate format/Flight.proto in any parent directory". This is > caused by the custom build script in the arrow-flight crate, which expects to > find a "format/Flight.proto" file in a parent directory. This works when > building the crate from within the Arrow source tree, but unfortunately > doesn't work for the published crate, since the Flight.proto file was not > published as part of the crate. > The workaround is to create a "format" directory in the root of your file > system (or at least at a higher level than where cargo is building code) and > place the Flight.proto file there (making sure to use the 0.17.0 version, > which can be found in the source release [1]). > [1] [https://github.com/apache/arrow/releases/tag/apache-arrow-0.17.0] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8536) [Rust] Failed to locate format/Flight.proto in any parent directory
[ https://issues.apache.org/jira/browse/ARROW-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale reassigned ARROW-8536: - Assignee: Neville Dipale (was: Andy Grove) > [Rust] Failed to locate format/Flight.proto in any parent directory > --- > > Key: ARROW-8536 > URL: https://issues.apache.org/jira/browse/ARROW-8536 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 0.17.0 >Reporter: Andy Grove >Assignee: Neville Dipale >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > When using Arrow 0.17.0 as a dependency, it is likely that you will get the > error "Failed to locate format/Flight.proto in any parent directory". This is > caused by the custom build script in the arrow-flight crate, which expects to > find a "format/Flight.proto" file in a parent directory. This works when > building the crate from within the Arrow source tree, but unfortunately > doesn't work for the published crate, since the Flight.proto file was not > published as part of the crate. > The workaround is to create a "format" directory in the root of your file > system (or at least at a higher level than where cargo is building code) and > place the Flight.proto file there (making sure to use the 0.17.0 version, > which can be found in the source release [1]). > [1] [https://github.com/apache/arrow/releases/tag/apache-arrow-0.17.0] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8536) [Rust] Failed to locate format/Flight.proto in any parent directory
[ https://issues.apache.org/jira/browse/ARROW-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8536: -- Labels: pull-request-available (was: ) > [Rust] Failed to locate format/Flight.proto in any parent directory > --- > > Key: ARROW-8536 > URL: https://issues.apache.org/jira/browse/ARROW-8536 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 0.17.0 >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > When using Arrow 0.17.0 as a dependency, it is likely that you will get the > error "Failed to locate format/Flight.proto in any parent directory". This is > caused by the custom build script in the arrow-flight crate, which expects to > find a "format/Flight.proto" file in a parent directory. This works when > building the crate from within the Arrow source tree, but unfortunately > doesn't work for the published crate, since the Flight.proto file was not > published as part of the crate. > The workaround is to create a "format" directory in the root of your file > system (or at least at a higher level than where cargo is building code) and > place the Flight.proto file there (making sure to use the 0.17.0 version, > which can be found in the source release [1]). > [1] [https://github.com/apache/arrow/releases/tag/apache-arrow-0.17.0] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8562) [C++] IO: Parameterize I/O coalescing using S3 storage metrics
[ https://issues.apache.org/jira/browse/ARROW-8562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090509#comment-17090509 ] Antoine Pitrou commented on ARROW-8562: --- > The proposal is to create a named constructor in the io::CacheOptions (PR: > https://github.com/apache/arrow/pull/6744 created by David Li) to compute > max_io_gap and ideal_request_size from TTFB and BW which will then be passed > to reader to configure the I/O coalescing. This sounds like a good idea in principle. Can you submit a PR? > [C++] IO: Parameterize I/O coalescing using S3 storage metrics > -- > > Key: ARROW-8562 > URL: https://issues.apache.org/jira/browse/ARROW-8562 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Mayur Srivastava >Priority: Major > > Related to https://issues.apache.org/jira/browse/ARROW-7995 > The adaptive I/O coalescing algorithm uses two parameters: > 1. max_io_gap: Max I/O gap/hole size in bytes > 2. ideal_request_size = Ideal I/O Request size in bytes > These parameters can be derived from S3 metrics as described below: > In an S3 compatible storage, there are two main metrics: > 1. Seek-time or Time-To-First-Byte (TTFB) in seconds: call setup latency of > a new S3 request > 2. Transfer Bandwidth (BW) for data in bytes/sec > 1. Computing max_io_gap: > max_io_gap = TTFB * BW > This is also called Bandwidth-Delay-Product (BDP). > Two byte ranges that have a gap can still be mapped to the same read if the > gap is less than the bandwidt-delay product [TTFB * TransferBandwidth], i.e. > if the Time-To-First-Byte (or call setup latency of a new S3 request) is > expected to be greater than just reading and discarding the extra bytes on an > existing HTTP request. > 2. Computing ideal_request_size: > We want to have high bandwidth utilization per S3 connections, i.e. transfer > large amounts of data to amortize the seek overhead. > But, we also want to leverage parallelism by slicing very large IO chunks. > We define two more config parameters with suggested default values to control > the slice size and seek to balance the two effects with the goal of > maximizing net data load performance. > BW_util (ideal bandwidth utilization): > This means what fraction of per connection bandwidth should be utilized to > maximize net data load. > A good default value is 90% or 0.9. > MAX_IDEAL_REQUEST_SIZE: > This means what is the maximum single request size (in bytes) to maximize > net data load. > A good default value is 64 MiB. > The amount of data that needs to be transferred in a single S3 get_object > request to achieve effective bandwidth eff_BW = BW_util * BW is as follows: > eff_BW = ideal_request_size / (TTFB + ideal_request_size / BW) > Substituting TTFB = max_io_gap / BW and eff_BW = BW_util * BW, we get the > following result: > ideal_request_size = max_io_gap * BW_util / (1 - BW_util) > Applying the MAX_IDEAL_REQUEST_SIZE, we get the following: > ideal_request_size = min(MAX_IDEAL_REQUEST_SIZE, max_io_gap * BW_util / (1 - > BW_util)) > The proposal is to create a named constructor in the io::CacheOptions (PR: > [https://github.com/apache/arrow/pull/6744] created by [~lidavidm]) to > compute max_io_gap and ideal_request_size from TTFB and BW which will then be > passed to reader to configure the I/O coalescing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8565) [C++] Static build with AWS SDK
[ https://issues.apache.org/jira/browse/ARROW-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090499#comment-17090499 ] Antoine Pitrou commented on ARROW-8565: --- The solution is to build and install the AWS SDK separately, or to use pre-built binaries. Hopefully CMake will be able to pick them up, if they are installed in the right place. > [C++] Static build with AWS SDK > --- > > Key: ARROW-8565 > URL: https://issues.apache.org/jira/browse/ARROW-8565 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.17.0 >Reporter: Remi Dettai >Priority: Major > Labels: aws-s3, build-problem > > I can't find my way around the build system when using the S3 client. > It seems that only shared target is allowed when the S3 feature is ON. In the > thirdparty toolchain, when printing: > ??FATAL_ERROR "FIXME: Building AWS C++ SDK from source will link with wrong > libcrypto"?? > What is actually meant is that static build will not work, correct ? If it is > the case, should libarrow.a be generated at all when S3 feature is on ? > What can be done to fix this ? What does it mean that the SDK links to the > wrong libcrypto ? Is it fixable ? Or is their a way to have the static build > but maintain a dynamic link to a shared version of the SDK ? > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8565) [C++] Static build with AWS SDK
[ https://issues.apache.org/jira/browse/ARROW-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090493#comment-17090493 ] Remi Dettai commented on ARROW-8565: Thanks ! With my mediocrity in understanding cmake, I'm definitively not the man for the job... ;) As a workaround, is it possible to build statically from arrow and only maintain a shared dependency on the SDK ? > [C++] Static build with AWS SDK > --- > > Key: ARROW-8565 > URL: https://issues.apache.org/jira/browse/ARROW-8565 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.17.0 >Reporter: Remi Dettai >Priority: Major > Labels: aws-s3, build-problem > > I can't find my way around the build system when using the S3 client. > It seems that only shared target is allowed when the S3 feature is ON. In the > thirdparty toolchain, when printing: > ??FATAL_ERROR "FIXME: Building AWS C++ SDK from source will link with wrong > libcrypto"?? > What is actually meant is that static build will not work, correct ? If it is > the case, should libarrow.a be generated at all when S3 feature is on ? > What can be done to fix this ? What does it mean that the SDK links to the > wrong libcrypto ? Is it fixable ? Or is their a way to have the static build > but maintain a dynamic link to a shared version of the SDK ? > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8565) [C++] Static build with AWS SDK
[ https://issues.apache.org/jira/browse/ARROW-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090463#comment-17090463 ] Antoine Pitrou commented on ARROW-8565: --- As far as I remember, this means that the AWS SDK build procedure will by default compile its own version of OpenSSL. I would say it's probably fixable, but you will have to find out how and submit a PR for it :-) > [C++] Static build with AWS SDK > --- > > Key: ARROW-8565 > URL: https://issues.apache.org/jira/browse/ARROW-8565 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.17.0 >Reporter: Remi Dettai >Priority: Major > Labels: aws-s3, build-problem > > I can't find my way around the build system when using the S3 client. > It seems that only shared target is allowed when the S3 feature is ON. In the > thirdparty toolchain, when printing: > ??FATAL_ERROR "FIXME: Building AWS C++ SDK from source will link with wrong > libcrypto"?? > What is actually meant is that static build will not work, correct ? If it is > the case, should libarrow.a be generated at all when S3 feature is on ? > What can be done to fix this ? What does it mean that the SDK links to the > wrong libcrypto ? Is it fixable ? Or is their a way to have the static build > but maintain a dynamic link to a shared version of the SDK ? > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8565) [C++] Static build with AWS SDK
Remi Dettai created ARROW-8565: -- Summary: [C++] Static build with AWS SDK Key: ARROW-8565 URL: https://issues.apache.org/jira/browse/ARROW-8565 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.17.0 Reporter: Remi Dettai I can't find my way around the build system when using the S3 client. It seems that only shared target is allowed when the S3 feature is ON. In the thirdparty toolchain, when printing: ??FATAL_ERROR "FIXME: Building AWS C++ SDK from source will link with wrong libcrypto"?? What is actually meant is that static build will not work, correct ? If it is the case, should libarrow.a be generated at all when S3 feature is on ? What can be done to fix this ? What does it mean that the SDK links to the wrong libcrypto ? Is it fixable ? Or is their a way to have the static build but maintain a dynamic link to a shared version of the SDK ? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8564) [Website] Add Ubuntu 20.04 LTS to supported package list
Kouhei Sutou created ARROW-8564: --- Summary: [Website] Add Ubuntu 20.04 LTS to supported package list Key: ARROW-8564 URL: https://issues.apache.org/jira/browse/ARROW-8564 Project: Apache Arrow Issue Type: Improvement Components: Website Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8545) [Python] Allow fast writing of Decimal column to parquet
[ https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090391#comment-17090391 ] Jacek Pliszka commented on ARROW-8545: -- Int to decimal is not implement either. Thought it is much simpler than float to decimal as no rounding handling is needed (we have no negative scale as the moment) If no one steps up before May I may have some time then to do it. It is similar to what we did with decimal to decimal and decimal to int casting https://issues.apache.org/jira/browse/ARROW-3329 > [Python] Allow fast writing of Decimal column to parquet > > > Key: ARROW-8545 > URL: https://issues.apache.org/jira/browse/ARROW-8545 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.17.0 >Reporter: Fons de Leeuw >Priority: Minor > > Currently, when one wants to use a decimal datatype in Pandas, the only > possibility is to use the `decimal.Decimal` standard-libary type. This is > then an "object" column in the DataFrame. > Arrow can write a column of decimal type to Parquet, which is quite > impressive given that [fastparquet does not write decimals|#data-types]] at > all. However, the writing is *very* slow, in the code snippet below a factor > of 4. > *Improvements* > Of course the best outcome would be if the conversion of a decimal column can > be made faster, but I am not familiar enough with pandas internals to know if > that's possible. (This same behavior also applies to `.to_pickle` etc.) > It would be nice, if a warning is shown that object-typed columns are being > converted which is very slow. That would at least make this behavior more > explicit. > Now, if fast parsing of a decimal.Decimal object column is not possible, it > would be nice if a workaround is possible. For example, pass an int and then > shift the dot "x" places to the left. (It is already possible to pass an int > column and specify "decimal" dtype in the Arrow schema during > `pa.Table.from_pandas()` but then it simply becomes a decimal without > decimals.) Also, it might be nice if it can be encoded as a 128-bit byte > string in the pandas column and then directly interpreted by Arrow. > *Usecase* > I need to save large dataframes (~10GB) of geospatial data with > latitude/longitude. I can't use float as comparisons need to be exact, and > the BigQuery "clustering" feature needs either an integer or a decimal but > not a float. In the meantime, I have to do a workaround where I use only ints > (the original number multiplied by 1000.) > *Snippet* > {code:java} > import decimal > from time import time > import numpy as np > import pandas as pd > d = dict() > for col in "abcdefghijklmnopqrstuvwxyz": > d[col] = np.random.rand(int(1E7)) * 100 > df = pd.DataFrame(d) > t0 = time() > df.to_parquet("/tmp/testabc.pq", engine="pyarrow") > t1 = time() > df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal) > t2 = time() > df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow") > t3 = time() > print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal > column {t3-t2:.3f}s") > # Saving the normal dataframe took 4.430s, with one decimal column > 17.673s{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8545) [Python] Allow fast writing of Decimal column to parquet
[ https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090385#comment-17090385 ] Joris Van den Bossche commented on ARROW-8545: -- As [~jacek.pliszka] says, there are indeed two different parts to this: 1) converting pandas dataframe with decimal objects to arrow Table, and 2) writing the arrow table to parquet. >From a quick timing, the slowdown you see with decimals is almost entirely due >to step 1 (so not related to writing parquet itself). Using the same dataframe creation as your code above (only using 10x less data to easily fit on my laptop): {code:python} ... df1 = pd.DataFrame(d) # second dataframe with the decimal column df2 = df.copy() df2["a"] = df2["a"].round(decimals=3).astype(str).map(decimal.Decimal) # convert each of them to a pyarrow.Table table1 = pa.table(df1) table2 = pa.table(df2) {code} Timing the conversion to pyarrow.Table: {code} In [13]: %timeit pa.table(df1) 32 ms ± 7.51 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) In [14]: %timeit pa.table(df2) 1.54 s ± 221 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) {code} and then timing the writing of the pyarrow.Table to parquet: {code} In [16]: import pyarrow.parquet as pq In [17]: %timeit pq.write_table(table1, "/tmp/testabc.parquet") 710 ms ± 29.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [18]: %timeit pq.write_table(table2, "/tmp/testabc.parquet") 750 ms ± 44.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) {code} and timing {{to_parquet()}} more or less gives the sum of the two steps above: {code} In [20]: %timeit df1.to_parquet("/tmp/testabc.pq", engine="pyarrow") 793 ms ± 73.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [21]: %timeit df2.to_parquet("/tmp/testabc.pq", engine="pyarrow") 2.01 s ± 61.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) {code} So you can see here that the actual writing to parquet is only slightly slower, and that the large slowdown comes from converting the python Decimal objects to a pyarrow decimal column. Of course, when starting from a pandas DataFrame to write to parquet, this conversion of pandas to pyarrow.Table is part of the overall process. But, to improve this, I think the only solution is to _not_ use python {{decimal.Decimal}} objects in an object-dtype column. Some options for this: * Do the casting to decimal on the pyarrow side. However, as [~jacek.pliszka] linked, this is not yet implemented for floats (ARROW-8557). I am not directly sure if other conversion are possible right now in pyarrow (like converting as ints and convert those to decimal with a factor). * Use a pandas ExtensionDtype to store decimals in a pandas DataFrame differently (not as python objects). I am not aware of an existing project that already does this (except for Fletcher, which experiments with storing arrow types in pandas dataframes in general). It might be that this python Decimal object -> pyarrow decimal array conversion is not fully optimized, however, since it involves dealing with a numpy array of python objects, it will never be as fast as converting a numpy float array to pyarrow. > [Python] Allow fast writing of Decimal column to parquet > > > Key: ARROW-8545 > URL: https://issues.apache.org/jira/browse/ARROW-8545 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.17.0 >Reporter: Fons de Leeuw >Priority: Minor > > Currently, when one wants to use a decimal datatype in Pandas, the only > possibility is to use the `decimal.Decimal` standard-libary type. This is > then an "object" column in the DataFrame. > Arrow can write a column of decimal type to Parquet, which is quite > impressive given that [fastparquet does not write decimals|#data-types]] at > all. However, the writing is *very* slow, in the code snippet below a factor > of 4. > *Improvements* > Of course the best outcome would be if the conversion of a decimal column can > be made faster, but I am not familiar enough with pandas internals to know if > that's possible. (This same behavior also applies to `.to_pickle` etc.) > It would be nice, if a warning is shown that object-typed columns are being > converted which is very slow. That would at least make this behavior more > explicit. > Now, if fast parsing of a decimal.Decimal object
[jira] [Commented] (ARROW-8455) [Rust] [Parquet] Arrow column read on partially compatible files
[ https://issues.apache.org/jira/browse/ARROW-8455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090358#comment-17090358 ] Remi Dettai commented on ARROW-8455: [~csun] can you take a look at this ? > [Rust] [Parquet] Arrow column read on partially compatible files > > > Key: ARROW-8455 > URL: https://issues.apache.org/jira/browse/ARROW-8455 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 0.16.0 >Reporter: Remi Dettai >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > Seen behavior: When reading a Parquet file into Arrow with > `get_record_reader_by_columns`, it will fail if one of the column of the file > is a list (or any other unsupported type). > Expected behavior: it should only fail if you are actually reading the column > with unsuported type. -- This message was sent by Atlassian Jira (v8.3.4#803005)