[jira] [Commented] (ARROW-12795) [CI] R-Hub Ubuntu GCC (Docker) can't install bit64
[ https://issues.apache.org/jira/browse/ARROW-12795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346519#comment-17346519 ] Mauricio 'Pachá' Vargas Sepúlveda commented on ARROW-12795: --- I need to send a PR to R-Hub and fix bit64 installation on the Docker image. > [CI] R-Hub Ubuntu GCC (Docker) can't install bit64 > -- > > Key: ARROW-12795 > URL: https://issues.apache.org/jira/browse/ARROW-12795 > Project: Apache Arrow > Issue Type: Task >Reporter: Mauricio 'Pachá' Vargas Sepúlveda >Priority: Major > > therefore, the arrow package can't be installed > https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=5160&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=200 > the Docker image was updated 16 hours ago, which explains this new problem > https://hub.docker.com/r/rhub/ubuntu-gcc-release -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11581) [Packaging][C++] Formalize distribution through vcpkg
[ https://issues.apache.org/jira/browse/ARROW-11581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346516#comment-17346516 ] Ian Cook commented on ARROW-11581: -- Draft PR at https://github.com/microsoft/vcpkg/pull/17975 > [Packaging][C++] Formalize distribution through vcpkg > - > > Key: ARROW-11581 > URL: https://issues.apache.org/jira/browse/ARROW-11581 > Project: Apache Arrow > Issue Type: Task > Components: C++, Packaging >Reporter: Ian Cook >Assignee: Ian Cook >Priority: Major > > Currently there is a port of Arrow on vcpkg [1] that is maintained by folks > outside the core Arrow developer community. We should consider formalizing > distribution of Arrow releases through vcpkg, in collaboration with the > existing maintainers of the Arrow vcpkg port if they are interested in > staying involved. > [1] https://github.com/microsoft/vcpkg/tree/master/ports/arrow -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12817) [CI] StackOverflow error while building SQL Catalyst
Mauricio 'Pachá' Vargas Sepúlveda created ARROW-12817: - Summary: [CI] StackOverflow error while building SQL Catalyst Key: ARROW-12817 URL: https://issues.apache.org/jira/browse/ARROW-12817 Project: Apache Arrow Issue Type: Task Components: Continuous Integration Reporter: Mauricio 'Pachá' Vargas Sepúlveda _Exception when compiling 474 sources to /spark/sql/catalyst/target/scala-2.12/classes_ See https://github.com/ursacomputing/crossbow/runs/2597929901#step:7:9761 This appeared for the 1st time on 2021-05-13, and previous to the error we see 'o dependency information available' warnings (https://github.com/ursacomputing/crossbow/runs/2597929901#step:7:9627) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10117) [C++] Implement work-stealing scheduler / multiple queues in ThreadPool
[ https://issues.apache.org/jira/browse/ARROW-10117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346470#comment-17346470 ] Weston Pace commented on ARROW-10117: - I'm a little confused by the word "workload" in the second paragraph. Traditional work stealing attempts to keep tasks together based on thread/core to preserve cache coherency. This is what appears to be described in the first paragraph. In the second paragraph are you asking for the capability to also group tasks based on workload? If so, I'm not sure what the benefit would be. If not, I don't think we'll end up needing to modify the API. A task can keep a thread_local reference to its queue. > [C++] Implement work-stealing scheduler / multiple queues in ThreadPool > --- > > Key: ARROW-10117 > URL: https://issues.apache.org/jira/browse/ARROW-10117 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > > This involves a change from a single task queue shared amongst all threads to > a per-thread task queue and the ability for idle threads to take tasks from > other threads' queues (work stealing). > As part of this, the task submission API would need to be evolved in some > fashion to allow for tasks related to a particular workload to end up in the > same task queue -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12689) [R] Implement ArrowArrayStream C interface
[ https://issues.apache.org/jira/browse/ARROW-12689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-12689. - Resolution: Fixed Issue resolved by pull request 10307 [https://github.com/apache/arrow/pull/10307] > [R] Implement ArrowArrayStream C interface > -- > > Key: ARROW-12689 > URL: https://issues.apache.org/jira/browse/ARROW-12689 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > See > https://github.com/apache/arrow/commit/97879eb970bac52d93d2247200b9ca7acf6f3f93, > which adds it and also adds Python bindings. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12813) [C++] Expose a `full` (array creation) capability to python/R
[ https://issues.apache.org/jira/browse/ARROW-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346461#comment-17346461 ] Weston Pace commented on ARROW-12813: - I'd be fine with exposing a utility function instead of a compute function. I still don't really understand how to make the distinction or what the impact of such a choice would be. > [C++] Expose a `full` (array creation) capability to python/R > - > > Key: ARROW-12813 > URL: https://issues.apache.org/jira/browse/ARROW-12813 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Priority: Major > > Given a scalar value and a length return an array where all values are equal > to the scalar value. > The name "full" is derived from > [https://numpy.org/doc/stable/reference/generated/numpy.full.html] but if > anyone has a more clever name please recommend it. > There are a number of utility functions in C++ that do this already. > However, exposing this as a compute function would allow R/Python to easily > generate arrays. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12813) [C++] Expose a `full` (array creation) capability to python/R
[ https://issues.apache.org/jira/browse/ARROW-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace updated ARROW-12813: Summary: [C++] Expose a `full` (array creation) capability to python/R (was: [C++] Support for a `full` compute function) > [C++] Expose a `full` (array creation) capability to python/R > - > > Key: ARROW-12813 > URL: https://issues.apache.org/jira/browse/ARROW-12813 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Priority: Major > > Given a scalar value and a length return an array where all values are equal > to the scalar value. > The name "full" is derived from > [https://numpy.org/doc/stable/reference/generated/numpy.full.html] but if > anyone has a more clever name please recommend it. > There are a number of utility functions in C++ that do this already. > However, exposing this as a compute function would allow R/Python to easily > generate arrays. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12816) [C++] C++14 laundry list
[ https://issues.apache.org/jira/browse/ARROW-12816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346454#comment-17346454 ] Ben Kietzman commented on ARROW-12816: -- [~apitrou] [~westonpace] [~lidavidm] > [C++] C++14 laundry list > > > Key: ARROW-12816 > URL: https://issues.apache.org/jira/browse/ARROW-12816 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ben Kietzman >Priority: Major > > Improvements to make/be aware of once C++14 is available: > - Ensure that lambda closure members are moved into where ever > possible/appropriate. We have a lot of local variables whose only function is > getting copied into a closure, including some demotions to shared_ptr since > move only types can't be closed over in c++11 > - visitor pattern can be used more fluidly with template lambdas, for example > we could have a utility like {{ VisitInts(offset_width, offset_bytes, > [&](auto* offsets) { /*mutate offsets*/ }) }} > - constexpr switch, for use in type traits functions > - std::enable_if_t > - std::quoted is available for quoting strings' > - std::make_unique > - standard {{[[deprecated]]}} attribute > - temporal literals such as ""s, ""ns, ... > - binary literals with place markers such as 0b1100_ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12816) [C++] C++14 laundry list
Ben Kietzman created ARROW-12816: Summary: [C++] C++14 laundry list Key: ARROW-12816 URL: https://issues.apache.org/jira/browse/ARROW-12816 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Ben Kietzman Improvements to make/be aware of once C++14 is available: - Ensure that lambda closure members are moved into where ever possible/appropriate. We have a lot of local variables whose only function is getting copied into a closure, including some demotions to shared_ptr since move only types can't be closed over in c++11 - visitor pattern can be used more fluidly with template lambdas, for example we could have a utility like {{ VisitInts(offset_width, offset_bytes, [&](auto* offsets) { /*mutate offsets*/ }) }} - constexpr switch, for use in type traits functions - std::enable_if_t - std::quoted is available for quoting strings' - std::make_unique - standard {{[[deprecated]]}} attribute - temporal literals such as ""s, ""ns, ... - binary literals with place markers such as 0b1100_ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9430) [C++/Python] Kernel for SetItem(BooleanArray, values)
[ https://issues.apache.org/jira/browse/ARROW-9430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346436#comment-17346436 ] Niranda Perera commented on ARROW-9430: --- [~jorisvandenbossche] I agree. I think ARROW-11044 only resolves the 'scalar replacement'. > [C++/Python] Kernel for SetItem(BooleanArray, values) > - > > Key: ARROW-9430 > URL: https://issues.apache.org/jira/browse/ARROW-9430 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Uwe Korn >Priority: Major > > We should have a kernel that allows overriding the values of an array by > supplying a boolean mask and a scalar or an array of equal length. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9431) [C++/Python] Kernel for SetItem(IntegerArray, values)
[ https://issues.apache.org/jira/browse/ARROW-9431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346435#comment-17346435 ] Niranda Perera commented on ARROW-9431: --- Identical issue > [C++/Python] Kernel for SetItem(IntegerArray, values) > - > > Key: ARROW-9431 > URL: https://issues.apache.org/jira/browse/ARROW-9431 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 2.0.0 >Reporter: Uwe Korn >Priority: Major > > We should have a kernel that allows overriding the values of an array using > an integer array as the indexer and a scalar or array of equal length as the > values. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12789) [C++] Support for scalar value recycling in RecordBatch/Table creation
[ https://issues.apache.org/jira/browse/ARROW-12789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346422#comment-17346422 ] Nic Crane commented on ARROW-12789: --- Also, feel free to push back on this if you don't think there's a huge amount of application beyond R - I can always look to implement it in the R package's C++ code if this is looking to be a particularly special case that won't be needed elsewhere? > [C++] Support for scalar value recycling in RecordBatch/Table creation > -- > > Key: ARROW-12789 > URL: https://issues.apache.org/jira/browse/ARROW-12789 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nic Crane >Priority: Major > > Please can we have the capability to be able to recycle scalar values during > table creation? It would work as follows: > Upon creation of a new Table/RecordBatch, the length of each column is > checked. If: > * number of columns is > 1 and > * any columns have length 1 and > * not all columns have length 1 > then, the value in the length 1 column(s) should be repeated to make it as > long as the other columns. > This should only occur if all columns either have length 1 or N (where N is > some value greater than 1), and if any columns lengths are values other than > 1 or N, we should still get an error as we do now. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12744) [C++][Compute] Add rounding kernel
[ https://issues.apache.org/jira/browse/ARROW-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346417#comment-17346417 ] Ben Kietzman commented on ARROW-12744: -- 1. IMHO, compute kernels should not rely on (or be affected in any way by) the floating point environment. Users may have a need to adjust this for their own applications and arrow's kernels should produce correct output regardless 2. Output should be of the same floating point type as the input since the extent of rounding is configurable (probably via a function option like {{RoundOptions::ndigits}}) whereas integral output is only well formed if we're rounding to the nearest one. > [C++][Compute] Add rounding kernel > -- > > Key: ARROW-12744 > URL: https://issues.apache.org/jira/browse/ARROW-12744 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ian Cook >Assignee: Eduardo Ponce >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Kernel to round an array of floating point numbers. Should return an array of > the same type as the input. Should have an option to control how many digits > after the decimal point (default value 0 meaning round to the nearest > integer). > Midpoint values (e.g. 0.5 rounded to nearest integer) should round away from > zero (up for positive numbers, down for negative numbers). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12814) [C++][Gandiva] Implements ABS, FLOOR, PI, SQRT, SIGN, LSHIFT, RSHIFT and TRUNC functions
[ https://issues.apache.org/jira/browse/ARROW-12814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-12814: --- Labels: pull-request-available (was: ) > [C++][Gandiva] Implements ABS, FLOOR, PI, SQRT, SIGN, LSHIFT, RSHIFT and > TRUNC functions > > > Key: ARROW-12814 > URL: https://issues.apache.org/jira/browse/ARROW-12814 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Gandiva >Reporter: Anthony Louis Gotlib Ferreira >Assignee: Anthony Louis Gotlib Ferreira >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12815) [C++] Warning when compiling on ubunut 21.04
Nate Clark created ARROW-12815: -- Summary: [C++] Warning when compiling on ubunut 21.04 Key: ARROW-12815 URL: https://issues.apache.org/jira/browse/ARROW-12815 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 5.0.0 Reporter: Nate Clark Warning generated when compiling using gcc 10.2 on ubuntu 21.04 {noformat} In file included from apache-arrow/cpp/src/arrow/chunked_array.h:26, from apache-arrow/cpp/src/arrow/table.h:25, from apache-arrow/cpp/src/arrow/table.cc:18, from src/arrow/CMakeFiles/arrow_objlib.dir/Unity/unity_4_cxx.cxx:4: apache-arrow/cpp/src/arrow/tensor.cc: In member function ‘arrow::Tensor::CountNonZero() const’: apache-arrow/cpp/src/arrow/result.h:446:5: warning: ‘MEM[(long int &)&counter + 8]’ may be used uninitialized in this function [-Wmaybe-uninitialized] 446 | new (&data_) T(std::forward(u)); | ^~ In file included from src/arrow/CMakeFiles/arrow_objlib.dir/Unity/unity_4_cxx.cxx:6: apache-arrow/cpp/src/arrow/tensor.cc:337:18: note: ‘MEM[(long int &)&counter + 8]’ was declared here 337 | NonZeroCounter counter(*this); | ^~~ {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12744) [C++][Compute] Add rounding kernel
[ https://issues.apache.org/jira/browse/ARROW-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-12744: --- Labels: pull-request-available (was: ) > [C++][Compute] Add rounding kernel > -- > > Key: ARROW-12744 > URL: https://issues.apache.org/jira/browse/ARROW-12744 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ian Cook >Assignee: Eduardo Ponce >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Kernel to round an array of floating point numbers. Should return an array of > the same type as the input. Should have an option to control how many digits > after the decimal point (default value 0 meaning round to the nearest > integer). > Midpoint values (e.g. 0.5 rounded to nearest integer) should round away from > zero (up for positive numbers, down for negative numbers). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12744) [C++][Compute] Add rounding kernel
[ https://issues.apache.org/jira/browse/ARROW-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eduardo Ponce updated ARROW-12744: -- Summary: [C++][Compute] Add rounding kernel (was: [C++] Add rounding kernel) > [C++][Compute] Add rounding kernel > -- > > Key: ARROW-12744 > URL: https://issues.apache.org/jira/browse/ARROW-12744 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ian Cook >Assignee: Eduardo Ponce >Priority: Major > > Kernel to round an array of floating point numbers. Should return an array of > the same type as the input. Should have an option to control how many digits > after the decimal point (default value 0 meaning round to the nearest > integer). > Midpoint values (e.g. 0.5 rounded to nearest integer) should round away from > zero (up for positive numbers, down for negative numbers). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12755) [C++][Compute] Add quotient and modulo kernels
[ https://issues.apache.org/jira/browse/ARROW-12755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eduardo Ponce updated ARROW-12755: -- Summary: [C++][Compute] Add quotient and modulo kernels (was: [C++] Add quotient and modulo kernels) > [C++][Compute] Add quotient and modulo kernels > -- > > Key: ARROW-12755 > URL: https://issues.apache.org/jira/browse/ARROW-12755 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ian Cook >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > Add a pair of binary kernels to compute the: > * quotient (result after division, discarding any fractional part, a.k.a > integer division) > * mod or modulo (remainder after division, a.k.a {{%}} / {{%%}} / modulus). > The returned array should have the same data type as the input arrays or > promote to an appropriate type to avoid loss of precision if the input types > differ. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12814) [C++][Gandiva] Implements ABS, FLOOR, PI, SQRT, SIGN, LSHIFT, RSHIFT and TRUNC functions
[ https://issues.apache.org/jira/browse/ARROW-12814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Louis Gotlib Ferreira updated ARROW-12814: -- Summary: [C++][Gandiva] Implements ABS, FLOOR, PI, SQRT, SIGN, LSHIFT, RSHIFT and TRUNC functions (was: [C++][Gandiva] Implements math functions) > [C++][Gandiva] Implements ABS, FLOOR, PI, SQRT, SIGN, LSHIFT, RSHIFT and > TRUNC functions > > > Key: ARROW-12814 > URL: https://issues.apache.org/jira/browse/ARROW-12814 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Gandiva >Reporter: Anthony Louis Gotlib Ferreira >Assignee: Anthony Louis Gotlib Ferreira >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12745) [C++][Compute] Add floor and ceiling kernels
[ https://issues.apache.org/jira/browse/ARROW-12745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eduardo Ponce updated ARROW-12745: -- Summary: [C++][Compute] Add floor and ceiling kernels (was: [C++] Add floor and ceiling kernels) > [C++][Compute] Add floor and ceiling kernels > > > Key: ARROW-12745 > URL: https://issues.apache.org/jira/browse/ARROW-12745 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ian Cook >Priority: Major > > Kernels to round each value in an array of floating point numbers to: > * the nearest integer less than or equal to it ({{floor}}) > * the nearest integer greater than or equal to it ({{ceiling}}) > Should return an array of the same type as the input (not an integer type) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12813) [C++] Support for a `full` compute function
[ https://issues.apache.org/jira/browse/ARROW-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346391#comment-17346391 ] Antoine Pitrou commented on ARROW-12813: I don't dispute it's useful. I'm just not convinced we need to make a compute function out of it. > [C++] Support for a `full` compute function > --- > > Key: ARROW-12813 > URL: https://issues.apache.org/jira/browse/ARROW-12813 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Priority: Major > > Given a scalar value and a length return an array where all values are equal > to the scalar value. > The name "full" is derived from > [https://numpy.org/doc/stable/reference/generated/numpy.full.html] but if > anyone has a more clever name please recommend it. > There are a number of utility functions in C++ that do this already. > However, exposing this as a compute function would allow R/Python to easily > generate arrays. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12814) [C++][Gandiva] Implements math functions
Anthony Louis Gotlib Ferreira created ARROW-12814: - Summary: [C++][Gandiva] Implements math functions Key: ARROW-12814 URL: https://issues.apache.org/jira/browse/ARROW-12814 Project: Apache Arrow Issue Type: Improvement Components: C++ - Gandiva Reporter: Anthony Louis Gotlib Ferreira Assignee: Anthony Louis Gotlib Ferreira -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12604) [R][Packaging] Dataset, Parquet off in autobrew and CRAN Mac builds
[ https://issues.apache.org/jira/browse/ARROW-12604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Cook resolved ARROW-12604. -- Resolution: Fixed > [R][Packaging] Dataset, Parquet off in autobrew and CRAN Mac builds > --- > > Key: ARROW-12604 > URL: https://issues.apache.org/jira/browse/ARROW-12604 > Project: Apache Arrow > Issue Type: Bug > Components: Packaging, R >Affects Versions: 4.0.0 >Reporter: Ian Cook >Assignee: Ian Cook >Priority: Major > Fix For: 5.0.0, 4.0.1 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (ARROW-12604) [R][Packaging] Dataset, Parquet off in autobrew and CRAN Mac builds
[ https://issues.apache.org/jira/browse/ARROW-12604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Cook reopened ARROW-12604: -- > [R][Packaging] Dataset, Parquet off in autobrew and CRAN Mac builds > --- > > Key: ARROW-12604 > URL: https://issues.apache.org/jira/browse/ARROW-12604 > Project: Apache Arrow > Issue Type: Bug > Components: Packaging, R >Affects Versions: 4.0.0 >Reporter: Ian Cook >Assignee: Ian Cook >Priority: Major > Fix For: 5.0.0, 4.0.1 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-12788) [C++] arrow::compute::Expression::type_id() function
[ https://issues.apache.org/jira/browse/ARROW-12788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Cook closed ARROW-12788. > [C++] arrow::compute::Expression::type_id() function > > > Key: ARROW-12788 > URL: https://issues.apache.org/jira/browse/ARROW-12788 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ian Cook >Assignee: Ian Cook >Priority: Major > > There is a function {{type()}} that returns the type of a post-bind > {{Expression}} as a {{std::shared_ptr}}: > > [https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/exec/expression.h#L105] > It would be convenient to also have a function {{type_id()}} that returns > this as an {{arrow::Type::type}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12810) [Python] Run tests with AWS_EC2_METADATA_DISABLED=true
[ https://issues.apache.org/jira/browse/ARROW-12810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-12810: --- Labels: pull-request-available (was: ) > [Python] Run tests with AWS_EC2_METADATA_DISABLED=true > -- > > Key: ARROW-12810 > URL: https://issues.apache.org/jira/browse/ARROW-12810 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > This explains why some tests are so slow. There's already a few tests that > work around this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-12604) [R][Packaging] Dataset, Parquet off in autobrew and CRAN Mac builds
[ https://issues.apache.org/jira/browse/ARROW-12604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Cook closed ARROW-12604. Resolution: Fixed > [R][Packaging] Dataset, Parquet off in autobrew and CRAN Mac builds > --- > > Key: ARROW-12604 > URL: https://issues.apache.org/jira/browse/ARROW-12604 > Project: Apache Arrow > Issue Type: Bug > Components: Packaging, R >Affects Versions: 4.0.0 >Reporter: Ian Cook >Assignee: Ian Cook >Priority: Major > Fix For: 5.0.0, 4.0.1 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12619) [Python] pyarrow sdist should not require git
[ https://issues.apache.org/jira/browse/ARROW-12619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs resolved ARROW-12619. - Fix Version/s: 5.0.0 Resolution: Fixed Issue resolved by pull request 10342 [https://github.com/apache/arrow/pull/10342] > [Python] pyarrow sdist should not require git > - > > Key: ARROW-12619 > URL: https://issues.apache.org/jira/browse/ARROW-12619 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Kouhei Sutou >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0, 4.0.1 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > {noformat} > FROM ubuntu:20.04 > RUN apt update && apt install -y python3-pip > RUN pip3 install --no-binary pyarrow pyarrow==4.0.0 > {noformat} > {noformat} > $ docker build . > ... > Step 3/3 : RUN pip3 install --no-binary pyarrow pyarrow==4.0.0 > ---> Running in 28d363e1c397 > Collecting pyarrow==4.0.0 > Downloading pyarrow-4.0.0.tar.gz (710 kB) > Installing build dependencies: started > Installing build dependencies: still running... > Installing build dependencies: finished with status 'done' > Getting requirements to build wheel: started > Getting requirements to build wheel: finished with status 'done' > Preparing wheel metadata: started > Preparing wheel metadata: finished with status 'error' > ERROR: Command errored out with exit status 1: > command: /usr/bin/python3 /tmp/tmp5rqecai7 > prepare_metadata_for_build_wheel /tmp/tmpc49gha3r > cwd: /tmp/pip-install-or1g7own/pyarrow > Complete output (42 lines): > Traceback (most recent call last): > File "/tmp/tmp5rqecai7", line 280, in > main() > File "/tmp/tmp5rqecai7", line 263, in main > json_out['return_val'] = hook(**hook_input['kwargs']) > File "/tmp/tmp5rqecai7", line 133, in prepare_metadata_for_build_wheel > return hook(metadata_directory, config_settings) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", > line 166, in prepare_metadata_for_build_wheel > self.run_setup() > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", > line 258, in run_setup > super(_BuildMetaLegacyBackend, > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", > line 150, in run_setup > exec(compile(code, __file__, 'exec'), locals()) > File "setup.py", line 585, in > setup( > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/__init__.py", > line 153, in setup > return distutils.core.setup(**attrs) > File "/usr/lib/python3.8/distutils/core.py", line 108, in setup > _setup_distribution = dist = klass(attrs) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py", > line 434, in __init__ > _Distribution.__init__(self, { > File "/usr/lib/python3.8/distutils/dist.py", line 292, in __init__ > self.finalize_options() > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py", > line 743, in finalize_options > ep(self) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py", > line 750, in _finalize_setup_keywords > ep.load()(self, ep.name, value) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/integration.py", > line 24, in version_keyword > dist.metadata.version = _get_version(config) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py", > line 173, in _get_version > parsed_version = _do_parse(config) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py", > line 119, in _do_parse > parse_result = _call_entrypoint_fn(config.absolute_root, config, > config.parse) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py", > line 54, in _call_entrypoint_fn > return fn(root) > File "setup.py", line 546, in parse_git > return parse(root, **kwargs) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/git.py", > line 115, in parse > require_command("git") > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/utils.py", > line 142, in require_command > raise OSError("%r was not found" % name) > OSError: 'git' was not found
[jira] [Commented] (ARROW-12813) [C++] Support for a `full` compute function
[ https://issues.apache.org/jira/browse/ARROW-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346337#comment-17346337 ] Weston Pace commented on ARROW-12813: - I think it comes up often enough in cases where there is a default value for a column. For example, if you are reading into datasets from two different sources that are similar but not quite the same and you want to unify them. I should also mention that, if you are reading in data as a dataset scan, you can achieve this with projection (project a name to a scalar and the scalar will be broadcast). > [C++] Support for a `full` compute function > --- > > Key: ARROW-12813 > URL: https://issues.apache.org/jira/browse/ARROW-12813 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Priority: Major > > Given a scalar value and a length return an array where all values are equal > to the scalar value. > The name "full" is derived from > [https://numpy.org/doc/stable/reference/generated/numpy.full.html] but if > anyone has a more clever name please recommend it. > There are a number of utility functions in C++ that do this already. > However, exposing this as a compute function would allow R/Python to easily > generate arrays. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12789) [C++] Support for scalar value recycling in RecordBatch/Table creation
[ https://issues.apache.org/jira/browse/ARROW-12789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346330#comment-17346330 ] Weston Pace commented on ARROW-12789: - [~bkietz] I'm going to tag you for input just because you've done some work with broadcasting in the past (e.g. in regards to projection) which seems quite similar. Perhaps this PR could be achieved by a "broadcast" compute function which takes in a vector of arrays? (which I suppose is an odd shape for input into the compute layer). > [C++] Support for scalar value recycling in RecordBatch/Table creation > -- > > Key: ARROW-12789 > URL: https://issues.apache.org/jira/browse/ARROW-12789 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nic Crane >Priority: Major > > Please can we have the capability to be able to recycle scalar values during > table creation? It would work as follows: > Upon creation of a new Table/RecordBatch, the length of each column is > checked. If: > * number of columns is > 1 and > * any columns have length 1 and > * not all columns have length 1 > then, the value in the length 1 column(s) should be repeated to make it as > long as the other columns. > This should only occur if all columns either have length 1 or N (where N is > some value greater than 1), and if any columns lengths are values other than > 1 or N, we should still get an error as we do now. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12813) [C++] Support for a `full` compute function
[ https://issues.apache.org/jira/browse/ARROW-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346324#comment-17346324 ] Antoine Pitrou commented on ARROW-12813: It could be exposed in Python/R without being a compute function. That said, {{np.full}} is intrinsically more useful than an Arrow equivalent because Numpy arrays are mutable. > [C++] Support for a `full` compute function > --- > > Key: ARROW-12813 > URL: https://issues.apache.org/jira/browse/ARROW-12813 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Priority: Major > > Given a scalar value and a length return an array where all values are equal > to the scalar value. > The name "full" is derived from > [https://numpy.org/doc/stable/reference/generated/numpy.full.html] but if > anyone has a more clever name please recommend it. > There are a number of utility functions in C++ that do this already. > However, exposing this as a compute function would allow R/Python to easily > generate arrays. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12789) [C++] Support for scalar value recycling in RecordBatch/Table creation
[ https://issues.apache.org/jira/browse/ARROW-12789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346319#comment-17346319 ] Weston Pace commented on ARROW-12789: - [~jorisvandenbossche] I've just opened ARROW-12813 for that as it has come up a few times. > [C++] Support for scalar value recycling in RecordBatch/Table creation > -- > > Key: ARROW-12789 > URL: https://issues.apache.org/jira/browse/ARROW-12789 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nic Crane >Priority: Major > > Please can we have the capability to be able to recycle scalar values during > table creation? It would work as follows: > Upon creation of a new Table/RecordBatch, the length of each column is > checked. If: > * number of columns is > 1 and > * any columns have length 1 and > * not all columns have length 1 > then, the value in the length 1 column(s) should be repeated to make it as > long as the other columns. > This should only occur if all columns either have length 1 or N (where N is > some value greater than 1), and if any columns lengths are values other than > 1 or N, we should still get an error as we do now. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (ARROW-12813) [C++] Support for a `full` compute function
[ https://issues.apache.org/jira/browse/ARROW-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace reopened ARROW-12813: - I was going to close as duplicate but looking at ARROW-12789 more closely I think they are asking for similar but slightly different interfaces. > [C++] Support for a `full` compute function > --- > > Key: ARROW-12813 > URL: https://issues.apache.org/jira/browse/ARROW-12813 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Priority: Major > > Given a scalar value and a length return an array where all values are equal > to the scalar value. > The name "full" is derived from > [https://numpy.org/doc/stable/reference/generated/numpy.full.html] but if > anyone has a more clever name please recommend it. > There are a number of utility functions in C++ that do this already. > However, exposing this as a compute function would allow R/Python to easily > generate arrays. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-12813) [C++] Support for a `full` compute function
[ https://issues.apache.org/jira/browse/ARROW-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace closed ARROW-12813. --- Resolution: Duplicate > [C++] Support for a `full` compute function > --- > > Key: ARROW-12813 > URL: https://issues.apache.org/jira/browse/ARROW-12813 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Priority: Major > > Given a scalar value and a length return an array where all values are equal > to the scalar value. > The name "full" is derived from > [https://numpy.org/doc/stable/reference/generated/numpy.full.html] but if > anyone has a more clever name please recommend it. > There are a number of utility functions in C++ that do this already. > However, exposing this as a compute function would allow R/Python to easily > generate arrays. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12812) [Packaging][Java] Improve JNI jars build
[ https://issues.apache.org/jira/browse/ARROW-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-12812: --- Labels: pull-request-available (was: ) > [Packaging][Java] Improve JNI jars build > > > Key: ARROW-12812 > URL: https://issues.apache.org/jira/browse/ARROW-12812 > Project: Apache Arrow > Issue Type: Improvement > Components: Java, Packaging >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > - to better align with the manylinux scripts > - also build the pure java packages > - add dynamic dependency check functionality to archery -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12813) [C++] Support for a `full` compute function
Weston Pace created ARROW-12813: --- Summary: [C++] Support for a `full` compute function Key: ARROW-12813 URL: https://issues.apache.org/jira/browse/ARROW-12813 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace Given a scalar value and a length return an array where all values are equal to the scalar value. The name "full" is derived from [https://numpy.org/doc/stable/reference/generated/numpy.full.html] but if anyone has a more clever name please recommend it. There are a number of utility functions in C++ that do this already. However, exposing this as a compute function would allow R/Python to easily generate arrays. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12812) [Packaging][Java] Improve JNI jars build
Krisztian Szucs created ARROW-12812: --- Summary: [Packaging][Java] Improve JNI jars build Key: ARROW-12812 URL: https://issues.apache.org/jira/browse/ARROW-12812 Project: Apache Arrow Issue Type: Improvement Components: Java, Packaging Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 5.0.0 - to better align with the manylinux scripts - also build the pure java packages - add dynamic dependency check functionality to archery -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12744) [C++] Add rounding kernel
[ https://issues.apache.org/jira/browse/ARROW-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346314#comment-17346314 ] Eduardo Ponce commented on ARROW-12744: --- C++ provides a [std::round()|https://en.cppreference.com/w/cpp/numeric/math/round] function where the [rounding mode|https://en.cppreference.com/w/cpp/numeric/fenv/FE_round] can be set at runtime. Note that library implementations can provide additional rounding modes or support a subset. It seems there is no *round-half-to-even/odd* defined in spec. 1. Should the Arrow *round* kernel make use of *std::round* and extend the rounding modes to support *round-half-to-even/odd* and only in these cases implement them explicitly? 2. Also, *std::round()* provides versions where it outputs integral data instead of floating-point. Are these variants desirable in Arrow? > [C++] Add rounding kernel > - > > Key: ARROW-12744 > URL: https://issues.apache.org/jira/browse/ARROW-12744 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ian Cook >Assignee: Eduardo Ponce >Priority: Major > > Kernel to round an array of floating point numbers. Should return an array of > the same type as the input. Should have an option to control how many digits > after the decimal point (default value 0 meaning round to the nearest > integer). > Midpoint values (e.g. 0.5 rounded to nearest integer) should round away from > zero (up for positive numbers, down for negative numbers). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12811) [C++] [Dataset] Dataset repartition / filter / update
Weston Pace created ARROW-12811: --- Summary: [C++] [Dataset] Dataset repartition / filter / update Key: ARROW-12811 URL: https://issues.apache.org/jira/browse/ARROW-12811 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace This feature would be to add support for an "update" workflow which scanned a set of batches and wrote them (potentially filtered/modified) back out to the same place. The existing dataset read / dataset write features wouldn't work because they would append the data. There is some discussion in ARROW-12358 and ARROW-12509 of an "overwrite mode" but an "overwrite partition" workflow wouldn't work unless you can scan in entire partitions at once (and in general this should probably be avoided). A naive "write to a different directory and rename" approach could work but it would be inefficient since it would require a copy of the entire dataset to modify a small part of it. The feature could be implemented using temporary directories in place that get renamed on top of the existing directory at the end. Files that are unchanged would be moved into the temporary directory instead of copied. Presumable no ACID guarantees would be made (and they would be quite hard to guarantee) since Arrow datasets do not make ACID guarantees of any kind currently. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12810) [Python] Run tests with AWS_EC2_METADATA_DISABLED=true
David Li created ARROW-12810: Summary: [Python] Run tests with AWS_EC2_METADATA_DISABLED=true Key: ARROW-12810 URL: https://issues.apache.org/jira/browse/ARROW-12810 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: David Li Assignee: David Li This explains why some tests are so slow. There's already a few tests that work around this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset
[ https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346298#comment-17346298 ] Weston Pace commented on ARROW-12358: - So looking on this with fresh eyes, the "overwrite mode" feature is fairly different from an "update" feature. So I don't think update related topics are relevant for this ticket. Update generally (and specifically in [~ldacey] 's case) implies reading and writing to the same set of files. Overwrite-partition mode wouldn't allow for that. Overwrite-partition mode could be useful in some limited circumstances (e.g. somehow someone regenerates an entire new set of data for one or more partitions) but I think those are rare enough, and would be handled by a general "update" feature anyways, that I don't see much benefit in creating a separate feature and the complexity would just confuse users. So I'll walk back my earlier comment. I'd now argue that dataset write should only allow "append" and "error" options. Dataset update could be created as a separate Jira ticket (I'll go ahead and draft one). Dataset update would mean scanning and rewriting a dataset (or parts thereof). > [C++][Python][R][Dataset] Control overwriting vs appending when writing to > existing dataset > --- > > Key: ARROW-12358 > URL: https://issues.apache.org/jira/browse/ARROW-12358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Labels: dataset > Fix For: 5.0.0 > > > Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}} > uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when > you are writing to an existing dataset, you de facto overwrite previous data > when using this default template. > There is some discussion in ARROW-10695 about how the user can avoid this by > ensuring the file names are unique (the user can specify the > {{basename_template}} to be something unique). There is also ARROW-7706 about > silently doubling data (so _not_ overwriting existing data) with the legacy > {{parquet.write_to_dataset}} implementation. > It could be good to have a "mode" when writing datasets that controls the > different possible behaviours. And erroring when there is pre-existing data > in the target directory is maybe the safest default, because both appending > vs overwriting silently can be surprising behaviour depending on your > expectations. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12807) [C++] Fix merge conflicts with Future refactor/async IPC
[ https://issues.apache.org/jira/browse/ARROW-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-12807. -- Resolution: Fixed Issue resolved by pull request 10347 [https://github.com/apache/arrow/pull/10347] > [C++] Fix merge conflicts with Future refactor/async IPC > > > Key: ARROW-12807 > URL: https://issues.apache.org/jira/browse/ARROW-12807 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > ARROW-12004 and ARROW-11772 conflict with each other (they merge cleanly but > the result doesn't build) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12809) [C++] Add StrptimeOptions defaults
Neal Richardson created ARROW-12809: --- Summary: [C++] Add StrptimeOptions defaults Key: ARROW-12809 URL: https://issues.apache.org/jira/browse/ARROW-12809 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Neal Richardson Per https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L1744 there are no default options for strptime (format, unit). But the TimestampType constructor has a default unit of milliseconds (https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L1236), and a reasonable default for {{format}} would be ISO8601. cc [~bkietz] [~wesm] for opinions as the authors of this code (according to {{git blame}}) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-12744) [C++] Add rounding kernel
[ https://issues.apache.org/jira/browse/ARROW-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eduardo Ponce reassigned ARROW-12744: - Assignee: Eduardo Ponce > [C++] Add rounding kernel > - > > Key: ARROW-12744 > URL: https://issues.apache.org/jira/browse/ARROW-12744 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ian Cook >Assignee: Eduardo Ponce >Priority: Major > > Kernel to round an array of floating point numbers. Should return an array of > the same type as the input. Should have an option to control how many digits > after the decimal point (default value 0 meaning round to the nearest > integer). > Midpoint values (e.g. 0.5 rounded to nearest integer) should round away from > zero (up for positive numbers, down for negative numbers). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12802) No more default ARROW_CSV=ON in libarrow build breaks R arrow
[ https://issues.apache.org/jira/browse/ARROW-12802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Webb Phillips resolved ARROW-12802. --- Resolution: Not A Problem Thanks for helping me resolve the problem with my build environment! > No more default ARROW_CSV=ON in libarrow build breaks R arrow > - > > Key: ARROW-12802 > URL: https://issues.apache.org/jira/browse/ARROW-12802 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 4.0.0, 4.0.1 >Reporter: Webb Phillips >Priority: Major > > libarrow build succeeds, but include/arrow/csv/type_fwd.h isn't installed > since 4.0.0. This causes R install.packages('arrow') to fail with: > {code:java} > make: *** > [/opt/local/Library/Frameworks/R.framework/Resources/etc/Makeconf:179: > array.o] Error 1 > In file included from recordbatch.cpp:18: > ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not > found{code} > Reproduced with Ubuntu 18.04 and with macOS 10.13.6 MacPorts with both > apache-arrow-4.0.0 and current HEAD f959141ece4d660bce5f7fa545befc0116a7db79. > No other type_fwd.h are missing: > {code:java} > find .../arrow/cpp/src -name type_fwd.h | wc -l > 10 > find .../include -name type_fwd.h | wc -l > 9{code} > Best guess: default value of cmake ARROW_CSV changed and R arrow requires > ARROW_CSV=ON. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12802) No more default ARROW_CSV=ON in libarrow build breaks R arrow
[ https://issues.apache.org/jira/browse/ARROW-12802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346236#comment-17346236 ] Webb Phillips commented on ARROW-12802: --- You are totally correct! Uninstalled all traces of homebrew and install.packages works fine now :D > No more default ARROW_CSV=ON in libarrow build breaks R arrow > - > > Key: ARROW-12802 > URL: https://issues.apache.org/jira/browse/ARROW-12802 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 4.0.0, 4.0.1 >Reporter: Webb Phillips >Priority: Major > > libarrow build succeeds, but include/arrow/csv/type_fwd.h isn't installed > since 4.0.0. This causes R install.packages('arrow') to fail with: > {code:java} > make: *** > [/opt/local/Library/Frameworks/R.framework/Resources/etc/Makeconf:179: > array.o] Error 1 > In file included from recordbatch.cpp:18: > ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not > found{code} > Reproduced with Ubuntu 18.04 and with macOS 10.13.6 MacPorts with both > apache-arrow-4.0.0 and current HEAD f959141ece4d660bce5f7fa545befc0116a7db79. > No other type_fwd.h are missing: > {code:java} > find .../arrow/cpp/src -name type_fwd.h | wc -l > 10 > find .../include -name type_fwd.h | wc -l > 9{code} > Best guess: default value of cmake ARROW_CSV changed and R arrow requires > ARROW_CSV=ON. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12808) [JS] Document browser support
Brian Hulette created ARROW-12808: - Summary: [JS] Document browser support Key: ARROW-12808 URL: https://issues.apache.org/jira/browse/ARROW-12808 Project: Apache Arrow Issue Type: Task Components: JavaScript Reporter: Brian Hulette For example in https://github.com/apache/arrow/pull/10340 we're explicitly removing support for IE. We should at least document that IE support is an explicit non-goal. Even better if we can identify supported version ranges for major browsers. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12806) [Python] test_write_to_dataset_filesystem missing a dataset mark
[ https://issues.apache.org/jira/browse/ARROW-12806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-12806: Fix Version/s: 4.0.1 > [Python] test_write_to_dataset_filesystem missing a dataset mark > > > Key: ARROW-12806 > URL: https://issues.apache.org/jira/browse/ARROW-12806 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0, 4.0.1 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > From > https://stackoverflow.com/questions/67526288/modulenotfounderror-no-module-named-pyarrow-dataset -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-12802) No more default ARROW_CSV=ON in libarrow build breaks R arrow
[ https://issues.apache.org/jira/browse/ARROW-12802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346231#comment-17346231 ] Neal Richardson edited comment on ARROW-12802 at 5/17/21, 3:29 PM: --- The {{*** Using Homebrew apache-arrow}} in the output suggests that it has found arrow installed by Homebrew ([specifically|https://github.com/apache/arrow/blob/master/r/configure#L108], {{brew ls --versions apache-arrow}} returned something). Judging from the compile error, you may have a very old version of apache-arrow installed by brew. You can set either of {{FORCE_AUTOBREW}} or {{FORCE_BUNDLED_BUILD}} environment variables to {{true}} to ignore the homebrew apache-arrow. was (Author: npr): The {{*** Using Homebrew apache-arrow}} in the output suggests that it has found arrow installed by Homebrew (specifically, {{brew ls --versions apache-arrow}} returned something). Judging from the compile error, you may have a very old version of apache-arrow installed by brew. You can set either of {{FORCE_AUTOBREW}} or {{FORCE_BUNDLED_BUILD}} environment variables to {{true}} to ignore the homebrew apache-arrow. > No more default ARROW_CSV=ON in libarrow build breaks R arrow > - > > Key: ARROW-12802 > URL: https://issues.apache.org/jira/browse/ARROW-12802 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 4.0.0, 4.0.1 >Reporter: Webb Phillips >Priority: Major > > libarrow build succeeds, but include/arrow/csv/type_fwd.h isn't installed > since 4.0.0. This causes R install.packages('arrow') to fail with: > {code:java} > make: *** > [/opt/local/Library/Frameworks/R.framework/Resources/etc/Makeconf:179: > array.o] Error 1 > In file included from recordbatch.cpp:18: > ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not > found{code} > Reproduced with Ubuntu 18.04 and with macOS 10.13.6 MacPorts with both > apache-arrow-4.0.0 and current HEAD f959141ece4d660bce5f7fa545befc0116a7db79. > No other type_fwd.h are missing: > {code:java} > find .../arrow/cpp/src -name type_fwd.h | wc -l > 10 > find .../include -name type_fwd.h | wc -l > 9{code} > Best guess: default value of cmake ARROW_CSV changed and R arrow requires > ARROW_CSV=ON. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12802) No more default ARROW_CSV=ON in libarrow build breaks R arrow
[ https://issues.apache.org/jira/browse/ARROW-12802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346231#comment-17346231 ] Neal Richardson commented on ARROW-12802: - The {{*** Using Homebrew apache-arrow}} in the output suggests that it has found arrow installed by Homebrew (specifically, {{brew ls --versions apache-arrow}} returned something). Judging from the compile error, you may have a very old version of apache-arrow installed by brew. You can set either of {{FORCE_AUTOBREW}} or {{FORCE_BUNDLED_BUILD}} environment variables to {{true}} to ignore the homebrew apache-arrow. > No more default ARROW_CSV=ON in libarrow build breaks R arrow > - > > Key: ARROW-12802 > URL: https://issues.apache.org/jira/browse/ARROW-12802 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 4.0.0, 4.0.1 >Reporter: Webb Phillips >Priority: Major > > libarrow build succeeds, but include/arrow/csv/type_fwd.h isn't installed > since 4.0.0. This causes R install.packages('arrow') to fail with: > {code:java} > make: *** > [/opt/local/Library/Frameworks/R.framework/Resources/etc/Makeconf:179: > array.o] Error 1 > In file included from recordbatch.cpp:18: > ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not > found{code} > Reproduced with Ubuntu 18.04 and with macOS 10.13.6 MacPorts with both > apache-arrow-4.0.0 and current HEAD f959141ece4d660bce5f7fa545befc0116a7db79. > No other type_fwd.h are missing: > {code:java} > find .../arrow/cpp/src -name type_fwd.h | wc -l > 10 > find .../include -name type_fwd.h | wc -l > 9{code} > Best guess: default value of cmake ARROW_CSV changed and R arrow requires > ARROW_CSV=ON. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12807) [C++] Fix merge conflicts with Future refactor/async IPC
[ https://issues.apache.org/jira/browse/ARROW-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-12807: --- Labels: pull-request-available (was: ) > [C++] Fix merge conflicts with Future refactor/async IPC > > > Key: ARROW-12807 > URL: https://issues.apache.org/jira/browse/ARROW-12807 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > ARROW-12004 and ARROW-11772 conflict with each other (they merge cleanly but > the result doesn't build) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-12802) No more default ARROW_CSV=ON in libarrow build breaks R arrow
[ https://issues.apache.org/jira/browse/ARROW-12802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346221#comment-17346221 ] Webb Phillips edited comment on ARROW-12802 at 5/17/21, 3:18 PM: - Using install.packages as in the docs would be ideal. but does not work for me: {code:java} mac$ R --vanilla ... > install.packages('arrow') ... * installing *source* package ‘arrow’ ... ** package ‘arrow’ successfully unpacked and MD5 sums checked ** using staged installation *** Using Homebrew apache-arrow PKG_CFLAGS=-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET PKG_LIBS=-L/usr/local/opt/apache-arrow/lib -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies ** libs /opt/local/bin/clang++-mp-9.0 -std=gnu++11 -I"/opt/local/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET -I'/Users/webb/Library/R/4.0/library/cpp11/include' -I/opt/local/include -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -fPIC -pipe -Os -stdlib=libc++ -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -arch x86_64 -c array.cpp -o array.o In file included from array.cpp:18: ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not found {code} This is with no other arrow installed. Could be install.packages works using Apple /usr/bin/clang++ instead of MacPorts /opt/local/bin/clang++-mp-9.0 was (Author: webbp): Using install.packages as in the docs would be ideal. but does not work for me: {code:java} mac$ R --vanilla ... > install.packages('arrow') ... * installing *source* package ‘arrow’ ... ** package ‘arrow’ successfully unpacked and MD5 sums checked ** using staged installation *** Using Homebrew apache-arrow PKG_CFLAGS=-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET PKG_LIBS=-L/usr/local/opt/apache-arrow/lib -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies ** libs /opt/local/bin/clang++-mp-9.0 -std=gnu++11 -I"/opt/local/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET -I'/Users/webb/Library/R/4.0/library/cpp11/include' -I/opt/local/include -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -fPIC -pipe -Os -stdlib=libc++ -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -arch x86_64 -c array.cpp -o array.o In file included from array.cpp:18: ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not found {code} This is with no other arrow installed. > No more default ARROW_CSV=ON in libarrow build breaks R arrow > - > > Key: ARROW-12802 > URL: https://issues.apache.org/jira/browse/ARROW-12802 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 4.0.0, 4.0.1 >Reporter: Webb Phillips >Priority: Major > > libarrow build succeeds, but include/arrow/csv/type_fwd.h isn't installed > since 4.0.0. This causes R install.packages('arrow') to fail with: > {code:java} > make: *** > [/opt/local/Library/Frameworks/R.framework/Resources/etc/Makeconf:179: > array.o] Error 1 > In file included from recordbatch.cpp:18: > ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not > found{code} > Reproduced with Ubuntu 18.04 and with macOS 10.13.6 MacPorts with both > apache-arrow-4.0.0 and current HEAD f959141ece4d660bce5f7fa545befc0116a7db79. > No other type_fwd.h are missing: > {code:java} > find .../arrow/cpp/src -name type_fwd.h | wc -l > 10 > find .../include -name type_fwd.h | wc -l > 9{code} > Best guess: default value of cmake ARROW_CSV changed and R arrow requires > ARROW_CSV=ON. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-12802) No more default ARROW_CSV=ON in libarrow build breaks R arrow
[ https://issues.apache.org/jira/browse/ARROW-12802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346221#comment-17346221 ] Webb Phillips edited comment on ARROW-12802 at 5/17/21, 3:11 PM: - Using install.packages as in the docs would be ideal. but does not work for me: {code:java} mac$ R --vanilla ... > install.packages('arrow') ... * installing *source* package ‘arrow’ ... ** package ‘arrow’ successfully unpacked and MD5 sums checked ** using staged installation *** Using Homebrew apache-arrow PKG_CFLAGS=-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET PKG_LIBS=-L/usr/local/opt/apache-arrow/lib -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies ** libs /opt/local/bin/clang++-mp-9.0 -std=gnu++11 -I"/opt/local/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET -I'/Users/webb/Library/R/4.0/library/cpp11/include' -I/opt/local/include -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -fPIC -pipe -Os -stdlib=libc++ -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -arch x86_64 -c array.cpp -o array.o In file included from array.cpp:18: ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not found {code} was (Author: webbp): Using install.packages as in the docs would be ideal. but does not work for me: {code:java} mac$ R --vanilla ... > install.packages('arrow') ... * installing *source* package ‘arrow’ ... ** package ‘arrow’ successfully unpacked and MD5 sums checked ** using staged installation *** Using Homebrew apache-arrow PKG_CFLAGS=-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET PKG_LIBS=-L/usr/local/opt/apache-arrow/lib -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies ** libs /opt/local/bin/clang++-mp-9.0 -std=gnu++11 -I"/opt/local/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET -I'/Users/webb/Library/R/4.0/library/cpp11/include' -I/opt/local/include -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -fPIC -pipe -Os -stdlib=libc++ -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -arch x86_64 -c array.cpp -o array.o In file included from array.cpp:18: ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not found {code} > No more default ARROW_CSV=ON in libarrow build breaks R arrow > - > > Key: ARROW-12802 > URL: https://issues.apache.org/jira/browse/ARROW-12802 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 4.0.0, 4.0.1 >Reporter: Webb Phillips >Priority: Major > > libarrow build succeeds, but include/arrow/csv/type_fwd.h isn't installed > since 4.0.0. This causes R install.packages('arrow') to fail with: > {code:java} > make: *** > [/opt/local/Library/Frameworks/R.framework/Resources/etc/Makeconf:179: > array.o] Error 1 > In file included from recordbatch.cpp:18: > ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not > found{code} > Reproduced with Ubuntu 18.04 and with macOS 10.13.6 MacPorts with both > apache-arrow-4.0.0 and current HEAD f959141ece4d660bce5f7fa545befc0116a7db79. > No other type_fwd.h are missing: > {code:java} > find .../arrow/cpp/src -name type_fwd.h | wc -l > 10 > find .../include -name type_fwd.h | wc -l > 9{code} > Best guess: default value of cmake ARROW_CSV changed and R arrow requires > ARROW_CSV=ON. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-12802) No more default ARROW_CSV=ON in libarrow build breaks R arrow
[ https://issues.apache.org/jira/browse/ARROW-12802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346221#comment-17346221 ] Webb Phillips edited comment on ARROW-12802 at 5/17/21, 3:11 PM: - Using install.packages as in the docs would be ideal. but does not work for me: {code:java} mac$ R --vanilla ... > install.packages('arrow') ... * installing *source* package ‘arrow’ ... ** package ‘arrow’ successfully unpacked and MD5 sums checked ** using staged installation *** Using Homebrew apache-arrow PKG_CFLAGS=-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET PKG_LIBS=-L/usr/local/opt/apache-arrow/lib -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies ** libs /opt/local/bin/clang++-mp-9.0 -std=gnu++11 -I"/opt/local/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET -I'/Users/webb/Library/R/4.0/library/cpp11/include' -I/opt/local/include -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -fPIC -pipe -Os -stdlib=libc++ -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -arch x86_64 -c array.cpp -o array.o In file included from array.cpp:18: ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not found {code} was (Author: webbp): Using `install.packages` as in the docs would be ideal. but does not work for me: {code:java} mac$ R --vanilla ... > install.packages('arrow') ... * installing *source* package ‘arrow’ ... ** package ‘arrow’ successfully unpacked and MD5 sums checked ** using staged installation *** Using Homebrew apache-arrow PKG_CFLAGS=-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET PKG_LIBS=-L/usr/local/opt/apache-arrow/lib -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies ** libs /opt/local/bin/clang++-mp-9.0 -std=gnu++11 -I"/opt/local/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET -I'/Users/webb/Library/R/4.0/library/cpp11/include' -I/opt/local/include -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -fPIC -pipe -Os -stdlib=libc++ -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -arch x86_64 -c array.cpp -o array.o In file included from array.cpp:18: ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not found {code} > No more default ARROW_CSV=ON in libarrow build breaks R arrow > - > > Key: ARROW-12802 > URL: https://issues.apache.org/jira/browse/ARROW-12802 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 4.0.0, 4.0.1 >Reporter: Webb Phillips >Priority: Major > > libarrow build succeeds, but include/arrow/csv/type_fwd.h isn't installed > since 4.0.0. This causes R install.packages('arrow') to fail with: > {code:java} > make: *** > [/opt/local/Library/Frameworks/R.framework/Resources/etc/Makeconf:179: > array.o] Error 1 > In file included from recordbatch.cpp:18: > ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not > found{code} > Reproduced with Ubuntu 18.04 and with macOS 10.13.6 MacPorts with both > apache-arrow-4.0.0 and current HEAD f959141ece4d660bce5f7fa545befc0116a7db79. > No other type_fwd.h are missing: > {code:java} > find .../arrow/cpp/src -name type_fwd.h | wc -l > 10 > find .../include -name type_fwd.h | wc -l > 9{code} > Best guess: default value of cmake ARROW_CSV changed and R arrow requires > ARROW_CSV=ON. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-12802) No more default ARROW_CSV=ON in libarrow build breaks R arrow
[ https://issues.apache.org/jira/browse/ARROW-12802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346221#comment-17346221 ] Webb Phillips edited comment on ARROW-12802 at 5/17/21, 3:11 PM: - Using install.packages as in the docs would be ideal. but does not work for me: {code:java} mac$ R --vanilla ... > install.packages('arrow') ... * installing *source* package ‘arrow’ ... ** package ‘arrow’ successfully unpacked and MD5 sums checked ** using staged installation *** Using Homebrew apache-arrow PKG_CFLAGS=-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET PKG_LIBS=-L/usr/local/opt/apache-arrow/lib -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies ** libs /opt/local/bin/clang++-mp-9.0 -std=gnu++11 -I"/opt/local/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET -I'/Users/webb/Library/R/4.0/library/cpp11/include' -I/opt/local/include -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -fPIC -pipe -Os -stdlib=libc++ -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -arch x86_64 -c array.cpp -o array.o In file included from array.cpp:18: ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not found {code} This is with no other arrow installed. was (Author: webbp): Using install.packages as in the docs would be ideal. but does not work for me: {code:java} mac$ R --vanilla ... > install.packages('arrow') ... * installing *source* package ‘arrow’ ... ** package ‘arrow’ successfully unpacked and MD5 sums checked ** using staged installation *** Using Homebrew apache-arrow PKG_CFLAGS=-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET PKG_LIBS=-L/usr/local/opt/apache-arrow/lib -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies ** libs /opt/local/bin/clang++-mp-9.0 -std=gnu++11 -I"/opt/local/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET -I'/Users/webb/Library/R/4.0/library/cpp11/include' -I/opt/local/include -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -fPIC -pipe -Os -stdlib=libc++ -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -arch x86_64 -c array.cpp -o array.o In file included from array.cpp:18: ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not found {code} > No more default ARROW_CSV=ON in libarrow build breaks R arrow > - > > Key: ARROW-12802 > URL: https://issues.apache.org/jira/browse/ARROW-12802 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 4.0.0, 4.0.1 >Reporter: Webb Phillips >Priority: Major > > libarrow build succeeds, but include/arrow/csv/type_fwd.h isn't installed > since 4.0.0. This causes R install.packages('arrow') to fail with: > {code:java} > make: *** > [/opt/local/Library/Frameworks/R.framework/Resources/etc/Makeconf:179: > array.o] Error 1 > In file included from recordbatch.cpp:18: > ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not > found{code} > Reproduced with Ubuntu 18.04 and with macOS 10.13.6 MacPorts with both > apache-arrow-4.0.0 and current HEAD f959141ece4d660bce5f7fa545befc0116a7db79. > No other type_fwd.h are missing: > {code:java} > find .../arrow/cpp/src -name type_fwd.h | wc -l > 10 > find .../include -name type_fwd.h | wc -l > 9{code} > Best guess: default value of cmake ARROW_CSV changed and R arrow requires > ARROW_CSV=ON. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12802) No more default ARROW_CSV=ON in libarrow build breaks R arrow
[ https://issues.apache.org/jira/browse/ARROW-12802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346221#comment-17346221 ] Webb Phillips commented on ARROW-12802: --- Using `install.packages` as in the docs would be ideal. but does not work for me: {code:java} mac$ R --vanilla ... > install.packages('arrow') ... * installing *source* package ‘arrow’ ... ** package ‘arrow’ successfully unpacked and MD5 sums checked ** using staged installation *** Using Homebrew apache-arrow PKG_CFLAGS=-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET PKG_LIBS=-L/usr/local/opt/apache-arrow/lib -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies ** libs /opt/local/bin/clang++-mp-9.0 -std=gnu++11 -I"/opt/local/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET -I'/Users/webb/Library/R/4.0/library/cpp11/include' -I/opt/local/include -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -fPIC -pipe -Os -stdlib=libc++ -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -arch x86_64 -c array.cpp -o array.o In file included from array.cpp:18: ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not found {code} > No more default ARROW_CSV=ON in libarrow build breaks R arrow > - > > Key: ARROW-12802 > URL: https://issues.apache.org/jira/browse/ARROW-12802 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 4.0.0, 4.0.1 >Reporter: Webb Phillips >Priority: Major > > libarrow build succeeds, but include/arrow/csv/type_fwd.h isn't installed > since 4.0.0. This causes R install.packages('arrow') to fail with: > {code:java} > make: *** > [/opt/local/Library/Frameworks/R.framework/Resources/etc/Makeconf:179: > array.o] Error 1 > In file included from recordbatch.cpp:18: > ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not > found{code} > Reproduced with Ubuntu 18.04 and with macOS 10.13.6 MacPorts with both > apache-arrow-4.0.0 and current HEAD f959141ece4d660bce5f7fa545befc0116a7db79. > No other type_fwd.h are missing: > {code:java} > find .../arrow/cpp/src -name type_fwd.h | wc -l > 10 > find .../include -name type_fwd.h | wc -l > 9{code} > Best guess: default value of cmake ARROW_CSV changed and R arrow requires > ARROW_CSV=ON. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12785) [CI] the r-devdocs build errors when brew installing gcc
[ https://issues.apache.org/jira/browse/ARROW-12785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-12785. Fix Version/s: 5.0.0 Resolution: Fixed Issue resolved by pull request 10328 [https://github.com/apache/arrow/pull/10328] > [CI] the r-devdocs build errors when brew installing gcc > > > Key: ARROW-12785 > URL: https://issues.apache.org/jira/browse/ARROW-12785 > Project: Apache Arrow > Issue Type: Task > Components: Continuous Integration >Reporter: Mauricio 'Pachá' Vargas Sepúlveda >Assignee: Jonathan Keane >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > This affects github-test-r-devdocs (R devdocs macOS-latest). The brew step to > install gcc fails, and then OpenBLAS fails as well. > See https://github.com/ursacomputing/crossbow/runs/2573031778#step:8:1494 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12807) [C++] Fix merge conflicts with Future refactor/async IPC
David Li created ARROW-12807: Summary: [C++] Fix merge conflicts with Future refactor/async IPC Key: ARROW-12807 URL: https://issues.apache.org/jira/browse/ARROW-12807 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: David Li Assignee: David Li Fix For: 5.0.0 ARROW-12004 and ARROW-11772 conflict with each other (they merge cleanly but the result doesn't build) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12802) No more default ARROW_CSV=ON in libarrow build breaks R arrow
[ https://issues.apache.org/jira/browse/ARROW-12802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346216#comment-17346216 ] Neal Richardson commented on ARROW-12802: - We don't recommend {{install_github}}, though with the right env vars it works on most platforms. Moreover, you don't need to install the arrow C++ library separately from the R package, the R package installation will take care of everything for you. So you may be making things harder for yourself than you need. See https://arrow.apache.org/docs/r/#installation as well as the longer installation vignette linked there for details. > No more default ARROW_CSV=ON in libarrow build breaks R arrow > - > > Key: ARROW-12802 > URL: https://issues.apache.org/jira/browse/ARROW-12802 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 4.0.0, 4.0.1 >Reporter: Webb Phillips >Priority: Major > > libarrow build succeeds, but include/arrow/csv/type_fwd.h isn't installed > since 4.0.0. This causes R install.packages('arrow') to fail with: > {code:java} > make: *** > [/opt/local/Library/Frameworks/R.framework/Resources/etc/Makeconf:179: > array.o] Error 1 > In file included from recordbatch.cpp:18: > ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not > found{code} > Reproduced with Ubuntu 18.04 and with macOS 10.13.6 MacPorts with both > apache-arrow-4.0.0 and current HEAD f959141ece4d660bce5f7fa545befc0116a7db79. > No other type_fwd.h are missing: > {code:java} > find .../arrow/cpp/src -name type_fwd.h | wc -l > 10 > find .../include -name type_fwd.h | wc -l > 9{code} > Best guess: default value of cmake ARROW_CSV changed and R arrow requires > ARROW_CSV=ON. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12806) [Python] test_write_to_dataset_filesystem missing a dataset mark
[ https://issues.apache.org/jira/browse/ARROW-12806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-12806. -- Resolution: Fixed Issue resolved by pull request 10346 [https://github.com/apache/arrow/pull/10346] > [Python] test_write_to_dataset_filesystem missing a dataset mark > > > Key: ARROW-12806 > URL: https://issues.apache.org/jira/browse/ARROW-12806 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > From > https://stackoverflow.com/questions/67526288/modulenotfounderror-no-module-named-pyarrow-dataset -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10425) [Python] Support reading (compressed) CSV file from remote file / binary blob
[ https://issues.apache.org/jira/browse/ARROW-10425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346205#comment-17346205 ] Joris Van den Bossche commented on ARROW-10425: --- Yes, reading from a buffer works, but your example is not using a compressed buffer. So for example: {code} from pyarrow import fs, csv s3 = fs.S3FileSystem() with s3.open_input_file("bucket/data.csv.gz") as file: table = csv.read_csv(file) {code} currently doesn't work? (tried it with LocalFileSystem) I am not sure this _should_ work, as right now we just infer the compression from the file path, and not from the actual content of the file. But, then it might be nice that something like {{csv.read_csv("s3://bucket/data.csv.gz")}} would work so that it could be detected from the file path. > [Python] Support reading (compressed) CSV file from remote file / binary blob > - > > Key: ARROW-10425 > URL: https://issues.apache.org/jira/browse/ARROW-10425 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: csv > > From > https://stackoverflow.com/questions/64588076/how-can-i-read-a-csv-gz-file-with-pyarrow-from-a-file-object > Currently {{pyarrow.csv.rad_csv}} happily takes a path to a compressed file > and automatically decompresses it, but AFAIK this only works for local paths. > It would be nice to in general support reading CSV from remote files (with > URI / specifying a filesystem), and in that case also support compression. > In addition we could also read a compressed file from a BytesIO / file-like > object, but not sure we want that (as it would required a keyword to indicate > the used compression). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12776) [Archery][Integration] Fix decimal case generation in write_js_test_json
[ https://issues.apache.org/jira/browse/ARROW-12776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-12776: Fix Version/s: 4.0.1 > [Archery][Integration] Fix decimal case generation in write_js_test_json > > > Key: ARROW-12776 > URL: https://issues.apache.org/jira/browse/ARROW-12776 > Project: Apache Arrow > Issue Type: Bug > Components: Archery, Continuous Integration >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0, 4.0.1 > > Time Spent: 0.5h > Remaining Estimate: 0h > > The integration build has started to fail on master: > https://github.com/apache/arrow/runs/2575265526#step:9:4265 > I don't entirely understand the reason why we see this error, in order to > call that function we would need to pass {{--write_generated_json}} to the > archery command, but we don't. The implementation is clearly wrong though. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12004) [C++] Result is annoying
[ https://issues.apache.org/jira/browse/ARROW-12004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman resolved ARROW-12004. -- Resolution: Fixed Issue resolved by pull request 10205 [https://github.com/apache/arrow/pull/10205] > [C++] Result is annoying > --- > > Key: ARROW-12004 > URL: https://issues.apache.org/jira/browse/ARROW-12004 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Reporter: Antoine Pitrou >Assignee: Weston Pace >Priority: Major > Labels: async-util, pull-request-available > Fix For: 5.0.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > When I add a callback (using {{AddCallback}} or {{Then}}) to a {{Future}}, > I would like the callback to take a {{Status}} rather than a > {{Result}}. > I managed to get this done for {{AddCallback}}, but {{Then}} is another pile > of complication due to template hackery. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12790) [Python] Cannot read from HDFS with blanks in path names
[ https://issues.apache.org/jira/browse/ARROW-12790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-12790: -- Labels: filesystem hdfs (was: ) > [Python] Cannot read from HDFS with blanks in path names > > > Key: ARROW-12790 > URL: https://issues.apache.org/jira/browse/ARROW-12790 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 4.0.0 >Reporter: Armin Müller >Priority: Critical > Labels: filesystem, hdfs > > I have a Hadoop FS with blanks in path and filenames. > Running this > {{hdfs = fs.HadoopFileSystem('namenode', 8020)}} > {{files = hdfs.get_file_info(fs.FileSelector("/", recursive=True))}} > throws a > {{pyarrow.lib.ArrowInvalid: Cannot parse URI: 'hdfs://namenode:8020/data/Path > with Blank'}} > How can I avoid that? > Strangely enough, reading a file with > {{hdfs.open_input_file(csv_file)}} > works just fine regardless of the blanks? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12769) [Python] Negative out of range slices yield invalid arrays
[ https://issues.apache.org/jira/browse/ARROW-12769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-12769. -- Resolution: Fixed Issue resolved by pull request 10341 [https://github.com/apache/arrow/pull/10341] > [Python] Negative out of range slices yield invalid arrays > -- > > Key: ARROW-12769 > URL: https://issues.apache.org/jira/browse/ARROW-12769 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 2.0.0, 4.0.0 >Reporter: Micah Kornfield >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0, 4.0.1 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Tested on pyarrow 2.0 and pyarrow 4.0 wheels. The errors are slightly > different between the 2.0. Below is a script from 4.0 > > This is taken from the result of test_slice_array > {{ }} > {{ >>> import pyarrow as pa}} > {{ >>> pa.array(range(0,10))}} > {{ }} > {{ [}} > {{ 0,}} > {{ 1,}} > {{ 2,}} > {{ 3,}} > {{ 4,}} > {{ 5,}} > {{ 6,}} > {{ 7,}} > {{ 8,}} > {{ 9}} > {{ ]}} > {{ >>> a=pa.array(range(0,10))}} > {{ >>> a[-9:-20]}} > {{ }} > {{ []}} > {{ >>> len(a[-9:-20])}} > {{ Traceback (most recent call last):}} > {{ File "", line 1, in }} > {{ SystemError: returned NULL without setting an > error}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-12806) [Python] test_write_to_dataset_filesystem missing a dataset mark
[ https://issues.apache.org/jira/browse/ARROW-12806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-12806: - Assignee: Joris Van den Bossche > [Python] test_write_to_dataset_filesystem missing a dataset mark > > > Key: ARROW-12806 > URL: https://issues.apache.org/jira/browse/ARROW-12806 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > From > https://stackoverflow.com/questions/67526288/modulenotfounderror-no-module-named-pyarrow-dataset -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12806) [Python] test_write_to_dataset_filesystem missing a dataset mark
[ https://issues.apache.org/jira/browse/ARROW-12806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-12806: --- Labels: pull-request-available (was: ) > [Python] test_write_to_dataset_filesystem missing a dataset mark > > > Key: ARROW-12806 > URL: https://issues.apache.org/jira/browse/ARROW-12806 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > From > https://stackoverflow.com/questions/67526288/modulenotfounderror-no-module-named-pyarrow-dataset -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12806) [Python] test_write_to_dataset_filesystem missing a dataset mark
[ https://issues.apache.org/jira/browse/ARROW-12806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-12806: -- Fix Version/s: 5.0.0 > [Python] test_write_to_dataset_filesystem missing a dataset mark > > > Key: ARROW-12806 > URL: https://issues.apache.org/jira/browse/ARROW-12806 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > From > https://stackoverflow.com/questions/67526288/modulenotfounderror-no-module-named-pyarrow-dataset -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12806) [Python] test_write_to_dataset_filesystem missing a dataset mark
Joris Van den Bossche created ARROW-12806: - Summary: [Python] test_write_to_dataset_filesystem missing a dataset mark Key: ARROW-12806 URL: https://issues.apache.org/jira/browse/ARROW-12806 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche >From >https://stackoverflow.com/questions/67526288/modulenotfounderror-no-module-named-pyarrow-dataset -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12569) [R] [CI] Run revdep in CI
[ https://issues.apache.org/jira/browse/ARROW-12569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-12569: --- Summary: [R] [CI] Run revdep in CI (was: [R] [CI] Can we run revdep in CI) > [R] [CI] Run revdep in CI > - > > Key: ARROW-12569 > URL: https://issues.apache.org/jira/browse/ARROW-12569 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, R >Reporter: Jonathan Keane >Assignee: Jonathan Keane >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Probably should be on demand, might be difficult to make it print/fail > usefully. > Use https://github.com/r-lib/revdepcheck? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12569) [R] [CI] Can we run revdep in CI
[ https://issues.apache.org/jira/browse/ARROW-12569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-12569: --- Labels: pull-request-available (was: ) > [R] [CI] Can we run revdep in CI > > > Key: ARROW-12569 > URL: https://issues.apache.org/jira/browse/ARROW-12569 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, R >Reporter: Jonathan Keane >Assignee: Jonathan Keane >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Probably should be on demand, might be difficult to make it print/fail > usefully. > Use https://github.com/r-lib/revdepcheck? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12758) [R] Add examples to more function documentation
[ https://issues.apache.org/jira/browse/ARROW-12758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-12758: --- Labels: pull-request-available (was: ) > [R] Add examples to more function documentation > --- > > Key: ARROW-12758 > URL: https://issues.apache.org/jira/browse/ARROW-12758 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nic Crane >Assignee: Nic Crane >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-12619) [Python] pyarrow sdist should not require git
[ https://issues.apache.org/jira/browse/ARROW-12619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs reassigned ARROW-12619: --- Assignee: Krisztian Szucs > [Python] pyarrow sdist should not require git > - > > Key: ARROW-12619 > URL: https://issues.apache.org/jira/browse/ARROW-12619 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Kouhei Sutou >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 4.0.1 > > Time Spent: 40m > Remaining Estimate: 0h > > {noformat} > FROM ubuntu:20.04 > RUN apt update && apt install -y python3-pip > RUN pip3 install --no-binary pyarrow pyarrow==4.0.0 > {noformat} > {noformat} > $ docker build . > ... > Step 3/3 : RUN pip3 install --no-binary pyarrow pyarrow==4.0.0 > ---> Running in 28d363e1c397 > Collecting pyarrow==4.0.0 > Downloading pyarrow-4.0.0.tar.gz (710 kB) > Installing build dependencies: started > Installing build dependencies: still running... > Installing build dependencies: finished with status 'done' > Getting requirements to build wheel: started > Getting requirements to build wheel: finished with status 'done' > Preparing wheel metadata: started > Preparing wheel metadata: finished with status 'error' > ERROR: Command errored out with exit status 1: > command: /usr/bin/python3 /tmp/tmp5rqecai7 > prepare_metadata_for_build_wheel /tmp/tmpc49gha3r > cwd: /tmp/pip-install-or1g7own/pyarrow > Complete output (42 lines): > Traceback (most recent call last): > File "/tmp/tmp5rqecai7", line 280, in > main() > File "/tmp/tmp5rqecai7", line 263, in main > json_out['return_val'] = hook(**hook_input['kwargs']) > File "/tmp/tmp5rqecai7", line 133, in prepare_metadata_for_build_wheel > return hook(metadata_directory, config_settings) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", > line 166, in prepare_metadata_for_build_wheel > self.run_setup() > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", > line 258, in run_setup > super(_BuildMetaLegacyBackend, > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", > line 150, in run_setup > exec(compile(code, __file__, 'exec'), locals()) > File "setup.py", line 585, in > setup( > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/__init__.py", > line 153, in setup > return distutils.core.setup(**attrs) > File "/usr/lib/python3.8/distutils/core.py", line 108, in setup > _setup_distribution = dist = klass(attrs) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py", > line 434, in __init__ > _Distribution.__init__(self, { > File "/usr/lib/python3.8/distutils/dist.py", line 292, in __init__ > self.finalize_options() > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py", > line 743, in finalize_options > ep(self) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py", > line 750, in _finalize_setup_keywords > ep.load()(self, ep.name, value) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/integration.py", > line 24, in version_keyword > dist.metadata.version = _get_version(config) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py", > line 173, in _get_version > parsed_version = _do_parse(config) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py", > line 119, in _do_parse > parse_result = _call_entrypoint_fn(config.absolute_root, config, > config.parse) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py", > line 54, in _call_entrypoint_fn > return fn(root) > File "setup.py", line 546, in parse_git > return parse(root, **kwargs) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/git.py", > line 115, in parse > require_command("git") > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/utils.py", > line 142, in require_command > raise OSError("%r was not found" % name) > OSError: 'git' was not found > > ERROR: Command errored out with exit status 1: /usr/bin/py
[jira] [Commented] (ARROW-12789) [C++] Support for scalar value recycling in RecordBatch/Table creation
[ https://issues.apache.org/jira/browse/ARROW-12789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346113#comment-17346113 ] Joris Van den Bossche commented on ARROW-12789: --- I was going to comment that, alternatively, C++ could also provide the utility to easily/efficiently create an array of a given length from a scalar, and then leave it up to the bindings to check for scalars and create the appropriate array. But it seems (from the PR for the other issue) this is what is already being done with {{MakeArrayFromScalar}}. > [C++] Support for scalar value recycling in RecordBatch/Table creation > -- > > Key: ARROW-12789 > URL: https://issues.apache.org/jira/browse/ARROW-12789 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nic Crane >Priority: Major > > Please can we have the capability to be able to recycle scalar values during > table creation? It would work as follows: > Upon creation of a new Table/RecordBatch, the length of each column is > checked. If: > * number of columns is > 1 and > * any columns have length 1 and > * not all columns have length 1 > then, the value in the length 1 column(s) should be repeated to make it as > long as the other columns. > This should only occur if all columns either have length 1 or N (where N is > some value greater than 1), and if any columns lengths are values other than > 1 or N, we should still get an error as we do now. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12773) [Docs] Clarify Java support for ORC and Parquet via JNI bindings
[ https://issues.apache.org/jira/browse/ARROW-12773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-12773. -- Fix Version/s: 5.0.0 Resolution: Fixed Issue resolved by pull request 10312 [https://github.com/apache/arrow/pull/10312] > [Docs] Clarify Java support for ORC and Parquet via JNI bindings > > > Key: ARROW-12773 > URL: https://issues.apache.org/jira/browse/ARROW-12773 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation >Reporter: Shuai Zhang >Assignee: Shuai Zhang >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Attachments: image-2021-05-13-20-31-52-890.png, > image-2021-05-13-20-34-29-206.png > > Time Spent: 1h > Remaining Estimate: 0h > > The ["Implementation Status" > document](https://arrow.apache.org/docs/status.html#third-party-data-formats) > says that Java support Parquet format by JNI while not support ORC format. > However, the [source > code](https://github.com/apache/arrow/tree/aa28470/java/adapter) shows that > Java support ORC format by JNI while not support Parquet format. See the > attached snapshots for further details. > !image-2021-05-13-20-31-52-890.png! > !image-2021-05-13-20-34-29-206.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset
[ https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346110#comment-17346110 ] Lance Dacey commented on ARROW-12358: - Being able to update and replace specific rows would be very powerful. For my use case, I am basically overwriting the entire partition in order to update a (sometimes tiny) subset of rows. That means that I need to read the existing data for that partition which was saved previously, and the new data with updated or new rows. Then I need to sort and drop duplicates (I use pandas because there is no simple .drop_duplicates() for a pyarrow table, but adding a step with pandas can add some complication sometimes with data types), then I need to overwrite the partition (I use the partition_filename_cb to guarantee that the final file for the partition is the same). > [C++][Python][R][Dataset] Control overwriting vs appending when writing to > existing dataset > --- > > Key: ARROW-12358 > URL: https://issues.apache.org/jira/browse/ARROW-12358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Labels: dataset > Fix For: 5.0.0 > > > Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}} > uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when > you are writing to an existing dataset, you de facto overwrite previous data > when using this default template. > There is some discussion in ARROW-10695 about how the user can avoid this by > ensuring the file names are unique (the user can specify the > {{basename_template}} to be something unique). There is also ARROW-7706 about > silently doubling data (so _not_ overwriting existing data) with the legacy > {{parquet.write_to_dataset}} implementation. > It could be good to have a "mode" when writing datasets that controls the > different possible behaviours. And erroring when there is pre-existing data > in the target directory is maybe the safest default, because both appending > vs overwriting silently can be surprising behaviour depending on your > expectations. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12773) [Docs] Clarify Java support for ORC and Parquet via JNI bindings
[ https://issues.apache.org/jira/browse/ARROW-12773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-12773: - Summary: [Docs] Clarify Java support for ORC and Parquet via JNI bindings (was: [Docs] Implementation Status say Java support Parquet but actually no) > [Docs] Clarify Java support for ORC and Parquet via JNI bindings > > > Key: ARROW-12773 > URL: https://issues.apache.org/jira/browse/ARROW-12773 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation >Reporter: Shuai Zhang >Assignee: Shuai Zhang >Priority: Major > Labels: pull-request-available > Attachments: image-2021-05-13-20-31-52-890.png, > image-2021-05-13-20-34-29-206.png > > Time Spent: 50m > Remaining Estimate: 0h > > The ["Implementation Status" > document](https://arrow.apache.org/docs/status.html#third-party-data-formats) > says that Java support Parquet format by JNI while not support ORC format. > However, the [source > code](https://github.com/apache/arrow/tree/aa28470/java/adapter) shows that > Java support ORC format by JNI while not support Parquet format. See the > attached snapshots for further details. > !image-2021-05-13-20-31-52-890.png! > !image-2021-05-13-20-34-29-206.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12773) [Docs] Implementation Status say Java support Parquet but actually no
[ https://issues.apache.org/jira/browse/ARROW-12773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-12773: - Component/s: Documentation > [Docs] Implementation Status say Java support Parquet but actually no > - > > Key: ARROW-12773 > URL: https://issues.apache.org/jira/browse/ARROW-12773 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation >Reporter: Shuai Zhang >Assignee: Shuai Zhang >Priority: Major > Labels: pull-request-available > Attachments: image-2021-05-13-20-31-52-890.png, > image-2021-05-13-20-34-29-206.png > > Time Spent: 50m > Remaining Estimate: 0h > > The ["Implementation Status" > document](https://arrow.apache.org/docs/status.html#third-party-data-formats) > says that Java support Parquet format by JNI while not support ORC format. > However, the [source > code](https://github.com/apache/arrow/tree/aa28470/java/adapter) shows that > Java support ORC format by JNI while not support Parquet format. See the > attached snapshots for further details. > !image-2021-05-13-20-31-52-890.png! > !image-2021-05-13-20-34-29-206.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-12773) [Docs] Implementation Status say Java support Parquet but actually no
[ https://issues.apache.org/jira/browse/ARROW-12773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li reassigned ARROW-12773: Assignee: Shuai Zhang > [Docs] Implementation Status say Java support Parquet but actually no > - > > Key: ARROW-12773 > URL: https://issues.apache.org/jira/browse/ARROW-12773 > Project: Apache Arrow > Issue Type: Bug >Reporter: Shuai Zhang >Assignee: Shuai Zhang >Priority: Major > Labels: pull-request-available > Attachments: image-2021-05-13-20-31-52-890.png, > image-2021-05-13-20-34-29-206.png > > Time Spent: 50m > Remaining Estimate: 0h > > The ["Implementation Status" > document](https://arrow.apache.org/docs/status.html#third-party-data-formats) > says that Java support Parquet format by JNI while not support ORC format. > However, the [source > code](https://github.com/apache/arrow/tree/aa28470/java/adapter) shows that > Java support ORC format by JNI while not support Parquet format. See the > attached snapshots for further details. > !image-2021-05-13-20-31-52-890.png! > !image-2021-05-13-20-34-29-206.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12619) [Python] pyarrow sdist should not require git
[ https://issues.apache.org/jira/browse/ARROW-12619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-12619: --- Labels: pull-request-available (was: ) > [Python] pyarrow sdist should not require git > - > > Key: ARROW-12619 > URL: https://issues.apache.org/jira/browse/ARROW-12619 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 4.0.1 > > Time Spent: 10m > Remaining Estimate: 0h > > {noformat} > FROM ubuntu:20.04 > RUN apt update && apt install -y python3-pip > RUN pip3 install --no-binary pyarrow pyarrow==4.0.0 > {noformat} > {noformat} > $ docker build . > ... > Step 3/3 : RUN pip3 install --no-binary pyarrow pyarrow==4.0.0 > ---> Running in 28d363e1c397 > Collecting pyarrow==4.0.0 > Downloading pyarrow-4.0.0.tar.gz (710 kB) > Installing build dependencies: started > Installing build dependencies: still running... > Installing build dependencies: finished with status 'done' > Getting requirements to build wheel: started > Getting requirements to build wheel: finished with status 'done' > Preparing wheel metadata: started > Preparing wheel metadata: finished with status 'error' > ERROR: Command errored out with exit status 1: > command: /usr/bin/python3 /tmp/tmp5rqecai7 > prepare_metadata_for_build_wheel /tmp/tmpc49gha3r > cwd: /tmp/pip-install-or1g7own/pyarrow > Complete output (42 lines): > Traceback (most recent call last): > File "/tmp/tmp5rqecai7", line 280, in > main() > File "/tmp/tmp5rqecai7", line 263, in main > json_out['return_val'] = hook(**hook_input['kwargs']) > File "/tmp/tmp5rqecai7", line 133, in prepare_metadata_for_build_wheel > return hook(metadata_directory, config_settings) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", > line 166, in prepare_metadata_for_build_wheel > self.run_setup() > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", > line 258, in run_setup > super(_BuildMetaLegacyBackend, > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", > line 150, in run_setup > exec(compile(code, __file__, 'exec'), locals()) > File "setup.py", line 585, in > setup( > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/__init__.py", > line 153, in setup > return distutils.core.setup(**attrs) > File "/usr/lib/python3.8/distutils/core.py", line 108, in setup > _setup_distribution = dist = klass(attrs) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py", > line 434, in __init__ > _Distribution.__init__(self, { > File "/usr/lib/python3.8/distutils/dist.py", line 292, in __init__ > self.finalize_options() > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py", > line 743, in finalize_options > ep(self) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py", > line 750, in _finalize_setup_keywords > ep.load()(self, ep.name, value) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/integration.py", > line 24, in version_keyword > dist.metadata.version = _get_version(config) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py", > line 173, in _get_version > parsed_version = _do_parse(config) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py", > line 119, in _do_parse > parse_result = _call_entrypoint_fn(config.absolute_root, config, > config.parse) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py", > line 54, in _call_entrypoint_fn > return fn(root) > File "setup.py", line 546, in parse_git > return parse(root, **kwargs) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/git.py", > line 115, in parse > require_command("git") > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/utils.py", > line 142, in require_command > raise OSError("%r was not found" % name) > OSError: 'git' was not found > > ERROR: Command errored out with exit status 1: /usr/bin/python3 > /tmp/tmp5rqecai7 prepare_
[jira] [Commented] (ARROW-12805) [Python] Use consistent memory_pool / pool keyword argument name
[ https://issues.apache.org/jira/browse/ARROW-12805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346108#comment-17346108 ] David Li commented on ARROW-12805: -- Note this isn't even all that consistent in C++; there's no keyword arguments of course, but while it's usually just {{pool}}, it seems some of the dataset/compute code calls it {{memory_pool}} in places like getters. > [Python] Use consistent memory_pool / pool keyword argument name > > > Key: ARROW-12805 > URL: https://issues.apache.org/jira/browse/ARROW-12805 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Priority: Minor > Fix For: 5.0.0 > > > Most of the functions taking a MemoryPool have a {{memory_pool}} keyword for > this, but a few take a {{pool}} keyword instead (eg > {{ListArray.from_arrays}}). > We should make this consistent and have all functions use {{memory_pool}} > (probably best with deprecating {{pool}} first). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12619) [Python] pyarrow sdist should not require git
[ https://issues.apache.org/jira/browse/ARROW-12619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346107#comment-17346107 ] Krisztian Szucs commented on ARROW-12619: - There is a fallback_version configuration option for setuptools_scm which we don't use: https://github.com/pypa/setuptools_scm#configuration-parameters Although this setting seems to have issues according to https://github.com/pypa/setuptools_scm/issues/549 We already have a workaround in setup.py for the functionality of the fallback_version option, but it is disabled for the case of sdist: https://github.com/apache/arrow/blob/master/python/setup.py#L529 > [Python] pyarrow sdist should not require git > - > > Key: ARROW-12619 > URL: https://issues.apache.org/jira/browse/ARROW-12619 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Kouhei Sutou >Priority: Major > Fix For: 4.0.1 > > > {noformat} > FROM ubuntu:20.04 > RUN apt update && apt install -y python3-pip > RUN pip3 install --no-binary pyarrow pyarrow==4.0.0 > {noformat} > {noformat} > $ docker build . > ... > Step 3/3 : RUN pip3 install --no-binary pyarrow pyarrow==4.0.0 > ---> Running in 28d363e1c397 > Collecting pyarrow==4.0.0 > Downloading pyarrow-4.0.0.tar.gz (710 kB) > Installing build dependencies: started > Installing build dependencies: still running... > Installing build dependencies: finished with status 'done' > Getting requirements to build wheel: started > Getting requirements to build wheel: finished with status 'done' > Preparing wheel metadata: started > Preparing wheel metadata: finished with status 'error' > ERROR: Command errored out with exit status 1: > command: /usr/bin/python3 /tmp/tmp5rqecai7 > prepare_metadata_for_build_wheel /tmp/tmpc49gha3r > cwd: /tmp/pip-install-or1g7own/pyarrow > Complete output (42 lines): > Traceback (most recent call last): > File "/tmp/tmp5rqecai7", line 280, in > main() > File "/tmp/tmp5rqecai7", line 263, in main > json_out['return_val'] = hook(**hook_input['kwargs']) > File "/tmp/tmp5rqecai7", line 133, in prepare_metadata_for_build_wheel > return hook(metadata_directory, config_settings) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", > line 166, in prepare_metadata_for_build_wheel > self.run_setup() > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", > line 258, in run_setup > super(_BuildMetaLegacyBackend, > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", > line 150, in run_setup > exec(compile(code, __file__, 'exec'), locals()) > File "setup.py", line 585, in > setup( > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/__init__.py", > line 153, in setup > return distutils.core.setup(**attrs) > File "/usr/lib/python3.8/distutils/core.py", line 108, in setup > _setup_distribution = dist = klass(attrs) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py", > line 434, in __init__ > _Distribution.__init__(self, { > File "/usr/lib/python3.8/distutils/dist.py", line 292, in __init__ > self.finalize_options() > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py", > line 743, in finalize_options > ep(self) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py", > line 750, in _finalize_setup_keywords > ep.load()(self, ep.name, value) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/integration.py", > line 24, in version_keyword > dist.metadata.version = _get_version(config) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py", > line 173, in _get_version > parsed_version = _do_parse(config) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py", > line 119, in _do_parse > parse_result = _call_entrypoint_fn(config.absolute_root, config, > config.parse) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py", > line 54, in _call_entrypoint_fn > return fn(root) > File "setup.py", line 546, in parse_git > return parse(root, **kwargs) > File > "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/git.py", > line 115, in parse > require
[jira] [Updated] (ARROW-11673) [C++] Casting dictionary type to use different index type
[ https://issues.apache.org/jira/browse/ARROW-11673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-11673: -- Fix Version/s: 5.0.0 > [C++] Casting dictionary type to use different index type > - > > Key: ARROW-11673 > URL: https://issues.apache.org/jira/browse/ARROW-11673 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Eduardo Ponce >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > It's currently not implemented to cast from one dictionary type to another > dictionary type to change the index type. > Small example: > {code} > In [2]: arr = pa.array(['a', 'b', 'a']).dictionary_encode() > In [3]: arr.type > Out[3]: DictionaryType(dictionary) > In [5]: arr.cast(pa.dictionary(pa.int8(), pa.string())) > ... > ArrowNotImplementedError: Unsupported cast from dictionary indices=int32, ordered=0> to dictionary ordered=0> (no available cast function for target type) > ../src/arrow/compute/cast.cc:112 > GetCastFunctionInternal(cast_options->to_type, args[0].type().get()) > {code} > From > https://stackoverflow.com/questions/66223730/how-to-change-column-datatype-with-pyarrow -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12751) [C++] Add variadic row-wise min/max kernels (least/greatest)
[ https://issues.apache.org/jira/browse/ARROW-12751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346102#comment-17346102 ] Joris Van den Bossche commented on ARROW-12751: --- Numpy has this as {{np.minimum}} and {{np.maximum}}, although those are limited to a fixed number of 2 input arrays (so a binary min/max, not variadic) > [C++] Add variadic row-wise min/max kernels (least/greatest) > > > Key: ARROW-12751 > URL: https://issues.apache.org/jira/browse/ARROW-12751 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ian Cook >Priority: Major > > Add a pair of variadic functions equivalent to SQL's {{least}}/{{greatest}} > or R's {{pmin}}/{{pmax}}. Should take 0, 1, 2, ... same-length numeric arrays > as input and return an array giving the minimum/maximum of the values found > in each position of the input arrays. For example, in the case of these 2 > input arrays: > {code:java} > ArrayArray > [[ > 1, 2, > 43 > ]] > {code} > {{least}} would return: > {code:java} > Array > [ > 1, > 3 > ] > {code} > and {{greatest}} would return > {code:java} > Array > [ > 2, > 4 > ] > {code} > The returned array should have the same data type as the input arrays, or > follow promotion rules if the numeric types of the input arrays differ. > Should also accept scalar numeric inputs and recycle their values. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-12762) [Python] pyarrow.lib.Schema equality fails after pickle and unpickle
[ https://issues.apache.org/jira/browse/ARROW-12762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346100#comment-17346100 ] Joris Van den Bossche edited comment on ARROW-12762 at 5/17/21, 11:37 AM: -- [~jjgalvez] thanks for opening the issue. I can't reproduce this without pyspark; when writing the pandas dataframe to parquet with pyarrow, it seems to work: {code} In [11]: import pyarrow.parquet as pq In [12]: tabe = pa.table(df) In [13]: pq.write_table(table, "test_list_str.parquet") In [14]: ds = pq.ParquetDataset("test_list_str.parquet") In [15]: pickle.loads(pickle.dumps(ds.schema)) == ds.schema Out[15]: True In [16]: pickle.loads(pickle.dumps(ds.schema.to_arrow_schema())) == ds.schema.to_arrow_schema() Out[16]: True {code} Could you try to check what the difference is between the schemas before and after pickling? (eg if you print both, do you see a difference? Or it's schema.metadata?) was (Author: jorisvandenbossche): [~jjgalvez] thanks for opening the issue. I can't reproduce this without pyspark; when writing the pandas dataframe to parquet with pyarrow, it seems to work: {code} In [12]: import pyarrow.parquet as pq In [13]: pq.write_table(table, "test_list_str.parquet") In [14]: ds = pq.ParquetDataset("test_list_str.parquet") In [15]: pickle.loads(pickle.dumps(ds.schema)) == ds.schema Out[15]: True In [16]: pickle.loads(pickle.dumps(ds.schema.to_arrow_schema())) == ds.schema.to_arrow_schema() Out[16]: True {code} > [Python] pyarrow.lib.Schema equality fails after pickle and unpickle > > > Key: ARROW-12762 > URL: https://issues.apache.org/jira/browse/ARROW-12762 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 4.0.0 >Reporter: Juan Galvez >Priority: Major > > Here is a small reproducer: > {code:python} > import pandas as pd > from pyspark.sql import SparkSession > import pyarrow.parquet as pq > import pickle > df = pd.DataFrame( > { > "A": [ > ["aa", "bb "], > ["c"], > ["d", "ee", "", "f"], > ["ggg", "H"], > [""], > ] > } > ) > spark = SparkSession.builder.appName("GenSparkData").getOrCreate() > spark_df = spark.createDataFrame(df) > spark_df.write.parquet("list_str.pq", "overwrite") > ds = pq.ParquetDataset("list_str.pq") > assert pickle.loads(pickle.dumps(ds.schema)) == ds.schema # PASSES > assert pickle.loads(pickle.dumps(ds.schema.to_arrow_schema())) == > ds.schema.to_arrow_schema() # FAILS > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12762) [Python] pyarrow.lib.Schema equality fails after pickle and unpickle
[ https://issues.apache.org/jira/browse/ARROW-12762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346100#comment-17346100 ] Joris Van den Bossche commented on ARROW-12762: --- [~jjgalvez] thanks for opening the issue. I can't reproduce this without pyspark; when writing the pandas dataframe to parquet with pyarrow, it seems to work: {code} In [12]: import pyarrow.parquet as pq In [13]: pq.write_table(table, "test_list_str.parquet") In [14]: ds = pq.ParquetDataset("test_list_str.parquet") In [15]: pickle.loads(pickle.dumps(ds.schema)) == ds.schema Out[15]: True In [16]: pickle.loads(pickle.dumps(ds.schema.to_arrow_schema())) == ds.schema.to_arrow_schema() Out[16]: True {code} > [Python] pyarrow.lib.Schema equality fails after pickle and unpickle > > > Key: ARROW-12762 > URL: https://issues.apache.org/jira/browse/ARROW-12762 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 4.0.0 >Reporter: Juan Galvez >Priority: Major > > Here is a small reproducer: > {code:python} > import pandas as pd > from pyspark.sql import SparkSession > import pyarrow.parquet as pq > import pickle > df = pd.DataFrame( > { > "A": [ > ["aa", "bb "], > ["c"], > ["d", "ee", "", "f"], > ["ggg", "H"], > [""], > ] > } > ) > spark = SparkSession.builder.appName("GenSparkData").getOrCreate() > spark_df = spark.createDataFrame(df) > spark_df.write.parquet("list_str.pq", "overwrite") > ds = pq.ParquetDataset("list_str.pq") > assert pickle.loads(pickle.dumps(ds.schema)) == ds.schema # PASSES > assert pickle.loads(pickle.dumps(ds.schema.to_arrow_schema())) == > ds.schema.to_arrow_schema() # FAILS > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12769) [Python] Negative out of range slices yield invalid arrays
[ https://issues.apache.org/jira/browse/ARROW-12769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-12769: --- Labels: pull-request-available (was: ) > [Python] Negative out of range slices yield invalid arrays > -- > > Key: ARROW-12769 > URL: https://issues.apache.org/jira/browse/ARROW-12769 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 2.0.0, 4.0.0 >Reporter: Micah Kornfield >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0, 4.0.1 > > Time Spent: 10m > Remaining Estimate: 0h > > Tested on pyarrow 2.0 and pyarrow 4.0 wheels. The errors are slightly > different between the 2.0. Below is a script from 4.0 > > This is taken from the result of test_slice_array > {{ }} > {{ >>> import pyarrow as pa}} > {{ >>> pa.array(range(0,10))}} > {{ }} > {{ [}} > {{ 0,}} > {{ 1,}} > {{ 2,}} > {{ 3,}} > {{ 4,}} > {{ 5,}} > {{ 6,}} > {{ 7,}} > {{ 8,}} > {{ 9}} > {{ ]}} > {{ >>> a=pa.array(range(0,10))}} > {{ >>> a[-9:-20]}} > {{ }} > {{ []}} > {{ >>> len(a[-9:-20])}} > {{ Traceback (most recent call last):}} > {{ File "", line 1, in }} > {{ SystemError: returned NULL without setting an > error}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12769) [Python] Negative out of range slices yield invalid arrays
[ https://issues.apache.org/jira/browse/ARROW-12769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346099#comment-17346099 ] Joris Van den Bossche commented on ARROW-12769: --- It also seems to happen simply when start > stop with positive indices (eg {{arr[5:3]}}) > [Python] Negative out of range slices yield invalid arrays > -- > > Key: ARROW-12769 > URL: https://issues.apache.org/jira/browse/ARROW-12769 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 2.0.0, 4.0.0 >Reporter: Micah Kornfield >Assignee: Joris Van den Bossche >Priority: Major > Fix For: 5.0.0, 4.0.1 > > > Tested on pyarrow 2.0 and pyarrow 4.0 wheels. The errors are slightly > different between the 2.0. Below is a script from 4.0 > > This is taken from the result of test_slice_array > {{ }} > {{ >>> import pyarrow as pa}} > {{ >>> pa.array(range(0,10))}} > {{ }} > {{ [}} > {{ 0,}} > {{ 1,}} > {{ 2,}} > {{ 3,}} > {{ 4,}} > {{ 5,}} > {{ 6,}} > {{ 7,}} > {{ 8,}} > {{ 9}} > {{ ]}} > {{ >>> a=pa.array(range(0,10))}} > {{ >>> a[-9:-20]}} > {{ }} > {{ []}} > {{ >>> len(a[-9:-20])}} > {{ Traceback (most recent call last):}} > {{ File "", line 1, in }} > {{ SystemError: returned NULL without setting an > error}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-12792) [R] DatasetFactory could sniff file formats
[ https://issues.apache.org/jira/browse/ARROW-12792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nic Crane reassigned ARROW-12792: - Assignee: (was: Nic Crane) > [R] DatasetFactory could sniff file formats > --- > > Key: ARROW-12792 > URL: https://issues.apache.org/jira/browse/ARROW-12792 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Nic Crane >Priority: Minor > > I was running the following code: > {code:java} > tf <- tempfile() > dir.create(tf) > on.exit(unlink(tf)) > write_csv_arrow(mtcars[1:5,], file.path(tf, "file1.csv")) > write_csv_arrow(mtcars[6:11,], file.path(tf, "file2.csv")) > # ds <- open_dataset(c(file.path(tf, "file1.csv"), file.path(tf, > "file2.csv"))) > ds <- open_dataset(c(file.path(tf, "file1.csv"), file.path(tf, "file2.csv")), >schema = Table$create(mtcars)$schema >) > {code} > But when I print the ds object, it reports that the files are Parquet files > not CSVs > {code:java} > > ds > FileSystemDataset with 2 Parquet files > mpg: double > cyl: double > disp: double > hp: double > drat: double > wt: double > qsec: double > vs: double > am: double > gear: double > carb: double{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12792) [R] DatasetFactory could sniff file formats
[ https://issues.apache.org/jira/browse/ARROW-12792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nic Crane resolved ARROW-12792. --- Resolution: Won't Fix > [R] DatasetFactory could sniff file formats > --- > > Key: ARROW-12792 > URL: https://issues.apache.org/jira/browse/ARROW-12792 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Nic Crane >Assignee: Nic Crane >Priority: Minor > > I was running the following code: > {code:java} > tf <- tempfile() > dir.create(tf) > on.exit(unlink(tf)) > write_csv_arrow(mtcars[1:5,], file.path(tf, "file1.csv")) > write_csv_arrow(mtcars[6:11,], file.path(tf, "file2.csv")) > # ds <- open_dataset(c(file.path(tf, "file1.csv"), file.path(tf, > "file2.csv"))) > ds <- open_dataset(c(file.path(tf, "file1.csv"), file.path(tf, "file2.csv")), >schema = Table$create(mtcars)$schema >) > {code} > But when I print the ds object, it reports that the files are Parquet files > not CSVs > {code:java} > > ds > FileSystemDataset with 2 Parquet files > mpg: double > cyl: double > disp: double > hp: double > drat: double > wt: double > qsec: double > vs: double > am: double > gear: double > carb: double{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12792) [R] DatasetFactory could sniff file formats
[ https://issues.apache.org/jira/browse/ARROW-12792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346084#comment-17346084 ] Nic Crane commented on ARROW-12792: --- After thinking about this, there are no sensible code-based changes - instead, I will add examples to the documentation on loading in non-parquet files. > [R] DatasetFactory could sniff file formats > --- > > Key: ARROW-12792 > URL: https://issues.apache.org/jira/browse/ARROW-12792 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Nic Crane >Assignee: Nic Crane >Priority: Minor > > I was running the following code: > {code:java} > tf <- tempfile() > dir.create(tf) > on.exit(unlink(tf)) > write_csv_arrow(mtcars[1:5,], file.path(tf, "file1.csv")) > write_csv_arrow(mtcars[6:11,], file.path(tf, "file2.csv")) > # ds <- open_dataset(c(file.path(tf, "file1.csv"), file.path(tf, > "file2.csv"))) > ds <- open_dataset(c(file.path(tf, "file1.csv"), file.path(tf, "file2.csv")), >schema = Table$create(mtcars)$schema >) > {code} > But when I print the ds object, it reports that the files are Parquet files > not CSVs > {code:java} > > ds > FileSystemDataset with 2 Parquet files > mpg: double > cyl: double > disp: double > hp: double > drat: double > wt: double > qsec: double > vs: double > am: double > gear: double > carb: double{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-12769) [Python] Negative out of range slices yield invalid arrays
[ https://issues.apache.org/jira/browse/ARROW-12769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-12769: - Assignee: Joris Van den Bossche > [Python] Negative out of range slices yield invalid arrays > -- > > Key: ARROW-12769 > URL: https://issues.apache.org/jira/browse/ARROW-12769 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 2.0.0, 4.0.0 >Reporter: Micah Kornfield >Assignee: Joris Van den Bossche >Priority: Major > Fix For: 5.0.0, 4.0.1 > > > Tested on pyarrow 2.0 and pyarrow 4.0 wheels. The errors are slightly > different between the 2.0. Below is a script from 4.0 > > This is taken from the result of test_slice_array > {{ }} > {{ >>> import pyarrow as pa}} > {{ >>> pa.array(range(0,10))}} > {{ }} > {{ [}} > {{ 0,}} > {{ 1,}} > {{ 2,}} > {{ 3,}} > {{ 4,}} > {{ 5,}} > {{ 6,}} > {{ 7,}} > {{ 8,}} > {{ 9}} > {{ ]}} > {{ >>> a=pa.array(range(0,10))}} > {{ >>> a[-9:-20]}} > {{ }} > {{ []}} > {{ >>> len(a[-9:-20])}} > {{ Traceback (most recent call last):}} > {{ File "", line 1, in }} > {{ SystemError: returned NULL without setting an > error}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12695) [Python] bool value of scalars depends on data type
[ https://issues.apache.org/jira/browse/ARROW-12695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346056#comment-17346056 ] Joris Van den Bossche commented on ARROW-12695: --- Ah, I see that ARROW-12609 has some more background. > [Python] bool value of scalars depends on data type > --- > > Key: ARROW-12695 > URL: https://issues.apache.org/jira/browse/ARROW-12695 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 4.0.0 > Environment: Windows 10 > python 3.9.4 >Reporter: Sergey Mozharov >Priority: Major > > `pyarrow.Scalar` and its subclasses do not implement `__bool__` method. The > default implementation does not seem to do the right thing. For example: > {code:java} > >>> import pyarrow as pa > >>> na_value = pa.scalar(None, type=pa.int32()) > >>> bool(na_value) > True > >>> na_value = pa.scalar(None, type=pa.struct([('a', pa.int32())])) > >>> bool(na_value) > False > >>> bool(pa.scalar(None, type=pa.list_(pa.int32( > Traceback (most recent call last): > File "", line 1, in > File "pyarrow\scalar.pxi", line 572, in pyarrow.lib.ListScalar.__len__ > TypeError: object of type 'NoneType' has no len() > >>> > {code} > Please consider implementing `___bool` method. It seems reasonable to > delegate to the `bool___` method of the wrapped object. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12609) [Python] TypeError when accessing length of an invalid ListScalar
[ https://issues.apache.org/jira/browse/ARROW-12609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346055#comment-17346055 ] Joris Van den Bossche commented on ARROW-12609: --- bq. why not return {{NullScalar}} in such case? It seems to me that {{pa.list_(pa.int32())}} means a schema that supports null values in the list, then the array should just return a null value when it hits one. [~amol-] The returned ListScalar _is_ a null value, though. Because each type supports null values, each scalar type also supports it's own null scalars. A {{NullScalar}} is what you would get when accessing a single element of a {{NullArray}}: {code} >>> arr = pa.array([None, None]) >>> arr 2 nulls >>> arr[0] {code} bq. Expected behavior: length is expected to be 0. [~mosalx] I think you could also argue that a missing list scalar has "no defined length" (why would it be zero? it's an empty list that has zero length) The problem, though, is that Python doesn't support this kind of missing or undefined values for integers ({{\_\_len\_\_}} needs to return an integer, or error) For example, if not using Python's builtin {{len}}, but using the pyarrow compute kernel to get the length of list element, we actually "propagate" the null, and the null list has a null length: {code} >>> import pyarrow.compute as pc >>> pc.list_value_length(pa.scalar([1, 2], type=pa.list_(pa.int32( >>> pc.list_value_length(pa.scalar(None, type=pa.list_(pa.int32( {code} > [Python] TypeError when accessing length of an invalid ListScalar > - > > Key: ARROW-12609 > URL: https://issues.apache.org/jira/browse/ARROW-12609 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0, 4.0.0 > Environment: Windows 10 > python=3.9.2 > pyarrow=4.0.0 (3.0.0 has the same behavior) >Reporter: Sergey Mozharov >Priority: Major > > For List-like data types, the scalar corresponding to a missing value has > '___len___' attribute, but TypeError is raised when it is accessed > {code:java} > import pyarrow as pa > data_type = pa.list_(pa.struct([ > ('a', pa.int64()), > ('b', pa.bool_()) > ])) > data = [[{'a': 1, 'b': False}, {'a': 2, 'b': True}], None] > arr = pa.array(data, type=data_type) > missing_scalar = arr[1] # > assert hasattr(missing_scalar, '__len__') > assert len(missing_scalar) == 0 # --> TypeError: object of type 'NoneType' > has no len() > {code} > Expected behavior: length is expected to be 0. > This issue causes several pandas unit tests to fail when an ExtensionArray > backed by arrow array with this data type is built. > This behavior is also inconsistent with a similar example where the data type > is a struct: > {code:java} > import pyarrow as pa > data_type = pa.struct([ > ('a', pa.int64()), > ('b', pa.bool_()) > ]) > data = [{'a': 1, 'b': False}, None] > arr = pa.array(data, type=data_type) > missing_scalar = arr[1] # > assert hasattr(missing_scalar, '__len__') > assert len(missing_scalar) == 0 # Ok > {code} > In this second example the TypeError is not raised. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-12695) [Python] bool value of scalars depends on data type
[ https://issues.apache.org/jira/browse/ARROW-12695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346046#comment-17346046 ] Joris Van den Bossche edited comment on ARROW-12695 at 5/17/21, 10:17 AM: -- Currently pyarrow doesn't implement any {{\_\_bool\_\_}}. In general, Python will then always return True by default, but it seems that if your object is "sequence-like" (having a {{\_\_len\_\_}}), it will check the length. This is described at https://docs.python.org/3/library/stdtypes.html#truth-value-testing So here the underlying reason is that this fails: {code} >>> len(pa.scalar([1, 2], type=pa.list_(pa.int32( 2 >>> len(pa.scalar(None, type=pa.list_(pa.int32( ... TypeError: object of type 'NoneType' has no len() {code} But the question is also, what should this return instead? Returning 0 in this case also doesn't feel correct, as you can also have an empty list scalar with a length of zero. In general, I think it will be hard to give a nice and consistent interface for pyarrow scalars involving null scalars (we could provide better error messages though?) [~mosalx] what's your use case for wanting to do {{bool(null_scalar)}}, and what do you think it should return? (also True as the other scalars?) was (Author: jorisvandenbossche): Currently pyarrow doesn't implement any {{\_\_bool\_\_}}. In general, Python will then always return True by default, but it seems that if your object is "sequence-like" (having a {\_\_len\_\_}}), it will check the length. This is described at https://docs.python.org/3/library/stdtypes.html#truth-value-testing So here the underlying reason is that this fails: {code} >>> len(pa.scalar([1, 2], type=pa.list_(pa.int32( 2 >>> len(pa.scalar(None, type=pa.list_(pa.int32( ... TypeError: object of type 'NoneType' has no len() {code} But the question is also, what should this return instead? Returning 0 in this case also doesn't feel correct, as you can also have an empty list scalar with a length of zero. In general, I think it will be hard to give a nice and consistent interface for pyarrow scalars involving null scalars (we could provide better error messages though?) [~mosalx] what's your use case for wanting to do {{bool(null_scalar)}}, and what do you think it should return? (also True as the other scalars?) > [Python] bool value of scalars depends on data type > --- > > Key: ARROW-12695 > URL: https://issues.apache.org/jira/browse/ARROW-12695 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 4.0.0 > Environment: Windows 10 > python 3.9.4 >Reporter: Sergey Mozharov >Priority: Major > > `pyarrow.Scalar` and its subclasses do not implement `__bool__` method. The > default implementation does not seem to do the right thing. For example: > {code:java} > >>> import pyarrow as pa > >>> na_value = pa.scalar(None, type=pa.int32()) > >>> bool(na_value) > True > >>> na_value = pa.scalar(None, type=pa.struct([('a', pa.int32())])) > >>> bool(na_value) > False > >>> bool(pa.scalar(None, type=pa.list_(pa.int32( > Traceback (most recent call last): > File "", line 1, in > File "pyarrow\scalar.pxi", line 572, in pyarrow.lib.ListScalar.__len__ > TypeError: object of type 'NoneType' has no len() > >>> > {code} > Please consider implementing `___bool` method. It seems reasonable to > delegate to the `bool___` method of the wrapped object. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12695) [Python] bool value of scalars depends on data type
[ https://issues.apache.org/jira/browse/ARROW-12695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346046#comment-17346046 ] Joris Van den Bossche commented on ARROW-12695: --- Currently pyarrow doesn't implement any {{\_\_bool\_\_}}. In general, Python will then always return True by default, but it seems that if your object is "sequence-like" (having a {\_\_len\_\_}}), it will check the length. This is described at https://docs.python.org/3/library/stdtypes.html#truth-value-testing So here the underlying reason is that this fails: {code} >>> len(pa.scalar([1, 2], type=pa.list_(pa.int32( 2 >>> len(pa.scalar(None, type=pa.list_(pa.int32( ... TypeError: object of type 'NoneType' has no len() {code} But the question is also, what should this return instead? Returning 0 in this case also doesn't feel correct, as you can also have an empty list scalar with a length of zero. In general, I think it will be hard to give a nice and consistent interface for pyarrow scalars involving null scalars (we could provide better error messages though?) [~mosalx] what's your use case for wanting to do {{bool(null_scalar)}}, and what do you think it should return? (also True as the other scalars?) > [Python] bool value of scalars depends on data type > --- > > Key: ARROW-12695 > URL: https://issues.apache.org/jira/browse/ARROW-12695 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 4.0.0 > Environment: Windows 10 > python 3.9.4 >Reporter: Sergey Mozharov >Priority: Major > > `pyarrow.Scalar` and its subclasses do not implement `__bool__` method. The > default implementation does not seem to do the right thing. For example: > {code:java} > >>> import pyarrow as pa > >>> na_value = pa.scalar(None, type=pa.int32()) > >>> bool(na_value) > True > >>> na_value = pa.scalar(None, type=pa.struct([('a', pa.int32())])) > >>> bool(na_value) > False > >>> bool(pa.scalar(None, type=pa.list_(pa.int32( > Traceback (most recent call last): > File "", line 1, in > File "pyarrow\scalar.pxi", line 572, in pyarrow.lib.ListScalar.__len__ > TypeError: object of type 'NoneType' has no len() > >>> > {code} > Please consider implementing `___bool` method. It seems reasonable to > delegate to the `bool___` method of the wrapped object. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12666) [Python] Array construction from numpy array is unclear about zero copy behaviour
[ https://issues.apache.org/jira/browse/ARROW-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346032#comment-17346032 ] Joris Van den Bossche commented on ARROW-12666: --- bq. {{copy=False}} would probably have to throw an exception in some cases where we can't guarantee zero copy, like when building from a Python List Or {{copy=False}} could also not guarantee that no copy is made, but will only try to not make a copy if possible. That's basically the behaviour of the {{copy}} keyword in {{numpy.array(..)}} On the general issue, I agree that the current behaviour is not ideal and potentially being confusing/having surprising effects. But I also think it's not that easy to change. I think a lot of people rely on the zero-copy behaviour to avoid unnecessary copies (eg if you just convert to Arrow to then directly write that to Parquet file, then you don't want to make an additional copy). > [Python] Array construction from numpy array is unclear about zero copy > behaviour > - > > Key: ARROW-12666 > URL: https://issues.apache.org/jira/browse/ARROW-12666 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 4.0.0 >Reporter: Alessandro Molina >Assignee: Alessandro Molina >Priority: Major > > When building an Arrow array from a numpy array it's very confusing from the > user point of view that the result is not always a new array. > Under the hood Arrow sometimes reuses the memory if no casting is needed > {code:python} > npa = np.array([1, 2, 3]*3) > arrow_array = pa.array(npa, type=pa.int64()) > npa[npa == 2] = 10 > print(arrow_array.to_pylist()) > # Prints: [1, 10, 3, 1, 10, 3, 1, 10, 3] > {code} > and sometimes doesn't if a cast is involved > {code:python} > npa = np.array([1, 2, 3]*3) > arrow_array = pa.array(npa, type=pa.int32()) > npa[npa == 2] = 10 > print(arrow_array.to_pylist()) > # Prints: [1, 2, 3, 1, 2, 3, 1, 2, 3] > {code} > For non primite types instead it does always copy > {code:python} > npa = np.array(["a", "b", "c"]*3) > arrow_array = pa.array(npa, type=pa.string()) > npa[npa == "b"] = "X" > print(arrow_array.to_pylist()) > # Prints: ['a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c'] > # Different from numpy array that was modified > {code} > This behaviour needs a lot of attention from the user and understanding of > what's going on, which makes pyarrow hard to use. > A {{copy=True/False}} should be added to {{pa.array}} and the default value > should probably be {{copy=True}} so that by default you can always create an > arrow array out of a numpy one (as {{copy=False}} would probably have to > throw an exception in some cases where we can't guarantee zero copy, like > when building from a Python List) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12680) [Python] StructScalar Timestamp using .to_pandas() loses/converts type
[ https://issues.apache.org/jira/browse/ARROW-12680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346022#comment-17346022 ] Joris Van den Bossche commented on ARROW-12680: --- (in any case, we also need to document this better, to not each time have to look into old discussions / guess from the behaviour and source code, when such a question comes up ..) > [Python] StructScalar Timestamp using .to_pandas() loses/converts type > -- > > Key: ARROW-12680 > URL: https://issues.apache.org/jira/browse/ARROW-12680 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 >Reporter: Tim Ryles >Priority: Major > > Hi, > We're noticing an issue where we lose type and formatting on conversion to a > pandas dataframe for a particular dataset we house, which contains a struct, > and the underlying type of the child is Timestamp rather than > datetime.datetime (which we believed synonymous from Pandas documentation). > > Inside the StructArray we can see nicely formatted timestamp values, but when > we call .to_pandas() on it, we end up with epoch stamps for the date child. > {code:java} > import pyarrow.parquet as pq > tbl=pq.read_table("part-9-47f62157-cb6f-41a8-9ad6-ace65df94c6e-c000.snappy.parquet") > tbl.column("observations").chunk(0).values pyarrow.lib.StructArray object at > 0x7fc8eb0cab40> > – is_valid: all not null > – child 0 type: timestamp[ns] > [ > 2000-01-01 00:00:00.0, > 2001-01-01 00:00:00.0, > 2002-01-01 00:00:00.0, > 2003-01-01 00:00:00.0, > 2004-01-01 00:00:00.0, > 2005-01-01 00:00:00.0, > 2006-01-01 00:00:00.0, > 2007-01-01 00:00:00.0, > 2008-01-01 00:00:00.0, > 2009-01-01 00:00:00.0, > ... > 2018-07-01 00:00:00.0, > 2018-10-01 00:00:00.0, > 2019-01-01 00:00:00.0, > 2019-04-01 00:00:00.0, > 2019-07-01 00:00:00.0, > 2019-10-01 00:00:00.0, > 2020-01-01 00:00:00.0, > 2020-04-01 00:00:00.0, > 2020-07-01 00:00:00.0, > 2020-10-01 00:00:00.0 > ] > – child 1 type: double > [ > -2.69685, > 9.27988, > 7.26902, > -7.55753, > -1.62137, > 6.84773, > -8.21204, > -8.97041, > -1.14405, > -0.710153, > ... > 2.1658, > 3.05588, > 2.3868, > 2.10805, > 2.39984, > 2.54855, > -7.26804, > -2.35179, > -0.867518, > 0.150593 > ] > {code} > {code:java} > > tbl.to_pandas()['observations'] > [{'date': 9466848000, 'value': -2.6968... 1 [{'date': > 9466848000, 'value': 57.9608... 2 [{'date': 14832288000, > 'value': 95.904... 3 [{'date': 12148704000, 'value': 19.021... 4 > [{'date': 11991456000, 'value': 1.2011... ... 636 [\{'date': > 10729152000, 'value': 5.418}... 637 [{'date': 9466848000, > 'value': 110.695... 638 [{'date': 10098432000, 'value': 3.0094... 639 > [{'date': 12228192000, 'value': 48.365... 640 [{'date': > 11991456000, 'value': 1.5600... Name: observations, Length: 641, > dtype: object > In [12]: tbl.to_pandas()["observations"].iloc[0][0] > Out[12]: {'date': 10413792000, 'value': 249.523242} > # date is now type Int{code} > > We notice that if we take the same table, save it back out to a file first, > and then check the chunk(0).values as above, the underlying type changes from > *Timestamp* to *datetime.datetime*, and that will now convert .to_pandas() > correctly. > {code:java} > pq.write_table(tbl, "output.parquet") > tbl2=pq.read_table("output.parquet") > tbl2.column("observations").chunk(0).values[0] > Out[17]: 'value': 249.523242}> > tbl2.column("observations").chunk(0).to_pandas() > Out[18]: > 0[{'date': 2003-01-01 00:00:00, 'value': 249.52... > 1[{'date': 2008-01-01 00:00:00, 'value': 29.741... > 2[{'date': 2000-01-01 00:00:00, 'value': 2.3454... > 3[{'date': 2006-01-01 00:00:00, 'value': 1.2048... > 4[{'date': 2008-01-01 00:00:00, 'value': 196546... >... > 29489[{'date': 2010-01-01 00:00:00, 'value': 19.155... > 29490[{'date': 2012-04-30 00:00:00, 'value': 0.0}, ... > 29491[{'date': 2012-04-30 00:00:00, 'value': 0.0}, ... > 29492[{'date': 2012-04-30 00:00:00, 'value': 0.0}, ... > 29493[{'date': 2012-04-30 00:00:00, 'value': 10.0},... > Length: 29494, dtype: object > tbl2.to_pandas()["observations"].iloc[0][0] > Out[8]: {'date': datetime.datetime(2003, 1, 1, 0, 0), 'value': 249.523242} > # date remains as datetime.datetime{code} > > Thanks in advance, and apologies if I have not followed Issue protocol on > this board. > If there is a parameter that we just need to pass into .to_pandas for this to > take place (I can see there is date_
[jira] [Commented] (ARROW-12680) [Python] StructScalar Timestamp using .to_pandas() loses/converts type
[ https://issues.apache.org/jira/browse/ARROW-12680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346020#comment-17346020 ] Joris Van den Bossche commented on ARROW-12680: --- Reading the comments in https://github.com/apache/arrow/pull/6322 again, I _think_ the fact that for nanosecond timestamps inside a struct, we currently return integer epoch is kind of deliberate? (because of the lack of a better alternative, since {{datetime.datetime}} cannot represent nanoseconds) Although for struct _scalars_, we actually use pandas.Timestamp for nanosecond resolution columns: {code} In [42]: sarr[0] Out[42]: In [43]: sarr[0].as_py() Out[43]: {'ms': datetime.datetime(2021, 5, 17, 11, 3, 58, 947000), 'ns': Timestamp('2021-05-17 11:03:58.947224')} {code} Of course, that is only possible if pandas is installed. And so maybe that's the reason that for array conversion we simply always use the "safe" integer epoch option. But it's certainly somewhat inconsistent. > [Python] StructScalar Timestamp using .to_pandas() loses/converts type > -- > > Key: ARROW-12680 > URL: https://issues.apache.org/jira/browse/ARROW-12680 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 >Reporter: Tim Ryles >Priority: Major > > Hi, > We're noticing an issue where we lose type and formatting on conversion to a > pandas dataframe for a particular dataset we house, which contains a struct, > and the underlying type of the child is Timestamp rather than > datetime.datetime (which we believed synonymous from Pandas documentation). > > Inside the StructArray we can see nicely formatted timestamp values, but when > we call .to_pandas() on it, we end up with epoch stamps for the date child. > {code:java} > import pyarrow.parquet as pq > tbl=pq.read_table("part-9-47f62157-cb6f-41a8-9ad6-ace65df94c6e-c000.snappy.parquet") > tbl.column("observations").chunk(0).values pyarrow.lib.StructArray object at > 0x7fc8eb0cab40> > – is_valid: all not null > – child 0 type: timestamp[ns] > [ > 2000-01-01 00:00:00.0, > 2001-01-01 00:00:00.0, > 2002-01-01 00:00:00.0, > 2003-01-01 00:00:00.0, > 2004-01-01 00:00:00.0, > 2005-01-01 00:00:00.0, > 2006-01-01 00:00:00.0, > 2007-01-01 00:00:00.0, > 2008-01-01 00:00:00.0, > 2009-01-01 00:00:00.0, > ... > 2018-07-01 00:00:00.0, > 2018-10-01 00:00:00.0, > 2019-01-01 00:00:00.0, > 2019-04-01 00:00:00.0, > 2019-07-01 00:00:00.0, > 2019-10-01 00:00:00.0, > 2020-01-01 00:00:00.0, > 2020-04-01 00:00:00.0, > 2020-07-01 00:00:00.0, > 2020-10-01 00:00:00.0 > ] > – child 1 type: double > [ > -2.69685, > 9.27988, > 7.26902, > -7.55753, > -1.62137, > 6.84773, > -8.21204, > -8.97041, > -1.14405, > -0.710153, > ... > 2.1658, > 3.05588, > 2.3868, > 2.10805, > 2.39984, > 2.54855, > -7.26804, > -2.35179, > -0.867518, > 0.150593 > ] > {code} > {code:java} > > tbl.to_pandas()['observations'] > [{'date': 9466848000, 'value': -2.6968... 1 [{'date': > 9466848000, 'value': 57.9608... 2 [{'date': 14832288000, > 'value': 95.904... 3 [{'date': 12148704000, 'value': 19.021... 4 > [{'date': 11991456000, 'value': 1.2011... ... 636 [\{'date': > 10729152000, 'value': 5.418}... 637 [{'date': 9466848000, > 'value': 110.695... 638 [{'date': 10098432000, 'value': 3.0094... 639 > [{'date': 12228192000, 'value': 48.365... 640 [{'date': > 11991456000, 'value': 1.5600... Name: observations, Length: 641, > dtype: object > In [12]: tbl.to_pandas()["observations"].iloc[0][0] > Out[12]: {'date': 10413792000, 'value': 249.523242} > # date is now type Int{code} > > We notice that if we take the same table, save it back out to a file first, > and then check the chunk(0).values as above, the underlying type changes from > *Timestamp* to *datetime.datetime*, and that will now convert .to_pandas() > correctly. > {code:java} > pq.write_table(tbl, "output.parquet") > tbl2=pq.read_table("output.parquet") > tbl2.column("observations").chunk(0).values[0] > Out[17]: 'value': 249.523242}> > tbl2.column("observations").chunk(0).to_pandas() > Out[18]: > 0[{'date': 2003-01-01 00:00:00, 'value': 249.52... > 1[{'date': 2008-01-01 00:00:00, 'value': 29.741... > 2[{'date': 2000-01-01 00:00:00, 'value': 2.3454... > 3[{'date': 2006-01-01 00:00:00, 'value': 1.2048... > 4[{'date': 2008-01-01 00:00:00, 'value': 196546... >... > 29489[{'date': 2010-01-01 00:00:00, 'value': 19.155... > 29490[{'date': 2012-04-30 00:00
[jira] [Comment Edited] (ARROW-12680) [Python] StructScalar Timestamp using .to_pandas() loses/converts type
[ https://issues.apache.org/jira/browse/ARROW-12680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17341003#comment-17341003 ] Joris Van den Bossche edited comment on ARROW-12680 at 5/17/21, 9:05 AM: - I think there is indeed a bug here. Let me try and demystify some of what is going on. There are 5+ temporal types in pyarrow but everything you are doing is currently related to just one, the timestamp type. The timestamp type represents seconds, milliseconds, microseconds, or nanoseconds from the epoch. In addition there may or may not be a time zone string. Finally, these types may or may not be in a struct (which shouldn't matter but does here...which is the bug). In pandas there are 3+ ways to represent temporal information. The datetime.datetime object, A pandas.Timestamp, and an integer. When you first read in your table you are getting a struct where the 'date' field is a timestamp with **nanosecond** resolution. When you save your table and then reload it the timestamp is being truncated. This is because pq.write_table with version==1.0 (the default in pyarrow 3) will truncate nanosecond timestamps down to microseconds. So when you next read in your table you are getting a struct where the 'date' field is a timestamp with **microsecond** resolution. Finally, It seems this may be a regression of https://issues.apache.org/jira/browse/ARROW-7723 {code:java} import pyarrow as pa import datetime pylist = [datetime.datetime.now()] arr1 = pa.array(pylist, pa.timestamp(unit='ms')) arr2 = pa.array(pylist, pa.timestamp(unit='ns')) sarr = pa.StructArray.from_arrays([arr1, arr2], names=['ms', 'ns']) table = pa.Table.from_arrays([arr1, arr2, sarr], ['ms', 'ns', 'struct']) print(table.to_pandas()) {code} {code:java} ms ns struct 0 2021-05-07 08:46:15.898 2021-05-07 08:46:15.898716 {'ms': 2021-05-07 08:46:15.898000, 'ns': 16203... {code} As for workarounds...if your schema is reliable you could cast from nanosecond resolution to us resolution (struct casting isn't working quite right (ARROW-1888) so it's a bit clunky): {code:java} import pyarrow as pa import pyarrow.compute as pc dates = pa.array([datetime.datetime.now()], pa.timestamp(unit='ns')) values = pa.array([200.37], pa.float64()) observations = pa.StructArray.from_arrays([dates, values], names=['dates', 'values']) desired_type = pa.struct([pa.field('dates', pa.timestamp(unit='us')), pa.field('values', pa.float64())]) tbl = pa.Table.from_arrays([observations], ['observations']) print(tbl.to_pandas()) bad_observations = tbl.column('observations').chunks values = [chunk.field('values') for chunk in bad_observations] bad_dates = [chunk.field('dates') for chunk in bad_observations] good_dates = [pc.cast(bad_dates_chunk, pa.timestamp(unit='us')) for bad_dates_chunk in bad_dates] good_observations_chunks = [] for i in range(len(good_dates)): good_observations_chunks.append(pa.StructArray.from_arrays([good_dates[i], values[i]], names=['dates', 'values'])) good_observations = pa.chunked_array(good_observations_chunks) tbl = tbl.set_column(0, 'observations', good_observations) print(tbl.to_pandas()) {code} was (Author: westonpace): I think there is indeed a bug here. Let me try and demystify some of what is going on. There are 5+ temporal types in pyarrow but everything you are doing is currently related to just one, the timestamp type. The timestamp type represents seconds, milliseconds, microseconds, or nanoseconds from the epoch. In addition there may or may not be a time zone string. Finally, these types may or may not be in a struct (which shouldn't matter but does here...which is the bug). In pandas there are 3+ ways to represent temporal information. The datetime.datetime object, A pandas.Timestamp, and an integer. When you first read in your table you are getting a struct where the 'date' field is a timestamp with **nanosecond** resolution. When you save your table and then reload it the timestamp is being truncated. This is because pq.write_table with version==1.0 (the default in pyarrow 3) will truncate nanosecond timestamps down to microseconds. So when you next read in your table you are getting a struct where the 'date' field is a timestamp with **microsecond** resolution. Finally, It seems this may be a regression of https://issues.apache.org/jira/browse/ARROW-7723 {code:java} import pyarrow as pa import datetimepylist = [datetime.datetime.now()] arr1 = pa.array(pylist, pa.timestamp(unit='ms')) arr2 = pa.array(pylist, pa.timestamp(unit='ns')) sarr = pa.StructArray.from_arrays([arr1, arr2], names=['ms', 'ns']) table = pa.Table.from_arrays([arr1, arr2, sarr], ['ms', 'ns', 'struct']) print(table.to_pandas()) {code} {code:java} ms