[jira] [Commented] (ARROW-12795) [CI] R-Hub Ubuntu GCC (Docker) can't install bit64

2021-05-17 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-12795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346519#comment-17346519
 ] 

Mauricio 'Pachá' Vargas Sepúlveda commented on ARROW-12795:
---

I need to send a PR to R-Hub and fix bit64 installation on the Docker image.

> [CI] R-Hub Ubuntu GCC (Docker) can't install bit64
> --
>
> Key: ARROW-12795
> URL: https://issues.apache.org/jira/browse/ARROW-12795
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Major
>
> therefore, the arrow package can't be installed
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=5160&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=200
> the Docker image was updated 16 hours ago, which explains this new problem
> https://hub.docker.com/r/rhub/ubuntu-gcc-release



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11581) [Packaging][C++] Formalize distribution through vcpkg

2021-05-17 Thread Ian Cook (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346516#comment-17346516
 ] 

Ian Cook commented on ARROW-11581:
--

Draft PR at https://github.com/microsoft/vcpkg/pull/17975

> [Packaging][C++] Formalize distribution through vcpkg
> -
>
> Key: ARROW-11581
> URL: https://issues.apache.org/jira/browse/ARROW-11581
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, Packaging
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
>
> Currently there is a port of Arrow on vcpkg [1] that is maintained by folks 
> outside the core Arrow developer community. We should consider formalizing 
> distribution of Arrow releases through vcpkg, in collaboration with the 
> existing maintainers of the Arrow vcpkg port if they are interested in 
> staying involved.
> [1] https://github.com/microsoft/vcpkg/tree/master/ports/arrow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12817) [CI] StackOverflow error while building SQL Catalyst

2021-05-17 Thread Jira
Mauricio 'Pachá' Vargas Sepúlveda created ARROW-12817:
-

 Summary: [CI] StackOverflow error while building SQL Catalyst
 Key: ARROW-12817
 URL: https://issues.apache.org/jira/browse/ARROW-12817
 Project: Apache Arrow
  Issue Type: Task
  Components: Continuous Integration
Reporter: Mauricio 'Pachá' Vargas Sepúlveda


_Exception when compiling 474 sources to 
/spark/sql/catalyst/target/scala-2.12/classes_
See https://github.com/ursacomputing/crossbow/runs/2597929901#step:7:9761

This appeared for the 1st time on 2021-05-13, and previous to the error we see 
'o dependency information available' warnings 
(https://github.com/ursacomputing/crossbow/runs/2597929901#step:7:9627)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10117) [C++] Implement work-stealing scheduler / multiple queues in ThreadPool

2021-05-17 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346470#comment-17346470
 ] 

Weston Pace commented on ARROW-10117:
-

I'm a little confused by the word "workload" in the second paragraph.  
Traditional work stealing attempts to keep tasks together based on thread/core 
to preserve cache coherency.  This is what appears to be described in the first 
paragraph.

 

In the second paragraph are you asking for the capability to also group tasks 
based on workload?  If so, I'm not sure what the benefit would be.  If not, I 
don't think we'll end up needing to modify the API.  A task can keep a 
thread_local reference to its queue.

> [C++] Implement work-stealing scheduler / multiple queues in ThreadPool
> ---
>
> Key: ARROW-10117
> URL: https://issues.apache.org/jira/browse/ARROW-10117
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> This involves a change from a single task queue shared amongst all threads to 
> a per-thread task queue and the ability for idle threads to take tasks from 
> other threads' queues (work stealing). 
> As part of this, the task submission API would need to be evolved in some 
> fashion to allow for tasks related to a particular workload to end up in the 
> same task queue



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12689) [R] Implement ArrowArrayStream C interface

2021-05-17 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-12689.
-
Resolution: Fixed

Issue resolved by pull request 10307
[https://github.com/apache/arrow/pull/10307]

> [R] Implement ArrowArrayStream C interface
> --
>
> Key: ARROW-12689
> URL: https://issues.apache.org/jira/browse/ARROW-12689
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See 
> https://github.com/apache/arrow/commit/97879eb970bac52d93d2247200b9ca7acf6f3f93,
>  which adds it and also adds Python bindings. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12813) [C++] Expose a `full` (array creation) capability to python/R

2021-05-17 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346461#comment-17346461
 ] 

Weston Pace commented on ARROW-12813:
-

I'd be fine with exposing a utility function instead of a compute function.  I 
still don't really understand how to make the distinction or what the impact of 
such a choice would be.

> [C++] Expose a `full` (array creation) capability to python/R
> -
>
> Key: ARROW-12813
> URL: https://issues.apache.org/jira/browse/ARROW-12813
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> Given a scalar value and a length return an array where all values are equal 
> to the scalar value.
> The name "full" is derived from 
> [https://numpy.org/doc/stable/reference/generated/numpy.full.html] but if 
> anyone has a more clever name please recommend it.
> There are a number of utility functions in C++ that do this already.  
> However, exposing this as a compute function would allow R/Python to easily 
> generate arrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12813) [C++] Expose a `full` (array creation) capability to python/R

2021-05-17 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-12813:

Summary: [C++] Expose a `full` (array creation) capability to python/R  
(was: [C++] Support for a `full` compute function)

> [C++] Expose a `full` (array creation) capability to python/R
> -
>
> Key: ARROW-12813
> URL: https://issues.apache.org/jira/browse/ARROW-12813
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> Given a scalar value and a length return an array where all values are equal 
> to the scalar value.
> The name "full" is derived from 
> [https://numpy.org/doc/stable/reference/generated/numpy.full.html] but if 
> anyone has a more clever name please recommend it.
> There are a number of utility functions in C++ that do this already.  
> However, exposing this as a compute function would allow R/Python to easily 
> generate arrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12816) [C++] C++14 laundry list

2021-05-17 Thread Ben Kietzman (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346454#comment-17346454
 ] 

Ben Kietzman commented on ARROW-12816:
--

[~apitrou] [~westonpace] [~lidavidm]

> [C++] C++14 laundry list
> 
>
> Key: ARROW-12816
> URL: https://issues.apache.org/jira/browse/ARROW-12816
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Major
>
> Improvements to make/be aware of once C++14 is available:
> - Ensure that lambda closure members are moved into where ever 
> possible/appropriate. We have a lot of local variables whose only function is 
> getting copied into a closure, including some demotions to shared_ptr since 
> move only types can't be closed over in c++11
> - visitor pattern can be used more fluidly with template lambdas, for example 
> we could have a utility like {{ VisitInts(offset_width, offset_bytes, 
> [&](auto* offsets) { /*mutate offsets*/ }) }}
> - constexpr switch, for use in type traits functions
> - std::enable_if_t
> - std::quoted is available for quoting strings'
> - std::make_unique
> - standard {{[[deprecated]]}} attribute
> - temporal literals such as ""s, ""ns, ...
> - binary literals with place markers such as 0b1100_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12816) [C++] C++14 laundry list

2021-05-17 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-12816:


 Summary: [C++] C++14 laundry list
 Key: ARROW-12816
 URL: https://issues.apache.org/jira/browse/ARROW-12816
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Ben Kietzman


Improvements to make/be aware of once C++14 is available:

- Ensure that lambda closure members are moved into where ever 
possible/appropriate. We have a lot of local variables whose only function is 
getting copied into a closure, including some demotions to shared_ptr since 
move only types can't be closed over in c++11
- visitor pattern can be used more fluidly with template lambdas, for example 
we could have a utility like {{ VisitInts(offset_width, offset_bytes, [&](auto* 
offsets) { /*mutate offsets*/ }) }}
- constexpr switch, for use in type traits functions
- std::enable_if_t
- std::quoted is available for quoting strings'
- std::make_unique
- standard {{[[deprecated]]}} attribute
- temporal literals such as ""s, ""ns, ...
- binary literals with place markers such as 0b1100_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9430) [C++/Python] Kernel for SetItem(BooleanArray, values)

2021-05-17 Thread Niranda Perera (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346436#comment-17346436
 ] 

Niranda Perera commented on ARROW-9430:
---

[~jorisvandenbossche] I agree. I think ARROW-11044 only resolves the 'scalar 
replacement'.

> [C++/Python] Kernel for SetItem(BooleanArray, values)
> -
>
> Key: ARROW-9430
> URL: https://issues.apache.org/jira/browse/ARROW-9430
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Uwe Korn
>Priority: Major
>
> We should have a kernel that allows overriding the values of an array by 
> supplying a boolean mask and a scalar or an array of equal length.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9431) [C++/Python] Kernel for SetItem(IntegerArray, values)

2021-05-17 Thread Niranda Perera (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346435#comment-17346435
 ] 

Niranda Perera commented on ARROW-9431:
---

Identical issue

> [C++/Python] Kernel for SetItem(IntegerArray, values)
> -
>
> Key: ARROW-9431
> URL: https://issues.apache.org/jira/browse/ARROW-9431
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Affects Versions: 2.0.0
>Reporter: Uwe Korn
>Priority: Major
>
> We should have a kernel that allows overriding the values of an array using 
> an integer array as the indexer and a scalar or array of equal length as the 
> values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12789) [C++] Support for scalar value recycling in RecordBatch/Table creation

2021-05-17 Thread Nic Crane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346422#comment-17346422
 ] 

Nic Crane commented on ARROW-12789:
---

Also, feel free to push back on this if you don't think there's a huge amount 
of application beyond R - I can always look to implement it in the R package's 
C++ code if this is looking to be a particularly special case that won't be 
needed elsewhere?

> [C++] Support for scalar value recycling in RecordBatch/Table creation
> --
>
> Key: ARROW-12789
> URL: https://issues.apache.org/jira/browse/ARROW-12789
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nic Crane
>Priority: Major
>
> Please can we have the capability to be able to recycle scalar values during 
> table creation?  It would work as follows:
> Upon creation of a new Table/RecordBatch, the length of each column is 
> checked.  If:
>  * number of columns is > 1 and
>  * any columns have length 1 and
>  * not all columns have length 1
> then, the value in the length 1 column(s) should be repeated to make it as 
> long as the other columns. 
> This should only occur if all columns either have length 1 or N (where N is 
> some value greater than 1), and if any columns lengths are values other than 
> 1 or N, we should still get an error as we do now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12744) [C++][Compute] Add rounding kernel

2021-05-17 Thread Ben Kietzman (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346417#comment-17346417
 ] 

Ben Kietzman commented on ARROW-12744:
--

1. IMHO, compute kernels should not rely on (or be affected in any way by) the 
floating point environment. Users may have a need to adjust this for their own 
applications and arrow's kernels should produce correct output regardless

2. Output should be of the same floating point type as the input since the 
extent of rounding is configurable (probably via a function option like 
{{RoundOptions::ndigits}}) whereas integral output is only well formed if we're 
rounding to the nearest one.

> [C++][Compute] Add rounding kernel
> --
>
> Key: ARROW-12744
> URL: https://issues.apache.org/jira/browse/ARROW-12744
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Kernel to round an array of floating point numbers. Should return an array of 
> the same type as the input. Should have an option to control how many digits 
> after the decimal point (default value 0 meaning round to the nearest 
> integer).
> Midpoint values (e.g. 0.5 rounded to nearest integer) should round away from 
> zero (up for positive numbers, down for negative numbers).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12814) [C++][Gandiva] Implements ABS, FLOOR, PI, SQRT, SIGN, LSHIFT, RSHIFT and TRUNC functions

2021-05-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12814:
---
Labels: pull-request-available  (was: )

> [C++][Gandiva] Implements ABS, FLOOR, PI, SQRT, SIGN, LSHIFT, RSHIFT and 
> TRUNC functions
> 
>
> Key: ARROW-12814
> URL: https://issues.apache.org/jira/browse/ARROW-12814
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Gandiva
>Reporter: Anthony Louis Gotlib Ferreira
>Assignee: Anthony Louis Gotlib Ferreira
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12815) [C++] Warning when compiling on ubunut 21.04

2021-05-17 Thread Nate Clark (Jira)
Nate Clark created ARROW-12815:
--

 Summary: [C++] Warning when compiling on ubunut 21.04
 Key: ARROW-12815
 URL: https://issues.apache.org/jira/browse/ARROW-12815
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 5.0.0
Reporter: Nate Clark


Warning generated when compiling using gcc 10.2 on ubuntu 21.04
{noformat}
In file included from apache-arrow/cpp/src/arrow/chunked_array.h:26,
from apache-arrow/cpp/src/arrow/table.h:25,
from apache-arrow/cpp/src/arrow/table.cc:18,
from src/arrow/CMakeFiles/arrow_objlib.dir/Unity/unity_4_cxx.cxx:4:
apache-arrow/cpp/src/arrow/tensor.cc: In member function 
‘arrow::Tensor::CountNonZero() const’:
apache-arrow/cpp/src/arrow/result.h:446:5: warning: ‘MEM[(long int &)&counter + 
8]’ may be used uninitialized in this function [-Wmaybe-uninitialized]
446 | new (&data_) T(std::forward(u));
| ^~
In file included from 
src/arrow/CMakeFiles/arrow_objlib.dir/Unity/unity_4_cxx.cxx:6:
apache-arrow/cpp/src/arrow/tensor.cc:337:18: note: ‘MEM[(long int &)&counter + 
8]’ was declared here
337 | NonZeroCounter counter(*this);
| ^~~ {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12744) [C++][Compute] Add rounding kernel

2021-05-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12744:
---
Labels: pull-request-available  (was: )

> [C++][Compute] Add rounding kernel
> --
>
> Key: ARROW-12744
> URL: https://issues.apache.org/jira/browse/ARROW-12744
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Kernel to round an array of floating point numbers. Should return an array of 
> the same type as the input. Should have an option to control how many digits 
> after the decimal point (default value 0 meaning round to the nearest 
> integer).
> Midpoint values (e.g. 0.5 rounded to nearest integer) should round away from 
> zero (up for positive numbers, down for negative numbers).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12744) [C++][Compute] Add rounding kernel

2021-05-17 Thread Eduardo Ponce (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce updated ARROW-12744:
--
Summary: [C++][Compute] Add rounding kernel  (was: [C++] Add rounding 
kernel)

> [C++][Compute] Add rounding kernel
> --
>
> Key: ARROW-12744
> URL: https://issues.apache.org/jira/browse/ARROW-12744
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Assignee: Eduardo Ponce
>Priority: Major
>
> Kernel to round an array of floating point numbers. Should return an array of 
> the same type as the input. Should have an option to control how many digits 
> after the decimal point (default value 0 meaning round to the nearest 
> integer).
> Midpoint values (e.g. 0.5 rounded to nearest integer) should round away from 
> zero (up for positive numbers, down for negative numbers).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12755) [C++][Compute] Add quotient and modulo kernels

2021-05-17 Thread Eduardo Ponce (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce updated ARROW-12755:
--
Summary: [C++][Compute] Add quotient and modulo kernels  (was: [C++] Add 
quotient and modulo kernels)

> [C++][Compute] Add quotient and modulo kernels
> --
>
> Key: ARROW-12755
> URL: https://issues.apache.org/jira/browse/ARROW-12755
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Add a pair of binary kernels to compute the:
>  * quotient (result after division, discarding any fractional part, a.k.a 
> integer division)
>  * mod or modulo (remainder after division, a.k.a {{%}} / {{%%}} / modulus).
> The returned array should have the same data type as the input arrays or 
> promote to an appropriate type to avoid loss of precision if the input types 
> differ.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12814) [C++][Gandiva] Implements ABS, FLOOR, PI, SQRT, SIGN, LSHIFT, RSHIFT and TRUNC functions

2021-05-17 Thread Anthony Louis Gotlib Ferreira (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Louis Gotlib Ferreira updated ARROW-12814:
--
Summary: [C++][Gandiva] Implements ABS, FLOOR, PI, SQRT, SIGN, LSHIFT, 
RSHIFT and TRUNC functions  (was: [C++][Gandiva] Implements math functions)

> [C++][Gandiva] Implements ABS, FLOOR, PI, SQRT, SIGN, LSHIFT, RSHIFT and 
> TRUNC functions
> 
>
> Key: ARROW-12814
> URL: https://issues.apache.org/jira/browse/ARROW-12814
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Gandiva
>Reporter: Anthony Louis Gotlib Ferreira
>Assignee: Anthony Louis Gotlib Ferreira
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12745) [C++][Compute] Add floor and ceiling kernels

2021-05-17 Thread Eduardo Ponce (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce updated ARROW-12745:
--
Summary: [C++][Compute] Add floor and ceiling kernels  (was: [C++] Add 
floor and ceiling kernels)

> [C++][Compute] Add floor and ceiling kernels
> 
>
> Key: ARROW-12745
> URL: https://issues.apache.org/jira/browse/ARROW-12745
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Priority: Major
>
> Kernels to round each value in an array of floating point numbers to:
>  * the nearest integer less than or equal to it ({{floor}})
>  * the nearest integer greater than or equal to it ({{ceiling}})
> Should return an array of the same type as the input (not an integer type)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12813) [C++] Support for a `full` compute function

2021-05-17 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346391#comment-17346391
 ] 

Antoine Pitrou commented on ARROW-12813:


I don't dispute it's useful. I'm just not convinced we need to make a compute 
function out of it.

> [C++] Support for a `full` compute function
> ---
>
> Key: ARROW-12813
> URL: https://issues.apache.org/jira/browse/ARROW-12813
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> Given a scalar value and a length return an array where all values are equal 
> to the scalar value.
> The name "full" is derived from 
> [https://numpy.org/doc/stable/reference/generated/numpy.full.html] but if 
> anyone has a more clever name please recommend it.
> There are a number of utility functions in C++ that do this already.  
> However, exposing this as a compute function would allow R/Python to easily 
> generate arrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12814) [C++][Gandiva] Implements math functions

2021-05-17 Thread Anthony Louis Gotlib Ferreira (Jira)
Anthony Louis Gotlib Ferreira created ARROW-12814:
-

 Summary: [C++][Gandiva] Implements math functions
 Key: ARROW-12814
 URL: https://issues.apache.org/jira/browse/ARROW-12814
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Gandiva
Reporter: Anthony Louis Gotlib Ferreira
Assignee: Anthony Louis Gotlib Ferreira






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12604) [R][Packaging] Dataset, Parquet off in autobrew and CRAN Mac builds

2021-05-17 Thread Ian Cook (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook resolved ARROW-12604.
--
Resolution: Fixed

> [R][Packaging] Dataset, Parquet off in autobrew and CRAN Mac builds
> ---
>
> Key: ARROW-12604
> URL: https://issues.apache.org/jira/browse/ARROW-12604
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging, R
>Affects Versions: 4.0.0
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
> Fix For: 5.0.0, 4.0.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (ARROW-12604) [R][Packaging] Dataset, Parquet off in autobrew and CRAN Mac builds

2021-05-17 Thread Ian Cook (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook reopened ARROW-12604:
--

> [R][Packaging] Dataset, Parquet off in autobrew and CRAN Mac builds
> ---
>
> Key: ARROW-12604
> URL: https://issues.apache.org/jira/browse/ARROW-12604
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging, R
>Affects Versions: 4.0.0
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
> Fix For: 5.0.0, 4.0.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-12788) [C++] arrow::compute::Expression::type_id() function

2021-05-17 Thread Ian Cook (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook closed ARROW-12788.


> [C++] arrow::compute::Expression::type_id() function
> 
>
> Key: ARROW-12788
> URL: https://issues.apache.org/jira/browse/ARROW-12788
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
>
> There is a function {{type()}} that returns the type of a post-bind 
> {{Expression}} as a {{std::shared_ptr}}:
>  
> [https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/exec/expression.h#L105]
> It would be convenient to also have a function {{type_id()}} that returns 
> this as an {{arrow::Type::type}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12810) [Python] Run tests with AWS_EC2_METADATA_DISABLED=true

2021-05-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12810:
---
Labels: pull-request-available  (was: )

> [Python] Run tests with AWS_EC2_METADATA_DISABLED=true
> --
>
> Key: ARROW-12810
> URL: https://issues.apache.org/jira/browse/ARROW-12810
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This explains why some tests are so slow. There's already a few tests that 
> work around this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-12604) [R][Packaging] Dataset, Parquet off in autobrew and CRAN Mac builds

2021-05-17 Thread Ian Cook (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook closed ARROW-12604.

Resolution: Fixed

> [R][Packaging] Dataset, Parquet off in autobrew and CRAN Mac builds
> ---
>
> Key: ARROW-12604
> URL: https://issues.apache.org/jira/browse/ARROW-12604
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging, R
>Affects Versions: 4.0.0
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
> Fix For: 5.0.0, 4.0.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12619) [Python] pyarrow sdist should not require git

2021-05-17 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-12619.
-
Fix Version/s: 5.0.0
   Resolution: Fixed

Issue resolved by pull request 10342
[https://github.com/apache/arrow/pull/10342]

> [Python] pyarrow sdist should not require git
> -
>
> Key: ARROW-12619
> URL: https://issues.apache.org/jira/browse/ARROW-12619
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Kouhei Sutou
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0, 4.0.1
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> {noformat}
> FROM ubuntu:20.04
> RUN apt update && apt install -y python3-pip
> RUN pip3 install --no-binary pyarrow pyarrow==4.0.0
> {noformat}
> {noformat}
> $ docker build .
> ...
> Step 3/3 : RUN pip3 install --no-binary pyarrow pyarrow==4.0.0
>  ---> Running in 28d363e1c397
> Collecting pyarrow==4.0.0
>   Downloading pyarrow-4.0.0.tar.gz (710 kB)
>   Installing build dependencies: started
>   Installing build dependencies: still running...
>   Installing build dependencies: finished with status 'done'
>   Getting requirements to build wheel: started
>   Getting requirements to build wheel: finished with status 'done'
> Preparing wheel metadata: started
> Preparing wheel metadata: finished with status 'error'
> ERROR: Command errored out with exit status 1:
>  command: /usr/bin/python3 /tmp/tmp5rqecai7 
> prepare_metadata_for_build_wheel /tmp/tmpc49gha3r
>  cwd: /tmp/pip-install-or1g7own/pyarrow
> Complete output (42 lines):
> Traceback (most recent call last):
>   File "/tmp/tmp5rqecai7", line 280, in 
> main()
>   File "/tmp/tmp5rqecai7", line 263, in main
> json_out['return_val'] = hook(**hook_input['kwargs'])
>   File "/tmp/tmp5rqecai7", line 133, in prepare_metadata_for_build_wheel
> return hook(metadata_directory, config_settings)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py",
>  line 166, in prepare_metadata_for_build_wheel
> self.run_setup()
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py",
>  line 258, in run_setup
> super(_BuildMetaLegacyBackend,
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py",
>  line 150, in run_setup
> exec(compile(code, __file__, 'exec'), locals())
>   File "setup.py", line 585, in 
> setup(
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/__init__.py",
>  line 153, in setup
> return distutils.core.setup(**attrs)
>   File "/usr/lib/python3.8/distutils/core.py", line 108, in setup
> _setup_distribution = dist = klass(attrs)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py",
>  line 434, in __init__
> _Distribution.__init__(self, {
>   File "/usr/lib/python3.8/distutils/dist.py", line 292, in __init__
> self.finalize_options()
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py",
>  line 743, in finalize_options
> ep(self)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py",
>  line 750, in _finalize_setup_keywords
> ep.load()(self, ep.name, value)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/integration.py",
>  line 24, in version_keyword
> dist.metadata.version = _get_version(config)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py",
>  line 173, in _get_version
> parsed_version = _do_parse(config)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py",
>  line 119, in _do_parse
> parse_result = _call_entrypoint_fn(config.absolute_root, config, 
> config.parse)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py",
>  line 54, in _call_entrypoint_fn
> return fn(root)
>   File "setup.py", line 546, in parse_git
> return parse(root, **kwargs)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/git.py",
>  line 115, in parse
> require_command("git")
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/utils.py",
>  line 142, in require_command
> raise OSError("%r was not found" % name)
> OSError: 'git' was not found

[jira] [Commented] (ARROW-12813) [C++] Support for a `full` compute function

2021-05-17 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346337#comment-17346337
 ] 

Weston Pace commented on ARROW-12813:
-

I think it comes up often enough in cases where there is a default value for a 
column.  For example, if you are reading into datasets from two different 
sources that are similar but not quite the same and you want to unify them.

I should also mention that, if you are reading in data as a dataset scan, you 
can achieve this with projection (project a name to a scalar and the scalar 
will be broadcast).

> [C++] Support for a `full` compute function
> ---
>
> Key: ARROW-12813
> URL: https://issues.apache.org/jira/browse/ARROW-12813
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> Given a scalar value and a length return an array where all values are equal 
> to the scalar value.
> The name "full" is derived from 
> [https://numpy.org/doc/stable/reference/generated/numpy.full.html] but if 
> anyone has a more clever name please recommend it.
> There are a number of utility functions in C++ that do this already.  
> However, exposing this as a compute function would allow R/Python to easily 
> generate arrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12789) [C++] Support for scalar value recycling in RecordBatch/Table creation

2021-05-17 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346330#comment-17346330
 ] 

Weston Pace commented on ARROW-12789:
-

[~bkietz]  I'm going to tag you for input just because you've done some work 
with broadcasting in the past (e.g. in regards to projection) which seems quite 
similar.  Perhaps this PR could be achieved by a "broadcast" compute function 
which takes in a vector of arrays? (which I suppose is an odd shape for input 
into the compute layer).

> [C++] Support for scalar value recycling in RecordBatch/Table creation
> --
>
> Key: ARROW-12789
> URL: https://issues.apache.org/jira/browse/ARROW-12789
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nic Crane
>Priority: Major
>
> Please can we have the capability to be able to recycle scalar values during 
> table creation?  It would work as follows:
> Upon creation of a new Table/RecordBatch, the length of each column is 
> checked.  If:
>  * number of columns is > 1 and
>  * any columns have length 1 and
>  * not all columns have length 1
> then, the value in the length 1 column(s) should be repeated to make it as 
> long as the other columns. 
> This should only occur if all columns either have length 1 or N (where N is 
> some value greater than 1), and if any columns lengths are values other than 
> 1 or N, we should still get an error as we do now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12813) [C++] Support for a `full` compute function

2021-05-17 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346324#comment-17346324
 ] 

Antoine Pitrou commented on ARROW-12813:


It could be exposed in Python/R without being a compute function.
That said, {{np.full}} is intrinsically more useful than an Arrow equivalent 
because Numpy arrays are mutable.

> [C++] Support for a `full` compute function
> ---
>
> Key: ARROW-12813
> URL: https://issues.apache.org/jira/browse/ARROW-12813
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> Given a scalar value and a length return an array where all values are equal 
> to the scalar value.
> The name "full" is derived from 
> [https://numpy.org/doc/stable/reference/generated/numpy.full.html] but if 
> anyone has a more clever name please recommend it.
> There are a number of utility functions in C++ that do this already.  
> However, exposing this as a compute function would allow R/Python to easily 
> generate arrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12789) [C++] Support for scalar value recycling in RecordBatch/Table creation

2021-05-17 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346319#comment-17346319
 ] 

Weston Pace commented on ARROW-12789:
-

[~jorisvandenbossche] I've just opened ARROW-12813 for that as it has come up a 
few times.

> [C++] Support for scalar value recycling in RecordBatch/Table creation
> --
>
> Key: ARROW-12789
> URL: https://issues.apache.org/jira/browse/ARROW-12789
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nic Crane
>Priority: Major
>
> Please can we have the capability to be able to recycle scalar values during 
> table creation?  It would work as follows:
> Upon creation of a new Table/RecordBatch, the length of each column is 
> checked.  If:
>  * number of columns is > 1 and
>  * any columns have length 1 and
>  * not all columns have length 1
> then, the value in the length 1 column(s) should be repeated to make it as 
> long as the other columns. 
> This should only occur if all columns either have length 1 or N (where N is 
> some value greater than 1), and if any columns lengths are values other than 
> 1 or N, we should still get an error as we do now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (ARROW-12813) [C++] Support for a `full` compute function

2021-05-17 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace reopened ARROW-12813:
-

I was going to close as duplicate but looking at ARROW-12789 more closely I 
think they are asking  for similar but slightly different interfaces.

> [C++] Support for a `full` compute function
> ---
>
> Key: ARROW-12813
> URL: https://issues.apache.org/jira/browse/ARROW-12813
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> Given a scalar value and a length return an array where all values are equal 
> to the scalar value.
> The name "full" is derived from 
> [https://numpy.org/doc/stable/reference/generated/numpy.full.html] but if 
> anyone has a more clever name please recommend it.
> There are a number of utility functions in C++ that do this already.  
> However, exposing this as a compute function would allow R/Python to easily 
> generate arrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-12813) [C++] Support for a `full` compute function

2021-05-17 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace closed ARROW-12813.
---
Resolution: Duplicate

> [C++] Support for a `full` compute function
> ---
>
> Key: ARROW-12813
> URL: https://issues.apache.org/jira/browse/ARROW-12813
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> Given a scalar value and a length return an array where all values are equal 
> to the scalar value.
> The name "full" is derived from 
> [https://numpy.org/doc/stable/reference/generated/numpy.full.html] but if 
> anyone has a more clever name please recommend it.
> There are a number of utility functions in C++ that do this already.  
> However, exposing this as a compute function would allow R/Python to easily 
> generate arrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12812) [Packaging][Java] Improve JNI jars build

2021-05-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12812:
---
Labels: pull-request-available  (was: )

> [Packaging][Java] Improve JNI jars build
> 
>
> Key: ARROW-12812
> URL: https://issues.apache.org/jira/browse/ARROW-12812
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java, Packaging
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> - to better align with the manylinux scripts
> - also build the pure java packages
> - add dynamic dependency check functionality to archery



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12813) [C++] Support for a `full` compute function

2021-05-17 Thread Weston Pace (Jira)
Weston Pace created ARROW-12813:
---

 Summary: [C++] Support for a `full` compute function
 Key: ARROW-12813
 URL: https://issues.apache.org/jira/browse/ARROW-12813
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace


Given a scalar value and a length return an array where all values are equal to 
the scalar value.

The name "full" is derived from 
[https://numpy.org/doc/stable/reference/generated/numpy.full.html] but if 
anyone has a more clever name please recommend it.

There are a number of utility functions in C++ that do this already.  However, 
exposing this as a compute function would allow R/Python to easily generate 
arrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12812) [Packaging][Java] Improve JNI jars build

2021-05-17 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-12812:
---

 Summary: [Packaging][Java] Improve JNI jars build
 Key: ARROW-12812
 URL: https://issues.apache.org/jira/browse/ARROW-12812
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java, Packaging
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 5.0.0


- to better align with the manylinux scripts
- also build the pure java packages
- add dynamic dependency check functionality to archery



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12744) [C++] Add rounding kernel

2021-05-17 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346314#comment-17346314
 ] 

Eduardo Ponce commented on ARROW-12744:
---

C++ provides a 
[std::round()|https://en.cppreference.com/w/cpp/numeric/math/round] function 
where the [rounding 
mode|https://en.cppreference.com/w/cpp/numeric/fenv/FE_round] can be set at 
runtime. Note that library implementations can provide additional rounding 
modes or support a subset. It seems there is no *round-half-to-even/odd* 
defined in spec.

1. Should the Arrow *round* kernel make use of *std::round* and extend the 
rounding modes to support *round-half-to-even/odd* and only in these cases 
implement them explicitly?

2. Also, *std::round()* provides versions where it outputs integral data 
instead of floating-point. Are these variants desirable in Arrow?

> [C++] Add rounding kernel
> -
>
> Key: ARROW-12744
> URL: https://issues.apache.org/jira/browse/ARROW-12744
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Assignee: Eduardo Ponce
>Priority: Major
>
> Kernel to round an array of floating point numbers. Should return an array of 
> the same type as the input. Should have an option to control how many digits 
> after the decimal point (default value 0 meaning round to the nearest 
> integer).
> Midpoint values (e.g. 0.5 rounded to nearest integer) should round away from 
> zero (up for positive numbers, down for negative numbers).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12811) [C++] [Dataset] Dataset repartition / filter / update

2021-05-17 Thread Weston Pace (Jira)
Weston Pace created ARROW-12811:
---

 Summary: [C++] [Dataset] Dataset repartition / filter / update
 Key: ARROW-12811
 URL: https://issues.apache.org/jira/browse/ARROW-12811
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace


This feature would be to add support for an "update" workflow which scanned a 
set of batches and wrote them (potentially filtered/modified) back out to the 
same place.


The existing dataset read / dataset write features wouldn't work because they 
would append the data.

There is some discussion in ARROW-12358 and ARROW-12509 of an "overwrite mode" 
but an "overwrite partition" workflow wouldn't work unless you can scan in 
entire partitions at once (and in general this should probably be avoided).

A naive "write to a different directory and rename" approach could work but it 
would be inefficient since it would require a copy of the entire dataset to 
modify a small part of it.

 

The feature could be implemented using temporary directories in place that get 
renamed on top of the existing directory at the end.  Files that are unchanged 
would be moved into the temporary directory instead of copied.

Presumable no ACID guarantees would be made (and they would be quite hard to 
guarantee) since Arrow datasets do not make ACID guarantees of any kind 
currently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12810) [Python] Run tests with AWS_EC2_METADATA_DISABLED=true

2021-05-17 Thread David Li (Jira)
David Li created ARROW-12810:


 Summary: [Python] Run tests with AWS_EC2_METADATA_DISABLED=true
 Key: ARROW-12810
 URL: https://issues.apache.org/jira/browse/ARROW-12810
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: David Li
Assignee: David Li


This explains why some tests are so slow. There's already a few tests that work 
around this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

2021-05-17 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346298#comment-17346298
 ] 

Weston Pace commented on ARROW-12358:
-

So looking on this with fresh eyes, the "overwrite mode" feature is fairly 
different from an "update" feature.  So I don't think update related topics are 
relevant for this ticket.  Update generally (and specifically in [~ldacey] 's 
case) implies reading and writing to the same set of files.   
Overwrite-partition mode wouldn't allow for that.  Overwrite-partition mode 
could be useful in some limited circumstances (e.g. somehow someone regenerates 
an entire new set of data for one or more partitions) but I think those are 
rare enough, and would be handled by a general "update" feature anyways, that I 
don't see much benefit in creating a separate feature and the complexity would 
just confuse users.

 

So I'll walk back my earlier comment.  I'd now argue that dataset write should 
only allow "append" and "error" options.

 

Dataset update could be created as a separate Jira ticket (I'll go ahead and 
draft one).  Dataset update would mean scanning and rewriting a dataset (or 
parts thereof).

> [C++][Python][R][Dataset] Control overwriting vs appending when writing to 
> existing dataset
> ---
>
> Key: ARROW-12358
> URL: https://issues.apache.org/jira/browse/ARROW-12358
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset
> Fix For: 5.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}} 
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when 
> you are writing to an existing dataset, you de facto overwrite previous data 
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by 
> ensuring the file names are unique (the user can specify the 
> {{basename_template}} to be something unique). There is also ARROW-7706 about 
> silently doubling data (so _not_ overwriting existing data) with the legacy 
> {{parquet.write_to_dataset}} implementation. 
> It could be good to have a "mode" when writing datasets that controls the 
> different possible behaviours. And erroring when there is pre-existing data 
> in the target directory is maybe the safest default, because both appending 
> vs overwriting silently can be surprising behaviour depending on your 
> expectations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12807) [C++] Fix merge conflicts with Future refactor/async IPC

2021-05-17 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-12807.
--
Resolution: Fixed

Issue resolved by pull request 10347
[https://github.com/apache/arrow/pull/10347]

> [C++] Fix merge conflicts with Future refactor/async IPC
> 
>
> Key: ARROW-12807
> URL: https://issues.apache.org/jira/browse/ARROW-12807
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> ARROW-12004 and ARROW-11772 conflict with each other (they merge cleanly but 
> the result doesn't build)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12809) [C++] Add StrptimeOptions defaults

2021-05-17 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-12809:
---

 Summary: [C++] Add StrptimeOptions defaults
 Key: ARROW-12809
 URL: https://issues.apache.org/jira/browse/ARROW-12809
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Neal Richardson


Per 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L1744
 there are no default options for strptime (format, unit). But the 
TimestampType constructor has a default unit of milliseconds 
(https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L1236), and a 
reasonable default for {{format}} would be ISO8601. 

cc [~bkietz] [~wesm] for opinions as the authors of this code (according to 
{{git blame}})



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12744) [C++] Add rounding kernel

2021-05-17 Thread Eduardo Ponce (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce reassigned ARROW-12744:
-

Assignee: Eduardo Ponce

> [C++] Add rounding kernel
> -
>
> Key: ARROW-12744
> URL: https://issues.apache.org/jira/browse/ARROW-12744
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Assignee: Eduardo Ponce
>Priority: Major
>
> Kernel to round an array of floating point numbers. Should return an array of 
> the same type as the input. Should have an option to control how many digits 
> after the decimal point (default value 0 meaning round to the nearest 
> integer).
> Midpoint values (e.g. 0.5 rounded to nearest integer) should round away from 
> zero (up for positive numbers, down for negative numbers).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12802) No more default ARROW_CSV=ON in libarrow build breaks R arrow

2021-05-17 Thread Webb Phillips (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Webb Phillips resolved ARROW-12802.
---
Resolution: Not A Problem

Thanks for helping me resolve the problem with my build environment!

> No more default ARROW_CSV=ON in libarrow build breaks R arrow
> -
>
> Key: ARROW-12802
> URL: https://issues.apache.org/jira/browse/ARROW-12802
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 4.0.0, 4.0.1
>Reporter: Webb Phillips
>Priority: Major
>
> libarrow build succeeds, but include/arrow/csv/type_fwd.h isn't installed 
> since 4.0.0. This causes R install.packages('arrow') to fail with:
> {code:java}
> make: *** 
> [/opt/local/Library/Frameworks/R.framework/Resources/etc/Makeconf:179: 
> array.o] Error 1
> In file included from recordbatch.cpp:18:
> ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not 
> found{code}
> Reproduced with Ubuntu 18.04 and with macOS 10.13.6 MacPorts with both 
> apache-arrow-4.0.0 and current HEAD f959141ece4d660bce5f7fa545befc0116a7db79.
> No other type_fwd.h are missing:
> {code:java}
> find .../arrow/cpp/src -name type_fwd.h | wc -l
> 10
> find .../include -name type_fwd.h | wc -l
> 9{code}
> Best guess: default value of cmake ARROW_CSV changed and R arrow requires 
> ARROW_CSV=ON.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12802) No more default ARROW_CSV=ON in libarrow build breaks R arrow

2021-05-17 Thread Webb Phillips (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346236#comment-17346236
 ] 

Webb Phillips commented on ARROW-12802:
---

You are totally correct! Uninstalled all traces of homebrew and 
install.packages works fine now :D

> No more default ARROW_CSV=ON in libarrow build breaks R arrow
> -
>
> Key: ARROW-12802
> URL: https://issues.apache.org/jira/browse/ARROW-12802
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 4.0.0, 4.0.1
>Reporter: Webb Phillips
>Priority: Major
>
> libarrow build succeeds, but include/arrow/csv/type_fwd.h isn't installed 
> since 4.0.0. This causes R install.packages('arrow') to fail with:
> {code:java}
> make: *** 
> [/opt/local/Library/Frameworks/R.framework/Resources/etc/Makeconf:179: 
> array.o] Error 1
> In file included from recordbatch.cpp:18:
> ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not 
> found{code}
> Reproduced with Ubuntu 18.04 and with macOS 10.13.6 MacPorts with both 
> apache-arrow-4.0.0 and current HEAD f959141ece4d660bce5f7fa545befc0116a7db79.
> No other type_fwd.h are missing:
> {code:java}
> find .../arrow/cpp/src -name type_fwd.h | wc -l
> 10
> find .../include -name type_fwd.h | wc -l
> 9{code}
> Best guess: default value of cmake ARROW_CSV changed and R arrow requires 
> ARROW_CSV=ON.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12808) [JS] Document browser support

2021-05-17 Thread Brian Hulette (Jira)
Brian Hulette created ARROW-12808:
-

 Summary: [JS] Document browser support
 Key: ARROW-12808
 URL: https://issues.apache.org/jira/browse/ARROW-12808
 Project: Apache Arrow
  Issue Type: Task
  Components: JavaScript
Reporter: Brian Hulette


For example in https://github.com/apache/arrow/pull/10340 we're explicitly 
removing support for IE. We should at least document that IE support is an 
explicit non-goal. Even better if we can identify supported version ranges for 
major browsers. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12806) [Python] test_write_to_dataset_filesystem missing a dataset mark

2021-05-17 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-12806:

Fix Version/s: 4.0.1

> [Python] test_write_to_dataset_filesystem missing a dataset mark
> 
>
> Key: ARROW-12806
> URL: https://issues.apache.org/jira/browse/ARROW-12806
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0, 4.0.1
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> From 
> https://stackoverflow.com/questions/67526288/modulenotfounderror-no-module-named-pyarrow-dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-12802) No more default ARROW_CSV=ON in libarrow build breaks R arrow

2021-05-17 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346231#comment-17346231
 ] 

Neal Richardson edited comment on ARROW-12802 at 5/17/21, 3:29 PM:
---

The {{*** Using Homebrew apache-arrow}} in the output suggests that it has 
found arrow installed by Homebrew 
([specifically|https://github.com/apache/arrow/blob/master/r/configure#L108], 
{{brew ls --versions apache-arrow}} returned something). Judging from the 
compile error, you may have a very old version of apache-arrow installed by 
brew.

You can set either of {{FORCE_AUTOBREW}} or {{FORCE_BUNDLED_BUILD}} environment 
variables to {{true}} to ignore the homebrew apache-arrow.


was (Author: npr):
The {{*** Using Homebrew apache-arrow}} in the output suggests that it has 
found arrow installed by Homebrew (specifically, {{brew ls --versions 
apache-arrow}} returned something). Judging from the compile error, you may 
have a very old version of apache-arrow installed by brew.

You can set either of {{FORCE_AUTOBREW}} or {{FORCE_BUNDLED_BUILD}} environment 
variables to {{true}} to ignore the homebrew apache-arrow.

> No more default ARROW_CSV=ON in libarrow build breaks R arrow
> -
>
> Key: ARROW-12802
> URL: https://issues.apache.org/jira/browse/ARROW-12802
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 4.0.0, 4.0.1
>Reporter: Webb Phillips
>Priority: Major
>
> libarrow build succeeds, but include/arrow/csv/type_fwd.h isn't installed 
> since 4.0.0. This causes R install.packages('arrow') to fail with:
> {code:java}
> make: *** 
> [/opt/local/Library/Frameworks/R.framework/Resources/etc/Makeconf:179: 
> array.o] Error 1
> In file included from recordbatch.cpp:18:
> ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not 
> found{code}
> Reproduced with Ubuntu 18.04 and with macOS 10.13.6 MacPorts with both 
> apache-arrow-4.0.0 and current HEAD f959141ece4d660bce5f7fa545befc0116a7db79.
> No other type_fwd.h are missing:
> {code:java}
> find .../arrow/cpp/src -name type_fwd.h | wc -l
> 10
> find .../include -name type_fwd.h | wc -l
> 9{code}
> Best guess: default value of cmake ARROW_CSV changed and R arrow requires 
> ARROW_CSV=ON.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12802) No more default ARROW_CSV=ON in libarrow build breaks R arrow

2021-05-17 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346231#comment-17346231
 ] 

Neal Richardson commented on ARROW-12802:
-

The {{*** Using Homebrew apache-arrow}} in the output suggests that it has 
found arrow installed by Homebrew (specifically, {{brew ls --versions 
apache-arrow}} returned something). Judging from the compile error, you may 
have a very old version of apache-arrow installed by brew.

You can set either of {{FORCE_AUTOBREW}} or {{FORCE_BUNDLED_BUILD}} environment 
variables to {{true}} to ignore the homebrew apache-arrow.

> No more default ARROW_CSV=ON in libarrow build breaks R arrow
> -
>
> Key: ARROW-12802
> URL: https://issues.apache.org/jira/browse/ARROW-12802
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 4.0.0, 4.0.1
>Reporter: Webb Phillips
>Priority: Major
>
> libarrow build succeeds, but include/arrow/csv/type_fwd.h isn't installed 
> since 4.0.0. This causes R install.packages('arrow') to fail with:
> {code:java}
> make: *** 
> [/opt/local/Library/Frameworks/R.framework/Resources/etc/Makeconf:179: 
> array.o] Error 1
> In file included from recordbatch.cpp:18:
> ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not 
> found{code}
> Reproduced with Ubuntu 18.04 and with macOS 10.13.6 MacPorts with both 
> apache-arrow-4.0.0 and current HEAD f959141ece4d660bce5f7fa545befc0116a7db79.
> No other type_fwd.h are missing:
> {code:java}
> find .../arrow/cpp/src -name type_fwd.h | wc -l
> 10
> find .../include -name type_fwd.h | wc -l
> 9{code}
> Best guess: default value of cmake ARROW_CSV changed and R arrow requires 
> ARROW_CSV=ON.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12807) [C++] Fix merge conflicts with Future refactor/async IPC

2021-05-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12807:
---
Labels: pull-request-available  (was: )

> [C++] Fix merge conflicts with Future refactor/async IPC
> 
>
> Key: ARROW-12807
> URL: https://issues.apache.org/jira/browse/ARROW-12807
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ARROW-12004 and ARROW-11772 conflict with each other (they merge cleanly but 
> the result doesn't build)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-12802) No more default ARROW_CSV=ON in libarrow build breaks R arrow

2021-05-17 Thread Webb Phillips (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346221#comment-17346221
 ] 

Webb Phillips edited comment on ARROW-12802 at 5/17/21, 3:18 PM:
-

Using install.packages as in the docs would be ideal. but does not work for me:
{code:java}
mac$ R --vanilla
...
> install.packages('arrow')
...
* installing *source* package ‘arrow’ ...
** package ‘arrow’ successfully unpacked and MD5 sums checked
** using staged installation
*** Using Homebrew apache-arrow
PKG_CFLAGS=-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW 
-DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET
PKG_LIBS=-L/usr/local/opt/apache-arrow/lib -larrow_dataset -lparquet -larrow 
-larrow_bundled_dependencies
** libs
/opt/local/bin/clang++-mp-9.0 -std=gnu++11 
-I"/opt/local/Library/Frameworks/R.framework/Resources/include" -DNDEBUG 
-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW 
-DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
-I'/Users/webb/Library/R/4.0/library/cpp11/include' -I/opt/local/include 
-isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -fPIC -pipe -Os 
-stdlib=libc++ -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk 
-arch x86_64 -c array.cpp -o array.o
In file included from array.cpp:18:
././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not found
{code}
This is with no other arrow installed.

Could be install.packages works using Apple /usr/bin/clang++ instead of 
MacPorts /opt/local/bin/clang++-mp-9.0


was (Author: webbp):
Using install.packages as in the docs would be ideal. but does not work for me:
{code:java}
mac$ R --vanilla
...
> install.packages('arrow')
...
* installing *source* package ‘arrow’ ...
** package ‘arrow’ successfully unpacked and MD5 sums checked
** using staged installation
*** Using Homebrew apache-arrow
PKG_CFLAGS=-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW 
-DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET
PKG_LIBS=-L/usr/local/opt/apache-arrow/lib -larrow_dataset -lparquet -larrow 
-larrow_bundled_dependencies
** libs
/opt/local/bin/clang++-mp-9.0 -std=gnu++11 
-I"/opt/local/Library/Frameworks/R.framework/Resources/include" -DNDEBUG 
-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW 
-DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
-I'/Users/webb/Library/R/4.0/library/cpp11/include' -I/opt/local/include 
-isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -fPIC -pipe -Os 
-stdlib=libc++ -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk 
-arch x86_64 -c array.cpp -o array.o
In file included from array.cpp:18:
././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not found
{code}
This is with no other arrow installed.

> No more default ARROW_CSV=ON in libarrow build breaks R arrow
> -
>
> Key: ARROW-12802
> URL: https://issues.apache.org/jira/browse/ARROW-12802
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 4.0.0, 4.0.1
>Reporter: Webb Phillips
>Priority: Major
>
> libarrow build succeeds, but include/arrow/csv/type_fwd.h isn't installed 
> since 4.0.0. This causes R install.packages('arrow') to fail with:
> {code:java}
> make: *** 
> [/opt/local/Library/Frameworks/R.framework/Resources/etc/Makeconf:179: 
> array.o] Error 1
> In file included from recordbatch.cpp:18:
> ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not 
> found{code}
> Reproduced with Ubuntu 18.04 and with macOS 10.13.6 MacPorts with both 
> apache-arrow-4.0.0 and current HEAD f959141ece4d660bce5f7fa545befc0116a7db79.
> No other type_fwd.h are missing:
> {code:java}
> find .../arrow/cpp/src -name type_fwd.h | wc -l
> 10
> find .../include -name type_fwd.h | wc -l
> 9{code}
> Best guess: default value of cmake ARROW_CSV changed and R arrow requires 
> ARROW_CSV=ON.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-12802) No more default ARROW_CSV=ON in libarrow build breaks R arrow

2021-05-17 Thread Webb Phillips (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346221#comment-17346221
 ] 

Webb Phillips edited comment on ARROW-12802 at 5/17/21, 3:11 PM:
-

Using install.packages as in the docs would be ideal. but does not work for me:
{code:java}
mac$ R --vanilla
...
> install.packages('arrow')
...
* installing *source* package ‘arrow’ ...
** package ‘arrow’ successfully unpacked and MD5 sums checked
** using staged installation
*** Using Homebrew apache-arrow
PKG_CFLAGS=-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW 
-DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET
PKG_LIBS=-L/usr/local/opt/apache-arrow/lib -larrow_dataset -lparquet -larrow 
-larrow_bundled_dependencies
** libs
/opt/local/bin/clang++-mp-9.0 -std=gnu++11 
-I"/opt/local/Library/Frameworks/R.framework/Resources/include" -DNDEBUG 
-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW 
-DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
-I'/Users/webb/Library/R/4.0/library/cpp11/include' -I/opt/local/include 
-isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -fPIC -pipe -Os 
-stdlib=libc++ -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk 
-arch x86_64 -c array.cpp -o array.o
In file included from array.cpp:18:
././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not found
{code}
 


was (Author: webbp):
Using install.packages as in the docs would be ideal. but does not work for me:

 
{code:java}
mac$ R --vanilla
...
> install.packages('arrow')
...
* installing *source* package ‘arrow’ ...
** package ‘arrow’ successfully unpacked and MD5 sums checked
** using staged installation
*** Using Homebrew apache-arrow
PKG_CFLAGS=-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW 
-DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET
PKG_LIBS=-L/usr/local/opt/apache-arrow/lib -larrow_dataset -lparquet -larrow 
-larrow_bundled_dependencies
** libs
/opt/local/bin/clang++-mp-9.0 -std=gnu++11 
-I"/opt/local/Library/Frameworks/R.framework/Resources/include" -DNDEBUG 
-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW 
-DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
-I'/Users/webb/Library/R/4.0/library/cpp11/include' -I/opt/local/include 
-isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -fPIC -pipe -Os 
-stdlib=libc++ -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk 
-arch x86_64 -c array.cpp -o array.o
In file included from array.cpp:18:
././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not found
{code}
 

> No more default ARROW_CSV=ON in libarrow build breaks R arrow
> -
>
> Key: ARROW-12802
> URL: https://issues.apache.org/jira/browse/ARROW-12802
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 4.0.0, 4.0.1
>Reporter: Webb Phillips
>Priority: Major
>
> libarrow build succeeds, but include/arrow/csv/type_fwd.h isn't installed 
> since 4.0.0. This causes R install.packages('arrow') to fail with:
> {code:java}
> make: *** 
> [/opt/local/Library/Frameworks/R.framework/Resources/etc/Makeconf:179: 
> array.o] Error 1
> In file included from recordbatch.cpp:18:
> ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not 
> found{code}
> Reproduced with Ubuntu 18.04 and with macOS 10.13.6 MacPorts with both 
> apache-arrow-4.0.0 and current HEAD f959141ece4d660bce5f7fa545befc0116a7db79.
> No other type_fwd.h are missing:
> {code:java}
> find .../arrow/cpp/src -name type_fwd.h | wc -l
> 10
> find .../include -name type_fwd.h | wc -l
> 9{code}
> Best guess: default value of cmake ARROW_CSV changed and R arrow requires 
> ARROW_CSV=ON.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-12802) No more default ARROW_CSV=ON in libarrow build breaks R arrow

2021-05-17 Thread Webb Phillips (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346221#comment-17346221
 ] 

Webb Phillips edited comment on ARROW-12802 at 5/17/21, 3:11 PM:
-

Using install.packages as in the docs would be ideal. but does not work for me:

 
{code:java}
mac$ R --vanilla
...
> install.packages('arrow')
...
* installing *source* package ‘arrow’ ...
** package ‘arrow’ successfully unpacked and MD5 sums checked
** using staged installation
*** Using Homebrew apache-arrow
PKG_CFLAGS=-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW 
-DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET
PKG_LIBS=-L/usr/local/opt/apache-arrow/lib -larrow_dataset -lparquet -larrow 
-larrow_bundled_dependencies
** libs
/opt/local/bin/clang++-mp-9.0 -std=gnu++11 
-I"/opt/local/Library/Frameworks/R.framework/Resources/include" -DNDEBUG 
-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW 
-DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
-I'/Users/webb/Library/R/4.0/library/cpp11/include' -I/opt/local/include 
-isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -fPIC -pipe -Os 
-stdlib=libc++ -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk 
-arch x86_64 -c array.cpp -o array.o
In file included from array.cpp:18:
././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not found
{code}
 


was (Author: webbp):
Using `install.packages` as in the docs would be ideal. but does not work for 
me:



 
{code:java}
mac$ R --vanilla
...
> install.packages('arrow')
...
* installing *source* package ‘arrow’ ...
** package ‘arrow’ successfully unpacked and MD5 sums checked
** using staged installation
*** Using Homebrew apache-arrow
PKG_CFLAGS=-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW 
-DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET
PKG_LIBS=-L/usr/local/opt/apache-arrow/lib -larrow_dataset -lparquet -larrow 
-larrow_bundled_dependencies
** libs
/opt/local/bin/clang++-mp-9.0 -std=gnu++11 
-I"/opt/local/Library/Frameworks/R.framework/Resources/include" -DNDEBUG 
-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW 
-DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
-I'/Users/webb/Library/R/4.0/library/cpp11/include' -I/opt/local/include 
-isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -fPIC -pipe -Os 
-stdlib=libc++ -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk 
-arch x86_64 -c array.cpp -o array.o
In file included from array.cpp:18:
././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not found
{code}
 

> No more default ARROW_CSV=ON in libarrow build breaks R arrow
> -
>
> Key: ARROW-12802
> URL: https://issues.apache.org/jira/browse/ARROW-12802
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 4.0.0, 4.0.1
>Reporter: Webb Phillips
>Priority: Major
>
> libarrow build succeeds, but include/arrow/csv/type_fwd.h isn't installed 
> since 4.0.0. This causes R install.packages('arrow') to fail with:
> {code:java}
> make: *** 
> [/opt/local/Library/Frameworks/R.framework/Resources/etc/Makeconf:179: 
> array.o] Error 1
> In file included from recordbatch.cpp:18:
> ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not 
> found{code}
> Reproduced with Ubuntu 18.04 and with macOS 10.13.6 MacPorts with both 
> apache-arrow-4.0.0 and current HEAD f959141ece4d660bce5f7fa545befc0116a7db79.
> No other type_fwd.h are missing:
> {code:java}
> find .../arrow/cpp/src -name type_fwd.h | wc -l
> 10
> find .../include -name type_fwd.h | wc -l
> 9{code}
> Best guess: default value of cmake ARROW_CSV changed and R arrow requires 
> ARROW_CSV=ON.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-12802) No more default ARROW_CSV=ON in libarrow build breaks R arrow

2021-05-17 Thread Webb Phillips (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346221#comment-17346221
 ] 

Webb Phillips edited comment on ARROW-12802 at 5/17/21, 3:11 PM:
-

Using install.packages as in the docs would be ideal. but does not work for me:
{code:java}
mac$ R --vanilla
...
> install.packages('arrow')
...
* installing *source* package ‘arrow’ ...
** package ‘arrow’ successfully unpacked and MD5 sums checked
** using staged installation
*** Using Homebrew apache-arrow
PKG_CFLAGS=-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW 
-DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET
PKG_LIBS=-L/usr/local/opt/apache-arrow/lib -larrow_dataset -lparquet -larrow 
-larrow_bundled_dependencies
** libs
/opt/local/bin/clang++-mp-9.0 -std=gnu++11 
-I"/opt/local/Library/Frameworks/R.framework/Resources/include" -DNDEBUG 
-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW 
-DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
-I'/Users/webb/Library/R/4.0/library/cpp11/include' -I/opt/local/include 
-isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -fPIC -pipe -Os 
-stdlib=libc++ -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk 
-arch x86_64 -c array.cpp -o array.o
In file included from array.cpp:18:
././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not found
{code}
This is with no other arrow installed.


was (Author: webbp):
Using install.packages as in the docs would be ideal. but does not work for me:
{code:java}
mac$ R --vanilla
...
> install.packages('arrow')
...
* installing *source* package ‘arrow’ ...
** package ‘arrow’ successfully unpacked and MD5 sums checked
** using staged installation
*** Using Homebrew apache-arrow
PKG_CFLAGS=-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW 
-DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET
PKG_LIBS=-L/usr/local/opt/apache-arrow/lib -larrow_dataset -lparquet -larrow 
-larrow_bundled_dependencies
** libs
/opt/local/bin/clang++-mp-9.0 -std=gnu++11 
-I"/opt/local/Library/Frameworks/R.framework/Resources/include" -DNDEBUG 
-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW 
-DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
-I'/Users/webb/Library/R/4.0/library/cpp11/include' -I/opt/local/include 
-isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -fPIC -pipe -Os 
-stdlib=libc++ -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk 
-arch x86_64 -c array.cpp -o array.o
In file included from array.cpp:18:
././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not found
{code}
 

> No more default ARROW_CSV=ON in libarrow build breaks R arrow
> -
>
> Key: ARROW-12802
> URL: https://issues.apache.org/jira/browse/ARROW-12802
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 4.0.0, 4.0.1
>Reporter: Webb Phillips
>Priority: Major
>
> libarrow build succeeds, but include/arrow/csv/type_fwd.h isn't installed 
> since 4.0.0. This causes R install.packages('arrow') to fail with:
> {code:java}
> make: *** 
> [/opt/local/Library/Frameworks/R.framework/Resources/etc/Makeconf:179: 
> array.o] Error 1
> In file included from recordbatch.cpp:18:
> ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not 
> found{code}
> Reproduced with Ubuntu 18.04 and with macOS 10.13.6 MacPorts with both 
> apache-arrow-4.0.0 and current HEAD f959141ece4d660bce5f7fa545befc0116a7db79.
> No other type_fwd.h are missing:
> {code:java}
> find .../arrow/cpp/src -name type_fwd.h | wc -l
> 10
> find .../include -name type_fwd.h | wc -l
> 9{code}
> Best guess: default value of cmake ARROW_CSV changed and R arrow requires 
> ARROW_CSV=ON.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12802) No more default ARROW_CSV=ON in libarrow build breaks R arrow

2021-05-17 Thread Webb Phillips (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346221#comment-17346221
 ] 

Webb Phillips commented on ARROW-12802:
---

Using `install.packages` as in the docs would be ideal. but does not work for 
me:



 
{code:java}
mac$ R --vanilla
...
> install.packages('arrow')
...
* installing *source* package ‘arrow’ ...
** package ‘arrow’ successfully unpacked and MD5 sums checked
** using staged installation
*** Using Homebrew apache-arrow
PKG_CFLAGS=-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW 
-DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET
PKG_LIBS=-L/usr/local/opt/apache-arrow/lib -larrow_dataset -lparquet -larrow 
-larrow_bundled_dependencies
** libs
/opt/local/bin/clang++-mp-9.0 -std=gnu++11 
-I"/opt/local/Library/Frameworks/R.framework/Resources/include" -DNDEBUG 
-I/usr/local/opt/apache-arrow/include -DARROW_R_WITH_ARROW 
-DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
-I'/Users/webb/Library/R/4.0/library/cpp11/include' -I/opt/local/include 
-isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -fPIC -pipe -Os 
-stdlib=libc++ -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk 
-arch x86_64 -c array.cpp -o array.o
In file included from array.cpp:18:
././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not found
{code}
 

> No more default ARROW_CSV=ON in libarrow build breaks R arrow
> -
>
> Key: ARROW-12802
> URL: https://issues.apache.org/jira/browse/ARROW-12802
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 4.0.0, 4.0.1
>Reporter: Webb Phillips
>Priority: Major
>
> libarrow build succeeds, but include/arrow/csv/type_fwd.h isn't installed 
> since 4.0.0. This causes R install.packages('arrow') to fail with:
> {code:java}
> make: *** 
> [/opt/local/Library/Frameworks/R.framework/Resources/etc/Makeconf:179: 
> array.o] Error 1
> In file included from recordbatch.cpp:18:
> ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not 
> found{code}
> Reproduced with Ubuntu 18.04 and with macOS 10.13.6 MacPorts with both 
> apache-arrow-4.0.0 and current HEAD f959141ece4d660bce5f7fa545befc0116a7db79.
> No other type_fwd.h are missing:
> {code:java}
> find .../arrow/cpp/src -name type_fwd.h | wc -l
> 10
> find .../include -name type_fwd.h | wc -l
> 9{code}
> Best guess: default value of cmake ARROW_CSV changed and R arrow requires 
> ARROW_CSV=ON.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12785) [CI] the r-devdocs build errors when brew installing gcc

2021-05-17 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-12785.

Fix Version/s: 5.0.0
   Resolution: Fixed

Issue resolved by pull request 10328
[https://github.com/apache/arrow/pull/10328]

> [CI] the r-devdocs build errors when brew installing gcc
> 
>
> Key: ARROW-12785
> URL: https://issues.apache.org/jira/browse/ARROW-12785
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Continuous Integration
>Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This affects github-test-r-devdocs (R devdocs macOS-latest). The brew step to 
> install gcc fails, and then OpenBLAS fails as well.
> See https://github.com/ursacomputing/crossbow/runs/2573031778#step:8:1494



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12807) [C++] Fix merge conflicts with Future refactor/async IPC

2021-05-17 Thread David Li (Jira)
David Li created ARROW-12807:


 Summary: [C++] Fix merge conflicts with Future refactor/async IPC
 Key: ARROW-12807
 URL: https://issues.apache.org/jira/browse/ARROW-12807
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: David Li
Assignee: David Li
 Fix For: 5.0.0


ARROW-12004 and ARROW-11772 conflict with each other (they merge cleanly but 
the result doesn't build)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12802) No more default ARROW_CSV=ON in libarrow build breaks R arrow

2021-05-17 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346216#comment-17346216
 ] 

Neal Richardson commented on ARROW-12802:
-

We don't recommend {{install_github}}, though with the right env vars it works 
on most platforms. 

Moreover, you don't need to install the arrow C++ library separately from the R 
package, the R package installation will take care of everything for you. So 
you may be making things harder for yourself than you need. See 
https://arrow.apache.org/docs/r/#installation as well as the longer 
installation vignette linked there for details. 

> No more default ARROW_CSV=ON in libarrow build breaks R arrow
> -
>
> Key: ARROW-12802
> URL: https://issues.apache.org/jira/browse/ARROW-12802
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 4.0.0, 4.0.1
>Reporter: Webb Phillips
>Priority: Major
>
> libarrow build succeeds, but include/arrow/csv/type_fwd.h isn't installed 
> since 4.0.0. This causes R install.packages('arrow') to fail with:
> {code:java}
> make: *** 
> [/opt/local/Library/Frameworks/R.framework/Resources/etc/Makeconf:179: 
> array.o] Error 1
> In file included from recordbatch.cpp:18:
> ././arrow_types.h:37:10: fatal error: 'arrow/csv/type_fwd.h' file not 
> found{code}
> Reproduced with Ubuntu 18.04 and with macOS 10.13.6 MacPorts with both 
> apache-arrow-4.0.0 and current HEAD f959141ece4d660bce5f7fa545befc0116a7db79.
> No other type_fwd.h are missing:
> {code:java}
> find .../arrow/cpp/src -name type_fwd.h | wc -l
> 10
> find .../include -name type_fwd.h | wc -l
> 9{code}
> Best guess: default value of cmake ARROW_CSV changed and R arrow requires 
> ARROW_CSV=ON.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12806) [Python] test_write_to_dataset_filesystem missing a dataset mark

2021-05-17 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-12806.
--
Resolution: Fixed

Issue resolved by pull request 10346
[https://github.com/apache/arrow/pull/10346]

> [Python] test_write_to_dataset_filesystem missing a dataset mark
> 
>
> Key: ARROW-12806
> URL: https://issues.apache.org/jira/browse/ARROW-12806
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> From 
> https://stackoverflow.com/questions/67526288/modulenotfounderror-no-module-named-pyarrow-dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10425) [Python] Support reading (compressed) CSV file from remote file / binary blob

2021-05-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346205#comment-17346205
 ] 

Joris Van den Bossche commented on ARROW-10425:
---

Yes, reading from a buffer works, but your example is not using a compressed 
buffer. So for example:

{code}
from pyarrow import fs, csv

s3 = fs.S3FileSystem()

with s3.open_input_file("bucket/data.csv.gz") as file:
table = csv.read_csv(file)
{code}

currently doesn't work? (tried it with LocalFileSystem)

I am not sure this _should_ work, as right now we just infer the compression 
from the file path, and not from the actual content of the file. 

But, then it might be nice that something like 
{{csv.read_csv("s3://bucket/data.csv.gz")}} would work so that it could be 
detected from the file path. 

> [Python] Support reading (compressed) CSV file from remote file / binary blob
> -
>
> Key: ARROW-10425
> URL: https://issues.apache.org/jira/browse/ARROW-10425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: csv
>
> From 
> https://stackoverflow.com/questions/64588076/how-can-i-read-a-csv-gz-file-with-pyarrow-from-a-file-object
> Currently {{pyarrow.csv.rad_csv}} happily takes a path to a compressed file 
> and automatically decompresses it, but AFAIK this only works for local paths. 
> It would be nice to in general support reading CSV from remote files (with 
> URI / specifying a filesystem), and in that case also support compression. 
> In addition we could also read a compressed file from a BytesIO / file-like 
> object, but not sure we want that (as it would required a keyword to indicate 
> the used compression).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12776) [Archery][Integration] Fix decimal case generation in write_js_test_json

2021-05-17 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-12776:

Fix Version/s: 4.0.1

> [Archery][Integration] Fix decimal case generation in write_js_test_json
> 
>
> Key: ARROW-12776
> URL: https://issues.apache.org/jira/browse/ARROW-12776
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Archery, Continuous Integration
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0, 4.0.1
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The integration build has started to fail on master: 
> https://github.com/apache/arrow/runs/2575265526#step:9:4265
> I don't entirely understand the reason why we see this error, in order to 
> call that function we would need to pass {{--write_generated_json}} to the 
> archery command, but we don't. The implementation is clearly wrong though.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12004) [C++] Result is annoying

2021-05-17 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-12004.
--
Resolution: Fixed

Issue resolved by pull request 10205
[https://github.com/apache/arrow/pull/10205]

> [C++] Result is annoying
> ---
>
> Key: ARROW-12004
> URL: https://issues.apache.org/jira/browse/ARROW-12004
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Weston Pace
>Priority: Major
>  Labels: async-util, pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> When I add a callback (using {{AddCallback}} or {{Then}}) to a {{Future}}, 
> I would like the callback to take a {{Status}} rather than a 
> {{Result}}.
> I managed to get this done for {{AddCallback}}, but {{Then}} is another pile 
> of complication due to template hackery.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12790) [Python] Cannot read from HDFS with blanks in path names

2021-05-17 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-12790:
--
Labels: filesystem hdfs  (was: )

> [Python] Cannot read from HDFS with blanks in path names
> 
>
> Key: ARROW-12790
> URL: https://issues.apache.org/jira/browse/ARROW-12790
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 4.0.0
>Reporter: Armin Müller
>Priority: Critical
>  Labels: filesystem, hdfs
>
> I have a Hadoop FS with blanks in path and filenames.
> Running this
> {{hdfs = fs.HadoopFileSystem('namenode', 8020)}}
> {{files = hdfs.get_file_info(fs.FileSelector("/", recursive=True))}}
> throws a
> {{pyarrow.lib.ArrowInvalid: Cannot parse URI: 'hdfs://namenode:8020/data/Path 
> with Blank'}}
> How can I avoid that?
> Strangely enough, reading a file with
> {{hdfs.open_input_file(csv_file)}}
> works just fine regardless of the blanks?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12769) [Python] Negative out of range slices yield invalid arrays

2021-05-17 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-12769.
--
Resolution: Fixed

Issue resolved by pull request 10341
[https://github.com/apache/arrow/pull/10341]

> [Python] Negative out of range slices yield invalid arrays
> --
>
> Key: ARROW-12769
> URL: https://issues.apache.org/jira/browse/ARROW-12769
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 2.0.0, 4.0.0
>Reporter: Micah Kornfield
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0, 4.0.1
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Tested on pyarrow 2.0 and pyarrow 4.0 wheels.  The errors are slightly 
> different between the 2.0.  Below is a script from 4.0
>  
> This is taken from the result of test_slice_array
> {{ }}
> {{ >>> import pyarrow as pa}}
> {{ >>> pa.array(range(0,10))}}
> {{ }}
> {{ [}}
> {{ 0,}}
> {{ 1,}}
> {{ 2,}}
> {{ 3,}}
> {{ 4,}}
> {{ 5,}}
> {{ 6,}}
> {{ 7,}}
> {{ 8,}}
> {{ 9}}
> {{ ]}}
> {{ >>> a=pa.array(range(0,10))}}
> {{ >>> a[-9:-20]}}
> {{ }}
> {{ []}}
> {{ >>> len(a[-9:-20])}}
> {{ Traceback (most recent call last):}}
> {{ File "", line 1, in }}
> {{ SystemError:  returned NULL without setting an 
> error}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12806) [Python] test_write_to_dataset_filesystem missing a dataset mark

2021-05-17 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-12806:
-

Assignee: Joris Van den Bossche

> [Python] test_write_to_dataset_filesystem missing a dataset mark
> 
>
> Key: ARROW-12806
> URL: https://issues.apache.org/jira/browse/ARROW-12806
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> From 
> https://stackoverflow.com/questions/67526288/modulenotfounderror-no-module-named-pyarrow-dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12806) [Python] test_write_to_dataset_filesystem missing a dataset mark

2021-05-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12806:
---
Labels: pull-request-available  (was: )

> [Python] test_write_to_dataset_filesystem missing a dataset mark
> 
>
> Key: ARROW-12806
> URL: https://issues.apache.org/jira/browse/ARROW-12806
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> From 
> https://stackoverflow.com/questions/67526288/modulenotfounderror-no-module-named-pyarrow-dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12806) [Python] test_write_to_dataset_filesystem missing a dataset mark

2021-05-17 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-12806:
--
Fix Version/s: 5.0.0

> [Python] test_write_to_dataset_filesystem missing a dataset mark
> 
>
> Key: ARROW-12806
> URL: https://issues.apache.org/jira/browse/ARROW-12806
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> From 
> https://stackoverflow.com/questions/67526288/modulenotfounderror-no-module-named-pyarrow-dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12806) [Python] test_write_to_dataset_filesystem missing a dataset mark

2021-05-17 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-12806:
-

 Summary: [Python] test_write_to_dataset_filesystem missing a 
dataset mark
 Key: ARROW-12806
 URL: https://issues.apache.org/jira/browse/ARROW-12806
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


>From 
>https://stackoverflow.com/questions/67526288/modulenotfounderror-no-module-named-pyarrow-dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12569) [R] [CI] Run revdep in CI

2021-05-17 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-12569:
---
Summary: [R] [CI] Run revdep in CI  (was: [R] [CI] Can we run revdep in CI)

> [R] [CI] Run revdep in CI
> -
>
> Key: ARROW-12569
> URL: https://issues.apache.org/jira/browse/ARROW-12569
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Probably should be on demand, might be difficult to make it print/fail 
> usefully.
> Use https://github.com/r-lib/revdepcheck?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12569) [R] [CI] Can we run revdep in CI

2021-05-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12569:
---
Labels: pull-request-available  (was: )

> [R] [CI] Can we run revdep in CI
> 
>
> Key: ARROW-12569
> URL: https://issues.apache.org/jira/browse/ARROW-12569
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Probably should be on demand, might be difficult to make it print/fail 
> usefully.
> Use https://github.com/r-lib/revdepcheck?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12758) [R] Add examples to more function documentation

2021-05-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12758:
---
Labels: pull-request-available  (was: )

> [R] Add examples to more function documentation
> ---
>
> Key: ARROW-12758
> URL: https://issues.apache.org/jira/browse/ARROW-12758
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nic Crane
>Assignee: Nic Crane
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12619) [Python] pyarrow sdist should not require git

2021-05-17 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-12619:
---

Assignee: Krisztian Szucs

> [Python] pyarrow sdist should not require git
> -
>
> Key: ARROW-12619
> URL: https://issues.apache.org/jira/browse/ARROW-12619
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Kouhei Sutou
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.1
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> {noformat}
> FROM ubuntu:20.04
> RUN apt update && apt install -y python3-pip
> RUN pip3 install --no-binary pyarrow pyarrow==4.0.0
> {noformat}
> {noformat}
> $ docker build .
> ...
> Step 3/3 : RUN pip3 install --no-binary pyarrow pyarrow==4.0.0
>  ---> Running in 28d363e1c397
> Collecting pyarrow==4.0.0
>   Downloading pyarrow-4.0.0.tar.gz (710 kB)
>   Installing build dependencies: started
>   Installing build dependencies: still running...
>   Installing build dependencies: finished with status 'done'
>   Getting requirements to build wheel: started
>   Getting requirements to build wheel: finished with status 'done'
> Preparing wheel metadata: started
> Preparing wheel metadata: finished with status 'error'
> ERROR: Command errored out with exit status 1:
>  command: /usr/bin/python3 /tmp/tmp5rqecai7 
> prepare_metadata_for_build_wheel /tmp/tmpc49gha3r
>  cwd: /tmp/pip-install-or1g7own/pyarrow
> Complete output (42 lines):
> Traceback (most recent call last):
>   File "/tmp/tmp5rqecai7", line 280, in 
> main()
>   File "/tmp/tmp5rqecai7", line 263, in main
> json_out['return_val'] = hook(**hook_input['kwargs'])
>   File "/tmp/tmp5rqecai7", line 133, in prepare_metadata_for_build_wheel
> return hook(metadata_directory, config_settings)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py",
>  line 166, in prepare_metadata_for_build_wheel
> self.run_setup()
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py",
>  line 258, in run_setup
> super(_BuildMetaLegacyBackend,
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py",
>  line 150, in run_setup
> exec(compile(code, __file__, 'exec'), locals())
>   File "setup.py", line 585, in 
> setup(
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/__init__.py",
>  line 153, in setup
> return distutils.core.setup(**attrs)
>   File "/usr/lib/python3.8/distutils/core.py", line 108, in setup
> _setup_distribution = dist = klass(attrs)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py",
>  line 434, in __init__
> _Distribution.__init__(self, {
>   File "/usr/lib/python3.8/distutils/dist.py", line 292, in __init__
> self.finalize_options()
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py",
>  line 743, in finalize_options
> ep(self)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py",
>  line 750, in _finalize_setup_keywords
> ep.load()(self, ep.name, value)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/integration.py",
>  line 24, in version_keyword
> dist.metadata.version = _get_version(config)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py",
>  line 173, in _get_version
> parsed_version = _do_parse(config)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py",
>  line 119, in _do_parse
> parse_result = _call_entrypoint_fn(config.absolute_root, config, 
> config.parse)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py",
>  line 54, in _call_entrypoint_fn
> return fn(root)
>   File "setup.py", line 546, in parse_git
> return parse(root, **kwargs)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/git.py",
>  line 115, in parse
> require_command("git")
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/utils.py",
>  line 142, in require_command
> raise OSError("%r was not found" % name)
> OSError: 'git' was not found
> 
> ERROR: Command errored out with exit status 1: /usr/bin/py

[jira] [Commented] (ARROW-12789) [C++] Support for scalar value recycling in RecordBatch/Table creation

2021-05-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346113#comment-17346113
 ] 

Joris Van den Bossche commented on ARROW-12789:
---

I was going to comment that, alternatively, C++ could also provide the utility 
to easily/efficiently create an array of a given length from a scalar, and then 
leave it up to the bindings to check for scalars and create the appropriate 
array. But it seems (from the PR for the other issue) this is what is already 
being done with {{MakeArrayFromScalar}}.

> [C++] Support for scalar value recycling in RecordBatch/Table creation
> --
>
> Key: ARROW-12789
> URL: https://issues.apache.org/jira/browse/ARROW-12789
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nic Crane
>Priority: Major
>
> Please can we have the capability to be able to recycle scalar values during 
> table creation?  It would work as follows:
> Upon creation of a new Table/RecordBatch, the length of each column is 
> checked.  If:
>  * number of columns is > 1 and
>  * any columns have length 1 and
>  * not all columns have length 1
> then, the value in the length 1 column(s) should be repeated to make it as 
> long as the other columns. 
> This should only occur if all columns either have length 1 or N (where N is 
> some value greater than 1), and if any columns lengths are values other than 
> 1 or N, we should still get an error as we do now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12773) [Docs] Clarify Java support for ORC and Parquet via JNI bindings

2021-05-17 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-12773.
--
Fix Version/s: 5.0.0
   Resolution: Fixed

Issue resolved by pull request 10312
[https://github.com/apache/arrow/pull/10312]

> [Docs] Clarify Java support for ORC and Parquet via JNI bindings
> 
>
> Key: ARROW-12773
> URL: https://issues.apache.org/jira/browse/ARROW-12773
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Shuai Zhang
>Assignee: Shuai Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
> Attachments: image-2021-05-13-20-31-52-890.png, 
> image-2021-05-13-20-34-29-206.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The ["Implementation Status" 
> document](https://arrow.apache.org/docs/status.html#third-party-data-formats) 
> says that Java support Parquet format by JNI while not support ORC format. 
> However, the [source 
> code](https://github.com/apache/arrow/tree/aa28470/java/adapter) shows that 
> Java support ORC format by JNI while not support Parquet format. See the 
> attached snapshots for further details.
>  !image-2021-05-13-20-31-52-890.png! 
>  !image-2021-05-13-20-34-29-206.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

2021-05-17 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346110#comment-17346110
 ] 

Lance Dacey commented on ARROW-12358:
-

Being able to update and replace specific rows would be very powerful. For my 
use case, I am basically overwriting the entire partition in order to update a 
(sometimes tiny) subset of rows. That means that I need to read the existing 
data for that partition which was saved previously, and the new data with 
updated or new rows. Then I need to sort and drop duplicates (I use pandas 
because there is no simple .drop_duplicates() for a pyarrow table, but adding a 
step with pandas can add some complication sometimes with data types), then I 
need to overwrite the partition (I use the partition_filename_cb to guarantee 
that the final file for the partition is the same).

> [C++][Python][R][Dataset] Control overwriting vs appending when writing to 
> existing dataset
> ---
>
> Key: ARROW-12358
> URL: https://issues.apache.org/jira/browse/ARROW-12358
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset
> Fix For: 5.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}} 
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when 
> you are writing to an existing dataset, you de facto overwrite previous data 
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by 
> ensuring the file names are unique (the user can specify the 
> {{basename_template}} to be something unique). There is also ARROW-7706 about 
> silently doubling data (so _not_ overwriting existing data) with the legacy 
> {{parquet.write_to_dataset}} implementation. 
> It could be good to have a "mode" when writing datasets that controls the 
> different possible behaviours. And erroring when there is pre-existing data 
> in the target directory is maybe the safest default, because both appending 
> vs overwriting silently can be surprising behaviour depending on your 
> expectations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12773) [Docs] Clarify Java support for ORC and Parquet via JNI bindings

2021-05-17 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-12773:
-
Summary: [Docs] Clarify Java support for ORC and Parquet via JNI bindings  
(was: [Docs] Implementation Status say Java support Parquet but actually no)

> [Docs] Clarify Java support for ORC and Parquet via JNI bindings
> 
>
> Key: ARROW-12773
> URL: https://issues.apache.org/jira/browse/ARROW-12773
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Shuai Zhang
>Assignee: Shuai Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2021-05-13-20-31-52-890.png, 
> image-2021-05-13-20-34-29-206.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The ["Implementation Status" 
> document](https://arrow.apache.org/docs/status.html#third-party-data-formats) 
> says that Java support Parquet format by JNI while not support ORC format. 
> However, the [source 
> code](https://github.com/apache/arrow/tree/aa28470/java/adapter) shows that 
> Java support ORC format by JNI while not support Parquet format. See the 
> attached snapshots for further details.
>  !image-2021-05-13-20-31-52-890.png! 
>  !image-2021-05-13-20-34-29-206.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12773) [Docs] Implementation Status say Java support Parquet but actually no

2021-05-17 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-12773:
-
Component/s: Documentation

> [Docs] Implementation Status say Java support Parquet but actually no
> -
>
> Key: ARROW-12773
> URL: https://issues.apache.org/jira/browse/ARROW-12773
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Shuai Zhang
>Assignee: Shuai Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2021-05-13-20-31-52-890.png, 
> image-2021-05-13-20-34-29-206.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The ["Implementation Status" 
> document](https://arrow.apache.org/docs/status.html#third-party-data-formats) 
> says that Java support Parquet format by JNI while not support ORC format. 
> However, the [source 
> code](https://github.com/apache/arrow/tree/aa28470/java/adapter) shows that 
> Java support ORC format by JNI while not support Parquet format. See the 
> attached snapshots for further details.
>  !image-2021-05-13-20-31-52-890.png! 
>  !image-2021-05-13-20-34-29-206.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12773) [Docs] Implementation Status say Java support Parquet but actually no

2021-05-17 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reassigned ARROW-12773:


Assignee: Shuai Zhang

> [Docs] Implementation Status say Java support Parquet but actually no
> -
>
> Key: ARROW-12773
> URL: https://issues.apache.org/jira/browse/ARROW-12773
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Shuai Zhang
>Assignee: Shuai Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2021-05-13-20-31-52-890.png, 
> image-2021-05-13-20-34-29-206.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The ["Implementation Status" 
> document](https://arrow.apache.org/docs/status.html#third-party-data-formats) 
> says that Java support Parquet format by JNI while not support ORC format. 
> However, the [source 
> code](https://github.com/apache/arrow/tree/aa28470/java/adapter) shows that 
> Java support ORC format by JNI while not support Parquet format. See the 
> attached snapshots for further details.
>  !image-2021-05-13-20-31-52-890.png! 
>  !image-2021-05-13-20-34-29-206.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12619) [Python] pyarrow sdist should not require git

2021-05-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12619:
---
Labels: pull-request-available  (was: )

> [Python] pyarrow sdist should not require git
> -
>
> Key: ARROW-12619
> URL: https://issues.apache.org/jira/browse/ARROW-12619
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {noformat}
> FROM ubuntu:20.04
> RUN apt update && apt install -y python3-pip
> RUN pip3 install --no-binary pyarrow pyarrow==4.0.0
> {noformat}
> {noformat}
> $ docker build .
> ...
> Step 3/3 : RUN pip3 install --no-binary pyarrow pyarrow==4.0.0
>  ---> Running in 28d363e1c397
> Collecting pyarrow==4.0.0
>   Downloading pyarrow-4.0.0.tar.gz (710 kB)
>   Installing build dependencies: started
>   Installing build dependencies: still running...
>   Installing build dependencies: finished with status 'done'
>   Getting requirements to build wheel: started
>   Getting requirements to build wheel: finished with status 'done'
> Preparing wheel metadata: started
> Preparing wheel metadata: finished with status 'error'
> ERROR: Command errored out with exit status 1:
>  command: /usr/bin/python3 /tmp/tmp5rqecai7 
> prepare_metadata_for_build_wheel /tmp/tmpc49gha3r
>  cwd: /tmp/pip-install-or1g7own/pyarrow
> Complete output (42 lines):
> Traceback (most recent call last):
>   File "/tmp/tmp5rqecai7", line 280, in 
> main()
>   File "/tmp/tmp5rqecai7", line 263, in main
> json_out['return_val'] = hook(**hook_input['kwargs'])
>   File "/tmp/tmp5rqecai7", line 133, in prepare_metadata_for_build_wheel
> return hook(metadata_directory, config_settings)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py",
>  line 166, in prepare_metadata_for_build_wheel
> self.run_setup()
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py",
>  line 258, in run_setup
> super(_BuildMetaLegacyBackend,
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py",
>  line 150, in run_setup
> exec(compile(code, __file__, 'exec'), locals())
>   File "setup.py", line 585, in 
> setup(
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/__init__.py",
>  line 153, in setup
> return distutils.core.setup(**attrs)
>   File "/usr/lib/python3.8/distutils/core.py", line 108, in setup
> _setup_distribution = dist = klass(attrs)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py",
>  line 434, in __init__
> _Distribution.__init__(self, {
>   File "/usr/lib/python3.8/distutils/dist.py", line 292, in __init__
> self.finalize_options()
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py",
>  line 743, in finalize_options
> ep(self)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py",
>  line 750, in _finalize_setup_keywords
> ep.load()(self, ep.name, value)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/integration.py",
>  line 24, in version_keyword
> dist.metadata.version = _get_version(config)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py",
>  line 173, in _get_version
> parsed_version = _do_parse(config)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py",
>  line 119, in _do_parse
> parse_result = _call_entrypoint_fn(config.absolute_root, config, 
> config.parse)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py",
>  line 54, in _call_entrypoint_fn
> return fn(root)
>   File "setup.py", line 546, in parse_git
> return parse(root, **kwargs)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/git.py",
>  line 115, in parse
> require_command("git")
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/utils.py",
>  line 142, in require_command
> raise OSError("%r was not found" % name)
> OSError: 'git' was not found
> 
> ERROR: Command errored out with exit status 1: /usr/bin/python3 
> /tmp/tmp5rqecai7 prepare_

[jira] [Commented] (ARROW-12805) [Python] Use consistent memory_pool / pool keyword argument name

2021-05-17 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346108#comment-17346108
 ] 

David Li commented on ARROW-12805:
--

Note this isn't even all that consistent in C++; there's no keyword arguments 
of course, but while it's usually just {{pool}}, it seems some of the 
dataset/compute code calls it {{memory_pool}} in places like getters.

> [Python] Use consistent memory_pool / pool keyword argument name
> 
>
> Key: ARROW-12805
> URL: https://issues.apache.org/jira/browse/ARROW-12805
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
> Fix For: 5.0.0
>
>
> Most of the functions taking a MemoryPool have a {{memory_pool}} keyword for 
> this, but a few take a {{pool}} keyword instead (eg 
> {{ListArray.from_arrays}}). 
> We should make this consistent and have all functions use {{memory_pool}} 
> (probably best with deprecating {{pool}} first). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12619) [Python] pyarrow sdist should not require git

2021-05-17 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346107#comment-17346107
 ] 

Krisztian Szucs commented on ARROW-12619:
-

There is a fallback_version configuration option for setuptools_scm which we 
don't use: https://github.com/pypa/setuptools_scm#configuration-parameters
Although this setting seems to have issues according to 
https://github.com/pypa/setuptools_scm/issues/549

We already have a workaround in setup.py for the functionality of the 
fallback_version option, but it is disabled for the case of sdist: 
https://github.com/apache/arrow/blob/master/python/setup.py#L529

> [Python] pyarrow sdist should not require git
> -
>
> Key: ARROW-12619
> URL: https://issues.apache.org/jira/browse/ARROW-12619
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Kouhei Sutou
>Priority: Major
> Fix For: 4.0.1
>
>
> {noformat}
> FROM ubuntu:20.04
> RUN apt update && apt install -y python3-pip
> RUN pip3 install --no-binary pyarrow pyarrow==4.0.0
> {noformat}
> {noformat}
> $ docker build .
> ...
> Step 3/3 : RUN pip3 install --no-binary pyarrow pyarrow==4.0.0
>  ---> Running in 28d363e1c397
> Collecting pyarrow==4.0.0
>   Downloading pyarrow-4.0.0.tar.gz (710 kB)
>   Installing build dependencies: started
>   Installing build dependencies: still running...
>   Installing build dependencies: finished with status 'done'
>   Getting requirements to build wheel: started
>   Getting requirements to build wheel: finished with status 'done'
> Preparing wheel metadata: started
> Preparing wheel metadata: finished with status 'error'
> ERROR: Command errored out with exit status 1:
>  command: /usr/bin/python3 /tmp/tmp5rqecai7 
> prepare_metadata_for_build_wheel /tmp/tmpc49gha3r
>  cwd: /tmp/pip-install-or1g7own/pyarrow
> Complete output (42 lines):
> Traceback (most recent call last):
>   File "/tmp/tmp5rqecai7", line 280, in 
> main()
>   File "/tmp/tmp5rqecai7", line 263, in main
> json_out['return_val'] = hook(**hook_input['kwargs'])
>   File "/tmp/tmp5rqecai7", line 133, in prepare_metadata_for_build_wheel
> return hook(metadata_directory, config_settings)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py",
>  line 166, in prepare_metadata_for_build_wheel
> self.run_setup()
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py",
>  line 258, in run_setup
> super(_BuildMetaLegacyBackend,
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/build_meta.py",
>  line 150, in run_setup
> exec(compile(code, __file__, 'exec'), locals())
>   File "setup.py", line 585, in 
> setup(
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/__init__.py",
>  line 153, in setup
> return distutils.core.setup(**attrs)
>   File "/usr/lib/python3.8/distutils/core.py", line 108, in setup
> _setup_distribution = dist = klass(attrs)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py",
>  line 434, in __init__
> _Distribution.__init__(self, {
>   File "/usr/lib/python3.8/distutils/dist.py", line 292, in __init__
> self.finalize_options()
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py",
>  line 743, in finalize_options
> ep(self)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools/dist.py",
>  line 750, in _finalize_setup_keywords
> ep.load()(self, ep.name, value)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/integration.py",
>  line 24, in version_keyword
> dist.metadata.version = _get_version(config)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py",
>  line 173, in _get_version
> parsed_version = _do_parse(config)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py",
>  line 119, in _do_parse
> parse_result = _call_entrypoint_fn(config.absolute_root, config, 
> config.parse)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/__init__.py",
>  line 54, in _call_entrypoint_fn
> return fn(root)
>   File "setup.py", line 546, in parse_git
> return parse(root, **kwargs)
>   File 
> "/tmp/pip-build-env-d53awzo4/overlay/lib/python3.8/site-packages/setuptools_scm/git.py",
>  line 115, in parse
> require

[jira] [Updated] (ARROW-11673) [C++] Casting dictionary type to use different index type

2021-05-17 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-11673:
--
Fix Version/s: 5.0.0

> [C++] Casting dictionary type to use different index type
> -
>
> Key: ARROW-11673
> URL: https://issues.apache.org/jira/browse/ARROW-11673
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> It's currently not implemented to cast from one dictionary type to another 
> dictionary type to change the index type. 
> Small example:
> {code}
> In [2]: arr = pa.array(['a', 'b', 'a']).dictionary_encode()
> In [3]: arr.type
> Out[3]: DictionaryType(dictionary)
> In [5]: arr.cast(pa.dictionary(pa.int8(), pa.string()))
> ...
> ArrowNotImplementedError: Unsupported cast from dictionary indices=int32, ordered=0> to dictionary ordered=0> (no available cast function for target type)
> ../src/arrow/compute/cast.cc:112  
> GetCastFunctionInternal(cast_options->to_type, args[0].type().get())
> {code}
> From 
> https://stackoverflow.com/questions/66223730/how-to-change-column-datatype-with-pyarrow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12751) [C++] Add variadic row-wise min/max kernels (least/greatest)

2021-05-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346102#comment-17346102
 ] 

Joris Van den Bossche commented on ARROW-12751:
---

Numpy has this as {{np.minimum}} and {{np.maximum}}, although those are limited 
to a fixed number of 2 input arrays (so a binary min/max, not variadic)

> [C++] Add variadic row-wise min/max kernels (least/greatest)
> 
>
> Key: ARROW-12751
> URL: https://issues.apache.org/jira/browse/ARROW-12751
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Priority: Major
>
> Add a pair of variadic functions equivalent to SQL's {{least}}/{{greatest}} 
> or R's {{pmin}}/{{pmax}}. Should take 0, 1, 2, ... same-length numeric arrays 
> as input and return an array giving the minimum/maximum of the values found 
> in each position of the input arrays. For example, in the case of these 2 
> input arrays:
> {code:java}
> ArrayArray
> [[
>   1,   2,
>   43
> ]]
> {code}
> {{least}} would return:
> {code:java}
> Array
> [ 
>   1,
>   3
> ] 
> {code}
> and {{greatest}} would return
> {code:java}
> Array
> [ 
>   2,
>   4
> ] 
> {code}
> The returned array should have the same data type as the input arrays, or 
> follow promotion rules if the numeric types of the input arrays differ.
> Should also accept scalar numeric inputs and recycle their values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-12762) [Python] pyarrow.lib.Schema equality fails after pickle and unpickle

2021-05-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346100#comment-17346100
 ] 

Joris Van den Bossche edited comment on ARROW-12762 at 5/17/21, 11:37 AM:
--

[~jjgalvez] thanks for opening the issue. 

I can't reproduce this without pyspark; when writing the pandas dataframe to 
parquet with pyarrow, it seems to work:

{code}
In [11]: import pyarrow.parquet as pq

In [12]: tabe = pa.table(df)

In [13]: pq.write_table(table, "test_list_str.parquet")

In [14]: ds = pq.ParquetDataset("test_list_str.parquet")

In [15]: pickle.loads(pickle.dumps(ds.schema)) == ds.schema
Out[15]: True

In [16]: pickle.loads(pickle.dumps(ds.schema.to_arrow_schema())) == 
ds.schema.to_arrow_schema()
Out[16]: True
{code}

Could you try to check what the difference is between the schemas before and 
after pickling? (eg if you print both, do you see a difference? Or it's 
schema.metadata?)


was (Author: jorisvandenbossche):
[~jjgalvez] thanks for opening the issue. 

I can't reproduce this without pyspark; when writing the pandas dataframe to 
parquet with pyarrow, it seems to work:

{code}
In [12]: import pyarrow.parquet as pq

In [13]: pq.write_table(table, "test_list_str.parquet")

In [14]: ds = pq.ParquetDataset("test_list_str.parquet")

In [15]: pickle.loads(pickle.dumps(ds.schema)) == ds.schema
Out[15]: True

In [16]: pickle.loads(pickle.dumps(ds.schema.to_arrow_schema())) == 
ds.schema.to_arrow_schema()
Out[16]: True
{code}


> [Python] pyarrow.lib.Schema equality fails after pickle and unpickle
> 
>
> Key: ARROW-12762
> URL: https://issues.apache.org/jira/browse/ARROW-12762
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Juan Galvez
>Priority: Major
>
> Here is a small reproducer:
> {code:python}
> import pandas as pd
> from pyspark.sql import SparkSession
> import pyarrow.parquet as pq
> import pickle
> df = pd.DataFrame(
> {
> "A": [
> ["aa", "bb "],
> ["c"],
> ["d", "ee", "", "f"],
> ["ggg", "H"],
> [""],
> ]
> }
> )
> spark = SparkSession.builder.appName("GenSparkData").getOrCreate()
> spark_df = spark.createDataFrame(df)
> spark_df.write.parquet("list_str.pq", "overwrite")
> ds = pq.ParquetDataset("list_str.pq")
> assert pickle.loads(pickle.dumps(ds.schema)) == ds.schema # PASSES
> assert pickle.loads(pickle.dumps(ds.schema.to_arrow_schema())) == 
> ds.schema.to_arrow_schema() # FAILS
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12762) [Python] pyarrow.lib.Schema equality fails after pickle and unpickle

2021-05-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346100#comment-17346100
 ] 

Joris Van den Bossche commented on ARROW-12762:
---

[~jjgalvez] thanks for opening the issue. 

I can't reproduce this without pyspark; when writing the pandas dataframe to 
parquet with pyarrow, it seems to work:

{code}
In [12]: import pyarrow.parquet as pq

In [13]: pq.write_table(table, "test_list_str.parquet")

In [14]: ds = pq.ParquetDataset("test_list_str.parquet")

In [15]: pickle.loads(pickle.dumps(ds.schema)) == ds.schema
Out[15]: True

In [16]: pickle.loads(pickle.dumps(ds.schema.to_arrow_schema())) == 
ds.schema.to_arrow_schema()
Out[16]: True
{code}


> [Python] pyarrow.lib.Schema equality fails after pickle and unpickle
> 
>
> Key: ARROW-12762
> URL: https://issues.apache.org/jira/browse/ARROW-12762
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Juan Galvez
>Priority: Major
>
> Here is a small reproducer:
> {code:python}
> import pandas as pd
> from pyspark.sql import SparkSession
> import pyarrow.parquet as pq
> import pickle
> df = pd.DataFrame(
> {
> "A": [
> ["aa", "bb "],
> ["c"],
> ["d", "ee", "", "f"],
> ["ggg", "H"],
> [""],
> ]
> }
> )
> spark = SparkSession.builder.appName("GenSparkData").getOrCreate()
> spark_df = spark.createDataFrame(df)
> spark_df.write.parquet("list_str.pq", "overwrite")
> ds = pq.ParquetDataset("list_str.pq")
> assert pickle.loads(pickle.dumps(ds.schema)) == ds.schema # PASSES
> assert pickle.loads(pickle.dumps(ds.schema.to_arrow_schema())) == 
> ds.schema.to_arrow_schema() # FAILS
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12769) [Python] Negative out of range slices yield invalid arrays

2021-05-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12769:
---
Labels: pull-request-available  (was: )

> [Python] Negative out of range slices yield invalid arrays
> --
>
> Key: ARROW-12769
> URL: https://issues.apache.org/jira/browse/ARROW-12769
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 2.0.0, 4.0.0
>Reporter: Micah Kornfield
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0, 4.0.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Tested on pyarrow 2.0 and pyarrow 4.0 wheels.  The errors are slightly 
> different between the 2.0.  Below is a script from 4.0
>  
> This is taken from the result of test_slice_array
> {{ }}
> {{ >>> import pyarrow as pa}}
> {{ >>> pa.array(range(0,10))}}
> {{ }}
> {{ [}}
> {{ 0,}}
> {{ 1,}}
> {{ 2,}}
> {{ 3,}}
> {{ 4,}}
> {{ 5,}}
> {{ 6,}}
> {{ 7,}}
> {{ 8,}}
> {{ 9}}
> {{ ]}}
> {{ >>> a=pa.array(range(0,10))}}
> {{ >>> a[-9:-20]}}
> {{ }}
> {{ []}}
> {{ >>> len(a[-9:-20])}}
> {{ Traceback (most recent call last):}}
> {{ File "", line 1, in }}
> {{ SystemError:  returned NULL without setting an 
> error}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12769) [Python] Negative out of range slices yield invalid arrays

2021-05-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346099#comment-17346099
 ] 

Joris Van den Bossche commented on ARROW-12769:
---

It also seems to happen simply when start > stop with positive indices (eg 
{{arr[5:3]}})

> [Python] Negative out of range slices yield invalid arrays
> --
>
> Key: ARROW-12769
> URL: https://issues.apache.org/jira/browse/ARROW-12769
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 2.0.0, 4.0.0
>Reporter: Micah Kornfield
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 5.0.0, 4.0.1
>
>
> Tested on pyarrow 2.0 and pyarrow 4.0 wheels.  The errors are slightly 
> different between the 2.0.  Below is a script from 4.0
>  
> This is taken from the result of test_slice_array
> {{ }}
> {{ >>> import pyarrow as pa}}
> {{ >>> pa.array(range(0,10))}}
> {{ }}
> {{ [}}
> {{ 0,}}
> {{ 1,}}
> {{ 2,}}
> {{ 3,}}
> {{ 4,}}
> {{ 5,}}
> {{ 6,}}
> {{ 7,}}
> {{ 8,}}
> {{ 9}}
> {{ ]}}
> {{ >>> a=pa.array(range(0,10))}}
> {{ >>> a[-9:-20]}}
> {{ }}
> {{ []}}
> {{ >>> len(a[-9:-20])}}
> {{ Traceback (most recent call last):}}
> {{ File "", line 1, in }}
> {{ SystemError:  returned NULL without setting an 
> error}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12792) [R] DatasetFactory could sniff file formats

2021-05-17 Thread Nic Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane reassigned ARROW-12792:
-

Assignee: (was: Nic Crane)

> [R] DatasetFactory could sniff file formats
> ---
>
> Key: ARROW-12792
> URL: https://issues.apache.org/jira/browse/ARROW-12792
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Nic Crane
>Priority: Minor
>
> I was running the following code:
> {code:java}
> tf <- tempfile()
> dir.create(tf)
> on.exit(unlink(tf))
> write_csv_arrow(mtcars[1:5,], file.path(tf, "file1.csv"))
> write_csv_arrow(mtcars[6:11,], file.path(tf, "file2.csv"))
> # ds <- open_dataset(c(file.path(tf, "file1.csv"), file.path(tf, 
> "file2.csv")))
> ds <- open_dataset(c(file.path(tf, "file1.csv"), file.path(tf, "file2.csv")), 
>schema = Table$create(mtcars)$schema
>)
> {code}
> But when I print the ds object, it reports that the files are Parquet files 
> not CSVs
> {code:java}
> > ds
>  FileSystemDataset with 2 Parquet files
>  mpg: double
>  cyl: double
>  disp: double
>  hp: double
>  drat: double
>  wt: double
>  qsec: double
>  vs: double
>  am: double
>  gear: double
>  carb: double{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12792) [R] DatasetFactory could sniff file formats

2021-05-17 Thread Nic Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane resolved ARROW-12792.
---
Resolution: Won't Fix

> [R] DatasetFactory could sniff file formats
> ---
>
> Key: ARROW-12792
> URL: https://issues.apache.org/jira/browse/ARROW-12792
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Nic Crane
>Assignee: Nic Crane
>Priority: Minor
>
> I was running the following code:
> {code:java}
> tf <- tempfile()
> dir.create(tf)
> on.exit(unlink(tf))
> write_csv_arrow(mtcars[1:5,], file.path(tf, "file1.csv"))
> write_csv_arrow(mtcars[6:11,], file.path(tf, "file2.csv"))
> # ds <- open_dataset(c(file.path(tf, "file1.csv"), file.path(tf, 
> "file2.csv")))
> ds <- open_dataset(c(file.path(tf, "file1.csv"), file.path(tf, "file2.csv")), 
>schema = Table$create(mtcars)$schema
>)
> {code}
> But when I print the ds object, it reports that the files are Parquet files 
> not CSVs
> {code:java}
> > ds
>  FileSystemDataset with 2 Parquet files
>  mpg: double
>  cyl: double
>  disp: double
>  hp: double
>  drat: double
>  wt: double
>  qsec: double
>  vs: double
>  am: double
>  gear: double
>  carb: double{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12792) [R] DatasetFactory could sniff file formats

2021-05-17 Thread Nic Crane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346084#comment-17346084
 ] 

Nic Crane commented on ARROW-12792:
---

After thinking about this, there are no sensible code-based changes - instead, 
I will add examples to the documentation on loading in non-parquet files.

> [R] DatasetFactory could sniff file formats
> ---
>
> Key: ARROW-12792
> URL: https://issues.apache.org/jira/browse/ARROW-12792
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Nic Crane
>Assignee: Nic Crane
>Priority: Minor
>
> I was running the following code:
> {code:java}
> tf <- tempfile()
> dir.create(tf)
> on.exit(unlink(tf))
> write_csv_arrow(mtcars[1:5,], file.path(tf, "file1.csv"))
> write_csv_arrow(mtcars[6:11,], file.path(tf, "file2.csv"))
> # ds <- open_dataset(c(file.path(tf, "file1.csv"), file.path(tf, 
> "file2.csv")))
> ds <- open_dataset(c(file.path(tf, "file1.csv"), file.path(tf, "file2.csv")), 
>schema = Table$create(mtcars)$schema
>)
> {code}
> But when I print the ds object, it reports that the files are Parquet files 
> not CSVs
> {code:java}
> > ds
>  FileSystemDataset with 2 Parquet files
>  mpg: double
>  cyl: double
>  disp: double
>  hp: double
>  drat: double
>  wt: double
>  qsec: double
>  vs: double
>  am: double
>  gear: double
>  carb: double{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12769) [Python] Negative out of range slices yield invalid arrays

2021-05-17 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-12769:
-

Assignee: Joris Van den Bossche

> [Python] Negative out of range slices yield invalid arrays
> --
>
> Key: ARROW-12769
> URL: https://issues.apache.org/jira/browse/ARROW-12769
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 2.0.0, 4.0.0
>Reporter: Micah Kornfield
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 5.0.0, 4.0.1
>
>
> Tested on pyarrow 2.0 and pyarrow 4.0 wheels.  The errors are slightly 
> different between the 2.0.  Below is a script from 4.0
>  
> This is taken from the result of test_slice_array
> {{ }}
> {{ >>> import pyarrow as pa}}
> {{ >>> pa.array(range(0,10))}}
> {{ }}
> {{ [}}
> {{ 0,}}
> {{ 1,}}
> {{ 2,}}
> {{ 3,}}
> {{ 4,}}
> {{ 5,}}
> {{ 6,}}
> {{ 7,}}
> {{ 8,}}
> {{ 9}}
> {{ ]}}
> {{ >>> a=pa.array(range(0,10))}}
> {{ >>> a[-9:-20]}}
> {{ }}
> {{ []}}
> {{ >>> len(a[-9:-20])}}
> {{ Traceback (most recent call last):}}
> {{ File "", line 1, in }}
> {{ SystemError:  returned NULL without setting an 
> error}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12695) [Python] bool value of scalars depends on data type

2021-05-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346056#comment-17346056
 ] 

Joris Van den Bossche commented on ARROW-12695:
---

Ah, I see that ARROW-12609 has some more background.

> [Python] bool value of scalars depends on data type
> ---
>
> Key: ARROW-12695
> URL: https://issues.apache.org/jira/browse/ARROW-12695
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 4.0.0
> Environment: Windows 10
> python 3.9.4
>Reporter: Sergey Mozharov
>Priority: Major
>
> `pyarrow.Scalar` and its subclasses do not implement `__bool__` method. The 
> default implementation does not seem to do the right thing. For example:
> {code:java}
> >>> import pyarrow as pa
> >>> na_value = pa.scalar(None, type=pa.int32())
> >>> bool(na_value)
> True
> >>> na_value = pa.scalar(None, type=pa.struct([('a', pa.int32())]))
> >>> bool(na_value)
> False
> >>> bool(pa.scalar(None, type=pa.list_(pa.int32(
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow\scalar.pxi", line 572, in pyarrow.lib.ListScalar.__len__
> TypeError: object of type 'NoneType' has no len()
> >>>
> {code}
> Please consider implementing `___bool` method. It seems reasonable to 
> delegate to the `bool___` method of the wrapped object.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12609) [Python] TypeError when accessing length of an invalid ListScalar

2021-05-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346055#comment-17346055
 ] 

Joris Van den Bossche commented on ARROW-12609:
---

bq. why not return {{NullScalar}} in such case? It seems to me that 
{{pa.list_(pa.int32())}} means a schema that supports null values in the list, 
then the array should just return a null value when it hits one.

[~amol-] The returned ListScalar _is_ a null value, though. Because each type 
supports null values, each scalar type also supports it's own null scalars. A 
{{NullScalar}} is what you would get when accessing a single element of a 
{{NullArray}}:

{code}
>>> arr = pa.array([None, None])
>>> arr

2 nulls
>>> arr[0]

{code}

bq. Expected behavior: length is expected to be 0.

[~mosalx] I think you could also argue that a missing list scalar has "no 
defined length" (why would it be zero? it's an empty list that has zero length) 
The problem, though, is that Python doesn't support this kind of missing or 
undefined values for integers ({{\_\_len\_\_}} needs to return an integer, or 
error)

For example, if not using Python's builtin {{len}}, but using the pyarrow 
compute kernel to get the length of list element, we actually "propagate" the 
null, and the null list has a null length:

{code}
>>> import pyarrow.compute as pc
>>> pc.list_value_length(pa.scalar([1, 2], type=pa.list_(pa.int32(

>>> pc.list_value_length(pa.scalar(None, type=pa.list_(pa.int32(

{code}

> [Python] TypeError when accessing length of an invalid ListScalar
> -
>
> Key: ARROW-12609
> URL: https://issues.apache.org/jira/browse/ARROW-12609
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0, 4.0.0
> Environment: Windows 10
> python=3.9.2
> pyarrow=4.0.0 (3.0.0 has the same behavior)
>Reporter: Sergey Mozharov
>Priority: Major
>
> For List-like data types, the scalar corresponding to a missing value has 
> '___len___' attribute, but TypeError is raised when it is accessed
> {code:java}
> import pyarrow as pa
> data_type = pa.list_(pa.struct([
> ('a', pa.int64()),
> ('b', pa.bool_())
> ]))
> data = [[{'a': 1, 'b': False}, {'a': 2, 'b': True}], None]
> arr = pa.array(data, type=data_type)
> missing_scalar = arr[1]  # 
> assert hasattr(missing_scalar, '__len__')
> assert len(missing_scalar) == 0  # --> TypeError: object of type 'NoneType' 
> has no len()
> {code}
> Expected behavior: length is expected to be 0.
> This issue causes several pandas unit tests to fail when an ExtensionArray 
> backed by arrow array with this data type is built.
> This behavior is also inconsistent with a similar example where the data type 
> is a struct:
> {code:java}
> import pyarrow as pa
> data_type = pa.struct([
> ('a', pa.int64()),
> ('b', pa.bool_())
> ])
> data = [{'a': 1, 'b': False}, None]
> arr = pa.array(data, type=data_type)
> missing_scalar = arr[1]  # 
> assert hasattr(missing_scalar, '__len__')
> assert len(missing_scalar) == 0  # Ok
> {code}
>  In this second example the TypeError is not raised.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-12695) [Python] bool value of scalars depends on data type

2021-05-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346046#comment-17346046
 ] 

Joris Van den Bossche edited comment on ARROW-12695 at 5/17/21, 10:17 AM:
--

Currently pyarrow doesn't implement any {{\_\_bool\_\_}}. In general, Python 
will then always return True by default, but it seems that if your object is 
"sequence-like" (having a {{\_\_len\_\_}}), it will check the length. This is 
described at https://docs.python.org/3/library/stdtypes.html#truth-value-testing

So here the underlying reason is that this fails:

{code}
>>> len(pa.scalar([1, 2], type=pa.list_(pa.int32(
2

>>> len(pa.scalar(None, type=pa.list_(pa.int32(
...
TypeError: object of type 'NoneType' has no len()
{code}

But the question is also, what should this return instead? Returning 0 in this 
case also doesn't feel correct, as you can also have an empty list scalar with 
a length of zero.

In general, I think it will be hard to give a nice and consistent interface for 
pyarrow scalars involving null scalars (we could provide better error messages 
though?)

[~mosalx] what's your use case for wanting to do {{bool(null_scalar)}}, and 
what do you think it should return? (also True as the other scalars?)


was (Author: jorisvandenbossche):
Currently pyarrow doesn't implement any {{\_\_bool\_\_}}. In general, Python 
will then always return True by default, but it seems that if your object is 
"sequence-like" (having a {\_\_len\_\_}}), it will check the length. This is 
described at https://docs.python.org/3/library/stdtypes.html#truth-value-testing

So here the underlying reason is that this fails:

{code}
>>> len(pa.scalar([1, 2], type=pa.list_(pa.int32(
2

>>> len(pa.scalar(None, type=pa.list_(pa.int32(
...
TypeError: object of type 'NoneType' has no len()
{code}

But the question is also, what should this return instead? Returning 0 in this 
case also doesn't feel correct, as you can also have an empty list scalar with 
a length of zero.

In general, I think it will be hard to give a nice and consistent interface for 
pyarrow scalars involving null scalars (we could provide better error messages 
though?)

[~mosalx] what's your use case for wanting to do {{bool(null_scalar)}}, and 
what do you think it should return? (also True as the other scalars?)

> [Python] bool value of scalars depends on data type
> ---
>
> Key: ARROW-12695
> URL: https://issues.apache.org/jira/browse/ARROW-12695
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 4.0.0
> Environment: Windows 10
> python 3.9.4
>Reporter: Sergey Mozharov
>Priority: Major
>
> `pyarrow.Scalar` and its subclasses do not implement `__bool__` method. The 
> default implementation does not seem to do the right thing. For example:
> {code:java}
> >>> import pyarrow as pa
> >>> na_value = pa.scalar(None, type=pa.int32())
> >>> bool(na_value)
> True
> >>> na_value = pa.scalar(None, type=pa.struct([('a', pa.int32())]))
> >>> bool(na_value)
> False
> >>> bool(pa.scalar(None, type=pa.list_(pa.int32(
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow\scalar.pxi", line 572, in pyarrow.lib.ListScalar.__len__
> TypeError: object of type 'NoneType' has no len()
> >>>
> {code}
> Please consider implementing `___bool` method. It seems reasonable to 
> delegate to the `bool___` method of the wrapped object.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12695) [Python] bool value of scalars depends on data type

2021-05-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346046#comment-17346046
 ] 

Joris Van den Bossche commented on ARROW-12695:
---

Currently pyarrow doesn't implement any {{\_\_bool\_\_}}. In general, Python 
will then always return True by default, but it seems that if your object is 
"sequence-like" (having a {\_\_len\_\_}}), it will check the length. This is 
described at https://docs.python.org/3/library/stdtypes.html#truth-value-testing

So here the underlying reason is that this fails:

{code}
>>> len(pa.scalar([1, 2], type=pa.list_(pa.int32(
2

>>> len(pa.scalar(None, type=pa.list_(pa.int32(
...
TypeError: object of type 'NoneType' has no len()
{code}

But the question is also, what should this return instead? Returning 0 in this 
case also doesn't feel correct, as you can also have an empty list scalar with 
a length of zero.

In general, I think it will be hard to give a nice and consistent interface for 
pyarrow scalars involving null scalars (we could provide better error messages 
though?)

[~mosalx] what's your use case for wanting to do {{bool(null_scalar)}}, and 
what do you think it should return? (also True as the other scalars?)

> [Python] bool value of scalars depends on data type
> ---
>
> Key: ARROW-12695
> URL: https://issues.apache.org/jira/browse/ARROW-12695
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 4.0.0
> Environment: Windows 10
> python 3.9.4
>Reporter: Sergey Mozharov
>Priority: Major
>
> `pyarrow.Scalar` and its subclasses do not implement `__bool__` method. The 
> default implementation does not seem to do the right thing. For example:
> {code:java}
> >>> import pyarrow as pa
> >>> na_value = pa.scalar(None, type=pa.int32())
> >>> bool(na_value)
> True
> >>> na_value = pa.scalar(None, type=pa.struct([('a', pa.int32())]))
> >>> bool(na_value)
> False
> >>> bool(pa.scalar(None, type=pa.list_(pa.int32(
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow\scalar.pxi", line 572, in pyarrow.lib.ListScalar.__len__
> TypeError: object of type 'NoneType' has no len()
> >>>
> {code}
> Please consider implementing `___bool` method. It seems reasonable to 
> delegate to the `bool___` method of the wrapped object.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12666) [Python] Array construction from numpy array is unclear about zero copy behaviour

2021-05-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346032#comment-17346032
 ] 

Joris Van den Bossche commented on ARROW-12666:
---

bq. {{copy=False}}  would probably have to throw an exception in some cases 
where we can't guarantee zero copy, like when building from a Python List

Or {{copy=False}} could also not guarantee that no copy is made, but will only 
try to not make a copy if possible. That's basically the behaviour of the 
{{copy}} keyword in {{numpy.array(..)}}

On the general issue, I agree that the current behaviour is not ideal and 
potentially being confusing/having surprising effects. But I also think it's 
not that easy to change. I think a lot of people rely on the zero-copy 
behaviour to avoid unnecessary copies (eg if you just convert to Arrow to then 
directly write that to Parquet file, then you don't want to make an additional 
copy).

> [Python] Array construction from numpy array is unclear about zero copy 
> behaviour
> -
>
> Key: ARROW-12666
> URL: https://issues.apache.org/jira/browse/ARROW-12666
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 4.0.0
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
>
> When building an Arrow array from a numpy array it's very confusing from the 
> user point of view that the result is not always a new array.
> Under the hood Arrow sometimes reuses the memory if no casting is needed
> {code:python}
> npa = np.array([1, 2, 3]*3)
> arrow_array = pa.array(npa, type=pa.int64())
> npa[npa == 2] = 10
> print(arrow_array.to_pylist())
> # Prints: [1, 10, 3, 1, 10, 3, 1, 10, 3]
> {code}
> and sometimes doesn't if a cast is involved
> {code:python}
> npa = np.array([1, 2, 3]*3)
> arrow_array = pa.array(npa, type=pa.int32())
> npa[npa == 2] = 10
> print(arrow_array.to_pylist())
> # Prints: [1, 2, 3, 1, 2, 3, 1, 2, 3]
> {code}
> For non primite types instead it does always copy
> {code:python}
> npa = np.array(["a", "b", "c"]*3)
> arrow_array = pa.array(npa, type=pa.string())
> npa[npa == "b"] = "X"
> print(arrow_array.to_pylist())
> # Prints: ['a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c']
> # Different from numpy array that was modified
> {code}
> This behaviour needs a lot of attention from the user and understanding of 
> what's going on, which makes pyarrow hard to use.
> A {{copy=True/False}} should be added to {{pa.array}} and the default value 
> should probably be {{copy=True}} so that by default you can always create an 
> arrow array out of a numpy one (as {{copy=False}}  would probably have to 
> throw an exception in some cases where we can't guarantee zero copy, like 
> when building from a Python List)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12680) [Python] StructScalar Timestamp using .to_pandas() loses/converts type

2021-05-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346022#comment-17346022
 ] 

Joris Van den Bossche commented on ARROW-12680:
---

(in any case, we also need to document this better, to not each time have to 
look into old discussions / guess from the behaviour and source code, when such 
a question comes up ..)

> [Python] StructScalar Timestamp using .to_pandas() loses/converts type
> --
>
> Key: ARROW-12680
> URL: https://issues.apache.org/jira/browse/ARROW-12680
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Tim Ryles
>Priority: Major
>
> Hi,
> We're noticing an issue where we lose type and formatting on conversion to a 
> pandas dataframe for a particular dataset we house, which contains a struct, 
> and the underlying type of the child is Timestamp rather than 
> datetime.datetime (which we believed synonymous from Pandas documentation).
>  
> Inside the StructArray we can see nicely formatted timestamp values, but when 
> we call .to_pandas() on it, we end up with epoch stamps for the date child.
> {code:java}
> import pyarrow.parquet as pq
> tbl=pq.read_table("part-9-47f62157-cb6f-41a8-9ad6-ace65df94c6e-c000.snappy.parquet")
> tbl.column("observations").chunk(0).values pyarrow.lib.StructArray object at 
> 0x7fc8eb0cab40>
> – is_valid: all not null
> – child 0 type: timestamp[ns]
> [
> 2000-01-01 00:00:00.0,
> 2001-01-01 00:00:00.0,
> 2002-01-01 00:00:00.0,
> 2003-01-01 00:00:00.0,
> 2004-01-01 00:00:00.0,
> 2005-01-01 00:00:00.0,
> 2006-01-01 00:00:00.0,
> 2007-01-01 00:00:00.0,
> 2008-01-01 00:00:00.0,
> 2009-01-01 00:00:00.0,
> ...
> 2018-07-01 00:00:00.0,
> 2018-10-01 00:00:00.0,
> 2019-01-01 00:00:00.0,
> 2019-04-01 00:00:00.0,
> 2019-07-01 00:00:00.0,
> 2019-10-01 00:00:00.0,
> 2020-01-01 00:00:00.0,
> 2020-04-01 00:00:00.0,
> 2020-07-01 00:00:00.0,
> 2020-10-01 00:00:00.0
> ]
> – child 1 type: double
> [
> -2.69685,
> 9.27988,
> 7.26902,
> -7.55753,
> -1.62137,
> 6.84773,
> -8.21204,
> -8.97041,
> -1.14405,
> -0.710153,
> ...
> 2.1658,
> 3.05588,
> 2.3868,
> 2.10805,
> 2.39984,
> 2.54855,
> -7.26804,
> -2.35179,
> -0.867518,
> 0.150593
> ]
> {code}
> {code:java}
>  
> tbl.to_pandas()['observations'] 
> [{'date': 9466848000, 'value': -2.6968... 1 [{'date': 
> 9466848000, 'value': 57.9608... 2 [{'date': 14832288000, 
> 'value': 95.904... 3 [{'date': 12148704000, 'value': 19.021... 4 
> [{'date': 11991456000, 'value': 1.2011... ... 636 [\{'date': 
> 10729152000, 'value': 5.418}... 637 [{'date': 9466848000, 
> 'value': 110.695... 638 [{'date': 10098432000, 'value': 3.0094... 639 
> [{'date': 12228192000, 'value': 48.365... 640 [{'date': 
> 11991456000, 'value': 1.5600... Name: observations, Length: 641, 
> dtype: object
> In [12]: tbl.to_pandas()["observations"].iloc[0][0]
> Out[12]: {'date': 10413792000, 'value': 249.523242}
> # date is now type Int{code}
>  
> We notice that if we take the same table, save it back out to a file first, 
> and then check the chunk(0).values as above, the underlying type changes from 
> *Timestamp* to *datetime.datetime*, and that will now convert .to_pandas() 
> correctly.
> {code:java}
> pq.write_table(tbl, "output.parquet")
> tbl2=pq.read_table("output.parquet")
> tbl2.column("observations").chunk(0).values[0]
> Out[17]:  'value': 249.523242}>
> tbl2.column("observations").chunk(0).to_pandas()
> Out[18]: 
> 0[{'date': 2003-01-01 00:00:00, 'value': 249.52...
> 1[{'date': 2008-01-01 00:00:00, 'value': 29.741...
> 2[{'date': 2000-01-01 00:00:00, 'value': 2.3454...
> 3[{'date': 2006-01-01 00:00:00, 'value': 1.2048...
> 4[{'date': 2008-01-01 00:00:00, 'value': 196546...
>...
> 29489[{'date': 2010-01-01 00:00:00, 'value': 19.155...
> 29490[{'date': 2012-04-30 00:00:00, 'value': 0.0}, ...
> 29491[{'date': 2012-04-30 00:00:00, 'value': 0.0}, ...
> 29492[{'date': 2012-04-30 00:00:00, 'value': 0.0}, ...
> 29493[{'date': 2012-04-30 00:00:00, 'value': 10.0},...
> Length: 29494, dtype: object
> tbl2.to_pandas()["observations"].iloc[0][0]
> Out[8]: {'date': datetime.datetime(2003, 1, 1, 0, 0), 'value': 249.523242}
> # date remains as datetime.datetime{code}
>  
> Thanks in advance, and apologies if I have not followed Issue protocol on 
> this board.
> If there is a parameter that we just need to pass into .to_pandas for this to 
> take place (I can see there is date_

[jira] [Commented] (ARROW-12680) [Python] StructScalar Timestamp using .to_pandas() loses/converts type

2021-05-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346020#comment-17346020
 ] 

Joris Van den Bossche commented on ARROW-12680:
---

Reading the comments in https://github.com/apache/arrow/pull/6322 again, I 
_think_ the fact that for nanosecond timestamps inside a struct, we currently 
return integer epoch is kind of deliberate? (because of the lack of a better 
alternative, since {{datetime.datetime}} cannot represent nanoseconds)

Although for struct _scalars_, we actually use pandas.Timestamp for nanosecond 
resolution columns:

{code}
In [42]: sarr[0]
Out[42]: 

In [43]: sarr[0].as_py()
Out[43]: 
{'ms': datetime.datetime(2021, 5, 17, 11, 3, 58, 947000),
 'ns': Timestamp('2021-05-17 11:03:58.947224')}
{code}

Of course, that is only possible if pandas is installed. And so maybe that's 
the reason that for array conversion we simply always use the "safe" integer 
epoch option. But it's certainly somewhat inconsistent.

> [Python] StructScalar Timestamp using .to_pandas() loses/converts type
> --
>
> Key: ARROW-12680
> URL: https://issues.apache.org/jira/browse/ARROW-12680
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Tim Ryles
>Priority: Major
>
> Hi,
> We're noticing an issue where we lose type and formatting on conversion to a 
> pandas dataframe for a particular dataset we house, which contains a struct, 
> and the underlying type of the child is Timestamp rather than 
> datetime.datetime (which we believed synonymous from Pandas documentation).
>  
> Inside the StructArray we can see nicely formatted timestamp values, but when 
> we call .to_pandas() on it, we end up with epoch stamps for the date child.
> {code:java}
> import pyarrow.parquet as pq
> tbl=pq.read_table("part-9-47f62157-cb6f-41a8-9ad6-ace65df94c6e-c000.snappy.parquet")
> tbl.column("observations").chunk(0).values pyarrow.lib.StructArray object at 
> 0x7fc8eb0cab40>
> – is_valid: all not null
> – child 0 type: timestamp[ns]
> [
> 2000-01-01 00:00:00.0,
> 2001-01-01 00:00:00.0,
> 2002-01-01 00:00:00.0,
> 2003-01-01 00:00:00.0,
> 2004-01-01 00:00:00.0,
> 2005-01-01 00:00:00.0,
> 2006-01-01 00:00:00.0,
> 2007-01-01 00:00:00.0,
> 2008-01-01 00:00:00.0,
> 2009-01-01 00:00:00.0,
> ...
> 2018-07-01 00:00:00.0,
> 2018-10-01 00:00:00.0,
> 2019-01-01 00:00:00.0,
> 2019-04-01 00:00:00.0,
> 2019-07-01 00:00:00.0,
> 2019-10-01 00:00:00.0,
> 2020-01-01 00:00:00.0,
> 2020-04-01 00:00:00.0,
> 2020-07-01 00:00:00.0,
> 2020-10-01 00:00:00.0
> ]
> – child 1 type: double
> [
> -2.69685,
> 9.27988,
> 7.26902,
> -7.55753,
> -1.62137,
> 6.84773,
> -8.21204,
> -8.97041,
> -1.14405,
> -0.710153,
> ...
> 2.1658,
> 3.05588,
> 2.3868,
> 2.10805,
> 2.39984,
> 2.54855,
> -7.26804,
> -2.35179,
> -0.867518,
> 0.150593
> ]
> {code}
> {code:java}
>  
> tbl.to_pandas()['observations'] 
> [{'date': 9466848000, 'value': -2.6968... 1 [{'date': 
> 9466848000, 'value': 57.9608... 2 [{'date': 14832288000, 
> 'value': 95.904... 3 [{'date': 12148704000, 'value': 19.021... 4 
> [{'date': 11991456000, 'value': 1.2011... ... 636 [\{'date': 
> 10729152000, 'value': 5.418}... 637 [{'date': 9466848000, 
> 'value': 110.695... 638 [{'date': 10098432000, 'value': 3.0094... 639 
> [{'date': 12228192000, 'value': 48.365... 640 [{'date': 
> 11991456000, 'value': 1.5600... Name: observations, Length: 641, 
> dtype: object
> In [12]: tbl.to_pandas()["observations"].iloc[0][0]
> Out[12]: {'date': 10413792000, 'value': 249.523242}
> # date is now type Int{code}
>  
> We notice that if we take the same table, save it back out to a file first, 
> and then check the chunk(0).values as above, the underlying type changes from 
> *Timestamp* to *datetime.datetime*, and that will now convert .to_pandas() 
> correctly.
> {code:java}
> pq.write_table(tbl, "output.parquet")
> tbl2=pq.read_table("output.parquet")
> tbl2.column("observations").chunk(0).values[0]
> Out[17]:  'value': 249.523242}>
> tbl2.column("observations").chunk(0).to_pandas()
> Out[18]: 
> 0[{'date': 2003-01-01 00:00:00, 'value': 249.52...
> 1[{'date': 2008-01-01 00:00:00, 'value': 29.741...
> 2[{'date': 2000-01-01 00:00:00, 'value': 2.3454...
> 3[{'date': 2006-01-01 00:00:00, 'value': 1.2048...
> 4[{'date': 2008-01-01 00:00:00, 'value': 196546...
>...
> 29489[{'date': 2010-01-01 00:00:00, 'value': 19.155...
> 29490[{'date': 2012-04-30 00:00

[jira] [Comment Edited] (ARROW-12680) [Python] StructScalar Timestamp using .to_pandas() loses/converts type

2021-05-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17341003#comment-17341003
 ] 

Joris Van den Bossche edited comment on ARROW-12680 at 5/17/21, 9:05 AM:
-

I think there is indeed a bug here.  Let me try and demystify some of what is 
going on.  There are 5+ temporal types in pyarrow but everything you are doing 
is currently related to just one, the timestamp type.  The timestamp type 
represents seconds, milliseconds, microseconds, or nanoseconds from the epoch.  
In addition there may or may not be a time zone string.  Finally, these types 
may or may not be in a struct (which shouldn't matter but does here...which is 
the bug).

In pandas there are 3+ ways to represent temporal information.  The 
datetime.datetime object,  A pandas.Timestamp, and an integer.

 

When you first read in your table you are getting a struct where the 'date' 
field is a timestamp with **nanosecond** resolution.

When you save your table and then reload it the timestamp is being truncated.  
This is because pq.write_table with version==1.0 (the default in pyarrow 3) 
will truncate nanosecond timestamps down to microseconds.

So when you next read in your table you are getting a struct where the 'date' 
field is a timestamp with **microsecond** resolution.

Finally, It seems this may be a regression of 
https://issues.apache.org/jira/browse/ARROW-7723

 
{code:java}
import pyarrow as pa
import datetime
pylist = [datetime.datetime.now()]
arr1 = pa.array(pylist, pa.timestamp(unit='ms'))
arr2 = pa.array(pylist, pa.timestamp(unit='ns'))
sarr = pa.StructArray.from_arrays([arr1, arr2], names=['ms', 'ns'])
table = pa.Table.from_arrays([arr1, arr2, sarr], ['ms', 'ns', 'struct'])
print(table.to_pandas())
{code}
 
{code:java}
   ms ns
 struct
0 2021-05-07 08:46:15.898 2021-05-07 08:46:15.898716  {'ms': 2021-05-07 
08:46:15.898000, 'ns': 16203...

{code}
 

As for workarounds...if your schema is reliable you could cast from nanosecond 
resolution to us resolution (struct casting isn't working quite right 
(ARROW-1888) so it's a bit clunky):

 
{code:java}
import pyarrow as pa
import pyarrow.compute as pc

dates = pa.array([datetime.datetime.now()], pa.timestamp(unit='ns'))
values = pa.array([200.37], pa.float64())
observations = pa.StructArray.from_arrays([dates, values], names=['dates', 
'values'])
desired_type = pa.struct([pa.field('dates', pa.timestamp(unit='us')), 
pa.field('values', pa.float64())])
tbl = pa.Table.from_arrays([observations], ['observations'])
print(tbl.to_pandas())

bad_observations = tbl.column('observations').chunks
values = [chunk.field('values') for chunk in bad_observations]
bad_dates = [chunk.field('dates') for chunk in bad_observations]
good_dates = [pc.cast(bad_dates_chunk, pa.timestamp(unit='us')) for 
bad_dates_chunk in bad_dates]
good_observations_chunks = []
for i in range(len(good_dates)):
good_observations_chunks.append(pa.StructArray.from_arrays([good_dates[i], 
values[i]], names=['dates', 'values']))
good_observations = pa.chunked_array(good_observations_chunks)
tbl = tbl.set_column(0, 'observations', good_observations)
print(tbl.to_pandas())
{code}
 


was (Author: westonpace):
I think there is indeed a bug here.  Let me try and demystify some of what is 
going on.  There are 5+ temporal types in pyarrow but everything you are doing 
is currently related to just one, the timestamp type.  The timestamp type 
represents seconds, milliseconds, microseconds, or nanoseconds from the epoch.  
In addition there may or may not be a time zone string.  Finally, these types 
may or may not be in a struct (which shouldn't matter but does here...which is 
the bug).

In pandas there are 3+ ways to represent temporal information.  The 
datetime.datetime object,  A pandas.Timestamp, and an integer.

 

When you first read in your table you are getting a struct where the 'date' 
field is a timestamp with **nanosecond** resolution.

When you save your table and then reload it the timestamp is being truncated.  
This is because pq.write_table with version==1.0 (the default in pyarrow 3) 
will truncate nanosecond timestamps down to microseconds.

So when you next read in your table you are getting a struct where the 'date' 
field is a timestamp with **microsecond** resolution.

Finally, It seems this may be a regression of 
https://issues.apache.org/jira/browse/ARROW-7723

 
{code:java}
import pyarrow as pa
import datetimepylist = [datetime.datetime.now()]
arr1 = pa.array(pylist, pa.timestamp(unit='ms'))
arr2 = pa.array(pylist, pa.timestamp(unit='ns'))
sarr = pa.StructArray.from_arrays([arr1, arr2], names=['ms', 'ns'])
table = pa.Table.from_arrays([arr1, arr2, sarr], ['ms', 'ns', 'struct'])
print(table.to_pandas())
{code}
 
{code:java}
   ms  

  1   2   >