[jira] [Updated] (ARROW-18121) [Release][CI] Use Ubuntu 22.04 for verifying binaries

2022-10-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18121:
---
Labels: pull-request-available  (was: )

> [Release][CI] Use Ubuntu 22.04 for verifying binaries
> -
>
> Key: ARROW-18121
> URL: https://issues.apache.org/jira/browse/ARROW-18121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> APT/Yum verifications use Docker. If we use old libseccomp on host, some 
> operations may cause errors:
> e.g.:  
> https://github.com/ursacomputing/crossbow/actions/runs/3294870946/jobs/5432835953#step:7:5437
> {noformat}
>   + valac --pkg arrow-glib --pkg posix build.vala
>   error: Failed to close file descriptor for child process (Operation not 
> permitted)
>   error: Failed to close file descriptor for child process (Operation not 
> permitted)
>   error: Failed to close file descriptor for child process (Operation not 
> permitted)
>   error: Failed to close file descriptor for child process (Operation not 
> permitted)
>   error: Failed to close file descriptor for child process (Operation not 
> permitted)
>   error: Failed to close file descriptor for child process (Operation not 
> permitted)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18121) [Release][CI] Use Ubuntu 22.04 for verifying binaries

2022-10-20 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-18121:
-
Description: 
APT/Yum verifications use Docker. If we use old libseccomp on host, some 
operations may cause errors:

e.g.:  
https://github.com/ursacomputing/crossbow/actions/runs/3294870946/jobs/5432835953#step:7:5437

{noformat}
  + valac --pkg arrow-glib --pkg posix build.vala
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
{noformat}

  was:
APT/Yum verifications use Docker. If we use some old version packages (I can't 
remember that what packages have a problem...) on host, some operations may 
cause errors:

e.g.:  
https://github.com/ursacomputing/crossbow/actions/runs/3294870946/jobs/5432835953#step:7:5437

{noformat}
  + valac --pkg arrow-glib --pkg posix build.vala
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
{noformat}


> [Release][CI] Use Ubuntu 22.04 for verifying binaries
> -
>
> Key: ARROW-18121
> URL: https://issues.apache.org/jira/browse/ARROW-18121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>
> APT/Yum verifications use Docker. If we use old libseccomp on host, some 
> operations may cause errors:
> e.g.:  
> https://github.com/ursacomputing/crossbow/actions/runs/3294870946/jobs/5432835953#step:7:5437
> {noformat}
>   + valac --pkg arrow-glib --pkg posix build.vala
>   error: Failed to close file descriptor for child process (Operation not 
> permitted)
>   error: Failed to close file descriptor for child process (Operation not 
> permitted)
>   error: Failed to close file descriptor for child process (Operation not 
> permitted)
>   error: Failed to close file descriptor for child process (Operation not 
> permitted)
>   error: Failed to close file descriptor for child process (Operation not 
> permitted)
>   error: Failed to close file descriptor for child process (Operation not 
> permitted)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18121) [Release][CI] Use Ubuntu 22.04 for verifying binaries

2022-10-20 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-18121:


 Summary: [Release][CI] Use Ubuntu 22.04 for verifying binaries
 Key: ARROW-18121
 URL: https://issues.apache.org/jira/browse/ARROW-18121
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


APT/Yum verifications use Docker. If we use some old version packages (I can't 
remember that what packages have a problem...) on host, some operations may 
cause errors:

e.g.:  
https://github.com/ursacomputing/crossbow/actions/runs/3294870946/jobs/5432835953#step:7:5437

{format}
  + valac --pkg arrow-glib --pkg posix build.vala
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
{format}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18121) [Release][CI] Use Ubuntu 22.04 for verifying binaries

2022-10-20 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-18121:
-
Description: 
APT/Yum verifications use Docker. If we use some old version packages (I can't 
remember that what packages have a problem...) on host, some operations may 
cause errors:

e.g.:  
https://github.com/ursacomputing/crossbow/actions/runs/3294870946/jobs/5432835953#step:7:5437

{noformat}
  + valac --pkg arrow-glib --pkg posix build.vala
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
{noformat}

  was:
APT/Yum verifications use Docker. If we use some old version packages (I can't 
remember that what packages have a problem...) on host, some operations may 
cause errors:

e.g.:  
https://github.com/ursacomputing/crossbow/actions/runs/3294870946/jobs/5432835953#step:7:5437

{format}
  + valac --pkg arrow-glib --pkg posix build.vala
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
  error: Failed to close file descriptor for child process (Operation not 
permitted)
{format}


> [Release][CI] Use Ubuntu 22.04 for verifying binaries
> -
>
> Key: ARROW-18121
> URL: https://issues.apache.org/jira/browse/ARROW-18121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>
> APT/Yum verifications use Docker. If we use some old version packages (I 
> can't remember that what packages have a problem...) on host, some operations 
> may cause errors:
> e.g.:  
> https://github.com/ursacomputing/crossbow/actions/runs/3294870946/jobs/5432835953#step:7:5437
> {noformat}
>   + valac --pkg arrow-glib --pkg posix build.vala
>   error: Failed to close file descriptor for child process (Operation not 
> permitted)
>   error: Failed to close file descriptor for child process (Operation not 
> permitted)
>   error: Failed to close file descriptor for child process (Operation not 
> permitted)
>   error: Failed to close file descriptor for child process (Operation not 
> permitted)
>   error: Failed to close file descriptor for child process (Operation not 
> permitted)
>   error: Failed to close file descriptor for child process (Operation not 
> permitted)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18093) [CI][Conda][Windows] Failed with missing ORC

2022-10-20 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-18093.
--
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14454
[https://github.com/apache/arrow/pull/14454]

> [CI][Conda][Windows] Failed with missing ORC
> 
>
> Key: ARROW-18093
> URL: https://issues.apache.org/jira/browse/ARROW-18093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=37759=logs=4c86bc1b-1091-5192-4404-c74dfaad23e7=41795ef0-6501-5db4-3ad4-33c0cf085626=497
> {noformat}
> CMake Error at cmake_modules/FindORC.cmake:56 (message):
>   ORC library was required in toolchain and unable to locate
> Call Stack (most recent call first):
>   cmake_modules/ThirdpartyToolchain.cmake:280 (find_package)
>   cmake_modules/ThirdpartyToolchain.cmake:4362 (resolve_dependency)
>   CMakeLists.txt:496 (include)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18120) [Release][Dev] Run binaries/wheels verifications in 05-binary-upload.sh

2022-10-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18120:
---
Labels: pull-request-available  (was: )

> [Release][Dev] Run binaries/wheels verifications in 05-binary-upload.sh
> ---
>
> Key: ARROW-18120
> URL: https://issues.apache.org/jira/browse/ARROW-18120
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We have a script (02-source.sh) that runs source verifications.
> But we don't have a script that runs binaries/wheels verifications yet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18120) [Release][Dev] Run binaries/wheels verifications in 05-binary-upload.sh

2022-10-20 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-18120:


 Summary: [Release][Dev] Run binaries/wheels verifications in 
05-binary-upload.sh
 Key: ARROW-18120
 URL: https://issues.apache.org/jira/browse/ARROW-18120
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


We have a script (02-source.sh) that runs source verifications.
But we don't have a script that runs binaries/wheels verifications yet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18091) [Ruby] Arrow::Table#join returns duplicated key columns

2022-10-20 Thread Hirokazu SUZUKI (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621484#comment-17621484
 ] 

Hirokazu SUZUKI commented on ARROW-18091:
-

I mean dplyr's join( ..., keep=FALSE) behavior.
Sorry for the poor explanation.


> [Ruby] Arrow::Table#join returns duplicated key columns
> ---
>
> Key: ARROW-18091
> URL: https://issues.apache.org/jira/browse/ARROW-18091
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Reporter: Hirokazu SUZUKI
>Priority: Major
>
> `Arrow::Table#join` returns columns with duplicate keys. Duplicate column 
> names are acceptable in Arrow, but it is preferable to use one.
> Also with `type: :full_outer`, column data should be merged.
> table1
> => 
> #
>         KEY     X         
> 0       A       1         
> 1       B       2         
> 2       C       3
> table2
> => 
> #
>         KEY     X
> 0       A       4
> 1       B       5
> 2       D       6
>  
> Should omit `:KEY` in right
> table1.join(table2, :KEY)
> => 
> #                   
>         KEY     X       KEY     X                                   
> 0       A       1       A       4                                   
> 1       B       2       B       5
>  
> Should merge `:KEY`s
> table1.join(table2, :KEY, type: :full_outer)
> => 
> #                   
>         KEY          X  KEY          X                              
> 0       A            1  A            4                              
> 1       B            2  B            5                              
> 2       C            3  (null)  (null)                              
> 3       (null)  (null)  D            6
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18119) [C++] Utility method to ensure an array object meetings an alignment requirement

2022-10-20 Thread Weston Pace (Jira)
Weston Pace created ARROW-18119:
---

 Summary: [C++] Utility method to ensure an array object meetings 
an alignment requirement
 Key: ARROW-18119
 URL: https://issues.apache.org/jira/browse/ARROW-18119
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Weston Pace


This would look something like:

EnsureAligned(Buffer|Array|ChunkedArray|RecordBatch|Table, int 
minimum_alignment, MemoryPool* memory_pool);

It would fail if MemoryPool's alignment < minimum_alignment
It would iterate through each buffer of the object, if the object is not 
aligned properly, it would reallocate and copy the buffer (using memory_pool)

It would return a new object where every buffer is guaranteed to meet the 
alignment requirements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-6575) [JS] decimal toString does not support negative values

2022-10-20 Thread Jonathan Swenson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621431#comment-17621431
 ] 

Jonathan Swenson edited comment on ARROW-6575 at 10/21/22 1:14 AM:
---

Is this the correct method to use? It appears as though this still does not 
support negative numbers, and perhaps also does not support non-zero scale. 

If so, this is still present in 9.0.0. 

>From the implementation in the c++ source, I believe there are two missing 
>pieces. 

+ handling of negative values. (determine if negative and negate before 
rendering) – [determine if 
negative|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/basic_decimal.h#L125]
 | [negation 
|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/basic_decimal.cc#L377-L385]
 

+ scaling values appropriately (insert the decimal place / prepend leading 
zeros as necessary). 
[implementation|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/decimal.cc#L328-L386]

 

The conversion to integer string is implemented in c++ 
[here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/decimal.cc#L685-L698]
 which includes checking if negative before rendering. 

 

I see values like:

+  101 (with scale 10, precision 13) renders as `10100` – where I 
expect this as `101.00`

+ -101 (with scale 10, precision 13) renders as 
`3402823669209385000` - I expect this as `-101.00`

I'm not really a Javascript expert, but a similar approach to the negation 
check and flipping appears to work, but I'm fairly confident I'm missing some 
edge cases.

 

General algorithm:

+ check to see if the high bits are negative 

+ if so, number is negative (prepend with "-") and add 1 to each "chunk" and 
handle overflows appropriately. 

+ render the string using current implementation. 

+ place decimal place (or prepend with 0s) based on the scale. 


was (Author: jswenson):
Is this the correct method to use? It appears as though this still does not 
support negative numbers, and perhaps also does not support non-zero scale. 

>From the implementation in the c++ source, I believe there are two missing 
>pieces. 

+ handling of negative values. (determine if negative and negate before 
rendering) – [determine if 
negative|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/basic_decimal.h#L125]
[negation | 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/basic_decimal.cc#L377-L385]
 

+ scaling values appropriately (insert the decimal place / prepend leading 
zeros as necessary). 
[implementation|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/decimal.cc#L328-L386]

 

The conversion to integer string is implemented in c++ 
[here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/decimal.cc#L685-L698]
 which includes checking if negative before rendering. 

 

I see values like:

+  101 (with scale 10, precision 13) renders as `10100` – where I 
expect this as `101.00`

+ -101 (with scale 10, precision 13) renders as 
`3402823669209385000` - I expect this as `-101.00`

 

I'm not really a Javascript expert, but a similar approach to the negation 
check and flipping appears to work, but I'm fairly confident I'm missing some 
edge cases.

 

General algorithm:

+ check to see if the high bits are negative 

+ if so, number is negative (prepend with "-") and add 1 to each "chunk" and 
handle overflows appropriately. 

+ render the string using current implementation. 

+ place decimal place (or prepend with 0s) based on the scale. 

> [JS] decimal toString does not support negative values
> --
>
> Key: ARROW-6575
> URL: https://issues.apache.org/jira/browse/ARROW-6575
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.14.1
>Reporter: Andong Zhan
>Priority: Critical
>
> The main description is here: [https://github.com/apache/arrow/issues/5397]
> Also, I have a simple test case (slightly changed generate-test-data.js and 
> generated-data-validators):
> {code:java}
> export const decimal = (length = 2, nullCount = length * 0.2 | 0, scale = 0, 
> precision = 38) => vectorGenerator.visit(new Decimal(scale, precision), 
> length, nullCount);
> function fillDecimal(length: number) {
> // const BPE = Uint32Array.BYTES_PER_ELEMENT; // 4
> const array = new Uint32Array(length);
> // const max = (2 ** (8 * BPE)) - 1;
> // for (let i = -1; ++i < length; array[i] = rand() * max * (rand() > 0.5 
> ? -1 : 1));
> array[0] = 0;
> array[1] = 1286889712;
> array[2] = 2218195178;
> array[3] = 4282345521;
> array[4] = 0;
> array[5] = 16004768;
> array[6] = 3587851993;
> array[7] 

[jira] [Comment Edited] (ARROW-6575) [JS] decimal toString does not support negative values

2022-10-20 Thread Jonathan Swenson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621431#comment-17621431
 ] 

Jonathan Swenson edited comment on ARROW-6575 at 10/21/22 1:11 AM:
---

Is this the correct method to use? It appears as though this still does not 
support negative numbers, and perhaps also does not support non-zero scale. 

>From the implementation in the c++ source, I believe there are two missing 
>pieces. 

+ handling of negative values. (determine if negative and negate before 
rendering) – [determine if 
negative|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/basic_decimal.h#L125]
[negation | 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/basic_decimal.cc#L377-L385]
 

+ scaling values appropriately (insert the decimal place / prepend leading 
zeros as necessary). 
[implementation|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/decimal.cc#L328-L386]

 

The conversion to integer string is implemented in c++ 
[here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/decimal.cc#L685-L698]
 which includes checking if negative before rendering. 

 

I see values like:

+  101 (with scale 10, precision 13) renders as `10100` – where I 
expect this as `101.00`

+ -101 (with scale 10, precision 13) renders as 
`3402823669209385000` - I expect this as `-101.00`

 

I'm not really a Javascript expert, but a similar approach to the negation 
check and flipping appears to work, but I'm fairly confident I'm missing some 
edge cases.

 

General algorithm:

+ check to see if the high bits are negative 

+ if so, number is negative (prepend with "-") and add 1 to each "chunk" and 
handle overflows appropriately. 

+ render the string using current implementation. 

+ place decimal place (or prepend with 0s) based on the scale. 


was (Author: jswenson):
Is this the correct method to use? It appears as though this still does not 
support negative numbers, and perhaps also does not support non-zero scale. 

>From the implementation in the c++ source, I believe there are two missing 
>pieces. 

+ handling of negative values. (determine if negative and negate before 
rendering) – [[determine if 
negative|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/basic_decimal.h#L125]][[negation|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/basic_decimal.cc#L377-L385]]
 

+ scaling values appropriately (insert the decimal place / prepend leading 
zeros as necessary). 
[[implementation|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/decimal.cc#L328-L386]]

 

The conversion to integer string is implemented in c++ 
[[here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/decimal.cc#L685-L698]]
 which includes checking if negative before rendering. 

 

I see values like:

+  101 (with scale 10, precision 13) renders as `10100` – where I 
expect this as `101.00`

+ -101 (with scale 10, precision 13) renders as 
`3402823669209385000`- I expect this as `-101.00`

 

I'm not really a Javascript expert, but a similar approach to the negation 
check and flipping appears to work, but I'm fairly confident I'm missing some 
edge cases. 

 

General algorithm:

+ check to see if the high bits are negative 

+ if so, number is negative (prepend with "-") and add 1 to each "chunk" and 
handle overflows appropriately. 

+ render the string using current implementation. 

+ place decimal place (or prepend with 0s) based on the scale. 

> [JS] decimal toString does not support negative values
> --
>
> Key: ARROW-6575
> URL: https://issues.apache.org/jira/browse/ARROW-6575
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.14.1
>Reporter: Andong Zhan
>Priority: Critical
>
> The main description is here: [https://github.com/apache/arrow/issues/5397]
> Also, I have a simple test case (slightly changed generate-test-data.js and 
> generated-data-validators):
> {code:java}
> export const decimal = (length = 2, nullCount = length * 0.2 | 0, scale = 0, 
> precision = 38) => vectorGenerator.visit(new Decimal(scale, precision), 
> length, nullCount);
> function fillDecimal(length: number) {
> // const BPE = Uint32Array.BYTES_PER_ELEMENT; // 4
> const array = new Uint32Array(length);
> // const max = (2 ** (8 * BPE)) - 1;
> // for (let i = -1; ++i < length; array[i] = rand() * max * (rand() > 0.5 
> ? -1 : 1));
> array[0] = 0;
> array[1] = 1286889712;
> array[2] = 2218195178;
> array[3] = 4282345521;
> array[4] = 0;
> array[5] = 16004768;
> array[6] = 3587851993;
> array[7] = 126217744;
> return array;
> 

[jira] [Commented] (ARROW-6575) [JS] decimal toString does not support negative values

2022-10-20 Thread Jonathan Swenson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621431#comment-17621431
 ] 

Jonathan Swenson commented on ARROW-6575:
-

Is this the correct method to use? It appears as though this still does not 
support negative numbers, and perhaps also does not support non-zero scale. 

>From the implementation in the c++ source, I believe there are two missing 
>pieces. 

+ handling of negative values. (determine if negative and negate before 
rendering) – [[determine if 
negative|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/basic_decimal.h#L125]][[negation|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/basic_decimal.cc#L377-L385]]
 

+ scaling values appropriately (insert the decimal place / prepend leading 
zeros as necessary). 
[[implementation|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/decimal.cc#L328-L386]]

 

The conversion to integer string is implemented in c++ 
[[here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/decimal.cc#L685-L698]]
 which includes checking if negative before rendering. 

 

I see values like:

+  101 (with scale 10, precision 13) renders as `10100` – where I 
expect this as `101.00`

+ -101 (with scale 10, precision 13) renders as 
`3402823669209385000`- I expect this as `-101.00`

 

I'm not really a Javascript expert, but a similar approach to the negation 
check and flipping appears to work, but I'm fairly confident I'm missing some 
edge cases. 

 

General algorithm:

+ check to see if the high bits are negative 

+ if so, number is negative (prepend with "-") and add 1 to each "chunk" and 
handle overflows appropriately. 

+ render the string using current implementation. 

+ place decimal place (or prepend with 0s) based on the scale. 

> [JS] decimal toString does not support negative values
> --
>
> Key: ARROW-6575
> URL: https://issues.apache.org/jira/browse/ARROW-6575
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.14.1
>Reporter: Andong Zhan
>Priority: Critical
>
> The main description is here: [https://github.com/apache/arrow/issues/5397]
> Also, I have a simple test case (slightly changed generate-test-data.js and 
> generated-data-validators):
> {code:java}
> export const decimal = (length = 2, nullCount = length * 0.2 | 0, scale = 0, 
> precision = 38) => vectorGenerator.visit(new Decimal(scale, precision), 
> length, nullCount);
> function fillDecimal(length: number) {
> // const BPE = Uint32Array.BYTES_PER_ELEMENT; // 4
> const array = new Uint32Array(length);
> // const max = (2 ** (8 * BPE)) - 1;
> // for (let i = -1; ++i < length; array[i] = rand() * max * (rand() > 0.5 
> ? -1 : 1));
> array[0] = 0;
> array[1] = 1286889712;
> array[2] = 2218195178;
> array[3] = 4282345521;
> array[4] = 0;
> array[5] = 16004768;
> array[6] = 3587851993;
> array[7] = 126217744;
> return array;
> }
> {code}
> and the expected value should be
> {code:java}
> expect(vector.get(0).toString()).toBe('-1');
> expect(vector.get(1).toString()).toBe('1');
> {code}
> However, the actual first value is 339282366920938463463374607431768211456 
> which is wrong! The second value is correct by the way.
> I believe the bug is in the function called 
> function decimalToString>(a: T) because it cannot 
> return a negative value at all.
> [arrow/js/src/util/bn.ts|https://github.com/apache/arrow/blob/d54425de19b7dbb2764a40355d76d1c785cf64ec/js/src/util/bn.ts#L99]
> Line 99 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18113) Implement a read range process without caching

2022-10-20 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621428#comment-17621428
 ] 

David Li commented on ARROW-18113:
--

Ah ok I was forgetting that we could theoretically split up reads. Thanks!

> Implement a read range process without caching
> --
>
> Key: ARROW-18113
> URL: https://issues.apache.org/jira/browse/ARROW-18113
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Percy Camilo Triveño Aucahuasi
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Major
>
> The current 
> [ReadRangeCache|https://github.com/apache/arrow/blob/e06e98db356e602212019cfbae83fd3d5347292d/cpp/src/arrow/io/caching.h#L100]
>  is mixing caching with coalescing and making difficult to implement readers 
> capable to really perform concurrent reads on coalesced data (see this 
> [github 
> comment|https://github.com/apache/arrow/pull/14226#discussion_r999334979] for 
> additional context); for instance, right now the prebuffering feature of 
> those readers cannot handle concurrent invocations.
> The goal for this ticket is to implement a similar component to 
> ReadRangeCache for performing non-cache reads (doing only the coalescing part 
> instead).  So, once we have that new capability, we can port the parquet and 
> IPC readers to this new component and keep improving the reading process 
> (that would be part of other set of follow-up tickets).  Similar ideas were 
> mentioned here https://issues.apache.org/jira/browse/ARROW-17599
> Maybe a good place to implement this new capability is inside the file system 
> abstraction (as part of a dedicated method to read coalesced data) and where 
> the abstract file system can provide a default implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-18113) Implement a read range process without caching

2022-10-20 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621427#comment-17621427
 ] 

Weston Pace edited comment on ARROW-18113 at 10/21/22 12:43 AM:


> Just to be clear: to the filesystem, or on the reader itself?

Oops, I mean on {{RandomAccessFile}}.

> Also, I'm not clear on: "Multiple returned futures may correspond to a single 
> read. Or, a single returned future may be a combined result of several 
> individual reads." Isn't this saying the same thing twice?

I might call

{noformat}
file->ReadMany({0, 3}, {3, 8}, {1024, 16Mi})
{noformat}.

The filesystem could then implement this as:

{noformat}
std::vector futures;
# The first two futures correspond to the same read
Future coalesced_read = ReadAsync(0, 8);
futures.push_back(coalesced_read.Then(buf => buf.Split(0, 3)));
futures.push_back(coalesced_read.Then(buf => buf.Split(3, 5)));

# The third future corresponds to two reads
Future part_one = ReadAsync(1024, 8Mi);
Future part_two = ReadAsync(1024+8Mi, 8Mi-1024);
futures.push_back(AllComplete({part_one, part_two}).Then(bufs => 
Concatenate(bufs));
{noformat}


was (Author: westonpace):
> Just to be clear: to the filesystem, or on the reader itself?

Oops, I mean on {{RandomAccessFile}}.

> Also, I'm not clear on: "Multiple returned futures may correspond to a single 
> read. Or, a single returned future may be a combined result of several 
> individual reads." Isn't this saying the same thing twice?

I might call {{file->ReadMany({0, 3}, {3, 8}, {1024, 16Mi})}}.

The filesystem could then implement this as:

{noformat}
std::vector futures;
# The first two futures correspond to the same read
Future coalesced_read = ReadAsync(0, 8);
futures.push_back(coalesced_read.Then(buf => buf.Split(0, 3)));
futures.push_back(coalesced_read.Then(buf => buf.Split(3, 5)));

# The third future corresponds to two reads
Future part_one = ReadAsync(1024, 8Mi);
Future part_two = ReadAsync(1024+8Mi, 8Mi-1024);
futures.push_back(AllComplete({part_one, part_two}).Then(bufs => 
Concatenate(bufs));
{noformat}

> Implement a read range process without caching
> --
>
> Key: ARROW-18113
> URL: https://issues.apache.org/jira/browse/ARROW-18113
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Percy Camilo Triveño Aucahuasi
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Major
>
> The current 
> [ReadRangeCache|https://github.com/apache/arrow/blob/e06e98db356e602212019cfbae83fd3d5347292d/cpp/src/arrow/io/caching.h#L100]
>  is mixing caching with coalescing and making difficult to implement readers 
> capable to really perform concurrent reads on coalesced data (see this 
> [github 
> comment|https://github.com/apache/arrow/pull/14226#discussion_r999334979] for 
> additional context); for instance, right now the prebuffering feature of 
> those readers cannot handle concurrent invocations.
> The goal for this ticket is to implement a similar component to 
> ReadRangeCache for performing non-cache reads (doing only the coalescing part 
> instead).  So, once we have that new capability, we can port the parquet and 
> IPC readers to this new component and keep improving the reading process 
> (that would be part of other set of follow-up tickets).  Similar ideas were 
> mentioned here https://issues.apache.org/jira/browse/ARROW-17599
> Maybe a good place to implement this new capability is inside the file system 
> abstraction (as part of a dedicated method to read coalesced data) and where 
> the abstract file system can provide a default implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18113) Implement a read range process without caching

2022-10-20 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621427#comment-17621427
 ] 

Weston Pace commented on ARROW-18113:
-

> Just to be clear: to the filesystem, or on the reader itself?

Oops, I mean on {{RandomAccessFile}}.

> Also, I'm not clear on: "Multiple returned futures may correspond to a single 
> read. Or, a single returned future may be a combined result of several 
> individual reads." Isn't this saying the same thing twice?

I might call {{file->ReadMany({0, 3}, {3, 8}, {1024, 16Mi})}}.

The filesystem could then implement this as:

{noformat}
std::vector futures;
# The first two futures correspond to the same read
Future coalesced_read = ReadAsync(0, 8);
futures.push_back(coalesced_read.Then(buf => buf.Split(0, 3)));
futures.push_back(coalesced_read.Then(buf => buf.Split(3, 5)));

# The third future corresponds to two reads
Future part_one = ReadAsync(1024, 8Mi);
Future part_two = ReadAsync(1024+8Mi, 8Mi-1024);
futures.push_back(AllComplete({part_one, part_two}).Then(bufs => 
Concatenate(bufs));
{noformat}

> Implement a read range process without caching
> --
>
> Key: ARROW-18113
> URL: https://issues.apache.org/jira/browse/ARROW-18113
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Percy Camilo Triveño Aucahuasi
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Major
>
> The current 
> [ReadRangeCache|https://github.com/apache/arrow/blob/e06e98db356e602212019cfbae83fd3d5347292d/cpp/src/arrow/io/caching.h#L100]
>  is mixing caching with coalescing and making difficult to implement readers 
> capable to really perform concurrent reads on coalesced data (see this 
> [github 
> comment|https://github.com/apache/arrow/pull/14226#discussion_r999334979] for 
> additional context); for instance, right now the prebuffering feature of 
> those readers cannot handle concurrent invocations.
> The goal for this ticket is to implement a similar component to 
> ReadRangeCache for performing non-cache reads (doing only the coalescing part 
> instead).  So, once we have that new capability, we can port the parquet and 
> IPC readers to this new component and keep improving the reading process 
> (that would be part of other set of follow-up tickets).  Similar ideas were 
> mentioned here https://issues.apache.org/jira/browse/ARROW-17599
> Maybe a good place to implement this new capability is inside the file system 
> abstraction (as part of a dedicated method to read coalesced data) and where 
> the abstract file system can provide a default implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18118) [Release][Dev] 02-source.sh/03-binary-submit.sh didn't work for 10.0.0

2022-10-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18118:
---
Labels: pull-request-available  (was: )

> [Release][Dev] 02-source.sh/03-binary-submit.sh didn't work for 10.0.0
> --
>
> Key: ARROW-18118
> URL: https://issues.apache.org/jira/browse/ARROW-18118
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> - Wrong variable names are used
> - Missing variable definitions
> - Requiring multiple environment variables for GitHub Personal Access Token



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18102) [R] dplyr::count and dplyr::tally implementation return NA instead of 0

2022-10-20 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621425#comment-17621425
 ] 

Weston Pace commented on ARROW-18102:
-

Supposedly both behaviors are useful (returning null is SQL standards 
compliant.  See 
https://database.guide/sqlite-sum-vs-total-whats-the-difference/).  I think we 
could add an option to the sum function.  See also: 
https://github.com/substrait-io/substrait/issues/259 so I expect we will 
eventually need both behaviors.

> [R] dplyr::count and dplyr::tally implementation return NA instead of 0
> ---
>
> Key: ARROW-18102
> URL: https://issues.apache.org/jira/browse/ARROW-18102
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: Arrow R package 9.0.0 on Mac OS 12.6 with R 4.2.0
>Reporter: Adam Black
>Priority: Major
>
> I'm using dplyr with FileSystemDataset objects. The expected behavior is 
> similar (or the same as) dataframe behavior. When the FileSystemDataset has 
> zero rows dplyr::count and dplyr::tally return NA instead of 0. I would 
> expect the result to be 0.
>  
> {code:r}
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #>     timestamp
> library(dplyr)
> #> 
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #> 
> #>     filter, lag
> #> The following objects are masked from 'package:base':
> #> 
> #>     intersect, setdiff, setequal, union
> path <- tempfile(fileext = ".feather")
> zero_row_dataset <- cars %>% filter(dist < 0)
> # expected behavior
> zero_row_dataset %>% 
>   count()
> #>   n
> #> 1 0
> zero_row_dataset %>% 
>   tally()
> #>   n
> #> 1 0
> nrow(zero_row_dataset)
> #> [1] 0
> # now test behavior with a FileSystemDataset
> write_feather(zero_row_dataset, path)
> ds <- open_dataset(path, format = "feather")
> ds
> #> FileSystemDataset with 1 Feather file
> #> speed: double
> #> dist: double
> #> 
> #> See $metadata for additional Schema metadata
> # actual behavior
> ds %>% 
>   count() %>% 
>   collect() # incorrect result
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    NA
> ds %>% 
>   tally() %>% 
>   collect() # incorrect result
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    NA
> nrow(ds) # works as expected
> #> [1] 0
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18118) [Release][Dev] 02-source.sh/03-binary-submit.sh didn't work for 10.0.0

2022-10-20 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-18118:


 Summary: [Release][Dev] 02-source.sh/03-binary-submit.sh didn't 
work for 10.0.0
 Key: ARROW-18118
 URL: https://issues.apache.org/jira/browse/ARROW-18118
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


- Wrong variable names are used
- Missing variable definitions
- Requiring multiple environment variables for GitHub Personal Access Token



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17207) [C++][CI] Occasional timeout failures on arrow-compute-scalar-test

2022-10-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17207:
---
Labels: pull-request-available  (was: )

> [C++][CI] Occasional timeout failures on arrow-compute-scalar-test
> --
>
> Key: ARROW-17207
> URL: https://issues.apache.org/jira/browse/ARROW-17207
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Raúl Cumplido
>Assignee: Weston Pace
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Sometimes C++ tests fail due to a timeout on `arrow-compute-scalar-test`:
> {code:java}
> 30/85 Test #29: arrow-compute-scalar-test .***Timeout 300.04 
> sec
> Running arrow-compute-scalar-test, redirecting output into 
> /build/cpp/build/test-logs/arrow-compute-scalar-test.txt (attempt 1/1) {code}
> Job failure example: 
> [https://github.com/ursacomputing/crossbow/runs/7511361872?check_suite_focus=true]
> I've realized that even when it run successfully it takes around 4 minutes 
> (timeout is 5 minutes):
> {code:java}
>  32/85 Test #29: arrow-compute-scalar-test .   Passed  229.77 
> sec{code}
> Should these tests be split?
> {code:java}
> add_arrow_compute_test(scalar_test
>                        SOURCES
>                        scalar_arithmetic_test.cc
>                        scalar_boolean_test.cc
>                        scalar_cast_test.cc
>                        scalar_compare_test.cc
>                        scalar_if_else_test.cc
>                        scalar_nested_test.cc
>                        scalar_random_test.cc
>                        scalar_set_lookup_test.cc
>                        scalar_string_test.cc
>                        scalar_temporal_test.cc
>                        scalar_validity_test.cc
>                        test_util.cc) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18117) [C++] Absl symbols not included in `arrow_bundled_dependencies`

2022-10-20 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-18117.
--
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14465
[https://github.com/apache/arrow/pull/14465]

> [C++] Absl symbols not included in `arrow_bundled_dependencies`
> ---
>
> Key: ARROW-18117
> URL: https://issues.apache.org/jira/browse/ARROW-18117
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yves Le Maout
>Assignee: Yves Le Maout
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18117) [C++] Absl symbols not included in `arrow_bundled_dependencies`

2022-10-20 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-18117:
-
Summary: [C++] Absl symbols not included in `arrow_bundled_dependencies`  
(was: Absl symbols not included in `arrow_bundled_dependencies`)

> [C++] Absl symbols not included in `arrow_bundled_dependencies`
> ---
>
> Key: ARROW-18117
> URL: https://issues.apache.org/jira/browse/ARROW-18117
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yves Le Maout
>Assignee: Yves Le Maout
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17783) [C++] Aggregate kernel should not mandate alignment

2022-10-20 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621415#comment-17621415
 ] 

David Li commented on ARROW-17783:
--

I guess if you're going for 512 byte alignment then you'd have to deal with 
this anyways though.

> [C++] Aggregate kernel should not mandate alignment
> ---
>
> Key: ARROW-17783
> URL: https://issues.apache.org/jira/browse/ARROW-17783
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 6.0.0, 8.0.0
>Reporter: Yifei Yang
>Assignee: Weston Pace
>Priority: Major
> Attachments: flight-alignment-test.zip
>
>
> When using arrow's aggregate kernel with table transferred from arrow flight 
> (DoGet), it may crash at arrow::util::CheckAlignment(). However using 
> original data it works well, also if I first serialize the transferred table 
> into bytes then recreate an arrow table using the bytes, it works well.
> "flight-alignment-test" attached is the minimal test that can produce the 
> issue, which basically does "sum(total_revenue) group by l_suppkey" using the 
> table from "DoGet()". ("DummyNode" is just used to be the producer of the 
> aggregate node as the producer is required to create the aggregate node)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17783) [C++] Aggregate kernel should not mandate alignment

2022-10-20 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621414#comment-17621414
 ] 

David Li commented on ARROW-17783:
--

It should also not be too bad to fix this in Flight (given gRPC generally 
forces a copy on us anyways); we would only lose the zero-copy in the case that 
the batch fits in a single gRPC slice (which is presumably relatively small, 
but I'd have to check what a typical size is). 

> [C++] Aggregate kernel should not mandate alignment
> ---
>
> Key: ARROW-17783
> URL: https://issues.apache.org/jira/browse/ARROW-17783
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 6.0.0, 8.0.0
>Reporter: Yifei Yang
>Assignee: Weston Pace
>Priority: Major
> Attachments: flight-alignment-test.zip
>
>
> When using arrow's aggregate kernel with table transferred from arrow flight 
> (DoGet), it may crash at arrow::util::CheckAlignment(). However using 
> original data it works well, also if I first serialize the transferred table 
> into bytes then recreate an arrow table using the bytes, it works well.
> "flight-alignment-test" attached is the minimal test that can produce the 
> issue, which basically does "sum(total_revenue) group by l_suppkey" using the 
> table from "DoGet()". ("DummyNode" is just used to be the producer of the 
> aggregate node as the producer is required to create the aggregate node)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18117) Absl symbols not included in `arrow_bundled_dependencies`

2022-10-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18117:
---
Labels: pull-request-available  (was: )

> Absl symbols not included in `arrow_bundled_dependencies`
> -
>
> Key: ARROW-18117
> URL: https://issues.apache.org/jira/browse/ARROW-18117
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yves Le Maout
>Assignee: Yves Le Maout
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18117) Absl symbols not included in `arrow_bundled_dependencies`

2022-10-20 Thread Yves Le Maout (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yves Le Maout updated ARROW-18117:
--
Component/s: C++

> Absl symbols not included in `arrow_bundled_dependencies`
> -
>
> Key: ARROW-18117
> URL: https://issues.apache.org/jira/browse/ARROW-18117
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yves Le Maout
>Assignee: Yves Le Maout
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18117) Absl symbols not included in `arrow_bundled_dependencies`

2022-10-20 Thread Yves Le Maout (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yves Le Maout reassigned ARROW-18117:
-

Assignee: Yves Le Maout

> Absl symbols not included in `arrow_bundled_dependencies`
> -
>
> Key: ARROW-18117
> URL: https://issues.apache.org/jira/browse/ARROW-18117
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Yves Le Maout
>Assignee: Yves Le Maout
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18117) Absl symbols not included in `arrow_bundled_dependencies`

2022-10-20 Thread Yves Le Maout (Jira)
Yves Le Maout created ARROW-18117:
-

 Summary: Absl symbols not included in `arrow_bundled_dependencies`
 Key: ARROW-18117
 URL: https://issues.apache.org/jira/browse/ARROW-18117
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Yves Le Maout






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17207) [C++][CI] Occasional timeout failures on arrow-compute-scalar-test

2022-10-20 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace reassigned ARROW-17207:
---

Assignee: Weston Pace

> [C++][CI] Occasional timeout failures on arrow-compute-scalar-test
> --
>
> Key: ARROW-17207
> URL: https://issues.apache.org/jira/browse/ARROW-17207
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Raúl Cumplido
>Assignee: Weston Pace
>Priority: Major
> Fix For: 11.0.0
>
>
> Sometimes C++ tests fail due to a timeout on `arrow-compute-scalar-test`:
> {code:java}
> 30/85 Test #29: arrow-compute-scalar-test .***Timeout 300.04 
> sec
> Running arrow-compute-scalar-test, redirecting output into 
> /build/cpp/build/test-logs/arrow-compute-scalar-test.txt (attempt 1/1) {code}
> Job failure example: 
> [https://github.com/ursacomputing/crossbow/runs/7511361872?check_suite_focus=true]
> I've realized that even when it run successfully it takes around 4 minutes 
> (timeout is 5 minutes):
> {code:java}
>  32/85 Test #29: arrow-compute-scalar-test .   Passed  229.77 
> sec{code}
> Should these tests be split?
> {code:java}
> add_arrow_compute_test(scalar_test
>                        SOURCES
>                        scalar_arithmetic_test.cc
>                        scalar_boolean_test.cc
>                        scalar_cast_test.cc
>                        scalar_compare_test.cc
>                        scalar_if_else_test.cc
>                        scalar_nested_test.cc
>                        scalar_random_test.cc
>                        scalar_set_lookup_test.cc
>                        scalar_string_test.cc
>                        scalar_temporal_test.cc
>                        scalar_validity_test.cc
>                        test_util.cc) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17207) [C++][CI] Occasional timeout failures on arrow-compute-scalar-test

2022-10-20 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621404#comment-17621404
 ] 

Weston Pace commented on ARROW-17207:
-

In this case I think I would prefer splitting.  There is no real upper bound on 
the number of scalar kernels we might add and I don't think most of our scalar 
tests run in parallel so we might get some performance benefit (probably just 
on many-core dev machines) from splitting.  I'll make a quick PR.

I'll also check to see if any of them seem unreasonably slow.

> [C++][CI] Occasional timeout failures on arrow-compute-scalar-test
> --
>
> Key: ARROW-17207
> URL: https://issues.apache.org/jira/browse/ARROW-17207
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Raúl Cumplido
>Priority: Major
> Fix For: 11.0.0
>
>
> Sometimes C++ tests fail due to a timeout on `arrow-compute-scalar-test`:
> {code:java}
> 30/85 Test #29: arrow-compute-scalar-test .***Timeout 300.04 
> sec
> Running arrow-compute-scalar-test, redirecting output into 
> /build/cpp/build/test-logs/arrow-compute-scalar-test.txt (attempt 1/1) {code}
> Job failure example: 
> [https://github.com/ursacomputing/crossbow/runs/7511361872?check_suite_focus=true]
> I've realized that even when it run successfully it takes around 4 minutes 
> (timeout is 5 minutes):
> {code:java}
>  32/85 Test #29: arrow-compute-scalar-test .   Passed  229.77 
> sec{code}
> Should these tests be split?
> {code:java}
> add_arrow_compute_test(scalar_test
>                        SOURCES
>                        scalar_arithmetic_test.cc
>                        scalar_boolean_test.cc
>                        scalar_cast_test.cc
>                        scalar_compare_test.cc
>                        scalar_if_else_test.cc
>                        scalar_nested_test.cc
>                        scalar_random_test.cc
>                        scalar_set_lookup_test.cc
>                        scalar_string_test.cc
>                        scalar_temporal_test.cc
>                        scalar_validity_test.cc
>                        test_util.cc) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17783) [C++] Aggregate kernel should not mandate alignment

2022-10-20 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621402#comment-17621402
 ] 

Weston Pace commented on ARROW-17783:
-

My concern is less performance and more complexity and testing.  I've made a 
proposal at ARROW-18115 which would address this (albeit this particular case 
would take a performance hit)

> [C++] Aggregate kernel should not mandate alignment
> ---
>
> Key: ARROW-17783
> URL: https://issues.apache.org/jira/browse/ARROW-17783
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 6.0.0, 8.0.0
>Reporter: Yifei Yang
>Assignee: Weston Pace
>Priority: Major
> Attachments: flight-alignment-test.zip
>
>
> When using arrow's aggregate kernel with table transferred from arrow flight 
> (DoGet), it may crash at arrow::util::CheckAlignment(). However using 
> original data it works well, also if I first serialize the transferred table 
> into bytes then recreate an arrow table using the bytes, it works well.
> "flight-alignment-test" attached is the minimal test that can produce the 
> issue, which basically does "sum(total_revenue) group by l_suppkey" using the 
> table from "DoGet()". ("DummyNode" is just used to be the producer of the 
> aggregate node as the producer is required to create the aggregate node)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18113) Implement a read range process without caching

2022-10-20 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621401#comment-17621401
 ] 

David Li commented on ARROW-18113:
--

Just to be clear: to the filesystem, or on the reader itself?

Also, I'm not clear on: "Multiple returned futures may correspond to a single 
read.  Or, a single returned future may be a combined result of several 
individual reads." Isn't this saying the same thing twice?

> Implement a read range process without caching
> --
>
> Key: ARROW-18113
> URL: https://issues.apache.org/jira/browse/ARROW-18113
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Percy Camilo Triveño Aucahuasi
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Major
>
> The current 
> [ReadRangeCache|https://github.com/apache/arrow/blob/e06e98db356e602212019cfbae83fd3d5347292d/cpp/src/arrow/io/caching.h#L100]
>  is mixing caching with coalescing and making difficult to implement readers 
> capable to really perform concurrent reads on coalesced data (see this 
> [github 
> comment|https://github.com/apache/arrow/pull/14226#discussion_r999334979] for 
> additional context); for instance, right now the prebuffering feature of 
> those readers cannot handle concurrent invocations.
> The goal for this ticket is to implement a similar component to 
> ReadRangeCache for performing non-cache reads (doing only the coalescing part 
> instead).  So, once we have that new capability, we can port the parquet and 
> IPC readers to this new component and keep improving the reading process 
> (that would be part of other set of follow-up tickets).  Similar ideas were 
> mentioned here https://issues.apache.org/jira/browse/ARROW-17599
> Maybe a good place to implement this new capability is inside the file system 
> abstraction (as part of a dedicated method to read coalesced data) and where 
> the abstract file system can provide a default implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18113) Implement a read range process without caching

2022-10-20 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621398#comment-17621398
 ] 

Weston Pace commented on ARROW-18113:
-

On reflection, I don't really prefer my automagic suggestion.  I think an 
explicit multi-read API added to the filesystem would be a good way to go.  I 
don't see it as an extension of ReadAsync though.  Something like:

{noformat}
  /// \brief Request multiple reads at once
  ///
  /// The underlying filesystem may optimize these reads by coalescing small 
reads into
  /// large reads or by breaking up large reads into multiple parallel smaller 
reads.  The
  /// reads should be issued in parallel if it makes sense for the filesystem.
  ///
  /// One future will be returned for each input read range.  Multiple returned 
futures
  /// may correspond to a single read.  Or, a single returned future may be a 
combined
  /// result of several individual reads.
  ///
  /// \param[in] ranges The ranges to read
  /// \return A future that will complete with the data from the requested 
range is
  /// available
  virtual std::vector>> ReadManyAsync(
  const IOContext&, const std::vector& ranges);
{noformat}

There could be a default implementation (perhaps relying on configurable 
protected min_hole_size_ and max_contiguous_read_size_ variables) so that 
filesystems would only need to provide a specialized alternative where it made 
sense.  In the future it would be interesting to benchmark and see if 
[preadv|https://linux.die.net/man/2/preadv] can be used to provide a more 
optimized version for the local filesystem.

I'd also be curious to know how an API like this could be adapted (or whether 
my proposal fits) for something like io_uring [~sakras]

> Implement a read range process without caching
> --
>
> Key: ARROW-18113
> URL: https://issues.apache.org/jira/browse/ARROW-18113
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Percy Camilo Triveño Aucahuasi
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Major
>
> The current 
> [ReadRangeCache|https://github.com/apache/arrow/blob/e06e98db356e602212019cfbae83fd3d5347292d/cpp/src/arrow/io/caching.h#L100]
>  is mixing caching with coalescing and making difficult to implement readers 
> capable to really perform concurrent reads on coalesced data (see this 
> [github 
> comment|https://github.com/apache/arrow/pull/14226#discussion_r999334979] for 
> additional context); for instance, right now the prebuffering feature of 
> those readers cannot handle concurrent invocations.
> The goal for this ticket is to implement a similar component to 
> ReadRangeCache for performing non-cache reads (doing only the coalescing part 
> instead).  So, once we have that new capability, we can port the parquet and 
> IPC readers to this new component and keep improving the reading process 
> (that would be part of other set of follow-up tickets).  Similar ideas were 
> mentioned here https://issues.apache.org/jira/browse/ARROW-17599
> Maybe a good place to implement this new capability is inside the file system 
> abstraction (as part of a dedicated method to read coalesced data) and where 
> the abstract file system can provide a default implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18116) [R][Doc] correct paths for the read_parquet examples in cloud storage vignette

2022-10-20 Thread Stephanie Hazlitt (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephanie Hazlitt updated ARROW-18116:
--
Description: 
{{The S3 file paths don't run:}}
{code:java}
> library(arrow) 
> read_parquet(file = 
> "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") 
Error in url(file, open = "rb") : URL scheme unsupported by this method{code}
{{It looks like the file names are `part-0.parquet` not `data.parquet`.}}

{{This runs:}}
{code:java}
read_parquet(file = 
"s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet"){code}
 

  was:
{{The S3 file paths don't run:}}

 
{code:java}
> library(arrow) 
> read_parquet(file = 
> "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") 
Error in url(file, open = "rb") : URL scheme unsupported by this method{code}
{{It looks like the file names are `part-0.parquet` not `data.parquet`.}}

{{This runs:}}


{code:java}
read_parquet(file = 
"s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet"){code}
 


> [R][Doc] correct paths for the read_parquet examples in cloud storage vignette
> --
>
> Key: ARROW-18116
> URL: https://issues.apache.org/jira/browse/ARROW-18116
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, R
>Reporter: Stephanie Hazlitt
>Priority: Major
>
> {{The S3 file paths don't run:}}
> {code:java}
> > library(arrow) 
> > read_parquet(file = 
> > "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") 
> Error in url(file, open = "rb") : URL scheme unsupported by this method{code}
> {{It looks like the file names are `part-0.parquet` not `data.parquet`.}}
> {{This runs:}}
> {code:java}
> read_parquet(file = 
> "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet"){code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18116) [R][Doc] correct paths for the read_parquet examples in cloud storage vignette

2022-10-20 Thread Stephanie Hazlitt (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephanie Hazlitt updated ARROW-18116:
--
Description: 
{{The S3 file paths don't run:}}

 
{code:java}
> library(arrow) 
> read_parquet(file = 
> "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") 
Error in url(file, open = "rb") : URL scheme unsupported by this method{code}
{{It looks like the file names are `part-0.parquet` not `data.parquet`.}}

{{This runs:}}


{code:java}
read_parquet(file = 
"s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet"){code}
 

  was:
{{The S3 file paths don't run:}}

{{}}
{code:java}
> library(arrow) 
> read_parquet(file = 
> "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") 
Error in url(file, open = "rb") : URL scheme unsupported by this method{code}

{{It looks like the file names are `part-0.parquet` not `data.parquet`.}}

{{This runs:}}
{{}}
{code:java}
read_parquet(file = 
"s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet"){code}

{{}}


> [R][Doc] correct paths for the read_parquet examples in cloud storage vignette
> --
>
> Key: ARROW-18116
> URL: https://issues.apache.org/jira/browse/ARROW-18116
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, R
>Reporter: Stephanie Hazlitt
>Priority: Major
>
> {{The S3 file paths don't run:}}
>  
> {code:java}
> > library(arrow) 
> > read_parquet(file = 
> > "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") 
> Error in url(file, open = "rb") : URL scheme unsupported by this method{code}
> {{It looks like the file names are `part-0.parquet` not `data.parquet`.}}
> {{This runs:}}
> {code:java}
> read_parquet(file = 
> "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet"){code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18116) [R][Doc] correct paths for the read_parquet examples in cloud storage vignette

2022-10-20 Thread Stephanie Hazlitt (Jira)
Stephanie Hazlitt created ARROW-18116:
-

 Summary: [R][Doc] correct paths for the read_parquet examples in 
cloud storage vignette
 Key: ARROW-18116
 URL: https://issues.apache.org/jira/browse/ARROW-18116
 Project: Apache Arrow
  Issue Type: Bug
  Components: Documentation, R
Reporter: Stephanie Hazlitt


{{The S3 file paths don't run:}}

{{}}
{code:java}
> library(arrow) 
> read_parquet(file = 
> "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") 
Error in url(file, open = "rb") : URL scheme unsupported by this method{code}

{{It looks like the file names are `part-0.parquet` not `data.parquet`.}}

{{This runs:}}
{{}}
{code:java}
read_parquet(file = 
"s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet"){code}

{{}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18115) [C++] Acero buffer alignment

2022-10-20 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621395#comment-17621395
 ] 

Weston Pace commented on ARROW-18115:
-

CC [~apitrou][~sakras][~michalno][~marsupialtail][~bkietz] and maybe 
[~rtpsw][~icexelloss] would be interested.

> [C++] Acero buffer alignment
> 
>
> Key: ARROW-18115
> URL: https://issues.apache.org/jira/browse/ARROW-18115
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> This is a general JIRA to centralize some discussion / proposal on a buffer 
> alignment strategy for Acero based on discussions that have happened in a few 
> different contexts.  Any actual work will probably span multiple JIRAs, some 
> of which are already in progress.
> Motivations:
> * Incoming data may not be aligned at all.  Some kernel functions and parts 
> of Acero (e.g. aggregation, join) use explicit SIMD instructions (e.g. 
> intrinsics) and will fail / cause corruption if care is not taken to use 
> unaligned loads (e.g. ARROW-17783).  There are parts of the code that assume 
> fixed arrays with size T are at least T aligned.  This is generally a safe 
> assumption for data generated by arrow-c++ (which allocates all buffers with 
> 64 byte alignment) but then leads to errors when processing data from flight.
> * Dataset writes and mid-plan spilling both can benefit form direct I/O, less 
> for performance reasons and more for memory management reasons.  However, in 
> order to use direct I/O a buffer needs to be aligned, typically to 512 bytes. 
>  This is larger than the current minimum alignment requirements.
> Proposal:
>  * Allow the minimum alignment of a memory pool to be configurable.  This is 
> similar to the proposal of ARROW-17836 but does not require much of an API 
> change (at the expense of being slightly less flexible).
>  * Add a capability to realign a buffer/array/batch/table to a target 
> alignment.  This would only modify buffers that are not already aligned.  
> Basically, given an arrow object, and a target memory pool, migrate a buffer 
> to the target memory pool if its alignment is insufficient.
>  * Acero, in the source node, forces all buffers to be 64 byte aligned.  This 
> way the internals of Acero don't have to worry about this case.  This 
> introduces a performance penalty when buffers are not aligned but is much 
> simpler to maintain and test than trying to support any random buffer.  To 
> avoid this penalty it would be simpler to avoid the unaligned buffers in the 
> first place.
>  * Acero requires a memory pool that has 512-byte alignment so that 
> Acero-allocated buffers can use direct I/O.  If the default memory pool does 
> not have 512-byte alignment then Acero can use a per-plan pool.  This covers 
> the common case for spilling and dataset writing which is that we are 
> partitioning prior to the write and so we are writing Acero-allocated buffers 
> anyways.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18115) [C++] Acero buffer alignment

2022-10-20 Thread Weston Pace (Jira)
Weston Pace created ARROW-18115:
---

 Summary: [C++] Acero buffer alignment
 Key: ARROW-18115
 URL: https://issues.apache.org/jira/browse/ARROW-18115
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace


This is a general JIRA to centralize some discussion / proposal on a buffer 
alignment strategy for Acero based on discussions that have happened in a few 
different contexts.  Any actual work will probably span multiple JIRAs, some of 
which are already in progress.

Motivations:

* Incoming data may not be aligned at all.  Some kernel functions and parts of 
Acero (e.g. aggregation, join) use explicit SIMD instructions (e.g. intrinsics) 
and will fail / cause corruption if care is not taken to use unaligned loads 
(e.g. ARROW-17783).  There are parts of the code that assume fixed arrays with 
size T are at least T aligned.  This is generally a safe assumption for data 
generated by arrow-c++ (which allocates all buffers with 64 byte alignment) but 
then leads to errors when processing data from flight.
* Dataset writes and mid-plan spilling both can benefit form direct I/O, less 
for performance reasons and more for memory management reasons.  However, in 
order to use direct I/O a buffer needs to be aligned, typically to 512 bytes.  
This is larger than the current minimum alignment requirements.

Proposal:
 * Allow the minimum alignment of a memory pool to be configurable.  This is 
similar to the proposal of ARROW-17836 but does not require much of an API 
change (at the expense of being slightly less flexible).
 * Add a capability to realign a buffer/array/batch/table to a target 
alignment.  This would only modify buffers that are not already aligned.  
Basically, given an arrow object, and a target memory pool, migrate a buffer to 
the target memory pool if its alignment is insufficient.
 * Acero, in the source node, forces all buffers to be 64 byte aligned.  This 
way the internals of Acero don't have to worry about this case.  This 
introduces a performance penalty when buffers are not aligned but is much 
simpler to maintain and test than trying to support any random buffer.  To 
avoid this penalty it would be simpler to avoid the unaligned buffers in the 
first place.
 * Acero requires a memory pool that has 512-byte alignment so that 
Acero-allocated buffers can use direct I/O.  If the default memory pool does 
not have 512-byte alignment then Acero can use a per-plan pool.  This covers 
the common case for spilling and dataset writing which is that we are 
partitioning prior to the write and so we are writing Acero-allocated buffers 
anyways.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17599) [C++] ReadRangeCache should not retain data after read

2022-10-20 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621368#comment-17621368
 ] 

Percy Camilo Triveño Aucahuasi commented on ARROW-17599:


Another follow up/related ticket:

https://issues.apache.org/jira/browse/ARROW-18113

> [C++] ReadRangeCache should not retain data after read
> --
>
> Key: ARROW-17599
> URL: https://issues.apache.org/jira/browse/ARROW-17599
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Major
>  Labels: good-second-issue, pull-request-available
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> I've added a unit test of the issue here: 
> https://github.com/westonpace/arrow/tree/experiment/read-range-cache-retention
> We use the ReadRangeCache for pre-buffering IPC and parquet files.  Sometimes 
> those files are quite large (gigabytes).  The usage is roughly:
> for X in num_row_groups:
>   CacheAllThePiecesWeNeedForRowGroupX
>   WaitForPiecesToArriveForRowGroupX
>   ReadThePiecesWeNeedForRowGroupX
> However, once we've read in row group X and passed it on to Acero, etc. we do 
> not release the data for row group X.  The read range cache's entries vector 
> still holds a pointer to the buffer.  The data is not released until the file 
> reader itself is destroyed which only happens when we have finished 
> processing an entire file.
> This leads to excessive memory usage when pre-buffering is enabled.
> This could potentially be a little difficult to implement because a single 
> read range's cache entry could be shared by multiple ranges so we will need 
> some kind of reference counting to know when we have fully finished with an 
> entry and can release it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18113) Implement a read range process without caching

2022-10-20 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621309#comment-17621309
 ] 

David Li commented on ARROW-18113:
--

(Of course, that's only if you choose to go with an explicit API, vs Weston's 
suggestion of possibly doing it 'automagically')

> Implement a read range process without caching
> --
>
> Key: ARROW-18113
> URL: https://issues.apache.org/jira/browse/ARROW-18113
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Percy Camilo Triveño Aucahuasi
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Major
>
> The current 
> [ReadRangeCache|https://github.com/apache/arrow/blob/e06e98db356e602212019cfbae83fd3d5347292d/cpp/src/arrow/io/caching.h#L100]
>  is mixing caching with coalescing and making difficult to implement readers 
> capable to really perform concurrent reads on coalesced data (see this 
> [github 
> comment|https://github.com/apache/arrow/pull/14226#discussion_r999334979] for 
> additional context); for instance, right now the prebuffering feature of 
> those readers cannot handle concurrent invocations.
> The goal for this ticket is to implement a similar component to 
> ReadRangeCache for performing non-cache reads (doing only the coalescing part 
> instead).  So, once we have that new capability, we can port the parquet and 
> IPC readers to this new component and keep improving the reading process 
> (that would be part of other set of follow-up tickets).  Similar ideas were 
> mentioned here https://issues.apache.org/jira/browse/ARROW-17599
> Maybe a good place to implement this new capability is inside the file system 
> abstraction (as part of a dedicated method to read coalesced data) and where 
> the abstract file system can provide a default implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18113) Implement a read range process without caching

2022-10-20 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621308#comment-17621308
 ] 

David Li commented on ARROW-18113:
--

I'd like to raise my comments in ARROW-17913 and ARROW-17917 as well, 
especially 
https://issues.apache.org/jira/browse/ARROW-17913?focusedCommentId=17614155=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17614155
 . Note Weston offers suggestions there too about how we might handle this.

The API would probably be an extension of RandomAccessFile::ReadAsync. The file 
system would come into play by returning a subclass of RAF that overrides the 
new method and does coalescing appropriate to the underlying device

> Implement a read range process without caching
> --
>
> Key: ARROW-18113
> URL: https://issues.apache.org/jira/browse/ARROW-18113
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Percy Camilo Triveño Aucahuasi
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Major
>
> The current 
> [ReadRangeCache|https://github.com/apache/arrow/blob/e06e98db356e602212019cfbae83fd3d5347292d/cpp/src/arrow/io/caching.h#L100]
>  is mixing caching with coalescing and making difficult to implement readers 
> capable to really perform concurrent reads on coalesced data (see this 
> [github 
> comment|https://github.com/apache/arrow/pull/14226#discussion_r999334979] for 
> additional context); for instance, right now the prebuffering feature of 
> those readers cannot handle concurrent invocations.
> The goal for this ticket is to implement a similar component to 
> ReadRangeCache for performing non-cache reads (doing only the coalescing part 
> instead).  So, once we have that new capability, we can port the parquet and 
> IPC readers to this new component and keep improving the reading process 
> (that would be part of other set of follow-up tickets).  Similar ideas were 
> mentioned here https://issues.apache.org/jira/browse/ARROW-17599
> Maybe a good place to implement this new capability is inside the file system 
> abstraction (as part of a dedicated method to read coalesced data) and where 
> the abstract file system can provide a default implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18114) [R] unify_schemas=FALSE does not improve open_dataset() read times

2022-10-20 Thread Carl Boettiger (Jira)
Carl Boettiger created ARROW-18114:
--

 Summary: [R] unify_schemas=FALSE does not improve open_dataset() 
read times
 Key: ARROW-18114
 URL: https://issues.apache.org/jira/browse/ARROW-18114
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Carl Boettiger


open_dataset() provides the very helpful optional argument to set 
unify_schemas=FALSE, which should allow arrow to inspect a single parquet file 
instead of touching potentially thousands or more parquet files to determine a 
consistent unified schema.  This ought to provide a substantial performance 
increase in contexts where the schema is known in advance. 

Unfortunately, in my tests it seems to have no impact on performance.  Consider 
the following reprexes:

default, unify_schemas=TRUE
library(arrow)
 ex <- s3_bucket("neon4cast-scores/parquet/terrestrial_30min", 
endpoint_override = "data.ecoforecast.org", anonymous=TRUE)

bench::bench_time({
open_dataset(ex) 
})
about 32 seconds for me.

manual, unify_schemas=FALSE:

 
bench::bench_time(\{

open_dataset(ex, unify_schemas = FALSE)

})
takes about 32 seconds as well. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18113) Implement a read range process without caching

2022-10-20 Thread Jira
Percy Camilo Triveño Aucahuasi created ARROW-18113:
--

 Summary: Implement a read range process without caching
 Key: ARROW-18113
 URL: https://issues.apache.org/jira/browse/ARROW-18113
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Percy Camilo Triveño Aucahuasi
Assignee: Percy Camilo Triveño Aucahuasi


The current 
[ReadRangeCache|https://github.com/apache/arrow/blob/e06e98db356e602212019cfbae83fd3d5347292d/cpp/src/arrow/io/caching.h#L100]
 is mixing caching with coalescing and making difficult to implement readers 
capable to really perform concurrent reads on coalesced data (see this [github 
comment|https://github.com/apache/arrow/pull/14226#discussion_r999334979] for 
additional context); for instance, right now the prebuffering feature of those 
readers cannot handle concurrent invocations.

The goal for this ticket is to implement a similar component to ReadRangeCache 
for performing non-cache reads (doing only the coalescing part instead).  So, 
once we have that new capability, we can port the parquet and IPC readers to 
this new component and keep improving the reading process (that would be part 
of other set of follow-up tickets).  Similar ideas were mentioned here 
https://issues.apache.org/jira/browse/ARROW-17599

Maybe a good place to implement this new capability is inside the file system 
abstraction (as part of a dedicated method to read coalesced data) and where 
the abstract file system can provide a default implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15641) [C++][Python] UDF Aggregate Function Implementation

2022-10-20 Thread Apache Arrow JIRA Bot (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621290#comment-17621290
 ] 

Apache Arrow JIRA Bot commented on ARROW-15641:
---

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [C++][Python] UDF Aggregate Function Implementation
> ---
>
> Key: ARROW-15641
> URL: https://issues.apache.org/jira/browse/ARROW-15641
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> Here we will be implementing the UDF support for aggregate functions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16041) [C++][Python] Include UDFOptions

2022-10-20 Thread Apache Arrow JIRA Bot (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621291#comment-17621291
 ] 

Apache Arrow JIRA Bot commented on ARROW-16041:
---

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [C++][Python] Include UDFOptions
> 
>
> Key: ARROW-16041
> URL: https://issues.apache.org/jira/browse/ARROW-16041
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> In the first stage of the development, we do not support function options to 
> be taken from Python UDFs. But this is a feature that is required for 
> advanced users. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15646) [C++][Python] UDF Vector Function Implementation

2022-10-20 Thread Apache Arrow JIRA Bot (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621288#comment-17621288
 ] 

Apache Arrow JIRA Bot commented on ARROW-15646:
---

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [C++][Python] UDF Vector Function Implementation 
> -
>
> Key: ARROW-15646
> URL: https://issues.apache.org/jira/browse/ARROW-15646
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> Here we will implement the vector functions for UDFs. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-15641) [C++][Python] UDF Aggregate Function Implementation

2022-10-20 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-15641:
-

Assignee: (was: Vibhatha Lakmal Abeykoon)

> [C++][Python] UDF Aggregate Function Implementation
> ---
>
> Key: ARROW-15641
> URL: https://issues.apache.org/jira/browse/ARROW-15641
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> Here we will be implementing the UDF support for aggregate functions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-15637) [C++][Python] UDF Optimizations

2022-10-20 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-15637:
-

Assignee: (was: Vibhatha Lakmal Abeykoon)

> [C++][Python] UDF Optimizations
> ---
>
> Key: ARROW-15637
> URL: https://issues.apache.org/jira/browse/ARROW-15637
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Vibhatha Lakmal Abeykoon
>Priority: Minor
>
> Need an interface to evaluate the memory footprint, execution time and health 
> of the UDFs and return a meaningful status ex: 
> `Status::HighMemoryUsageException()`, `Status::TimeLimitException()`
> Note: This is also aligned with resource monitoring in the parallel execution 
> space. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16041) [C++][Python] Include UDFOptions

2022-10-20 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-16041:
-

Assignee: (was: Vibhatha Lakmal Abeykoon)

> [C++][Python] Include UDFOptions
> 
>
> Key: ARROW-16041
> URL: https://issues.apache.org/jira/browse/ARROW-16041
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> In the first stage of the development, we do not support function options to 
> be taken from Python UDFs. But this is a feature that is required for 
> advanced users. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-15646) [C++][Python] UDF Vector Function Implementation

2022-10-20 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-15646:
-

Assignee: (was: Vibhatha Lakmal Abeykoon)

> [C++][Python] UDF Vector Function Implementation 
> -
>
> Key: ARROW-15646
> URL: https://issues.apache.org/jira/browse/ARROW-15646
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> Here we will implement the vector functions for UDFs. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15637) [C++][Python] UDF Optimizations

2022-10-20 Thread Apache Arrow JIRA Bot (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621289#comment-17621289
 ] 

Apache Arrow JIRA Bot commented on ARROW-15637:
---

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [C++][Python] UDF Optimizations
> ---
>
> Key: ARROW-15637
> URL: https://issues.apache.org/jira/browse/ARROW-15637
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Minor
>
> Need an interface to evaluate the memory footprint, execution time and health 
> of the UDFs and return a meaningful status ex: 
> `Status::HighMemoryUsageException()`, `Status::TimeLimitException()`
> Note: This is also aligned with resource monitoring in the parallel execution 
> space. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-17711) [Go] RLE Arrays To/From JSON

2022-10-20 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol closed ARROW-17711.
-
Resolution: Duplicate

> [Go] RLE Arrays To/From JSON
> 
>
> Key: ARROW-17711
> URL: https://issues.apache.org/jira/browse/ARROW-17711
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18106) [C++] JSON reader ignores explicit schema with default unexpected_field_behavior="infer"

2022-10-20 Thread Ben Harkins (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621277#comment-17621277
 ] 

Ben Harkins commented on ARROW-18106:
-

That is indeed unexpected... especially since it comes back as a plain string 
in the first case. I suspect it's an issue with timestamps specifically (or 
potentially any non-string type with a json string representation). Test 
coverage seems to be lacking in this area.

I'll take a look at it.

> [C++] JSON reader ignores explicit schema with default 
> unexpected_field_behavior="infer"
> 
>
> Key: ARROW-18106
> URL: https://issues.apache.org/jira/browse/ARROW-18106
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Harkins
>Priority: Major
>  Labels: json
>
> Not 100% sure this is a "bug", but at least I find it an unexpected interplay 
> between two options.
> By default, when reading json, we _infer_ the data type of columns, and when 
> specifying an explicit schema, we _also_ by default infer the type of columns 
> that are not specified in the explicit schema. The docs for 
> {{unexpected_field_behavior}}:
> > How JSON fields outside of explicit_schema (if given) are treated
> But it seems that if you specify a schema, and the parsing of one of the 
> columns fails according to that schema, we still fall back to this default of 
> inferring the data type (while I would have expected an error, since we 
> should only infer for columns _not_ in the schema.
> Example code using pyarrow:
> {code:python}
> import io
> import pyarrow as pa
> from pyarrow import json
> s_json = """{"column":"2022-09-05T08:08:46.000"}"""
> opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
> pa.timestamp("s"))]))
> json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
> {code}
> The parsing fails here because there are milliseconds and the type is "s", 
> but the explicit schema is ignored, and we get a result with a string column 
> as result:
> {code}
> pyarrow.Table
> column: string
> 
> column: [["2022-09-05T08:08:46.000"]]
> {code}
> But when adding {{unexpected_field_behaviour="ignore"}}, we actually get the 
> expected parse error:
> {code:python}
> opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
> pa.timestamp("s"))]), unexpected_field_behavior="ignore")
> json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
> {code}
> gives
> {code}
> ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't 
> parse:2022-09-05T08:08:46.000
> {code}
> It might be this is specific to timestamps, I don't directly see a similar 
> issue with eg {{"column": "A"}} and setting the schema to "column" being 
> int64.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18112) [Go] Remaining Scalar Unary Arithmetic (sin/cos/etc. rounding, log/ln, etc.)

2022-10-20 Thread Matthew Topol (Jira)
Matthew Topol created ARROW-18112:
-

 Summary: [Go] Remaining Scalar Unary Arithmetic (sin/cos/etc. 
rounding, log/ln, etc.)
 Key: ARROW-18112
 URL: https://issues.apache.org/jira/browse/ARROW-18112
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Go
Reporter: Matthew Topol






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18110) [Go] Scalar Comparisons

2022-10-20 Thread Matthew Topol (Jira)
Matthew Topol created ARROW-18110:
-

 Summary: [Go] Scalar Comparisons
 Key: ARROW-18110
 URL: https://issues.apache.org/jira/browse/ARROW-18110
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Matthew Topol






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18111) [Go] Remaining Scalar Binary Arithmetic (bitwise, shifts)

2022-10-20 Thread Matthew Topol (Jira)
Matthew Topol created ARROW-18111:
-

 Summary: [Go] Remaining Scalar Binary Arithmetic (bitwise, shifts)
 Key: ARROW-18111
 URL: https://issues.apache.org/jira/browse/ARROW-18111
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Go
Reporter: Matthew Topol






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18109) [Go] Scalar Unary Arithmetic (abs, neg, sqrt, sign)

2022-10-20 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol updated ARROW-18109:
--
Summary: [Go] Scalar Unary Arithmetic (abs, neg, sqrt, sign)  (was: [Go] 
Scalar Unary Arithmetic)

> [Go] Scalar Unary Arithmetic (abs, neg, sqrt, sign)
> ---
>
> Key: ARROW-18109
> URL: https://issues.apache.org/jira/browse/ARROW-18109
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Matthew Topol
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18109) [Go] Scalar Unary Arithmetic

2022-10-20 Thread Matthew Topol (Jira)
Matthew Topol created ARROW-18109:
-

 Summary: [Go] Scalar Unary Arithmetic
 Key: ARROW-18109
 URL: https://issues.apache.org/jira/browse/ARROW-18109
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Go
Reporter: Matthew Topol






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18108) [Go] More Scalar Binary Arithmetic (Multiply & Divide)

2022-10-20 Thread Matthew Topol (Jira)
Matthew Topol created ARROW-18108:
-

 Summary: [Go] More Scalar Binary Arithmetic (Multiply & Divide)
 Key: ARROW-18108
 URL: https://issues.apache.org/jira/browse/ARROW-18108
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Go
Reporter: Matthew Topol






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18106) [C++] JSON reader ignores explicit schema with default unexpected_field_behavior="infer"

2022-10-20 Thread Ben Harkins (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Harkins reassigned ARROW-18106:
---

Assignee: Ben Harkins

> [C++] JSON reader ignores explicit schema with default 
> unexpected_field_behavior="infer"
> 
>
> Key: ARROW-18106
> URL: https://issues.apache.org/jira/browse/ARROW-18106
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Harkins
>Priority: Major
>  Labels: json
>
> Not 100% sure this is a "bug", but at least I find it an unexpected interplay 
> between two options.
> By default, when reading json, we _infer_ the data type of columns, and when 
> specifying an explicit schema, we _also_ by default infer the type of columns 
> that are not specified in the explicit schema. The docs for 
> {{unexpected_field_behavior}}:
> > How JSON fields outside of explicit_schema (if given) are treated
> But it seems that if you specify a schema, and the parsing of one of the 
> columns fails according to that schema, we still fall back to this default of 
> inferring the data type (while I would have expected an error, since we 
> should only infer for columns _not_ in the schema.
> Example code using pyarrow:
> {code:python}
> import io
> import pyarrow as pa
> from pyarrow import json
> s_json = """{"column":"2022-09-05T08:08:46.000"}"""
> opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
> pa.timestamp("s"))]))
> json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
> {code}
> The parsing fails here because there are milliseconds and the type is "s", 
> but the explicit schema is ignored, and we get a result with a string column 
> as result:
> {code}
> pyarrow.Table
> column: string
> 
> column: [["2022-09-05T08:08:46.000"]]
> {code}
> But when adding {{unexpected_field_behaviour="ignore"}}, we actually get the 
> expected parse error:
> {code:python}
> opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
> pa.timestamp("s"))]), unexpected_field_behavior="ignore")
> json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
> {code}
> gives
> {code}
> ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't 
> parse:2022-09-05T08:08:46.000
> {code}
> It might be this is specific to timestamps, I don't directly see a similar 
> issue with eg {{"column": "A"}} and setting the schema to "column" being 
> int64.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16503) [C++] Can't concatenate extension arrays

2022-10-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16503:
---
Labels: good-second-issue pull-request-available  (was: good-second-issue)

> [C++] Can't concatenate extension arrays
> 
>
> Key: ARROW-16503
> URL: https://issues.apache.org/jira/browse/ARROW-16503
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Dewey Dunnington
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: good-second-issue, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It looks like Arrays with an extension type can't be concatenated. From the R 
> bindings:
> {code:R}
> library(arrow, warn.conflicts = FALSE)
> arr <- vctrs_extension_array(1:10)
> concat_arrays(arr, arr)
> #> Error: NotImplemented: concatenation of integer(0)
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:195
>   VisitTypeInline(*out_->type, this)
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:590
>   ConcatenateImpl(data, pool).Concatenate(_data)
> {code}
> This shows up more practically when using the query engine:
> {code:R}
> library(arrow, warn.conflicts = FALSE)
> table <- arrow_table(
>   group = rep(c("a", "b"), 5),
>   col1 = 1:10,
>   col2 = vctrs_extension_array(1:10)
> )
> tf <- tempfile()
> table |> dplyr::group_by(group) |> write_dataset(tf)
> open_dataset(tf) |>
>   dplyr::arrange(col1) |> 
>   dplyr::collect()
> #> Error in `dplyr::collect()`:
> #> ! NotImplemented: concatenation of extension
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:195
>   VisitTypeInline(*out_->type, this)
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:590
>   ConcatenateImpl(data, pool).Concatenate(_data)
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2025
>   Concatenate(values.chunks(), ctx->memory_pool())
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2084
>   TakeCA(*table.column(j), indices, options, ctx)
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/sink_node.cc:527
>   impl_->DoFinish()
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:467
>   iterator_.Next()
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:337
>   ReadNext()
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:351
>   ToRecordBatches()
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18081) [Go] Add Scalar Boolean Functions

2022-10-20 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-18081.
---
Resolution: Fixed

Issue resolved by pull request 14442
[https://github.com/apache/arrow/pull/14442]

> [Go] Add Scalar Boolean Functions
> -
>
> Key: ARROW-18081
> URL: https://issues.apache.org/jira/browse/ARROW-18081
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17954) [R] Update News for 10.0.0

2022-10-20 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-17954:

Fix Version/s: 10.0.0
   (was: 11.0.0)

> [R] Update News for 10.0.0
> --
>
> Key: ARROW-17954
> URL: https://issues.apache.org/jira/browse/ARROW-17954
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17954) [R] Update News for 10.0.0

2022-10-20 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-17954.
-
Fix Version/s: 11.0.0
   (was: 10.0.0)
   Resolution: Fixed

Issue resolved by pull request 14337
[https://github.com/apache/arrow/pull/14337]

> [R] Update News for 10.0.0
> --
>
> Key: ARROW-17954
> URL: https://issues.apache.org/jira/browse/ARROW-17954
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17871) [Go] Implement Initial Scalar Binary Arithmetic Infrastructure

2022-10-20 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol updated ARROW-17871:
--
Fix Version/s: 11.0.0

> [Go] Implement Initial Scalar Binary Arithmetic Infrastructure
> --
>
> Key: ARROW-17871
> URL: https://issues.apache.org/jira/browse/ARROW-17871
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 11h
>  Remaining Estimate: 0h
>
> Uses add, add_checked, sub, and sub_checked as the initial implementation, 
> only for integral types and float32/float64.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17871) [Go] Implement Initial Scalar Binary Arithmetic Infrastructure

2022-10-20 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-17871.
---
Resolution: Fixed

Issue resolved by pull request 14255
[https://github.com/apache/arrow/pull/14255]

> [Go] Implement Initial Scalar Binary Arithmetic Infrastructure
> --
>
> Key: ARROW-17871
> URL: https://issues.apache.org/jira/browse/ARROW-17871
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> Uses add, add_checked, sub, and sub_checked as the initial implementation, 
> only for integral types and float32/float64.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17192) [Python] .to_pandas can't read_feather if a date column contains dates before 1677 and after 2262

2022-10-20 Thread Alenka Frim (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621174#comment-17621174
 ] 

Alenka Frim commented on ARROW-17192:
-

This is said to be a known issue due to the fact that pandas, for now, only 
supports {{datetime64}} data type in nanosecond resolution. So when you write 
to a feather file the pandas dataframe gets converted to an arrow table and the 
conversion infers the datetime to microsecond resolution. 

As a workaround you can use {{feather.read_table}} to read the feather file 
into an Arrow table and then use {{to_pandas}} to convert it into a pandas 
dataframe, but you will have to add {{timestamp_as_object=True}} keyword so 
that PyArrow doesn't try to convert the timestamp to {{{}datetime64[ns]{}}}:
{code:python}
>>> feather.read_table("to_trash.feather").to_pandas(timestamp_as_object=True)
  date
0  1654-01-01 00:00:00
1  1920-01-01 00:00:00
{code}
But I think we should still pass through {{**kwargs}} in {{read_feather}} to 
{{to_pandas()}} so that one could specify {{timestamp_as_object=True}} keyword 
there also. So I am keeping the Jira open and will try to make a PR for it in 
the following week. Contributions are also welcome, I can help if needed.

> [Python] .to_pandas  can't read_feather if a date column contains dates 
> before 1677 and after 2262
> --
>
> Key: ARROW-17192
> URL: https://issues.apache.org/jira/browse/ARROW-17192
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Any environment
>Reporter: Adrien Pacifico
>Priority: Major
>
> A feather file with a column containing dates lower than 1677 or greater than 
> 2262 cannot be read with pandas, du to  `.to_pandas` method.
> To reproduce the issue:
> {code:java}
> ### create feather file
> import pandas as pd
> from datetime import datetime
> df = pd.DataFrame({"date": [
> datetime.fromisoformat("1654-01-01"),
> datetime.fromisoformat("1920-01-01"),
> ],})
> df.to_feather("to_trash.feather")
> ### read feather file      
> from pyarrow.feather import read_feather
> read_feather("to_trash.feather")
> {code}
>  
> I think that the expected behavior would be to have an object column 
> contining datetime objects.
> I think that the problem comes from _array_like_to_pandas method : 
> [https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L1584]
> or  from `_to_pandas()`
> [https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L2742]
> or from `to_pandas`:
> [https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L673]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18106) [C++] JSON reader ignores explicit schema with default unexpected_field_behavior="infer"

2022-10-20 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621158#comment-17621158
 ] 

Joris Van den Bossche commented on ARROW-18106:
---

cc [~benpharkins]

> [C++] JSON reader ignores explicit schema with default 
> unexpected_field_behavior="infer"
> 
>
> Key: ARROW-18106
> URL: https://issues.apache.org/jira/browse/ARROW-18106
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: json
>
> Not 100% sure this is a "bug", but at least I find it an unexpected interplay 
> between two options.
> By default, when reading json, we _infer_ the data type of columns, and when 
> specifying an explicit schema, we _also_ by default infer the type of columns 
> that are not specified in the explicit schema. The docs for 
> {{unexpected_field_behavior}}:
> > How JSON fields outside of explicit_schema (if given) are treated
> But it seems that if you specify a schema, and the parsing of one of the 
> columns fails according to that schema, we still fall back to this default of 
> inferring the data type (while I would have expected an error, since we 
> should only infer for columns _not_ in the schema.
> Example code using pyarrow:
> {code:python}
> import io
> import pyarrow as pa
> from pyarrow import json
> s_json = """{"column":"2022-09-05T08:08:46.000"}"""
> opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
> pa.timestamp("s"))]))
> json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
> {code}
> The parsing fails here because there are milliseconds and the type is "s", 
> but the explicit schema is ignored, and we get a result with a string column 
> as result:
> {code}
> pyarrow.Table
> column: string
> 
> column: [["2022-09-05T08:08:46.000"]]
> {code}
> But when adding {{unexpected_field_behaviour="ignore"}}, we actually get the 
> expected parse error:
> {code:python}
> opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
> pa.timestamp("s"))]), unexpected_field_behavior="ignore")
> json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
> {code}
> gives
> {code}
> ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't 
> parse:2022-09-05T08:08:46.000
> {code}
> It might be this is specific to timestamps, I don't directly see a similar 
> issue with eg {{"column": "A"}} and setting the schema to "column" being 
> int64.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16503) [C++] Can't concatenate extension arrays

2022-10-20 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-16503:
-

Assignee: Joris Van den Bossche

> [C++] Can't concatenate extension arrays
> 
>
> Key: ARROW-16503
> URL: https://issues.apache.org/jira/browse/ARROW-16503
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Dewey Dunnington
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: good-second-issue
>
> It looks like Arrays with an extension type can't be concatenated. From the R 
> bindings:
> {code:R}
> library(arrow, warn.conflicts = FALSE)
> arr <- vctrs_extension_array(1:10)
> concat_arrays(arr, arr)
> #> Error: NotImplemented: concatenation of integer(0)
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:195
>   VisitTypeInline(*out_->type, this)
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:590
>   ConcatenateImpl(data, pool).Concatenate(_data)
> {code}
> This shows up more practically when using the query engine:
> {code:R}
> library(arrow, warn.conflicts = FALSE)
> table <- arrow_table(
>   group = rep(c("a", "b"), 5),
>   col1 = 1:10,
>   col2 = vctrs_extension_array(1:10)
> )
> tf <- tempfile()
> table |> dplyr::group_by(group) |> write_dataset(tf)
> open_dataset(tf) |>
>   dplyr::arrange(col1) |> 
>   dplyr::collect()
> #> Error in `dplyr::collect()`:
> #> ! NotImplemented: concatenation of extension
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:195
>   VisitTypeInline(*out_->type, this)
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:590
>   ConcatenateImpl(data, pool).Concatenate(_data)
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2025
>   Concatenate(values.chunks(), ctx->memory_pool())
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2084
>   TakeCA(*table.column(j), indices, options, ctx)
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/sink_node.cc:527
>   impl_->DoFinish()
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:467
>   iterator_.Next()
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:337
>   ReadNext()
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:351
>   ToRecordBatches()
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16503) [C++] Can't concatenate extension arrays

2022-10-20 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-16503:
--
Fix Version/s: 11.0.0

> [C++] Can't concatenate extension arrays
> 
>
> Key: ARROW-16503
> URL: https://issues.apache.org/jira/browse/ARROW-16503
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Dewey Dunnington
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: good-second-issue
> Fix For: 11.0.0
>
>
> It looks like Arrays with an extension type can't be concatenated. From the R 
> bindings:
> {code:R}
> library(arrow, warn.conflicts = FALSE)
> arr <- vctrs_extension_array(1:10)
> concat_arrays(arr, arr)
> #> Error: NotImplemented: concatenation of integer(0)
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:195
>   VisitTypeInline(*out_->type, this)
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:590
>   ConcatenateImpl(data, pool).Concatenate(_data)
> {code}
> This shows up more practically when using the query engine:
> {code:R}
> library(arrow, warn.conflicts = FALSE)
> table <- arrow_table(
>   group = rep(c("a", "b"), 5),
>   col1 = 1:10,
>   col2 = vctrs_extension_array(1:10)
> )
> tf <- tempfile()
> table |> dplyr::group_by(group) |> write_dataset(tf)
> open_dataset(tf) |>
>   dplyr::arrange(col1) |> 
>   dplyr::collect()
> #> Error in `dplyr::collect()`:
> #> ! NotImplemented: concatenation of extension
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:195
>   VisitTypeInline(*out_->type, this)
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:590
>   ConcatenateImpl(data, pool).Concatenate(_data)
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2025
>   Concatenate(values.chunks(), ctx->memory_pool())
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2084
>   TakeCA(*table.column(j), indices, options, ctx)
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/sink_node.cc:527
>   impl_->DoFinish()
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:467
>   iterator_.Next()
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:337
>   ReadNext()
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:351
>   ToRecordBatches()
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18102) [R] dplyr::count and dplyr::tally implementation return NA instead of 0

2022-10-20 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621118#comment-17621118
 ] 

Neal Richardson commented on ARROW-18102:
-

We could work around this in R but it seems reasonable that this should be 
fixed in C++. What do you think [~westonpace]?

> [R] dplyr::count and dplyr::tally implementation return NA instead of 0
> ---
>
> Key: ARROW-18102
> URL: https://issues.apache.org/jira/browse/ARROW-18102
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: Arrow R package 9.0.0 on Mac OS 12.6 with R 4.2.0
>Reporter: Adam Black
>Priority: Major
>
> I'm using dplyr with FileSystemDataset objects. The expected behavior is 
> similar (or the same as) dataframe behavior. When the FileSystemDataset has 
> zero rows dplyr::count and dplyr::tally return NA instead of 0. I would 
> expect the result to be 0.
>  
> {code:r}
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #>     timestamp
> library(dplyr)
> #> 
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #> 
> #>     filter, lag
> #> The following objects are masked from 'package:base':
> #> 
> #>     intersect, setdiff, setequal, union
> path <- tempfile(fileext = ".feather")
> zero_row_dataset <- cars %>% filter(dist < 0)
> # expected behavior
> zero_row_dataset %>% 
>   count()
> #>   n
> #> 1 0
> zero_row_dataset %>% 
>   tally()
> #>   n
> #> 1 0
> nrow(zero_row_dataset)
> #> [1] 0
> # now test behavior with a FileSystemDataset
> write_feather(zero_row_dataset, path)
> ds <- open_dataset(path, format = "feather")
> ds
> #> FileSystemDataset with 1 Feather file
> #> speed: double
> #> dist: double
> #> 
> #> See $metadata for additional Schema metadata
> # actual behavior
> ds %>% 
>   count() %>% 
>   collect() # incorrect result
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    NA
> ds %>% 
>   tally() %>% 
>   collect() # incorrect result
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    NA
> nrow(ds) # works as expected
> #> [1] 0
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17954) [R] Update News for 10.0.0

2022-10-20 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621117#comment-17621117
 ] 

Neal Richardson commented on ARROW-17954:
-

It's ready to merge now.

> [R] Update News for 10.0.0
> --
>
> Key: ARROW-17954
> URL: https://issues.apache.org/jira/browse/ARROW-17954
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18013) [C++][Python] Cannot concatenate extension arrays

2022-10-20 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621107#comment-17621107
 ] 

Joris Van den Bossche commented on ARROW-18013:
---

This should indeed certainly work (and shouldn't be difficult, it should "just" 
concatenate the storage arrays). It seems we already have another issue about 
this (using an R example): ARROW-16503. So will close this one as a duplicate 
(and will also take a look at fixing it)

> [C++][Python] Cannot concatenate extension arrays
> -
>
> Key: ARROW-18013
> URL: https://issues.apache.org/jira/browse/ARROW-18013
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 9.0.0
>Reporter: Chang She
>Priority: Major
>
> `pa.Table.take` and `pa.ChunkedArray.combine_chunks` raises exception for 
> extension arrays.
> https://github.com/apache/arrow/blob/apache-arrow-9.0.0/cpp/src/arrow/array/concatenate.cc#L440
> Quick example:
> ```
> In [1]: import pyarrow as pa
> In [2]: class LabelType(pa.ExtensionType):
>...: 
>...:  def __init__(self):
>...:  super(LabelType, self).__init__(pa.string(), "label")
>...: 
>...:  def __arrow_ext_serialize__(self):
>...:  return b""
>...: 
>...:  @classmethod
>...:  def __arrow_ext_deserialize__(cls, storage_type, serialized):
>...:  return LabelType()
>...: 
> In [3]: import numpy as np
> In [4]: chunk1 = pa.ExtensionArray.from_storage(LabelType(), 
> pa.array(np.repeat('a', 1000)))
> In [5]: chunk2 = pa.ExtensionArray.from_storage(LabelType(), 
> pa.array(np.repeat('b', 1000)))
> In [6]: pa.chunked_array([chunk1, chunk2]).combine_chunks()
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
> Cell In [6], line 1
> > 1 pa.chunked_array([chunk1, chunk2]).combine_chunks()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/table.pxi:700, in 
> pyarrow.lib.ChunkedArray.combine_chunks()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/array.pxi:2889, in 
> pyarrow.lib.concat_arrays()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:144, in 
> pyarrow.lib.pyarrow_internal_check_status()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:121, in 
> pyarrow.lib.check_status()
> ArrowNotImplementedError: concatenation of extension>
> ```
> Would it be possible to concatenate the storage and the "re-box" to the 
> ExtensionType?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18013) [C++][Python] Cannot concatenate extension arrays

2022-10-20 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-18013:
--
Labels: extension-type  (was: )

> [C++][Python] Cannot concatenate extension arrays
> -
>
> Key: ARROW-18013
> URL: https://issues.apache.org/jira/browse/ARROW-18013
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 9.0.0
>Reporter: Chang She
>Priority: Major
>  Labels: extension-type
>
> `pa.Table.take` and `pa.ChunkedArray.combine_chunks` raises exception for 
> extension arrays.
> https://github.com/apache/arrow/blob/apache-arrow-9.0.0/cpp/src/arrow/array/concatenate.cc#L440
> Quick example:
> ```
> In [1]: import pyarrow as pa
> In [2]: class LabelType(pa.ExtensionType):
>...: 
>...:  def __init__(self):
>...:  super(LabelType, self).__init__(pa.string(), "label")
>...: 
>...:  def __arrow_ext_serialize__(self):
>...:  return b""
>...: 
>...:  @classmethod
>...:  def __arrow_ext_deserialize__(cls, storage_type, serialized):
>...:  return LabelType()
>...: 
> In [3]: import numpy as np
> In [4]: chunk1 = pa.ExtensionArray.from_storage(LabelType(), 
> pa.array(np.repeat('a', 1000)))
> In [5]: chunk2 = pa.ExtensionArray.from_storage(LabelType(), 
> pa.array(np.repeat('b', 1000)))
> In [6]: pa.chunked_array([chunk1, chunk2]).combine_chunks()
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
> Cell In [6], line 1
> > 1 pa.chunked_array([chunk1, chunk2]).combine_chunks()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/table.pxi:700, in 
> pyarrow.lib.ChunkedArray.combine_chunks()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/array.pxi:2889, in 
> pyarrow.lib.concat_arrays()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:144, in 
> pyarrow.lib.pyarrow_internal_check_status()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:121, in 
> pyarrow.lib.check_status()
> ArrowNotImplementedError: concatenation of extension>
> ```
> Would it be possible to concatenate the storage and the "re-box" to the 
> ExtensionType?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-18013) [C++][Python] Cannot concatenate extension arrays

2022-10-20 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche closed ARROW-18013.
-
Resolution: Duplicate

> [C++][Python] Cannot concatenate extension arrays
> -
>
> Key: ARROW-18013
> URL: https://issues.apache.org/jira/browse/ARROW-18013
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 9.0.0
>Reporter: Chang She
>Priority: Major
>  Labels: extension-type
>
> `pa.Table.take` and `pa.ChunkedArray.combine_chunks` raises exception for 
> extension arrays.
> https://github.com/apache/arrow/blob/apache-arrow-9.0.0/cpp/src/arrow/array/concatenate.cc#L440
> Quick example:
> ```
> In [1]: import pyarrow as pa
> In [2]: class LabelType(pa.ExtensionType):
>...: 
>...:  def __init__(self):
>...:  super(LabelType, self).__init__(pa.string(), "label")
>...: 
>...:  def __arrow_ext_serialize__(self):
>...:  return b""
>...: 
>...:  @classmethod
>...:  def __arrow_ext_deserialize__(cls, storage_type, serialized):
>...:  return LabelType()
>...: 
> In [3]: import numpy as np
> In [4]: chunk1 = pa.ExtensionArray.from_storage(LabelType(), 
> pa.array(np.repeat('a', 1000)))
> In [5]: chunk2 = pa.ExtensionArray.from_storage(LabelType(), 
> pa.array(np.repeat('b', 1000)))
> In [6]: pa.chunked_array([chunk1, chunk2]).combine_chunks()
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
> Cell In [6], line 1
> > 1 pa.chunked_array([chunk1, chunk2]).combine_chunks()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/table.pxi:700, in 
> pyarrow.lib.ChunkedArray.combine_chunks()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/array.pxi:2889, in 
> pyarrow.lib.concat_arrays()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:144, in 
> pyarrow.lib.pyarrow_internal_check_status()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:121, in 
> pyarrow.lib.check_status()
> ArrowNotImplementedError: concatenation of extension>
> ```
> Would it be possible to concatenate the storage and the "re-box" to the 
> ExtensionType?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17317) [Release][Docs] Normalize previous document version directory

2022-10-20 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-17317.
---
Resolution: Fixed

Issue resolved by pull request 14457
[https://github.com/apache/arrow/pull/14457]

> [Release][Docs] Normalize previous document version directory
> -
>
> Key: ARROW-17317
> URL: https://issues.apache.org/jira/browse/ARROW-17317
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> We should use X.Y instead of X.Y.Z (e.g.: 8.0 not 8.0.1) for previous version 
> document directory.
> See also: 
> https://github.com/apache/arrow/blob/apache-arrow-9.0.0/dev/release/post-08-docs.sh#L84
> The script should accept X.Y.Z such as 8.0.1 and normalize it to X.Y. It'll 
> reduce human error.
> See also:
> * https://github.com/apache/arrow-site/pull/228#issuecomment-1205997067
> * https://github.com/apache/arrow-site/pull/228#issuecomment-1206085602



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17317) [Release][Docs] Normalize previous document version directory

2022-10-20 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-17317:
--
Fix Version/s: 10.0.0
   (was: 11.0.0)

> [Release][Docs] Normalize previous document version directory
> -
>
> Key: ARROW-17317
> URL: https://issues.apache.org/jira/browse/ARROW-17317
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> We should use X.Y instead of X.Y.Z (e.g.: 8.0 not 8.0.1) for previous version 
> document directory.
> See also: 
> https://github.com/apache/arrow/blob/apache-arrow-9.0.0/dev/release/post-08-docs.sh#L84
> The script should accept X.Y.Z such as 8.0.1 and normalize it to X.Y. It'll 
> reduce human error.
> See also:
> * https://github.com/apache/arrow-site/pull/228#issuecomment-1205997067
> * https://github.com/apache/arrow-site/pull/228#issuecomment-1206085602



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17966) [C++] Adjust to new format for Substrait optional arguments

2022-10-20 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-17966:


Assignee: Ben Kietzman  (was: Weston Pace)

> [C++] Adjust to new format for Substrait optional arguments
> ---
>
> Key: ARROW-17966
> URL: https://issues.apache.org/jira/browse/ARROW-17966
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Substrait is presumably going to change how it defines optional arguments in 
> https://github.com/substrait-io/substrait/pull/342 .
> This change will require a corresponding change in Acero (this should also 
> bring Acero in line with Ibis & Isthmus).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17966) [C++] Adjust to new format for Substrait optional arguments

2022-10-20 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-17966:


Assignee: Weston Pace  (was: Ben Kietzman)

> [C++] Adjust to new format for Substrait optional arguments
> ---
>
> Key: ARROW-17966
> URL: https://issues.apache.org/jira/browse/ARROW-17966
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Substrait is presumably going to change how it defines optional arguments in 
> https://github.com/substrait-io/substrait/pull/342 .
> This change will require a corresponding change in Acero (this should also 
> bring Acero in line with Ibis & Isthmus).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17966) [C++] Adjust to new format for Substrait optional arguments

2022-10-20 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-17966:


Assignee: Weston Pace

> [C++] Adjust to new format for Substrait optional arguments
> ---
>
> Key: ARROW-17966
> URL: https://issues.apache.org/jira/browse/ARROW-17966
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Substrait is presumably going to change how it defines optional arguments in 
> https://github.com/substrait-io/substrait/pull/342 .
> This change will require a corresponding change in Acero (this should also 
> bring Acero in line with Ibis & Isthmus).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18107) [C++] Provide more informative error when (CSV/JSON) parsing fails

2022-10-20 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18107:
-

 Summary: [C++] Provide more informative error when (CSV/JSON) 
parsing fails
 Key: ARROW-18107
 URL: https://issues.apache.org/jira/browse/ARROW-18107
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


Related to ARROW-18106 (and derived from 
https://stackoverflow.com/questions/74138746/why-i-cant-parse-timestamp-in-pyarrow).
 

Assume you have the following code to read a JSON file with timestamps. The 
timestamps have a sub-second part in their string, which fails parsing if you 
specify it as second resolution timestamp:

{code:python}
import io
from pyarrow import json

s_json = """{"column":"2022-09-05T08:08:46.000"}"""

opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
pa.timestamp("s"))]), unexpected_field_behavior="ignore")
json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
{code}

gives:

{code}
ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't 
parse:2022-09-05T08:08:46.000
{code}

This error is expected, but I think it could be more informative about the 
reason why it failed parsing (because at first sight it looks like a proper 
timestamp string, so you might be left wondering why this is failing). 

(this might not be that straightforward, though, since there can be many 
reasons why the parsing is failing)







--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17192) [Python] .to_pandas can't read_feather if a date column contains dates before 1677 and after 2262

2022-10-20 Thread Alenka Frim (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim updated ARROW-17192:

Description: 
A feather file with a column containing dates lower than 1677 or greater than 
2262 cannot be read with pandas, du to  `.to_pandas` method.

To reproduce the issue:
{code:java}
### create feather file
import pandas as pd
from datetime import datetime
df = pd.DataFrame({"date": [
datetime.fromisoformat("1654-01-01"),
datetime.fromisoformat("1920-01-01"),
],})
df.to_feather("to_trash.feather")

### read feather file      
from pyarrow.feather import read_feather
read_feather("to_trash.feather")
{code}
 

I think that the expected behavior would be to have an object column contining 
datetime objects.

I think that the problem comes from _array_like_to_pandas method : 
[https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L1584]

or  from `_to_pandas()`
[https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L2742]

or from `to_pandas`:
[https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L673]

  was:
A feather file with a column containing dates lower than 1677 or greater than 
2262 cannot be read with pandas, du to  `.to_pandas` method.

To reproduce the issue:
{code:java}
### create feather file
import pandas as pd
from datetime import datetime df = pd.DataFrame({"date": [
datetime.fromisoformat("1654-01-01"),
datetime.fromisoformat("1920-01-01"),
],})
df.to_feather("to_trash.feather")

### read feather file      
from pyarrow.feather import read_feather
read_feather("to_trash.feather")
{code}
 

I think that the expected behavior would be to have an object column contining 
datetime objects.

I think that the problem comes from _array_like_to_pandas method : 
[https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L1584]

or  from `_to_pandas()`
[https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L2742]

or from `to_pandas`:
[https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L673]


> [Python] .to_pandas  can't read_feather if a date column contains dates 
> before 1677 and after 2262
> --
>
> Key: ARROW-17192
> URL: https://issues.apache.org/jira/browse/ARROW-17192
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Any environment
>Reporter: Adrien Pacifico
>Priority: Major
>
> A feather file with a column containing dates lower than 1677 or greater than 
> 2262 cannot be read with pandas, du to  `.to_pandas` method.
> To reproduce the issue:
> {code:java}
> ### create feather file
> import pandas as pd
> from datetime import datetime
> df = pd.DataFrame({"date": [
> datetime.fromisoformat("1654-01-01"),
> datetime.fromisoformat("1920-01-01"),
> ],})
> df.to_feather("to_trash.feather")
> ### read feather file      
> from pyarrow.feather import read_feather
> read_feather("to_trash.feather")
> {code}
>  
> I think that the expected behavior would be to have an object column 
> contining datetime objects.
> I think that the problem comes from _array_like_to_pandas method : 
> [https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L1584]
> or  from `_to_pandas()`
> [https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L2742]
> or from `to_pandas`:
> [https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L673]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18106) [C++] JSON reader ignores explicit schema with default unexpected_field_behavior="infer"

2022-10-20 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18106:
-

 Summary: [C++] JSON reader ignores explicit schema with default 
unexpected_field_behavior="infer"
 Key: ARROW-18106
 URL: https://issues.apache.org/jira/browse/ARROW-18106
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


Not 100% sure this is a "bug", but at least I find it an unexpected interplay 
between two options.

By default, when reading json, we _infer_ the data type of columns, and when 
specifying an explicit schema, we _also_ by default infer the type of columns 
that are not specified in the explicit schema. The docs for 
{{unexpected_field_behavior}}:

> How JSON fields outside of explicit_schema (if given) are treated

But it seems that if you specify a schema, and the parsing of one of the 
columns fails according to that schema, we still fall back to this default of 
inferring the data type (while I would have expected an error, since we should 
only infer for columns _not_ in the schema.

Example code using pyarrow:

{code:python}
import io
import pyarrow as pa
from pyarrow import json

s_json = """{"column":"2022-09-05T08:08:46.000"}"""

opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
pa.timestamp("s"))]))
json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
{code}

The parsing fails here because there are milliseconds and the type is "s", but 
the explicit schema is ignored, and we get a result with a string column as 
result:

{code}
pyarrow.Table
column: string

column: [["2022-09-05T08:08:46.000"]]
{code}

But when adding {{unexpected_field_behaviour="ignore"}}, we actually get the 
expected parse error:

{code:python}
opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
pa.timestamp("s"))]), unexpected_field_behavior="ignore")
json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
{code}

gives

{code}
ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't 
parse:2022-09-05T08:08:46.000
{code}


It might be this is specific to timestamps, I don't directly see a similar 
issue with eg {{"column": "A"}} and setting the schema to "column" being int64.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-17308) ValueError: Keyword 'validate_schema' is not yet supported with the new Dataset API

2022-10-20 Thread Alenka Frim (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim closed ARROW-17308.
---
Resolution: Duplicate

> ValueError: Keyword 'validate_schema' is not yet supported with the new 
> Dataset API
> ---
>
> Key: ARROW-17308
> URL: https://issues.apache.org/jira/browse/ARROW-17308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Abderrahmane Jaidi
>Priority: Major
>  Labels: dataset-parquet-legacy, dataset-parquet-read
>
> Documentation for PyArrow 6.x and 7.x both indicate that the 
> `validate_schema` argument is supported in the `ParquetDataset` class. Yet 
> passing that argument to an instance results in:
> ValueError: Keyword 'validate_schema' is not yet supported with the new 
> Dataset API
> Code:
> {code:python}
> parquet_dataset = pyarrow.parquet.ParquetDataset(
>     path_or_paths=paths,
>     validate_schema=validate_schema,
>     filesystem=filesystem,
>     partitioning=partitioning,
>     use_legacy_dataset=False,
> ){code}
> Docs link:
> [https://arrow.apache.org/docs/6.0/python/generated/pyarrow.parquet.ParquetDataset.html]
> [https://arrow.apache.org/docs/7.0/python/generated/pyarrow.parquet.ParquetDataset.html]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17308) ValueError: Keyword 'validate_schema' is not yet supported with the new Dataset API

2022-10-20 Thread Alenka Frim (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621081#comment-17621081
 ] 

Alenka Frim commented on ARROW-17308:
-

Maybe it is worth mentioning that even if the {{validate_schema}} argument is 
not supplied, the Arrow C++ implementation validates the schema for all the 
fragments of the dataset and returns a validation error in case of mismatch.

> ValueError: Keyword 'validate_schema' is not yet supported with the new 
> Dataset API
> ---
>
> Key: ARROW-17308
> URL: https://issues.apache.org/jira/browse/ARROW-17308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Abderrahmane Jaidi
>Priority: Major
>  Labels: dataset-parquet-legacy, dataset-parquet-read
>
> Documentation for PyArrow 6.x and 7.x both indicate that the 
> `validate_schema` argument is supported in the `ParquetDataset` class. Yet 
> passing that argument to an instance results in:
> ValueError: Keyword 'validate_schema' is not yet supported with the new 
> Dataset API
> Code:
> {code:python}
> parquet_dataset = pyarrow.parquet.ParquetDataset(
>     path_or_paths=paths,
>     validate_schema=validate_schema,
>     filesystem=filesystem,
>     partitioning=partitioning,
>     use_legacy_dataset=False,
> ){code}
> Docs link:
> [https://arrow.apache.org/docs/6.0/python/generated/pyarrow.parquet.ParquetDataset.html]
> [https://arrow.apache.org/docs/7.0/python/generated/pyarrow.parquet.ParquetDataset.html]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-17200) [Python][Parquet] support partitioning by Pandas DataFrame index

2022-10-20 Thread Alenka Frim (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim closed ARROW-17200.
---
Resolution: Invalid

> [Python][Parquet] support partitioning by Pandas DataFrame index
> 
>
> Key: ARROW-17200
> URL: https://issues.apache.org/jira/browse/ARROW-17200
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Parquet, Python
>Reporter: Gregory Werbin
>Priority: Minor
>
> In a Pandas {{DataFrame}} with a multi-index, with a slowly-varying "outer" 
> index level, one might want to partition by that index level when saving the 
> data frame to Parquet format. This is currently not possible; you need to 
> manually reset the index before writing, and re-add the index after reading. 
> It would be very useful if you could supply the name of an index level to 
> {{partition_cols}} instead of (or ideally in addition to) a data column name.
> I originally posted this on the Pandas issue tracker 
> ([https://github.com/pandas-dev/pandas/issues/47797]). Matthew Roeschke 
> looked at the code and figured out that the partitioning functionality was 
> implemented entirely in PyArrow, and that the change would need to happen 
> within PyArrow itself.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17200) [Python][Parquet] support partitioning by Pandas DataFrame index

2022-10-20 Thread Alenka Frim (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621068#comment-17621068
 ] 

Alenka Frim commented on ARROW-17200:
-

This should be possible.

When transforming pandas dataframe into Arrow table the multi-index is 
converted into columns. These columns can then be defined as {{partition_cols}} 
for writing parquet files into partitions. Also looking at the code in pandas 
codebase, the correct method is selected if {{partition_cols}} are supplied:

[https://github.com/pandas-dev/pandas/blob/56d82a9bd654e91d14596e82e4d9c82215fa5bc8/pandas/io/parquet.py#L195-L209]

which is {{write_to_dataset}}. A working example:

{code:python}
import pandas as pd
import numpy as np

# Creating a dataframe with MultiIndex
arrays = [
["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"],
]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
df = pd.DataFrame(data={'randn':  np.random.randn(8)}, index=index)

# writing to a partitioned dataset
df.to_parquet(path='dataset_name', partition_cols=["first", "second"])

# inspecting the pieces
import pyarrow.parquet as pq
dataset = pq.ParquetDataset('dataset_name', use_legacy_dataset=False)
dataset.fragments
# [,
#  ,
#  ,
#  ,
#  ,
#  ,
#  ,
#  ,
#  ,
#  ,
#  ,
#  ,
#  ,
#  ,
#  ,
#  ]

{code}

> [Python][Parquet] support partitioning by Pandas DataFrame index
> 
>
> Key: ARROW-17200
> URL: https://issues.apache.org/jira/browse/ARROW-17200
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Parquet, Python
>Reporter: Gregory Werbin
>Priority: Minor
>
> In a Pandas {{DataFrame}} with a multi-index, with a slowly-varying "outer" 
> index level, one might want to partition by that index level when saving the 
> data frame to Parquet format. This is currently not possible; you need to 
> manually reset the index before writing, and re-add the index after reading. 
> It would be very useful if you could supply the name of an index level to 
> {{partition_cols}} instead of (or ideally in addition to) a data column name.
> I originally posted this on the Pandas issue tracker 
> ([https://github.com/pandas-dev/pandas/issues/47797]). Matthew Roeschke 
> looked at the code and figured out that the partitioning functionality was 
> implemented entirely in PyArrow, and that the change would need to happen 
> within PyArrow itself.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18064) [Python] Error of wrong number of rows read from Parquet file

2022-10-20 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-18064:
--
Summary: [Python] Error of wrong number of rows read from Parquet file  
(was: [Python] Error of wrong number of rows read from file)

> [Python] Error of wrong number of rows read from Parquet file
> -
>
> Key: ARROW-18064
> URL: https://issues.apache.org/jira/browse/ARROW-18064
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 7.0.0, 7.0.1, 8.0.0, 8.0.1, 9.0.0
> Environment: Python Info
> 3.10.7 (tags/v3.10.7:6cc6b13, Sep  5 2022, 14:08:36) [MSC v.1933 64 bit 
> (AMD64)]
> Pyarrow Info
> 6.0.1
> Platform Info
> Windows-10-10.0.19042-SP0
> Windows
> 10
> 10.0.19042
> 19042
> AMD64
>Reporter: Blake erickson
>Priority: Major
> Attachments: badplug.parquet, readBadParquet.py, screenshot-1.png
>
>
> on version greater than 6.0.1 fail to read tables saying expected length n, 
> got n=1 rows
>  
> Tables can be read column by column fine, or with a fixed number of rows 
> matching the meta data fine.      Reads correctly in version 6.0.1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17360) [Python] pyarrow.orc.ORCFile.read does not preserve ordering of columns

2022-10-20 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621059#comment-17621059
 ] 

Joris Van den Bossche commented on ARROW-17360:
---

For reference, we had a similar issue for Feather, where the underlying C++ 
reader always follows the order of the schema (ARROW-8641). And there we solved 
this by reordering the columns on the Python side in 
{{pyarrow.feather.read_table}} (as Alenka linked above).

> [Python] pyarrow.orc.ORCFile.read does not preserve ordering of columns
> ---
>
> Key: ARROW-17360
> URL: https://issues.apache.org/jira/browse/ARROW-17360
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 8.0.1
>Reporter: Matthew Roeschke
>Priority: Major
>  Labels: orc
>
> xref [https://github.com/pandas-dev/pandas/issues/47944]
>  
> {code:java}
> In [1]: df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})
> # pandas main branch / 1.5
> In [2]: df.to_orc("abc")
> In [3]: pd.read_orc("abc", columns=['b', 'a'])
> Out[3]:
>a  b
> 0  1  a
> 1  2  b
> 2  3  c
> In [4]: import pyarrow.orc as orc
> In [5]: orc_file = orc.ORCFile("abc")
> # reordered to a, b
> In [6]: orc_file.read(columns=['b', 'a']).to_pandas()
> Out[6]:
>a  b
> 0  1  a
> 1  2  b
> 2  3  c
> # reordered to a, b
> In [7]: orc_file.read(columns=['b', 'a'])
> Out[7]:
> pyarrow.Table
> a: int64
> b: string
> 
> a: [[1,2,3]]
> b: [["a","b","c"]] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17540) [Python] Can not refer to field in a list of structs

2022-10-20 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-17540:
--
Fix Version/s: 11.0.0

> [Python] Can not refer to field in a list of structs 
> -
>
> Key: ARROW-17540
> URL: https://issues.apache.org/jira/browse/ARROW-17540
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: Lei (Eddy) Xu
>Priority: Major
> Fix For: 11.0.0
>
>
> When the dataset has nested sturcts, "list",  we can not use 
> `pyarrow.field(..)` to get the reference of the sub-field of the struct.
>  
> For example
>  
> {code:python}
> import pyarrow as pa
> import pyarrow.dataset as ds
> import pandas as pd
> schema = pa.schema(
> [
> pa.field(
> "objects",
> pa.list_(
> pa.struct(
> [
> pa.field("name", pa.utf8()),
> pa.field("attr1", pa.float32()),
> pa.field("attr2", pa.int32()),
> ]
> )
> ),
> )
> ]
> )
> table = pa.Table.from_pandas(
> pd.DataFrame([{"objects": [{"name": "a", "attr1": 5.0, "attr2": 20}]}])
> )
> print(table)
> dataset = ds.dataset(table)
> print(dataset)
> dataset.scanner(columns=["objects.attr2"]).to_table()
> {code}
> which throws exception:
> {noformat}
> Traceback (most recent call last):
>   File "foo.py", line 31, in 
> dataset.scanner(columns=["objects.attr2"]).to_table()
>   File "pyarrow/_dataset.pyx", line 298, in pyarrow._dataset.Dataset.scanner
>   File "pyarrow/_dataset.pyx", line 2356, in 
> pyarrow._dataset.Scanner.from_dataset
>   File "pyarrow/_dataset.pyx", line 2202, in 
> pyarrow._dataset._populate_builder
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(objects.attr2) in 
> objects: list>
> __fragment_index: int32
> __batch_index: int32
> __last_in_fragment: bool
> __filename: string
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18105) Arrow Flight SegFault

2022-10-20 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621048#comment-17621048
 ] 

David Li commented on ARROW-18105:
--

Duplicate of ARROW-17822?

> Arrow Flight SegFault
> -
>
> Key: ARROW-18105
> URL: https://issues.apache.org/jira/browse/ARROW-18105
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC
>Affects Versions: 9.0.0
>Reporter: Ziheng Wang
>Priority: Minor
>
> Typo in grpc endpoint results in segfault. Probably should result in warning 
> instead.
> ziheng@ziheng:~$ python3
> Python 3.8.0 (default, Nov  6 2019, 21:49:08) 
> [GCC 7.3.0] :: Anaconda, Inc. on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow.flight
> >>> flight_client = pyarrow.flight.connect("grcp://0.0.0.0:5005")



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17611) [Rust] Boolean column data saved with V2 from arrow-rs unreadable by pyarrow

2022-10-20 Thread Miles Granger (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miles Granger updated ARROW-17611:
--
Summary: [Rust] Boolean column data saved with V2 from arrow-rs unreadable 
by pyarrow  (was: Boolean column data saved with V2 from arrow-rs unreadable by 
pyarrow)

> [Rust] Boolean column data saved with V2 from arrow-rs unreadable by pyarrow
> 
>
> Key: ARROW-17611
> URL: https://issues.apache.org/jira/browse/ARROW-17611
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Format
>Affects Versions: 9.0.0
> Environment: Rust:
> "arrow" = "21.0.0"
> "parquet" = "21.0.0"
> Python:
> parquet-tools 0.2.11
> pyarrow  9.0.0
>Reporter: Kamil Skalski
>Priority: Minor
> Attachments: arrow_boolean.tar.gz, main.rs, x.parquet
>
>
> I'm generating Parquet V2 files with boolean column, but when trying to read 
> them with pyarrow ({_}parquet-tool{_}s or {_}parq{_}) I'm getting 
> {code:java}
> OSError: Unknown encoding type. {code}
> To reproduce run following Rust program:
> {code:java}
> use arrow::json;
> use std::fs::File;
> const DATA: &'static str = r#"
>    {"x": 1, "y": false}
> "#;
> fn main() -> anyhow::Result<()> {
>    let mut json = json::ReaderBuilder::new().infer_schema(Some(2))
>       .build(std::io::Cursor::new(DATA.as_bytes()))?;
>let batch = json.next()?.unwrap();   
>let out_file = File::create("x.parquet")?;
>    let props = parquet::file::properties::WriterProperties::builder()
>   .set_writer_version(
>   parquet::file::properties::WriterVersion::PARQUET_2_0)
>       .build();
>    let mut writer = parquet::arrow::ArrowWriter::try_new(
>   out_file, batch.schema(), Some(props))?;
>    writer.write()?;
>    writer.close()?;
>    Ok(())
> } {code}
> and try to show the output _x.parquet_ file
> {code:java}
> $ cargo run
> $ parquet-tools show x.parquet 
> Traceback (most recent call last):
>   File "/home/nazgul/.local/bin/parquet-tools", line 8, in 
>     sys.exit(main())
>   File 
> "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/cli.py", line 
> 26, in main
>     args.handler(args)
>   File 
> "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/commands/show.py",
>  line 59, in _cli
>     with get_datafame_from_objs(pfs, args.head) as df:
>   File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
>     return next(self.gen)
>   File 
> "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/commands/utils.py",
>  line 190, in get_datafame_from_objs
>     df: Optional[pd.DataFrame] = stack.enter_context(pf.get_dataframe())
>   File "/usr/lib/python3.10/contextlib.py", line 492, in enter_context
>     result = _cm_type.__enter__(cm)
>   File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
>     return next(self.gen)
>   File 
> "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/commands/utils.py",
>  line 71, in get_dataframe
>     yield pq.read_table(local_path).to_pandas()
>   File 
> "/home/nazgul/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
>  line 2827, in read_table
>     return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/home/nazgul/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
>  line 2473, in read
>     table = self._dataset.to_table(
>   File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
>   File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
> OSError: Unknown encoding type. {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17611) Boolean column data saved with V2 from arrow-rs unreadable by pyarrow

2022-10-20 Thread Miles Granger (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621039#comment-17621039
 ] 

Miles Granger edited comment on ARROW-17611 at 10/20/22 11:42 AM:
--

It seems `arrow-rs` defaults to RLE for the boolean array; but that can be 
changed using a different encoding thru 
[set_encoding|https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_encoding]
 on the Builder.

Other good news, is pyarrow 10.x should (and does work, checking on master now) 
read the given file, thanks to ARROW-18031  being closed.

---

EDIT: If this answers the question/issue, can we close this?


was (Author: JIRAUSER293894):
It seems `arrow-rs` defaults to RLE for the boolean array; but that can be 
changed using a different encoding thru 
[set_encoding|https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_encoding]
 on the Builder.

Other good news, is pyarrow 10.x should (and does work, checking on master now) 
read the given file, thanks to ARROW-18031  being closed.

> Boolean column data saved with V2 from arrow-rs unreadable by pyarrow
> -
>
> Key: ARROW-17611
> URL: https://issues.apache.org/jira/browse/ARROW-17611
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Format
>Affects Versions: 9.0.0
> Environment: Rust:
> "arrow" = "21.0.0"
> "parquet" = "21.0.0"
> Python:
> parquet-tools 0.2.11
> pyarrow  9.0.0
>Reporter: Kamil Skalski
>Priority: Minor
> Attachments: arrow_boolean.tar.gz, main.rs, x.parquet
>
>
> I'm generating Parquet V2 files with boolean column, but when trying to read 
> them with pyarrow ({_}parquet-tool{_}s or {_}parq{_}) I'm getting 
> {code:java}
> OSError: Unknown encoding type. {code}
> To reproduce run following Rust program:
> {code:java}
> use arrow::json;
> use std::fs::File;
> const DATA: &'static str = r#"
>    {"x": 1, "y": false}
> "#;
> fn main() -> anyhow::Result<()> {
>    let mut json = json::ReaderBuilder::new().infer_schema(Some(2))
>       .build(std::io::Cursor::new(DATA.as_bytes()))?;
>let batch = json.next()?.unwrap();   
>let out_file = File::create("x.parquet")?;
>    let props = parquet::file::properties::WriterProperties::builder()
>   .set_writer_version(
>   parquet::file::properties::WriterVersion::PARQUET_2_0)
>       .build();
>    let mut writer = parquet::arrow::ArrowWriter::try_new(
>   out_file, batch.schema(), Some(props))?;
>    writer.write()?;
>    writer.close()?;
>    Ok(())
> } {code}
> and try to show the output _x.parquet_ file
> {code:java}
> $ cargo run
> $ parquet-tools show x.parquet 
> Traceback (most recent call last):
>   File "/home/nazgul/.local/bin/parquet-tools", line 8, in 
>     sys.exit(main())
>   File 
> "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/cli.py", line 
> 26, in main
>     args.handler(args)
>   File 
> "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/commands/show.py",
>  line 59, in _cli
>     with get_datafame_from_objs(pfs, args.head) as df:
>   File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
>     return next(self.gen)
>   File 
> "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/commands/utils.py",
>  line 190, in get_datafame_from_objs
>     df: Optional[pd.DataFrame] = stack.enter_context(pf.get_dataframe())
>   File "/usr/lib/python3.10/contextlib.py", line 492, in enter_context
>     result = _cm_type.__enter__(cm)
>   File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
>     return next(self.gen)
>   File 
> "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/commands/utils.py",
>  line 71, in get_dataframe
>     yield pq.read_table(local_path).to_pandas()
>   File 
> "/home/nazgul/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
>  line 2827, in read_table
>     return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/home/nazgul/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
>  line 2473, in read
>     table = self._dataset.to_table(
>   File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
>   File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
> OSError: Unknown encoding type. {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17611) Boolean column data saved with V2 from arrow-rs unreadable by pyarrow

2022-10-20 Thread Miles Granger (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621039#comment-17621039
 ] 

Miles Granger commented on ARROW-17611:
---

It seems `arrow-rs` defaults to RLE for the boolean array; but that can be 
changed using a different encoding thru 
[set_encoding|https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_encoding]
 on the Builder.

Other good news, is pyarrow 10.x should (and does work, checking on master now) 
read the given file, thanks to ARROW-18031  being closed.

> Boolean column data saved with V2 from arrow-rs unreadable by pyarrow
> -
>
> Key: ARROW-17611
> URL: https://issues.apache.org/jira/browse/ARROW-17611
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Format
>Affects Versions: 9.0.0
> Environment: Rust:
> "arrow" = "21.0.0"
> "parquet" = "21.0.0"
> Python:
> parquet-tools 0.2.11
> pyarrow  9.0.0
>Reporter: Kamil Skalski
>Priority: Minor
> Attachments: arrow_boolean.tar.gz, main.rs, x.parquet
>
>
> I'm generating Parquet V2 files with boolean column, but when trying to read 
> them with pyarrow ({_}parquet-tool{_}s or {_}parq{_}) I'm getting 
> {code:java}
> OSError: Unknown encoding type. {code}
> To reproduce run following Rust program:
> {code:java}
> use arrow::json;
> use std::fs::File;
> const DATA: &'static str = r#"
>    {"x": 1, "y": false}
> "#;
> fn main() -> anyhow::Result<()> {
>    let mut json = json::ReaderBuilder::new().infer_schema(Some(2))
>       .build(std::io::Cursor::new(DATA.as_bytes()))?;
>let batch = json.next()?.unwrap();   
>let out_file = File::create("x.parquet")?;
>    let props = parquet::file::properties::WriterProperties::builder()
>   .set_writer_version(
>   parquet::file::properties::WriterVersion::PARQUET_2_0)
>       .build();
>    let mut writer = parquet::arrow::ArrowWriter::try_new(
>   out_file, batch.schema(), Some(props))?;
>    writer.write()?;
>    writer.close()?;
>    Ok(())
> } {code}
> and try to show the output _x.parquet_ file
> {code:java}
> $ cargo run
> $ parquet-tools show x.parquet 
> Traceback (most recent call last):
>   File "/home/nazgul/.local/bin/parquet-tools", line 8, in 
>     sys.exit(main())
>   File 
> "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/cli.py", line 
> 26, in main
>     args.handler(args)
>   File 
> "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/commands/show.py",
>  line 59, in _cli
>     with get_datafame_from_objs(pfs, args.head) as df:
>   File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
>     return next(self.gen)
>   File 
> "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/commands/utils.py",
>  line 190, in get_datafame_from_objs
>     df: Optional[pd.DataFrame] = stack.enter_context(pf.get_dataframe())
>   File "/usr/lib/python3.10/contextlib.py", line 492, in enter_context
>     result = _cm_type.__enter__(cm)
>   File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
>     return next(self.gen)
>   File 
> "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/commands/utils.py",
>  line 71, in get_dataframe
>     yield pq.read_table(local_path).to_pandas()
>   File 
> "/home/nazgul/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
>  line 2827, in read_table
>     return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/home/nazgul/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
>  line 2473, in read
>     table = self._dataset.to_table(
>   File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
>   File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
> OSError: Unknown encoding type. {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17540) [Python] Can not refer to field in a list of structs

2022-10-20 Thread Miles Granger (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621037#comment-17621037
 ] 

Miles Granger commented on ARROW-17540:
---

Should also mention, that if you are only after a single list element, you can 
do the following, albeit ugly, bit of code here. Until it's properly fixed.

{code:python}
dataset.to_table(columns={
'attr2': pc.struct_field(
pc.list_element(ds.field("objects"), ds.scalar(0)), 
[1])
}
)
{code}

> [Python] Can not refer to field in a list of structs 
> -
>
> Key: ARROW-17540
> URL: https://issues.apache.org/jira/browse/ARROW-17540
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: Lei (Eddy) Xu
>Priority: Major
>
> When the dataset has nested sturcts, "list",  we can not use 
> `pyarrow.field(..)` to get the reference of the sub-field of the struct.
>  
> For example
>  
> {code:python}
> import pyarrow as pa
> import pyarrow.dataset as ds
> import pandas as pd
> schema = pa.schema(
> [
> pa.field(
> "objects",
> pa.list_(
> pa.struct(
> [
> pa.field("name", pa.utf8()),
> pa.field("attr1", pa.float32()),
> pa.field("attr2", pa.int32()),
> ]
> )
> ),
> )
> ]
> )
> table = pa.Table.from_pandas(
> pd.DataFrame([{"objects": [{"name": "a", "attr1": 5.0, "attr2": 20}]}])
> )
> print(table)
> dataset = ds.dataset(table)
> print(dataset)
> dataset.scanner(columns=["objects.attr2"]).to_table()
> {code}
> which throws exception:
> {noformat}
> Traceback (most recent call last):
>   File "foo.py", line 31, in 
> dataset.scanner(columns=["objects.attr2"]).to_table()
>   File "pyarrow/_dataset.pyx", line 298, in pyarrow._dataset.Dataset.scanner
>   File "pyarrow/_dataset.pyx", line 2356, in 
> pyarrow._dataset.Scanner.from_dataset
>   File "pyarrow/_dataset.pyx", line 2202, in 
> pyarrow._dataset._populate_builder
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(objects.attr2) in 
> objects: list>
> __fragment_index: int32
> __batch_index: int32
> __last_in_fragment: bool
> __filename: string
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17360) [Python] pyarrow.orc.ORCFile.read does not preserve ordering of columns

2022-10-20 Thread Alenka Frim (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621017#comment-17621017
 ] 

Alenka Frim edited comment on ARROW-17360 at 10/20/22 11:21 AM:


Thank you for reporting!

I would say this is not the expected behaviour. If we look at the {{parquet}} 
or {{feather}} format the {{read}} methods preserve the ordering of selected 
columns:
{code:python}
import pyarrow as pa
table = pa.table({"a": [1, 2, 3], "b": ["a", "b", "c"]})

import pyarrow.parquet as pq
pq.write_table(table, 'example.parquet')
pq.read_table('example.parquet', columns=['b', 'a'])
# pyarrow.Table
# b: string
# a: int64
# 
# b: [["a","b","c"]]
# a: [[1,2,3]]

import pyarrow.feather as feather
feather.write_feather(table, 'example_feather')
feather.read_table('example_feather', columns=['b', 'a'])
# pyarrow.Table
# b: string
# a: int64
# 
# b: [["a","b","c"]]
# a: [[1,2,3]]
{code}
FWIU looking at the code in 
[pyarrow/_orc.pyx|https://github.com/apache/arrow/blob/962121062e4b13c148f24a6d4fa4b1a2f1be0d88/python/pyarrow/_orc.pyx#L379-L382]
 and 
[arrow/adapters/orc/adapter.cc|https://github.com/apache/arrow/blob/183517c8baad039c0100687c8a405bd4d8b404a7/cpp/src/arrow/adapters/orc/adapter.cc#L336-L341]
 I think the behaviour comes from [Apache 
ORC|https://github.com/apache/orc/blob/7f7362bdcecfd48e5ff9f4a3255100e3ea724f6f/c%2B%2B/include/orc/Reader.hh#L158-L165]
 and can therefore be open as an issue there (about following order in the 
original schema).

Nevertheless there are two options we have to make this work correctly:
 * add a re-ordering in {{pyarrow}} as it is done for [feather 
implementation|https://github.com/apache/arrow/blob/0f91e684ddda3dfd11d376c2755bbc3071c3099d/python/pyarrow/feather.py#L280-L281].
 * Even better would be if {{pandas}} uses the new {{dataset}} API to read 
{{orc}} files like so:
{code:python}
import pyarrow.dataset as ds
dataset = ds.dataset("example.orc", format="orc")
dataset.to_table(columns=['b', 'a'])
# pyarrow.Table
# b: string
# a: int64
# 
# b: [["a","b","c"]]
# a: [[1,2,3]]
{code}


was (Author: alenkaf):
Thank you for reporting!

I would say this is not the expected behaviour. If we look at the {{parquet}} 
or {{feather}} format the {{read}} methods preserve the ordering of selected 
columns:
{code:python}
import pyarrow as pa
table = pa.table({"a": [1, 2, 3], "b": ["a", "b", "c"]})

import pyarrow.parquet as pq
pq.write_table(table, 'example.parquet')
pq.read_table('example.parquet', columns=['b', 'a'])
# pyarrow.Table
# b: string
# a: int64
# 
# b: [["a","b","c"]]
# a: [[1,2,3]]

import pyarrow.feather as feather
feather.write_feather(table, 'example_feather')
feather.read_table('example_feather', columns=['b', 'a'])
# pyarrow.Table
# b: string
# a: int64
# 
# b: [["a","b","c"]]
# a: [[1,2,3]]
{code}
FWIU looking at the code in 
[pyarrow/_orc.pyx|https://github.com/apache/arrow/blob/962121062e4b13c148f24a6d4fa4b1a2f1be0d88/python/pyarrow/_orc.pyx#L379-L382]
 and 
[arrow/adapters/orc/adapter.cc|https://github.com/apache/arrow/blob/183517c8baad039c0100687c8a405bd4d8b404a7/cpp/src/arrow/adapters/orc/adapter.cc#L336-L341]
 I think the behaviour comes from [Apache 
ORC|https://github.com/apache/orc/blob/7f7362bdcecfd48e5ff9f4a3255100e3ea724f6f/c%2B%2B/include/orc/Reader.hh#L158-L165]
 and can therefore be open as an issue there.

Nevertheless there are two options we have to make this work correctly:
 * add a re-ordering in {{pyarrow}} as it is done for [feather 
implementation|https://github.com/apache/arrow/blob/0f91e684ddda3dfd11d376c2755bbc3071c3099d/python/pyarrow/feather.py#L280-L281].
 * Even better would be if {{pandas}} uses the new {{dataset}} API to read 
{{orc}} files like so:
{code:python}
import pyarrow.dataset as ds
dataset = ds.dataset("example.orc", format="orc")
dataset.to_table(columns=['b', 'a'])
# pyarrow.Table
# b: string
# a: int64
# 
# b: [["a","b","c"]]
# a: [[1,2,3]]
{code}

> [Python] pyarrow.orc.ORCFile.read does not preserve ordering of columns
> ---
>
> Key: ARROW-17360
> URL: https://issues.apache.org/jira/browse/ARROW-17360
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 8.0.1
>Reporter: Matthew Roeschke
>Priority: Major
>  Labels: orc
>
> xref [https://github.com/pandas-dev/pandas/issues/47944]
>  
> {code:java}
> In [1]: df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})
> # pandas main branch / 1.5
> In [2]: df.to_orc("abc")
> In [3]: pd.read_orc("abc", columns=['b', 'a'])
> Out[3]:
>a  b
> 0  1  a
> 1  2  b
> 2  3  c
> In [4]: import pyarrow.orc as orc
> In [5]: orc_file = orc.ORCFile("abc")
> # reordered to a, b
> In [6]: orc_file.read(columns=['b', 'a']).to_pandas()
> Out[6]:
>

[jira] [Commented] (ARROW-17360) [Python] pyarrow.orc.ORCFile.read does not preserve ordering of columns

2022-10-20 Thread Alenka Frim (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621017#comment-17621017
 ] 

Alenka Frim commented on ARROW-17360:
-

Thank you for reporting!

I would say this is not the expected behaviour. If we look at the {{parquet}} 
or {{feather}} format the {{read}} methods preserve the ordering of selected 
columns:
{code:python}
import pyarrow as pa
table = pa.table({"a": [1, 2, 3], "b": ["a", "b", "c"]})

import pyarrow.parquet as pq
pq.write_table(table, 'example.parquet')
pq.read_table('example.parquet', columns=['b', 'a'])
# pyarrow.Table
# b: string
# a: int64
# 
# b: [["a","b","c"]]
# a: [[1,2,3]]

import pyarrow.feather as feather
feather.write_feather(table, 'example_feather')
feather.read_table('example_feather', columns=['b', 'a'])
# pyarrow.Table
# b: string
# a: int64
# 
# b: [["a","b","c"]]
# a: [[1,2,3]]
{code}
FWIU looking at the code in 
[pyarrow/_orc.pyx|https://github.com/apache/arrow/blob/962121062e4b13c148f24a6d4fa4b1a2f1be0d88/python/pyarrow/_orc.pyx#L379-L382]
 and 
[arrow/adapters/orc/adapter.cc|https://github.com/apache/arrow/blob/183517c8baad039c0100687c8a405bd4d8b404a7/cpp/src/arrow/adapters/orc/adapter.cc#L336-L341]
 I think the behaviour comes from [Apache 
ORC|https://github.com/apache/orc/blob/7f7362bdcecfd48e5ff9f4a3255100e3ea724f6f/c%2B%2B/include/orc/Reader.hh#L158-L165]
 and can therefore be open as an issue there.

Nevertheless there are two options we have to make this work correctly:
 * add a re-ordering in {{pyarrow}} as it is done for [feather 
implementation|https://github.com/apache/arrow/blob/0f91e684ddda3dfd11d376c2755bbc3071c3099d/python/pyarrow/feather.py#L280-L281].
 * Even better would be if {{pandas}} uses the new {{dataset}} API to read 
{{orc}} files like so:
{code:python}
import pyarrow.dataset as ds
dataset = ds.dataset("example.orc", format="orc")
dataset.to_table(columns=['b', 'a'])
# pyarrow.Table
# b: string
# a: int64
# 
# b: [["a","b","c"]]
# a: [[1,2,3]]
{code}

> [Python] pyarrow.orc.ORCFile.read does not preserve ordering of columns
> ---
>
> Key: ARROW-17360
> URL: https://issues.apache.org/jira/browse/ARROW-17360
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 8.0.1
>Reporter: Matthew Roeschke
>Priority: Major
>  Labels: orc
>
> xref [https://github.com/pandas-dev/pandas/issues/47944]
>  
> {code:java}
> In [1]: df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})
> # pandas main branch / 1.5
> In [2]: df.to_orc("abc")
> In [3]: pd.read_orc("abc", columns=['b', 'a'])
> Out[3]:
>a  b
> 0  1  a
> 1  2  b
> 2  3  c
> In [4]: import pyarrow.orc as orc
> In [5]: orc_file = orc.ORCFile("abc")
> # reordered to a, b
> In [6]: orc_file.read(columns=['b', 'a']).to_pandas()
> Out[6]:
>a  b
> 0  1  a
> 1  2  b
> 2  3  c
> # reordered to a, b
> In [7]: orc_file.read(columns=['b', 'a'])
> Out[7]:
> pyarrow.Table
> a: int64
> b: string
> 
> a: [[1,2,3]]
> b: [["a","b","c"]] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18064) [Python] Error of wrong number of rows read from file

2022-10-20 Thread Miles Granger (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miles Granger updated ARROW-18064:
--
Summary: [Python] Error of wrong number of rows read from file  (was: Error 
of wrong number of rows read from file)

> [Python] Error of wrong number of rows read from file
> -
>
> Key: ARROW-18064
> URL: https://issues.apache.org/jira/browse/ARROW-18064
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 7.0.0, 7.0.1, 8.0.0, 8.0.1, 9.0.0
> Environment: Python Info
> 3.10.7 (tags/v3.10.7:6cc6b13, Sep  5 2022, 14:08:36) [MSC v.1933 64 bit 
> (AMD64)]
> Pyarrow Info
> 6.0.1
> Platform Info
> Windows-10-10.0.19042-SP0
> Windows
> 10
> 10.0.19042
> 19042
> AMD64
>Reporter: Blake erickson
>Priority: Major
> Attachments: badplug.parquet, readBadParquet.py, screenshot-1.png
>
>
> on version greater than 6.0.1 fail to read tables saying expected length n, 
> got n=1 rows
>  
> Tables can be read column by column fine, or with a fixed number of rows 
> matching the meta data fine.      Reads correctly in version 6.0.1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18099) [Python] Cannot create pandas categorical from table only with nulls

2022-10-20 Thread Alenka Frim (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim updated ARROW-18099:

Summary: [Python] Cannot create pandas categorical from table only with 
nulls  (was: Cannot create pandas categorical from table only with nulls)

> [Python] Cannot create pandas categorical from table only with nulls
> 
>
> Key: ARROW-18099
> URL: https://issues.apache.org/jira/browse/ARROW-18099
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
> Environment: OSX 12.6
> M1 silicon
>Reporter: Damian Barabonkov
>Priority: Minor
>
> A pyarrow Table with only null values cannot be instantiated as a Pandas 
> DataFrame with said column as a category. However, pandas does support 
> "empty" categoricals. Therefore, a simple patch would be to load the pa.Table 
> as an object first and convert, once in pandas, to a categorical which will 
> be empty. However, that does not solve the pyarrow bug at its root.
>  
> Sample reproducible example
> {code:java}
> import pyarrow as pa
> pylist = [{'x': None, '__index_level_0__': 2}, {'x': None, 
> '__index_level_0__': 3}]
> tbl = pa.Table.from_pylist(pylist)
>  
> # Errors
> df_broken = tbl.to_pandas(categories=["x"])
>  
> # Works
> df_works = tbl.to_pandas()
> df_works = df_works.astype({"x": "category"}) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18102) [R] dplyr::count and dplyr::tally implementation return NA instead of 0

2022-10-20 Thread Nicola Crane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620931#comment-17620931
 ] 

Nicola Crane commented on ARROW-18102:
--

Thanks for opening this ticket [~adam.black].  I've tried this with the dev 
version of Arrow, and can confirm this bug still exists there too.

> [R] dplyr::count and dplyr::tally implementation return NA instead of 0
> ---
>
> Key: ARROW-18102
> URL: https://issues.apache.org/jira/browse/ARROW-18102
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: Arrow R package 9.0.0 on Mac OS 12.6 with R 4.2.0
>Reporter: Adam Black
>Priority: Minor
>
> I'm using dplyr with FileSystemDataset objects. The expected behavior is 
> similar (or the same as) dataframe behavior. When the FileSystemDataset has 
> zero rows dplyr::count and dplyr::tally return NA instead of 0. I would 
> expect the result to be 0.
>  
> {code:r}
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #>     timestamp
> library(dplyr)
> #> 
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #> 
> #>     filter, lag
> #> The following objects are masked from 'package:base':
> #> 
> #>     intersect, setdiff, setequal, union
> path <- tempfile(fileext = ".feather")
> zero_row_dataset <- cars %>% filter(dist < 0)
> # expected behavior
> zero_row_dataset %>% 
>   count()
> #>   n
> #> 1 0
> zero_row_dataset %>% 
>   tally()
> #>   n
> #> 1 0
> nrow(zero_row_dataset)
> #> [1] 0
> # now test behavior with a FileSystemDataset
> write_feather(zero_row_dataset, path)
> ds <- open_dataset(path, format = "feather")
> ds
> #> FileSystemDataset with 1 Feather file
> #> speed: double
> #> dist: double
> #> 
> #> See $metadata for additional Schema metadata
> # actual behavior
> ds %>% 
>   count() %>% 
>   collect() # incorrect result
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    NA
> ds %>% 
>   tally() %>% 
>   collect() # incorrect result
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    NA
> nrow(ds) # works as expected
> #> [1] 0
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18102) [R] dplyr::count and dplyr::tally implementation return NA instead of 0

2022-10-20 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-18102:
-
Priority: Major  (was: Minor)

> [R] dplyr::count and dplyr::tally implementation return NA instead of 0
> ---
>
> Key: ARROW-18102
> URL: https://issues.apache.org/jira/browse/ARROW-18102
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: Arrow R package 9.0.0 on Mac OS 12.6 with R 4.2.0
>Reporter: Adam Black
>Priority: Major
>
> I'm using dplyr with FileSystemDataset objects. The expected behavior is 
> similar (or the same as) dataframe behavior. When the FileSystemDataset has 
> zero rows dplyr::count and dplyr::tally return NA instead of 0. I would 
> expect the result to be 0.
>  
> {code:r}
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #>     timestamp
> library(dplyr)
> #> 
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #> 
> #>     filter, lag
> #> The following objects are masked from 'package:base':
> #> 
> #>     intersect, setdiff, setequal, union
> path <- tempfile(fileext = ".feather")
> zero_row_dataset <- cars %>% filter(dist < 0)
> # expected behavior
> zero_row_dataset %>% 
>   count()
> #>   n
> #> 1 0
> zero_row_dataset %>% 
>   tally()
> #>   n
> #> 1 0
> nrow(zero_row_dataset)
> #> [1] 0
> # now test behavior with a FileSystemDataset
> write_feather(zero_row_dataset, path)
> ds <- open_dataset(path, format = "feather")
> ds
> #> FileSystemDataset with 1 Feather file
> #> speed: double
> #> dist: double
> #> 
> #> See $metadata for additional Schema metadata
> # actual behavior
> ds %>% 
>   count() %>% 
>   collect() # incorrect result
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    NA
> ds %>% 
>   tally() %>% 
>   collect() # incorrect result
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    NA
> nrow(ds) # works as expected
> #> [1] 0
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   >