[jira] [Commented] (ARROW-16769) [C++] Add Status::Warn()

2022-06-07 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551392#comment-17551392
 ] 

Weston Pace commented on ARROW-16769:
-

So would the usage look something like this?

{noformat}
~RecordBatchReader() {
  Status st = Close();
  if (!st.ok()) {
st.Warn();
  }
}
{noformat}

Would we maybe want a macro similar to {{ABORT_NOT_OK}}?

{noformat}
~RecordBatchReader() {
  ARROW_WARN_NOT_OK(Close());
}
{noformat}

> [C++] Add Status::Warn()
> 
>
> Key: ARROW-16769
> URL: https://issues.apache.org/jira/browse/ARROW-16769
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
>  Labels: good-first-issue
> Fix For: 9.0.0
>
>
> We currently have {{Status::Abort()}} which gives an easy way to abort the 
> process with a meaningful message and detail.
> We should similarly add {{Status::Warn()}} that would simply print a warning 
> message of the error. Possible example use at 
> https://github.com/apache/arrow/pull/13315/files#diff-1256864b34a1b43082596ab5b16881702881ad06be8e1c157b47e1e6ac9ff5d2R160-R164
>  (together with {{StatusFromErrno}}).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16761) [C++][Python] Track bytes_written on FileWriter / WrittenFile

2022-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16761:
---
Labels: pull-request-available  (was: )

> [C++][Python] Track bytes_written on FileWriter / WrittenFile
> -
>
> Key: ARROW-16761
> URL: https://issues.apache.org/jira/browse/ARROW-16761
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Affects Versions: 8.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For Apache Iceberg and Delta Lake tables, we need to be able to get the size 
> of the files written in bytes. In Iceberg, this is the required field 
> {{file_size_in_bytes}} ([docs|https://iceberg.apache.org/spec/#manifests]). 
> In Delta, this is the required field {{size}} as part of the Add action.
> I think this could be exposed on 
> [FileWriter|https://github.com/apache/arrow/blob/8c63788ff7d52812599a546989b7df10887cb01e/cpp/src/arrow/dataset/file_base.h#L305]
>  and then through that 
> [WrittenFile|https://github.com/apache/arrow/blob/8c63788ff7d52812599a546989b7df10887cb01e/python/pyarrow/_dataset.pyx#L766-L769].
>  But lower-level than that I'm not yet sure. {{FileWriter}} owns its 
> {{OutputStream}}; would {{OutputStream::Tell()}} give the correct count?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16786) [Docs] Update "closed without merge" in pull request note

2022-06-07 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-16786:
-
Description: 
https://arrow.apache.org/docs/developers/overview.html#pull-request-and-review

{quote}
A side effect of this way of merging is that your pull request will appear in 
the GitHub interface to have been “closed without merge”. Do not be alarmed: if 
you look at the bottom, you will see a message that says @user closed this in 
$COMMIT. In the commit message of that commit, the merge tool adds the pull 
request description, a link back to the pull request, and attribution to the 
contributor and any co-authors.
{quote}

We have changed to use the "merge" feature of GitHub by ARROW-16602 to show 
"merged" mark not "closed without merge" mark on GitHub. We should update the 
note.

[~alenkaf] Do you want to work on this?

  was:
https://arrow.apache.org/docs/developers/overview.html#pull-request-and-review

{noformat}
A side effect of this way of merging is that your pull request will appear in 
the GitHub interface to have been “closed without merge”. Do not be alarmed: if 
you look at the bottom, you will see a message that says @user closed this in 
$COMMIT. In the commit message of that commit, the merge tool adds the pull 
request description, a link back to the pull request, and attribution to the 
contributor and any co-authors.
{noformat}

We have changed to use the "merge" feature of GitHub by ARROW-16602 to show 
"merged" mark not "closed without merge" mark on GitHub. We should update the 
note.

[~alenkaf] Do you want to work on this?


> [Docs] Update "closed without merge" in pull request note
> -
>
> Key: ARROW-16786
> URL: https://issues.apache.org/jira/browse/ARROW-16786
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Kouhei Sutou
>Priority: Major
>
> https://arrow.apache.org/docs/developers/overview.html#pull-request-and-review
> {quote}
> A side effect of this way of merging is that your pull request will appear in 
> the GitHub interface to have been “closed without merge”. Do not be alarmed: 
> if you look at the bottom, you will see a message that says @user closed this 
> in $COMMIT. In the commit message of that commit, the merge tool adds the 
> pull request description, a link back to the pull request, and attribution to 
> the contributor and any co-authors.
> {quote}
> We have changed to use the "merge" feature of GitHub by ARROW-16602 to show 
> "merged" mark not "closed without merge" mark on GitHub. We should update the 
> note.
> [~alenkaf] Do you want to work on this?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16786) [Docs] Update "closed without merge" in pull request note

2022-06-07 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-16786:


 Summary: [Docs] Update "closed without merge" in pull request note
 Key: ARROW-16786
 URL: https://issues.apache.org/jira/browse/ARROW-16786
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Kouhei Sutou


https://arrow.apache.org/docs/developers/overview.html#pull-request-and-review

{noformat}
A side effect of this way of merging is that your pull request will appear in 
the GitHub interface to have been “closed without merge”. Do not be alarmed: if 
you look at the bottom, you will see a message that says @user closed this in 
$COMMIT. In the commit message of that commit, the merge tool adds the pull 
request description, a link back to the pull request, and attribution to the 
contributor and any co-authors.
{noformat}

We have changed to use the "merge" feature of GitHub by ARROW-16602 to show 
"merged" mark not "closed without merge" mark on GitHub. We should update the 
note.

[~alenkaf] Do you want to work on this?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16761) [C++][Python] Track bytes_written on FileWriter / WrittenFile

2022-06-07 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-16761:
--

Assignee: Will Jones

> [C++][Python] Track bytes_written on FileWriter / WrittenFile
> -
>
> Key: ARROW-16761
> URL: https://issues.apache.org/jira/browse/ARROW-16761
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Affects Versions: 8.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Major
> Fix For: 9.0.0
>
>
> For Apache Iceberg and Delta Lake tables, we need to be able to get the size 
> of the files written in bytes. In Iceberg, this is the required field 
> {{file_size_in_bytes}} ([docs|https://iceberg.apache.org/spec/#manifests]). 
> In Delta, this is the required field {{size}} as part of the Add action.
> I think this could be exposed on 
> [FileWriter|https://github.com/apache/arrow/blob/8c63788ff7d52812599a546989b7df10887cb01e/cpp/src/arrow/dataset/file_base.h#L305]
>  and then through that 
> [WrittenFile|https://github.com/apache/arrow/blob/8c63788ff7d52812599a546989b7df10887cb01e/python/pyarrow/_dataset.pyx#L766-L769].
>  But lower-level than that I'm not yet sure. {{FileWriter}} owns its 
> {{OutputStream}}; would {{OutputStream::Tell()}} give the correct count?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16729) [C++] Bump version of bundled gRPC, Abseil and c-ares

2022-06-07 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-16729.
--
Resolution: Fixed

Issue resolved by pull request 13315
[https://github.com/apache/arrow/pull/13315]

> [C++] Bump version of bundled gRPC, Abseil and c-ares
> -
>
> Key: ARROW-16729
> URL: https://issues.apache.org/jira/browse/ARROW-16729
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16729) [C++] Bump version of bundled gRPC, Abseil and c-ares

2022-06-07 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-16729:


Assignee: David Li

> [C++] Bump version of bundled gRPC, Abseil and c-ares
> -
>
> Key: ARROW-16729
> URL: https://issues.apache.org/jira/browse/ARROW-16729
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16609) [C++] xxhash not installed into dist/lib/include when building C++

2022-06-07 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-16609.
--
Resolution: Fixed

Issue resolved by pull request 13282
[https://github.com/apache/arrow/pull/13282]

> [C++] xxhash not installed into dist/lib/include when building C++
> --
>
> Key: ARROW-16609
> URL: https://issues.apache.org/jira/browse/ARROW-16609
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> My C++ build setup doesn’t install {{dist/include/arrow/vendored/xxhash/}} 
> but only {{dist/include/arrow/vendored/xxhash.h}}. The last time the module 
> was installed was in november 2021.
> As {{arrow/python/arrow_to_pandas.cc}} includes {{arrow/util/hashing.h}} ->  
> {{arrow/vendored/xxhash.h}}  -> {{arrow/vendored/xxhash/xxhash.h}} this 
> module is needed to try to build Python C++ API separately from C++ 
> (ARROW-16340).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16785) [Packaging][Linux] Add FindThrift.cmake

2022-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16785:
---
Labels: pull-request-available  (was: )

> [Packaging][Linux] Add FindThrift.cmake 
> 
>
> Key: ARROW-16785
> URL: https://issues.apache.org/jira/browse/ARROW-16785
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a follow-up of ARROW-1672.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16785) [Packaging][Linux] Add FindThrift.cmake

2022-06-07 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-16785:


 Summary: [Packaging][Linux] Add FindThrift.cmake 
 Key: ARROW-16785
 URL: https://issues.apache.org/jira/browse/ARROW-16785
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


This is a follow-up of ARROW-1672.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16778) [C++] 32 bit MSVC doesn't build

2022-06-07 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-16778:
-
Description: 
When specifying Win32 as a platform, and building with MSVC, the build fails 
with the following compile errors :

{noformat}
C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error C3861: 
'__popcnt64': identifier not found 
[C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error C3861: 
'_BitScanReverse64': identifier not found 
[C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error C3861: 
'_BitScanForward64': identifier not found 
[C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 
{noformat}

  was:
When specifying Win32 as a platform, and building with MSVC, the build fails 
with the following compile errors :

C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error C3861: 
'__popcnt64': identifier not found 
[C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 
C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error C3861: 
'_BitScanReverse64': identifier not found 
[C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 
C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error C3861: 
'_BitScanForward64': identifier not found 
[C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 


> [C++] 32 bit MSVC doesn't build
> ---
>
> Key: ARROW-16778
> URL: https://issues.apache.org/jira/browse/ARROW-16778
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
> Environment: Win32, MSVC
>Reporter: Arkadiy Vertleyb
>Priority: Major
>
> When specifying Win32 as a platform, and building with MSVC, the build fails 
> with the following compile errors :
> {noformat}
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error 
> C3861: '__popcnt64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error 
> C3861: '_BitScanReverse64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error 
> C3861: '_BitScanForward64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16767) [Archery] Refactor archery.release submodule to its own subpackage

2022-06-07 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-16767.
--
Resolution: Fixed

Issue resolved by pull request 13326
[https://github.com/apache/arrow/pull/13326]

> [Archery] Refactor archery.release submodule to its own subpackage
> --
>
> Key: ARROW-16767
> URL: https://issues.apache.org/jira/browse/ARROW-16767
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Archery
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16783) [R] write_dataset fails with an uninformative message when duplicated column names

2022-06-07 Thread Andy Teucher (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551312#comment-17551312
 ] 

Andy Teucher commented on ARROW-16783:
--

Great idea to push the check down... I opened a PR 
[here|https://github.com/apache/arrow/pull/13336] where I added the check to 
`arrow_dplyr_query`. I also added `RecordBatchReader` to the list of supported 
classes that `write_dataset()` and `arrow_dplyr_query()` can be called on.

> [R] write_dataset fails with an uninformative message when duplicated column 
> names
> --
>
> Key: ARROW-16783
> URL: https://issues.apache.org/jira/browse/ARROW-16783
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Andy Teucher
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {{write_dataset()}} fails when the object being written has duplicated column 
> names. This is probably reasonable behaviour, but the error message is 
> misleading:
> {code:r}
> library(arrow, warn.conflicts = FALSE)
> df <- data.frame(
>   id = c("a", "b", "c"),
>   x = 1:3, 
>   x = 4:6,
>   check.names = FALSE
> )
> write_dataset(df, "df")
> #> Error: 'dataset' must be a Dataset, RecordBatch, Table, arrow_dplyr_query, 
> or data.frame, not "data.frame"
> {code}
> [{{write_dataset()}} calls {{as_adq()}} inside a {{tryCatch()}} 
> statement|https://github.com/apache/arrow/blob/0d5cf1882228624271062e6c19583c8b0c361a20/r/R/dataset-write.R#L146-L160],
>  so any error from {{as_adq()}} is swallowed and the error emitted is about 
> the class of the object.
> The real error comes from here:
> {code:r}
> arrow:::as_adq(df)
> #> Error in `arrow_dplyr_query()`:
> #> ! Duplicated field names
> #> ✖ The following field names were found more than once in the data: "x"
> {code}
> I'm not sure what your preferred fix is here... two options that come to mind 
> are:
> 1. Explicitly check for compatible classes before calling {{as_adq()}} 
> instead of using {{tryCatch()}}, allowing `as_adq()` to emit its own errors.
> OR
> 2. Check for duplicate column names before the {{tryCatch}} block
> My thought is that option 1 is better, as option 2 means that checking for 
> duplicates would happen twice (once inside {{write_dataset()}} and once again 
> inside {{{}as_adq(){}}}).
> I'm happy to work a fix if you like!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16783) [R] write_dataset fails with an uninformative message when duplicated column names

2022-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16783:
---
Labels: pull-request-available  (was: )

> [R] write_dataset fails with an uninformative message when duplicated column 
> names
> --
>
> Key: ARROW-16783
> URL: https://issues.apache.org/jira/browse/ARROW-16783
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Andy Teucher
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{write_dataset()}} fails when the object being written has duplicated column 
> names. This is probably reasonable behaviour, but the error message is 
> misleading:
> {code:r}
> library(arrow, warn.conflicts = FALSE)
> df <- data.frame(
>   id = c("a", "b", "c"),
>   x = 1:3, 
>   x = 4:6,
>   check.names = FALSE
> )
> write_dataset(df, "df")
> #> Error: 'dataset' must be a Dataset, RecordBatch, Table, arrow_dplyr_query, 
> or data.frame, not "data.frame"
> {code}
> [{{write_dataset()}} calls {{as_adq()}} inside a {{tryCatch()}} 
> statement|https://github.com/apache/arrow/blob/0d5cf1882228624271062e6c19583c8b0c361a20/r/R/dataset-write.R#L146-L160],
>  so any error from {{as_adq()}} is swallowed and the error emitted is about 
> the class of the object.
> The real error comes from here:
> {code:r}
> arrow:::as_adq(df)
> #> Error in `arrow_dplyr_query()`:
> #> ! Duplicated field names
> #> ✖ The following field names were found more than once in the data: "x"
> {code}
> I'm not sure what your preferred fix is here... two options that come to mind 
> are:
> 1. Explicitly check for compatible classes before calling {{as_adq()}} 
> instead of using {{tryCatch()}}, allowing `as_adq()` to emit its own errors.
> OR
> 2. Check for duplicate column names before the {{tryCatch}} block
> My thought is that option 1 is better, as option 2 means that checking for 
> duplicates would happen twice (once inside {{write_dataset()}} and once again 
> inside {{{}as_adq(){}}}).
> I'm happy to work a fix if you like!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16783) [R] write_dataset fails with an uninformative message when duplicated column names

2022-06-07 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551292#comment-17551292
 ] 

Will Jones commented on ARROW-16783:


I agree that (1) is the better option. Also, I wonder if we can push the check 
for compatible classes down into {{as_adq()}} or {{{}arrow_dplyr_query(){}}} as 
well. What do you think of that [~thisisnic] [~npr] ?

> [R] write_dataset fails with an uninformative message when duplicated column 
> names
> --
>
> Key: ARROW-16783
> URL: https://issues.apache.org/jira/browse/ARROW-16783
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Andy Teucher
>Priority: Major
>
> {{write_dataset()}} fails when the object being written has duplicated column 
> names. This is probably reasonable behaviour, but the error message is 
> misleading:
> {code:r}
> library(arrow, warn.conflicts = FALSE)
> df <- data.frame(
>   id = c("a", "b", "c"),
>   x = 1:3, 
>   x = 4:6,
>   check.names = FALSE
> )
> write_dataset(df, "df")
> #> Error: 'dataset' must be a Dataset, RecordBatch, Table, arrow_dplyr_query, 
> or data.frame, not "data.frame"
> {code}
> [{{write_dataset()}} calls {{as_adq()}} inside a {{tryCatch()}} 
> statement|https://github.com/apache/arrow/blob/0d5cf1882228624271062e6c19583c8b0c361a20/r/R/dataset-write.R#L146-L160],
>  so any error from {{as_adq()}} is swallowed and the error emitted is about 
> the class of the object.
> The real error comes from here:
> {code:r}
> arrow:::as_adq(df)
> #> Error in `arrow_dplyr_query()`:
> #> ! Duplicated field names
> #> ✖ The following field names were found more than once in the data: "x"
> {code}
> I'm not sure what your preferred fix is here... two options that come to mind 
> are:
> 1. Explicitly check for compatible classes before calling {{as_adq()}} 
> instead of using {{tryCatch()}}, allowing `as_adq()` to emit its own errors.
> OR
> 2. Check for duplicate column names before the {{tryCatch}} block
> My thought is that option 1 is better, as option 2 means that checking for 
> duplicates would happen twice (once inside {{write_dataset()}} and once again 
> inside {{{}as_adq(){}}}).
> I'm happy to work a fix if you like!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16784) [C++][Gandiva] Add alias to Upper and Lower

2022-06-07 Thread Vinicius Souza Roque (Jira)
Vinicius Souza Roque created ARROW-16784:


 Summary: [C++][Gandiva] Add alias to Upper and Lower
 Key: ARROW-16784
 URL: https://issues.apache.org/jira/browse/ARROW-16784
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++ - Gandiva
Reporter: Vinicius Souza Roque


Adding alias to functions Upper and Lower



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16783) [R] write_dataset fails with an uninformative message when duplicated column names

2022-06-07 Thread Andy Teucher (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Teucher updated ARROW-16783:
-
Description: 
{{write_dataset()}} fails when the object being written has duplicated column 
names. This is probably reasonable behaviour, but the error message is 
misleading:
{code:r}
library(arrow, warn.conflicts = FALSE)

df <- data.frame(
  id = c("a", "b", "c"),
  x = 1:3, 
  x = 4:6,
  check.names = FALSE
)

write_dataset(df, "df")
#> Error: 'dataset' must be a Dataset, RecordBatch, Table, arrow_dplyr_query, 
or data.frame, not "data.frame"
{code}
[{{write_dataset()}} calls {{as_adq()}} inside a {{tryCatch()}} 
statement|https://github.com/apache/arrow/blob/0d5cf1882228624271062e6c19583c8b0c361a20/r/R/dataset-write.R#L146-L160],
 so any error from {{as_adq()}} is swallowed and the error emitted is about the 
class of the object.

The real error comes from here:
{code:r}
arrow:::as_adq(df)
#> Error in `arrow_dplyr_query()`:
#> ! Duplicated field names
#> ✖ The following field names were found more than once in the data: "x"
{code}
I'm not sure what your preferred fix is here... two options that come to mind 
are:

1. Explicitly check for compatible classes before calling {{as_adq()}} instead 
of using {{tryCatch()}}, allowing `as_adq()` to emit its own errors.

OR

2. Check for duplicate column names before the {{tryCatch}} block

My thought is that option 1 is better, as option 2 means that checking for 
duplicates would happen twice (once inside {{write_dataset()}} and once again 
inside {{{}as_adq(){}}}).

I'm happy to work a fix if you like!

  was:
{{write_dataset()}} fails when the object being written has duplicated column 
names. This is probably reasonable behaviour, but the error message is 
misleading:
{code:r}
library(arrow, warn.conflicts = FALSE)

df <- data.frame(
  id = c("a", "b", "c"),
  x = 1:3, 
  x = 4:6,
  check.names = FALSE
)

write_dataset(df, "df")
#> Error: 'dataset' must be a Dataset, RecordBatch, Table, arrow_dplyr_query, 
or data.frame, not "data.frame"
{code}
[{{write_dataset()}} calls {{as_adq()}} inside a {{tryCatch()}} 
statement|https://github.com/apache/arrow/blob/0d5cf1882228624271062e6c19583c8b0c361a20/r/R/dataset-write.R#L146-L160],
 so any error from {{as_adq()}} is swallowed and the error emitted is about the 
class of the object.

The real error comes from here:
{code:r}
arrow:::as_adq(df)
#> Error in `arrow_dplyr_query()`:
#> ! Duplicated field names
#> ✖ The following field names were found more than once in the data: "x"
{code}
I'm not sure what your preferred fix is here... two options that come to mind 
are:

1. Explicitly check for compatible classes before calling {{as_adq()}} instead 
of using {{tryCatch()}}

OR

2. Check for duplicate column names before the {{tryCatc}} block

My thought is that option 1 is better, as option 2 means that checking for 
duplicates would happen twice (once inside {{write_dataset()}} and once again 
inside {{{}as_adq(){}}}).

I'm happy to work a fix if you like!


> [R] write_dataset fails with an uninformative message when duplicated column 
> names
> --
>
> Key: ARROW-16783
> URL: https://issues.apache.org/jira/browse/ARROW-16783
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Andy Teucher
>Priority: Major
>
> {{write_dataset()}} fails when the object being written has duplicated column 
> names. This is probably reasonable behaviour, but the error message is 
> misleading:
> {code:r}
> library(arrow, warn.conflicts = FALSE)
> df <- data.frame(
>   id = c("a", "b", "c"),
>   x = 1:3, 
>   x = 4:6,
>   check.names = FALSE
> )
> write_dataset(df, "df")
> #> Error: 'dataset' must be a Dataset, RecordBatch, Table, arrow_dplyr_query, 
> or data.frame, not "data.frame"
> {code}
> [{{write_dataset()}} calls {{as_adq()}} inside a {{tryCatch()}} 
> statement|https://github.com/apache/arrow/blob/0d5cf1882228624271062e6c19583c8b0c361a20/r/R/dataset-write.R#L146-L160],
>  so any error from {{as_adq()}} is swallowed and the error emitted is about 
> the class of the object.
> The real error comes from here:
> {code:r}
> arrow:::as_adq(df)
> #> Error in `arrow_dplyr_query()`:
> #> ! Duplicated field names
> #> ✖ The following field names were found more than once in the data: "x"
> {code}
> I'm not sure what your preferred fix is here... two options that come to mind 
> are:
> 1. Explicitly check for compatible classes before calling {{as_adq()}} 
> instead of using {{tryCatch()}}, allowing `as_adq()` to emit its own errors.
> OR
> 2. Check for duplicate column names before the {{tryCatch}} block
> My thought is that option 1 is better, as option 2 means that checking for 
> duplicates would happen twice (once 

[jira] [Updated] (ARROW-16783) [R] write_dataset fails with an uninformative message when duplicated column names

2022-06-07 Thread Andy Teucher (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Teucher updated ARROW-16783:
-
Description: 
{{write_dataset()}} fails when the object being written has duplicated column 
names. This is probably reasonable behaviour, but the error message is 
misleading:
{code:r}
library(arrow, warn.conflicts = FALSE)

df <- data.frame(
  id = c("a", "b", "c"),
  x = 1:3, 
  x = 4:6,
  check.names = FALSE
)

write_dataset(df, "df")
#> Error: 'dataset' must be a Dataset, RecordBatch, Table, arrow_dplyr_query, 
or data.frame, not "data.frame"
{code}
[{{write_dataset()}} calls {{as_adq()}} inside a {{tryCatch()}} 
statement|https://github.com/apache/arrow/blob/0d5cf1882228624271062e6c19583c8b0c361a20/r/R/dataset-write.R#L146-L160],
 so any error from {{as_adq()}} is swallowed and the error emitted is about the 
class of the object.

The real error comes from here:
{code:r}
arrow:::as_adq(df)
#> Error in `arrow_dplyr_query()`:
#> ! Duplicated field names
#> ✖ The following field names were found more than once in the data: "x"
{code}
I'm not sure what your preferred fix is here... two options that come to mind 
are:

1. Explicitly check for compatible classes before calling {{as_adq()}} instead 
of using {{tryCatch()}}

OR

2. Check for duplicate column names before the {{tryCatc}} block

My thought is that option 1 is better, as option 2 means that checking for 
duplicates would happen twice (once inside {{write_dataset()}} and once again 
inside {{{}as_adq(){}}}).

I'm happy to work a fix if you like!

  was:
{{write_dataset()}} fails when the object being written has duplicated column 
names. This is probably reasonable behaviour, but the error message is 
misleading:
{code:r}
library(arrow, warn.conflicts = FALSE)

df <- data.frame(
  id = c("a", "b", "c"),
  x = 1:3, 
  x = 4:6,
  check.names = FALSE
)

write_dataset(df, "df.parquet")
#> Error: 'dataset' must be a Dataset, RecordBatch, Table, arrow_dplyr_query, 
or data.frame, not "data.frame"
{code}
[{{write_dataset()}} calls {{as_adq()}} inside a {{tryCatch()}} 
statement|https://github.com/apache/arrow/blob/0d5cf1882228624271062e6c19583c8b0c361a20/r/R/dataset-write.R#L146-L160],
 so any error from {{as_adq()}} is swallowed and the error emitted is about the 
class of the object.

The real error comes from here:
{code:r}
arrow:::as_adq(df)
#> Error in `arrow_dplyr_query()`:
#> ! Duplicated field names
#> ✖ The following field names were found more than once in the data: "x"
{code}
I'm not sure what your preferred fix is here... two options that come to mind 
are:

1. Explicitly check for compatible classes before calling {{as_adq()}} instead 
of using {{tryCatch()}}

OR

2. Check for duplicate column names before the {{tryCatc}} block

My thought is that option 1 is better, as option 2 means that checking for 
duplicates would happen twice (once inside {{write_dataset()}} and once again 
inside {{{}as_adq(){}}}).

I'm happy to work a fix if you like!


> [R] write_dataset fails with an uninformative message when duplicated column 
> names
> --
>
> Key: ARROW-16783
> URL: https://issues.apache.org/jira/browse/ARROW-16783
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Andy Teucher
>Priority: Major
>
> {{write_dataset()}} fails when the object being written has duplicated column 
> names. This is probably reasonable behaviour, but the error message is 
> misleading:
> {code:r}
> library(arrow, warn.conflicts = FALSE)
> df <- data.frame(
>   id = c("a", "b", "c"),
>   x = 1:3, 
>   x = 4:6,
>   check.names = FALSE
> )
> write_dataset(df, "df")
> #> Error: 'dataset' must be a Dataset, RecordBatch, Table, arrow_dplyr_query, 
> or data.frame, not "data.frame"
> {code}
> [{{write_dataset()}} calls {{as_adq()}} inside a {{tryCatch()}} 
> statement|https://github.com/apache/arrow/blob/0d5cf1882228624271062e6c19583c8b0c361a20/r/R/dataset-write.R#L146-L160],
>  so any error from {{as_adq()}} is swallowed and the error emitted is about 
> the class of the object.
> The real error comes from here:
> {code:r}
> arrow:::as_adq(df)
> #> Error in `arrow_dplyr_query()`:
> #> ! Duplicated field names
> #> ✖ The following field names were found more than once in the data: "x"
> {code}
> I'm not sure what your preferred fix is here... two options that come to mind 
> are:
> 1. Explicitly check for compatible classes before calling {{as_adq()}} 
> instead of using {{tryCatch()}}
> OR
> 2. Check for duplicate column names before the {{tryCatc}} block
> My thought is that option 1 is better, as option 2 means that checking for 
> duplicates would happen twice (once inside {{write_dataset()}} and once again 
> inside {{{}as_adq(){}}}).
> I'm happy to 

[jira] [Created] (ARROW-16783) [R] write_dataset fails with an uninformative message when duplicated column names

2022-06-07 Thread Andy Teucher (Jira)
Andy Teucher created ARROW-16783:


 Summary: [R] write_dataset fails with an uninformative message 
when duplicated column names
 Key: ARROW-16783
 URL: https://issues.apache.org/jira/browse/ARROW-16783
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 8.0.0
Reporter: Andy Teucher


{{write_dataset()}} fails when the object being written has duplicated column 
names. This is probably reasonable behaviour, but the error message is 
misleading:
{code:r}
library(arrow, warn.conflicts = FALSE)

df <- data.frame(
  id = c("a", "b", "c"),
  x = 1:3, 
  x = 4:6,
  check.names = FALSE
)

write_dataset(df, "df.parquet")
#> Error: 'dataset' must be a Dataset, RecordBatch, Table, arrow_dplyr_query, 
or data.frame, not "data.frame"
{code}
[{{write_dataset()}} calls {{as_adq()}} inside a {{tryCatch()}} 
statement|https://github.com/apache/arrow/blob/0d5cf1882228624271062e6c19583c8b0c361a20/r/R/dataset-write.R#L146-L160],
 so any error from {{as_adq()}} is swallowed and the error emitted is about the 
class of the object.

The real error comes from here:
{code:r}
arrow:::as_adq(df)
#> Error in `arrow_dplyr_query()`:
#> ! Duplicated field names
#> ✖ The following field names were found more than once in the data: "x"
{code}
I'm not sure what your preferred fix is here... two options that come to mind 
are:

1. Explicitly check for compatible classes before calling {{as_adq()}} instead 
of using {{tryCatch()}}

OR

2. Check for duplicate column names before the {{tryCatc}} block

My thought is that option 1 is better, as option 2 means that checking for 
duplicates would happen twice (once inside {{write_dataset()}} and once again 
inside {{{}as_adq(){}}}).

I'm happy to work a fix if you like!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-9773) [C++] Take kernel can't handle ChunkedArrays that don't fit in an Array

2022-06-07 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551259#comment-17551259
 ] 

Will Jones commented on ARROW-9773:
---

I've looked through the code and I think there are three related issues. I'll 
try to describe them here. If you think I am missing some case, let me know. 
Otherwise, I'll open three sub-tasks and start work on those.
h2. Problem 1: We concatenate when we shouldn't need to

This fails:
{code:python}
arr = pa.chunked_array([["a" * 2**30]] * 2)
arr.take([0,1])
# Traceback (most recent call last):
#   File "", line 1, in 
#   File "pyarrow/table.pxi", line 998, in pyarrow.lib.ChunkedArray.take
#   File 
"/Users/willjones/Documents/test-env/venv/lib/python3.9/site-packages/pyarrow/compute.py",
 line 457, in take
# return call_function('take', [data, indices], options, memory_pool)
#   File "pyarrow/_compute.pyx", line 542, in pyarrow._compute.call_function
#   File "pyarrow/_compute.pyx", line 341, in pyarrow._compute.Function.call
#   File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
#   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
# pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
{code}
 because we concatenate input values 
[here|https://github.com/apache/arrow/blob/0d5cf1882228624271062e6c19583c8b0c361a20/cpp/src/arrow/compute/kernels/vector_selection.cc#L2024-L2025].
 If that were corrected, it would then fail on the concatenation 
[here|https://github.com/apache/arrow/blob/0d5cf1882228624271062e6c19583c8b0c361a20/cpp/src/arrow/compute/kernels/vector_selection.cc#L2046-L2047]
 if the indices were a chunked array.

The first concatenation could be avoided somewhat easily in special cases 
(sorted / fall in same chunk), which was [partially implement in 
R|https://github.com/apache/arrow/blob/6f2c9041137001f7a9212f244b51bc004efc29af/r/src/compute.cpp#L123-L151].
 For the general case, we'd need to address this within the kernel rather than 
within pre-processing (see Problem 3).

The second concatenation shouldn't always be eliminated, but we might want to 
add a check to validate that there is enough room in offset buffers of arrays 
to concatenate. TBD if there is an efficient way to test that.
h2. Problem 2: take_array kernel doesn't handle case of offset overflow

This is what Antoine was pointing out:
{code:python}
import pyarrow as pa
arr = pa.array(["x" * (1<<20)])
t = arr.take(pa.array([0]*((1<<12) + 1), type=pa.int8()))
t.validate(full=True)
# Traceback (most recent call last):
#   [...]
# ArrowInvalid: Offset invariant failure: non-monotonic offset at slot 2048: 
-2147483648 < 2146435072
{code}
To solve this, I think we'd either have to:
 # (optionally?) promote arrays to Large variants of type. Problem is we'd need 
to do this cast consistently across chunks.
 # Switch to returning chunked arrays, and create new chunks as needed. (TBD: 
Could we do that in some cases (String, Binary, List types) and not others?)

h2. Problem 3: there isn't a take_array kernel that handles ChunkedArrays

Finally, for sorting chunked arrays of type string/binary/list (that is, the 
case for take where the indices are out-of-order), I think we need to implement 
kernels specialized for chunked arrays. IIUC, everything but string/binary/list 
types could simply do the concatenation we do now; it's just those three types 
that need special logic to chunk as necessary to avoid offset overflows.

 

 

> [C++] Take kernel can't handle ChunkedArrays that don't fit in an Array
> ---
>
> Key: ARROW-9773
> URL: https://issues.apache.org/jira/browse/ARROW-9773
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.0.0
>Reporter: David Li
>Assignee: Will Jones
>Priority: Major
>  Labels: kernel
>
> Take() currently concatenates ChunkedArrays first. However, this breaks down 
> when calling Take() from a ChunkedArray or Table where concatenating the 
> arrays would result in an array that's too large. While inconvenient to 
> implement, it would be useful if this case were handled.
> This could be done as a higher-level wrapper around Take(), perhaps.
> Example in Python:
> {code:python}
> >>> import pyarrow as pa
> >>> pa.__version__
> '1.0.0'
> >>> rb1 = pa.RecordBatch.from_arrays([["a" * 2**30]], names=["a"])
> >>> rb2 = pa.RecordBatch.from_arrays([["b" * 2**30]], names=["a"])
> >>> table = pa.Table.from_batches([rb1, rb2], schema=rb1.schema)
> >>> table.take([1, 0])
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 1145, in pyarrow.lib.Table.take
>   File 
> "/home/lidavidm/Code/twosigma/arrow/venv/lib/python3.8/site-packages/pyarrow/compute.py",
>  line 268, in 

[jira] [Updated] (ARROW-14314) [C++] Sorting dictionary array not implemented

2022-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14314:
---
Labels: kernel pull-request-available  (was: kernel)

> [C++] Sorting dictionary array not implemented
> --
>
> Key: ARROW-14314
> URL: https://issues.apache.org/jira/browse/ARROW-14314
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Ariana Villegas
>Priority: Major
>  Labels: kernel, pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> From R, taking the stock {{mtcars}} dataset and giving it a dictionary type 
> column:
> {code}
> mtcars %>% 
>   mutate(cyl = as.factor(cyl)) %>% 
>   Table$create() %>% 
>   arrange(cyl) %>% 
>   collect()
> Error: Type error: Sorting not supported for type dictionary indices=int8, ordered=0>
> ../src/arrow/compute/kernels/vector_array_sort.cc:427  VisitTypeInline(type, 
> this)
> ../src/arrow/compute/kernels/vector_sort.cc:148  
> GetArraySorter(*physical_type_)
> ../src/arrow/compute/kernels/vector_sort.cc:1206  sorter.Sort()
> ../src/arrow/compute/api_vector.cc:259  CallFunction("sort_indices", {datum}, 
> , ctx)
> ../src/arrow/compute/exec/order_by_impl.cc:53  SortIndices(table, options_, 
> ctx_)
> ../src/arrow/compute/exec/sink_node.cc:292  impl_->DoFinish()
> ../src/arrow/compute/exec/exec_plan.cc:297  iterator_.Next()
> ../src/arrow/record_batch.cc:318  ReadNext()
> ../src/arrow/record_batch.cc:329  ReadAll()
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16782) [Format] Add RLE definitions to FlatBuffers

2022-06-07 Thread Tobias Zagorni (Jira)
Tobias Zagorni created ARROW-16782:
--

 Summary: [Format] Add RLE definitions to FlatBuffers
 Key: ARROW-16782
 URL: https://issues.apache.org/jira/browse/ARROW-16782
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Format
Reporter: Tobias Zagorni






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16773) [Docs][Format] Document Run-Length encoding in Arrow columnar format

2022-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16773:
---
Labels: pull-request-available  (was: )

> [Docs][Format] Document Run-Length encoding in Arrow columnar format
> 
>
> Key: ARROW-16773
> URL: https://issues.apache.org/jira/browse/ARROW-16773
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Documentation, Format
>Reporter: Tobias Zagorni
>Assignee: Tobias Zagorni
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16781) [C++] Complete RunLengthEncoded type

2022-06-07 Thread Tobias Zagorni (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tobias Zagorni updated ARROW-16781:
---
Description: 
 

WIP list of things that should be implemented for a RunLengthEncoded type
 * There should be a corresponding Array type (+Builder?)
 * make_array() should work
 * Validation should pass

  was:
 

WIP list of things that should be implemented for a RunLengthEncoded type
 * There should be a corresponding Array type (+Builder?)
 * Validation should pass


> [C++] Complete RunLengthEncoded type
> 
>
> Key: ARROW-16781
> URL: https://issues.apache.org/jira/browse/ARROW-16781
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Tobias Zagorni
>Priority: Major
>
>  
> WIP list of things that should be implemented for a RunLengthEncoded type
>  * There should be a corresponding Array type (+Builder?)
>  * make_array() should work
>  * Validation should pass



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16781) [C++] Complete RunLengthEncoded type

2022-06-07 Thread Tobias Zagorni (Jira)
Tobias Zagorni created ARROW-16781:
--

 Summary: [C++] Complete RunLengthEncoded type
 Key: ARROW-16781
 URL: https://issues.apache.org/jira/browse/ARROW-16781
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Tobias Zagorni


 

WIP list of things that should be implemented for a RunLengthEncoded type
 * There should be a corresponding Array type (+Builder?)
 * Validation should pass



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16713) [C++] Pull join accumulation outside of HashJoinImpl

2022-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16713:
---
Labels: pull-request-available  (was: )

> [C++] Pull join accumulation outside of HashJoinImpl
> 
>
> Key: ARROW-16713
> URL: https://issues.apache.org/jira/browse/ARROW-16713
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Sasha Krassovsky
>Assignee: Sasha Krassovsky
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is part of the preparatory refactoring for spilling (ARROW-16389)
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16780) [CI] Add automatic PR label for docs PRs

2022-06-07 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551196#comment-17551196
 ] 

Antoine Pitrou commented on ARROW-16780:


cc [~raulcd] [~assignUser]

> [CI] Add automatic PR label for docs PRs
> 
>
> Key: ARROW-16780
> URL: https://issues.apache.org/jira/browse/ARROW-16780
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Developer Tools
>Reporter: Antoine Pitrou
>Priority: Minor
>
> Our auto-labeller for PRs supports adding several labels to Github PRs based 
> on the files modified in the PR. It misses a label for doc changes, though.
> See {{.github/workflows/dev_pr/labeler.yml}} for the implementation.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16780) [CI] Add automatic PR label for docs PRs

2022-06-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-16780:
---
Priority: Minor  (was: Major)

> [CI] Add automatic PR label for docs PRs
> 
>
> Key: ARROW-16780
> URL: https://issues.apache.org/jira/browse/ARROW-16780
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Developer Tools
>Reporter: Antoine Pitrou
>Priority: Minor
>
> Our auto-labeller for PRs supports adding several labels to Github PRs based 
> on the files modified in the PR. It misses a label for doc changes, though.
> See {{.github/workflows/dev_pr/labeler.yml}} for the implementation.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16780) [CI] Add automatic PR label for docs PRs

2022-06-07 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-16780:
--

 Summary: [CI] Add automatic PR label for docs PRs
 Key: ARROW-16780
 URL: https://issues.apache.org/jira/browse/ARROW-16780
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Developer Tools
Reporter: Antoine Pitrou


Our auto-labeller for PRs supports adding several labels to Github PRs based on 
the files modified in the PR. It misses a label for doc changes, though.

See {{.github/workflows/dev_pr/labeler.yml}} for the implementation.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16779) Request for Pyarrow Flight to be shipped in arm64 MacOS version of the wheel

2022-06-07 Thread Ajay Kanagala (Jira)
Ajay Kanagala created ARROW-16779:
-

 Summary: Request for Pyarrow Flight to be shipped in arm64 MacOS 
version of the wheel
 Key: ARROW-16779
 URL: https://issues.apache.org/jira/browse/ARROW-16779
 Project: Apache Arrow
  Issue Type: New Feature
  Components: FlightRPC
 Environment: Mac M1 OS, Python,
Reporter: Ajay Kanagala


This ticket is in continuation to previous ticket 
"https://issues.apache.org/jira/browse/ARROW-13657;

It is found that Flight is not shipped in all versions of the wheel. we will 
also get an import error if you attempt to import pyarrow.gandiva, which is 
also an optional feature. It is turned off for arm64 MacOS here:

[https://github.com/apache/arrow/blob/8f0ddc785dd72e950b570f3bc380deb15c124c45/dev/tasks/python-wheels/github.osx.arm64.yml#L26]

 

Our team uses Mac M1 processor to work on dremio driver and need access to 
pyarrow package.

 

Can you please add it to the wheel?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16725) [C++] Compilation warnings on gcc in release mode

2022-06-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-16725.

Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13293
[https://github.com/apache/arrow/pull/13293]

> [C++] Compilation warnings on gcc in release mode
> -
>
> Key: ARROW-16725
> URL: https://issues.apache.org/jira/browse/ARROW-16725
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> See for example on MinGW CI builds:
> [https://github.com/apache/arrow/runs/6705674960?check_suite_focus=true#step:7:624]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16778) 32 bit MSVC doesn't build

2022-06-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-16778:
---
Priority: Major  (was: Blocker)

> 32 bit MSVC doesn't build
> -
>
> Key: ARROW-16778
> URL: https://issues.apache.org/jira/browse/ARROW-16778
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
> Environment: Win32, MSVC
>Reporter: Arkadiy Vertleyb
>Priority: Major
>
> When specifying Win32 as a platform, and building with MSVC, the build fails 
> with the following compile errors :
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error 
> C3861: '__popcnt64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error 
> C3861: '_BitScanReverse64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error 
> C3861: '_BitScanForward64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16778) [C++] 32 bit MSVC doesn't build

2022-06-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-16778:
---
Summary: [C++] 32 bit MSVC doesn't build  (was: 32 bit MSVC doesn't build)

> [C++] 32 bit MSVC doesn't build
> ---
>
> Key: ARROW-16778
> URL: https://issues.apache.org/jira/browse/ARROW-16778
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
> Environment: Win32, MSVC
>Reporter: Arkadiy Vertleyb
>Priority: Major
>
> When specifying Win32 as a platform, and building with MSVC, the build fails 
> with the following compile errors :
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error 
> C3861: '__popcnt64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error 
> C3861: '_BitScanReverse64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error 
> C3861: '_BitScanForward64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16741) Add Benchmarks for Binary Temporal Operations

2022-06-07 Thread Ivan Chau (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551161#comment-17551161
 ] 

Ivan Chau commented on ARROW-16741:
---

[~westonpace], I also have an open PR here that needs to be reviewed; this one 
is more of an incremental change compared to the other benchmarks. [~rokm] left 
some comments, but I think I need some more clarification and was wondering if 
you had any thoughts. 

Thank you all in advance!

> Add Benchmarks for Binary Temporal Operations
> -
>
> Key: ARROW-16741
> URL: https://issues.apache.org/jira/browse/ARROW-16741
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Benchmarking, C++
>Reporter: Ivan Chau
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16778) 32 bit MSVC doesn't build

2022-06-07 Thread Arkadiy Vertleyb (Jira)
Arkadiy Vertleyb created ARROW-16778:


 Summary: 32 bit MSVC doesn't build
 Key: ARROW-16778
 URL: https://issues.apache.org/jira/browse/ARROW-16778
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
 Environment: Win32, MSVC
Reporter: Arkadiy Vertleyb


When specifying Win32 as a platform, and building with MSVC, the build fails 
with the following compile errors :

C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error C3861: 
'__popcnt64': identifier not found 
[C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 
C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error C3861: 
'_BitScanReverse64': identifier not found 
[C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 
C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error C3861: 
'_BitScanForward64': identifier not found 
[C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16760) [Docs] Mention PYARROW_PARALLEL in Python dev docs

2022-06-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-16760.

Resolution: Fixed

Issue resolved by pull request 13324
[https://github.com/apache/arrow/pull/13324]

> [Docs] Mention PYARROW_PARALLEL in Python dev docs
> --
>
> Key: ARROW-16760
> URL: https://issues.apache.org/jira/browse/ARROW-16760
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 8.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We should include {{PYARROW_PARALLEL}} in the Python developer docs. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16772) [C++] Implement encode and decode functions for Run-Length encoding

2022-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16772:
---
Labels: pull-request-available  (was: )

> [C++] Implement encode and decode functions for Run-Length encoding
> ---
>
> Key: ARROW-16772
> URL: https://issues.apache.org/jira/browse/ARROW-16772
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Tobias Zagorni
>Assignee: Tobias Zagorni
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16567) [Doc][Python] Sphinx Copybutton should ignore IPython prompt text

2022-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16567:
---
Labels: beginner good-first-issue pull-request-available  (was: beginner 
good-first-issue)

> [Doc][Python] Sphinx Copybutton should ignore IPython prompt text 
> --
>
> Key: ARROW-16567
> URL: https://issues.apache.org/jira/browse/ARROW-16567
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>  Labels: beginner, good-first-issue, pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Sphinx Copybutton configuration set in 
> [docs/source/conf.py|https://github.com/apache/arrow/blob/90aac16761b7dbf5fe931bc8837cad5116939270/docs/source/conf.py#L94-L96]
>  should be updated to ignore Ipython propmt text in examples that use IPython 
> directive, example:
> [https://arrow.apache.org/docs/python/dataset.html|https://arrow.apache.org/docs/python/dataset.html#dataset-discovery]
> See: 
> [https://sphinx-copybutton.readthedocs.io/en/latest/use.html#using-regexp-prompt-identifiers]
> This can be changed once the work on ARROW-13159 will be done.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (ARROW-16716) [Benchmarks] Create Projection benchmark for Acero

2022-06-07 Thread Ivan Chau (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550653#comment-17550653
 ] 

Ivan Chau edited comment on ARROW-16716 at 6/7/22 2:30 PM:
---

Hi [~icexelloss] , I wrote this part of the code corresponding to Weston's 
review and have been running into some segfaults, I think pertaining to the 
`ExecNode* input` parameter. Do you see any smells / errors in logic here?

*This is fixed now, just requires the addition of a sink.*
{code:java}
ASSERT_OK_AND_ASSIGN(auto sn, MakeExecNode("source", plan.get(), {}, 
SourceNodeOptions
{data.schema, data.gen(/*parallel=*/true, /*slow=*/false)}
)); // does source node need input? 
ASSERT_OK_AND_ASSIGN(auto pn, MakeExecNode("project", plan.get(), {sn}, 
ProjectNodeOptions{{
expr,
}})); // {sn}, as the input is the source node
state.ResumeTiming();
pn->InputFinished(sn, num_batches); // segfault occurs on this line
for (auto b : data.batches)
{ pn->InputReceived(sn, b); } // segfault occurs on this line, if the other 
faulting line is removed.
pn->finished();
{code}


was (Author: JIRAUSER290345):
Hi [~icexelloss] , I wrote this part of the code corresponding to Weston's 
review and have been running into some segfaults, I think pertaining to the 
`ExecNode* input` parameter. Do you see any smells / errors in logic here?

 
{code:java}
ASSERT_OK_AND_ASSIGN(auto sn, MakeExecNode("source", plan.get(), {}, 
SourceNodeOptions
{data.schema, data.gen(/*parallel=*/true, /*slow=*/false)}
)); // does source node need input? 
ASSERT_OK_AND_ASSIGN(auto pn, MakeExecNode("project", plan.get(), {sn}, 
ProjectNodeOptions{{
expr,
}})); // {sn}, as the input is the source node
state.ResumeTiming();
pn->InputFinished(sn, num_batches); // segfault occurs on this line
for (auto b : data.batches)
{ pn->InputReceived(sn, b); } // segfault occurs on this line, if the other 
faulting line is removed.
pn->finished();
{code}

> [Benchmarks] Create Projection benchmark for Acero
> --
>
> Key: ARROW-16716
> URL: https://issues.apache.org/jira/browse/ARROW-16716
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Benchmarking
>Reporter: Li Jin
>Priority: Major
>  Labels: pull-request-available
> Attachments: out, out_expression
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-9773) [C++] Take kernel can't handle ChunkedArrays that don't fit in an Array

2022-06-07 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-9773:
-

Assignee: Will Jones  (was: Percy Camilo Triveño Aucahuasi)

> [C++] Take kernel can't handle ChunkedArrays that don't fit in an Array
> ---
>
> Key: ARROW-9773
> URL: https://issues.apache.org/jira/browse/ARROW-9773
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.0.0
>Reporter: David Li
>Assignee: Will Jones
>Priority: Major
>  Labels: kernel
>
> Take() currently concatenates ChunkedArrays first. However, this breaks down 
> when calling Take() from a ChunkedArray or Table where concatenating the 
> arrays would result in an array that's too large. While inconvenient to 
> implement, it would be useful if this case were handled.
> This could be done as a higher-level wrapper around Take(), perhaps.
> Example in Python:
> {code:python}
> >>> import pyarrow as pa
> >>> pa.__version__
> '1.0.0'
> >>> rb1 = pa.RecordBatch.from_arrays([["a" * 2**30]], names=["a"])
> >>> rb2 = pa.RecordBatch.from_arrays([["b" * 2**30]], names=["a"])
> >>> table = pa.Table.from_batches([rb1, rb2], schema=rb1.schema)
> >>> table.take([1, 0])
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 1145, in pyarrow.lib.Table.take
>   File 
> "/home/lidavidm/Code/twosigma/arrow/venv/lib/python3.8/site-packages/pyarrow/compute.py",
>  line 268, in take
> return call_function('take', [data, indices], options)
>   File "pyarrow/_compute.pyx", line 298, in pyarrow._compute.call_function
>   File "pyarrow/_compute.pyx", line 192, in pyarrow._compute.Function.call
>   File "pyarrow/error.pxi", line 122, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
> {code}
> In this example, it would be useful if Take() or a higher-level wrapper could 
> generate multiple record batches as output.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16689) [CI] Improve R Nightly Workflow

2022-06-07 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-16689.
-
Resolution: Fixed

Issue resolved by pull request 13266
[https://github.com/apache/arrow/pull/13266]

> [CI] Improve R Nightly Workflow
> ---
>
> Key: ARROW-16689
> URL: https://issues.apache.org/jira/browse/ARROW-16689
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Continuous Integration
>Reporter: Jacob Wujciak-Jens
>Assignee: Jacob Wujciak-Jens
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> Only upload if all tests succeed, improve overall polish, and improve 
> documentation.
> Add ubuntu 22.04 binary (see ARROW-16678).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16772) [C++] Implement encode and decode functions for Run-Length encoding

2022-06-07 Thread Tobias Zagorni (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tobias Zagorni updated ARROW-16772:
---
Summary: [C++] Implement encode and decode functions for Run-Length 
encoding  (was: [C++] Implement encode and deccode functions for Run-Length 
encoding)

> [C++] Implement encode and decode functions for Run-Length encoding
> ---
>
> Key: ARROW-16772
> URL: https://issues.apache.org/jira/browse/ARROW-16772
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Tobias Zagorni
>Assignee: Tobias Zagorni
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16772) [C++] Implement encode and deccode functions for Run-Length encoding

2022-06-07 Thread Tobias Zagorni (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tobias Zagorni updated ARROW-16772:
---
Summary: [C++] Implement encode and deccode functions for Run-Length 
encoding  (was: [C++] Implement encode and doecode functions for Run-Length 
encoding)

> [C++] Implement encode and deccode functions for Run-Length encoding
> 
>
> Key: ARROW-16772
> URL: https://issues.apache.org/jira/browse/ARROW-16772
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Tobias Zagorni
>Assignee: Tobias Zagorni
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16776) [R] dpylr::glimpse method for arrow table/datasets on disk

2022-06-07 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-16776:

Summary: [R] dpylr::glimpse method for arrow table/datasets on disk  (was: 
dpylr::glimpse method for arrow table/datasets on disk)

> [R] dpylr::glimpse method for arrow table/datasets on disk
> --
>
> Key: ARROW-16776
> URL: https://issues.apache.org/jira/browse/ARROW-16776
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Thomas Mock
>Priority: Minor
> Fix For: 9.0.0
>
>
> When working with Arrow datasets/tables, I often find myself wanting to 
> interactively print or "see" the results of a query or the first few rows of 
> the data without having to fully collect into memory. 
> I can perform exploratory data analysis on large out-of-memory datasets via 
> Arrow + dplyr but in order to print the returned values I have to collect() 
> into memory or send to_duckdb().
>  * compute() - returns number of rows/columns, but no data
>  * collect() - returns data fully into memory, can be combined with head()
>  * to_duckdb() - keeps data out of memory, always returns top 10 rows and all 
> columns, optionally increase/decrease number of printed rows
> While to_duckdb() gives me the ability to do true EDA, it seems 
> counterintuitive to need to send the arrow table over to a duckdb database 
> just to see the glimpse()/head() equivalent.
> My feature request is that there is a dplyr::glimpse() method that will 
> lazily print the first few values of table/dataset. The expected output would 
> be something like the below.
> ``` r
> library(dplyr)
> library(arrow)
> mtcars %>% arrow::write_parquet("mtcars.parquet")
> car_ds <- arrow::open_dataset("mtcars.parquet")
> car_ds %>% 
>   glimpse()
> Rows: ??
> Columns: 11
> $ mpg   21.0, 21.0, 22.8, 21.4, 18.7, …
> $ cyl   6, 6, 4, 6, 8, 6, 8, 4, 4, 6, …
> $ disp  160.0, 160.0, 108.0, 258.0, 36…
> $ hp110, 110, 93, 110, 175, 105, 2…
> $ drat  3.90, 3.90, 3.85, 3.08, 3.15, …
> $ wt2.620, 2.875, 2.320, 3.215, 3.…
> $ qsec  16.46, 17.02, 18.61, 19.44, 17…
> $ vs0, 0, 1, 1, 0, 1, 0, 1, 1, 1, …
> $ am1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
> $ gear  4, 4, 4, 3, 3, 3, 3, 4, 4, 4, …
> $ carb  4, 4, 1, 1, 2, 1, 4, 2, 2, 4, …
> ```
> Currently glimpse() will return a list output where the majority of the 
> output is erroneous to the actual data/values.
> ``` r
> library(dplyr)
> library(arrow)
> mtcars %>% arrow::write_parquet("mtcars.parquet")
> car_ds <- arrow::open_dataset("mtcars.parquet")
> car_ds %>% 
>   glimpse()
> #> Classes 'FileSystemDataset', 'Dataset', 'ArrowObject', 'R6' 
> 
> #>   Inherits from: 
> #>   Public:
> #>     .:xp:.: externalptr
> #>     .class_title: function () 
> #>     clone: function (deep = FALSE) 
> #>     files: active binding
> #>     filesystem: active binding
> #>     format: active binding
> #>     initialize: function (xp) 
> #>     invalidate: function () 
> #>     metadata: active binding
> #>     NewScan: function () 
> #>     num_cols: active binding
> #>     num_rows: active binding
> #>     pointer: function () 
> #>     print: function (...) 
> #>     schema: active binding
> #>     set_pointer: function (xp) 
> #>     ToString: function () 
> #>     type: active binding
> car_ds %>%
>   filter(cyl == 6) %>%
>   glimpse()
> #> List of 7
> #>  $ mpg :Classes 'FileSystemDataset', 'Dataset', 'ArrowObject', 'R6' 
> 
> #>   Inherits from: 
> #>   Public:
> #>     .:xp:.: externalptr
> #>     .class_title: function () 
> #>     clone: function (deep = FALSE) 
> #>     files: active binding
> #>     filesystem: active binding
> #>     format: active binding
> #>     initialize: function (xp) 
> #>     invalidate: function () 
> #>     metadata: active binding
> #>     NewScan: function () 
> #>     num_cols: active binding
> #>     num_rows: active binding
> #>     pointer: function () 
> #>     print: function (...) 
> #>     schema: active binding
> #>     set_pointer: function (xp) 
> #>     ToString: function () 
> #>     type: active binding 
> #>  $ cyl :List of 11
> #>   ..$ mpg :Classes 'Expression', 'ArrowObject', 'R6' 
> #>   Inherits from: 
> #>   Public:
> #>     .:xp:.: externalptr
> #>     cast: function (to_type, safe = TRUE, ...) 
> #>     clone: function (deep = FALSE) 
> #>     Equals: function (other, ...) 
> #>     field_name: active binding
> #>     initialize: function (xp) 
> #>     invalidate: function () 
> #>     pointer: function () 
> #>     print: function (...) 
> #>     schema: Schema, ArrowObject, R6
> #>     set_pointer: function (xp) 
> #>     ToString: function () 
> #>     type: function (schema = self$schema) 
> #>     type_id: function (schema = self$schema)  
> #>   ..$ cyl :Classes 'Expression', 

[jira] [Updated] (ARROW-16777) printing data in Table/RecordBatch print method

2022-06-07 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-16777:

Fix Version/s: 9.0.0

> printing data in Table/RecordBatch print method
> ---
>
> Key: ARROW-16777
> URL: https://issues.apache.org/jira/browse/ARROW-16777
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python, R
>Reporter: Thomas Mock
>Priority: Minor
> Fix For: 9.0.0
>
>
> Related to ARROW-16776 but after a brief discussion with Neal Richardson, he 
> requested that I split the improvement request into separate issues.
> When working with Arrow datasets/tables, I often find myself wanting to 
> interactively print or "see" the results of a query or the first few rows of 
> the data without having to fully collect into memory. 
> It would be ideal to lazily print some data with Table/RecordBatch print 
> methods, however, currently, the print methods return schema without data. 
> IE:
> ``` r
> library(dplyr)
> library(arrow)
> mtcars %>% arrow::write_parquet("mtcars.parquet")
> car_ds <- arrow::open_dataset("mtcars.parquet")
> car_ds
> #> FileSystemDataset with 1 Parquet file
> #> mpg: double
> #> cyl: double
> #> disp: double
> #> hp: double
> #> drat: double
> #> wt: double
> #> qsec: double
> #> vs: double
> #> am: double
> #> gear: double
> #> carb: double
> #> 
> #> See $metadata for additional Schema metadata
> car_ds %>%
>   compute()
> #> Table
> #> 32 rows x 11 columns
> #> $mpg 
> #> $cyl 
> #> $disp 
> #> $hp 
> #> $drat 
> #> $wt 
> #> $qsec 
> #> $vs 
> #> $am 
> #> $gear 
> #> $carb 
> #> 
> #> See $metadata for additional Schema metadata
> ```



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16777) [R] printing data in Table/RecordBatch print method

2022-06-07 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-16777:

Summary: [R] printing data in Table/RecordBatch print method  (was: 
printing data in Table/RecordBatch print method)

> [R] printing data in Table/RecordBatch print method
> ---
>
> Key: ARROW-16777
> URL: https://issues.apache.org/jira/browse/ARROW-16777
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python, R
>Reporter: Thomas Mock
>Priority: Minor
> Fix For: 9.0.0
>
>
> Related to ARROW-16776 but after a brief discussion with Neal Richardson, he 
> requested that I split the improvement request into separate issues.
> When working with Arrow datasets/tables, I often find myself wanting to 
> interactively print or "see" the results of a query or the first few rows of 
> the data without having to fully collect into memory. 
> It would be ideal to lazily print some data with Table/RecordBatch print 
> methods, however, currently, the print methods return schema without data. 
> IE:
> ``` r
> library(dplyr)
> library(arrow)
> mtcars %>% arrow::write_parquet("mtcars.parquet")
> car_ds <- arrow::open_dataset("mtcars.parquet")
> car_ds
> #> FileSystemDataset with 1 Parquet file
> #> mpg: double
> #> cyl: double
> #> disp: double
> #> hp: double
> #> drat: double
> #> wt: double
> #> qsec: double
> #> vs: double
> #> am: double
> #> gear: double
> #> carb: double
> #> 
> #> See $metadata for additional Schema metadata
> car_ds %>%
>   compute()
> #> Table
> #> 32 rows x 11 columns
> #> $mpg 
> #> $cyl 
> #> $disp 
> #> $hp 
> #> $drat 
> #> $wt 
> #> $qsec 
> #> $vs 
> #> $am 
> #> $gear 
> #> $carb 
> #> 
> #> See $metadata for additional Schema metadata
> ```



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16776) [R] dplyr::glimpse method for arrow table/datasets on disk

2022-06-07 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-16776:

Summary: [R] dplyr::glimpse method for arrow table/datasets on disk  (was: 
[R] dpylr::glimpse method for arrow table/datasets on disk)

> [R] dplyr::glimpse method for arrow table/datasets on disk
> --
>
> Key: ARROW-16776
> URL: https://issues.apache.org/jira/browse/ARROW-16776
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Thomas Mock
>Priority: Minor
> Fix For: 9.0.0
>
>
> When working with Arrow datasets/tables, I often find myself wanting to 
> interactively print or "see" the results of a query or the first few rows of 
> the data without having to fully collect into memory. 
> I can perform exploratory data analysis on large out-of-memory datasets via 
> Arrow + dplyr but in order to print the returned values I have to collect() 
> into memory or send to_duckdb().
>  * compute() - returns number of rows/columns, but no data
>  * collect() - returns data fully into memory, can be combined with head()
>  * to_duckdb() - keeps data out of memory, always returns top 10 rows and all 
> columns, optionally increase/decrease number of printed rows
> While to_duckdb() gives me the ability to do true EDA, it seems 
> counterintuitive to need to send the arrow table over to a duckdb database 
> just to see the glimpse()/head() equivalent.
> My feature request is that there is a dplyr::glimpse() method that will 
> lazily print the first few values of table/dataset. The expected output would 
> be something like the below.
> ``` r
> library(dplyr)
> library(arrow)
> mtcars %>% arrow::write_parquet("mtcars.parquet")
> car_ds <- arrow::open_dataset("mtcars.parquet")
> car_ds %>% 
>   glimpse()
> Rows: ??
> Columns: 11
> $ mpg   21.0, 21.0, 22.8, 21.4, 18.7, …
> $ cyl   6, 6, 4, 6, 8, 6, 8, 4, 4, 6, …
> $ disp  160.0, 160.0, 108.0, 258.0, 36…
> $ hp110, 110, 93, 110, 175, 105, 2…
> $ drat  3.90, 3.90, 3.85, 3.08, 3.15, …
> $ wt2.620, 2.875, 2.320, 3.215, 3.…
> $ qsec  16.46, 17.02, 18.61, 19.44, 17…
> $ vs0, 0, 1, 1, 0, 1, 0, 1, 1, 1, …
> $ am1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
> $ gear  4, 4, 4, 3, 3, 3, 3, 4, 4, 4, …
> $ carb  4, 4, 1, 1, 2, 1, 4, 2, 2, 4, …
> ```
> Currently glimpse() will return a list output where the majority of the 
> output is erroneous to the actual data/values.
> ``` r
> library(dplyr)
> library(arrow)
> mtcars %>% arrow::write_parquet("mtcars.parquet")
> car_ds <- arrow::open_dataset("mtcars.parquet")
> car_ds %>% 
>   glimpse()
> #> Classes 'FileSystemDataset', 'Dataset', 'ArrowObject', 'R6' 
> 
> #>   Inherits from: 
> #>   Public:
> #>     .:xp:.: externalptr
> #>     .class_title: function () 
> #>     clone: function (deep = FALSE) 
> #>     files: active binding
> #>     filesystem: active binding
> #>     format: active binding
> #>     initialize: function (xp) 
> #>     invalidate: function () 
> #>     metadata: active binding
> #>     NewScan: function () 
> #>     num_cols: active binding
> #>     num_rows: active binding
> #>     pointer: function () 
> #>     print: function (...) 
> #>     schema: active binding
> #>     set_pointer: function (xp) 
> #>     ToString: function () 
> #>     type: active binding
> car_ds %>%
>   filter(cyl == 6) %>%
>   glimpse()
> #> List of 7
> #>  $ mpg :Classes 'FileSystemDataset', 'Dataset', 'ArrowObject', 'R6' 
> 
> #>   Inherits from: 
> #>   Public:
> #>     .:xp:.: externalptr
> #>     .class_title: function () 
> #>     clone: function (deep = FALSE) 
> #>     files: active binding
> #>     filesystem: active binding
> #>     format: active binding
> #>     initialize: function (xp) 
> #>     invalidate: function () 
> #>     metadata: active binding
> #>     NewScan: function () 
> #>     num_cols: active binding
> #>     num_rows: active binding
> #>     pointer: function () 
> #>     print: function (...) 
> #>     schema: active binding
> #>     set_pointer: function (xp) 
> #>     ToString: function () 
> #>     type: active binding 
> #>  $ cyl :List of 11
> #>   ..$ mpg :Classes 'Expression', 'ArrowObject', 'R6' 
> #>   Inherits from: 
> #>   Public:
> #>     .:xp:.: externalptr
> #>     cast: function (to_type, safe = TRUE, ...) 
> #>     clone: function (deep = FALSE) 
> #>     Equals: function (other, ...) 
> #>     field_name: active binding
> #>     initialize: function (xp) 
> #>     invalidate: function () 
> #>     pointer: function () 
> #>     print: function (...) 
> #>     schema: Schema, ArrowObject, R6
> #>     set_pointer: function (xp) 
> #>     ToString: function () 
> #>     type: function (schema = self$schema) 
> #>     type_id: function (schema = self$schema)  
> #>   ..$ cyl :Classes 'Expression', 

[jira] [Updated] (ARROW-16776) dpylr::glimpse method for arrow table/datasets on disk

2022-06-07 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-16776:

Fix Version/s: 9.0.0

> dpylr::glimpse method for arrow table/datasets on disk
> --
>
> Key: ARROW-16776
> URL: https://issues.apache.org/jira/browse/ARROW-16776
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Thomas Mock
>Priority: Minor
> Fix For: 9.0.0
>
>
> When working with Arrow datasets/tables, I often find myself wanting to 
> interactively print or "see" the results of a query or the first few rows of 
> the data without having to fully collect into memory. 
> I can perform exploratory data analysis on large out-of-memory datasets via 
> Arrow + dplyr but in order to print the returned values I have to collect() 
> into memory or send to_duckdb().
>  * compute() - returns number of rows/columns, but no data
>  * collect() - returns data fully into memory, can be combined with head()
>  * to_duckdb() - keeps data out of memory, always returns top 10 rows and all 
> columns, optionally increase/decrease number of printed rows
> While to_duckdb() gives me the ability to do true EDA, it seems 
> counterintuitive to need to send the arrow table over to a duckdb database 
> just to see the glimpse()/head() equivalent.
> My feature request is that there is a dplyr::glimpse() method that will 
> lazily print the first few values of table/dataset. The expected output would 
> be something like the below.
> ``` r
> library(dplyr)
> library(arrow)
> mtcars %>% arrow::write_parquet("mtcars.parquet")
> car_ds <- arrow::open_dataset("mtcars.parquet")
> car_ds %>% 
>   glimpse()
> Rows: ??
> Columns: 11
> $ mpg   21.0, 21.0, 22.8, 21.4, 18.7, …
> $ cyl   6, 6, 4, 6, 8, 6, 8, 4, 4, 6, …
> $ disp  160.0, 160.0, 108.0, 258.0, 36…
> $ hp110, 110, 93, 110, 175, 105, 2…
> $ drat  3.90, 3.90, 3.85, 3.08, 3.15, …
> $ wt2.620, 2.875, 2.320, 3.215, 3.…
> $ qsec  16.46, 17.02, 18.61, 19.44, 17…
> $ vs0, 0, 1, 1, 0, 1, 0, 1, 1, 1, …
> $ am1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
> $ gear  4, 4, 4, 3, 3, 3, 3, 4, 4, 4, …
> $ carb  4, 4, 1, 1, 2, 1, 4, 2, 2, 4, …
> ```
> Currently glimpse() will return a list output where the majority of the 
> output is erroneous to the actual data/values.
> ``` r
> library(dplyr)
> library(arrow)
> mtcars %>% arrow::write_parquet("mtcars.parquet")
> car_ds <- arrow::open_dataset("mtcars.parquet")
> car_ds %>% 
>   glimpse()
> #> Classes 'FileSystemDataset', 'Dataset', 'ArrowObject', 'R6' 
> 
> #>   Inherits from: 
> #>   Public:
> #>     .:xp:.: externalptr
> #>     .class_title: function () 
> #>     clone: function (deep = FALSE) 
> #>     files: active binding
> #>     filesystem: active binding
> #>     format: active binding
> #>     initialize: function (xp) 
> #>     invalidate: function () 
> #>     metadata: active binding
> #>     NewScan: function () 
> #>     num_cols: active binding
> #>     num_rows: active binding
> #>     pointer: function () 
> #>     print: function (...) 
> #>     schema: active binding
> #>     set_pointer: function (xp) 
> #>     ToString: function () 
> #>     type: active binding
> car_ds %>%
>   filter(cyl == 6) %>%
>   glimpse()
> #> List of 7
> #>  $ mpg :Classes 'FileSystemDataset', 'Dataset', 'ArrowObject', 'R6' 
> 
> #>   Inherits from: 
> #>   Public:
> #>     .:xp:.: externalptr
> #>     .class_title: function () 
> #>     clone: function (deep = FALSE) 
> #>     files: active binding
> #>     filesystem: active binding
> #>     format: active binding
> #>     initialize: function (xp) 
> #>     invalidate: function () 
> #>     metadata: active binding
> #>     NewScan: function () 
> #>     num_cols: active binding
> #>     num_rows: active binding
> #>     pointer: function () 
> #>     print: function (...) 
> #>     schema: active binding
> #>     set_pointer: function (xp) 
> #>     ToString: function () 
> #>     type: active binding 
> #>  $ cyl :List of 11
> #>   ..$ mpg :Classes 'Expression', 'ArrowObject', 'R6' 
> #>   Inherits from: 
> #>   Public:
> #>     .:xp:.: externalptr
> #>     cast: function (to_type, safe = TRUE, ...) 
> #>     clone: function (deep = FALSE) 
> #>     Equals: function (other, ...) 
> #>     field_name: active binding
> #>     initialize: function (xp) 
> #>     invalidate: function () 
> #>     pointer: function () 
> #>     print: function (...) 
> #>     schema: Schema, ArrowObject, R6
> #>     set_pointer: function (xp) 
> #>     ToString: function () 
> #>     type: function (schema = self$schema) 
> #>     type_id: function (schema = self$schema)  
> #>   ..$ cyl :Classes 'Expression', 'ArrowObject', 'R6' 
> #>   Inherits from: 
> #>   Public:
> #>     .:xp:.: externalptr
> #>     cast: function (to_type, safe = 

[jira] [Created] (ARROW-16777) printing data in Table/RecordBatch print method

2022-06-07 Thread Thomas Mock (Jira)
Thomas Mock created ARROW-16777:
---

 Summary: printing data in Table/RecordBatch print method
 Key: ARROW-16777
 URL: https://issues.apache.org/jira/browse/ARROW-16777
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python, R
Reporter: Thomas Mock


Related to ARROW-16776 but after a brief discussion with Neal Richardson, he 
requested that I split the improvement request into separate issues.

When working with Arrow datasets/tables, I often find myself wanting to 
interactively print or "see" the results of a query or the first few rows of 
the data without having to fully collect into memory. 

It would be ideal to lazily print some data with Table/RecordBatch print 
methods, however, currently, the print methods return schema without data. 

IE:

``` r
library(dplyr)
library(arrow)

mtcars %>% arrow::write_parquet("mtcars.parquet")
car_ds <- arrow::open_dataset("mtcars.parquet")

car_ds
#> FileSystemDataset with 1 Parquet file
#> mpg: double
#> cyl: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> am: double
#> gear: double
#> carb: double
#> 
#> See $metadata for additional Schema metadata

car_ds %>%
  compute()
#> Table
#> 32 rows x 11 columns
#> $mpg 
#> $cyl 
#> $disp 
#> $hp 
#> $drat 
#> $wt 
#> $qsec 
#> $vs 
#> $am 
#> $gear 
#> $carb 
#> 
#> See $metadata for additional Schema metadata
```





--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16567) [Doc][Python] Sphinx Copybutton should ignore IPython prompt text

2022-06-07 Thread Alenka Frim (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim reassigned ARROW-16567:
---

Assignee: Alenka Frim

> [Doc][Python] Sphinx Copybutton should ignore IPython prompt text 
> --
>
> Key: ARROW-16567
> URL: https://issues.apache.org/jira/browse/ARROW-16567
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>  Labels: beginner, good-first-issue
> Fix For: 9.0.0
>
>
> Sphinx Copybutton configuration set in 
> [docs/source/conf.py|https://github.com/apache/arrow/blob/90aac16761b7dbf5fe931bc8837cad5116939270/docs/source/conf.py#L94-L96]
>  should be updated to ignore Ipython propmt text in examples that use IPython 
> directive, example:
> [https://arrow.apache.org/docs/python/dataset.html|https://arrow.apache.org/docs/python/dataset.html#dataset-discovery]
> See: 
> [https://sphinx-copybutton.readthedocs.io/en/latest/use.html#using-regexp-prompt-identifiers]
> This can be changed once the work on ARROW-13159 will be done.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16776) dpylr::glimpse method for arrow table/datasets on disk

2022-06-07 Thread Thomas Mock (Jira)
Thomas Mock created ARROW-16776:
---

 Summary: dpylr::glimpse method for arrow table/datasets on disk
 Key: ARROW-16776
 URL: https://issues.apache.org/jira/browse/ARROW-16776
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Thomas Mock


When working with Arrow datasets/tables, I often find myself wanting to 
interactively print or "see" the results of a query or the first few rows of 
the data without having to fully collect into memory. 

I can perform exploratory data analysis on large out-of-memory datasets via 
Arrow + dplyr but in order to print the returned values I have to collect() 
into memory or send to_duckdb().
 * compute() - returns number of rows/columns, but no data
 * collect() - returns data fully into memory, can be combined with head()
 * to_duckdb() - keeps data out of memory, always returns top 10 rows and all 
columns, optionally increase/decrease number of printed rows

While to_duckdb() gives me the ability to do true EDA, it seems 
counterintuitive to need to send the arrow table over to a duckdb database just 
to see the glimpse()/head() equivalent.

My feature request is that there is a dplyr::glimpse() method that will lazily 
print the first few values of table/dataset. The expected output would be 
something like the below.

``` r
library(dplyr)
library(arrow)

mtcars %>% arrow::write_parquet("mtcars.parquet")
car_ds <- arrow::open_dataset("mtcars.parquet")

car_ds %>% 
  glimpse()

Rows: ??
Columns: 11
$ mpg   21.0, 21.0, 22.8, 21.4, 18.7, …
$ cyl   6, 6, 4, 6, 8, 6, 8, 4, 4, 6, …
$ disp  160.0, 160.0, 108.0, 258.0, 36…
$ hp110, 110, 93, 110, 175, 105, 2…
$ drat  3.90, 3.90, 3.85, 3.08, 3.15, …
$ wt2.620, 2.875, 2.320, 3.215, 3.…
$ qsec  16.46, 17.02, 18.61, 19.44, 17…
$ vs0, 0, 1, 1, 0, 1, 0, 1, 1, 1, …
$ am1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
$ gear  4, 4, 4, 3, 3, 3, 3, 4, 4, 4, …
$ carb  4, 4, 1, 1, 2, 1, 4, 2, 2, 4, …
```

Currently glimpse() will return a list output where the majority of the output 
is erroneous to the actual data/values.

``` r
library(dplyr)
library(arrow)

mtcars %>% arrow::write_parquet("mtcars.parquet")
car_ds <- arrow::open_dataset("mtcars.parquet")

car_ds %>% 
  glimpse()
#> Classes 'FileSystemDataset', 'Dataset', 'ArrowObject', 'R6' 

#>   Inherits from: 
#>   Public:
#>     .:xp:.: externalptr
#>     .class_title: function () 
#>     clone: function (deep = FALSE) 
#>     files: active binding
#>     filesystem: active binding
#>     format: active binding
#>     initialize: function (xp) 
#>     invalidate: function () 
#>     metadata: active binding
#>     NewScan: function () 
#>     num_cols: active binding
#>     num_rows: active binding
#>     pointer: function () 
#>     print: function (...) 
#>     schema: active binding
#>     set_pointer: function (xp) 
#>     ToString: function () 
#>     type: active binding

car_ds %>%
  filter(cyl == 6) %>%
  glimpse()
#> List of 7
#>  $ mpg :Classes 'FileSystemDataset', 'Dataset', 'ArrowObject', 'R6' 

#>   Inherits from: 
#>   Public:
#>     .:xp:.: externalptr
#>     .class_title: function () 
#>     clone: function (deep = FALSE) 
#>     files: active binding
#>     filesystem: active binding
#>     format: active binding
#>     initialize: function (xp) 
#>     invalidate: function () 
#>     metadata: active binding
#>     NewScan: function () 
#>     num_cols: active binding
#>     num_rows: active binding
#>     pointer: function () 
#>     print: function (...) 
#>     schema: active binding
#>     set_pointer: function (xp) 
#>     ToString: function () 
#>     type: active binding 
#>  $ cyl :List of 11
#>   ..$ mpg :Classes 'Expression', 'ArrowObject', 'R6' 
#>   Inherits from: 
#>   Public:
#>     .:xp:.: externalptr
#>     cast: function (to_type, safe = TRUE, ...) 
#>     clone: function (deep = FALSE) 
#>     Equals: function (other, ...) 
#>     field_name: active binding
#>     initialize: function (xp) 
#>     invalidate: function () 
#>     pointer: function () 
#>     print: function (...) 
#>     schema: Schema, ArrowObject, R6
#>     set_pointer: function (xp) 
#>     ToString: function () 
#>     type: function (schema = self$schema) 
#>     type_id: function (schema = self$schema)  
#>   ..$ cyl :Classes 'Expression', 'ArrowObject', 'R6' 
#>   Inherits from: 
#>   Public:
#>     .:xp:.: externalptr
#>     cast: function (to_type, safe = TRUE, ...) 
#>     clone: function (deep = FALSE) 
#>     Equals: function (other, ...) 
#>     field_name: active binding
#>     initialize: function (xp) 
#>     invalidate: function () 
#>     pointer: function () 
#>     print: function (...) 
#>     schema: Schema, ArrowObject, R6
#>     set_pointer: function (xp) 
#>     ToString: function () 
#>     type: function (schema = self$schema) 
#>     type_id: function (schema = self$schema)  
#>   ..$ disp:Classes 'Expression', 

[jira] [Created] (ARROW-16775) pyarrow's read_table is way slower than iter_batches

2022-06-07 Thread Satoshi Nakamoto (Jira)
Satoshi Nakamoto created ARROW-16775:


 Summary: pyarrow's read_table is way slower than iter_batches
 Key: ARROW-16775
 URL: https://issues.apache.org/jira/browse/ARROW-16775
 Project: Apache Arrow
  Issue Type: Bug
  Components: Parquet, Python
Affects Versions: 8.0.0
 Environment: pyarrow 8.0.0
pandas 1.4.2
numpy 1.22.4
python 3.9

I reproduced this behaviour on two machines: 
* macbook pro with m1 max 32 gb and cpython 3.9.12 from conda miniforge
* pytorch docker container on standard linux machine
Reporter: Satoshi Nakamoto


Hi!

Loading a table created from DataFrame  `pyarrow.parquet.read_table()` is 
taking 3x  much time as loading it as batches

 
{code:java}
pyarrow.Table.from_batches(
list(pyarrow.parquet.ParquetFile.iter_batches()
){code}
 
h4. Minimal example

 
{code:java}
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame(
    {
        "a": np.random.random(10**9), 
        "b": np.random.random(10**9)
    }
)

df.to_parquet("file.parquet")

table_of_whole_file = pq.read_table("file.parquet")

table_of_batches = pa.Table.from_batches(
    list(
        pq.ParquetFile("file.parquet").iter_batches()
    )
)

table_of_one_batch = pa.Table.from_batches(
    [
        next(pq.ParquetFile("file.parquet")
        .iter_batches(batch_size=10**9))
    ]
){code}
 

_table_of_batches_ reading time is 11.5 seconds, _table_of_whole_file_ read 
time is 33.2s.

Also loading table as one batch _table_of_one_batch_ is slightly faster: 9.8s.
h4. Parquet file metadata

 
{code:java}

  created_by: parquet-cpp-arrow version 8.0.0
  num_columns: 2
  num_rows: 10
  num_row_groups: 15
  format_version: 1.0
  serialized_size: 5680 {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16774) [C++] Create and example Kernel that works on RLE data

2022-06-07 Thread Tobias Zagorni (Jira)
Tobias Zagorni created ARROW-16774:
--

 Summary: [C++] Create and example Kernel that works on RLE data
 Key: ARROW-16774
 URL: https://issues.apache.org/jira/browse/ARROW-16774
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Tobias Zagorni






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16773) [Docs][Format] Document Run-Length encoding in Arrow columnar format

2022-06-07 Thread Tobias Zagorni (Jira)
Tobias Zagorni created ARROW-16773:
--

 Summary: [Docs][Format] Document Run-Length encoding in Arrow 
columnar format
 Key: ARROW-16773
 URL: https://issues.apache.org/jira/browse/ARROW-16773
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Documentation, Format
Reporter: Tobias Zagorni
Assignee: Tobias Zagorni






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16772) [C++] Implement encode and doecode functions for Run-Length encoding

2022-06-07 Thread Tobias Zagorni (Jira)
Tobias Zagorni created ARROW-16772:
--

 Summary: [C++] Implement encode and doecode functions for 
Run-Length encoding
 Key: ARROW-16772
 URL: https://issues.apache.org/jira/browse/ARROW-16772
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Tobias Zagorni
Assignee: Tobias Zagorni






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16658) [Python] Support arithmetic on arrays and scalars

2022-06-07 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551075#comment-17551075
 ] 

Joris Van den Bossche commented on ARROW-16658:
---

For the options or checked vs non checked, we indeed need to pick some default, 
but I think that's OK (and I would indeed also go for the checked versions).

I think _if_ we add those, we should also change the behaviour of {{__eq__}}, 
although that is something that will require a long deprecation cycle.

> [Python] Support arithmetic on arrays and scalars
> -
>
> Key: ARROW-16658
> URL: https://issues.apache.org/jira/browse/ARROW-16658
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Will Jones
>Priority: Major
>
> I was surprised to find you can't use standard arithmetic operators on 
> PyArrow arrays and scalars. Instead, one must use the compute functions:
> {code:Python}
> import pyarrow as pa
> arr = pa.array([1, 2, 3])
> pc.add(arr, 2)
> # Doesn't work:
> # arr + 2
> # arr + pa.scalar(2)
> # arr + arr
> pc.multiply(arr, 20)
> # Doesn't work:
> # 20 * arr
> # pa.scalar(20) * arr
> {code}
> Is it intentional we don't support this?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16771) [Format][C++] Adding Run-Length encoding to Arrow

2022-06-07 Thread Tobias Zagorni (Jira)
Tobias Zagorni created ARROW-16771:
--

 Summary: [Format][C++] Adding Run-Length encoding to Arrow
 Key: ARROW-16771
 URL: https://issues.apache.org/jira/browse/ARROW-16771
 Project: Apache Arrow
  Issue Type: Task
  Components: C++, Format
Reporter: Tobias Zagorni
Assignee: Tobias Zagorni


As discussed here:

[https://lists.apache.org/thread/djy8xn28p264vhj8y5rqbgkgwss6oyo1]

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16768) [R] Factor levels cannot contain NA

2022-06-07 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-16768:

Summary: [R] Factor levels cannot contain NA  (was: Factor variables in R 
with missing values cause an error for write_parquet)

> [R] Factor levels cannot contain NA
> ---
>
> Key: ARROW-16768
> URL: https://issues.apache.org/jira/browse/ARROW-16768
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 7.0.0
>Reporter: Kieran Martin
>Priority: Minor
>
> If you try to write a data frame with a factor with a missing value to 
> parquet, you get the error: "Error: Invalid: Cannot insert dictionary values 
> containing nulls". 
> This seems likely due to how the metadata for factors is currently captured 
> in parquet files. Reprex follows:
>  
> library(arrow)
> bad_data <- data.frame(A = factor(1, 2, NA))
> write_parquet(bad_data, tempfile())
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16768) Factor variables in R with missing values cause an error for write_parquet

2022-06-07 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550987#comment-17550987
 ] 

Neal Richardson commented on ARROW-16768:
-

Thanks for the report. A couple things to note:

1. factors can have missing values in the data. The issue is that in your 
example, you've put an NA into the "labels" argument of {{factor()}}. 

{code}
> factor(1, 2, NA)
[1] 
Levels: 
{code}

Assuming you meant all of the arguments passed to {{factor()}} to be data 
values, there is no problem because R puts the NA in the data and not in the 
levels:

{code}
> factor(c(1, 2, NA))
[1] 12
Levels: 1 2
{code}

So {{data.frame(A = factor(c(1, 2, NA)))}} writes just fine. 

2. The error comes from conversion to Arrow types, prior to sending to the 
Parquet writer

{code}
> Array$create(factor(1, labels=NA))
Error: Invalid: Cannot insert dictionary values containing nulls
{code}

raised from here: 
https://github.com/apache/arrow/blob/91e3ac53e2e21736ce6291d73fc37da6fa21259d/cpp/src/arrow/array/builder_dict.cc#L81

If there is a real use case where you could get an NA in the factor levels, we 
would need to handle that in R.

> Factor variables in R with missing values cause an error for write_parquet
> --
>
> Key: ARROW-16768
> URL: https://issues.apache.org/jira/browse/ARROW-16768
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 7.0.0
>Reporter: Kieran Martin
>Priority: Minor
>
> If you try to write a data frame with a factor with a missing value to 
> parquet, you get the error: "Error: Invalid: Cannot insert dictionary values 
> containing nulls". 
> This seems likely due to how the metadata for factors is currently captured 
> in parquet files. Reprex follows:
>  
> library(arrow)
> bad_data <- data.frame(A = factor(1, 2, NA))
> write_parquet(bad_data, tempfile())
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16658) [Python] Support arithmetic on arrays and scalars

2022-06-07 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550984#comment-17550984
 ] 

Antoine Pitrou commented on ARROW-16658:


There are other questions such as which options do we pass. We probably don't 
want to issue unchecked arithmetic by default, for example.

> [Python] Support arithmetic on arrays and scalars
> -
>
> Key: ARROW-16658
> URL: https://issues.apache.org/jira/browse/ARROW-16658
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Will Jones
>Priority: Major
>
> I was surprised to find you can't use standard arithmetic operators on 
> PyArrow arrays and scalars. Instead, one must use the compute functions:
> {code:Python}
> import pyarrow as pa
> arr = pa.array([1, 2, 3])
> pc.add(arr, 2)
> # Doesn't work:
> # arr + 2
> # arr + pa.scalar(2)
> # arr + arr
> pc.multiply(arr, 20)
> # Doesn't work:
> # 20 * arr
> # pa.scalar(20) * arr
> {code}
> Is it intentional we don't support this?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16706) [Python] Expose RankOptions

2022-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16706:
---
Labels: good-first-issue pull-request-available  (was: good-first-issue)

> [Python] Expose RankOptions
> ---
>
> Key: ARROW-16706
> URL: https://issues.apache.org/jira/browse/ARROW-16706
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Raúl Cumplido
>Priority: Critical
>  Labels: good-first-issue, pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Followup to ARROW-16234



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16770) Arrow Substrait test fails with SIGSEGV, possibly due to gtest 1.11.0

2022-06-07 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-16770:

Description: 
I built Arrow using the instructions in the Python development page, under the 
pyarrow-dev environment, and found that `arrow-substrait-substrait-test` fails 
with SIGSEGV - see gdb session below. The same Arrow builds and runs correctly 
on my system, outside of pyarrow-dev. I suspect this is due to something 
different about gtest 1.11.0 as compared to gtest 1.10.0 based on the following 
observations:
 # The backtrace in the gdb session shows gtest 1.11.0 is used.
 # The backtrace also shows the error is deep inside gtest, working on an 
`UnorderedElementsAre` expectation.
 # My system, outside pyarrow-dev, uses gtest 1.10.0.

 
{noformat}
$ gdb --args ./release/arrow-substrait-substrait-test 
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
    .
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./release/arrow-substrait-substrait-test...
(gdb) run
Starting program: 
/mnt/user1/tscontract/github/rtpsw/arrow/cpp/build/debug/release/arrow-substrait-substrait-test
 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x741ff700 (LWP 115128)]
Running main() from 
/home/conda/feedstock_root/build_artifacts/gtest_1647154636757/work/googletest/src/gtest_main.cc
[==] Running 33 tests from 3 test suites.
[--] Global test environment set-up.
[--] 4 tests from ExtensionIdRegistryTest
[ RUN      ] ExtensionIdRegistryTest.RegisterTempTypes
[       OK ] ExtensionIdRegistryTest.RegisterTempTypes (0 ms)
[ RUN      ] ExtensionIdRegistryTest.RegisterTempFunctions
[       OK ] ExtensionIdRegistryTest.RegisterTempFunctions (0 ms)
[ RUN      ] ExtensionIdRegistryTest.RegisterNestedTypes
[       OK ] ExtensionIdRegistryTest.RegisterNestedTypes (0 ms)
[ RUN      ] ExtensionIdRegistryTest.RegisterNestedFunctions
[       OK ] ExtensionIdRegistryTest.RegisterNestedFunctions (0 ms)
[--] 4 tests from ExtensionIdRegistryTest (0 ms total)
[--] 21 tests from Substrait
[ RUN      ] Substrait.SupportedTypes
[       OK ] Substrait.SupportedTypes (0 ms)
[ RUN      ] Substrait.SupportedExtensionTypes
[       OK ] Substrait.SupportedExtensionTypes (0 ms)
[ RUN      ] Substrait.NamedStruct
[       OK ] Substrait.NamedStruct (0 ms)
[ RUN      ] Substrait.NoEquivalentArrowType
[       OK ] Substrait.NoEquivalentArrowType (0 ms)
[ RUN      ] Substrait.NoEquivalentSubstraitType
[       OK ] Substrait.NoEquivalentSubstraitType (0 ms)
[ RUN      ] Substrait.SupportedLiterals
[       OK ] Substrait.SupportedLiterals (1 ms)
[ RUN      ] Substrait.CannotDeserializeLiteral
[       OK ] Substrait.CannotDeserializeLiteral (0 ms)
[ RUN      ] Substrait.FieldRefRoundTrip
[       OK ] Substrait.FieldRefRoundTrip (1 ms)
[ RUN      ] Substrait.RecursiveFieldRef
[       OK ] Substrait.RecursiveFieldRef (0 ms)
[ RUN      ] Substrait.FieldRefsInExpressions
[       OK ] Substrait.FieldRefsInExpressions (0 ms)
[ RUN      ] Substrait.CallSpecialCaseRoundTrip
[       OK ] Substrait.CallSpecialCaseRoundTrip (0 ms)
[ RUN      ] Substrait.CallExtensionFunction
[       OK ] Substrait.CallExtensionFunction (0 ms)
[ RUN      ] Substrait.ReadRel
Thread 1 "arrow-substrait" received signal SIGSEGV, Segmentation fault.
0x555b02e6 in 
testing::internal::MatcherBase, std::allocator > const&>::MatchAndExplain 
(listener=0x7fffb3a0, x=..., 
    this=) at 
/mnt/soft1/tscontract/pkg/miniconda3/envs/pyarrow-dev/x86_64-conda-linux-gnu/include/c++/10.3.0/bits/shared_ptr_base.h:1324
1324          get() const noexcept
(gdb) bt
#0  0x555b02e6 in 
testing::internal::MatcherBase, std::allocator > const&>::MatchAndExplain 
(listener=0x7fffb3a0, x=..., 
    this=) at 
/mnt/soft1/tscontract/pkg/miniconda3/envs/pyarrow-dev/x86_64-conda-linux-gnu/include/c++/10.3.0/bits/shared_ptr_base.h:1324
#1  
testing::internal::UnorderedElementsAreMatcherImpl, std::allocator >, 
std::allocator, 
std::allocator > > > 
const&>::AnalyzeElements<__gnu_cxx::__normal_iterator, std::allocator > const*, 
std::vector, 
std::allocator >, std::allocator, std::allocator > > > > > 
(listener=0x7fffb640, 
    

[jira] [Created] (ARROW-16770) Arrow Substrait test fails with SIGSEGV, possibly due to gtest 1.11.0

2022-06-07 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-16770:
---

 Summary: Arrow Substrait test fails with SIGSEGV, possibly due to 
gtest 1.11.0
 Key: ARROW-16770
 URL: https://issues.apache.org/jira/browse/ARROW-16770
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Yaron Gvili


I built Arrow using the instructions in the Python development page, under the 
pyarrow-dev environment, and found that `arrow-substrait-substrait-test` fails 
with SIGSEGV - see gdb session below. The same Arrow build and runs correctly 
on my system, outside of pyarrow-dev. I suspect this is due to something 
different about gtest 1.11.0 as compared to gtest 1.10.0 based on the following 
observations:
 # The backtrace in the gdb session shows gtest 1.11.0 is used.
 # The backtrace also shows the error is deep inside gtest, working on an 
`UnorderedElementsAre` expectation.
 # My system, outside pyarrow-dev, uses gtest 1.10.0.

 
{noformat}
$ gdb --args ./release/arrow-substrait-substrait-test 
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
    .
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./release/arrow-substrait-substrait-test...
(gdb) run
Starting program: 
/mnt/user1/tscontract/github/rtpsw/arrow/cpp/build/debug/release/arrow-substrait-substrait-test
 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x741ff700 (LWP 115128)]
Running main() from 
/home/conda/feedstock_root/build_artifacts/gtest_1647154636757/work/googletest/src/gtest_main.cc
[==] Running 33 tests from 3 test suites.
[--] Global test environment set-up.
[--] 4 tests from ExtensionIdRegistryTest
[ RUN      ] ExtensionIdRegistryTest.RegisterTempTypes
[       OK ] ExtensionIdRegistryTest.RegisterTempTypes (0 ms)
[ RUN      ] ExtensionIdRegistryTest.RegisterTempFunctions
[       OK ] ExtensionIdRegistryTest.RegisterTempFunctions (0 ms)
[ RUN      ] ExtensionIdRegistryTest.RegisterNestedTypes
[       OK ] ExtensionIdRegistryTest.RegisterNestedTypes (0 ms)
[ RUN      ] ExtensionIdRegistryTest.RegisterNestedFunctions
[       OK ] ExtensionIdRegistryTest.RegisterNestedFunctions (0 ms)
[--] 4 tests from ExtensionIdRegistryTest (0 ms total)
[--] 21 tests from Substrait
[ RUN      ] Substrait.SupportedTypes
[       OK ] Substrait.SupportedTypes (0 ms)
[ RUN      ] Substrait.SupportedExtensionTypes
[       OK ] Substrait.SupportedExtensionTypes (0 ms)
[ RUN      ] Substrait.NamedStruct
[       OK ] Substrait.NamedStruct (0 ms)
[ RUN      ] Substrait.NoEquivalentArrowType
[       OK ] Substrait.NoEquivalentArrowType (0 ms)
[ RUN      ] Substrait.NoEquivalentSubstraitType
[       OK ] Substrait.NoEquivalentSubstraitType (0 ms)
[ RUN      ] Substrait.SupportedLiterals
[       OK ] Substrait.SupportedLiterals (1 ms)
[ RUN      ] Substrait.CannotDeserializeLiteral
[       OK ] Substrait.CannotDeserializeLiteral (0 ms)
[ RUN      ] Substrait.FieldRefRoundTrip
[       OK ] Substrait.FieldRefRoundTrip (1 ms)
[ RUN      ] Substrait.RecursiveFieldRef
[       OK ] Substrait.RecursiveFieldRef (0 ms)
[ RUN      ] Substrait.FieldRefsInExpressions
[       OK ] Substrait.FieldRefsInExpressions (0 ms)
[ RUN      ] Substrait.CallSpecialCaseRoundTrip
[       OK ] Substrait.CallSpecialCaseRoundTrip (0 ms)
[ RUN      ] Substrait.CallExtensionFunction
[       OK ] Substrait.CallExtensionFunction (0 ms)
[ RUN      ] Substrait.ReadRel
Thread 1 "arrow-substrait" received signal SIGSEGV, Segmentation fault.
0x555b02e6 in 
testing::internal::MatcherBase, std::allocator > const&>::MatchAndExplain 
(listener=0x7fffb3a0, x=..., 
    this=) at 
/mnt/soft1/tscontract/pkg/miniconda3/envs/pyarrow-dev/x86_64-conda-linux-gnu/include/c++/10.3.0/bits/shared_ptr_base.h:1324
1324          get() const noexcept
(gdb) bt
#0  0x555b02e6 in 
testing::internal::MatcherBase, std::allocator > const&>::MatchAndExplain 
(listener=0x7fffb3a0, x=..., 
    this=) at 
/mnt/soft1/tscontract/pkg/miniconda3/envs/pyarrow-dev/x86_64-conda-linux-gnu/include/c++/10.3.0/bits/shared_ptr_base.h:1324
#1  
testing::internal::UnorderedElementsAreMatcherImpl, std::allocator >, 
std::allocator, 
std::allocator > > > 

[jira] [Commented] (ARROW-16658) [Python] Support arithmetic on arrays and scalars

2022-06-07 Thread Alessandro Molina (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550979#comment-17550979
 ] 

Alessandro Molina commented on ARROW-16658:
---

I think it's reasonable to add those operators to Arrays. In the end we already 
provide those features, it's not that we would be implementing something new.
I think relying on operators to manipulate arrays is a spread enough pattern in 
Python that it's a good idea to support it.

> [Python] Support arithmetic on arrays and scalars
> -
>
> Key: ARROW-16658
> URL: https://issues.apache.org/jira/browse/ARROW-16658
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Will Jones
>Priority: Major
>
> I was surprised to find you can't use standard arithmetic operators on 
> PyArrow arrays and scalars. Instead, one must use the compute functions:
> {code:Python}
> import pyarrow as pa
> arr = pa.array([1, 2, 3])
> pc.add(arr, 2)
> # Doesn't work:
> # arr + 2
> # arr + pa.scalar(2)
> # arr + arr
> pc.multiply(arr, 20)
> # Doesn't work:
> # 20 * arr
> # pa.scalar(20) * arr
> {code}
> Is it intentional we don't support this?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-13370) [R] More special handling for known errors in arrow_eval

2022-06-07 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-13370:
---
Description: We have special handling in arrow_eval that looks for the "not 
supported in Arrow" error, and when that's found it shows the error message 
rather than swallowing it in an "Expression not supported" message. But we have 
other error messages we raise in nse_funcs that are worth showing--bad input 
etc. Use a sentinel error message that we can also detect and subclass as 
"arrow-try-error" like the others, or (better) raise a classed exception (if 
that's supported in all versions of R we support).   (was: We have special 
handling in arrow_eval that looks for the "not supported in Arrow" error, and 
when that's found it shows the error message rather than swallowing it in an 
"Expression not supported" message. But we have other error messages we raise 
in nse_funcs that are worth showing--bad input etc. Use a sentinel error 
message that we can also detect and subclass as "arrow-try-error" like the 
others, or (better) raised a classed exception (if that's supported in all 
versions of R we support). )

> [R] More special handling for known errors in arrow_eval
> 
>
> Key: ARROW-13370
> URL: https://issues.apache.org/jira/browse/ARROW-13370
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 9.0.0
>
>
> We have special handling in arrow_eval that looks for the "not supported in 
> Arrow" error, and when that's found it shows the error message rather than 
> swallowing it in an "Expression not supported" message. But we have other 
> error messages we raise in nse_funcs that are worth showing--bad input etc. 
> Use a sentinel error message that we can also detect and subclass as 
> "arrow-try-error" like the others, or (better) raise a classed exception (if 
> that's supported in all versions of R we support). 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16243) [C++][Python] Remove Parquet ReadSchemaField method

2022-06-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-16243:
--

Assignee:  Alexandre de Siqueira

> [C++][Python] Remove Parquet ReadSchemaField method
> ---
>
> Key: ARROW-16243
> URL: https://issues.apache.org/jira/browse/ARROW-16243
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 7.0.0
>Reporter: Will Jones
>Assignee:  Alexandre de Siqueira
>Priority: Minor
>  Labels: good-first-issue, pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> It doesn't seem like the experimental {{ReadSchemaField()}} method does 
> anything different than {{ReadColumn()}} at this point. We should remove it 
> and it's corresponding Python method.
> https://github.com/apache/arrow/blob/cedb4f8112b9c622dad88e0b6e8e0600f7e52746/cpp/src/parquet/arrow/reader.h#L143-L156



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16243) [C++][Python] Remove Parquet ReadSchemaField method

2022-06-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-16243.

Resolution: Fixed

Issue resolved by pull request 13060
[https://github.com/apache/arrow/pull/13060]

> [C++][Python] Remove Parquet ReadSchemaField method
> ---
>
> Key: ARROW-16243
> URL: https://issues.apache.org/jira/browse/ARROW-16243
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 7.0.0
>Reporter: Will Jones
>Priority: Minor
>  Labels: good-first-issue, pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> It doesn't seem like the experimental {{ReadSchemaField()}} method does 
> anything different than {{ReadColumn()}} at this point. We should remove it 
> and it's corresponding Python method.
> https://github.com/apache/arrow/blob/cedb4f8112b9c622dad88e0b6e8e0600f7e52746/cpp/src/parquet/arrow/reader.h#L143-L156



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16415) [R] Update strptime bindings to use tz

2022-06-07 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-16415.

Resolution: Fixed

Issue resolved by pull request 13190
[https://github.com/apache/arrow/pull/13190]

> [R] Update strptime bindings to use tz 
> ---
>
> Key: ARROW-16415
> URL: https://issues.apache.org/jira/browse/ARROW-16415
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 7.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> {{strptime}} mentions it does not support {{tz}} - the timezone argument. 
> ARROW-12820 has been addressed and the binding definition need updating.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16769) [C++] Add Status::Warn()

2022-06-07 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550961#comment-17550961
 ] 

David Li commented on ARROW-16769:
--

This would also be useful with the new RecordBatchReader::Close, cc [~vibhatha] 
/ [~westonpace] 

> [C++] Add Status::Warn()
> 
>
> Key: ARROW-16769
> URL: https://issues.apache.org/jira/browse/ARROW-16769
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
>  Labels: good-first-issue
> Fix For: 9.0.0
>
>
> We currently have {{Status::Abort()}} which gives an easy way to abort the 
> process with a meaningful message and detail.
> We should similarly add {{Status::Warn()}} that would simply print a warning 
> message of the error. Possible example use at 
> https://github.com/apache/arrow/pull/13315/files#diff-1256864b34a1b43082596ab5b16881702881ad06be8e1c157b47e1e6ac9ff5d2R160-R164
>  (together with {{StatusFromErrno}}).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16769) [C++] Add Status::Warn()

2022-06-07 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-16769:
--

 Summary: [C++] Add Status::Warn()
 Key: ARROW-16769
 URL: https://issues.apache.org/jira/browse/ARROW-16769
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Antoine Pitrou
 Fix For: 9.0.0


We currently have {{Status::Abort()}} which gives an easy way to abort the 
process with a meaningful message and detail.

We should similarly add {{Status::Warn()}} that would simply print a warning 
message of the error. Possible example use at 
https://github.com/apache/arrow/pull/13315/files#diff-1256864b34a1b43082596ab5b16881702881ad06be8e1c157b47e1e6ac9ff5d2R160-R164
 (together with {{StatusFromErrno}}).




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16769) [C++] Add Status::Warn()

2022-06-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-16769:
---
Labels: good-first-issue  (was: )

> [C++] Add Status::Warn()
> 
>
> Key: ARROW-16769
> URL: https://issues.apache.org/jira/browse/ARROW-16769
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
>  Labels: good-first-issue
> Fix For: 9.0.0
>
>
> We currently have {{Status::Abort()}} which gives an easy way to abort the 
> process with a meaningful message and detail.
> We should similarly add {{Status::Warn()}} that would simply print a warning 
> message of the error. Possible example use at 
> https://github.com/apache/arrow/pull/13315/files#diff-1256864b34a1b43082596ab5b16881702881ad06be8e1c157b47e1e6ac9ff5d2R160-R164
>  (together with {{StatusFromErrno}}).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16726) [Python] Setuptools warnings about installing packages as data

2022-06-07 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-16726.
---
Resolution: Fixed

Issue resolved by pull request 13309
[https://github.com/apache/arrow/pull/13309]

> [Python] Setuptools warnings about installing packages as data
> --
>
> Key: ARROW-16726
> URL: https://issues.apache.org/jira/browse/ARROW-16726
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging, Python
>Reporter: Antoine Pitrou
>Assignee: Raúl Cumplido
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> These warnings have started appearing in some builds (such as when running 
> {{archery docker run conda-python-docs}}):
> {code}
> SetuptoolsDeprecationWarning: Installing 'pyarrow.includes' as data is 
> deprecated, please list it in `packages`.
>   !!
>   
>   # Package would be ignored #
>   
>   Python recognizes 'pyarrow.includes' as an importable package, however 
> it is
>   included in the distribution as "data".
>   This behavior is likely to change in future versions of setuptools (and
>   therefore is considered deprecated).
>   Please make sure that 'pyarrow.includes' is included as a package by 
> using
>   setuptools' `packages` configuration field or the proper discovery 
> methods
>   (for example by using `find_namespace_packages(...)`/`find_namespace:`
>   instead of `find_packages(...)`/`find:`).
>   You can read more about "package discovery" and "data files" on 
> setuptools
>   documentation page.
> {code}
> We should probably fix them before something really breaks.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16546) [Python] Pyarrow fails to loads parquet file with long column names

2022-06-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-16546.

Resolution: Fixed

Issue resolved by pull request 13275
[https://github.com/apache/arrow/pull/13275]

> [Python] Pyarrow fails to loads parquet file with long column names
> ---
>
> Key: ARROW-16546
> URL: https://issues.apache.org/jira/browse/ARROW-16546
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet, Python
>Affects Versions: 8.0.0
> Environment: Ubuntu 20.04, pandas 1.4.2
>Reporter: Boris Urman
>Assignee: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 9.0.0
>
> Attachments: Screenshot from 2022-05-12 16-59-10.png
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> When loading parquet file "OSError: Couldn't deserialize thrift: 
> TProtocolException: Exceeded size limit" is raised. This seems to be related 
> to memory usage of table header. The issue may be coming from C code part. 
> Also pyarrow 0.16 version is capable to read that parquet file.
> Below is code snippet to reproduce the issue. Screenshot of jupyter-notebook 
> with more details is in attachments.
> Code snippet creates 2 pandas dataframes which only differ in column names. 
> One with short column names is stored and read without problem while the 
> other dataframe with long column names is stored but raises Exception during 
> reading.
> {code:java}
> import pandas as pd
> import numpy as np
> data = np.random.randn(10, 25)
> index = range(10)
> short_column_names = [f"col_{i}" for i in range(25)]
> long_column_names = 
> [f"some_really_long_column_name_ending_with_integer_number_{i}" for i in 
> range(25)]
> df_short_cols = pd.DataFrame(columns=short_column_names, data=data, 
> index=index)
> df_long_cols = pd.DataFrame(columns=long_column_names, data=data, 
> index=index)# Identical dataframes only column names are different
> # Storing dataframe with long column names works OK but reading fails
> df_long_cols.to_parquet("long_cols.parquet", engine="pyarrow") # Storing works
> df_long_cols_loaded = pd.read_parquet("long_cols.parquet", engine="pyarrow") 
> # <--- Fails here{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16706) [Python] Expose RankOptions

2022-06-07 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-16706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido reassigned ARROW-16706:
-

Assignee: Raúl Cumplido

> [Python] Expose RankOptions
> ---
>
> Key: ARROW-16706
> URL: https://issues.apache.org/jira/browse/ARROW-16706
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Raúl Cumplido
>Priority: Critical
>  Labels: good-first-issue
> Fix For: 9.0.0
>
>
> Followup to ARROW-16234



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16768) Factor variables in R with missing values cause an error for write_parquet

2022-06-07 Thread Kieran Martin (Jira)
Kieran Martin created ARROW-16768:
-

 Summary: Factor variables in R with missing values cause an error 
for write_parquet
 Key: ARROW-16768
 URL: https://issues.apache.org/jira/browse/ARROW-16768
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 7.0.0
Reporter: Kieran Martin


If you try to write a data frame with a factor with a missing value to parquet, 
you get the error: "Error: Invalid: Cannot insert dictionary values containing 
nulls". 

This seems likely due to how the metadata for factors is currently captured in 
parquet files. Reprex follows:

 

library(arrow)

bad_data <- data.frame(A = factor(1, 2, NA))

write_parquet(bad_data, tempfile())

 

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16767) [Archery] Refactor archery.release submodule to its own subpackage

2022-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16767:
---
Labels: pull-request-available  (was: )

> [Archery] Refactor archery.release submodule to its own subpackage
> --
>
> Key: ARROW-16767
> URL: https://issues.apache.org/jira/browse/ARROW-16767
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Archery
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16767) [Archery] Refactor archery.release submodule to its own subpackage

2022-06-07 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-16767:
---

 Summary: [Archery] Refactor archery.release submodule to its own 
subpackage
 Key: ARROW-16767
 URL: https://issues.apache.org/jira/browse/ARROW-16767
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Archery
Reporter: Krisztian Szucs
 Fix For: 9.0.0






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16767) [Archery] Refactor archery.release submodule to its own subpackage

2022-06-07 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-16767:
---

Assignee: Krisztian Szucs

> [Archery] Refactor archery.release submodule to its own subpackage
> --
>
> Key: ARROW-16767
> URL: https://issues.apache.org/jira/browse/ARROW-16767
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Archery
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 9.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16663) [Release][Dev] Add flag to archery release curate to only show minimal information

2022-06-07 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-16663.
-
Fix Version/s: 9.0.0
   Resolution: Fixed

> [Release][Dev] Add flag to archery release curate to only show minimal 
> information
> --
>
> Key: ARROW-16663
> URL: https://issues.apache.org/jira/browse/ARROW-16663
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Developer Tools
>Reporter: Raúl Cumplido
>Assignee: Jacob Wujciak-Jens
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Currently archery release curate shows a lot of information that is not 
> relevant, like the tickets that are correctly assigned. Have a new flag to 
> show only the information that requires manual fixing.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-13160) [CI][C++] Use binary caching for vcpkg builds

2022-06-07 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-13160:


Assignee: Jacob Wujciak-Jens  (was: Kouhei Sutou)

> [CI][C++] Use binary caching for vcpkg builds
> -
>
> Key: ARROW-13160
> URL: https://issues.apache.org/jira/browse/ARROW-13160
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Assignee: Jacob Wujciak-Jens
>Priority: Major
>
> Currently, the vcpkg CI builds ({{test-build-vcpkg-win}}) take 2 hours.
> We should try to enable binary caching: 
> https://github.com/microsoft/vcpkg/blob/master/docs/users/binarycaching.md



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16721) [C++] Drop support for bundled Thrift < 0.13

2022-06-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-16721.

Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13292
[https://github.com/apache/arrow/pull/13292]

> [C++] Drop support for bundled Thrift < 0.13
> 
>
> Key: ARROW-16721
> URL: https://issues.apache.org/jira/browse/ARROW-16721
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> We bundle Thrift 0.16.0. Users can use other version but will not use 0.13 or 
> earlier.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16709) [Docs][Python] Add how to run doctests to the developer guide

2022-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16709:
---
Labels: pull-request-available  (was: )

> [Docs][Python] Add how to run doctests to the developer guide
> -
>
> Key: ARROW-16709
> URL: https://issues.apache.org/jira/browse/ARROW-16709
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Python
>Reporter: Raúl Cumplido
>Assignee: Alenka Frim
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I have had some doctest failures on CI and I had to search over the 
> docker-compose file how to run doctests:
> {code:java}
> --doctest-modules --doctest-cython {code}
> I think it would be a nice addition to add it to our Python developers guide 
> on how to run tests.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16762) [C++] Implement saturation arithmetic kernels

2022-06-07 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550835#comment-17550835
 ] 

Antoine Pitrou commented on ARROW-16762:


Saturated arithmetic would probably use dedicated kernels for performance.
I agree it does not make sense to add them unless there's a real world use case.

> [C++] Implement saturation arithmetic kernels
> -
>
> Key: ARROW-16762
> URL: https://issues.apache.org/jira/browse/ARROW-16762
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ian Cook
>Priority: Minor
>
> The Arrow C++ library currently lacks saturation arithmetic kernels. I do not 
> think it is especially important to add them, but I am curious how 
> straightforward it might be. For example could they be implemented by 
> extending the {{_checked}} arithmetic kernels to include an option to emit 
> the min/max of the type when overflow is detected (in each position of the 
> array)?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16761) [C++][Python] Track bytes_written on FileWriter / WrittenFile

2022-06-07 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550834#comment-17550834
 ] 

Antoine Pitrou commented on ARROW-16761:


{{OutputStream::Tell()}} would work, yes.

> [C++][Python] Track bytes_written on FileWriter / WrittenFile
> -
>
> Key: ARROW-16761
> URL: https://issues.apache.org/jira/browse/ARROW-16761
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Affects Versions: 8.0.0
>Reporter: Will Jones
>Priority: Major
> Fix For: 9.0.0
>
>
> For Apache Iceberg and Delta Lake tables, we need to be able to get the size 
> of the files written in bytes. In Iceberg, this is the required field 
> {{file_size_in_bytes}} ([docs|https://iceberg.apache.org/spec/#manifests]). 
> In Delta, this is the required field {{size}} as part of the Add action.
> I think this could be exposed on 
> [FileWriter|https://github.com/apache/arrow/blob/8c63788ff7d52812599a546989b7df10887cb01e/cpp/src/arrow/dataset/file_base.h#L305]
>  and then through that 
> [WrittenFile|https://github.com/apache/arrow/blob/8c63788ff7d52812599a546989b7df10887cb01e/python/pyarrow/_dataset.pyx#L766-L769].
>  But lower-level than that I'm not yet sure. {{FileWriter}} owns its 
> {{OutputStream}}; would {{OutputStream::Tell()}} give the correct count?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)