[jira] [Commented] (ARROW-8679) [Python] supporting pandas sparse series in pyarrow

2022-04-26 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528587#comment-17528587
 ] 

Joris Van den Bossche commented on ARROW-8679:
--

With the current Arrow data types, we don't really have support for sparse 
data, so there is no direct way to support conversion from/to pandas sparse 
Series (except for converting to dense).

 There has been some discussion in the past about extending the Arrow spec to 
sparse/compressed data (e.g. RLE), but no one has started yet on a full 
proposal.

> [Python] supporting pandas sparse series in pyarrow
> ---
>
> Key: ARROW-8679
> URL: https://issues.apache.org/jira/browse/ARROW-8679
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
> Environment: ubuntu 16/18
>Reporter: Michael Novitsky
>Priority: Major
> Fix For: 0.17.0
>
>
> I've seen that Pandas sparse series was not supported in pyarrow since it was 
> planned to be deprecated.  In Pandas 1.0.1 they released a stable version of 
> sparse array and as far as I know it is not planned to be deprecated anymore. 
> Are you planning to support sparse series in next versions of pyarrow ?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-602) [C++] Provide iterator access to primitive elements inside a Column/ChunkedArray

2022-04-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-602:
-
Labels: beginner good-first-issue newbie pull-request-available  (was: 
beginner good-first-issue newbie)

> [C++] Provide iterator access to primitive elements inside a 
> Column/ChunkedArray
> 
>
> Key: ARROW-602
> URL: https://issues.apache.org/jira/browse/ARROW-602
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe Korn
>Assignee: Alvin Chunga Mamani
>Priority: Major
>  Labels: beginner, good-first-issue, newbie, 
> pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Given a ChunkedArray, an Arrow user must currently iterate over all its 
> chunks and then cast them to their types to extract the primitive memory 
> regions to access the values. A convenient way to access the underlying 
> values would be to offer a function that takes a ChunkedArray and returns a 
> C++ iterator over all elements.
> While this may not be the most performant way to access the underlying data, 
> it should have sufficient performance and adds a convenience layer for new 
> users.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16355) [Dev] verify-release-build.sh uses only one thread to build cpp source

2022-04-26 Thread Yibo Cai (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528514#comment-17528514
 ] 

Yibo Cai commented on ARROW-16355:
--

Or we prefer setting {{CMAKE_BUILD_PARALLEL_LEVEL}} env?

> [Dev] verify-release-build.sh uses only one thread to build cpp source
> --
>
> Key: ARROW-16355
> URL: https://issues.apache.org/jira/browse/ARROW-16355
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {{verify-release-build.sh}} uses only one thread to build cpp source, it's 
> very slow.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16320) Dataset re-partitioning consumes considerable amount of memory

2022-04-26 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528511#comment-17528511
 ] 

Weston Pace commented on ARROW-16320:
-

What do you get from the following?

{noformat}
a = arrow::read_parquet(here::here("db", "large_parquet", "part-0.parquet"))
a$nbytes()
{noformat}

The {{nbytes}} function should print a pretty decent approximation of the C 
memory referenced by {{a}}.

{{lobstr::obj_size}} prints only the R memory used (I think).

{{fs::file_size}} is going to give you the size of the file, which is possibly 
encoded and compressed.  Some parquet files can be much larger in memory than 
they are on disk.  So it is not unheard of for a 620MB parquet file to end up 
occupying gigabytes in memory (11 GB seems a little extreme but within the 
realm of possibility)

> Dataset re-partitioning consumes considerable amount of memory
> --
>
> Key: ARROW-16320
> URL: https://issues.apache.org/jira/browse/ARROW-16320
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 7.0.0
>Reporter: Zsolt Kegyes-Brassai
>Priority: Minor
> Attachments: Rgui_mem.jpg, Rstudio_env.jpg, Rstudio_mem.jpg
>
>
> A short background: I was trying to create a dataset from a big pile of csv 
> files (couple of hundreds). In first step the csv were parsed and saved to 
> parquet files because there were many inconsistencies between csv files. In a 
> consequent step the dataset was re-partitioned using one column (code_key).
>  
> {code:java}
> new_dataset <- open_dataset(
>   temp_parquet_folder, 
>   format = "parquet",
>   unify_schemas = TRUE
>   )
> new_dataset |> 
>   group_by(code_key) |> 
>   write_dataset(
>     folder_repartitioned_dataset, 
>     format = "parquet"
>   )
> {code}
>  
> This re-partitioning consumed a considerable amount of memory (5 GB). 
>  * Is this a normal behavior?  Or a bug?
>  * Is there any rule of thumb to estimate the memory requirement for a 
> dataset re-partitioning? (it’s important when scaling up this approach)
> The drawback is that this memory space is not freed up after the 
> re-partitioning  (I am using RStudio). 
> The {{gc()}} useless in this situation. And there is no any associated object 
> (to the repartitioning) in the {{R}} environment which can be removed from 
> memory (using the {{rm()}} function).
>  * How one can regain this memory space used by re-partitioning?
> The rationale behind choosing the dataset re-partitioning: if my 
> understanding is correct,  in the current arrow version the append is not 
> working when writing parquet files/datasets. (the original csv files were 
> partly partitioned according to a different variable)
> Can you recommend any better approach?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16355) [Dev] verify-release-build.sh uses only one thread to build cpp source

2022-04-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16355:
---
Labels: pull-request-available  (was: )

> [Dev] verify-release-build.sh uses only one thread to build cpp source
> --
>
> Key: ARROW-16355
> URL: https://issues.apache.org/jira/browse/ARROW-16355
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{verify-release-build.sh}} uses only one thread to build cpp source, it's 
> very slow.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16355) [Dev] verify-release-build.sh uses only one thread to build cpp source

2022-04-26 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-16355:


 Summary: [Dev] verify-release-build.sh uses only one thread to 
build cpp source
 Key: ARROW-16355
 URL: https://issues.apache.org/jira/browse/ARROW-16355
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Yibo Cai
Assignee: Yibo Cai


{{verify-release-build.sh}} uses only one thread to build cpp source, it's very 
slow.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16349) [Release][Packaging][RPM] Remove ed25519 keys from KEYS

2022-04-26 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-16349.
--
Resolution: Fixed

Issue resolved by pull request 13002
[https://github.com/apache/arrow/pull/13002]

> [Release][Packaging][RPM] Remove ed25519 keys from KEYS
> ---
>
> Key: ARROW-16349
> URL: https://issues.apache.org/jira/browse/ARROW-16349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> It seems that AlmaLinux 8 doesn't support ed25519. If we have ed25519, GPG 
> keys import is failed:
> https://github.com/ursacomputing/crossbow/runs/6177081003?check_suite_focus=true#step:7:355
> {noformat}
> Importing GPG key 0x717D3FB2:
>  Userid : "Neville Dipale "
>  Fingerprint: 3905 F254 F9E5 04B4 0FFF 6CF6 0004 88D7 717D 3FB2
>  From   : /etc/pki/rpm-gpg/RPM-GPG-KEY-Apache-Arrow
> Extra Packages for Enterprise Linux 8 - x86_64  1.6 MB/s | 1.6 kB 00:00   
>  
> Importing GPG key 0x2F86D6A1:
> {noformat}
> "Key imported successfully" isn't shown for "Importing GPG key 0x717D3FB2:".



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16352) [GLib] enums.h are installed in wrong location

2022-04-26 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-16352.
--
Resolution: Fixed

Issue resolved by pull request 13006
[https://github.com/apache/arrow/pull/13006]

> [GLib] enums.h are installed in wrong location
> --
>
> Key: ARROW-16352
> URL: https://issues.apache.org/jira/browse/ARROW-16352
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16351) [C++][Python] Implement seek() for BufferedInputStream

2022-04-26 Thread Yibo Cai (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528489#comment-17528489
 ] 

Yibo Cai commented on ARROW-16351:
--

BufferedInputStream wraps a InputStream which implements only the Readable 
interface, not Seekable. In general, I think it's reasonable as 
BufferedInputStream is only suitable for sequetial read, not random access.
cc [~apitrou]

> [C++][Python] Implement seek() for BufferedInputStream
> --
>
> Key: ARROW-16351
> URL: https://issues.apache.org/jira/browse/ARROW-16351
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Affects Versions: 7.0.0
>Reporter: Frank Luan
>Priority: Major
>
> I would like to use seek() in a buffered input stream for the following usage 
> scenario:
>  * Open a S3 file (e.g. 1GB)
>  * Jump to an offset (e.g. skip 500MB)
>  * Do a bunch of small (8 bytes) reads
> So that I get the performance of buffered input by avoiding lots of small 
> reads (which are expensive and slow if using S3) and also seek to a position.
> Currently I need to hack it using a mix of RandomAccessFile and 
> BufferedInputStream, like
> {{with _fs.open_input_file(url) as f:}}
> {{    f.seek(offset)}}
> {{    f = fs._wrap_input_stream(f, url, None, self._buffer_size)}}
> {{    x = }}{{{}f.read(8){}}}{{{}{}}}
> I'm wondering if there is any fundamental reason why seek is not implemented 
> for the buffered input stream? Looks like .NET implements it: 
> [https://docs.microsoft.com/en-us/dotnet/api/system.io.bufferedstream.seek?view=net-6.0]
> Or, what I actually need is to open a S3 file with an offset. Would this be 
> easier to do, or is it already supported in current API?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16354) [Packaging][RPM] Artifacts pattern list is outdated

2022-04-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16354:
---
Labels: pull-request-available  (was: )

> [Packaging][RPM] Artifacts pattern list is outdated
> ---
>
> Key: ARROW-16354
> URL: https://issues.apache.org/jira/browse/ARROW-16354
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a follow up of ARROW-15631.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16354) [Packaging][RPM] Artifacts pattern list is outdated

2022-04-26 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-16354:


 Summary: [Packaging][RPM] Artifacts pattern list is outdated
 Key: ARROW-16354
 URL: https://issues.apache.org/jira/browse/ARROW-16354
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


This is a follow up of ARROW-15631.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16329) [Java][C++] Keep more context when marshalling errors through JNI

2022-04-26 Thread Hongze Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongze Zhang reassigned ARROW-16329:


Assignee: Hongze Zhang

> [Java][C++] Keep more context when marshalling errors through JNI
> -
>
> Key: ARROW-16329
> URL: https://issues.apache.org/jira/browse/ARROW-16329
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Java
>Reporter: Antoine Pitrou
>Assignee: Hongze Zhang
>Priority: Major
> Fix For: 9.0.0
>
>
> When errors are propagated through the JNI barrier, two mechanisms are 
> involved:
> * the {{Status CheckException(JNIEnv* env)}} function for Java-to-C++ error 
> translation
> * the {{JniAssertOkOrThrow(arrow::Status status)}} and {{T 
> JniGetOrThrow(arrow::Result result)}} functions for C++-to-Java error 
> translation
> Currently, both mechanisms lose most context about the original error, such 
> as its type and any additional state, such as the optional {{StatusDetail}} 
> in C++ or any properties in Java (which I'm sure exist on some exception 
> classes).
> We should improve these mechanisms to retain as much context as possible. For 
> example, in a hypothetical Java-to-C++-to-Java error propagation scenario, 
> the original Java exception from inner code should ideally be re-thrown in 
> the outer Java context (we already support this in Python btw).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16353) [Wiki] Release verification howto is obsolete

2022-04-26 Thread Yibo Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai updated ARROW-16353:
-
Issue Type: Improvement  (was: Bug)

> [Wiki] Release verification howto is obsolete
> -
>
> Key: ARROW-16353
> URL: https://issues.apache.org/jira/browse/ARROW-16353
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Wiki
>Reporter: Yibo Cai
>Priority: Major
>
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates#HowtoVerifyReleaseCandidates-LinuxandmacOS
> The example commands are wrong.
> E.g., {code:bash}TEST_DEFAULT=0 TEST_SOURCE=1 TEST_CPP=1 
> verify-release-candidate.sh source $VERSION $RC_NUM{code}
> should be changed to
> {code:bash}TEST_DEFAULT=0 TEST_CPP=1 verify-release-candidate.sh $VERSION 
> $RC_NUM{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-15461) [C++] arrow-utility-test fails with clang-12 (TestCopyAndReverseBitmapPreAllocated)

2022-04-26 Thread Yibo Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai updated ARROW-15461:
-
Issue Type: Bug  (was: Improvement)

> [C++] arrow-utility-test fails with clang-12 
> (TestCopyAndReverseBitmapPreAllocated)
> ---
>
> Key: ARROW-15461
> URL: https://issues.apache.org/jira/browse/ARROW-15461
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Unit test {{BitUtilTests.TestCopyAndReverseBitmapPreAllocated}} failed if 
> release build arrow with clang-12, on both x86 and Arm.
> Per my debug, it's related to {{GetReversedBlock}} function [1], when right 
> shift a uint8 value by 8 bits.
> I think it's a compiler bug. From the test code [2], clang-12 returns 1, 
> which is wrong. clang-11 and clang-13 both return 2, the correct answer. 
> Looks clang-12 over optimized the code, there should be no UB in the code 
> (uint8 is promoted to integer before shift).
> A workaround is to treat shifting 8 bits as a special case. Or we can simply 
> ignore this error if the compiler bug is confirmed (I didn't find clang bug 
> report).
> [1] 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bitmap_ops.cc#L101
> [2] https://godbolt.org/z/TzYWfcP1E



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-15461) [C++] arrow-utility-test fails with clang-12 (TestCopyAndReverseBitmapPreAllocated)

2022-04-26 Thread Yibo Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai updated ARROW-15461:
-
Issue Type: Improvement  (was: Bug)

> [C++] arrow-utility-test fails with clang-12 
> (TestCopyAndReverseBitmapPreAllocated)
> ---
>
> Key: ARROW-15461
> URL: https://issues.apache.org/jira/browse/ARROW-15461
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Unit test {{BitUtilTests.TestCopyAndReverseBitmapPreAllocated}} failed if 
> release build arrow with clang-12, on both x86 and Arm.
> Per my debug, it's related to {{GetReversedBlock}} function [1], when right 
> shift a uint8 value by 8 bits.
> I think it's a compiler bug. From the test code [2], clang-12 returns 1, 
> which is wrong. clang-11 and clang-13 both return 2, the correct answer. 
> Looks clang-12 over optimized the code, there should be no UB in the code 
> (uint8 is promoted to integer before shift).
> A workaround is to treat shifting 8 bits as a special case. Or we can simply 
> ignore this error if the compiler bug is confirmed (I didn't find clang bug 
> report).
> [1] 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bitmap_ops.cc#L101
> [2] https://godbolt.org/z/TzYWfcP1E



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16353) [Wiki] Release verification howto is obsolete

2022-04-26 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-16353:


 Summary: [Wiki] Release verification howto is obsolete
 Key: ARROW-16353
 URL: https://issues.apache.org/jira/browse/ARROW-16353
 Project: Apache Arrow
  Issue Type: Bug
  Components: Wiki
Reporter: Yibo Cai


https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates#HowtoVerifyReleaseCandidates-LinuxandmacOS

The example commands are wrong.
E.g., {code:bash}TEST_DEFAULT=0 TEST_SOURCE=1 TEST_CPP=1 
verify-release-candidate.sh source $VERSION $RC_NUM{code}
should be changed to
{code:bash}TEST_DEFAULT=0 TEST_CPP=1 verify-release-candidate.sh $VERSION 
$RC_NUM{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16352) [GLib] enums.h are installed in wrong location

2022-04-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16352:
---
Labels: pull-request-available  (was: )

> [GLib] enums.h are installed in wrong location
> --
>
> Key: ARROW-16352
> URL: https://issues.apache.org/jira/browse/ARROW-16352
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16351) [C++][Python] Implement seek() for BufferedInputStream

2022-04-26 Thread Frank Luan (Jira)
Frank Luan created ARROW-16351:
--

 Summary: [C++][Python] Implement seek() for BufferedInputStream
 Key: ARROW-16351
 URL: https://issues.apache.org/jira/browse/ARROW-16351
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Python
Affects Versions: 7.0.0
Reporter: Frank Luan


I would like to use seek() in a buffered input stream for the following usage 
scenario:
 * Open a S3 file (e.g. 1GB)
 * Jump to an offset (e.g. skip 500MB)
 * Do a bunch of small (8 bytes) reads

So that I get the performance of buffered input by avoiding lots of small reads 
(which are expensive and slow if using S3) and also seek to a position.

Currently I need to hack it using a mix of RandomAccessFile and 
BufferedInputStream, like

{{with _fs.open_input_file(url) as f:}}
{{    f.seek(offset)}}
{{    f = fs._wrap_input_stream(f, url, None, self._buffer_size)}}
{{    x = }}{{{}f.read(8){}}}{{{}{}}}

I'm wondering if there is any fundamental reason why seek is not implemented 
for the buffered input stream? Looks like .NET implements it: 
[https://docs.microsoft.com/en-us/dotnet/api/system.io.bufferedstream.seek?view=net-6.0]

Or, what I actually need is to open a S3 file with an offset. Would this be 
easier to do, or is it already supported in current API?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16352) [GLib] enums.h are installed in wrong location

2022-04-26 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-16352:


 Summary: [GLib] enums.h are installed in wrong location
 Key: ARROW-16352
 URL: https://issues.apache.org/jira/browse/ARROW-16352
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou
 Fix For: 8.0.0






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16350) [Dev][Archery] Add missing newline in error message comment

2022-04-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16350:
---
Labels: pull-request-available  (was: )

> [Dev][Archery] Add missing newline in error message comment
> ---
>
> Key: ARROW-16350
> URL: https://issues.apache.org/jira/browse/ARROW-16350
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16350) [Dev][Archery] Add missing newline in error message comment

2022-04-26 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-16350:


 Summary: [Dev][Archery] Add missing newline in error message 
comment
 Key: ARROW-16350
 URL: https://issues.apache.org/jira/browse/ARROW-16350
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16276) [R] Release News

2022-04-26 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-16276:
---
Component/s: R

> [R] Release News
> 
>
> Key: ARROW-16276
> URL: https://issues.apache.org/jira/browse/ARROW-16276
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Will Jones
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I typically use a command like:
> {code}
> git log fcab481 --grep=".*\[R\].*" --format="%s"
> {code}
> Which will find all the commits with {{[R]}}, since commit fcab481. I found 
> commit fcab481 by going to the 7.0.0 release branch and then finding the last 
> commit that is in the master branch as well as the 7.0.0 release. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16276) [R] Release News

2022-04-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16276:
---
Labels: pull-request-available  (was: )

> [R] Release News
> 
>
> Key: ARROW-16276
> URL: https://issues.apache.org/jira/browse/ARROW-16276
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jonathan Keane
>Assignee: Will Jones
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I typically use a command like:
> {code}
> git log fcab481 --grep=".*\[R\].*" --format="%s"
> {code}
> Which will find all the commits with {{[R]}}, since commit fcab481. I found 
> commit fcab481 by going to the 7.0.0 release branch and then finding the last 
> commit that is in the master branch as well as the 7.0.0 release. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16272) [C++][Python] Poor read performance of S3FileSystem.open_input_file when used with `pd.read_csv`

2022-04-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16272:
---
Labels: S3FileSystem csv pandas pull-request-available s3  (was: 
S3FileSystem csv pandas s3)

> [C++][Python] Poor read performance of S3FileSystem.open_input_file when used 
> with `pd.read_csv`
> 
>
> Key: ARROW-16272
> URL: https://issues.apache.org/jira/browse/ARROW-16272
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 4.0.1, 5.0.0, 7.0.0
> Environment: MacOS 12.1
> MacBook Pro
> Intel x86
>Reporter: Sahil Gupta
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: S3FileSystem, csv, pandas, pull-request-available, s3
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> `pyarrow.fs.S3FileSystem.open_input_file` and 
> `pyarrow.fs.S3FileSystem.open_input_stream` performs very poorly when used 
> with Pandas' `read_csv`.
> {code:python}
> import pandas as pd
> import time
> from pyarrow.fs import S3FileSystem
> def load_parking_tickets():
>     print("Running...")
>     t0 = time.time()
>     fs = S3FileSystem(
>         anonymous=True,
>         region="us-east-2",
>         endpoint_override=None,
>         proxy_options=None,
>     )
>     print("Time to create fs: ", time.time() - t0)
>     t0 = time.time()
>     # fhandler = fs.open_input_stream(
>     #     
> "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
>     # )
>     fhandler = fs.open_input_file(
>         
> "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
>     )
>     print("Time to create fhandler: ", time.time() - t0)
>     t0 = time.time()
>     year_2016_df = pd.read_csv(
>         fhandler,
>         nrows=100,
>     )
>     print("read time:", time.time() - t0)
>     return year_2016_df
> t0 = time.time()
> load_parking_tickets()
> print("total time:", time.time() - t0)
> {code}
> Output:
> {code}
> Running...
> Time to create fs:  0.0003612041473388672
> Time to create fhandler:  0.22461509704589844
> read time: 105.76488208770752
> total time: 105.99135684967041
> {code}
> This is with `pandas==1.4.2`.
> Getting similar performance with `fs.open_input_stream` as well (commented 
> out in the code).
> {code}
> Running...
> Time to create fs:  0.0002570152282714844
> Time to create fhandler:  0.18540692329406738
> read time: 186.8419930934906
> total time: 187.03169012069702
> {code}
> When running it with just pandas (which uses `s3fs` under the hood), it's 
> much faster:
> {code:python}
> import pandas as pd
> import time
> def load_parking_tickets():
>     print("Running...")
>     t0 = time.time()
>     year_2016_df = pd.read_csv(
>         
> "s3://bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
>         nrows=100,
>     )
>     print("read time:", time.time() - t0)
>     return year_2016_df
> t0 = time.time()
> load_parking_tickets()
> print("total time:", time.time() - t0)
> {code}
> Output:
> {code}
> Running...
> read time: 1.1012001037597656
> total time: 1.101264238357544
> {code}
> Surprisingly, when we use `fsspec`'s `ArrowFSWrapper`, it's matches s3fs 
> performance:
> {code:python}
> import pandas as pd
> import time
> from pyarrow.fs import S3FileSystem
> from fsspec.implementations.arrow import ArrowFSWrapper
> def load_parking_tickets():
>     print("Running...")
>     t0 = time.time()
>     fs = ArrowFSWrapper(
>         S3FileSystem(
>             anonymous=True,
>             region="us-east-2",
>             endpoint_override=None,
>             proxy_options=None,
>         )
>     )
>     print("Time to create fs: ", time.time() - t0)
>     t0 = time.time()
>     fhandler = fs._open(
>         
> "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
>     )
>     print("Time to create fhandler: ", time.time() - t0)
>     t0 = time.time()
>     year_2016_df = pd.read_csv(
>         fhandler,
>         nrows=100,
>     )
>     print("read time:", time.time() - t0)
>     return year_2016_df
> t0 = time.time()
> load_parking_tickets()
> print("total time:", time.time() - t0)
> {code}
> Output:
> {code}
> Running...
> Time to create fs:  0.0002467632293701172
> Time to create fhandler:  0.1858382225036621
> read time: 0.13701486587524414
> total time: 0.3232450485229492
> {code}
> Packages:
> {code}
> pyarrow=7.0.0
> pandas : 1.4.2
> numpy : 1.20.3
> {code}
> I tested it with 4.0.1, 5.0.0 as well and saw similar results.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16332) [Release] Java jars verification pass despite binaries not being uploaded

2022-04-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16332:
---
Labels: pull-request-available  (was: )

> [Release] Java jars verification pass despite binaries not being uploaded
> -
>
> Key: ARROW-16332
> URL: https://issues.apache.org/jira/browse/ARROW-16332
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See results at 
> https://github.com/apache/arrow/pull/12991#issuecomment-1109525407



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16332) [Release] Java jars verification pass despite binaries not being uploaded

2022-04-26 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-16332:


Assignee: Kouhei Sutou

> [Release] Java jars verification pass despite binaries not being uploaded
> -
>
> Key: ARROW-16332
> URL: https://issues.apache.org/jira/browse/ARROW-16332
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Assignee: Kouhei Sutou
>Priority: Major
> Fix For: 9.0.0
>
>
> See results at 
> https://github.com/apache/arrow/pull/12991#issuecomment-1109525407



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16349) [Release][Packaging][RPM] Remove ed25519 keys from KEYS

2022-04-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16349:
---
Labels: pull-request-available  (was: )

> [Release][Packaging][RPM] Remove ed25519 keys from KEYS
> ---
>
> Key: ARROW-16349
> URL: https://issues.apache.org/jira/browse/ARROW-16349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It seems that AlmaLinux 8 doesn't support ed25519. If we have ed25519, GPG 
> keys import is failed:
> https://github.com/ursacomputing/crossbow/runs/6177081003?check_suite_focus=true#step:7:355
> {noformat}
> Importing GPG key 0x717D3FB2:
>  Userid : "Neville Dipale "
>  Fingerprint: 3905 F254 F9E5 04B4 0FFF 6CF6 0004 88D7 717D 3FB2
>  From   : /etc/pki/rpm-gpg/RPM-GPG-KEY-Apache-Arrow
> Extra Packages for Enterprise Linux 8 - x86_64  1.6 MB/s | 1.6 kB 00:00   
>  
> Importing GPG key 0x2F86D6A1:
> {noformat}
> "Key imported successfully" isn't shown for "Importing GPG key 0x717D3FB2:".



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16349) [Release][Packaging][RPM] Remove ed25519 keys from KEYS

2022-04-26 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-16349:


 Summary: [Release][Packaging][RPM] Remove ed25519 keys from KEYS
 Key: ARROW-16349
 URL: https://issues.apache.org/jira/browse/ARROW-16349
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou
 Fix For: 8.0.0


It seems that AlmaLinux 8 doesn't support ed25519. If we have ed25519, GPG keys 
import is failed:

https://github.com/ursacomputing/crossbow/runs/6177081003?check_suite_focus=true#step:7:355

{noformat}
Importing GPG key 0x717D3FB2:
 Userid : "Neville Dipale "
 Fingerprint: 3905 F254 F9E5 04B4 0FFF 6CF6 0004 88D7 717D 3FB2
 From   : /etc/pki/rpm-gpg/RPM-GPG-KEY-Apache-Arrow
Extra Packages for Enterprise Linux 8 - x86_64  1.6 MB/s | 1.6 kB 00:00
Importing GPG key 0x2F86D6A1:
{noformat}

"Key imported successfully" isn't shown for "Importing GPG key 0x717D3FB2:".





--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (ARROW-15959) [Java][Docs] Fix IntelliJ IDE setup instructions

2022-04-26 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528365#comment-17528365
 ] 

David Li edited comment on ARROW-15959 at 4/26/22 8:46 PM:
---

Tracing my steps: I started with a fresh IntelliJ 2022.1 installation. I 
installed IdeaVim and CheckStyle-IDEA for reference.

I then opened the {{java}} subdirectory of a fresh Arrow checkout, and waited 
for the Maven sync. That ran into 

{noformat}
Could not find artifact 
io.netty:netty-transport-native-unix-common:jar:${os.detected.name}-${os.detected.arch}:4.1.72.Final
 in central (https://repo.maven.apache.org/maven2)
{noformat}

This seems to be related to 
https://github.com/trustin/os-maven-plugin/issues/19. I updated the plugin and 
re-opened the IDE, which seemed to fix it mostly, except in 
flight-integration-tests which had

{noformat}
Unresolved dependency: 
'io.netty:netty-transport-native-unix-common:jar:4.1.72.Final'
{noformat}

Adding the os-maven-plugin didn't help here. Deactivating the 
linux-netty-native profile in the Maven pane did. I think this might be because 
IntelliJ isn't substituting the os properties when dependencies come from a 
profile. 

Now Maven syncs. I tried to build the project and had to set an SDK. I chose 
JDK11 and set the language level to 8, then tried building again. That led to 

{noformat}
package sun.misc does not exist
{noformat}

That seems to be related to https://youtrack.jetbrains.com/issue/IDEA-201168. I 
disabled the option specified there and the build continued. Now the build 
fails in TestExpandableByteBuf. Honestly, I think this file is in the wrong 
package…I moved it into arrow-memory-netty.

Continuing on, the build fails because it can't find the auto-generated 
sources. So then I manually invoked {{mvn compile}}. That inexplicably failed 
from within IntelliJ, so I switched to the CLI and ran {{mvn compile}} 
manually, which seemed to work fine. That generated 
{{arrow-vector/target/generated-sources}}, so I found that in the IntelliJ 
Project pane and right-click > "Mark Directory As" > "Generated Sources Root". 
Then I restarted the build. This finally completed successfully!

So there's a couple things to do here:

# Document the Maven issue
# Submit an upgrade of os-maven-plugin
# Document the compiler issue
# Submit moving the test file (will have to check if we're currently 
accidentally not running those tests…)
# Document that you have to Maven compile first (will retry and see if tweaking 
pom.xml lets us avoid at least manually marking the sources root)

Ah, the manual Maven build fails because of this: 
https://youtrack.jetbrains.com/issue/IDEA-278903

Disabling fork indeed fixes it in IntelliJ. Not sure how to set the other 
options mentioned.


was (Author: lidavidm):
Tracing my steps: I started with a fresh IntelliJ 2022.1 installation. I 
installed IdeaVim and CheckStyle-IDEA for reference.

I then opened the {{java}} subdirectory of a fresh Arrow checkout, and waited 
for the Maven sync. That ran into 

{noformat}
Could not find artifact 
io.netty:netty-transport-native-unix-common:jar:${os.detected.name}-${os.detected.arch}:4.1.72.Final
 in central (https://repo.maven.apache.org/maven2)
{noformat}

This seems to be related to 
https://github.com/trustin/os-maven-plugin/issues/19. I updated the plugin and 
re-opened the IDE, which seemed to fix it mostly, except in 
flight-integration-tests which had

{noformat}
Unresolved dependency: 
'io.netty:netty-transport-native-unix-common:jar:4.1.72.Final'
{noformat}

Adding the os-maven-plugin didn't help here. Deactivating the 
linux-netty-native profile in the Maven pane did. I think this might be because 
IntelliJ isn't substituting the os properties when dependencies come from a 
profile. 

Now Maven syncs. I tried to build the project and had to set an SDK. I chose 
JDK11 and set the language level to 8, then tried building again. That led to 

{noformat}
package sun.misc does not exist
{noformat}

That seems to be related to https://youtrack.jetbrains.com/issue/IDEA-201168. I 
disabled the option specified there and the build continued. Now the build 
fails in TestExpandableByteBuf. Honestly, I think this file is in the wrong 
package…I moved it into arrow-memory-netty.

Continuing on, the build fails because it can't find the auto-generated 
sources. So then I manually invoked {{mvn compile}}. That inexplicably failed 
from within IntelliJ, so I switched to the CLI and ran {{mvn compile}} 
manually, which seemed to work fine. That generated 
{{arrow-vector/target/generated-sources}}, so I found that in the IntelliJ 
Project pane and right-click > "Mark Directory As" > "Generated Sources Root". 
Then I restarted the build. This finally completed successfully!

So there's a couple things to do here:

# Document the Maven issue
# Submit an upgrade of os-maven-plugin
# Document the com

[jira] [Commented] (ARROW-16276) [R] Release News

2022-04-26 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528383#comment-17528383
 ] 

Will Jones commented on ARROW-16276:


I think that command is changed up until that commit. Changes after that commit 
seems to be:

{code:bash}
git log fcab481..HEAD --grep=".*\[R\].*" --format="%s" > r-changes.txt
{code}

> [R] Release News
> 
>
> Key: ARROW-16276
> URL: https://issues.apache.org/jira/browse/ARROW-16276
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jonathan Keane
>Assignee: Will Jones
>Priority: Major
>
> I typically use a command like:
> {code}
> git log fcab481 --grep=".*\[R\].*" --format="%s"
> {code}
> Which will find all the commits with {{[R]}}, since commit fcab481. I found 
> commit fcab481 by going to the 7.0.0 release branch and then finding the last 
> commit that is in the master branch as well as the 7.0.0 release. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16348) ParquetWriter use_compliant_nested_type=True does not preserve ExtensionArray when reading back

2022-04-26 Thread Jim Pivarski (Jira)
Jim Pivarski created ARROW-16348:


 Summary: ParquetWriter use_compliant_nested_type=True does not 
preserve ExtensionArray when reading back
 Key: ARROW-16348
 URL: https://issues.apache.org/jira/browse/ARROW-16348
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 7.0.0
 Environment: pyarrow 7.0.0 installed via pip.
Reporter: Jim Pivarski


I've been happily making ExtensionArrays, but recently noticed that they aren't 
preserved by round-trips through Parquet files when 
{{{}use_compliant_nested_type=True{}}}.

Consider this writer.py:

 
{code:java}
import json
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
class AnnotatedType(pa.ExtensionType):
    def __init__(self, storage_type, annotation):
        self.annotation = annotation
        super().__init__(storage_type, "my:app")
    def __arrow_ext_serialize__(self):
        return json.dumps(self.annotation).encode()
    @classmethod
    def __arrow_ext_deserialize__(cls, storage_type, serialized):
        annotation = json.loads(serialized.decode())
        return cls(storage_type, annotation)
    @property
    def num_buffers(self):
        return self.storage_type.num_buffers
    @property
    def num_fields(self):
        return self.storage_type.num_fields
pa.register_extension_type(AnnotatedType(pa.null(), None))
array = pa.Array.from_buffers(
    AnnotatedType(pa.list_(pa.float64()), {"cool": "beans"}),
    3,
    [None, pa.py_buffer(np.array([0, 3, 3, 5], np.int32))],
    children=[pa.array([1.1, 2.2, 3.3, 4.4, 5.5])],
)
table = pa.table({"": array})
print(table)
pq.write_table(table, "tmp.parquet", use_compliant_nested_type=True)
{code}
And this reader.py:

 
{code:java}
import json
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
class AnnotatedType(pa.ExtensionType):
    def __init__(self, storage_type, annotation):
        self.annotation = annotation
        super().__init__(storage_type, "my:app")
    def __arrow_ext_serialize__(self):
        return json.dumps(self.annotation).encode()
    @classmethod
    def __arrow_ext_deserialize__(cls, storage_type, serialized):
        annotation = json.loads(serialized.decode())
        return cls(storage_type, annotation)
    @property
    def num_buffers(self):
        return self.storage_type.num_buffers
    @property
    def num_fields(self):
        return self.storage_type.num_fields
pa.register_extension_type(AnnotatedType(pa.null(), None))
table = pq.read_table("tmp.parquet")
print(table)
{code}
(The AnnotatedType is the same; I wrote it twice for explicitness.)

When the writer.py has {{{}use_compliant_nested_type=False{}}}, the output is
{code:java}
% python writer.py 
pyarrow.Table
: extension>

: [[[1.1,2.2,3.3],[],[4.4,5.5]]]
% python reader.py 
pyarrow.Table
: extension>

: [[[1.1,2.2,3.3],[],[4.4,5.5]]]{code}
In other words, the AnnotatedType is preserved. When 
{{{}use_compliant_nested_type=True{}}}, however,
{code:java}
% rm tmp.parquet
rm: remove regular file 'tmp.parquet'? y
% python writer.py 
pyarrow.Table
: extension>

: [[[1.1,2.2,3.3],[],[4.4,5.5]]]
% python reader.py 
pyarrow.Table
: list
  child 0, element: double

: [[[1.1,2.2,3.3],[],[4.4,5.5]]]{code}
The issue doesn't seem to be in the writing, but in the reading: regardless of 
whether {{use_compliant_nested_type}} is {{True}} or {{{}False{}}}, I can see 
the extension metadata in the Parquet → Arrow converted schema.
{code:java}
>>> import pyarrow.parquet as pq
>>> pq.ParquetFile("tmp.parquet").schema.to_arrow_schema()
: list
  child 0, item: double
  -- field metadata --
  ARROW:extension:metadata: '{"cool": "beans"}'
  ARROW:extension:name: 'my:app'{code}
versus
{code:java}
>>> import pyarrow.parquet as pq
>>> pq.ParquetFile("tmp.parquet").schema.to_arrow_schema()
: list
  child 0, element: double
  -- field metadata --
  ARROW:extension:metadata: '{"cool": "beans"}'
  ARROW:extension:name: 'my:app'{code}
Note that the first has "{{{}item: double{}}}" and the second has "{{{}element: 
double{}}}".

(I'm also rather surprised that {{use_compliant_nested_type=False}} is an 
option. Wouldn't you want the Parquet files to always be written with compliant 
lists? I noticed this when I was having trouble getting the data into BigQuery.)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (ARROW-15959) [Java][Docs] Fix IntelliJ IDE setup instructions

2022-04-26 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528365#comment-17528365
 ] 

David Li edited comment on ARROW-15959 at 4/26/22 7:27 PM:
---

Tracing my steps: I started with a fresh IntelliJ 2022.1 installation. I 
installed IdeaVim and CheckStyle-IDEA for reference.

I then opened the {{java}} subdirectory of a fresh Arrow checkout, and waited 
for the Maven sync. That ran into 

{noformat}
Could not find artifact 
io.netty:netty-transport-native-unix-common:jar:${os.detected.name}-${os.detected.arch}:4.1.72.Final
 in central (https://repo.maven.apache.org/maven2)
{noformat}

This seems to be related to 
https://github.com/trustin/os-maven-plugin/issues/19. I updated the plugin and 
re-opened the IDE, which seemed to fix it mostly, except in 
flight-integration-tests which had

{noformat}
Unresolved dependency: 
'io.netty:netty-transport-native-unix-common:jar:4.1.72.Final'
{noformat}

Adding the os-maven-plugin didn't help here. Deactivating the 
linux-netty-native profile in the Maven pane did. I think this might be because 
IntelliJ isn't substituting the os properties when dependencies come from a 
profile. 

Now Maven syncs. I tried to build the project and had to set an SDK. I chose 
JDK11 and set the language level to 8, then tried building again. That led to 

{noformat}
package sun.misc does not exist
{noformat}

That seems to be related to https://youtrack.jetbrains.com/issue/IDEA-201168. I 
disabled the option specified there and the build continued. Now the build 
fails in TestExpandableByteBuf. Honestly, I think this file is in the wrong 
package…I moved it into arrow-memory-netty.

Continuing on, the build fails because it can't find the auto-generated 
sources. So then I manually invoked {{mvn compile}}. That inexplicably failed 
from within IntelliJ, so I switched to the CLI and ran {{mvn compile}} 
manually, which seemed to work fine. That generated 
{{arrow-vector/target/generated-sources}}, so I found that in the IntelliJ 
Project pane and right-click > "Mark Directory As" > "Generated Sources Root". 
Then I restarted the build. This finally completed successfully!

So there's a couple things to do here:

# Document the Maven issue
# Submit an upgrade of os-maven-plugin
# Document the compiler issue
# Submit moving the test file (will have to check if we're currently 
accidentally not running those tests…)
# Document that you have to Maven compile first (will retry and see if tweaking 
pom.xml lets us avoid at least manually marking the sources root)


was (Author: lidavidm):
Tracing my steps: I started with a fresh IntelliJ 2022.1 installation. I 
installed IdeaVim and CheckStyle-IDEA for reference.

I then opened the {{java}} subdirectory of a fresh Arrow checkout, and waited 
for the Maven sync. That ran into 

{noformat}
Could not find artifact 
io.netty:netty-transport-native-unix-common:jar:${os.detected.name}-${os.detected.arch}:4.1.72.Final
 in central (https://repo.maven.apache.org/maven2)
{noformat}

This seems to be related to 
https://github.com/trustin/os-maven-plugin/issues/19. I updated the plugin and 
re-opened the IDE, which seemed to fix it mostly, except in 
flight-integration-tests which had

{noformat}
Unresolved dependency: 
'io.netty:netty-transport-native-unix-common:jar:4.1.72.Final'
{noformat}

Adding the os-maven-plugin didn't help here. Deactivating the 
linux-netty-native profile in the Maven pane did. I think this might be because 
IntelliJ isn't substituting the os properties when dependencies come from a 
profile. 

Now Maven syncs. I tried to build the project and had to set an SDK. I chose 
JDK11 and set the language level to 8, then tried building again. That led to 

{noformat}
package sun.misc does not exist
{noformat}

That seems to be related to https://youtrack.jetbrains.com/issue/IDEA-201168. I 
disabled the option specified there and the build continued. Now the build 
fails in TestExpandableByteBuf. Honestly, I think this file is in the wrong 
package…I moved it into arrow-memory-netty.

Continuing on, the build fails because it can't find the auto-generated 
sources. So then I manually invoked {{mvn compile}}. That inexplicably failed 
from within IntelliJ, so I switched to the CLI and ran {{mvn compile}} 
manually, which seemed to work fine. That generated 
{{arrow-vector/target/generated-sources}}, so I found that in the IntelliJ 
Project pane and right-click > "Mark Directory As" > "Generated Sources Root". 
Then I restarted the build.

> [Java][Docs] Fix IntelliJ IDE setup instructions
> 
>
> Key: ARROW-15959
> URL: https://issues.apache.org/jira/browse/ARROW-15959
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Java
>Reporter: David Li

[jira] [Comment Edited] (ARROW-15959) [Java][Docs] Fix IntelliJ IDE setup instructions

2022-04-26 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528365#comment-17528365
 ] 

David Li edited comment on ARROW-15959 at 4/26/22 7:25 PM:
---

Tracing my steps: I started with a fresh IntelliJ 2022.1 installation. I 
installed IdeaVim and CheckStyle-IDEA for reference.

I then opened the {{java}} subdirectory of a fresh Arrow checkout, and waited 
for the Maven sync. That ran into 

{noformat}
Could not find artifact 
io.netty:netty-transport-native-unix-common:jar:${os.detected.name}-${os.detected.arch}:4.1.72.Final
 in central (https://repo.maven.apache.org/maven2)
{noformat}

This seems to be related to 
https://github.com/trustin/os-maven-plugin/issues/19. I updated the plugin and 
re-opened the IDE, which seemed to fix it mostly, except in 
flight-integration-tests which had

{noformat}
Unresolved dependency: 
'io.netty:netty-transport-native-unix-common:jar:4.1.72.Final'
{noformat}

Adding the os-maven-plugin didn't help here. Deactivating the 
linux-netty-native profile in the Maven pane did. I think this might be because 
IntelliJ isn't substituting the os properties when dependencies come from a 
profile. 

Now Maven syncs. I tried to build the project and had to set an SDK. I chose 
JDK11 and set the language level to 8, then tried building again. That led to 

{noformat}
package sun.misc does not exist
{noformat}

That seems to be related to https://youtrack.jetbrains.com/issue/IDEA-201168. I 
disabled the option specified there and the build continued. Now the build 
fails in TestExpandableByteBuf. Honestly, I think this file is in the wrong 
package…I moved it into arrow-memory-netty.

Continuing on, the build fails because it can't find the auto-generated 
sources. So then I manually invoked {{mvn compile}}. That inexplicably failed 
from within IntelliJ, so I switched to the CLI and ran {{mvn compile}} 
manually, which seemed to work fine. That generated 
{{arrow-vector/target/generated-sources}}, so I found that in the IntelliJ 
Project pane and right-click > "Mark Directory As" > "Generated Sources Root". 
Then I restarted the build.


was (Author: lidavidm):
Tracing my steps: I started with a fresh IntelliJ 2022.1 installation. I 
installed IdeaVim and CheckStyle-IDEA for reference.

I then opened the {{java}} subdirectory of a fresh Arrow checkout, and waited 
for the Maven sync. That ran into 

{noformat}
Could not find artifact 
io.netty:netty-transport-native-unix-common:jar:${os.detected.name}-${os.detected.arch}:4.1.72.Final
 in central (https://repo.maven.apache.org/maven2)
{noformat}

This seems to be related to 
https://github.com/trustin/os-maven-plugin/issues/19. I updated the plugin and 
re-opened the IDE, which seemed to fix it mostly, except in 
flight-integration-tests which had

{noformat}
Unresolved dependency: 
'io.netty:netty-transport-native-unix-common:jar:4.1.72.Final'
{noformat}

Adding the os-maven-plugin didn't help here. Deactivating the 
linux-netty-native profile in the Maven pane did. I think this might be because 
IntelliJ isn't substituting the os properties when dependencies come from a 
profile. 

Now Maven syncs. I tried to build the project and had to set an SDK. I chose 
JDK11 and set the language level to 8, then tried building again. That led to 

{noformat}
package sun.misc does not exist
{noformat}

> [Java][Docs] Fix IntelliJ IDE setup instructions
> 
>
> Key: ARROW-15959
> URL: https://issues.apache.org/jira/browse/ARROW-15959
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Java
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>
> A few more things need to be documented to get debugging working in IntelliJ, 
> at least in my experience. This is probably because instead of using the 
> Maven build, I'm using IntelliJ's native build, which lets me one-click run a 
> particular class or test, but needs some extra configuration.
>  * Must unset "Use --release option for cross compilation" in compiler 
> settings
>  * Must build once with Maven and mark the 
> arrow-vector/target/generated-sources directory as a generated sources root



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-15959) [Java][Docs] Fix IntelliJ IDE setup instructions

2022-04-26 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528365#comment-17528365
 ] 

David Li commented on ARROW-15959:
--

Tracing my steps: I started with a fresh IntelliJ 2022.1 installation. I 
installed IdeaVim and CheckStyle-IDEA for reference.

I then opened the {{java}} subdirectory of a fresh Arrow checkout, and waited 
for the Maven sync. That ran into 

{noformat}
Could not find artifact 
io.netty:netty-transport-native-unix-common:jar:${os.detected.name}-${os.detected.arch}:4.1.72.Final
 in central (https://repo.maven.apache.org/maven2)
{noformat}

This seems to be related to 
https://github.com/trustin/os-maven-plugin/issues/19. I updated the plugin and 
re-opened the IDE, which seemed to fix it mostly, except in 
flight-integration-tests which had

{noformat}
Unresolved dependency: 
'io.netty:netty-transport-native-unix-common:jar:4.1.72.Final'
{noformat}

Adding the os-maven-plugin didn't help here. Deactivating the 
linux-netty-native profile in the Maven pane did. I think this might be because 
IntelliJ isn't substituting the os properties when dependencies come from a 
profile. 

Now Maven syncs. I tried to build the project and had to set an SDK. I chose 
JDK11 and set the language level to 8, then tried building again. That led to 

{noformat}
package sun.misc does not exist
{noformat}

> [Java][Docs] Fix IntelliJ IDE setup instructions
> 
>
> Key: ARROW-15959
> URL: https://issues.apache.org/jira/browse/ARROW-15959
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Java
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>
> A few more things need to be documented to get debugging working in IntelliJ, 
> at least in my experience. This is probably because instead of using the 
> Maven build, I'm using IntelliJ's native build, which lets me one-click run a 
> particular class or test, but needs some extra configuration.
>  * Must unset "Use --release option for cross compilation" in compiler 
> settings
>  * Must build once with Maven and mark the 
> arrow-vector/target/generated-sources directory as a generated sources root



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-15959) [Java][Docs] Fix IntelliJ IDE setup instructions

2022-04-26 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reassigned ARROW-15959:


Assignee: David Li

> [Java][Docs] Fix IntelliJ IDE setup instructions
> 
>
> Key: ARROW-15959
> URL: https://issues.apache.org/jira/browse/ARROW-15959
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Java
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>
> A few more things need to be documented to get debugging working in IntelliJ, 
> at least in my experience. This is probably because instead of using the 
> Maven build, I'm using IntelliJ's native build, which lets me one-click run a 
> particular class or test, but needs some extra configuration.
>  * Must unset "Use --release option for cross compilation" in compiler 
> settings
>  * Must build once with Maven and mark the 
> arrow-vector/target/generated-sources directory as a generated sources root



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16347) [Packaging] verify-release-candidate fails oddly if a Conda environment is active

2022-04-26 Thread David Li (Jira)
David Li created ARROW-16347:


 Summary: [Packaging] verify-release-candidate fails oddly if a 
Conda environment is active
 Key: ARROW-16347
 URL: https://issues.apache.org/jira/browse/ARROW-16347
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Affects Versions: 8.0.0
Reporter: David Li


{noformat}
Conda environment is active despite that USE_CONDA is set to 0.

CommandNotFoundError: No command 'conda deactive'.
Did you mean 'conda deactivate'?
{noformat}

The next line is {{echo "Deactivate the environment using `conda deactive` 
before running the verification script."}} but this tries to _evaluate_ "conda 
deactive" which of course fails. The typo should be fixed, but also the 
backticks should be escaped.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-12885) [C++] Error: template with C linkage template

2022-04-26 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528304#comment-17528304
 ] 

Kazuaki Ishizaki commented on ARROW-12885:
--

I got an AIX instance for a few weeks. I will work to fix errors on AIX.

> [C++] Error: template with C linkage template 
> ---
>
> Key: ARROW-12885
> URL: https://issues.apache.org/jira/browse/ARROW-12885
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
> Environment: IBM i | AS400 | AIX
>Reporter: Menno
>Priority: Major
> Attachments: 2021-05-26 16_31_09-Window.png, thrift_ep-build-err.log
>
>
> When installing arrow on IBM i it fails the install at the thrift dependency 
> install with the following output:
> !2021-05-26 16_31_09-Window.png!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-8679) [Python] supporting pandas sparse series in pyarrow

2022-04-26 Thread Prabhant Singh (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528302#comment-17528302
 ] 

Prabhant Singh commented on ARROW-8679:
---

Hi, 
Is there any progress on this issue? or will it ever be supported. 

Otherwise is there any recommended way to deal with sparse data in the 
meantime? ie convert to dense data or something else?

> [Python] supporting pandas sparse series in pyarrow
> ---
>
> Key: ARROW-8679
> URL: https://issues.apache.org/jira/browse/ARROW-8679
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
> Environment: ubuntu 16/18
>Reporter: Michael Novitsky
>Priority: Major
> Fix For: 0.17.0
>
>
> I've seen that Pandas sparse series was not supported in pyarrow since it was 
> planned to be deprecated.  In Pandas 1.0.1 they released a stable version of 
> sparse array and as far as I know it is not planned to be deprecated anymore. 
> Are you planning to support sparse series in next versions of pyarrow ?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16346) [Python] Add a migration path for external packages due to Python code being moved to PyArrow

2022-04-26 Thread Alenka Frim (Jira)
Alenka Frim created ARROW-16346:
---

 Summary: [Python] Add a migration path for external packages due 
to Python code being moved to PyArrow
 Key: ARROW-16346
 URL: https://issues.apache.org/jira/browse/ARROW-16346
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Python
Reporter: Alenka Frim
Assignee: Alenka Frim
 Fix For: 9.0.0


Try to find a viable migration path so that external packages can use Python 
C++ API as before.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16345) [Python] Make changes to the C++ build setup due moving Python C++ API to PyArrow

2022-04-26 Thread Alenka Frim (Jira)
Alenka Frim created ARROW-16345:
---

 Summary: [Python] Make changes to the C++ build setup due moving 
Python C++ API to PyArrow
 Key: ARROW-16345
 URL: https://issues.apache.org/jira/browse/ARROW-16345
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Python
Reporter: Alenka Frim
Assignee: Alenka Frim
 Fix For: 9.0.0


As the C++ build setup no longer needs to build Python C++ API make sure that 
the CMake is corrected and C++ build no longer depends op Python.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16344) [Python] Finalize Pyarrow build setup changes

2022-04-26 Thread Alenka Frim (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim updated ARROW-16344:

Summary: [Python] Finalize Pyarrow build setup changes  (was: [Python] 
Pyarrow and C++ build setup changes)

> [Python] Finalize Pyarrow build setup changes
> -
>
> Key: ARROW-16344
> URL: https://issues.apache.org/jira/browse/ARROW-16344
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
> Fix For: 9.0.0
>
>
> Finalize the CMake changes to integrate C++ compilation for the 
> {{arrow/python}} folder as part of the PyArrow build setup.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16344) [Python] Pyarrow and C++ build setup changes

2022-04-26 Thread Alenka Frim (Jira)
Alenka Frim created ARROW-16344:
---

 Summary: [Python] Pyarrow and C++ build setup changes
 Key: ARROW-16344
 URL: https://issues.apache.org/jira/browse/ARROW-16344
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Python
Reporter: Alenka Frim
Assignee: Alenka Frim
 Fix For: 9.0.0


Finalize the CMake changes to integrate C++ compilation for the 
{{arrow/python}} folder as part of the PyArrow build setup.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16343) [Python] Refine the fist draft of the PyArrow build setup changes

2022-04-26 Thread Alenka Frim (Jira)
Alenka Frim created ARROW-16343:
---

 Summary: [Python] Refine the fist draft of the PyArrow build setup 
changes
 Key: ARROW-16343
 URL: https://issues.apache.org/jira/browse/ARROW-16343
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Python
Reporter: Alenka Frim
Assignee: Alenka Frim
 Fix For: 9.0.0


Refine the draft of the CMake changes to integrate C++ compilation for the 
{{arrow/python}} folder as part of the PyArrow build setup.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16342) [Python] First draft of the PyArrow build setup changes

2022-04-26 Thread Alenka Frim (Jira)
Alenka Frim created ARROW-16342:
---

 Summary: [Python] First draft of the PyArrow build setup changes
 Key: ARROW-16342
 URL: https://issues.apache.org/jira/browse/ARROW-16342
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Python
Reporter: Alenka Frim
Assignee: Alenka Frim
 Fix For: 9.0.0


>From the list of tasks needed to make CMake changes to integrate C++ 
>compilation for the {{arrow/python}} folder as part of the PyArrow build setup 
>make a first draft.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16341) [Python] Research CMake of C++ vs PyArrow

2022-04-26 Thread Alenka Frim (Jira)
Alenka Frim created ARROW-16341:
---

 Summary: [Python] Research CMake of C++ vs PyArrow
 Key: ARROW-16341
 URL: https://issues.apache.org/jira/browse/ARROW-16341
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Python
Reporter: Alenka Frim
Assignee: Alenka Frim
 Fix For: 9.0.0


Research two different build processes (C++ and PyArrow separately) and list 
the tasks that will need to be done in order to do the CMake changes to 
integrate C++ compilation for the {{arrow/python}} folder as part of the 
PyArrow build setup.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16340) [Python] Move all Python related code into PyArrow

2022-04-26 Thread Alenka Frim (Jira)
Alenka Frim created ARROW-16340:
---

 Summary: [Python] Move all Python related code into PyArrow
 Key: ARROW-16340
 URL: https://issues.apache.org/jira/browse/ARROW-16340
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Alenka Frim
Assignee: Alenka Frim
 Fix For: 9.0.0


Move {{src/arrow/python}} directory into {{pyarrow}} and arrange PyArrow to 
build it.

More details can be found on this thread:

https://lists.apache.org/thread/jbxyldhqff4p9z53whhs95y4jcomdgd2



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-7272) [C++][Java][Dataset] JNI bridge between RecordBatch and VectorSchemaRoot

2022-04-26 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-7272.
---
Fix Version/s: 8.0.0
   (was: 9.0.0)
   Resolution: Fixed

Issue resolved by pull request 10883
[https://github.com/apache/arrow/pull/10883]

> [C++][Java][Dataset] JNI bridge between RecordBatch and VectorSchemaRoot
> 
>
> Key: ARROW-7272
> URL: https://issues.apache.org/jira/browse/ARROW-7272
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Java
>Reporter: Francois Saint-Jacques
>Assignee: Hongze Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> Given a C++ std::shared_ptr, retrieve it in java as a 
> VectorSchemaRoot class. Gandiva already offer a similar facility but with raw 
> buffers. It would be convenient if users could call C++ that yields 
> RecordBatch and retrieve it in a seamless fashion.
> This would remove one roadblock of using C++ dataset facility in Java.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16339) [C++][Parquet] Parquet FileMetaData key_value_metadata not always mapped to Arrow Schema metadata

2022-04-26 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528253#comment-17528253
 ] 

Joris Van den Bossche commented on ARROW-16339:
---

Note: I recently created an issue about _field-level_ metadata (ARROW-15548), 
but so this issue is about schema-level (for Arrow) / file-level (for Parquet) 
metadata.

The above is a long description with examples, but trying to summarize the 
findings, and questions to answer:

- Do we generally want to map the schema-level metadata from Arrow with Parquet 
file-level metadata? (I think the answer is yes?)
- When _reading_, and the file metadata does not contain a "ARROW:schema" key, 
we actually already do map the Parquet file metadata to resulting Arrow schema 
metadata (this is OK)
- When _writing_, the {{store_schema}} flag seems to also influence whether we 
store schema metadata key/values in the Parquet file. This might be a bug? (or 
at least unintended behaviour?)
- When _reading_, and the file metadata does contain both an "ARROW:schema" key 
and other keys, we ignore the other keys. Should we merge the keys from the 
metadata in the serialized "ARROW:schema" schema with the other keys in the 
Parquet FileMetaData key_value_metadata? (those could of course be duplicative 
/ conflicting)

cc [~apitrou] [~emkornfield] 
  

> [C++][Parquet] Parquet FileMetaData key_value_metadata not always mapped to 
> Arrow Schema metadata
> -
>
> Key: ARROW-16339
> URL: https://issues.apache.org/jira/browse/ARROW-16339
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> Context: I ran into this issue when reading Parquet files created by GDAL 
> (using the Arrow C++ APIs, [https://github.com/OSGeo/gdal/pull/5477]), which 
> writes files that have custom key_value_metadata, but without storing 
> ARROW:schema in those metadata (cc [~paleolimbot]
> —
> Both in reading and writing files, I expected that we would map Arrow 
> {{Schema::metadata}} with Parquet {{{}FileMetaData::key_value_metadata{}}}. 
> But apparently this doesn't (always) happen out of the box, and only happens 
> through the "ARROW:schema" field (which stores the original Arrow schema, and 
> thus the metadata stored in this schema).
> For example, when writing a Table with schema metadata, this is not stored 
> directly in the Parquet FileMetaData (code below is using branch from 
> ARROW-16337 to have the {{store_schema}} keyword):
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table({'a': [1, 2, 3]}, metadata={"key": "value"})
> pq.write_table(table, "test_metadata_with_arrow_schema.parquet")
> pq.write_table(table, "test_metadata_without_arrow_schema.parquet", 
> store_schema=False)
> # original schema has metadata
> >>> table.schema
> a: int64
> -- schema metadata --
> key: 'value'
> # reading back only has the metadata in case we stored ARROW:schema
> >>> pq.read_table("test_metadata_with_arrow_schema.parquet").schema
> a: int64
> -- schema metadata --
> key: 'value'
> # and not if ARROW:schema is absent
> >>> pq.read_table("test_metadata_without_arrow_schema.parquet").schema
> a: int64
> {code}
> It seems that if we store the ARROW:schema, we _also_ store the schema 
> metadata separately. But if {{store_schema}} is False, we also stop writing 
> those metadata (not fully sure if this is the intended behaviour, and that's 
> the reason for the above output):
> {code:python}
> # when storing the ARROW:schema, we ALSO store key:value metadata
> >>> pq.read_metadata("test_metadata_with_arrow_schema.parquet").metadata
> {b'ARROW:schema': b'/7AQAAAKAA4ABgAFAA...',
>  b'key': b'value'}
> # when not storing the schema, we also don't store the key:value
> >>> pq.read_metadata("test_metadata_without_arrow_schema.parquet").metadata 
> >>> is None
> True
> {code}
> On the reading side, it seems that we generally do read custom key/value 
> metadata into schema metadata. We don't have the pyarrow APIs at the moment 
> to create such a file (given the above), but with a small patch I could 
> create such a file:
> {code:python}
> # a Parquet file with ParquetFileMetaData::metadata that ONLY has a custom key
> >>> pq.read_metadata("test_metadata_without_arrow_schema2.parquet").metadata
> {b'key': b'value'}
> # this metadata is now correctly mapped to the Arrow schema metadata
> >>> pq.read_schema("test_metadata_without_arrow_schema2.parquet")
> a: int64
> -- schema metadata --
> key: 'value'
> {code}
> But if you have a file that has both custom key/value metadata and an 
> "ARROW:schema" key, we actually ignore the custom keys, and only look at the 
> "ARROW:schema" one. 

[jira] [Updated] (ARROW-16338) [CI] Update azure windows image as vs2017-win2016 is retired

2022-04-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16338:
---
Labels: pull-request-available  (was: )

> [CI] Update azure windows image as vs2017-win2016 is retired
> 
>
> Key: ARROW-16338
> URL: https://issues.apache.org/jira/browse/ARROW-16338
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Developer Tools
>Reporter: Jacob Wujciak-Jens
>Assignee: Jacob Wujciak-Jens
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See 
> [https://devblogs.microsoft.com/devops/hosted-pipelines-image-deprecation/#windows]
> We should replace it with windows-2019 
> I am not sure if this will require other changes in the setup, will open PR 
> to test it. cc [~raulcd] 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16339) [C++][Parquet] Parquet FileMetaData key_value_metadata not always mapped to Arrow Schema metadata

2022-04-26 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16339:
-

 Summary: [C++][Parquet] Parquet FileMetaData key_value_metadata 
not always mapped to Arrow Schema metadata
 Key: ARROW-16339
 URL: https://issues.apache.org/jira/browse/ARROW-16339
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Parquet, Python
Reporter: Joris Van den Bossche


Context: I ran into this issue when reading Parquet files created by GDAL 
(using the Arrow C++ APIs, [https://github.com/OSGeo/gdal/pull/5477]), which 
writes files that have custom key_value_metadata, but without storing 
ARROW:schema in those metadata (cc [~paleolimbot]

—

Both in reading and writing files, I expected that we would map Arrow 
{{Schema::metadata}} with Parquet {{{}FileMetaData::key_value_metadata{}}}. But 
apparently this doesn't (always) happen out of the box, and only happens 
through the "ARROW:schema" field (which stores the original Arrow schema, and 
thus the metadata stored in this schema).

For example, when writing a Table with schema metadata, this is not stored 
directly in the Parquet FileMetaData (code below is using branch from 
ARROW-16337 to have the {{store_schema}} keyword):
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({'a': [1, 2, 3]}, metadata={"key": "value"})
pq.write_table(table, "test_metadata_with_arrow_schema.parquet")
pq.write_table(table, "test_metadata_without_arrow_schema.parquet", 
store_schema=False)

# original schema has metadata
>>> table.schema
a: int64
-- schema metadata --
key: 'value'

# reading back only has the metadata in case we stored ARROW:schema
>>> pq.read_table("test_metadata_with_arrow_schema.parquet").schema
a: int64
-- schema metadata --
key: 'value'
# and not if ARROW:schema is absent
>>> pq.read_table("test_metadata_without_arrow_schema.parquet").schema
a: int64
{code}
It seems that if we store the ARROW:schema, we _also_ store the schema metadata 
separately. But if {{store_schema}} is False, we also stop writing those 
metadata (not fully sure if this is the intended behaviour, and that's the 
reason for the above output):
{code:python}
# when storing the ARROW:schema, we ALSO store key:value metadata
>>> pq.read_metadata("test_metadata_with_arrow_schema.parquet").metadata
{b'ARROW:schema': b'/7AQAAAKAA4ABgAFAA...',
 b'key': b'value'}
# when not storing the schema, we also don't store the key:value
>>> pq.read_metadata("test_metadata_without_arrow_schema.parquet").metadata is 
>>> None
True
{code}
On the reading side, it seems that we generally do read custom key/value 
metadata into schema metadata. We don't have the pyarrow APIs at the moment to 
create such a file (given the above), but with a small patch I could create 
such a file:
{code:python}
# a Parquet file with ParquetFileMetaData::metadata that ONLY has a custom key
>>> pq.read_metadata("test_metadata_without_arrow_schema2.parquet").metadata
{b'key': b'value'}

# this metadata is now correctly mapped to the Arrow schema metadata
>>> pq.read_schema("test_metadata_without_arrow_schema2.parquet")
a: int64
-- schema metadata --
key: 'value'
{code}
But if you have a file that has both custom key/value metadata and an 
"ARROW:schema" key, we actually ignore the custom keys, and only look at the 
"ARROW:schema" one. 
This was the case that I ran into with GDAL, where I have a file with both 
keys, but where the custom "geo" key is not also included in the serialized 
arrow schema in the "ARROW:schema" key:
{code:python}
# includes both keys in the Parquet file
>>> pq.read_metadata("test_gdal.parquet").metadata
{b'geo': b'{"version":"0.1.0","...',
 b'ARROW:schema': b'/3gBAAAQ...'}
# the "geo" key is lost in the Arrow schema
>>> pq.read_table("test_gdal.parquet").schema.metadata is None
True
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16338) [CI] Update azure windows image as vs2017-win2016 is retired

2022-04-26 Thread Jacob Wujciak-Jens (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacob Wujciak-Jens reassigned ARROW-16338:
--

Assignee: Jacob Wujciak-Jens

> [CI] Update azure windows image as vs2017-win2016 is retired
> 
>
> Key: ARROW-16338
> URL: https://issues.apache.org/jira/browse/ARROW-16338
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Developer Tools
>Reporter: Jacob Wujciak-Jens
>Assignee: Jacob Wujciak-Jens
>Priority: Critical
>
> See 
> [https://devblogs.microsoft.com/devops/hosted-pipelines-image-deprecation/#windows]
> We should replace it with windows-2019 
> I am not sure if this will require other changes in the setup, will open PR 
> to test it. cc [~raulcd] 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16338) [CI] Update azure windows image as vs2017-win2016 is retired

2022-04-26 Thread Jacob Wujciak-Jens (Jira)
Jacob Wujciak-Jens created ARROW-16338:
--

 Summary: [CI] Update azure windows image as vs2017-win2016 is 
retired
 Key: ARROW-16338
 URL: https://issues.apache.org/jira/browse/ARROW-16338
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Developer Tools
Reporter: Jacob Wujciak-Jens


See 
[https://devblogs.microsoft.com/devops/hosted-pipelines-image-deprecation/#windows]

We should replace it with windows-2019 

I am not sure if this will require other changes in the setup, will open PR to 
test it. cc [~raulcd] 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16328) [Java] POC Arrow Modular

2022-04-26 Thread David Dali Susanibar Arce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528221#comment-17528221
 ] 

David Dali Susanibar Arce commented on ARROW-16328:
---

Yes, something like that. Currently Arrow Java are configured as a multi module 
that help us a lot at compile time, then is needed to analyze how do we could 
take advantage about Java Modular for Arrow Java at runtime.

> [Java] POC Arrow Modular
> 
>
> Key: ARROW-16328
> URL: https://issues.apache.org/jira/browse/ARROW-16328
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: David Dali Susanibar Arce
>Priority: Major
>
> POC to move Arrow Java module to Single/Multi module mode.
> Currently we are supporting Arrow Java JSE1.8 to be able to consume by 
> JSE11,17,18 in *legacy mode* that is enabled when the compilation environment 
> is defined by the {{{}--source{}}}, {{--target.}}
> This POC is to validate changes needed in case Arrow java decided to 
> implement "{*}Single module mode{*}" or "{*}Multi-module mode{*}"



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16328) [Java] POC Arrow Modular

2022-04-26 Thread David Dali Susanibar Arce (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Dali Susanibar Arce updated ARROW-16328:
--
Summary: [Java] POC Arrow Modular  (was: [Java] POC Arrow Modular (format 
module for example))

> [Java] POC Arrow Modular
> 
>
> Key: ARROW-16328
> URL: https://issues.apache.org/jira/browse/ARROW-16328
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: David Dali Susanibar Arce
>Priority: Major
>
> POC to move Arrow Java module to Single/Multi module mode.
> Currently we are supporting Arrow Java JSE1.8 to be able to consume by 
> JSE11,17,18 in *legacy mode* that is enabled when the compilation environment 
> is defined by the {{{}--source{}}}, {{--target.}}
> This POC is to validate changes needed in case Arrow java decided to 
> implement "{*}Single module mode{*}" or "{*}Multi-module mode{*}"



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16334) [Release][Archery][CI] Change the links on the email to point to the job run instead of the branch

2022-04-26 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido updated ARROW-16334:
--
Labels: Nightly  (was: )

> [Release][Archery][CI] Change the links on the email to point to the job run 
> instead of the branch
> --
>
> Key: ARROW-16334
> URL: https://issues.apache.org/jira/browse/ARROW-16334
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Archery, Continuous Integration
>Reporter: Raúl Cumplido
>Assignee: Raúl Cumplido
>Priority: Major
>  Labels: Nightly
> Fix For: 9.0.0
>
>
> The current link that we send on the nightly email report is the link to the 
> branch. When we want to understand what is the failure we want to go to the 
> build itself and not to the branch.
> This ticket aims to update the link in the reports to point to the build 
> instead of the branch.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16216) [Python][FlightRPC] Fix test_flight.py when flight is not available

2022-04-26 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-16216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido updated ARROW-16216:
--
Labels: pull-request-available  (was: nig pull-request-available)

> [Python][FlightRPC] Fix test_flight.py when flight is not available
> ---
>
> Key: ARROW-16216
> URL: https://issues.apache.org/jira/browse/ARROW-16216
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC, Python
>Reporter: Kouhei Sutou
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> https://github.com/apache/arrow/pull/12749#discussion_r851671770
> {{flight}} is {{None}} when not building flight so don't use the module at 
> module level



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16216) [Python][FlightRPC] Fix test_flight.py when flight is not available

2022-04-26 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-16216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido updated ARROW-16216:
--
Labels: nig pull-request-available  (was: pull-request-available)

> [Python][FlightRPC] Fix test_flight.py when flight is not available
> ---
>
> Key: ARROW-16216
> URL: https://issues.apache.org/jira/browse/ARROW-16216
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC, Python
>Reporter: Kouhei Sutou
>Assignee: David Li
>Priority: Major
>  Labels: nig, pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> https://github.com/apache/arrow/pull/12749#discussion_r851671770
> {{flight}} is {{None}} when not building flight so don't use the module at 
> module level



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16337) [Python] Expose parameter that determines to store Arrow schema in Parquet metadata in Python

2022-04-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16337:
---
Labels: pull-request-available  (was: )

> [Python] Expose parameter that determines to store Arrow schema in Parquet 
> metadata in Python
> -
>
> Key: ARROW-16337
> URL: https://issues.apache.org/jira/browse/ARROW-16337
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is a {{store_schema}} flag that determines whether we store the Arrow 
> schema in the Parquet metadata (under the {{ARROW:schema}} key) or not. This 
> is exposed in the C++, but not in the Python interface. It would be good to 
> also expose this in the Python layer, to more easily experiment with this (eg 
> to check the impact of having the schema available or not when reading a file)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16337) [Python] Expose parameter that determines to store Arrow schema in Parquet metadata in Python

2022-04-26 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16337:
-

 Summary: [Python] Expose parameter that determines to store Arrow 
schema in Parquet metadata in Python
 Key: ARROW-16337
 URL: https://issues.apache.org/jira/browse/ARROW-16337
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 9.0.0


There is a {{store_schema}} flag that determines whether we store the Arrow 
schema in the Parquet metadata (under the {{ARROW:schema}} key) or not. This is 
exposed in the C++, but not in the Python interface. It would be good to also 
expose this in the Python layer, to more easily experiment with this (eg to 
check the impact of having the schema available or not when reading a file)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16337) [Python] Expose parameter that determines to store Arrow schema in Parquet metadata in Python

2022-04-26 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-16337:
-

Assignee: Joris Van den Bossche

> [Python] Expose parameter that determines to store Arrow schema in Parquet 
> metadata in Python
> -
>
> Key: ARROW-16337
> URL: https://issues.apache.org/jira/browse/ARROW-16337
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 9.0.0
>
>
> There is a {{store_schema}} flag that determines whether we store the Arrow 
> schema in the Parquet metadata (under the {{ARROW:schema}} key) or not. This 
> is exposed in the C++, but not in the Python interface. It would be good to 
> also expose this in the Python layer, to more easily experiment with this (eg 
> to check the impact of having the schema available or not when reading a file)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16328) [Java] POC Arrow Modular (format module for example)

2022-04-26 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528184#comment-17528184
 ] 

David Li commented on ARROW-16328:
--

Presumably we'll want multiple modules since some of our packages require 
several dependencies that people may not necessarily want? For instance Flight 
should probably be in a separate module, anything using JNI should be in a 
separate module, and possibly adapters like arrow-avro and arrow-jdbc should be 
in their own modules?

> [Java] POC Arrow Modular (format module for example)
> 
>
> Key: ARROW-16328
> URL: https://issues.apache.org/jira/browse/ARROW-16328
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: David Dali Susanibar Arce
>Priority: Major
>
> POC to move Arrow Java module to Single/Multi module mode.
> Currently we are supporting Arrow Java JSE1.8 to be able to consume by 
> JSE11,17,18 in *legacy mode* that is enabled when the compilation environment 
> is defined by the {{{}--source{}}}, {{--target.}}
> This POC is to validate changes needed in case Arrow java decided to 
> implement "{*}Single module mode{*}" or "{*}Multi-module mode{*}"



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-15582) [C++] Add support for registering tricky functions with the Substrait consumer (or add a bunch of substrait meta functions)

2022-04-26 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528177#comment-17528177
 ] 

David Li commented on ARROW-15582:
--

I agree we're going to need some sort of mapping. I think someone will just 
have to sit down and look through all the functions to determine how best to 
structure/maintain this, though; maybe most are fairly trivial, some patterns 
exist like for arithmetic, and a few are just special cases (if_else etc. 
perhaps).

> [C++] Add support for registering tricky functions with the Substrait 
> consumer (or add a bunch of substrait meta functions)
> ---
>
> Key: ARROW-15582
> URL: https://issues.apache.org/jira/browse/ARROW-15582
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: substrait
>
> Sometimes one Substrait function will map to multiple Arrow functions.  For 
> example, the Substrait {{add}} function might be referring to Arrow's {{add}} 
> or {{add_checked}}.  We need to figure out how to register this correctly 
> (e.g. one possible approach would be a {{substrait_add}} meta function).
> Other times a substrait function will encode something Arrow considers an 
> "option" as a function argument.  For example, the is_in Arrow function is 
> unary with an option for the lookup set.  The substrait function is binary 
> but the second argument must be constant and be the lookup set.  Neither of 
> which is to be confused with a truly binary is_in function which takes in a 
> different set at every row.
> It's possible there is no work to do here other than adding a bunch of 
> substrait_ meta functions in Arrow.  In that case all the work will be done 
> in other JIRAs.  Or, it is possible that there is some kind of extension we 
> can make to the function registry that bypasses the need for the meta 
> functions.  I'm leaving this JIRA open so future contributors can consider 
> this second option.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16336) [Python] Hide internal (common_)metadata related warnings from the user (ParquetDataset)

2022-04-26 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-16336:
-

Assignee: Joris Van den Bossche

> [Python] Hide internal (common_)metadata related warnings from the user 
> (ParquetDataset)
> 
>
> Key: ARROW-16336
> URL: https://issues.apache.org/jira/browse/ARROW-16336
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Small follow-up on ARROW-16121, we missed a few cases where we are internally 
> using those attributes (in the {{equals}} method)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16336) [Python] Hide internal (common_)metadata related warnings from the user (ParquetDataset)

2022-04-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16336:
---
Labels: pull-request-available  (was: )

> [Python] Hide internal (common_)metadata related warnings from the user 
> (ParquetDataset)
> 
>
> Key: ARROW-16336
> URL: https://issues.apache.org/jira/browse/ARROW-16336
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Small follow-up on ARROW-16121, we missed a few cases where we are internally 
> using those attributes (in the {{equals}} method)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16336) [Python] Hide internal (common_)metadata related warnings from the user (ParquetDataset)

2022-04-26 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16336:
-

 Summary: [Python] Hide internal (common_)metadata related warnings 
from the user (ParquetDataset)
 Key: ARROW-16336
 URL: https://issues.apache.org/jira/browse/ARROW-16336
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 8.0.0


Small follow-up on ARROW-16121, we missed a few cases where we are internally 
using those attributes (in the {{equals}} method)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16292) [Java][Doc] Upgrade java documentation for JSE17

2022-04-26 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-16292:
-
Summary: [Java][Doc] Upgrade java documentation for JSE17  (was: 
[Java][Doc]: Upgrade java documentation for JSE17)

> [Java][Doc] Upgrade java documentation for JSE17
> 
>
> Key: ARROW-16292
> URL: https://issues.apache.org/jira/browse/ARROW-16292
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Documentation, Java
>Affects Versions: 9.0.0
>Reporter: David Dali Susanibar Arce
>Assignee: David Dali Susanibar Arce
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Document  changes needed to support JSE17:
>  # Changed for arrow side: Changes related to {{--add-exports"}} are needed 
> to continue supporting erroProne base on JSE11+ [installation 
> doc|https://errorprone.info/docs/installation]. It mean you won't need this 
> changes if you run arrow java building code without errorProne validation 
> (mvn clean install -P-error-prone-jdk11+ )
>  # Changes as a user of arrow: If the user are planning to use Arrow with 
> JSE17 is needed to pass modules needed. For example if I run cookbook for IO 
> [https://arrow.apache.org/cookbook/java/io.html] it finished with an error 
> mention {{Unable to make field long java.nio.Buffer.address accessible: 
> module java.base does not "opens java.nio" to unnamed module}} for that 
> reason as a user for JSE17 (not for arrow changes) is needed to add VM 
> arguments as {{-ea --add-opens=java.base/java.nio=ALL-UNNAMED}} and it will 
> finished without errors.
>  
> This ticket are related with 
> https://github.com/apache/arrow/pull/12941#pullrequestreview-950090643



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16292) [Java][Doc]: Upgrade java documentation for JSE17

2022-04-26 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-16292.
--
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 12990
[https://github.com/apache/arrow/pull/12990]

> [Java][Doc]: Upgrade java documentation for JSE17
> -
>
> Key: ARROW-16292
> URL: https://issues.apache.org/jira/browse/ARROW-16292
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Documentation, Java
>Affects Versions: 9.0.0
>Reporter: David Dali Susanibar Arce
>Assignee: David Dali Susanibar Arce
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Document  changes needed to support JSE17:
>  # Changed for arrow side: Changes related to {{--add-exports"}} are needed 
> to continue supporting erroProne base on JSE11+ [installation 
> doc|https://errorprone.info/docs/installation]. It mean you won't need this 
> changes if you run arrow java building code without errorProne validation 
> (mvn clean install -P-error-prone-jdk11+ )
>  # Changes as a user of arrow: If the user are planning to use Arrow with 
> JSE17 is needed to pass modules needed. For example if I run cookbook for IO 
> [https://arrow.apache.org/cookbook/java/io.html] it finished with an error 
> mention {{Unable to make field long java.nio.Buffer.address accessible: 
> module java.base does not "opens java.nio" to unnamed module}} for that 
> reason as a user for JSE17 (not for arrow changes) is needed to add VM 
> arguments as {{-ea --add-opens=java.base/java.nio=ALL-UNNAMED}} and it will 
> finished without errors.
>  
> This ticket are related with 
> https://github.com/apache/arrow/pull/12941#pullrequestreview-950090643



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16335) [Release][C++] Windows source verification runs C++ tests on a single thread

2022-04-26 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528152#comment-17528152
 ] 

Antoine Pitrou commented on ARROW-16335:


[~raulcd]  [~assignUser]

> [Release][C++] Windows source verification runs C++ tests on a single thread
> 
>
> Key: ARROW-16335
> URL: https://issues.apache.org/jira/browse/ARROW-16335
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Developer Tools
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 9.0.0
>
>
> {{verify-release-candidate.bat}} uses the following command to run the C++ 
> tests:
> {code}
> ctest -VV
> {code}
> This has two problems:
> * output is verbose even for successful tests, making it difficult to find 
> errors in the log
> * tests are run serially even on a many-core machine
> I would suggest instead something like:
> {code}
> ctest -jN --output-on-failure
> {code}
> (where N is the available number of hardware threads / CPU cores)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16335) [Release][C++] Windows source verification runs C++ tests on a single thread

2022-04-26 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-16335:
--

 Summary: [Release][C++] Windows source verification runs C++ tests 
on a single thread
 Key: ARROW-16335
 URL: https://issues.apache.org/jira/browse/ARROW-16335
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Developer Tools
Reporter: Antoine Pitrou
 Fix For: 9.0.0


{{verify-release-candidate.bat}} uses the following command to run the C++ 
tests:
{code}
ctest -VV
{code}

This has two problems:
* output is verbose even for successful tests, making it difficult to find 
errors in the log
* tests are run serially even on a many-core machine

I would suggest instead something like:
{code}
ctest -jN --output-on-failure
{code}

(where N is the available number of hardware threads / CPU cores)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16330) [Release][C++] Windows source verification compiles on a single thread

2022-04-26 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528150#comment-17528150
 ] 

Antoine Pitrou commented on ARROW-16330:


See https://gitlab.kitware.com/cmake/cmake/-/issues/20564#note_730853

> [Release][C++] Windows source verification compiles on a single thread
> --
>
> Key: ARROW-16330
> URL: https://issues.apache.org/jira/browse/ARROW-16330
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Developer Tools
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 9.0.0
>
>
> {{verify-release-candidate.bat}} runs Arrow C++ compilation on a single 
> thread, even on a many-core machine.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16311) [Java] FlightSqlExample does not always return correct schema for CommandGetTables

2022-04-26 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-16311:
-
Summary: [Java] FlightSqlExample does not always return correct schema for 
CommandGetTables  (was: [JAVA] FlightSqlExample does not always return correct 
schema for CommandGetTables)

> [Java] FlightSqlExample does not always return correct schema for 
> CommandGetTables
> --
>
> Key: ARROW-16311
> URL: https://issues.apache.org/jira/browse/ARROW-16311
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Tim Van Wassenhove
>Assignee: Tim Van Wassenhove
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Currently, getFlightInfoTables does not consider the "include_schema" value 
> in CommandGetTables.
>  
> This means that, in case include_schema is set to false, the returned schema 
> returns a schema with a column that is not returned (table_schema column).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16311) [JAVA] FlightSqlExample does not always return correct schema for CommandGetTables

2022-04-26 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-16311.
--
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 12982
[https://github.com/apache/arrow/pull/12982]

> [JAVA] FlightSqlExample does not always return correct schema for 
> CommandGetTables
> --
>
> Key: ARROW-16311
> URL: https://issues.apache.org/jira/browse/ARROW-16311
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Tim Van Wassenhove
>Assignee: Tim Van Wassenhove
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently, getFlightInfoTables does not consider the "include_schema" value 
> in CommandGetTables.
>  
> This means that, in case include_schema is set to false, the returned schema 
> returns a schema with a column that is not returned (table_schema column).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16311) [JAVA] FlightSqlExample does not always return correct schema for CommandGetTables

2022-04-26 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reassigned ARROW-16311:


Assignee: Tim Van Wassenhove

> [JAVA] FlightSqlExample does not always return correct schema for 
> CommandGetTables
> --
>
> Key: ARROW-16311
> URL: https://issues.apache.org/jira/browse/ARROW-16311
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Tim Van Wassenhove
>Assignee: Tim Van Wassenhove
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently, getFlightInfoTables does not consider the "include_schema" value 
> in CommandGetTables.
>  
> This means that, in case include_schema is set to false, the returned schema 
> returns a schema with a column that is not returned (table_schema column).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16329) [Java][C++] Keep more context when marshalling errors through JNI

2022-04-26 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528125#comment-17528125
 ] 

David Li commented on ARROW-16329:
--

CC [~zhztheplayer]

> [Java][C++] Keep more context when marshalling errors through JNI
> -
>
> Key: ARROW-16329
> URL: https://issues.apache.org/jira/browse/ARROW-16329
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Java
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 9.0.0
>
>
> When errors are propagated through the JNI barrier, two mechanisms are 
> involved:
> * the {{Status CheckException(JNIEnv* env)}} function for Java-to-C++ error 
> translation
> * the {{JniAssertOkOrThrow(arrow::Status status)}} and {{T 
> JniGetOrThrow(arrow::Result result)}} functions for C++-to-Java error 
> translation
> Currently, both mechanisms lose most context about the original error, such 
> as its type and any additional state, such as the optional {{StatusDetail}} 
> in C++ or any properties in Java (which I'm sure exist on some exception 
> classes).
> We should improve these mechanisms to retain as much context as possible. For 
> example, in a hypothetical Java-to-C++-to-Java error propagation scenario, 
> the original Java exception from inner code should ideally be re-thrown in 
> the outer Java context (we already support this in Python btw).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16325) [R] Add task for R package with gcc12

2022-04-26 Thread Dewey Dunnington (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dewey Dunnington reassigned ARROW-16325:


Assignee: Dewey Dunnington

> [R] Add task for R package with gcc12
> -
>
> Key: ARROW-16325
> URL: https://issues.apache.org/jira/browse/ARROW-16325
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dewey Dunnington
>Assignee: Dewey Dunnington
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We now have a check for gcc11; however, gcc11 has been the default on 
> debian/testing for some time. The CRAN debian image now uses gcc12, so we 
> should update the gcc11 task to use gcc12 here: 
> https://github.com/apache/arrow/blob/0e03af446c328d0ef963510c3292cb14e092b917/dev/tasks/tasks.yml#L1319-L1328



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16173) [C++] Add benchmarks for temporal functions/kernels

2022-04-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16173:
---
Labels: good-second-issue kernel pull-request-available  (was: 
good-second-issue kernel)

> [C++] Add benchmarks for temporal functions/kernels
> ---
>
> Key: ARROW-16173
> URL: https://issues.apache.org/jira/browse/ARROW-16173
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: Rok Mihevc
>Priority: Major
>  Labels: good-second-issue, kernel, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See ML: https://lists.apache.org/thread/bp2f036sgfj72o46yqmglnx20zfc6tfq



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-15940) [Gandiva][C++] Add NEGATIVE function for decimal data type

2022-04-26 Thread Pindikura Ravindra (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pindikura Ravindra resolved ARROW-15940.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12581
[https://github.com/apache/arrow/pull/12581]

> [Gandiva][C++] Add NEGATIVE function for decimal data type
> --
>
> Key: ARROW-15940
> URL: https://issues.apache.org/jira/browse/ARROW-15940
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Gandiva
>Reporter: Johnnathan Rodrigo Pego de Almeida
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This PR implements the NEGATIVE function for decimal data type.
> The function receive a decimal128() and return a negative decimal128().



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16334) [Release][Archery][CI] Change the links on the email to point to the job run instead of the branch

2022-04-26 Thread Jira
Raúl Cumplido created ARROW-16334:
-

 Summary: [Release][Archery][CI] Change the links on the email to 
point to the job run instead of the branch
 Key: ARROW-16334
 URL: https://issues.apache.org/jira/browse/ARROW-16334
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Archery, Continuous Integration
Reporter: Raúl Cumplido
Assignee: Raúl Cumplido
 Fix For: 9.0.0


The current link that we send on the nightly email report is the link to the 
branch. When we want to understand what is the failure we want to go to the 
build itself and not to the branch.

This ticket aims to update the link in the reports to point to the build 
instead of the branch.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16333) [Release] Improve Nightly Reports

2022-04-26 Thread Jira
Raúl Cumplido created ARROW-16333:
-

 Summary: [Release] Improve Nightly Reports
 Key: ARROW-16333
 URL: https://issues.apache.org/jira/browse/ARROW-16333
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Archery, Continuous Integration
Reporter: Raúl Cumplido
Assignee: Raúl Cumplido
 Fix For: 9.0.0


This initiative tries to improve on some of the issues we currently have with 
our nightly reports to get a clearer understanding on what is the status of our 
nightly builds.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16332) [Release] Java jars verification pass despite binaries not being uploaded

2022-04-26 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-16332:
---

 Summary: [Release] Java jars verification pass despite binaries 
not being uploaded
 Key: ARROW-16332
 URL: https://issues.apache.org/jira/browse/ARROW-16332
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools
Reporter: Krisztian Szucs
 Fix For: 9.0.0


See results at 
https://github.com/apache/arrow/pull/12991#issuecomment-1109525407



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-14651) [Release] "archery crossbow download-artifacts" raises read timeout error

2022-04-26 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-14651.
-
Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12996
[https://github.com/apache/arrow/pull/12996]

> [Release] "archery crossbow download-artifacts" raises read timeout error
> -
>
> Key: ARROW-14651
> URL: https://issues.apache.org/jira/browse/ARROW-14651
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> I couldn't all artifacts for 6.0.1 RC1...
> {noformat}
> Downloading release-6.0.1-rc0-0's artifacts.
> Destination directory is 
> /home/kou/work/cpp/arrow.kou/packages/release-6.0.1-rc0-0
> [  state] Task / Branch   
> Artifacts
> ---
> [SUCCESS] debian-bookworm-amd64uploaded 70 / 
> 70
>  └ 
> https://github.com/ursacomputing/crossbow/runs/4111265571?check_suite_focus=true
> apache-arrow-apt-source_6.0.1-1.debian.tar.xz [ 
> OK]
>   apache-arrow-apt-source_6.0.1-1.dsc [ 
> OK]
>   apache-arrow-apt-source_6.0.1-1_all.deb [ 
> OK]
> apache-arrow-apt-source_6.0.1.orig.tar.gz [ 
> OK]
>apache-arrow_6.0.1-1.debian.tar.xz [ 
> OK]
>  apache-arrow_6.0.1-1.dsc [ 
> OK]
> Traceback (most recent call last):
>   File "/usr/lib/python3/dist-packages/urllib3/response.py", line 438, in 
> _error_catcher
> yield
>   File "/usr/lib/python3/dist-packages/urllib3/response.py", line 519, in read
> data = self._fp.read(amt) if not fp_closed else b""
>   File "/usr/lib/python3.9/http/client.py", line 462, in read
> n = self.readinto(b)
>   File "/usr/lib/python3.9/http/client.py", line 506, in readinto
> n = self.fp.readinto(b)
>   File "/usr/lib/python3.9/socket.py", line 704, in readinto
> return self._sock.recv_into(b)
>   File "/usr/lib/python3.9/ssl.py", line 1241, in recv_into
> return self.read(nbytes, buffer)
>   File "/usr/lib/python3.9/ssl.py", line 1099, in read
> return self._sslobj.read(len, buffer)
> socket.timeout: The read operation timed out
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "/usr/lib/python3/dist-packages/requests/models.py", line 753, in 
> generate
> for chunk in self.raw.stream(chunk_size, decode_content=True):
>   File "/usr/lib/python3/dist-packages/urllib3/response.py", line 576, in 
> stream
> data = self.read(amt=amt, decode_content=decode_content)
>   File "/usr/lib/python3/dist-packages/urllib3/response.py", line 541, in read
> raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
>   File "/usr/lib/python3.9/contextlib.py", line 137, in __exit__
> self.gen.throw(typ, value, traceback)
>   File "/usr/lib/python3/dist-packages/urllib3/response.py", line 443, in 
> _error_catcher
> raise ReadTimeoutError(self._pool, None, "Read timed out.")
> urllib3.exceptions.ReadTimeoutError: 
> HTTPSConnectionPool(host='github-releases.githubusercontent.com', port=443): 
> Read timed out.
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "/usr/local/bin/archery", line 33, in 
> sys.exit(load_entry_point('archery', 'console_scripts', 'archery')())
>   File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 829, in 
> __call__
> return self.main(*args, **kwargs)
>   File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 782, in 
> main
> rv = self.invoke(ctx)
>   File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1259, in 
> invoke
> return _process_result(sub_ctx.command.invoke(sub_ctx))
>   File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1259, in 
> invoke
> return _process_result(sub_ctx.command.invoke(sub_ctx))
>   File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1066, in 
> invoke
> return ctx.invoke(self.callback, **ctx.params)
>   File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 610, in 
> invoke
> return callback(*args, **kwargs)
>   File "/usr/local/lib/python3.9/dist-packages/click/decorators.py", line 33, 
> in new_func
> return f(get_current_context().obj, *args, **kwargs)
>   File "/home

[jira] [Assigned] (ARROW-16018) [Doc][Python] Run doctests on Python docstring examples

2022-04-26 Thread Alenka Frim (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim reassigned ARROW-16018:
---

Assignee: Alenka Frim

> [Doc][Python] Run doctests on Python docstring examples
> ---
>
> Key: ARROW-16018
> URL: https://issues.apache.org/jira/browse/ARROW-16018
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Documentation, Python
>Reporter: Joris Van den Bossche
>Assignee: Alenka Frim
>Priority: Major
>
> We start to add more and more examples to the docstrings of Python methods 
> (ARROW-15367), and so we could use the doctest functionality to ensure that 
> those examples are actually correct (and keep being correct).
> Pytest has integration for doctests 
> (https://docs.pytest.org/en/6.2.x/doctest.html), and so you can do:
> {code}
> pytest python/pyarrow --doctest-modules
> {code}
> This currently fails for me because not having pyarrow.cuda, so we will need 
> to find some ways to automatically skips those parts if not available. 
> Normally, that should be possible with adding a {{conftest.py}} file in the 
> main {{pyarrow}} directory, and then we can influence which files are found 
> by defining {{pytest_runtest_setup}} or {{pytest_collection_modifyitems}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16097) [Python] Address docstrings in Streams and File Access (FileSystems)

2022-04-26 Thread Alenka Frim (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim updated ARROW-16097:

Parent: ARROW-16331
Issue Type: Sub-task  (was: Improvement)

> [Python] Address docstrings in Streams and File Access (FileSystems)
> 
>
> Key: ARROW-16097
> URL: https://issues.apache.org/jira/browse/ARROW-16097
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>
> Ensure docstrings for [Streams and File 
> Access|https://arrow.apache.org/docs/python/api/files.html] - File Systems- 
> have an {{Examples}} section:
> [https://arrow.apache.org/docs/python/generated/pyarrow.LocalFileSystem.html#pyarrow.LocalFileSystem]
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16097) [Python] Address docstrings in Streams and File Access (FileSystems)

2022-04-26 Thread Alenka Frim (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim updated ARROW-16097:

Parent: (was: ARROW-16091)
Issue Type: Improvement  (was: Sub-task)

> [Python] Address docstrings in Streams and File Access (FileSystems)
> 
>
> Key: ARROW-16097
> URL: https://issues.apache.org/jira/browse/ARROW-16097
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>
> Ensure docstrings for [Streams and File 
> Access|https://arrow.apache.org/docs/python/api/files.html] - File Systems- 
> have an {{Examples}} section:
> [https://arrow.apache.org/docs/python/generated/pyarrow.LocalFileSystem.html#pyarrow.LocalFileSystem]
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16096) [Python] Address docstrings in Streams and File Access (Stream Classes)

2022-04-26 Thread Alenka Frim (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim updated ARROW-16096:

Parent: ARROW-16331
Issue Type: Sub-task  (was: Improvement)

> [Python] Address docstrings in Streams and File Access  (Stream Classes)
> 
>
> Key: ARROW-16096
> URL: https://issues.apache.org/jira/browse/ARROW-16096
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>
> Ensure docstrings for [Streams and File 
> Access|https://arrow.apache.org/docs/python/api/files.html] - Stream Classes 
> - have an {{Examples}} section.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16096) [Python] Address docstrings in Streams and File Access (Stream Classes)

2022-04-26 Thread Alenka Frim (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim updated ARROW-16096:

Parent: (was: ARROW-16091)
Issue Type: Improvement  (was: Sub-task)

> [Python] Address docstrings in Streams and File Access  (Stream Classes)
> 
>
> Key: ARROW-16096
> URL: https://issues.apache.org/jira/browse/ARROW-16096
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>
> Ensure docstrings for [Streams and File 
> Access|https://arrow.apache.org/docs/python/api/files.html] - Stream Classes 
> - have an {{Examples}} section.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16095) [Python] Address docstrings in Streams and File Access (Factory Functions)

2022-04-26 Thread Alenka Frim (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim updated ARROW-16095:

Parent: ARROW-16331
Issue Type: Sub-task  (was: Bug)

> [Python] Address docstrings in Streams and File Access (Factory Functions)
> --
>
> Key: ARROW-16095
> URL: https://issues.apache.org/jira/browse/ARROW-16095
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>
> Ensure docstrings for [Streams and File 
> Access|https://arrow.apache.org/docs/python/api/files.html] - Factory 
> Functions - have an {{Examples}} section:
>  * 
> [https://arrow.apache.org/docs/python/generated/pyarrow.input_stream.html#pyarrow.input_stream]
>  * 
> [https://arrow.apache.org/docs/python/generated/pyarrow.output_stream.html#pyarrow.output_stream]
>  * 
> [https://arrow.apache.org/docs/python/generated/pyarrow.memory_map.html#pyarrow.memory_map]
>  * 
> [https://arrow.apache.org/docs/python/generated/pyarrow.create_memory_map.html#pyarrow.create_memory_map]
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16095) [Python] Address docstrings in Streams and File Access (Factory Functions)

2022-04-26 Thread Alenka Frim (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim updated ARROW-16095:

Parent: (was: ARROW-16091)
Issue Type: Bug  (was: Sub-task)

> [Python] Address docstrings in Streams and File Access (Factory Functions)
> --
>
> Key: ARROW-16095
> URL: https://issues.apache.org/jira/browse/ARROW-16095
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>
> Ensure docstrings for [Streams and File 
> Access|https://arrow.apache.org/docs/python/api/files.html] - Factory 
> Functions - have an {{Examples}} section:
>  * 
> [https://arrow.apache.org/docs/python/generated/pyarrow.input_stream.html#pyarrow.input_stream]
>  * 
> [https://arrow.apache.org/docs/python/generated/pyarrow.output_stream.html#pyarrow.output_stream]
>  * 
> [https://arrow.apache.org/docs/python/generated/pyarrow.memory_map.html#pyarrow.memory_map]
>  * 
> [https://arrow.apache.org/docs/python/generated/pyarrow.create_memory_map.html#pyarrow.create_memory_map]
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16331) [Python] Improving Classes and Methods Docstrings - Streams and File access

2022-04-26 Thread Alenka Frim (Jira)
Alenka Frim created ARROW-16331:
---

 Summary: [Python] Improving Classes and Methods Docstrings - 
Streams and File access
 Key: ARROW-16331
 URL: https://issues.apache.org/jira/browse/ARROW-16331
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Alenka Frim


Continuation of the initiative aimed at improving methods and classes 
docstrings, especially from the point of view of ensuring they have an 
{{Examples}} section.

Covered topic: Streams and File access



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16018) [Doc][Python] Run doctests on Python docstring examples

2022-04-26 Thread Alenka Frim (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim updated ARROW-16018:

Parent: ARROW-16091
Issue Type: Sub-task  (was: Test)

> [Doc][Python] Run doctests on Python docstring examples
> ---
>
> Key: ARROW-16018
> URL: https://issues.apache.org/jira/browse/ARROW-16018
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Documentation, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> We start to add more and more examples to the docstrings of Python methods 
> (ARROW-15367), and so we could use the doctest functionality to ensure that 
> those examples are actually correct (and keep being correct).
> Pytest has integration for doctests 
> (https://docs.pytest.org/en/6.2.x/doctest.html), and so you can do:
> {code}
> pytest python/pyarrow --doctest-modules
> {code}
> This currently fails for me because not having pyarrow.cuda, so we will need 
> to find some ways to automatically skips those parts if not available. 
> Normally, that should be possible with adding a {{conftest.py}} file in the 
> main {{pyarrow}} directory, and then we can influence which files are found 
> by defining {{pytest_runtest_setup}} or {{pytest_collection_modifyitems}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-14651) [Release] "archery crossbow download-artifacts" raises read timeout error

2022-04-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14651:
---
Labels: pull-request-available  (was: )

> [Release] "archery crossbow download-artifacts" raises read timeout error
> -
>
> Key: ARROW-14651
> URL: https://issues.apache.org/jira/browse/ARROW-14651
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I couldn't all artifacts for 6.0.1 RC1...
> {noformat}
> Downloading release-6.0.1-rc0-0's artifacts.
> Destination directory is 
> /home/kou/work/cpp/arrow.kou/packages/release-6.0.1-rc0-0
> [  state] Task / Branch   
> Artifacts
> ---
> [SUCCESS] debian-bookworm-amd64uploaded 70 / 
> 70
>  └ 
> https://github.com/ursacomputing/crossbow/runs/4111265571?check_suite_focus=true
> apache-arrow-apt-source_6.0.1-1.debian.tar.xz [ 
> OK]
>   apache-arrow-apt-source_6.0.1-1.dsc [ 
> OK]
>   apache-arrow-apt-source_6.0.1-1_all.deb [ 
> OK]
> apache-arrow-apt-source_6.0.1.orig.tar.gz [ 
> OK]
>apache-arrow_6.0.1-1.debian.tar.xz [ 
> OK]
>  apache-arrow_6.0.1-1.dsc [ 
> OK]
> Traceback (most recent call last):
>   File "/usr/lib/python3/dist-packages/urllib3/response.py", line 438, in 
> _error_catcher
> yield
>   File "/usr/lib/python3/dist-packages/urllib3/response.py", line 519, in read
> data = self._fp.read(amt) if not fp_closed else b""
>   File "/usr/lib/python3.9/http/client.py", line 462, in read
> n = self.readinto(b)
>   File "/usr/lib/python3.9/http/client.py", line 506, in readinto
> n = self.fp.readinto(b)
>   File "/usr/lib/python3.9/socket.py", line 704, in readinto
> return self._sock.recv_into(b)
>   File "/usr/lib/python3.9/ssl.py", line 1241, in recv_into
> return self.read(nbytes, buffer)
>   File "/usr/lib/python3.9/ssl.py", line 1099, in read
> return self._sslobj.read(len, buffer)
> socket.timeout: The read operation timed out
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "/usr/lib/python3/dist-packages/requests/models.py", line 753, in 
> generate
> for chunk in self.raw.stream(chunk_size, decode_content=True):
>   File "/usr/lib/python3/dist-packages/urllib3/response.py", line 576, in 
> stream
> data = self.read(amt=amt, decode_content=decode_content)
>   File "/usr/lib/python3/dist-packages/urllib3/response.py", line 541, in read
> raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
>   File "/usr/lib/python3.9/contextlib.py", line 137, in __exit__
> self.gen.throw(typ, value, traceback)
>   File "/usr/lib/python3/dist-packages/urllib3/response.py", line 443, in 
> _error_catcher
> raise ReadTimeoutError(self._pool, None, "Read timed out.")
> urllib3.exceptions.ReadTimeoutError: 
> HTTPSConnectionPool(host='github-releases.githubusercontent.com', port=443): 
> Read timed out.
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "/usr/local/bin/archery", line 33, in 
> sys.exit(load_entry_point('archery', 'console_scripts', 'archery')())
>   File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 829, in 
> __call__
> return self.main(*args, **kwargs)
>   File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 782, in 
> main
> rv = self.invoke(ctx)
>   File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1259, in 
> invoke
> return _process_result(sub_ctx.command.invoke(sub_ctx))
>   File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1259, in 
> invoke
> return _process_result(sub_ctx.command.invoke(sub_ctx))
>   File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1066, in 
> invoke
> return ctx.invoke(self.callback, **ctx.params)
>   File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 610, in 
> invoke
> return callback(*args, **kwargs)
>   File "/usr/local/lib/python3.9/dist-packages/click/decorators.py", line 33, 
> in new_func
> return f(get_current_context().obj, *args, **kwargs)
>   File "/home/kou/work/cpp/arrow.kou/dev/archery/archery/crossbow/cli.py", 
> line 349, in download_artifacts
> report.show(
>   File 
>

[jira] [Assigned] (ARROW-14651) [Release] "archery crossbow download-artifacts" raises read timeout error

2022-04-26 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-14651:


Assignee: Kouhei Sutou

> [Release] "archery crossbow download-artifacts" raises read timeout error
> -
>
> Key: ARROW-14651
> URL: https://issues.apache.org/jira/browse/ARROW-14651
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>
> I couldn't all artifacts for 6.0.1 RC1...
> {noformat}
> Downloading release-6.0.1-rc0-0's artifacts.
> Destination directory is 
> /home/kou/work/cpp/arrow.kou/packages/release-6.0.1-rc0-0
> [  state] Task / Branch   
> Artifacts
> ---
> [SUCCESS] debian-bookworm-amd64uploaded 70 / 
> 70
>  └ 
> https://github.com/ursacomputing/crossbow/runs/4111265571?check_suite_focus=true
> apache-arrow-apt-source_6.0.1-1.debian.tar.xz [ 
> OK]
>   apache-arrow-apt-source_6.0.1-1.dsc [ 
> OK]
>   apache-arrow-apt-source_6.0.1-1_all.deb [ 
> OK]
> apache-arrow-apt-source_6.0.1.orig.tar.gz [ 
> OK]
>apache-arrow_6.0.1-1.debian.tar.xz [ 
> OK]
>  apache-arrow_6.0.1-1.dsc [ 
> OK]
> Traceback (most recent call last):
>   File "/usr/lib/python3/dist-packages/urllib3/response.py", line 438, in 
> _error_catcher
> yield
>   File "/usr/lib/python3/dist-packages/urllib3/response.py", line 519, in read
> data = self._fp.read(amt) if not fp_closed else b""
>   File "/usr/lib/python3.9/http/client.py", line 462, in read
> n = self.readinto(b)
>   File "/usr/lib/python3.9/http/client.py", line 506, in readinto
> n = self.fp.readinto(b)
>   File "/usr/lib/python3.9/socket.py", line 704, in readinto
> return self._sock.recv_into(b)
>   File "/usr/lib/python3.9/ssl.py", line 1241, in recv_into
> return self.read(nbytes, buffer)
>   File "/usr/lib/python3.9/ssl.py", line 1099, in read
> return self._sslobj.read(len, buffer)
> socket.timeout: The read operation timed out
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "/usr/lib/python3/dist-packages/requests/models.py", line 753, in 
> generate
> for chunk in self.raw.stream(chunk_size, decode_content=True):
>   File "/usr/lib/python3/dist-packages/urllib3/response.py", line 576, in 
> stream
> data = self.read(amt=amt, decode_content=decode_content)
>   File "/usr/lib/python3/dist-packages/urllib3/response.py", line 541, in read
> raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
>   File "/usr/lib/python3.9/contextlib.py", line 137, in __exit__
> self.gen.throw(typ, value, traceback)
>   File "/usr/lib/python3/dist-packages/urllib3/response.py", line 443, in 
> _error_catcher
> raise ReadTimeoutError(self._pool, None, "Read timed out.")
> urllib3.exceptions.ReadTimeoutError: 
> HTTPSConnectionPool(host='github-releases.githubusercontent.com', port=443): 
> Read timed out.
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "/usr/local/bin/archery", line 33, in 
> sys.exit(load_entry_point('archery', 'console_scripts', 'archery')())
>   File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 829, in 
> __call__
> return self.main(*args, **kwargs)
>   File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 782, in 
> main
> rv = self.invoke(ctx)
>   File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1259, in 
> invoke
> return _process_result(sub_ctx.command.invoke(sub_ctx))
>   File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1259, in 
> invoke
> return _process_result(sub_ctx.command.invoke(sub_ctx))
>   File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1066, in 
> invoke
> return ctx.invoke(self.callback, **ctx.params)
>   File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 610, in 
> invoke
> return callback(*args, **kwargs)
>   File "/usr/local/lib/python3.9/dist-packages/click/decorators.py", line 33, 
> in new_func
> return f(get_current_context().obj, *args, **kwargs)
>   File "/home/kou/work/cpp/arrow.kou/dev/archery/archery/crossbow/cli.py", 
> line 349, in download_artifacts
> report.show(
>   File 
> "/home/kou/work/cpp/arrow.kou/dev/archery/archery/crossbow/reports.py", line 
> 132, in show
> asset_callbac

[jira] [Commented] (ARROW-16330) [Release][C++] Windows source verification compiles on a single thread

2022-04-26 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528029#comment-17528029
 ] 

Antoine Pitrou commented on ARROW-16330:


cc [~raulcd] [~assignUser]

> [Release][C++] Windows source verification compiles on a single thread
> --
>
> Key: ARROW-16330
> URL: https://issues.apache.org/jira/browse/ARROW-16330
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Developer Tools
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 9.0.0
>
>
> {{verify-release-candidate.bat}} runs Arrow C++ compilation on a single 
> thread, even on a many-core machine.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16304) [C++] arrow-dataset-file-parquet-test sporadic failure in appveyor job

2022-04-26 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528030#comment-17528030
 ] 

Weston Pace commented on ARROW-16304:
-

I tried to reproduce this some today on Linux without much luck.  I'll try 
Windows tomorrow (MSVC 2017 seems designed to expose race conditions :)

> [C++] arrow-dataset-file-parquet-test sporadic failure in appveyor job
> --
>
> Key: ARROW-16304
> URL: https://issues.apache.org/jira/browse/ARROW-16304
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Weston Pace
>Priority: Major
>
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/43330908/job/yw3djjni6as253m4
> {code:bash}
> [ RUN  ] 
> TestScan/TestParquetFileFormatScan.ScanRecordBatchReaderProjected/0Threaded16b1024r
> C:/projects/arrow/cpp/src/arrow/util/future.cc:323:  Check failed: 
> !IsFutureFinished(state_) Future already marked finished
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16330) [Release][C++] Windows source verification compiles on a single thread

2022-04-26 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-16330:
--

 Summary: [Release][C++] Windows source verification compiles on a 
single thread
 Key: ARROW-16330
 URL: https://issues.apache.org/jira/browse/ARROW-16330
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Developer Tools
Reporter: Antoine Pitrou
 Fix For: 9.0.0


{{verify-release-candidate.bat}} runs Arrow C++ compilation on a single thread, 
even on a many-core machine.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-15754) [Java] ORC JNI bridge should use the C data interface

2022-04-26 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527979#comment-17527979
 ] 

Antoine Pitrou commented on ARROW-15754:


cc [~zhztheplayer]

> [Java] ORC JNI bridge should use the C data interface
> -
>
> Key: ARROW-15754
> URL: https://issues.apache.org/jira/browse/ARROW-15754
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Antoine Pitrou
>Priority: Major
>
> Right now the ORC JNI bridge uses some custom buffer passing which only seems 
> to handle primitive arrays correctly (child array buffers and dictionaries 
> are not considered):
> https://github.com/apache/arrow/blob/master/cpp/src/jni/orc/jni_wrapper.cpp#L263-L265
> Instead, it should use the C data interface, which is now implemented in Java.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-15174) [Java] Consolidate JNI compilation

2022-04-26 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527978#comment-17527978
 ] 

Antoine Pitrou commented on ARROW-15174:


Note that moving the dataset JNI bridge to use the C data interface is being 
handled in ARROW-7272.

> [Java] Consolidate JNI compilation
> --
>
> Key: ARROW-15174
> URL: https://issues.apache.org/jira/browse/ARROW-15174
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Alessandro Molina
>Assignee: Larry White
>Priority: Major
> Fix For: 9.0.0
>
>
> *Umbrella ticket for consolidating Java JNI compilation initiative*
> Seems we have spread the JNI code across the {{cpp}} and {{java}} 
> directories. As for other bindings (Python) we already discussed it would be 
> great to consolidate and move all cpp code related to PYthon into PyArrow, we 
> should do something equivalent for Java too and move all C++ code specific to 
> Java into the Java project.
> At the moment there are two JNI related directories:
>  * [https://github.com/apache/arrow/tree/master/java/c]
>  * [https://github.com/apache/arrow/tree/master/cpp/src/jni]
> Let's also research what's the best method to build those. The {{java/c}} 
> directory seems to be already integrated with the Java build process, let's 
> check if that approach is something we can reuse for the {{dataset}} 
> directory too



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (ARROW-15753) [Java] Dataset JNI bridge should use the C data interface

2022-04-26 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou closed ARROW-15753.
--
Resolution: Duplicate

> [Java] Dataset JNI bridge should use the C data interface
> -
>
> Key: ARROW-15753
> URL: https://issues.apache.org/jira/browse/ARROW-15753
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Antoine Pitrou
>Priority: Major
>
> Right now the datasets JNI bridge uses some custom buffer passing which only 
> seems to handle primitive arrays correctly (child array buffers and 
> dictionaries are not considered):
> https://github.com/apache/arrow/blob/master/cpp/src/jni/dataset/jni_wrapper.cc#L498-L500
> Instead, it should use the C data interface, which is now implemented in Java.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-15753) [Java] Dataset JNI bridge should use the C data interface

2022-04-26 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527975#comment-17527975
 ] 

Antoine Pitrou commented on ARROW-15753:


It seems this is being done in ARROW-7272

> [Java] Dataset JNI bridge should use the C data interface
> -
>
> Key: ARROW-15753
> URL: https://issues.apache.org/jira/browse/ARROW-15753
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Antoine Pitrou
>Priority: Major
>
> Right now the datasets JNI bridge uses some custom buffer passing which only 
> seems to handle primitive arrays correctly (child array buffers and 
> dictionaries are not considered):
> https://github.com/apache/arrow/blob/master/cpp/src/jni/dataset/jni_wrapper.cc#L498-L500
> Instead, it should use the C data interface, which is now implemented in Java.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (ARROW-16329) [Java][C++] Keep more context when marshalling errors through JNI

2022-04-26 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527974#comment-17527974
 ] 

Antoine Pitrou edited comment on ARROW-16329 at 4/26/22 7:45 AM:
-

cc [~lidavidm] [~ljw1001]


was (Author: pitrou):
cc [~lidavidm] @larry whi

> [Java][C++] Keep more context when marshalling errors through JNI
> -
>
> Key: ARROW-16329
> URL: https://issues.apache.org/jira/browse/ARROW-16329
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Java
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 9.0.0
>
>
> When errors are propagated through the JNI barrier, two mechanisms are 
> involved:
> * the {{Status CheckException(JNIEnv* env)}} function for Java-to-C++ error 
> translation
> * the {{JniAssertOkOrThrow(arrow::Status status)}} and {{T 
> JniGetOrThrow(arrow::Result result)}} functions for C++-to-Java error 
> translation
> Currently, both mechanisms lose most context about the original error, such 
> as its type and any additional state, such as the optional {{StatusDetail}} 
> in C++ or any properties in Java (which I'm sure exist on some exception 
> classes).
> We should improve these mechanisms to retain as much context as possible. For 
> example, in a hypothetical Java-to-C++-to-Java error propagation scenario, 
> the original Java exception from inner code should ideally be re-thrown in 
> the outer Java context (we already support this in Python btw).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


  1   2   >