date:20220830

[jira] [Comment Edited] (ARROW-17079) [C++] Improve error message propagation from AWS SDK

2022-08-30 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597510#comment-17597510
 ] 

Antoine Pitrou edited comment on ARROW-17079 at 8/30/22 7:32 AM:
-

The first part (associating which S3 operation caused the error) has been 
merged in [https://github.com/apache/arrow/pull/13979].

I'm leaving this Jira ticket open since I'm planning to improve the error 
messages next and show the actual error instead of the error code.


was (Author: pcmoritz):
The first part (associating which S3 operation caused the error) has been 
merged in [https://github.com/apache/arrow/pull/13979.]

I'm leaving this Jira ticket open since I'm planning to improve the error 
messages next and show the actual error instead of the error code.

> [C++] Improve error message propagation from AWS SDK
> 
>
> Key: ARROW-17079
> URL: https://issues.apache.org/jira/browse/ARROW-17079
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 8.0.0
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Dear all,
> I'd like to see if there is interest to improve the error messages that 
> originate from the AWS SDK. Especially for loading datasets from S3, there 
> are many things that can go wrong and the error messages that (Py)Arrow gives 
> are not always the most actionable, especially if the call involves many 
> different SDK functions. In particular, it would be great to have the 
> following attached to each error message:
>  * A machine parseable status code from the AWS SDK
>  * Information as to exactly which AWS SDK call failed, so it can be 
> disambiguated for Arrow API calls that use multiple AWS SDK calls
> In the ideal case, as a developer I could reconstruct the AWS SDK call that 
> failed from the error message (e.g. in a form the allows me to run the API 
> call via the "aws" CLI program) so I can debug errors and see how they relate 
> to my AWS infrastructure. Any progress in this direction would be super 
> helpful.
>  
> For context: I recently was debugging some permissioning issues in S3 based 
> on the current error codes and it was pretty hard to figure out what was 
> going on (see 
> [https://github.com/ray-project/ray/issues/19799#issuecomment-1185035602).]
>  
> I'm happy to take a stab at this problem but might need some help. Is 
> implementing a custom StatusDetail class for AWS errors and propagating 
> errors that way the right hunch here? 
> [https://github.com/apache/arrow/blob/50f6fcad6cc09c06e78dcd09ad07218b86e689de/cpp/src/arrow/status.h#L110]
>  
> All the best,
> Philipp.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17079) [C++] Improve error message propagation from AWS SDK

2022-08-30 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597615#comment-17597615
 ] 

Antoine Pitrou commented on ARROW-17079:


Thanks [~pcmoritz]!

> [C++] Improve error message propagation from AWS SDK
> 
>
> Key: ARROW-17079
> URL: https://issues.apache.org/jira/browse/ARROW-17079
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 8.0.0
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Dear all,
> I'd like to see if there is interest to improve the error messages that 
> originate from the AWS SDK. Especially for loading datasets from S3, there 
> are many things that can go wrong and the error messages that (Py)Arrow gives 
> are not always the most actionable, especially if the call involves many 
> different SDK functions. In particular, it would be great to have the 
> following attached to each error message:
>  * A machine parseable status code from the AWS SDK
>  * Information as to exactly which AWS SDK call failed, so it can be 
> disambiguated for Arrow API calls that use multiple AWS SDK calls
> In the ideal case, as a developer I could reconstruct the AWS SDK call that 
> failed from the error message (e.g. in a form the allows me to run the API 
> call via the "aws" CLI program) so I can debug errors and see how they relate 
> to my AWS infrastructure. Any progress in this direction would be super 
> helpful.
>  
> For context: I recently was debugging some permissioning issues in S3 based 
> on the current error codes and it was pretty hard to figure out what was 
> going on (see 
> [https://github.com/ray-project/ray/issues/19799#issuecomment-1185035602).]
>  
> I'm happy to take a stab at this problem but might need some help. Is 
> implementing a custom StatusDetail class for AWS errors and propagating 
> errors that way the right hunch here? 
> [https://github.com/apache/arrow/blob/50f6fcad6cc09c06e78dcd09ad07218b86e689de/cpp/src/arrow/status.h#L110]
>  
> All the best,
> Philipp.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17562) [C++][Acero] Add Window Functions exec node

2022-08-30 Thread Michal Nowakiewicz (Jira)

Michal Nowakiewicz created ARROW-17562:
--

 Summary: [C++][Acero] Add Window Functions exec node
 Key: ARROW-17562
 URL: https://issues.apache.org/jira/browse/ARROW-17562
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 10.0.0
Reporter: Michal Nowakiewicz
Assignee: Michal Nowakiewicz
 Fix For: 10.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17563) [C++][Acero] Window Functions add helper classes for quantiles

2022-08-30 Thread Michal Nowakiewicz (Jira)

Michal Nowakiewicz created ARROW-17563:
--

 Summary: [C++][Acero] Window Functions add helper classes for 
quantiles
 Key: ARROW-17563
 URL: https://issues.apache.org/jira/browse/ARROW-17563
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 10.0.0
Reporter: Michal Nowakiewicz
Assignee: Michal Nowakiewicz
 Fix For: 10.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17564) [C++][Acero] Window Functions add helper classes for window aggregates and distinct aggregates

2022-08-30 Thread Michal Nowakiewicz (Jira)

Michal Nowakiewicz created ARROW-17564:
--

 Summary: [C++][Acero] Window Functions add helper classes for 
window aggregates and distinct aggregates
 Key: ARROW-17564
 URL: https://issues.apache.org/jira/browse/ARROW-17564
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 10.0.0
Reporter: Michal Nowakiewicz
Assignee: Michal Nowakiewicz
 Fix For: 10.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17079) [C++] Improve error message propagation from AWS SDK

2022-08-30 Thread Philipp Moritz (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597649#comment-17597649
 ] 

Philipp Moritz commented on ARROW-17079:


PR for human readable errors instead of error codes: 
https://github.com/apache/arrow/pull/14001

> [C++] Improve error message propagation from AWS SDK
> 
>
> Key: ARROW-17079
> URL: https://issues.apache.org/jira/browse/ARROW-17079
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 8.0.0
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Dear all,
> I'd like to see if there is interest to improve the error messages that 
> originate from the AWS SDK. Especially for loading datasets from S3, there 
> are many things that can go wrong and the error messages that (Py)Arrow gives 
> are not always the most actionable, especially if the call involves many 
> different SDK functions. In particular, it would be great to have the 
> following attached to each error message:
>  * A machine parseable status code from the AWS SDK
>  * Information as to exactly which AWS SDK call failed, so it can be 
> disambiguated for Arrow API calls that use multiple AWS SDK calls
> In the ideal case, as a developer I could reconstruct the AWS SDK call that 
> failed from the error message (e.g. in a form the allows me to run the API 
> call via the "aws" CLI program) so I can debug errors and see how they relate 
> to my AWS infrastructure. Any progress in this direction would be super 
> helpful.
>  
> For context: I recently was debugging some permissioning issues in S3 based 
> on the current error codes and it was pretty hard to figure out what was 
> going on (see 
> [https://github.com/ray-project/ray/issues/19799#issuecomment-1185035602).]
>  
> I'm happy to take a stab at this problem but might need some help. Is 
> implementing a custom StatusDetail class for AWS errors and propagating 
> errors that way the right hunch here? 
> [https://github.com/apache/arrow/blob/50f6fcad6cc09c06e78dcd09ad07218b86e689de/cpp/src/arrow/status.h#L110]
>  
> All the best,
> Philipp.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-14742) [C++] Allow ParquetWriter to take a RecordBatchReader as input

2022-08-30 Thread Nicola Crane (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-14742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597674#comment-17597674
 ] 

Nicola Crane commented on ARROW-14742:
--

I'll admit I don't understand this bit of the codebase super well, and so will 
have to take a look later to see if this means that ARROW-14428 is currently 
possible.

> [C++] Allow ParquetWriter to take a RecordBatchReader as input
> --
>
> Key: ARROW-14742
> URL: https://issues.apache.org/jira/browse/ARROW-14742
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Nicola Crane
>Assignee: Alvin Chunga Mamani
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Please could we extend the Parquet Writer to not only take a Table or 
> RecordBatch as inputs, but also RecordBatchReader?  This would be 
> super-helpful for opening data as a dataset and writing to a single file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17564) [C++][Acero] Window Functions add helper classes for window aggregates and distinct aggregates

2022-08-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17564:
---
Labels: pull-request-available query-engine  (was: query-engine)

> [C++][Acero] Window Functions add helper classes for window aggregates and 
> distinct aggregates
> --
>
> Key: ARROW-17564
> URL: https://issues.apache.org/jira/browse/ARROW-17564
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 10.0.0
>Reporter: Michal Nowakiewicz
>Assignee: Michal Nowakiewicz
>Priority: Major
>  Labels: pull-request-available, query-engine
> Fix For: 10.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17565) [C++] Backward compatible ${PACKAGE}_shared CMake target isn't provided

2022-08-30 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-17565:


 Summary: [C++] Backward compatible ${PACKAGE}_shared CMake target 
isn't provided
 Key: ARROW-17565
 URL: https://issues.apache.org/jira/browse/ARROW-17565
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


This is a follow-up of ARROW-12175.

We introduced {{${PACKAGE}::}} prefix to all exported CMake targets such as 
{{Arrow::arrow_shared}} and {{Arrow::arrow_static}} but we also provides no 
namespaced CMake targets such as {{arrow_shared}} and {{arrow_static}} as 
aliases of namespaced CMake targets. But the backward compatibility feature 
isn't worked for {{_shared}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17544) [C++/Python] Add support for S3 Bucket Versioning

2022-08-30 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597688#comment-17597688
 ] 

Antoine Pitrou commented on ARROW-17544:


{quote}If we're waiting for C++17, what's the timeline for that?{quote}

It's being done in https://github.com/apache/arrow/pull/13991

{quote}I think if we're to implement InputStreamOptions with specific types per 
FileSystem type, that likely seems to be very reasonable and less prone to 
error by users.{quote}

Probably, but it also means that options are not portable (for example, passing 
a version id would be different depending on the actual filesystem type, even 
though several different filesystem types would implement a version id).

We also must think about how to expose those options or parameters to Python.

{quote}What do you think about using ReadMetadata() to retrieve the uploaded S3 
key information after the stream is closed?  That's also in the PR.{quote}

That sounds a bit ad hoc. We should try to think more generally about which 
kind of information may be available when closing a file.

> [C++/Python] Add support for S3 Bucket Versioning
> -
>
> Key: ARROW-17544
> URL: https://issues.apache.org/jira/browse/ARROW-17544
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 9.0.0
>Reporter: Rusty Conover
>Assignee: Rusty Conover
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Arrow offers a reasonably capable S3 interface, but it lacks support for S3 
> Buckets that have versioning enabled.  For information about what S3 bucket 
> versioning is, see:
> [https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html]
> If Arrow is interacting with a bucket where versioning is enabled, there can 
> be S3 keys that have multiple versions of content stored utilizing the same 
> key name.  At the present moment, Arrow does not have the ability to:
>  # Access versions of an S3 key rather than just the latest version of an S3 
> key.  There is no ability to specify the VersionId parameter of S3's 
> GetObject API.
>  # Report the VersionId created when a new S3 key is uploaded to a bucket.
> Along with S3, GCS also supports versioned buckets.
> [https://cloud.google.com/storage/docs/object-versioning]
> There are a few shortcomings of the Filesystem interface to support remote 
> file systems that support versioning:
> 1. The parameters for open_input_stream() and open_input_file() do not easily 
> lend themselves to adding an additional parameter of "version" because they 
> would be passed to all other implemented filesystems.  Most other file 
> systems that exist don't actually support versioning.
> 2. Upon completion of an S3 multipart upload (i.e., close() on an 
> S3FileSystem output stream), there is not currently a way for the user to 
> determine the VersionId or ETag of the S3 key that was created.  This is 
> important to know because if there are multiple concurrent writers to S3, it 
> should be possible to identify the written S3 key.
> Proposed solutions to enable S3 Bucket versioning:
> 1. To allow library callers to read specific versions of an S3 key, extend 
> only the S3FileSystem interface with two new API calls:
> {{open_input_stream_with_version()}}
> {{open_input_file_with_version()}}
> Both are like their namesakes from the normal FileSystem interface but take 
> an additional parameter of a "version," which is a string representation of 
> the VersionId returned by S3 when the S3 Key is created.  If these functions 
> are called with an empty string for the specified version, the latest version 
> of the S3 key will be returned.
> I'm a bit reluctant to create these specialized functions just on the 
> S3FileSystem interface, but I also don't think it is appropriate to change 
> open_input_stream() and open_input_file()'s parameter list for all 
> filesystems just for functionality that is only implemented by a small number 
> of filesystems.
> 2. Allow callers to call ReadMetadata() on an S3FileSystem output stream to 
> retrieve the metadata about the S3 key that has been written after the stream 
> has been closed.  The metadata will likely include both a VersionId and a 
> value for ETag.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17565) [C++] Backward compatible ${PACKAGE}_shared CMake target isn't provided

2022-08-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17565:
---
Labels: pull-request-available  (was: )

> [C++] Backward compatible ${PACKAGE}_shared CMake target isn't provided
> ---
>
> Key: ARROW-17565
> URL: https://issues.apache.org/jira/browse/ARROW-17565
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a follow-up of ARROW-12175.
> We introduced {{${PACKAGE}::}} prefix to all exported CMake targets such as 
> {{Arrow::arrow_shared}} and {{Arrow::arrow_static}} but we also provides no 
> namespaced CMake targets such as {{arrow_shared}} and {{arrow_static}} as 
> aliases of namespaced CMake targets. But the backward compatibility feature 
> isn't worked for {{_shared}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17567) [C++][Compute]Compiler error with gcc7 and c++17

2022-08-30 Thread Jin Shang (Jira)

Jin Shang created ARROW-17567:
-

 Summary: [C++][Compute]Compiler error with gcc7 and c++17
 Key: ARROW-17567
 URL: https://issues.apache.org/jira/browse/ARROW-17567
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 9.0.0
 Environment: gcc6/7
c++14/17
Reporter: Jin Shang
Assignee: Jin Shang


When compiling the c++ compute component with gcc6/7 with std=c++14/17, 
compiler internal errors are triggered at 
compute/kernels/aggregate_internal.h:176:24 and various places at 
compute/kernels/scalar_set_lookup.cc
{code:java}
cpp/src/arrow/compute/kernels/aggregate_internal.h:176:24: internal compiler 
error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1740
DCHECK_LT(cur_level, levels);
~^~~~
0x683399 maybe_undo_parenthesized_ref(tree_node*)
../../gcc-7.5.0/gcc/cp/semantics.c:1739
0x6c8638 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2180
0x6c949c cp_fold_maybe_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
0x6c8346 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2149
0x6c949c cp_fold_maybe_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
0x6c8234 cp_fold_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2063
0x6c8234 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2304
0x66a037 cp_convert_and_check(tree_node*, tree_node*, int)
../../gcc-7.5.0/gcc/cp/cvt.c:640
0x59bb8a convert_like_real
../../gcc-7.5.0/gcc/cp/call.c:7053
0x59de12 build_over_call
../../gcc-7.5.0/gcc/cp/call.c:7869
0x5a3c2f build_new_function_call(tree_node*, vec**, bool, int)
../../gcc-7.5.0/gcc/cp/call.c:4272
0x685601 finish_call_expr(tree_node*, vec**, bool, 
bool, int)
../../gcc-7.5.0/gcc/cp/semantics.c:2501
0x5e1b83 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17508
0x5e121b tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17544
0x5d6c47 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:16732
0x5d6c47 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16613
0x5d6a85 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15874
0x5d6ad5 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15860
0x5d61de tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16036
0x5d6ad5 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15860
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See  for instructions. {code}
{code:java}
cpp/src/arrow/compute/kernels/scalar_set_lookup.cc:70:50: internal compiler 
error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1740 auto on_found = 
[&](int32_t memo_index) { DCHECK_LT(memo_index, memo_size); }; 0x683399 
maybe_undo_parenthesized_ref(tree_node*) 
../../gcc-7.5.0/gcc/cp/semantics.c:1739 0x6c8638 cp_fold 
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2180 0x6c949c cp_fold_maybe_rvalue 
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042 0x6c8346 cp_fold 
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2149 0x66a037 
cp_convert_and_check(tree_node*, tree_node*, int) 
../../gcc-7.5.0/gcc/cp/cvt.c:640 0x65f5d4 cp_build_binary_op(unsigned int, 
tree_code, tree_node*, tree_node*, int) ../../gcc-7.5.0/gcc/cp/typeck.c:5208 
0x5a689c build_new_op_1 ../../gcc-7.5.0/gcc/cp/call.c:5978 0x5a737e 
build_new_op(unsigned int, tree_code, int, tree_node*, tree_node*, tree_node*, 
tree_node**, int) ../../gcc-7.5.0/gcc/cp/call.c:6022 0x657a12 
build_x_binary_op(unsigned int, tree_code, tree_node*, tree_code, tree_node*, 
tree_code, tree_node**, int) ../../gcc-7.5.0/gcc/cp/typeck.c:3941 0x5e04ff 
tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, bool) 
../../gcc-7.5.0/gcc/cp/pt.c:17001 0x5e1120 tsubst_copy_and_build(tree_node*, 
tree_node*, int, tree_node*, bool, bool) ../../gcc-7.5.0/gcc/cp/pt.c:16940 
0x5e1120 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool) ../../gcc-7.5.0/gcc/cp/pt.c:16940 0x5e1676 
tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, bool) 
../../gcc-7.5.0/gcc/cp/pt.c:17312 0x5e121b tsubst_copy_and_build(tree_node*, 
tree_node*, int, tree_node*, bool, bool) ../../gcc-7.5.0/gcc/cp/pt.c:17544 
0x5d6c47 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool) ../../gcc-7.5.0/gcc/cp/pt.c:16732 0x5d6c47 tsubst_expr(tree_node*, 
tree_node*, int, tree_node*, bool) ../../gcc-7.5.0/gcc/cp/pt.c:16613 0x5d6a85 
tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool) 
../../gcc-7.5.0/gcc/cp/pt.c:15874 0x5d696b tsubst_expr(tree_node*, tree_node*, 
int, tree_node*, bool) ../../gcc-7.5.0/gcc/cp/pt.c:16090 0x5d696b 
tsubst_expr(tree_node*, tree

[jira] [Created] (ARROW-17566) [C++][Compute]Compiler error with gcc7 and c++17

2022-08-30 Thread Jin Shang (Jira)

Jin Shang created ARROW-17566:
-

 Summary: [C++][Compute]Compiler error with gcc7 and c++17
 Key: ARROW-17566
 URL: https://issues.apache.org/jira/browse/ARROW-17566
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 9.0.0
 Environment: gcc6/7
c++14/17
Reporter: Jin Shang
Assignee: Jin Shang


When compiling the c++ compute component with gcc6/7 with std=c++14/17, 
compiler internal errors are triggered at 
compute/kernels/aggregate_internal.h:176:24 and various places at 
compute/kernels/scalar_set_lookup.cc
{code:java}
cpp/src/arrow/compute/kernels/aggregate_internal.h:176:24: internal compiler 
error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1740
DCHECK_LT(cur_level, levels);
~^~~~
0x683399 maybe_undo_parenthesized_ref(tree_node*)
../../gcc-7.5.0/gcc/cp/semantics.c:1739
0x6c8638 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2180
0x6c949c cp_fold_maybe_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
0x6c8346 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2149
0x6c949c cp_fold_maybe_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
0x6c8234 cp_fold_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2063
0x6c8234 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2304
0x66a037 cp_convert_and_check(tree_node*, tree_node*, int)
../../gcc-7.5.0/gcc/cp/cvt.c:640
0x59bb8a convert_like_real
../../gcc-7.5.0/gcc/cp/call.c:7053
0x59de12 build_over_call
../../gcc-7.5.0/gcc/cp/call.c:7869
0x5a3c2f build_new_function_call(tree_node*, vec**, bool, int)
../../gcc-7.5.0/gcc/cp/call.c:4272
0x685601 finish_call_expr(tree_node*, vec**, bool, 
bool, int)
../../gcc-7.5.0/gcc/cp/semantics.c:2501
0x5e1b83 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17508
0x5e121b tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17544
0x5d6c47 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:16732
0x5d6c47 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16613
0x5d6a85 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15874
0x5d6ad5 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15860
0x5d61de tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16036
0x5d6ad5 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15860
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See  for instructions. {code}
{code:java}
cpp/src/arrow/compute/kernels/scalar_set_lookup.cc:70:50: internal compiler 
error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1740 auto on_found = 
[&](int32_t memo_index) { DCHECK_LT(memo_index, memo_size); }; 0x683399 
maybe_undo_parenthesized_ref(tree_node*) 
../../gcc-7.5.0/gcc/cp/semantics.c:1739 0x6c8638 cp_fold 
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2180 0x6c949c cp_fold_maybe_rvalue 
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042 0x6c8346 cp_fold 
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2149 0x66a037 
cp_convert_and_check(tree_node*, tree_node*, int) 
../../gcc-7.5.0/gcc/cp/cvt.c:640 0x65f5d4 cp_build_binary_op(unsigned int, 
tree_code, tree_node*, tree_node*, int) ../../gcc-7.5.0/gcc/cp/typeck.c:5208 
0x5a689c build_new_op_1 ../../gcc-7.5.0/gcc/cp/call.c:5978 0x5a737e 
build_new_op(unsigned int, tree_code, int, tree_node*, tree_node*, tree_node*, 
tree_node**, int) ../../gcc-7.5.0/gcc/cp/call.c:6022 0x657a12 
build_x_binary_op(unsigned int, tree_code, tree_node*, tree_code, tree_node*, 
tree_code, tree_node**, int) ../../gcc-7.5.0/gcc/cp/typeck.c:3941 0x5e04ff 
tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, bool) 
../../gcc-7.5.0/gcc/cp/pt.c:17001 0x5e1120 tsubst_copy_and_build(tree_node*, 
tree_node*, int, tree_node*, bool, bool) ../../gcc-7.5.0/gcc/cp/pt.c:16940 
0x5e1120 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool) ../../gcc-7.5.0/gcc/cp/pt.c:16940 0x5e1676 
tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, bool) 
../../gcc-7.5.0/gcc/cp/pt.c:17312 0x5e121b tsubst_copy_and_build(tree_node*, 
tree_node*, int, tree_node*, bool, bool) ../../gcc-7.5.0/gcc/cp/pt.c:17544 
0x5d6c47 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool) ../../gcc-7.5.0/gcc/cp/pt.c:16732 0x5d6c47 tsubst_expr(tree_node*, 
tree_node*, int, tree_node*, bool) ../../gcc-7.5.0/gcc/cp/pt.c:16613 0x5d6a85 
tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool) 
../../gcc-7.5.0/gcc/cp/pt.c:15874 0x5d696b tsubst_expr(tree_node*, tree_node*, 
int, tree_node*, bool) ../../gcc-7.5.0/gcc/cp/pt.c:16090 0x5d696b 
tsubst_expr(tree_node*, tree

[jira] [Updated] (ARROW-17567) [C++][Compute]Compiler error with gcc7 and c++17

2022-08-30 Thread Jin Shang (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jin Shang updated ARROW-17567:
--
Description: 
When compiling the c++ compute component with gcc6/7 with std=c++14/17, 
compiler internal errors are triggered at 
compute/kernels/aggregate_internal.h:176:24 and various places at 
compute/kernels/scalar_set_lookup.cc
{code:java}
cpp/src/arrow/compute/kernels/aggregate_internal.h:176:24: internal compiler 
error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1740
DCHECK_LT(cur_level, levels);
~^~~~
0x683399 maybe_undo_parenthesized_ref(tree_node*)
../../gcc-7.5.0/gcc/cp/semantics.c:1739
0x6c8638 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2180
0x6c949c cp_fold_maybe_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
0x6c8346 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2149
0x6c949c cp_fold_maybe_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
0x6c8234 cp_fold_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2063
0x6c8234 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2304
0x66a037 cp_convert_and_check(tree_node*, tree_node*, int)
../../gcc-7.5.0/gcc/cp/cvt.c:640
0x59bb8a convert_like_real
../../gcc-7.5.0/gcc/cp/call.c:7053
0x59de12 build_over_call
../../gcc-7.5.0/gcc/cp/call.c:7869
0x5a3c2f build_new_function_call(tree_node*, vec**, bool, int)
../../gcc-7.5.0/gcc/cp/call.c:4272
0x685601 finish_call_expr(tree_node*, vec**, bool, 
bool, int)
../../gcc-7.5.0/gcc/cp/semantics.c:2501
0x5e1b83 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17508
0x5e121b tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17544
0x5d6c47 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:16732
0x5d6c47 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16613
0x5d6a85 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15874
0x5d6ad5 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15860
0x5d61de tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16036
0x5d6ad5 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15860
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See  for instructions. {code}
{code:java}
cpp/src/arrow/compute/kernels/scalar_set_lookup.cc:70:50: internal compiler 
error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1740 auto on_found = 
[&](int32_t memo_index) { DCHECK_LT(memo_index, memo_size); }; 0x683399 
maybe_undo_parenthesized_ref(tree_node*) 
../../gcc-7.5.0/gcc/cp/semantics.c:1739 0x6c8638 cp_fold 
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2180 0x6c949c cp_fold_maybe_rvalue 
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042 0x6c8346 cp_fold 
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2149 0x66a037 
cp_convert_and_check(tree_node*, tree_node*, int) 
../../gcc-7.5.0/gcc/cp/cvt.c:640 0x65f5d4 cp_build_binary_op(unsigned int, 
tree_code, tree_node*, tree_node*, int) ../../gcc-7.5.0/gcc/cp/typeck.c:5208 
0x5a689c build_new_op_1 ../../gcc-7.5.0/gcc/cp/call.c:5978 0x5a737e 
build_new_op(unsigned int, tree_code, int, tree_node*, tree_node*, tree_node*, 
tree_node**, int) ../../gcc-7.5.0/gcc/cp/call.c:6022 0x657a12 
build_x_binary_op(unsigned int, tree_code, tree_node*, tree_code, tree_node*, 
tree_code, tree_node**, int) ../../gcc-7.5.0/gcc/cp/typeck.c:3941 0x5e04ff 
tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, bool) 
../../gcc-7.5.0/gcc/cp/pt.c:17001 0x5e1120 tsubst_copy_and_build(tree_node*, 
tree_node*, int, tree_node*, bool, bool) ../../gcc-7.5.0/gcc/cp/pt.c:16940 
0x5e1120 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool) ../../gcc-7.5.0/gcc/cp/pt.c:16940 0x5e1676 
tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, bool) 
../../gcc-7.5.0/gcc/cp/pt.c:17312 0x5e121b tsubst_copy_and_build(tree_node*, 
tree_node*, int, tree_node*, bool, bool) ../../gcc-7.5.0/gcc/cp/pt.c:17544 
0x5d6c47 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool) ../../gcc-7.5.0/gcc/cp/pt.c:16732 0x5d6c47 tsubst_expr(tree_node*, 
tree_node*, int, tree_node*, bool) ../../gcc-7.5.0/gcc/cp/pt.c:16613 0x5d6a85 
tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool) 
../../gcc-7.5.0/gcc/cp/pt.c:15874 0x5d696b tsubst_expr(tree_node*, tree_node*, 
int, tree_node*, bool) ../../gcc-7.5.0/gcc/cp/pt.c:16090 0x5d696b 
tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool) 
../../gcc-7.5.0/gcc/cp/pt.c:16090 0x5d4aae tsubst_expr(tree_node*, tree_node*, 
int, tree_node*, bool) ../../gcc-7.5.0/gcc/cp/pt.c:15845 Please submit a full 
bug report, with preprocessed source if appropriate. Please i

[jira] [Updated] (ARROW-17567) [C++][Compute]Compiler error with gcc7 and c++17

2022-08-30 Thread Jin Shang (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jin Shang updated ARROW-17567:
--
Description: 
When compiling the c++ compute component with gcc6/7 with std=c++14/17, 
compiler internal errors are triggered at 
compute/kernels/aggregate_internal.h:176:24 and various places at 
compute/kernels/scalar_set_lookup.cc
{code:java}
cpp/src/arrow/compute/kernels/aggregate_internal.h:176:24: internal compiler 
error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1740
DCHECK_LT(cur_level, levels);
~^~~~
0x683399 maybe_undo_parenthesized_ref(tree_node*)
../../gcc-7.5.0/gcc/cp/semantics.c:1739
0x6c8638 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2180
0x6c949c cp_fold_maybe_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
0x6c8346 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2149
0x6c949c cp_fold_maybe_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
0x6c8234 cp_fold_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2063
0x6c8234 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2304
0x66a037 cp_convert_and_check(tree_node*, tree_node*, int)
../../gcc-7.5.0/gcc/cp/cvt.c:640
0x59bb8a convert_like_real
../../gcc-7.5.0/gcc/cp/call.c:7053
0x59de12 build_over_call
../../gcc-7.5.0/gcc/cp/call.c:7869
0x5a3c2f build_new_function_call(tree_node*, vec**, bool, int)
../../gcc-7.5.0/gcc/cp/call.c:4272
0x685601 finish_call_expr(tree_node*, vec**, bool, 
bool, int)
../../gcc-7.5.0/gcc/cp/semantics.c:2501
0x5e1b83 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17508
0x5e121b tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17544
0x5d6c47 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:16732
0x5d6c47 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16613
0x5d6a85 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15874
0x5d6ad5 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15860
0x5d61de tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16036
0x5d6ad5 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15860
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See  for instructions. {code}
{code:java}
cpp/src/arrow/compute/kernels/scalar_set_lookup.cc:70:50: internal compiler 
error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1740
auto on_found = [&](int32_t memo_index) { DCHECK_LT(memo_index, memo_size); };
0x683399 maybe_undo_parenthesized_ref(tree_node*)
../../gcc-7.5.0/gcc/cp/semantics.c:1739
0x6c8638 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2180
0x6c949c cp_fold_maybe_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
0x6c8346 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2149
0x66a037 cp_convert_and_check(tree_node*, tree_node*, int)
../../gcc-7.5.0/gcc/cp/cvt.c:640
0x65f5d4 cp_build_binary_op(unsigned int, tree_code, tree_node*, tree_node*, 
int)
../../gcc-7.5.0/gcc/cp/typeck.c:5208
0x5a689c build_new_op_1
../../gcc-7.5.0/gcc/cp/call.c:5978
0x5a737e build_new_op(unsigned int, tree_code, int, tree_node*, tree_node*, 
tree_node*, tree_node**, int)
../../gcc-7.5.0/gcc/cp/call.c:6022
0x657a12 build_x_binary_op(unsigned int, tree_code, tree_node*, tree_code, 
tree_node*, tree_code, tree_node**, int)
../../gcc-7.5.0/gcc/cp/typeck.c:3941
0x5e04ff tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17001
0x5e1120 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:16940
0x5e1120 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:16940
0x5e1676 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17312
0x5e121b tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17544
0x5d6c47 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:16732
0x5d6c47 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16613
0x5d6a85 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15874
0x5d696b tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16090
0x5d696b tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16090
0x5d4aae tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15845
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete bac

[jira] [Closed] (ARROW-17566) [C++][Compute]Compiler error with gcc7 and c++17

2022-08-30 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li closed ARROW-17566.

Resolution: Duplicate

> [C++][Compute]Compiler error with gcc7 and c++17
> 
>
> Key: ARROW-17566
> URL: https://issues.apache.org/jira/browse/ARROW-17566
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 9.0.0
> Environment: gcc6/7
> c++14/17
>Reporter: Jin Shang
>Assignee: Jin Shang
>Priority: Minor
>  Labels: easyfix
>
> When compiling the c++ compute component with gcc6/7 with std=c++14/17, 
> compiler internal errors are triggered at 
> compute/kernels/aggregate_internal.h:176:24 and various places at 
> compute/kernels/scalar_set_lookup.cc
> {code:java}
> cpp/src/arrow/compute/kernels/aggregate_internal.h:176:24: internal compiler 
> error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1740
> DCHECK_LT(cur_level, levels);
> ~^~~~
> 0x683399 maybe_undo_parenthesized_ref(tree_node*)
> ../../gcc-7.5.0/gcc/cp/semantics.c:1739
> 0x6c8638 cp_fold
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2180
> 0x6c949c cp_fold_maybe_rvalue
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
> 0x6c8346 cp_fold
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2149
> 0x6c949c cp_fold_maybe_rvalue
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
> 0x6c8234 cp_fold_rvalue
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2063
> 0x6c8234 cp_fold
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2304
> 0x66a037 cp_convert_and_check(tree_node*, tree_node*, int)
> ../../gcc-7.5.0/gcc/cp/cvt.c:640
> 0x59bb8a convert_like_real
> ../../gcc-7.5.0/gcc/cp/call.c:7053
> 0x59de12 build_over_call
> ../../gcc-7.5.0/gcc/cp/call.c:7869
> 0x5a3c2f build_new_function_call(tree_node*, vec vl_embed>**, bool, int)
> ../../gcc-7.5.0/gcc/cp/call.c:4272
> 0x685601 finish_call_expr(tree_node*, vec**, 
> bool, bool, int)
> ../../gcc-7.5.0/gcc/cp/semantics.c:2501
> 0x5e1b83 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
> bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:17508
> 0x5e121b tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
> bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:17544
> 0x5d6c47 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
> bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:16732
> 0x5d6c47 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:16613
> 0x5d6a85 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:15874
> 0x5d6ad5 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:15860
> 0x5d61de tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:16036
> 0x5d6ad5 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:15860
> Please submit a full bug report,
> with preprocessed source if appropriate.
> Please include the complete backtrace with any bug report.
> See  for instructions. {code}
> {code:java}
> cpp/src/arrow/compute/kernels/scalar_set_lookup.cc:70:50: internal compiler 
> error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1740 auto on_found 
> = [&](int32_t memo_index) { DCHECK_LT(memo_index, memo_size); }; 0x683399 
> maybe_undo_parenthesized_ref(tree_node*) 
> ../../gcc-7.5.0/gcc/cp/semantics.c:1739 0x6c8638 cp_fold 
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2180 0x6c949c cp_fold_maybe_rvalue 
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042 0x6c8346 cp_fold 
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2149 0x66a037 
> cp_convert_and_check(tree_node*, tree_node*, int) 
> ../../gcc-7.5.0/gcc/cp/cvt.c:640 0x65f5d4 cp_build_binary_op(unsigned int, 
> tree_code, tree_node*, tree_node*, int) ../../gcc-7.5.0/gcc/cp/typeck.c:5208 
> 0x5a689c build_new_op_1 ../../gcc-7.5.0/gcc/cp/call.c:5978 0x5a737e 
> build_new_op(unsigned int, tree_code, int, tree_node*, tree_node*, 
> tree_node*, tree_node**, int) ../../gcc-7.5.0/gcc/cp/call.c:6022 0x657a12 
> build_x_binary_op(unsigned int, tree_code, tree_node*, tree_code, tree_node*, 
> tree_code, tree_node**, int) ../../gcc-7.5.0/gcc/cp/typeck.c:3941 0x5e04ff 
> tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, bool) 
> ../../gcc-7.5.0/gcc/cp/pt.c:17001 0x5e1120 tsubst_copy_and_build(tree_node*, 
> tree_node*, int, tree_node*, bool, bool) ../../gcc-7.5.0/gcc/cp/pt.c:16940 
> 0x5e1120 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
> bool) ../../gcc-7.5.0/gcc/cp/pt.c:16940 0x5e1676 
> tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, bool) 
> ../../gcc-7.5.0/gcc/cp/pt.c:17312 0x5e121b tsubst_copy_and_build(tree_node*, 
> tree_node*, int, tree_node*, bool, bool) ../../gcc-7.5.0/gcc/cp/pt.c:17544 
> 0x5d6c47 tsubst_copy_and_build(tree_no

[jira] [Updated] (ARROW-17567) [C++][Compute]Compiler error with gcc7 and c++17

2022-08-30 Thread Jin Shang (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jin Shang updated ARROW-17567:
--
Description: 
When compiling the c++ compute component with gcc6/7 with std=c++14/17, 
compiler internal errors are triggered at 
compute/kernels/aggregate_internal.h:176:24 and various places at 
compute/kernels/scalar_set_lookup.cc
{code:java}
cpp/src/arrow/compute/kernels/aggregate_internal.h:176:24: internal compiler 
error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1740
DCHECK_LT(cur_level, levels);
~^~~~
0x683399 maybe_undo_parenthesized_ref(tree_node*)
../../gcc-7.5.0/gcc/cp/semantics.c:1739
0x6c8638 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2180
0x6c949c cp_fold_maybe_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
0x6c8346 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2149
0x6c949c cp_fold_maybe_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
0x6c8234 cp_fold_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2063
0x6c8234 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2304
0x66a037 cp_convert_and_check(tree_node*, tree_node*, int)
../../gcc-7.5.0/gcc/cp/cvt.c:640
0x59bb8a convert_like_real
../../gcc-7.5.0/gcc/cp/call.c:7053
0x59de12 build_over_call
../../gcc-7.5.0/gcc/cp/call.c:7869
0x5a3c2f build_new_function_call(tree_node*, vec**, bool, int)
../../gcc-7.5.0/gcc/cp/call.c:4272
0x685601 finish_call_expr(tree_node*, vec**, bool, 
bool, int)
../../gcc-7.5.0/gcc/cp/semantics.c:2501
0x5e1b83 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17508
0x5e121b tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17544
0x5d6c47 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:16732
0x5d6c47 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16613
0x5d6a85 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15874
0x5d6ad5 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15860
0x5d61de tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16036
0x5d6ad5 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15860
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See  for instructions. {code}
{code:java}
cpp/src/arrow/compute/kernels/scalar_set_lookup.cc:70:50: internal compiler 
error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1740
auto on_found = [&](int32_t memo_index) { DCHECK_LT(memo_index, memo_size); };
0x683399 maybe_undo_parenthesized_ref(tree_node*)
../../gcc-7.5.0/gcc/cp/semantics.c:1739
0x6c8638 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2180
0x6c949c cp_fold_maybe_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
0x6c8346 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2149
0x66a037 cp_convert_and_check(tree_node*, tree_node*, int)
../../gcc-7.5.0/gcc/cp/cvt.c:640
0x65f5d4 cp_build_binary_op(unsigned int, tree_code, tree_node*, tree_node*, 
int)
../../gcc-7.5.0/gcc/cp/typeck.c:5208
0x5a689c build_new_op_1
../../gcc-7.5.0/gcc/cp/call.c:5978
0x5a737e build_new_op(unsigned int, tree_code, int, tree_node*, tree_node*, 
tree_node*, tree_node**, int)
../../gcc-7.5.0/gcc/cp/call.c:6022
0x657a12 build_x_binary_op(unsigned int, tree_code, tree_node*, tree_code, 
tree_node*, tree_code, tree_node**, int)
../../gcc-7.5.0/gcc/cp/typeck.c:3941
0x5e04ff tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17001
0x5e1120 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:16940
0x5e1120 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:16940
0x5e1676 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17312
0x5e121b tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17544
0x5d6c47 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:16732
0x5d6c47 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16613
0x5d6a85 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15874
0x5d696b tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16090
0x5d696b tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16090
0x5d4aae tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15845
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete bac

[jira] [Updated] (ARROW-17567) [C++][Compute]Compiler error with gcc7 and c++17

2022-08-30 Thread Jin Shang (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jin Shang updated ARROW-17567:
--
Description: 
When compiling the c++ compute component with gcc6/7 with std=c++14/17, 
compiler internal errors are triggered at 
compute/kernels/aggregate_internal.h:176:24 and various places at 
compute/kernels/scalar_set_lookup.cc
{code:java}
cpp/src/arrow/compute/kernels/aggregate_internal.h:176:24: internal compiler 
error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1740
DCHECK_LT(cur_level, levels);
~^~~~
0x683399 maybe_undo_parenthesized_ref(tree_node*)
../../gcc-7.5.0/gcc/cp/semantics.c:1739
0x6c8638 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2180
0x6c949c cp_fold_maybe_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
0x6c8346 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2149
0x6c949c cp_fold_maybe_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
0x6c8234 cp_fold_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2063
0x6c8234 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2304
0x66a037 cp_convert_and_check(tree_node*, tree_node*, int)
../../gcc-7.5.0/gcc/cp/cvt.c:640
0x59bb8a convert_like_real
../../gcc-7.5.0/gcc/cp/call.c:7053
0x59de12 build_over_call
../../gcc-7.5.0/gcc/cp/call.c:7869
0x5a3c2f build_new_function_call(tree_node*, vec**, bool, int)
../../gcc-7.5.0/gcc/cp/call.c:4272
0x685601 finish_call_expr(tree_node*, vec**, bool, 
bool, int)
../../gcc-7.5.0/gcc/cp/semantics.c:2501
0x5e1b83 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17508
0x5e121b tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17544
0x5d6c47 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:16732
0x5d6c47 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16613
0x5d6a85 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15874
0x5d6ad5 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15860
0x5d61de tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16036
0x5d6ad5 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15860
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See  for instructions. {code}
{code:java}
cpp/src/arrow/compute/kernels/scalar_set_lookup.cc:70:50: internal compiler 
error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1740
auto on_found = [&](int32_t memo_index) { DCHECK_LT(memo_index, memo_size); };
0x683399 maybe_undo_parenthesized_ref(tree_node*)
../../gcc-7.5.0/gcc/cp/semantics.c:1739
0x6c8638 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2180
0x6c949c cp_fold_maybe_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
0x6c8346 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2149
0x66a037 cp_convert_and_check(tree_node*, tree_node*, int)
../../gcc-7.5.0/gcc/cp/cvt.c:640
0x65f5d4 cp_build_binary_op(unsigned int, tree_code, tree_node*, tree_node*, 
int)
../../gcc-7.5.0/gcc/cp/typeck.c:5208
0x5a689c build_new_op_1
../../gcc-7.5.0/gcc/cp/call.c:5978
0x5a737e build_new_op(unsigned int, tree_code, int, tree_node*, tree_node*, 
tree_node*, tree_node**, int)
../../gcc-7.5.0/gcc/cp/call.c:6022
0x657a12 build_x_binary_op(unsigned int, tree_code, tree_node*, tree_code, 
tree_node*, tree_code, tree_node**, int)
../../gcc-7.5.0/gcc/cp/typeck.c:3941
0x5e04ff tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17001
0x5e1120 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:16940
0x5e1120 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:16940
0x5e1676 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17312
0x5e121b tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17544
0x5d6c47 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:16732
0x5d6c47 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16613
0x5d6a85 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15874
0x5d696b tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16090
0x5d696b tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16090
0x5d4aae tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15845
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete bac

[jira] [Updated] (ARROW-17567) [C++][Compute]Compiler error with gcc7 and c++17

2022-08-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17567:
---
Labels: easyfix pull-request-available  (was: easyfix)

> [C++][Compute]Compiler error with gcc7 and c++17
> 
>
> Key: ARROW-17567
> URL: https://issues.apache.org/jira/browse/ARROW-17567
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 9.0.0
> Environment: gcc6/7
> c++14/17
>Reporter: Jin Shang
>Assignee: Jin Shang
>Priority: Minor
>  Labels: easyfix, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When compiling the c++ compute component with gcc6/7 with std=c++14/17, 
> compiler internal errors are triggered at 
> compute/kernels/aggregate_internal.h:176:24 and various places at 
> compute/kernels/scalar_set_lookup.cc
> {code:java}
> cpp/src/arrow/compute/kernels/aggregate_internal.h:176:24: internal compiler 
> error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1740
> DCHECK_LT(cur_level, levels);
> ~^~~~
> 0x683399 maybe_undo_parenthesized_ref(tree_node*)
> ../../gcc-7.5.0/gcc/cp/semantics.c:1739
> 0x6c8638 cp_fold
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2180
> 0x6c949c cp_fold_maybe_rvalue
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
> 0x6c8346 cp_fold
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2149
> 0x6c949c cp_fold_maybe_rvalue
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
> 0x6c8234 cp_fold_rvalue
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2063
> 0x6c8234 cp_fold
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2304
> 0x66a037 cp_convert_and_check(tree_node*, tree_node*, int)
> ../../gcc-7.5.0/gcc/cp/cvt.c:640
> 0x59bb8a convert_like_real
> ../../gcc-7.5.0/gcc/cp/call.c:7053
> 0x59de12 build_over_call
> ../../gcc-7.5.0/gcc/cp/call.c:7869
> 0x5a3c2f build_new_function_call(tree_node*, vec vl_embed>**, bool, int)
> ../../gcc-7.5.0/gcc/cp/call.c:4272
> 0x685601 finish_call_expr(tree_node*, vec**, 
> bool, bool, int)
> ../../gcc-7.5.0/gcc/cp/semantics.c:2501
> 0x5e1b83 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
> bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:17508
> 0x5e121b tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
> bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:17544
> 0x5d6c47 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
> bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:16732
> 0x5d6c47 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:16613
> 0x5d6a85 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:15874
> 0x5d6ad5 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:15860
> 0x5d61de tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:16036
> 0x5d6ad5 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:15860
> Please submit a full bug report,
> with preprocessed source if appropriate.
> Please include the complete backtrace with any bug report.
> See  for instructions. {code}
> {code:java}
> cpp/src/arrow/compute/kernels/scalar_set_lookup.cc:70:50: internal compiler 
> error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1740
> auto on_found = [&](int32_t memo_index) { DCHECK_LT(memo_index, memo_size); };
> 0x683399 maybe_undo_parenthesized_ref(tree_node*)
> ../../gcc-7.5.0/gcc/cp/semantics.c:1739
> 0x6c8638 cp_fold
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2180
> 0x6c949c cp_fold_maybe_rvalue
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
> 0x6c8346 cp_fold
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2149
> 0x66a037 cp_convert_and_check(tree_node*, tree_node*, int)
> ../../gcc-7.5.0/gcc/cp/cvt.c:640
> 0x65f5d4 cp_build_binary_op(unsigned int, tree_code, tree_node*, tree_node*, 
> int)
> ../../gcc-7.5.0/gcc/cp/typeck.c:5208
> 0x5a689c build_new_op_1
> ../../gcc-7.5.0/gcc/cp/call.c:5978
> 0x5a737e build_new_op(unsigned int, tree_code, int, tree_node*, tree_node*, 
> tree_node*, tree_node**, int)
> ../../gcc-7.5.0/gcc/cp/call.c:6022
> 0x657a12 build_x_binary_op(unsigned int, tree_code, tree_node*, tree_code, 
> tree_node*, tree_code, tree_node**, int)
> ../../gcc-7.5.0/gcc/cp/typeck.c:3941
> 0x5e04ff tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
> bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:17001
> 0x5e1120 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
> bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:16940
> 0x5e1120 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
> bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:16940
> 0x5e1676 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
> bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:17312
> 0x5e121b tsubst_copy_and

[jira] [Commented] (ARROW-17399) pyarrow may use a lot of memory to load a dataframe from parquet

2022-08-30 Thread Gianluca Ficarelli (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597832#comment-17597832
 ] 

Gianluca Ficarelli commented on ARROW-17399:


I tried with a fresh virtualenv, on both Linux and Mac (Intel):

Linux (Ubuntu 20.04, 32 GB):
{code:java}
$ python -V
Python 3.9.9

$ pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==9.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

$ python test_pyarrow.py 
  0 time:       0.0 rss:      90.8
  1 time:       3.0 rss:    1205.7
  2 time:       4.6 rss:    1212.6
  3 time:       4.8 rss:     710.0
  4 time:       8.0 rss:     708.2
  5 time:      14.6 rss:   16652.9
  6 time:      17.6 rss:   16242.9
  7 time:      17.7 rss:   15743.5
  8 time:      20.7 rss:     866.2
{code}
Mac (Monterey 12.5, 16 GB):
{code:java}
$ python -V
Python 3.9.9

$ pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==9.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

$ python test_pyarrow.py 
  0 time:       0.0 rss:      64.0
  1 time:       4.0 rss:    1075.0
  2 time:       6.2 rss:    1136.6
  3 time:       6.8 rss:     671.8
  4 time:       9.8 rss:     671.8
  5 time:      22.9 rss:    2477.4
  6 time:      25.9 rss:    2423.4
  7 time:      27.1 rss:     180.6
  8 time:      30.1 rss:     180.6
 {code}
but when the same script is retried there is some variability on Mac in the 
lines 5 and 6 (I observed from 1261 to 4140 MB), while on Linux is always the 
same.

So it seems that the rss memory usage is high on linux only.

> pyarrow may use a lot of memory to load a dataframe from parquet
> 
>
> Key: ARROW-17399
> URL: https://issues.apache.org/jira/browse/ARROW-17399
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet, Python
>Affects Versions: 9.0.0
> Environment: linux
>Reporter: Gianluca Ficarelli
>Priority: Major
> Attachments: memory-profiler.png
>
>
> When a pandas dataframe is loaded from a parquet file using 
> {{{}pyarrow.parquet.read_table{}}}, the memory usage may grow a lot more than 
> what should be needed to load the dataframe, and it's not freed until the 
> dataframe is deleted.
> The problem is evident when the dataframe has a {*}column containing lists or 
> numpy arrays{*}, while it seems absent (or not noticeable) if the column 
> contains only integer or floats.
> I'm attaching a simple script to reproduce the issue, and a graph created 
> with memory-profiler showing the memory usage.
> In this example, the dataframe created with pandas needs around 1.2 GB, but 
> the memory usage after loading it from parquet is around 16 GB.
> The items of the column are created as numpy arrays and not lists, to be 
> consistent with the types loaded from parquet (pyarrow produces numpy arrays 
> and not lists).
>  
> {code:python}
> import gc
> import time
> import numpy as np
> import pandas as pd
> import pyarrow
> import pyarrow.parquet
> import psutil
> def pyarrow_dump(filename, df, compression="snappy"):
> table = pyarrow.Table.from_pandas(df)
> pyarrow.parquet.write_table(table, filename, compression=compression)
> def pyarrow_load(filename):
> table = pyarrow.parquet.read_table(filename)
> return table.to_pandas()
> def print_mem(msg, start_time=time.monotonic(), process=psutil.Process()):
> # gc.collect()
> current_time = time.monotonic() - start_time
> rss = process.memory_info().rss / 2 ** 20
> print(f"{msg:>3} time:{current_time:>10.1f} rss:{rss:>10.1f}")
> if __name__ == "__main__":
> print_mem(0)
> rows = 500
> df = pd.DataFrame({"a": [np.arange(10) for i in range(rows)]})
> print_mem(1)
> 
> pyarrow_dump("example.parquet", df)
> print_mem(2)
> 
> del df
> print_mem(3)
> time.sleep(3)
> print_mem(4)
> df = pyarrow_load("example.parquet")
> print_mem(5)
> time.sleep(3)
> print_mem(6)
> del df
> print_mem(7)
> time.sleep(3)
> print_mem(8)
> {code}
> Run with memory-profiler:
> {code:bash}
> mprof run --multiprocess python test_pyarrow.py
> {code}
> Output:
> {code:java}
> mprof: Sampling memory every 0.1s
> running new process
>   0 time:   0.0 rss: 135.4
>   1 time:   4.9 rss:1252.2
>   2 time:   7.1 rss:1265.0
>   3 time:   7.5 rss: 760.2
>   4 time:  10.7 rss: 758.9
>   5 time:  19.6 rss:   16745.4
>   6 time:  22.6 rss:   16335.4
>   7 time:  22.9 rss:   15833.0
>   8 time:  25.9 rss: 955.0
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-17399) pyarrow may use a lot of memory to load a dataframe from parquet

2022-08-30 Thread Gianluca Ficarelli (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597832#comment-17597832
 ] 

Gianluca Ficarelli edited comment on ARROW-17399 at 8/30/22 12:14 PM:
--

I tried with a fresh virtualenv, on both Linux and Mac (Intel):

Linux (Ubuntu 20.04, 32 GB):
{code:java}
$ python -V
Python 3.9.9

$ pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==9.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

$ python test_pyarrow.py 
  0 time:       0.0 rss:      90.8
  1 time:       3.0 rss:    1205.7
  2 time:       4.6 rss:    1212.6
  3 time:       4.8 rss:     710.0
  4 time:       8.0 rss:     708.2
  5 time:      14.6 rss:   16652.9
  6 time:      17.6 rss:   16242.9
  7 time:      17.7 rss:   15743.5
  8 time:      20.7 rss:     866.2
{code}
Mac (Monterey 12.5, 16 GB):
{code:java}
$ python -V
Python 3.9.9

$ pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==9.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

$ python test_pyarrow.py 
  0 time:       0.0 rss:      64.0
  1 time:       4.0 rss:    1075.0
  2 time:       6.2 rss:    1136.6
  3 time:       6.8 rss:     671.8
  4 time:       9.8 rss:     671.8
  5 time:      22.9 rss:    2477.4
  6 time:      25.9 rss:    2423.4
  7 time:      27.1 rss:     180.6
  8 time:      30.1 rss:     180.6
 {code}
but when the same script is retried there is some variability on Mac in the 
lines 5 and 6 (I observed from 1261 to 4140 MB), while on Linux is always the 
same (around 16 GB in lines 5, 6, 7).

So it seems that the rss memory usage is high on linux only.


was (Author: JIRAUSER294344):
I tried with a fresh virtualenv, on both Linux and Mac (Intel):

Linux (Ubuntu 20.04, 32 GB):
{code:java}
$ python -V
Python 3.9.9

$ pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==9.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

$ python test_pyarrow.py 
  0 time:       0.0 rss:      90.8
  1 time:       3.0 rss:    1205.7
  2 time:       4.6 rss:    1212.6
  3 time:       4.8 rss:     710.0
  4 time:       8.0 rss:     708.2
  5 time:      14.6 rss:   16652.9
  6 time:      17.6 rss:   16242.9
  7 time:      17.7 rss:   15743.5
  8 time:      20.7 rss:     866.2
{code}
Mac (Monterey 12.5, 16 GB):
{code:java}
$ python -V
Python 3.9.9

$ pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==9.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

$ python test_pyarrow.py 
  0 time:       0.0 rss:      64.0
  1 time:       4.0 rss:    1075.0
  2 time:       6.2 rss:    1136.6
  3 time:       6.8 rss:     671.8
  4 time:       9.8 rss:     671.8
  5 time:      22.9 rss:    2477.4
  6 time:      25.9 rss:    2423.4
  7 time:      27.1 rss:     180.6
  8 time:      30.1 rss:     180.6
 {code}
but when the same script is retried there is some variability on Mac in the 
lines 5 and 6 (I observed from 1261 to 4140 MB), while on Linux is always the 
same.

So it seems that the rss memory usage is high on linux only.

> pyarrow may use a lot of memory to load a dataframe from parquet
> 
>
> Key: ARROW-17399
> URL: https://issues.apache.org/jira/browse/ARROW-17399
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet, Python
>Affects Versions: 9.0.0
> Environment: linux
>Reporter: Gianluca Ficarelli
>Priority: Major
> Attachments: memory-profiler.png
>
>
> When a pandas dataframe is loaded from a parquet file using 
> {{{}pyarrow.parquet.read_table{}}}, the memory usage may grow a lot more than 
> what should be needed to load the dataframe, and it's not freed until the 
> dataframe is deleted.
> The problem is evident when the dataframe has a {*}column containing lists or 
> numpy arrays{*}, while it seems absent (or not noticeable) if the column 
> contains only integer or floats.
> I'm attaching a simple script to reproduce the issue, and a graph created 
> with memory-profiler showing the memory usage.
> In this example, the dataframe created with pandas needs around 1.2 GB, but 
> the memory usage after loading it from parquet is around 16 GB.
> The items of the column are created as numpy arrays and not lists, to be 
> consistent with the types loaded from parquet (pyarrow produces numpy arrays 
> and not lists).
>  
> {code:python}
> import gc
> import time
> import numpy as np
> import pandas as pd
> import pyarrow
> import pyarrow.parquet
> import psutil
> def pyarrow_dump(filename, df, compression="snappy"):
> table = pyarrow.Table.from_pandas(df)
> pyarrow.parquet.write_table(table, filename, compression=compression)
> def pyarrow_load(filename):
> table = pyarrow.parquet.read_table(filename)
> return table.to_pandas()
> def print_mem(msg, start_time=time.monotonic(), proc

[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-30 Thread Arthur Passos (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597846#comment-17597846
 ] 

Arthur Passos commented on ARROW-17459:
---

[~willjones127] Thank you for sharing this!

 

While your `GetRecordBatchReader` suggestion works for the use case I shared, 
it won't work for this one. Are there any docs I could read to understand the 
internals of arrow lib in order to implement it? Any tips would be 
appreciated.. The only thing that comes to mind right now is to somehow build a 
giant array with all the chunks, but it certainly has a set of implications.

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17567) [C++][Compute]Compiler error with gcc7 and c++17

2022-08-30 Thread Jin Shang (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jin Shang updated ARROW-17567:
--
Description: 
When compiling the c++ compute component with gcc6/7 with std=c++14/17, 
compiler internal errors are triggered at 
compute/kernels/aggregate_internal.h:176:24 and various places at 
compute/kernels/scalar_set_lookup.cc
{code:java}
cpp/src/arrow/compute/kernels/aggregate_internal.h:176:24: internal compiler 
error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1740
DCHECK_LT(cur_level, levels);
~^~~~
0x683399 maybe_undo_parenthesized_ref(tree_node*)
../../gcc-7.5.0/gcc/cp/semantics.c:1739
0x6c8638 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2180
0x6c949c cp_fold_maybe_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
0x6c8346 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2149
0x6c949c cp_fold_maybe_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
0x6c8234 cp_fold_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2063
0x6c8234 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2304
0x66a037 cp_convert_and_check(tree_node*, tree_node*, int)
../../gcc-7.5.0/gcc/cp/cvt.c:640
0x59bb8a convert_like_real
../../gcc-7.5.0/gcc/cp/call.c:7053
0x59de12 build_over_call
../../gcc-7.5.0/gcc/cp/call.c:7869
0x5a3c2f build_new_function_call(tree_node*, vec**, bool, int)
../../gcc-7.5.0/gcc/cp/call.c:4272
0x685601 finish_call_expr(tree_node*, vec**, bool, 
bool, int)
../../gcc-7.5.0/gcc/cp/semantics.c:2501
0x5e1b83 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17508
0x5e121b tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17544
0x5d6c47 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:16732
0x5d6c47 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16613
0x5d6a85 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15874
0x5d6ad5 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15860
0x5d61de tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16036
0x5d6ad5 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15860
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See  for instructions. {code}
{code:java}
cpp/src/arrow/compute/kernels/scalar_set_lookup.cc:70:50: internal compiler 
error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1740
auto on_found = [&](int32_t memo_index) { DCHECK_LT(memo_index, memo_size); };
0x683399 maybe_undo_parenthesized_ref(tree_node*)
../../gcc-7.5.0/gcc/cp/semantics.c:1739
0x6c8638 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2180
0x6c949c cp_fold_maybe_rvalue
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
0x6c8346 cp_fold
../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2149
0x66a037 cp_convert_and_check(tree_node*, tree_node*, int)
../../gcc-7.5.0/gcc/cp/cvt.c:640
0x65f5d4 cp_build_binary_op(unsigned int, tree_code, tree_node*, tree_node*, 
int)
../../gcc-7.5.0/gcc/cp/typeck.c:5208
0x5a689c build_new_op_1
../../gcc-7.5.0/gcc/cp/call.c:5978
0x5a737e build_new_op(unsigned int, tree_code, int, tree_node*, tree_node*, 
tree_node*, tree_node**, int)
../../gcc-7.5.0/gcc/cp/call.c:6022
0x657a12 build_x_binary_op(unsigned int, tree_code, tree_node*, tree_code, 
tree_node*, tree_code, tree_node**, int)
../../gcc-7.5.0/gcc/cp/typeck.c:3941
0x5e04ff tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17001
0x5e1120 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:16940
0x5e1120 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:16940
0x5e1676 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17312
0x5e121b tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:17544
0x5d6c47 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
bool)
../../gcc-7.5.0/gcc/cp/pt.c:16732
0x5d6c47 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16613
0x5d6a85 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15874
0x5d696b tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16090
0x5d696b tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:16090
0x5d4aae tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../gcc-7.5.0/gcc/cp/pt.c:15845
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete bac

[jira] [Created] (ARROW-17568) [FlightRPC][Integration] Ensure all RPC methods are covered by integration testing

2022-08-30 Thread David Li (Jira)

David Li created ARROW-17568:


 Summary: [FlightRPC][Integration] Ensure all RPC methods are 
covered by integration testing
 Key: ARROW-17568
 URL: https://issues.apache.org/jira/browse/ARROW-17568
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, FlightRPC, Go, Integration, Java
Reporter: David Li


This would help catch issues like https://github.com/apache/arrow/issues/13853



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17497) [R] installation failure on R Studio Server

2022-08-30 Thread Nicola Crane (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597879#comment-17597879
 ] 

Nicola Crane commented on ARROW-17497:
--

Hi [~p.r], thanks for reporting this.  From the error message above, it looks 
like there was a failure to download an external library needed - Boost.  
Above, there's a reference to the file 
{{/tmp/Rtmp64sUMd/file553b4faa5f21/boost_ep-prefix/src/boost_ep-stamp/boost_ep-download-*.log}}
 which should contain more information - I don't suppose you'd be able to try 
again and send us the contents of that file?

Also, out of interest, what OS are you on?

Alternatively...

This current installation above is trying to build libarrow (i.e. the Arrow C++ 
library) from source.  If you set the environment variable {{LIBARROW_BINARY}} 
to {{TRUE}}, it'll instead download a precompiled binary of libarrow (depending 
on your OS, but there are binaries for the vast majority of cases), which would 
avoid even having to worry about Boost.

> [R] installation failure on R Studio Server
> ---
>
> Key: ARROW-17497
> URL: https://issues.apache.org/jira/browse/ARROW-17497
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Philipp
>Priority: Major
>
> Hi,
> I try to install arrow on a RStudio Server. I tried different approaches but 
> always ran into the same poblem. Do you have any idea what goes wrong? Thanks.
> {code:java}
> // > Sys.setenv(ARROW_R_DEV=TRUE)
> > install.packages("arrow")
> Installiere Paket nach 
> ‘/home/philipp.roechner/R/x86_64-pc-linux-gnu-library/4.2’
> (da ‘lib’ nicht spezifiziert)
> versuche URL 'https://cloud.r-project.org/src/contrib/arrow_9.0.0.tar.gz'
> Content type 'application/x-gzip' length 4900968 bytes (4.7 MB)
> ==
> downloaded 4.7 MB
> * installing *source* package ‘arrow’ ...
> ** Paket ‘arrow’ erfolgreich entpackt und MD5 Summen überprüft
> ** using staged installation
> *** Found local C++ source: 'tools/cpp'
> *** Building libarrow from source
> For build options and troubleshooting, see the install vignette:
> https://cran.r-project.org/web/packages/arrow/vignettes/install.html
> *** Building with MAKEFLAGS= -j2 
>  cmake: /usr/bin/cmake
>  arrow with SOURCE_DIR='tools/cpp' 
> BUILD_DIR='/tmp/Rtmp64sUMd/file553b4faa5f21' DEST_DIR='libarrow/arrow-9.0.0' 
> CMAKE='/usr/bin/cmake' EXTRA_CMAKE_FLAGS='' CC='gcc' CXX='g++ -std=gnu++11' 
> LDFLAGS='-Wl,-z,relro' ARROW_S3='OFF' ARROW_GCS='OFF' ARROW_MIMALLOC='OFF' 
> ++ pwd
> + : /tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow
> + : tools/cpp
> + : /tmp/Rtmp64sUMd/file553b4faa5f21
> + : libarrow/arrow-9.0.0
> + : /usr/bin/cmake
> ++ cd tools/cpp
> ++ pwd
> + SOURCE_DIR=/tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow/tools/cpp
> ++ mkdir -p libarrow/arrow-9.0.0
> ++ cd libarrow/arrow-9.0.0
> ++ pwd
> + DEST_DIR=/tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow/libarrow/arrow-9.0.0
> + '[' '' '!=' '' ']'
> + '[' '' = false ']'
> + ARROW_DEFAULT_PARAM=OFF
> + mkdir -p /tmp/Rtmp64sUMd/file553b4faa5f21
> + pushd /tmp/Rtmp64sUMd/file553b4faa5f21
> /tmp/Rtmp64sUMd/file553b4faa5f21 /tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow
> + /usr/bin/cmake -DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_TESTS=OFF 
> -DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON -DARROW_COMPUTE=ON 
> -DARROW_CSV=ON -DARROW_DATASET=ON -DARROW_DEPENDENCY_SOURCE=AUTO 
> -DAWSSDK_SOURCE= -DARROW_FILESYSTEM=ON -DARROW_GCS=OFF -DARROW_JEMALLOC=OFF 
> -DARROW_MIMALLOC=OFF -DARROW_JSON=ON -DARROW_PARQUET=ON -DARROW_S3=OFF 
> -DARROW_WITH_BROTLI=OFF -DARROW_WITH_BZ2=OFF -DARROW_WITH_LZ4=ON 
> -DARROW_WITH_RE2=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_UTF8PROC=ON 
> -DARROW_WITH_ZLIB=OFF -DARROW_WITH_ZSTD=OFF 
> -DARROW_VERBOSE_THIRDPARTY_BUILD=OFF -DCMAKE_BUILD_TYPE=Release 
> -DCMAKE_INSTALL_LIBDIR=lib 
> -DCMAKE_INSTALL_PREFIX=/tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow/libarrow/arrow-9.0.0
>  -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON 
> -DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON -DCMAKE_UNITY_BUILD=OFF 
> -Dxsimd_SOURCE= -G 'Unix Makefiles' 
> /tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow/tools/cpp
> -- Building using CMake version: 3.13.4
> -- The C compiler identification is GNU 8.3.0
> -- The CXX compiler identification is GNU 8.3.0
> -- Check for working C compiler: /usr/bin/gcc
> -- Check for working C compiler: /usr/bin/gcc -- works
> -- Detecting C compiler ABI info
> -- Detecting C compiler ABI info - done
> -- Detecting C compile features
> -- Detecting C compile features - done
> -- Check for working CXX compiler: /usr/bin/g++
> -- Check for working CXX compiler: /usr/bin/g++ -- works
> -- Detecting CXX compiler ABI info
> -- Detecting CXX compiler ABI info - done
> -- Detecting CXX compile features
> -- Detecting CXX compile features - done
> -- Arrow version: 9.0

[jira] [Commented] (ARROW-17399) pyarrow may use a lot of memory to load a dataframe from parquet

2022-08-30 Thread Gianluca Ficarelli (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597904#comment-17597904
 ] 

Gianluca Ficarelli commented on ARROW-17399:


I tested on Linux the previous versions of pyarrow, here are the results:
 * pyarrow==9.0.0

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==9.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      90.6
  1 time:       2.8 rss:    1205.4
  2 time:       4.4 rss:    1212.3
  3 time:       4.7 rss:     709.7
  4 time:       7.8 rss:     707.9
  5 time:      14.4 rss:   16656.0
  6 time:      17.4 rss:   16246.0
  7 time:      17.5 rss:   15743.6
  8 time:      20.5 rss:     866.3{code}
 * pyarrow==8.0.0

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==8.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      86.2
  1 time:       2.8 rss:    1200.9
  2 time:       4.3 rss:    2266.2
  3 time:       4.6 rss:    1443.6
  4 time:       7.7 rss:     703.5
  5 time:      14.3 rss:   16648.0
  6 time:      17.3 rss:   16238.0
  7 time:      17.4 rss:   15738.6
  8 time:      20.4 rss:     861.3{code}
 * pyarrow==7.0.0

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==7.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      84.3
  1 time:       2.8 rss:    1199.1
  2 time:       4.4 rss:    2263.7
  3 time:       4.6 rss:    1441.8
  4 time:       7.7 rss:     701.6
  5 time:       9.8 rss:    3679.9
  6 time:      12.8 rss:    3268.3
  7 time:      12.9 rss:    2766.6
  8 time:      15.9 rss:     859.2
 {code}
 * pyarrow==6.0.1

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==6.0.1
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      81.9
  1 time:       2.9 rss:    1196.8
  2 time:       4.5 rss:    2261.4
  3 time:       4.7 rss:    1439.0
  4 time:       7.8 rss:     698.9
  5 time:       9.2 rss:    2224.0
  6 time:      12.2 rss:    1740.4
  7 time:      12.3 rss:    1238.1
  8 time:      15.3 rss:     856.6{code}
 * pyarrow==6.0.0

{code:java}
 pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==6.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      81.7
  1 time:       2.9 rss:    1196.6
  2 time:       4.5 rss:    2261.1
  3 time:       4.7 rss:    1438.5
  4 time:       7.8 rss:     698.4
  5 time:       9.2 rss:    2224.9
  6 time:      12.2 rss:    1740.1
  7 time:      12.3 rss:    1237.7
  8 time:      15.3 rss:     856.2{code}
 * pyarrow==5.0.0

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==5.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      79.2
  1 time:       2.8 rss:    1194.0
  2 time:       4.3 rss:    2258.3
  3 time:       4.5 rss:    1436.2
  4 time:       7.7 rss:     696.1
  5 time:       9.1 rss:    2221.1
  6 time:      12.1 rss:    1736.3
  7 time:      12.2 rss:    1235.0
  8 time:      15.3 rss:     853.5{code}
So:
 * with pyarrow 9.0.0 and 8.0.0 the results are similar
 * with pyarrow 7.0.0 the used memory seems lower
 * with pyarrow 6.0.0, 6.0.1, 5.0.0 the used memory is even lower

> pyarrow may use a lot of memory to load a dataframe from parquet
> 
>
> Key: ARROW-17399
> URL: https://issues.apache.org/jira/browse/ARROW-17399
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet, Python
>Affects Versions: 9.0.0
> Environment: linux
>Reporter: Gianluca Ficarelli
>Priority: Major
> Attachments: memory-profiler.png
>
>
> When a pandas dataframe is loaded from a parquet file using 
> {{{}pyarrow.parquet.read_table{}}}, the memory usage may grow a lot more than 
> what should be needed to load the dataframe, and it's not freed until the 
> dataframe is deleted.
> The problem is evident when the dataframe has a {*}column containing lists or 
> numpy arrays{*}, while it seems absent (or not noticeable) if the column 
> contains only integer or floats.
> I'm attaching a simple script to reproduce the issue, and a graph created 
> with memory-profiler showing the memory usage.
> In this example, the dataframe created with pandas needs around 1.2 GB, but 
> the memory usage after loading it from parquet is around 16 GB.
> The items of the column are created as numpy arrays and not lists, to be 
> consistent with the types loaded from parquet (pyarrow produces numpy arrays 
> and not lists).
>  
> {code:python}
> import gc
> import time
> import numpy as np
> impo

[jira] [Comment Edited] (ARROW-17399) pyarrow may use a lot of memory to load a dataframe from parquet

2022-08-30 Thread Gianluca Ficarelli (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597904#comment-17597904
 ] 

Gianluca Ficarelli edited comment on ARROW-17399 at 8/30/22 2:32 PM:
-

I tested on Linux the previous versions of pyarrow, here are the results:
 * pyarrow==9.0.0

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==9.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      90.6
  1 time:       2.8 rss:    1205.4
  2 time:       4.4 rss:    1212.3
  3 time:       4.7 rss:     709.7
  4 time:       7.8 rss:     707.9
  5 time:      14.4 rss:   16656.0
  6 time:      17.4 rss:   16246.0
  7 time:      17.5 rss:   15743.6
  8 time:      20.5 rss:     866.3{code}
 * pyarrow==8.0.0

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==8.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      86.2
  1 time:       2.8 rss:    1200.9
  2 time:       4.3 rss:    2266.2
  3 time:       4.6 rss:    1443.6
  4 time:       7.7 rss:     703.5
  5 time:      14.3 rss:   16648.0
  6 time:      17.3 rss:   16238.0
  7 time:      17.4 rss:   15738.6
  8 time:      20.4 rss:     861.3{code}
 * pyarrow==7.0.0

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==7.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      84.3
  1 time:       2.8 rss:    1199.1
  2 time:       4.4 rss:    2263.7
  3 time:       4.6 rss:    1441.8
  4 time:       7.7 rss:     701.6
  5 time:       9.8 rss:    3679.9
  6 time:      12.8 rss:    3268.3
  7 time:      12.9 rss:    2766.6
  8 time:      15.9 rss:     859.2
 {code}
 * pyarrow==6.0.1

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==6.0.1
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      81.9
  1 time:       2.9 rss:    1196.8
  2 time:       4.5 rss:    2261.4
  3 time:       4.7 rss:    1439.0
  4 time:       7.8 rss:     698.9
  5 time:       9.2 rss:    2224.0
  6 time:      12.2 rss:    1740.4
  7 time:      12.3 rss:    1238.1
  8 time:      15.3 rss:     856.6{code}
 * pyarrow==6.0.0

{code:java}
 pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==6.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      81.7
  1 time:       2.9 rss:    1196.6
  2 time:       4.5 rss:    2261.1
  3 time:       4.7 rss:    1438.5
  4 time:       7.8 rss:     698.4
  5 time:       9.2 rss:    2224.9
  6 time:      12.2 rss:    1740.1
  7 time:      12.3 rss:    1237.7
  8 time:      15.3 rss:     856.2{code}
 * pyarrow==5.0.0

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==5.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      79.2
  1 time:       2.8 rss:    1194.0
  2 time:       4.3 rss:    2258.3
  3 time:       4.5 rss:    1436.2
  4 time:       7.7 rss:     696.1
  5 time:       9.1 rss:    2221.1
  6 time:      12.1 rss:    1736.3
  7 time:      12.2 rss:    1235.0
  8 time:      15.3 rss:     853.5{code}
So:
 * with pyarrow 9.0.0 and 8.0.0 the results are similar
 * with pyarrow 7.0.0 the used memory seems lower (and the execution time is 
faster)
 * with pyarrow 6.0.0, 6.0.1, 5.0.0 the used memory is even lower


was (Author: JIRAUSER294344):
I tested on Linux the previous versions of pyarrow, here are the results:
 * pyarrow==9.0.0

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==9.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      90.6
  1 time:       2.8 rss:    1205.4
  2 time:       4.4 rss:    1212.3
  3 time:       4.7 rss:     709.7
  4 time:       7.8 rss:     707.9
  5 time:      14.4 rss:   16656.0
  6 time:      17.4 rss:   16246.0
  7 time:      17.5 rss:   15743.6
  8 time:      20.5 rss:     866.3{code}
 * pyarrow==8.0.0

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==8.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      86.2
  1 time:       2.8 rss:    1200.9
  2 time:       4.3 rss:    2266.2
  3 time:       4.6 rss:    1443.6
  4 time:       7.7 rss:     703.5
  5 time:      14.3 rss:   16648.0
  6 time:      17.3 rss:   16238.0
  7 time:      17.4 rss:   15738.6
  8 time:      20.4 rss:     861.3{code}
 * pyarrow==7.0.0

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==7.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      84.3
  1 time:       2.8 rss:    1199.1
  2 time:       4.4 rss:    2263.7
  3

[jira] [Comment Edited] (ARROW-17399) pyarrow may use a lot of memory to load a dataframe from parquet

2022-08-30 Thread Gianluca Ficarelli (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597904#comment-17597904
 ] 

Gianluca Ficarelli edited comment on ARROW-17399 at 8/30/22 2:32 PM:
-

I tested on Linux the previous versions of pyarrow, here are the results:
 * pyarrow==9.0.0

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==9.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      90.6
  1 time:       2.8 rss:    1205.4
  2 time:       4.4 rss:    1212.3
  3 time:       4.7 rss:     709.7
  4 time:       7.8 rss:     707.9
  5 time:      14.4 rss:   16656.0
  6 time:      17.4 rss:   16246.0
  7 time:      17.5 rss:   15743.6
  8 time:      20.5 rss:     866.3{code}
 * pyarrow==8.0.0

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==8.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      86.2
  1 time:       2.8 rss:    1200.9
  2 time:       4.3 rss:    2266.2
  3 time:       4.6 rss:    1443.6
  4 time:       7.7 rss:     703.5
  5 time:      14.3 rss:   16648.0
  6 time:      17.3 rss:   16238.0
  7 time:      17.4 rss:   15738.6
  8 time:      20.4 rss:     861.3{code}
 * pyarrow==7.0.0

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==7.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      84.3
  1 time:       2.8 rss:    1199.1
  2 time:       4.4 rss:    2263.7
  3 time:       4.6 rss:    1441.8
  4 time:       7.7 rss:     701.6
  5 time:       9.8 rss:    3679.9
  6 time:      12.8 rss:    3268.3
  7 time:      12.9 rss:    2766.6
  8 time:      15.9 rss:     859.2
 {code}
 * pyarrow==6.0.1

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==6.0.1
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      81.9
  1 time:       2.9 rss:    1196.8
  2 time:       4.5 rss:    2261.4
  3 time:       4.7 rss:    1439.0
  4 time:       7.8 rss:     698.9
  5 time:       9.2 rss:    2224.0
  6 time:      12.2 rss:    1740.4
  7 time:      12.3 rss:    1238.1
  8 time:      15.3 rss:     856.6{code}
 * pyarrow==6.0.0

{code:java}
 pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==6.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      81.7
  1 time:       2.9 rss:    1196.6
  2 time:       4.5 rss:    2261.1
  3 time:       4.7 rss:    1438.5
  4 time:       7.8 rss:     698.4
  5 time:       9.2 rss:    2224.9
  6 time:      12.2 rss:    1740.1
  7 time:      12.3 rss:    1237.7
  8 time:      15.3 rss:     856.2{code}
 * pyarrow==5.0.0

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==5.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      79.2
  1 time:       2.8 rss:    1194.0
  2 time:       4.3 rss:    2258.3
  3 time:       4.5 rss:    1436.2
  4 time:       7.7 rss:     696.1
  5 time:       9.1 rss:    2221.1
  6 time:      12.1 rss:    1736.3
  7 time:      12.2 rss:    1235.0
  8 time:      15.3 rss:     853.5{code}
So:
 * with pyarrow 9.0.0 and 8.0.0 the results are similar
 * with pyarrow 7.0.0 the used memory seems lower (and faster)
 * with pyarrow 6.0.0, 6.0.1, 5.0.0 the used memory is even lower (and faster)


was (Author: JIRAUSER294344):
I tested on Linux the previous versions of pyarrow, here are the results:
 * pyarrow==9.0.0

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==9.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      90.6
  1 time:       2.8 rss:    1205.4
  2 time:       4.4 rss:    1212.3
  3 time:       4.7 rss:     709.7
  4 time:       7.8 rss:     707.9
  5 time:      14.4 rss:   16656.0
  6 time:      17.4 rss:   16246.0
  7 time:      17.5 rss:   15743.6
  8 time:      20.5 rss:     866.3{code}
 * pyarrow==8.0.0

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==8.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      86.2
  1 time:       2.8 rss:    1200.9
  2 time:       4.3 rss:    2266.2
  3 time:       4.6 rss:    1443.6
  4 time:       7.7 rss:     703.5
  5 time:      14.3 rss:   16648.0
  6 time:      17.3 rss:   16238.0
  7 time:      17.4 rss:   15738.6
  8 time:      20.4 rss:     861.3{code}
 * pyarrow==7.0.0

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==7.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      84.3
  1 time:       2.8 rss:    1199.1
  2 time:       4.4 rss:    2263.7
  3 time:

[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-30 Thread Will Jones (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597915#comment-17597915
 ] 

Will Jones commented on ARROW-17459:


We have a section of our docs devoted to [developer setup and 
guidelines|https://arrow.apache.org/docs/developers/contributing.html]. And we 
have documentation describing the [Arrow in-memory 
format|https://arrow.apache.org/docs/format/Columnar.html] (it may be worth 
reviewing the structure of nested arrays, for example). For the internals of 
the Parquet arrow code, it's best to read through the source headers at 
{{{}cpp/src/parquet/arrow/{}}}.

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17569) Bump xsimd to 9.0.0

2022-08-30 Thread Serge Guelton (Jira)

Serge Guelton created ARROW-17569:
-

 Summary: Bump xsimd to 9.0.0
 Key: ARROW-17569
 URL: https://issues.apache.org/jira/browse/ARROW-17569
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Serge Guelton


xsmd has released a new upstream version (namely 9.0.0), it would be nice to 
match it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-4852) [Go] add shmem allocator

2022-08-30 Thread Matthew Topol (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol closed ARROW-4852.

Resolution: Abandoned

> [Go] add shmem allocator
> 
>
> Key: ARROW-4852
> URL: https://issues.apache.org/jira/browse/ARROW-4852
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Go
>Reporter: Sebastien Binet
>Priority: Major
>
> Go-Arrow doesn't implement the IPC protocol yet.
> in the meantime, to exchange data with other languages, a nice close gap 
> solution would be to have a shmem allocator where one would put Arrow arrays.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-13742) [C++][Go] Expose Dataset API to Go via C interface

2022-08-30 Thread Matthew Topol (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol closed ARROW-13742.
-
Resolution: Won't Do

Closing this in favor of a Native Go approach rather than using CGO

> [C++][Go] Expose Dataset API to Go via C interface
> --
>
> Key: ARROW-13742
> URL: https://issues.apache.org/jira/browse/ARROW-13742
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Continuous Integration, Go
>Reporter: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Rather than attempting to reimplement the Compute and Dataset APIs in Go, 
> which would be fairly un-maintainable, we can expose the Compute and Dataset 
> APIs from the libraries using a C interface and then have Go access them via 
> cgo and the C Data Interface. 
> This requires both adding a C api to the Dataset library and a new module in 
> the Go which would then rely on needing to link against it during build. As a 
> result it also requires adding to the build scripts to properly test it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17557) [Go] WASM build fails

2022-08-30 Thread Matthew Topol (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597927#comment-17597927
 ] 

Matthew Topol commented on ARROW-17557:
---

[~tschaub] Can you post your main.go file that you were using and try building 
with `-tags noasm` ?

> [Go] WASM build fails
> -
>
> Key: ARROW-17557
> URL: https://issues.apache.org/jira/browse/ARROW-17557
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Reporter: Tim Schaub
>Priority: Major
>
> I see ARROW-4689 and it looks like 
> [https://github.com/apache/arrow/pull/3707] was supposed to add support for 
> building with {{GOOS=js GOARCH=wasm}}.
> When I try to build a wasm binary, I get the following failure
> {code}
> # GOOS=js GOARCH=wasm go build -o test.wasm ./main.go
> # github.com/apache/arrow/go/v9/internal/utils
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:76:4:
>  undefined: TransposeInt8Int8
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:78:4:
>  undefined: TransposeInt8Int16
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:80:4:
>  undefined: TransposeInt8Int32
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:82:4:
>  undefined: TransposeInt8Int64
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:84:4:
>  undefined: TransposeInt8Uint8
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:86:4:
>  undefined: TransposeInt8Uint16
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:88:4:
>  undefined: TransposeInt8Uint32
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:90:4:
>  undefined: TransposeInt8Uint64
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:95:4:
>  undefined: TransposeInt16Int8
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:97:4:
>  undefined: TransposeInt16Int16
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:97:4:
>  too many errors
> # github.com/apache/thrift/lib/go/thrift
> ../../go/pkg/mod/github.com/apache/thrift@v0.15.0/lib/go/thrift/socket_unix_conn.go:60:63:
>  undefined: syscall.MSG_PEEK
> ../../go/pkg/mod/github.com/apache/thrift@v0.15.0/lib/go/thrift/socket_unix_conn.go:60:80:
>  undefined: syscall.MSG_DONTWAIT
> {code}
> {code}
> go version go1.18.2 darwin/arm64
> {code}
> {code}
> github.com/apache/arrow/go/v9 v9.0.0
> {code}
> Does additional code need to be generated for the {{wasm}} arch?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17569) [C++] Bump xsimd to 9.0.0

2022-08-30 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-17569:
---
Summary: [C++] Bump xsimd to 9.0.0  (was: Bump xsimd to 9.0.0)

> [C++] Bump xsimd to 9.0.0
> -
>
> Key: ARROW-17569
> URL: https://issues.apache.org/jira/browse/ARROW-17569
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Serge Guelton
>Priority: Minor
>
> xsmd has released a new upstream version (namely 9.0.0), it would be nice to 
> match it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17569) Bump xsimd to 9.0.0

2022-08-30 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597929#comment-17597929
 ] 

Antoine Pitrou commented on ARROW-17569:


cc [~yibocai]. But [~serge sans paille], feel free to submit a PR as well.

> Bump xsimd to 9.0.0
> ---
>
> Key: ARROW-17569
> URL: https://issues.apache.org/jira/browse/ARROW-17569
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Serge Guelton
>Priority: Minor
>
> xsmd has released a new upstream version (namely 9.0.0), it would be nice to 
> match it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17557) [Go] WASM build fails

2022-08-30 Thread Matthew Topol (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597933#comment-17597933
 ] 

Matthew Topol commented on ARROW-17557:
---

I can confirm that building with the `noasm` tag produces a successful WASM 
build that I was able to run in my browser, please comment back and let me know 
if this works for you [~tschaub]

> [Go] WASM build fails
> -
>
> Key: ARROW-17557
> URL: https://issues.apache.org/jira/browse/ARROW-17557
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Reporter: Tim Schaub
>Priority: Major
>
> I see ARROW-4689 and it looks like 
> [https://github.com/apache/arrow/pull/3707] was supposed to add support for 
> building with {{GOOS=js GOARCH=wasm}}.
> When I try to build a wasm binary, I get the following failure
> {code}
> # GOOS=js GOARCH=wasm go build -o test.wasm ./main.go
> # github.com/apache/arrow/go/v9/internal/utils
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:76:4:
>  undefined: TransposeInt8Int8
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:78:4:
>  undefined: TransposeInt8Int16
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:80:4:
>  undefined: TransposeInt8Int32
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:82:4:
>  undefined: TransposeInt8Int64
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:84:4:
>  undefined: TransposeInt8Uint8
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:86:4:
>  undefined: TransposeInt8Uint16
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:88:4:
>  undefined: TransposeInt8Uint32
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:90:4:
>  undefined: TransposeInt8Uint64
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:95:4:
>  undefined: TransposeInt16Int8
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:97:4:
>  undefined: TransposeInt16Int16
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:97:4:
>  too many errors
> # github.com/apache/thrift/lib/go/thrift
> ../../go/pkg/mod/github.com/apache/thrift@v0.15.0/lib/go/thrift/socket_unix_conn.go:60:63:
>  undefined: syscall.MSG_PEEK
> ../../go/pkg/mod/github.com/apache/thrift@v0.15.0/lib/go/thrift/socket_unix_conn.go:60:80:
>  undefined: syscall.MSG_DONTWAIT
> {code}
> {code}
> go version go1.18.2 darwin/arm64
> {code}
> {code}
> github.com/apache/arrow/go/v9 v9.0.0
> {code}
> Does additional code need to be generated for the {{wasm}} arch?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17567) [C++][Compute]Compiler error with gcc7 and c++17

2022-08-30 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-17567.

Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 14004
[https://github.com/apache/arrow/pull/14004]

> [C++][Compute]Compiler error with gcc7 and c++17
> 
>
> Key: ARROW-17567
> URL: https://issues.apache.org/jira/browse/ARROW-17567
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 9.0.0
> Environment: gcc6/7
> c++14/17
>Reporter: Jin Shang
>Assignee: Jin Shang
>Priority: Minor
>  Labels: easyfix, pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> When compiling the c++ compute component with gcc6/7 with std=c++14/17, 
> compiler internal errors are triggered at 
> compute/kernels/aggregate_internal.h:176:24 and various places at 
> compute/kernels/scalar_set_lookup.cc
> {code:java}
> cpp/src/arrow/compute/kernels/aggregate_internal.h:176:24: internal compiler 
> error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1740
> DCHECK_LT(cur_level, levels);
> ~^~~~
> 0x683399 maybe_undo_parenthesized_ref(tree_node*)
> ../../gcc-7.5.0/gcc/cp/semantics.c:1739
> 0x6c8638 cp_fold
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2180
> 0x6c949c cp_fold_maybe_rvalue
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
> 0x6c8346 cp_fold
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2149
> 0x6c949c cp_fold_maybe_rvalue
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
> 0x6c8234 cp_fold_rvalue
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2063
> 0x6c8234 cp_fold
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2304
> 0x66a037 cp_convert_and_check(tree_node*, tree_node*, int)
> ../../gcc-7.5.0/gcc/cp/cvt.c:640
> 0x59bb8a convert_like_real
> ../../gcc-7.5.0/gcc/cp/call.c:7053
> 0x59de12 build_over_call
> ../../gcc-7.5.0/gcc/cp/call.c:7869
> 0x5a3c2f build_new_function_call(tree_node*, vec vl_embed>**, bool, int)
> ../../gcc-7.5.0/gcc/cp/call.c:4272
> 0x685601 finish_call_expr(tree_node*, vec**, 
> bool, bool, int)
> ../../gcc-7.5.0/gcc/cp/semantics.c:2501
> 0x5e1b83 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
> bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:17508
> 0x5e121b tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
> bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:17544
> 0x5d6c47 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
> bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:16732
> 0x5d6c47 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:16613
> 0x5d6a85 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:15874
> 0x5d6ad5 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:15860
> 0x5d61de tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:16036
> 0x5d6ad5 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:15860
> Please submit a full bug report,
> with preprocessed source if appropriate.
> Please include the complete backtrace with any bug report.
> See  for instructions. {code}
> {code:java}
> cpp/src/arrow/compute/kernels/scalar_set_lookup.cc:70:50: internal compiler 
> error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1740
> auto on_found = [&](int32_t memo_index) { DCHECK_LT(memo_index, memo_size); };
> 0x683399 maybe_undo_parenthesized_ref(tree_node*)
> ../../gcc-7.5.0/gcc/cp/semantics.c:1739
> 0x6c8638 cp_fold
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2180
> 0x6c949c cp_fold_maybe_rvalue
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2042
> 0x6c8346 cp_fold
> ../../gcc-7.5.0/gcc/cp/cp-gimplify.c:2149
> 0x66a037 cp_convert_and_check(tree_node*, tree_node*, int)
> ../../gcc-7.5.0/gcc/cp/cvt.c:640
> 0x65f5d4 cp_build_binary_op(unsigned int, tree_code, tree_node*, tree_node*, 
> int)
> ../../gcc-7.5.0/gcc/cp/typeck.c:5208
> 0x5a689c build_new_op_1
> ../../gcc-7.5.0/gcc/cp/call.c:5978
> 0x5a737e build_new_op(unsigned int, tree_code, int, tree_node*, tree_node*, 
> tree_node*, tree_node**, int)
> ../../gcc-7.5.0/gcc/cp/call.c:6022
> 0x657a12 build_x_binary_op(unsigned int, tree_code, tree_node*, tree_code, 
> tree_node*, tree_code, tree_node**, int)
> ../../gcc-7.5.0/gcc/cp/typeck.c:3941
> 0x5e04ff tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
> bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:17001
> 0x5e1120 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
> bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:16940
> 0x5e1120 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, 
> bool)
> ../../gcc-7.5.0/gcc/cp/pt.c:16940
> 0x5e1676 tsubst_copy_and_build(tree_n

[jira] [Updated] (ARROW-17497) [R] installation failure on R Studio Server

2022-08-30 Thread Philipp (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philipp updated ARROW-17497:

Description: 
Hi,

I try to install arrow on a RStudio Server. I tried different approaches but 
always ran into the same poblem. Do you have any idea what goes wrong? Thanks.
{code:java}
// > Sys.setenv(ARROW_R_DEV=TRUE)
> install.packages("arrow")
Installiere Paket nach ‘/home/*/R/x86_64-pc-linux-gnu-library/4.2’
(da ‘lib’ nicht spezifiziert)
versuche URL 'https://cloud.r-project.org/src/contrib/arrow_9.0.0.tar.gz'
Content type 'application/x-gzip' length 4900968 bytes (4.7 MB)
==
downloaded 4.7 MB

* installing *source* package ‘arrow’ ...
** Paket ‘arrow’ erfolgreich entpackt und MD5 Summen überprüft
** using staged installation
*** Found local C++ source: 'tools/cpp'
*** Building libarrow from source
For build options and troubleshooting, see the install vignette:
https://cran.r-project.org/web/packages/arrow/vignettes/install.html
*** Building with MAKEFLAGS= -j2 
 cmake: /usr/bin/cmake
 arrow with SOURCE_DIR='tools/cpp' 
BUILD_DIR='/tmp/Rtmp64sUMd/file553b4faa5f21' DEST_DIR='libarrow/arrow-9.0.0' 
CMAKE='/usr/bin/cmake' EXTRA_CMAKE_FLAGS='' CC='gcc' CXX='g++ -std=gnu++11' 
LDFLAGS='-Wl,-z,relro' ARROW_S3='OFF' ARROW_GCS='OFF' ARROW_MIMALLOC='OFF' 
++ pwd
+ : /tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow
+ : tools/cpp
+ : /tmp/Rtmp64sUMd/file553b4faa5f21
+ : libarrow/arrow-9.0.0
+ : /usr/bin/cmake
++ cd tools/cpp
++ pwd
+ SOURCE_DIR=/tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow/tools/cpp
++ mkdir -p libarrow/arrow-9.0.0
++ cd libarrow/arrow-9.0.0
++ pwd
+ DEST_DIR=/tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow/libarrow/arrow-9.0.0
+ '[' '' '!=' '' ']'
+ '[' '' = false ']'
+ ARROW_DEFAULT_PARAM=OFF
+ mkdir -p /tmp/Rtmp64sUMd/file553b4faa5f21
+ pushd /tmp/Rtmp64sUMd/file553b4faa5f21
/tmp/Rtmp64sUMd/file553b4faa5f21 /tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow
+ /usr/bin/cmake -DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_TESTS=OFF 
-DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON -DARROW_COMPUTE=ON 
-DARROW_CSV=ON -DARROW_DATASET=ON -DARROW_DEPENDENCY_SOURCE=AUTO 
-DAWSSDK_SOURCE= -DARROW_FILESYSTEM=ON -DARROW_GCS=OFF -DARROW_JEMALLOC=OFF 
-DARROW_MIMALLOC=OFF -DARROW_JSON=ON -DARROW_PARQUET=ON -DARROW_S3=OFF 
-DARROW_WITH_BROTLI=OFF -DARROW_WITH_BZ2=OFF -DARROW_WITH_LZ4=ON 
-DARROW_WITH_RE2=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_UTF8PROC=ON 
-DARROW_WITH_ZLIB=OFF -DARROW_WITH_ZSTD=OFF 
-DARROW_VERBOSE_THIRDPARTY_BUILD=OFF -DCMAKE_BUILD_TYPE=Release 
-DCMAKE_INSTALL_LIBDIR=lib 
-DCMAKE_INSTALL_PREFIX=/tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow/libarrow/arrow-9.0.0
 -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON 
-DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON -DCMAKE_UNITY_BUILD=OFF 
-Dxsimd_SOURCE= -G 'Unix Makefiles' 
/tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow/tools/cpp
-- Building using CMake version: 3.13.4
-- The C compiler identification is GNU 8.3.0
-- The CXX compiler identification is GNU 8.3.0
-- Check for working C compiler: /usr/bin/gcc
-- Check for working C compiler: /usr/bin/gcc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/g++
-- Check for working CXX compiler: /usr/bin/g++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Arrow version: 9.0.0 (full: '9.0.0')
-- Arrow SO version: 900 (full: 900.0.0)
-- clang-tidy 12 not found
-- clang-format 12 not found
-- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) 
-- infer not found
fatal: Kein Git-Repository (oder irgendeines der Elternverzeichnisse): .git
-- Found Python3: /usr/bin/python3.7 (found version "3.7.3") found components:  
Interpreter 
-- Found cpplint executable at 
/tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow/tools/cpp/build-support/cpplint.py
-- System processor: x86_64
-- Performing Test CXX_SUPPORTS_SSE4_2
-- Performing Test CXX_SUPPORTS_SSE4_2 - Success
-- Performing Test CXX_SUPPORTS_AVX2
-- Performing Test CXX_SUPPORTS_AVX2 - Success
-- Performing Test CXX_SUPPORTS_AVX512
-- Performing Test CXX_SUPPORTS_AVX512 - Success
-- Arrow build warning level: PRODUCTION
-- Using ld linker
-- Configured for RELEASE build (set with cmake 
-DCMAKE_BUILD_TYPE={release,debug,...})
-- Build Type: RELEASE
-- Using AUTO approach to find dependencies
-- ARROW_ABSL_BUILD_VERSION: 20211102.0
-- ARROW_ABSL_BUILD_SHA256_CHECKSUM: 
dcf71b9cba8dc0ca9940c4b316a0c796be8fab42b070bb6b7cab62b48f0e66c4
-- ARROW_AWSSDK_BUILD_VERSION: 1.8.133
-- ARROW_AWSSDK_BUILD_SHA256_CHECKSUM: 
d6c495bc06be5e21dac716571305d77437e7cfd62a2226b8fe48d9ab5785a8d6
-- ARROW_AWS_CHECKSUMS_BUILD_VERSION: v0.1.12
-- ARROW_AWS_CHECKSUMS_BUILD_SHA256_CHECKSUM: 
394723034b8

[jira] [Commented] (ARROW-15006) [Python][Doc] Iteratively enable more numpydoc checks

2022-08-30 Thread Will Jones (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597937#comment-17597937
 ] 

Will Jones commented on ARROW-15006:


Great spreadsheet! Best place for developer discussions is the dev mailing 
list, since it's the most public. Add a link to this ticket and the 
spreadsheet, and you can add a {{[Python][Doc]}} prefix to the subject to help 
recipients know if it's relevant to them.

> [Python][Doc] Iteratively enable more numpydoc checks
> -
>
> Key: ARROW-15006
> URL: https://issues.apache.org/jira/browse/ARROW-15006
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Python
>Reporter: Krisztian Szucs
>Assignee: Bryce Mecum
>Priority: Major
>  Labels: good-first-issue, pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Asof https://github.com/apache/arrow/pull/7732 we're going to have a numpydoc 
> check running on pull requests. There is a single rule enabled at the moment: 
> PR01
> Additional checks we can run:
> {code}
> ERROR_MSGS = {
> "GL01": "Docstring text (summary) should start in the line immediately "
> "after the opening quotes (not in the same line, or leaving a "
> "blank line in between)",
> "GL02": "Closing quotes should be placed in the line after the last text "
> "in the docstring (do not close the quotes in the same line as "
> "the text, or leave a blank line between the last text and the "
> "quotes)",
> "GL03": "Double line break found; please use only one blank line to "
> "separate sections or paragraphs, and do not leave blank lines "
> "at the end of docstrings",
> "GL05": 'Tabs found at the start of line "{line_with_tabs}", please use '
> "whitespace only",
> "GL06": 'Found unknown section "{section}". Allowed sections are: '
> "{allowed_sections}",
> "GL07": "Sections are in the wrong order. Correct order is: 
> {correct_sections}",
> "GL08": "The object does not have a docstring",
> "GL09": "Deprecation warning should precede extended summary",
> "GL10": "reST directives {directives} must be followed by two colons",
> "SS01": "No summary found (a short summary in a single line should be "
> "present at the beginning of the docstring)",
> "SS02": "Summary does not start with a capital letter",
> "SS03": "Summary does not end with a period",
> "SS04": "Summary contains heading whitespaces",
> "SS05": "Summary must start with infinitive verb, not third person "
> '(e.g. use "Generate" instead of "Generates")',
> "SS06": "Summary should fit in a single line",
> "ES01": "No extended summary found",
> "PR01": "Parameters {missing_params} not documented",
> "PR02": "Unknown parameters {unknown_params}",
> "PR03": "Wrong parameters order. Actual: {actual_params}. "
> "Documented: {documented_params}",
> "PR04": 'Parameter "{param_name}" has no type',
> "PR05": 'Parameter "{param_name}" type should not finish with "."',
> "PR06": 'Parameter "{param_name}" type should use "{right_type}" instead '
> 'of "{wrong_type}"',
> "PR07": 'Parameter "{param_name}" has no description',
> "PR08": 'Parameter "{param_name}" description should start with a '
> "capital letter",
> "PR09": 'Parameter "{param_name}" description should finish with "."',
> "PR10": 'Parameter "{param_name}" requires a space before the colon '
> "separating the parameter name and type",
> "RT01": "No Returns section found",
> "RT02": "The first line of the Returns section should contain only the "
> "type, unless multiple values are being returned",
> "RT03": "Return value has no description",
> "RT04": "Return value description should start with a capital letter",
> "RT05": 'Return value description should finish with "."',
> "YD01": "No Yields section found",
> "SA01": "See Also section not found",
> "SA02": "Missing period at end of description for See Also "
> '"{reference_name}" reference',
> "SA03": "Description should be capitalized for See Also "
> '"{reference_name}" reference',
> "SA04": 'Missing description for See Also "{reference_name}" reference',
> "EX01": "No examples section found",
> }
> {code}
> cc [~alenkaf] [~amol-] [~jorisvandenbossche]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17497) [R] installation failure on R Studio Server

2022-08-30 Thread Philipp (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597938#comment-17597938
 ] 

Philipp commented on ARROW-17497:
-

Hi [~thisisnic],

thank you for your reply. My OS is Debian GNU/Linux 10 (buster).

Following your suggestion, I tried to download the precompiled binary of 
libarrow. Is seems like there is no binary for my OS, see below. I will post 
the log files in as separate command.
{code:java}
// > Sys.getenv("LIBARROW_BINARY")
[1] "TRUE"
> install.packages("arrow")
Installiere Paket nach ‘/home/*/R/x86_64-pc-linux-gnu-library/4.2’
(da ‘lib’ nicht spezifiziert)
versuche URL 'https://cloud.r-project.org/src/contrib/arrow_9.0.0.tar.gz'
Content type 'application/x-gzip' length 4900968 bytes (4.7 MB)
==
downloaded 4.7 MB

* installing *source* package ‘arrow’ ...
** Paket ‘arrow’ erfolgreich entpackt und MD5 Summen überprüft
** using staged installation
*** Found libcurl and openssl >= 1.0.2
versuche URL 
'https://apache.jfrog.io/artifactory/arrow/r/9.0.0/libarrow/bin/ubuntu-18.04/arrow-9.0.0.zip'
Error in download.file(from_url, to_file, quiet = hush) : 
  kann URL 
'https://apache.jfrog.io/artifactory/arrow/r/9.0.0/libarrow/bin/ubuntu-18.04/arrow-9.0.0.zip'
 nicht öffnen
*** No libarrow binary found for version 9.0.0 (ubuntu-18.04)
*** Found local C++ source: 'tools/cpp'
*** Building libarrow from source
For build options and troubleshooting, see the install vignette:
https://cran.r-project.org/web/packages/arrow/vignettes/install.html
*** Building with MAKEFLAGS= -j2 
 cmake: /usr/bin/cmake
 arrow with SOURCE_DIR='tools/cpp' 
BUILD_DIR='/home/*/private-fs04/R/sd-008_auto_tz_prod/log' 
DEST_DIR='libarrow/arrow-9.0.0' CMAKE='/usr/bin/cmake' EXTRA_CMAKE_FLAGS='' 
CC='gcc' CXX='g++ -std=gnu++11' LDFLAGS='-Wl,-z,relro' ARROW_S3='OFF' 
ARROW_GCS='OFF' ARROW_MIMALLOC='OFF' 
++ pwd
+ : /tmp/RtmpUwyJJQ/R.INSTALL6f55305b382c/arrow
+ : tools/cpp
+ : /home/*/private-fs04/R/sd-008_auto_tz_prod/log
+ : libarrow/arrow-9.0.0
+ : /usr/bin/cmake
++ cd tools/cpp
++ pwd
+ SOURCE_DIR=/tmp/RtmpUwyJJQ/R.INSTALL6f55305b382c/arrow/tools/cpp
++ mkdir -p libarrow/arrow-9.0.0
++ cd libarrow/arrow-9.0.0
++ pwd
+ DEST_DIR=/tmp/RtmpUwyJJQ/R.INSTALL6f55305b382c/arrow/libarrow/arrow-9.0.0
+ '[' '' '!=' '' ']'
+ '[' '' = false ']'
+ ARROW_DEFAULT_PARAM=OFF
+ mkdir -p /home/*/private-fs04/R/sd-008_auto_tz_prod/log
+ pushd /home/*/private-fs04/R/sd-008_auto_tz_prod/log
~/private-fs04/R/sd-008_auto_tz_prod/log 
/tmp/RtmpUwyJJQ/R.INSTALL6f55305b382c/arrow
+ /usr/bin/cmake -DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_TESTS=OFF 
-DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON -DARROW_COMPUTE=ON 
-DARROW_CSV=ON -DARROW_DATASET=ON -DARROW_DEPENDENCY_SOURCE=AUTO 
-DAWSSDK_SOURCE= -DARROW_FILESYSTEM=ON -DARROW_GCS=OFF -DARROW_JEMALLOC=OFF 
-DARROW_MIMALLOC=OFF -DARROW_JSON=ON -DARROW_PARQUET=ON -DARROW_S3=OFF 
-DARROW_WITH_BROTLI=OFF -DARROW_WITH_BZ2=OFF -DARROW_WITH_LZ4=ON 
-DARROW_WITH_RE2=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_UTF8PROC=ON 
-DARROW_WITH_ZLIB=OFF -DARROW_WITH_ZSTD=OFF 
-DARROW_VERBOSE_THIRDPARTY_BUILD=OFF -DCMAKE_BUILD_TYPE=Release 
-DCMAKE_INSTALL_LIBDIR=lib 
-DCMAKE_INSTALL_PREFIX=/tmp/RtmpUwyJJQ/R.INSTALL6f55305b382c/arrow/libarrow/arrow-9.0.0
 -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON 
-DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON -DCMAKE_UNITY_BUILD=OFF 
-Dxsimd_SOURCE= -G 'Unix Makefiles' 
/tmp/RtmpUwyJJQ/R.INSTALL6f55305b382c/arrow/tools/cpp
CMake Error: The source 
"/tmp/RtmpUwyJJQ/R.INSTALL6f55305b382c/arrow/tools/cpp/CMakeLists.txt" does not 
match the source 
"/tmp/RtmpAw7akN/R.INSTALL66f76729afb8/arrow/tools/cpp/CMakeLists.txt" used to 
generate cache.  Re-run cmake with a different source directory.
 Error building Arrow C++.  
- NOTE ---
There was an issue preparing the Arrow C++ libraries.
See https://arrow.apache.org/docs/r/articles/install.html
-
ERROR: configuration failed for package ‘arrow’
* removing ‘/home/*/R/x86_64-pc-linux-gnu-library/4.2/arrow’
Warning in install.packages :
  Installation des Pakets ‘arrow’ hatte Exit-Status ungleich 0

Die heruntergeladenen Quellpakete sind in 
‘/tmp/RtmpXC0LYV/downloaded_packages’ {code}

> [R] installation failure on R Studio Server
> ---
>
> Key: ARROW-17497
> URL: https://issues.apache.org/jira/browse/ARROW-17497
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Philipp
>Priority: Major
>
> Hi,
> I try to install arrow on a RStudio Server. I tried different approaches but 
> always ran into the same poblem. Do you have any idea what goes wrong? Thanks.
> {code:java}
> // > Sys.setenv(ARROW_R_DEV=TRUE)
> > install.packages("arrow")
> Installiere Pake

[jira] [Resolved] (ARROW-17523) [C++] Support more substrait function

2022-08-30 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace resolved ARROW-17523.
-
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13969
[https://github.com/apache/arrow/pull/13969]

> [C++] Support more substrait function
> -
>
> Key: ARROW-17523
> URL: https://issues.apache.org/jira/browse/ARROW-17523
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 10.0.0
>Reporter: Jin Chengcheng
>Assignee: Jin Chengcheng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> support is_null, is_not_null, count function



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-17497) [R] installation failure on R Studio Server

2022-08-30 Thread Philipp (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597938#comment-17597938
 ] 

Philipp edited comment on ARROW-17497 at 8/30/22 3:30 PM:
--

Hi [~thisisnic],

thank you for your reply. My OS is Debian GNU/Linux 10 (buster).

Following your suggestion, I tried to download the precompiled binary of 
libarrow. Is seems like there is no binary for my OS, see below. I will post 
the log files in as separate comment.
{code:java}
// > Sys.getenv("LIBARROW_BINARY")
[1] "TRUE"
> install.packages("arrow")
Installiere Paket nach ‘/home/*/R/x86_64-pc-linux-gnu-library/4.2’
(da ‘lib’ nicht spezifiziert)
versuche URL 'https://cloud.r-project.org/src/contrib/arrow_9.0.0.tar.gz'
Content type 'application/x-gzip' length 4900968 bytes (4.7 MB)
==
downloaded 4.7 MB

* installing *source* package ‘arrow’ ...
** Paket ‘arrow’ erfolgreich entpackt und MD5 Summen überprüft
** using staged installation
*** Found libcurl and openssl >= 1.0.2
versuche URL 
'https://apache.jfrog.io/artifactory/arrow/r/9.0.0/libarrow/bin/ubuntu-18.04/arrow-9.0.0.zip'
Error in download.file(from_url, to_file, quiet = hush) : 
  kann URL 
'https://apache.jfrog.io/artifactory/arrow/r/9.0.0/libarrow/bin/ubuntu-18.04/arrow-9.0.0.zip'
 nicht öffnen
*** No libarrow binary found for version 9.0.0 (ubuntu-18.04)
*** Found local C++ source: 'tools/cpp'
*** Building libarrow from source
For build options and troubleshooting, see the install vignette:
https://cran.r-project.org/web/packages/arrow/vignettes/install.html
*** Building with MAKEFLAGS= -j2 
 cmake: /usr/bin/cmake
 arrow with SOURCE_DIR='tools/cpp' 
BUILD_DIR='/home/*/private-fs04/R/sd-008_auto_tz_prod/log' 
DEST_DIR='libarrow/arrow-9.0.0' CMAKE='/usr/bin/cmake' EXTRA_CMAKE_FLAGS='' 
CC='gcc' CXX='g++ -std=gnu++11' LDFLAGS='-Wl,-z,relro' ARROW_S3='OFF' 
ARROW_GCS='OFF' ARROW_MIMALLOC='OFF' 
++ pwd
+ : /tmp/RtmpUwyJJQ/R.INSTALL6f55305b382c/arrow
+ : tools/cpp
+ : /home/*/private-fs04/R/sd-008_auto_tz_prod/log
+ : libarrow/arrow-9.0.0
+ : /usr/bin/cmake
++ cd tools/cpp
++ pwd
+ SOURCE_DIR=/tmp/RtmpUwyJJQ/R.INSTALL6f55305b382c/arrow/tools/cpp
++ mkdir -p libarrow/arrow-9.0.0
++ cd libarrow/arrow-9.0.0
++ pwd
+ DEST_DIR=/tmp/RtmpUwyJJQ/R.INSTALL6f55305b382c/arrow/libarrow/arrow-9.0.0
+ '[' '' '!=' '' ']'
+ '[' '' = false ']'
+ ARROW_DEFAULT_PARAM=OFF
+ mkdir -p /home/*/private-fs04/R/sd-008_auto_tz_prod/log
+ pushd /home/*/private-fs04/R/sd-008_auto_tz_prod/log
~/private-fs04/R/sd-008_auto_tz_prod/log 
/tmp/RtmpUwyJJQ/R.INSTALL6f55305b382c/arrow
+ /usr/bin/cmake -DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_TESTS=OFF 
-DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON -DARROW_COMPUTE=ON 
-DARROW_CSV=ON -DARROW_DATASET=ON -DARROW_DEPENDENCY_SOURCE=AUTO 
-DAWSSDK_SOURCE= -DARROW_FILESYSTEM=ON -DARROW_GCS=OFF -DARROW_JEMALLOC=OFF 
-DARROW_MIMALLOC=OFF -DARROW_JSON=ON -DARROW_PARQUET=ON -DARROW_S3=OFF 
-DARROW_WITH_BROTLI=OFF -DARROW_WITH_BZ2=OFF -DARROW_WITH_LZ4=ON 
-DARROW_WITH_RE2=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_UTF8PROC=ON 
-DARROW_WITH_ZLIB=OFF -DARROW_WITH_ZSTD=OFF 
-DARROW_VERBOSE_THIRDPARTY_BUILD=OFF -DCMAKE_BUILD_TYPE=Release 
-DCMAKE_INSTALL_LIBDIR=lib 
-DCMAKE_INSTALL_PREFIX=/tmp/RtmpUwyJJQ/R.INSTALL6f55305b382c/arrow/libarrow/arrow-9.0.0
 -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON 
-DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON -DCMAKE_UNITY_BUILD=OFF 
-Dxsimd_SOURCE= -G 'Unix Makefiles' 
/tmp/RtmpUwyJJQ/R.INSTALL6f55305b382c/arrow/tools/cpp
CMake Error: The source 
"/tmp/RtmpUwyJJQ/R.INSTALL6f55305b382c/arrow/tools/cpp/CMakeLists.txt" does not 
match the source 
"/tmp/RtmpAw7akN/R.INSTALL66f76729afb8/arrow/tools/cpp/CMakeLists.txt" used to 
generate cache.  Re-run cmake with a different source directory.
 Error building Arrow C++.  
- NOTE ---
There was an issue preparing the Arrow C++ libraries.
See https://arrow.apache.org/docs/r/articles/install.html
-
ERROR: configuration failed for package ‘arrow’
* removing ‘/home/*/R/x86_64-pc-linux-gnu-library/4.2/arrow’
Warning in install.packages :
  Installation des Pakets ‘arrow’ hatte Exit-Status ungleich 0

Die heruntergeladenen Quellpakete sind in 
‘/tmp/RtmpXC0LYV/downloaded_packages’ {code}


was (Author: JIRAUSER294767):
Hi [~thisisnic],

thank you for your reply. My OS is Debian GNU/Linux 10 (buster).

Following your suggestion, I tried to download the precompiled binary of 
libarrow. Is seems like there is no binary for my OS, see below. I will post 
the log files in as separate command.
{code:java}
// > Sys.getenv("LIBARROW_BINARY")
[1] "TRUE"
> install.packages("arrow")
Installiere Paket nach ‘/home/*/R/x86_64-pc-linux-gnu-library/4.2’
(da ‘lib’ nicht spezifiziert)
versuche URL 'https://cloud.r-project.org/src/contrib/a

[jira] [Commented] (ARROW-17497) [R] installation failure on R Studio Server

2022-08-30 Thread Philipp (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597943#comment-17597943
 ] 

Philipp commented on ARROW-17497:
-

Hi [~thisisnic],

here is the content of the log file. Do you have any idea what is going wrong?
{code:java}
// -- Downloading...
   
dst='/home/*/private-fs04/R/sd-008_auto_tz_prod/log/boost_ep-prefix/src/boost_1_75_0.tar.gz'
   timeout='none'
-- Using 
src='https://apache.jfrog.io/artifactory/arrow/thirdparty/7.0.0/boost_1_75_0.tar.gz'
-- Using 
src='https://boostorg.jfrog.io/artifactory/main/release/1.75.0/source/boost_1_75_0.tar.gz'
-- Using 
src='https://sourceforge.net/projects/boost/files/boost/1.75.0/boost_1_75_0.tar.gz'
-- Retrying...
-- Using 
src='https://apache.jfrog.io/artifactory/arrow/thirdparty/7.0.0/boost_1_75_0.tar.gz'
-- Using 
src='https://boostorg.jfrog.io/artifactory/main/release/1.75.0/source/boost_1_75_0.tar.gz'
-- Using 
src='https://sourceforge.net/projects/boost/files/boost/1.75.0/boost_1_75_0.tar.gz'
-- Retry after 5 seconds (attempt #2) ...
-- Using 
src='https://apache.jfrog.io/artifactory/arrow/thirdparty/7.0.0/boost_1_75_0.tar.gz'
-- Using 
src='https://boostorg.jfrog.io/artifactory/main/release/1.75.0/source/boost_1_75_0.tar.gz'
-- Using 
src='https://sourceforge.net/projects/boost/files/boost/1.75.0/boost_1_75_0.tar.gz'
-- Retry after 5 seconds (attempt #3) ...
-- Using 
src='https://apache.jfrog.io/artifactory/arrow/thirdparty/7.0.0/boost_1_75_0.tar.gz'
-- Using 
src='https://boostorg.jfrog.io/artifactory/main/release/1.75.0/source/boost_1_75_0.tar.gz'
-- Using 
src='https://sourceforge.net/projects/boost/files/boost/1.75.0/boost_1_75_0.tar.gz'
-- Retry after 15 seconds (attempt #4) ...
-- Using 
src='https://apache.jfrog.io/artifactory/arrow/thirdparty/7.0.0/boost_1_75_0.tar.gz'
-- Using 
src='https://boostorg.jfrog.io/artifactory/main/release/1.75.0/source/boost_1_75_0.tar.gz'
-- Using 
src='https://sourceforge.net/projects/boost/files/boost/1.75.0/boost_1_75_0.tar.gz'
-- Retry after 60 seconds (attempt #5) ...
-- Using 
src='https://apache.jfrog.io/artifactory/arrow/thirdparty/7.0.0/boost_1_75_0.tar.gz'
-- Using 
src='https://boostorg.jfrog.io/artifactory/main/release/1.75.0/source/boost_1_75_0.tar.gz'
-- Using 
src='https://sourceforge.net/projects/boost/files/boost/1.75.0/boost_1_75_0.tar.gz'
 {code}

> [R] installation failure on R Studio Server
> ---
>
> Key: ARROW-17497
> URL: https://issues.apache.org/jira/browse/ARROW-17497
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Philipp
>Priority: Major
>
> Hi,
> I try to install arrow on a RStudio Server. I tried different approaches but 
> always ran into the same poblem. Do you have any idea what goes wrong? Thanks.
> {code:java}
> // > Sys.setenv(ARROW_R_DEV=TRUE)
> > install.packages("arrow")
> Installiere Paket nach ‘/home/*/R/x86_64-pc-linux-gnu-library/4.2’
> (da ‘lib’ nicht spezifiziert)
> versuche URL 'https://cloud.r-project.org/src/contrib/arrow_9.0.0.tar.gz'
> Content type 'application/x-gzip' length 4900968 bytes (4.7 MB)
> ==
> downloaded 4.7 MB
> * installing *source* package ‘arrow’ ...
> ** Paket ‘arrow’ erfolgreich entpackt und MD5 Summen überprüft
> ** using staged installation
> *** Found local C++ source: 'tools/cpp'
> *** Building libarrow from source
> For build options and troubleshooting, see the install vignette:
> https://cran.r-project.org/web/packages/arrow/vignettes/install.html
> *** Building with MAKEFLAGS= -j2 
>  cmake: /usr/bin/cmake
>  arrow with SOURCE_DIR='tools/cpp' 
> BUILD_DIR='/tmp/Rtmp64sUMd/file553b4faa5f21' DEST_DIR='libarrow/arrow-9.0.0' 
> CMAKE='/usr/bin/cmake' EXTRA_CMAKE_FLAGS='' CC='gcc' CXX='g++ -std=gnu++11' 
> LDFLAGS='-Wl,-z,relro' ARROW_S3='OFF' ARROW_GCS='OFF' ARROW_MIMALLOC='OFF' 
> ++ pwd
> + : /tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow
> + : tools/cpp
> + : /tmp/Rtmp64sUMd/file553b4faa5f21
> + : libarrow/arrow-9.0.0
> + : /usr/bin/cmake
> ++ cd tools/cpp
> ++ pwd
> + SOURCE_DIR=/tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow/tools/cpp
> ++ mkdir -p libarrow/arrow-9.0.0
> ++ cd libarrow/arrow-9.0.0
> ++ pwd
> + DEST_DIR=/tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow/libarrow/arrow-9.0.0
> + '[' '' '!=' '' ']'
> + '[' '' = false ']'
> + ARROW_DEFAULT_PARAM=OFF
> + mkdir -p /tmp/Rtmp64sUMd/file553b4faa5f21
> + pushd /tmp/Rtmp64sUMd/file553b4faa5f21
> /tmp/Rtmp64sUMd/file553b4faa5f21 /tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow
> + /usr/bin/cmake -DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_TESTS=OFF 
> -DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON -DARROW_COMPUTE=ON 
> -DARROW_CSV=ON -DARROW_DATASET=ON -DARROW_DEPENDENCY_SOURCE=AUTO 
> -DAWSSDK_SOURCE= -DARROW_FILESYSTEM=ON -DARROW_GCS=OFF -DARROW_JEMALLOC=OFF 
> -DARROW_MIMALLOC=OFF -DARROW_JSO

[jira] [Updated] (ARROW-17551) [Go] Implement temporal cast functions

2022-08-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17551:
---
Labels: pull-request-available  (was: )

> [Go] Implement temporal cast functions
> --
>
> Key: ARROW-17551
> URL: https://issues.apache.org/jira/browse/ARROW-17551
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17497) [R] installation failure on R Studio Server

2022-08-30 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597946#comment-17597946
 ] 

Neal Richardson commented on ARROW-17497:
-

Given that both 
[https://apache.jfrog.io/artifactory/arrow/thirdparty/7.0.0/boost_1_75_0.tar.gz]
 and 
[https://apache.jfrog.io/artifactory/arrow/r/9.0.0/libarrow/bin/ubuntu-18.04/arrow-9.0.0.zip]
 cannot download, I wonder if your server/firewall is blocking traffic to some 
IPs.

> [R] installation failure on R Studio Server
> ---
>
> Key: ARROW-17497
> URL: https://issues.apache.org/jira/browse/ARROW-17497
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Philipp
>Priority: Major
>
> Hi,
> I try to install arrow on a RStudio Server. I tried different approaches but 
> always ran into the same poblem. Do you have any idea what goes wrong? Thanks.
> {code:java}
> // > Sys.setenv(ARROW_R_DEV=TRUE)
> > install.packages("arrow")
> Installiere Paket nach ‘/home/*/R/x86_64-pc-linux-gnu-library/4.2’
> (da ‘lib’ nicht spezifiziert)
> versuche URL 'https://cloud.r-project.org/src/contrib/arrow_9.0.0.tar.gz'
> Content type 'application/x-gzip' length 4900968 bytes (4.7 MB)
> ==
> downloaded 4.7 MB
> * installing *source* package ‘arrow’ ...
> ** Paket ‘arrow’ erfolgreich entpackt und MD5 Summen überprüft
> ** using staged installation
> *** Found local C++ source: 'tools/cpp'
> *** Building libarrow from source
> For build options and troubleshooting, see the install vignette:
> https://cran.r-project.org/web/packages/arrow/vignettes/install.html
> *** Building with MAKEFLAGS= -j2 
>  cmake: /usr/bin/cmake
>  arrow with SOURCE_DIR='tools/cpp' 
> BUILD_DIR='/tmp/Rtmp64sUMd/file553b4faa5f21' DEST_DIR='libarrow/arrow-9.0.0' 
> CMAKE='/usr/bin/cmake' EXTRA_CMAKE_FLAGS='' CC='gcc' CXX='g++ -std=gnu++11' 
> LDFLAGS='-Wl,-z,relro' ARROW_S3='OFF' ARROW_GCS='OFF' ARROW_MIMALLOC='OFF' 
> ++ pwd
> + : /tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow
> + : tools/cpp
> + : /tmp/Rtmp64sUMd/file553b4faa5f21
> + : libarrow/arrow-9.0.0
> + : /usr/bin/cmake
> ++ cd tools/cpp
> ++ pwd
> + SOURCE_DIR=/tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow/tools/cpp
> ++ mkdir -p libarrow/arrow-9.0.0
> ++ cd libarrow/arrow-9.0.0
> ++ pwd
> + DEST_DIR=/tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow/libarrow/arrow-9.0.0
> + '[' '' '!=' '' ']'
> + '[' '' = false ']'
> + ARROW_DEFAULT_PARAM=OFF
> + mkdir -p /tmp/Rtmp64sUMd/file553b4faa5f21
> + pushd /tmp/Rtmp64sUMd/file553b4faa5f21
> /tmp/Rtmp64sUMd/file553b4faa5f21 /tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow
> + /usr/bin/cmake -DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_TESTS=OFF 
> -DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON -DARROW_COMPUTE=ON 
> -DARROW_CSV=ON -DARROW_DATASET=ON -DARROW_DEPENDENCY_SOURCE=AUTO 
> -DAWSSDK_SOURCE= -DARROW_FILESYSTEM=ON -DARROW_GCS=OFF -DARROW_JEMALLOC=OFF 
> -DARROW_MIMALLOC=OFF -DARROW_JSON=ON -DARROW_PARQUET=ON -DARROW_S3=OFF 
> -DARROW_WITH_BROTLI=OFF -DARROW_WITH_BZ2=OFF -DARROW_WITH_LZ4=ON 
> -DARROW_WITH_RE2=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_UTF8PROC=ON 
> -DARROW_WITH_ZLIB=OFF -DARROW_WITH_ZSTD=OFF 
> -DARROW_VERBOSE_THIRDPARTY_BUILD=OFF -DCMAKE_BUILD_TYPE=Release 
> -DCMAKE_INSTALL_LIBDIR=lib 
> -DCMAKE_INSTALL_PREFIX=/tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow/libarrow/arrow-9.0.0
>  -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON 
> -DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON -DCMAKE_UNITY_BUILD=OFF 
> -Dxsimd_SOURCE= -G 'Unix Makefiles' 
> /tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow/tools/cpp
> -- Building using CMake version: 3.13.4
> -- The C compiler identification is GNU 8.3.0
> -- The CXX compiler identification is GNU 8.3.0
> -- Check for working C compiler: /usr/bin/gcc
> -- Check for working C compiler: /usr/bin/gcc -- works
> -- Detecting C compiler ABI info
> -- Detecting C compiler ABI info - done
> -- Detecting C compile features
> -- Detecting C compile features - done
> -- Check for working CXX compiler: /usr/bin/g++
> -- Check for working CXX compiler: /usr/bin/g++ -- works
> -- Detecting CXX compiler ABI info
> -- Detecting CXX compiler ABI info - done
> -- Detecting CXX compile features
> -- Detecting CXX compile features - done
> -- Arrow version: 9.0.0 (full: '9.0.0')
> -- Arrow SO version: 900 (full: 900.0.0)
> -- clang-tidy 12 not found
> -- clang-format 12 not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) 
> -- infer not found
> fatal: Kein Git-Repository (oder irgendeines der Elternverzeichnisse): .git
> -- Found Python3: /usr/bin/python3.7 (found version "3.7.3") found 
> components:  Interpreter 
> -- Found cpplint executable at 
> /tmp/Rtmp1PYvFm/R.INSTALL55147fc8d301/arrow/tools/cpp/build-support/cpplint.py
> -- System processor: x86_64
> -- Performing Test CXX_SUPPO

[jira] [Created] (ARROW-17570) [Java][Documentation] Add JavaDoc to TransferPair interface

2022-08-30 Thread Larry White (Jira)

Larry White created ARROW-17570:
---

 Summary: [Java][Documentation] Add JavaDoc to TransferPair 
interface
 Key: ARROW-17570
 URL: https://issues.apache.org/jira/browse/ARROW-17570
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Java
Reporter: Larry White


The TransferPair interface is important to the Java vector module's memory 
management, but there is only a single line class comment, and no comments for 
the methods. The implementations of those methods have subtitles that are 
hidden in the method names. For example, the method transferTo() clears the 
memory in the original vector and resets the rowCount to 0, but 
splitAndTransferTo() only copies the values into new memory and the original is 
unchanged.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-17570) [Java][Documentation] Add JavaDoc to TransferPair interface

2022-08-30 Thread Larry White (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry White reassigned ARROW-17570:
---

Assignee: Larry White

> [Java][Documentation] Add JavaDoc to TransferPair interface
> ---
>
> Key: ARROW-17570
> URL: https://issues.apache.org/jira/browse/ARROW-17570
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Java
>Reporter: Larry White
>Assignee: Larry White
>Priority: Major
>
> The TransferPair interface is important to the Java vector module's memory 
> management, but there is only a single line class comment, and no comments 
> for the methods. The implementations of those methods have subtitles that are 
> hidden in the method names. For example, the method transferTo() clears the 
> memory in the original vector and resets the rowCount to 0, but 
> splitAndTransferTo() only copies the values into new memory and the original 
> is unchanged.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17569) [C++] Bump xsimd to 9.0.0

2022-08-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17569:
---
Labels: pull-request-available  (was: )

> [C++] Bump xsimd to 9.0.0
> -
>
> Key: ARROW-17569
> URL: https://issues.apache.org/jira/browse/ARROW-17569
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Serge Guelton
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> xsmd has released a new upstream version (namely 9.0.0), it would be nice to 
> match it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-17569) [C++] Bump xsimd to 9.0.0

2022-08-30 Thread Serge Guelton (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Serge Guelton reassigned ARROW-17569:
-

Assignee: Serge Guelton

> [C++] Bump xsimd to 9.0.0
> -
>
> Key: ARROW-17569
> URL: https://issues.apache.org/jira/browse/ARROW-17569
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Serge Guelton
>Assignee: Serge Guelton
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> xsmd has released a new upstream version (namely 9.0.0), it would be nice to 
> match it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17571) [Benchmarks] Default build for PyArrow seems to be debug

2022-08-30 Thread Alenka Frim (Jira)

Alenka Frim created ARROW-17571:
---

 Summary: [Benchmarks] Default build for PyArrow seems to be debug
 Key: ARROW-17571
 URL: https://issues.apache.org/jira/browse/ARROW-17571
 Project: Apache Arrow
  Issue Type: Bug
  Components: Benchmarking
Reporter: Alenka Frim
Assignee: Alenka Frim
 Fix For: 10.0.0


After a benchmark regression was identified in the [Python refactoring 
PR|https://github.com/apache/arrow/pull/13311] we identified the cause is in 
the build script for benchmarks. In the file _dev/conbench_envs/hooks.sh_ the 
script used to build PyArrow is _ci/scripts/python_build.sh_ where the default 
for PyArrow build type is set to *debug* (assuming _CMAKE_BUILD_TYPE_ isn't 
defined)

See:
[https://github.com/apache/arrow/blob/74dae618ed8d6b492bf3b88e3b9b7dfd4c21e8d8/dev/conbench_envs/hooks.sh#L60-L62]
[https://github.com/apache/arrow/blob/93b63e8f3b4880927ccbd5522c967df79e926cda/ci/scripts/python_build.sh#L55]
 

I think we need to change the build type to release in 
_dev/conbench_envs/hooks.sh_ (_build_arrow_python()_) or maybe better to set 
the variable _CMAKE_BUILD_TYPE_ to release in 
_dev/conbench_envs/benchmarks.env_.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17571) [Benchmarks] Default build for PyArrow seems to be debug

2022-08-30 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-17571:
---
Component/s: Python

> [Benchmarks] Default build for PyArrow seems to be debug
> 
>
> Key: ARROW-17571
> URL: https://issues.apache.org/jira/browse/ARROW-17571
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Benchmarking, Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
> Fix For: 10.0.0
>
>
> After a benchmark regression was identified in the [Python refactoring 
> PR|https://github.com/apache/arrow/pull/13311] we identified the cause is in 
> the build script for benchmarks. In the file _dev/conbench_envs/hooks.sh_ the 
> script used to build PyArrow is _ci/scripts/python_build.sh_ where the 
> default for PyArrow build type is set to *debug* (assuming _CMAKE_BUILD_TYPE_ 
> isn't defined)
> See:
> [https://github.com/apache/arrow/blob/74dae618ed8d6b492bf3b88e3b9b7dfd4c21e8d8/dev/conbench_envs/hooks.sh#L60-L62]
> [https://github.com/apache/arrow/blob/93b63e8f3b4880927ccbd5522c967df79e926cda/ci/scripts/python_build.sh#L55]
>  
> I think we need to change the build type to release in 
> _dev/conbench_envs/hooks.sh_ (_build_arrow_python()_) or maybe better to set 
> the variable _CMAKE_BUILD_TYPE_ to release in 
> _dev/conbench_envs/benchmarks.env_.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17571) [Benchmarks] Default build for PyArrow seems to be debug

2022-08-30 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597959#comment-17597959
 ] 

Antoine Pitrou commented on ARROW-17571:


Nice find :-)

> [Benchmarks] Default build for PyArrow seems to be debug
> 
>
> Key: ARROW-17571
> URL: https://issues.apache.org/jira/browse/ARROW-17571
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Benchmarking, Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
> Fix For: 10.0.0
>
>
> After a benchmark regression was identified in the [Python refactoring 
> PR|https://github.com/apache/arrow/pull/13311] we identified the cause is in 
> the build script for benchmarks. In the file _dev/conbench_envs/hooks.sh_ the 
> script used to build PyArrow is _ci/scripts/python_build.sh_ where the 
> default for PyArrow build type is set to *debug* (assuming _CMAKE_BUILD_TYPE_ 
> isn't defined)
> See:
> [https://github.com/apache/arrow/blob/74dae618ed8d6b492bf3b88e3b9b7dfd4c21e8d8/dev/conbench_envs/hooks.sh#L60-L62]
> [https://github.com/apache/arrow/blob/93b63e8f3b4880927ccbd5522c967df79e926cda/ci/scripts/python_build.sh#L55]
>  
> I think we need to change the build type to release in 
> _dev/conbench_envs/hooks.sh_ (_build_arrow_python()_) or maybe better to set 
> the variable _CMAKE_BUILD_TYPE_ to release in 
> _dev/conbench_envs/benchmarks.env_.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-15398) Reduce time to build and run integration tests

2022-08-30 Thread Todd Farmer (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597980#comment-17597980
 ] 

Todd Farmer commented on ARROW-15398:
-

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> Reduce time to build and run integration tests
> --
>
> Key: ARROW-15398
> URL: https://issues.apache.org/jira/browse/ARROW-15398
> Project: Apache Arrow
>  Issue Type: Test
>  Components: Integration
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The tests spend most of the time building each of the implementations, but 
> most of the time only a small number of them changes.
> The goal here is to see what can be done to re-use / cache the build steps of 
> each of the implementation to only have to re-compile the ones that change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-15063) [C++][Tools] Create visualization tool for exec plan tracing logs

2022-08-30 Thread Todd Farmer (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-15063:
---

Assignee: (was: Alvin Chunga Mamani)

> [C++][Tools] Create visualization tool for exec plan tracing logs
> -
>
> Key: ARROW-15063
> URL: https://issues.apache.org/jira/browse/ARROW-15063
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: query-engine
>
> I'm assigning the C++ component because this is only relevant to the C++ 
> streaming execution engine.  However, the tool itself need not be written in 
> C++.
> Ideally this tool, given the exec plan log generated by ARROW-15061, the tool 
> will generate some kind of flame chart.
> A basic description of the process should be added to the execution engine 
> docs or in a readme somewhere.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-15063) [C++][Tools] Create visualization tool for exec plan tracing logs

2022-08-30 Thread Todd Farmer (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597981#comment-17597981
 ] 

Todd Farmer commented on ARROW-15063:
-

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [C++][Tools] Create visualization tool for exec plan tracing logs
> -
>
> Key: ARROW-15063
> URL: https://issues.apache.org/jira/browse/ARROW-15063
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Assignee: Alvin Chunga Mamani
>Priority: Major
>  Labels: query-engine
>
> I'm assigning the C++ component because this is only relevant to the C++ 
> streaming execution engine.  However, the tool itself need not be written in 
> C++.
> Ideally this tool, given the exec plan log generated by ARROW-15061, the tool 
> will generate some kind of flame chart.
> A basic description of the process should be added to the execution engine 
> docs or in a readme somewhere.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-15398) Reduce time to build and run integration tests

2022-08-30 Thread Todd Farmer (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-15398:
---

Assignee: (was: Jorge Leitão)

> Reduce time to build and run integration tests
> --
>
> Key: ARROW-15398
> URL: https://issues.apache.org/jira/browse/ARROW-15398
> Project: Apache Arrow
>  Issue Type: Test
>  Components: Integration
>Reporter: Jorge Leitão
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The tests spend most of the time building each of the implementations, but 
> most of the time only a small number of them changes.
> The goal here is to see what can be done to re-use / cache the build steps of 
> each of the implementation to only have to re-compile the ones that change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17538) [C++] Importing an ArrowArrayStream can't handle errors from get_schema

2022-08-30 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597982#comment-17597982
 ] 

Antoine Pitrou commented on ARROW-17538:


The main question is whether {{get_schema}} may be costly (i.e. do some IO). If 
not then it sounds fine.

> [C++] Importing an ArrowArrayStream can't handle errors from get_schema
> ---
>
> Key: ARROW-17538
> URL: https://issues.apache.org/jira/browse/ARROW-17538
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 9.0.0
>Reporter: David Li
>Priority: Major
>
> As indicated in the code: 
> https://github.com/apache/arrow/blob/cd3c6ead97d584366aafd2f14d99a1cb8ace9ca2/cpp/src/arrow/c/bridge.cc#L1823
>  
> This probably needs a static initializer so we can catch things.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17538) [C++] Importing an ArrowArrayStream can't handle errors from get_schema

2022-08-30 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-17538:
---
Labels: good-first-issue  (was: )

> [C++] Importing an ArrowArrayStream can't handle errors from get_schema
> ---
>
> Key: ARROW-17538
> URL: https://issues.apache.org/jira/browse/ARROW-17538
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 9.0.0
>Reporter: David Li
>Priority: Major
>  Labels: good-first-issue
>
> As indicated in the code: 
> https://github.com/apache/arrow/blob/cd3c6ead97d584366aafd2f14d99a1cb8ace9ca2/cpp/src/arrow/c/bridge.cc#L1823
>  
> This probably needs a static initializer so we can catch things.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17572) [R] Add binding for random() function

2022-08-30 Thread Neal Richardson (Jira)

Neal Richardson created ARROW-17572:
---

 Summary: [R] Add binding for random() function
 Key: ARROW-17572
 URL: https://issues.apache.org/jira/browse/ARROW-17572
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
 Fix For: 10.0.0


{{random()}} generates uniform distributed values. We can probably bind that 
somehow to {{runif() }}in dplyr (will have to ignore the `n` argument). If 
nothing else, it can be used to implement {{slice_sample()}} (see also 
ARROW-13766, ARROW-13767).

FWIW there's {{sql_random()}} in dbplyr, not sure if that's actually exposed 
meaningfully or just used to implement slice_sample. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-30 Thread Arthur Passos (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598019#comment-17598019
 ] 

Arthur Passos commented on ARROW-17459:
---

Hi [~emkornfield]. I see you are one of the authors of 
[https://github.com/apache/arrow/pull/8177|https://github.com/apache/arrow/pull/8177.].
 I see the following snippet was introduced on that PR:


{code:java}
      // ARROW-3762(wesm): If item reader yields a chunked array, we reject as
      // this is not yet implemented
      return Status::NotImplemented(
          "Nested data conversions not implemented for chunked array 
outputs");{code}
I wonder why this wasn't implemented. Is there a techinical limitation or the 
approach wasn't very well defined?

I am pretty new to Parquet and to `arrow` library, so it's very hard to me to 
reason about all of these concepts and code. From the top of my head, I got a 
couple of silly ideas:
 # Find a way to convert a ChunkedArray into a single Array. That requires a 
processing step that allocates a contiguous chunk of memory big enough to hold 
all chunks. Plus, there is no clear interface to do so.
 # Create a new ChunkedArray class that can hold ChunkedArrays. As of now, it 
can only hold raw Arrays. That would require a LOT of changes in other 
{{arrow}}  classes and, of course, it's not guaranteed to work.
 # Make the chunk memory limit configurable (not sure it's feasible)

Do you see any of these as a path forward? If not, what would be the path 
forward?

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-30 Thread Micah Kornfield (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598038#comment-17598038
 ] 

Micah Kornfield commented on ARROW-17459:
-

1.  ChunkedArrays have a Flatten method that will do this but I don't think it 
will help in this case.  IIRC the challenge in this case is that parquet only 
yields chunked arrays if the underlying column data cannot fit into the right 
arrow structure.  In this case for Utf8 arrays it means the sum of bytes across 
all strings has to be less then INT_MAX length.  Otherwise it would need to 
flatten to LargeUtf8 which has implications for schema conversion.  Structs and 
lists always expected Arrays as their inner element types and not chunked 
arrays.
2.  doesn't necessarily seem like the right approach.  
3.  Per 1, this isn't really the issue I think.  The approach here that could 
work (I don't remember all the code paths) is to vary the number of rows read 
back if not all rows are huge).

One way forward here could be to add an option for reading back arrays to 
always use the Large* variant (or maybe on a per column basis) to avoid 
chunking.

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17573) [Go] String Binary Builder Leaks Memory When Writing to Parquet

2022-08-30 Thread Sasha Sirovica (Jira)

Sasha Sirovica created ARROW-17573:
--

 Summary: [Go] String Binary Builder Leaks Memory When Writing to 
Parquet
 Key: ARROW-17573
 URL: https://issues.apache.org/jira/browse/ARROW-17573
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Affects Versions: 9.0.0
Reporter: Sasha Sirovica


When using `arrow.BinaryTypes.String` in a schema, appending multiple strings, 
and then writing a record out to parquet the memory of the program continuously 
increases. This also applies for the other `arrow.BinaryTypes` 

  

I took a heap dump midway through the program and the majority of allocations 
comes from `StringBuilder.Append` which is not GC'd. I approached 16GB of RAM 
before terminating the program. 

  

I was not able to replicate this behavior with just PrimativeTypes. Another 
interesting point, if the records are created but never written with pqarrow 
memory does not grow. In the below program commenting out `w.Write(rec)` will 
not cause memory issues. 

Example program which causes memory to leak: 
{code:java}
package main 

import ( 
   "os" 
   "testing" 

   "github.com/apache/arrow/go/v9/arrow" 
   "github.com/apache/arrow/go/v9/arrow/array" 
   "github.com/apache/arrow/go/v9/arrow/memory" 
   "github.com/apache/arrow/go/v9/parquet" 
   "github.com/apache/arrow/go/v9/parquet/compress" 
   "github.com/apache/arrow/go/v9/parquet/pqarrow" 
) 

func main() { 
   f, _ := os.Create("/tmp/test.parquet") 

   arrowProps := pqarrow.DefaultWriterProps() 
   schema := arrow.NewSchema( 
  []arrow.Field{ 
 {Name: "aString", Type: arrow.BinaryTypes.String}, 
  }, 
  nil, 
   ) 
   w, _ := pqarrow.NewFileWriter(schema, f, 
parquet.NewWriterProperties(parquet.WithCompression(compress.Codecs.Snappy)), 
arrowProps) 

   builder := array.NewRecordBuilder(memory.DefaultAllocator, schema) 
   for i := 1; i < 5000; i++ { 
  builder.Field(0).(*array.StringBuilder).Append("HelloWorld!") 
  if i%200 == 0 { 
 // Write row groups out every 2M times 
 rec := builder.NewRecord() 
 w.Write(rec) 
 rec.Release() 
  } 
   } 
   w.Close() 
}{code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17532) [Go] Implement Numeric Cast functions

2022-08-30 Thread Matthew Topol (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-17532.
---
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13992
[https://github.com/apache/arrow/pull/13992]

> [Go] Implement Numeric Cast functions
> -
>
> Key: ARROW-17532
> URL: https://issues.apache.org/jira/browse/ARROW-17532
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-30 Thread Arthur Passos (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598050#comment-17598050
 ] 

Arthur Passos commented on ARROW-17459:
---

[~emkornfield] thank you for your answer. Can you clarify what you mean by 
"read back arrays to always use the Large* variant"? I don't know what "back 
array" and "large variant" refer to, tho I can especulate what the latter means.

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17138) [Python] Converting data frame to Table with large nested column fails `Invalid Struct child array has length smaller than expected`

2022-08-30 Thread hadim (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598054#comment-17598054
 ] 

hadim commented on ARROW-17138:
---

I also confirm the bug for the same reasons with pyarrow 6, 7, 8 and 9.

 

Is there is a workaround waiting for a fix?

> [Python] Converting data frame to Table with large nested column fails 
> `Invalid Struct child array has length smaller than expected`
> 
>
> Key: ARROW-17138
> URL: https://issues.apache.org/jira/browse/ARROW-17138
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Simon Weiß
>Priority: Major
>
> Hey, 
> I have a data frame for which one column is a nested struct array. Converting 
> it to a `pyarrow.Table` fails if the data frame gets too big. I could 
> reproduce the bug with a minimal example with anonymized data that is roughly 
> similar to mine. When I set, e.g., `N_ROWS=500_000`, or smaller, it is 
> working fine.
> ```python
> import pandas as pd
> import pyarrow as pa
> N_ROWS = 800_000
> item_record = {
>     "someImportantAssets": [
>         {
>             "square": 
> "https://some.super.loong.link.com/withmany/lorem/upload/ipsum/stilllooonger/lorem/\{someparameter}/156fdjjf644984dfdfaera648/specificLink-i15348891";
>         }
>     ],
>     "id": "i15348891",
>     "title": "Some Long Item Title i15348891",
> }
> user_record = {
>     "userId": "faa4648-4964drf-64648fafa648-4648falj",
>     "data": [item_record for _ in range(24)],
> }
> df = pd.DataFrame([user_record for _ in range(N_ROWS)])
> table = pa.Table.from_pandas(df)
> ```
> ```python-traceback
> Traceback (most recent call last):
>     table = pa.Table.from_pandas(df)
>   File "pyarrow/table.pxi", line 1658, in pyarrow.lib.Table.from_pandas
>   File "pyarrow/table.pxi", line 1702, in pyarrow.lib.Table.from_arrays
>   File "pyarrow/table.pxi", line 1314, in pyarrow.lib.Table.validate
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: List child array 
> invalid: Invalid: Struct child array #1 invalid: Invalid: List child array 
> invalid: Invalid: Struct child array #0 has length smaller than expected for 
> struct array (13256071 < 13256072)
> ```
> The length is always smaller than expected by 1.
>  
> h2. Expected behavior:
> Run without errors or fail with a better error message.
>  
> h2. System Info and Versions:
> Apple M1 Pro but also happened on amd64 Linux machine on AWS
> ```
> arrow-cpp                 7.0.0           py39h8a997f0_8_cpu    conda-forge
> pyarrow                   7.0.0           py39h3a11367_8_cpu    conda-forge
> python                    3.9.7           h54d631c_3_cpython    conda-forge
> ```
> I could also reproduce with `pyarrow 8.0.0`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17137) [Python] Converting data frame to Table with large nested column fails `Invalid Struct child array has length smaller than expected`

2022-08-30 Thread hadim (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598055#comment-17598055
 ] 

hadim commented on ARROW-17137:
---

I also confirm the bug for the same reasons with pyarrow 6, 7, 8 and 9.

 

Is there is a workaround waiting for a fix?

> [Python] Converting data frame to Table with large nested column fails 
> `Invalid Struct child array has length smaller than expected`
> 
>
> Key: ARROW-17137
> URL: https://issues.apache.org/jira/browse/ARROW-17137
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Simon Weiß
>Priority: Major
>  Labels: python-conversion
>
> Hey, 
> I have a data frame for which one column is a nested struct array. Converting 
> it to a pyarrow.Table fails if the data frame gets too big. I could reproduce 
> the bug with a minimal example with anonymized data that is roughly similar 
> to mine. When I set, e.g., N_ROWS=500_000, or smaller, it is working fine.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> N_ROWS = 800_000
> item_record = {
>     "someImportantAssets": [
>         {
>             "square": 
> "https://some.super.loong.link.com/withmany/lorem/upload/";
>             
> "ipsum/stilllooonger/lorem/{someparameter}/156fdjjf644984dfdfaera64"
>             "/specificLink-i15348891"
>         }
>     ],
>     "id": "i15348891",
>     "title": "Some Long Item Title i15348891",
> }
> user_record = {
>     "userId": "faa4648-4964drf-64648fafa648-4648falj",
>     "data": [item_record for _ in range(24)],
> }
> df = pd.DataFrame([user_record for _ in range(N_ROWS)])
> table = pa.Table.from_pandas(df){code}
>  
> {code:java}
> Traceback (most recent call last):
>   File "/.../scratch/experiment_pq.py", line 23, in 
>     table = pa.Table.from_pandas(df)
>   File "pyarrow/table.pxi", line 3472, in pyarrow.lib.Table.from_pandas
>   File "pyarrow/table.pxi", line 3574, in pyarrow.lib.Table.from_arrays
>   File "pyarrow/table.pxi", line 2793, in pyarrow.lib.Table.validate
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: List child array 
> invalid: Invalid: Struct child array #1 invalid: Invalid: List child array 
> invalid: Invalid: Struct child array #0 has length smaller than expected for 
> struct array (13338407 < 13338408) {code}
> The length is always smaller than expected by 1.
>  
> h2. Expected behavior:
> Run without errors or fail with a better error message.
>  
> h2. System Info and Versions:
> Apple M1 Pro but also happened on amd64 Linux machine on AWS
>  
> {code:java}
> arrow-cpp                 7.0.0           py39h8a997f0_8_cpu    conda-forge
> pyarrow                   7.0.0           py39h3a11367_8_cpu    conda-forge
> python                    3.9.7           h54d631c_3_cpython    conda-forge
> {code}
> I could also reproduce with
> {noformat}
>  pyarrow 8.0.0{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-13454) [C++][Docs] Tables vs Record Batches

2022-08-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13454:
---
Labels: pull-request-available  (was: )

> [C++][Docs] Tables vs Record Batches
> 
>
> Key: ARROW-13454
> URL: https://issues.apache.org/jira/browse/ARROW-13454
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Rares Vernica
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It is not clear what the difference is between Tables and Record Batches is 
> as described on [https://arrow.apache.org/docs/cpp/tables.html#tables]
> _A 
> [{{arrow::Table}}|https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow5TableE]
>  is a two-dimensional dataset with chunked arrays for columns_
> _A 
> [{{arrow::RecordBatch}}|https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow11RecordBatchE]
>  is a two-dimensional dataset of a number of contiguous arrays_
> Or maybe the distinction between _chunked arrays_ and _contiguous arrays_ can 
> be clarified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-13454) [C++][Docs] Tables vs Record Batches

2022-08-30 Thread Will Jones (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-13454:
--

Assignee: Will Jones

> [C++][Docs] Tables vs Record Batches
> 
>
> Key: ARROW-13454
> URL: https://issues.apache.org/jira/browse/ARROW-13454
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Rares Vernica
>Assignee: Will Jones
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> It is not clear what the difference is between Tables and Record Batches is 
> as described on [https://arrow.apache.org/docs/cpp/tables.html#tables]
> _A 
> [{{arrow::Table}}|https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow5TableE]
>  is a two-dimensional dataset with chunked arrays for columns_
> _A 
> [{{arrow::RecordBatch}}|https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow11RecordBatchE]
>  is a two-dimensional dataset of a number of contiguous arrays_
> Or maybe the distinction between _chunked arrays_ and _contiguous arrays_ can 
> be clarified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-30 Thread Micah Kornfield (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598058#comment-17598058
 ] 

Micah Kornfield commented on ARROW-17459:
-

i.e. LargeBinary, LargeString, LargeList these are distinct types that use 
int64s to represent offsets instead of int32

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17573) [Go] String Binary Builder Leaks Memory When Writing to Parquet

2022-08-30 Thread Sasha Sirovica (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sasha Sirovica updated ARROW-17573:
---
Description: 
When using `arrow.BinaryTypes.String` in a schema, appending multiple strings, 
and then writing a record out to parquet the memory of the program continuously 
increases. This also applies for the other `arrow.BinaryTypes` 

 

I took a heap dump midway through the program and the majority of allocations 
comes from `StringBuilder.Append` which is not GC'd. I approached 16GB of RAM 
before terminating the program.

 

I was not able to replicate this behavior with just PrimativeTypes. Another 
interesting point, if the records are created but never written with pqarrow 
memory does not grow. In the below program commenting out `w.Write(rec)` will 
not cause memory issues.

Example program which causes memory to leak:
{code:java}
package main

import (
   "os"

   "github.com/apache/arrow/go/v9/arrow"
   "github.com/apache/arrow/go/v9/arrow/array"
   "github.com/apache/arrow/go/v9/arrow/memory"
   "github.com/apache/arrow/go/v9/parquet"
   "github.com/apache/arrow/go/v9/parquet/compress"
   "github.com/apache/arrow/go/v9/parquet/pqarrow"
)

func main() {
   f, _ := os.Create("/tmp/test.parquet")

   arrowProps := pqarrow.DefaultWriterProps()
   schema := arrow.NewSchema(
  []arrow.Field{
 {Name: "aString", Type: arrow.BinaryTypes.String},
  },
  nil,
   )
   w, _ := pqarrow.NewFileWriter(schema, f, 
parquet.NewWriterProperties(parquet.WithCompression(compress.Codecs.Snappy)), 
arrowProps)

   builder := array.NewRecordBuilder(memory.DefaultAllocator, schema)
   for i := 1; i < 50; i++ {
  builder.Field(0).(*array.StringBuilder).Append("HelloWorld!")
  if i%200 == 0 {
 // Write row groups out every 2M times
 rec := builder.NewRecord()
 w.Write(rec)
 rec.Release()
  }
   }
   w.Close()
}{code}
 

  was:
When using `arrow.BinaryTypes.String` in a schema, appending multiple strings, 
and then writing a record out to parquet the memory of the program continuously 
increases. This also applies for the other `arrow.BinaryTypes` 

  

I took a heap dump midway through the program and the majority of allocations 
comes from `StringBuilder.Append` which is not GC'd. I approached 16GB of RAM 
before terminating the program. 

  

I was not able to replicate this behavior with just PrimativeTypes. Another 
interesting point, if the records are created but never written with pqarrow 
memory does not grow. In the below program commenting out `w.Write(rec)` will 
not cause memory issues. 

Example program which causes memory to leak: 
{code:java}
package main 

import ( 
   "os" 
   "testing" 

   "github.com/apache/arrow/go/v9/arrow" 
   "github.com/apache/arrow/go/v9/arrow/array" 
   "github.com/apache/arrow/go/v9/arrow/memory" 
   "github.com/apache/arrow/go/v9/parquet" 
   "github.com/apache/arrow/go/v9/parquet/compress" 
   "github.com/apache/arrow/go/v9/parquet/pqarrow" 
) 

func main() { 
   f, _ := os.Create("/tmp/test.parquet") 

   arrowProps := pqarrow.DefaultWriterProps() 
   schema := arrow.NewSchema( 
  []arrow.Field{ 
 {Name: "aString", Type: arrow.BinaryTypes.String}, 
  }, 
  nil, 
   ) 
   w, _ := pqarrow.NewFileWriter(schema, f, 
parquet.NewWriterProperties(parquet.WithCompression(compress.Codecs.Snappy)), 
arrowProps) 

   builder := array.NewRecordBuilder(memory.DefaultAllocator, schema) 
   for i := 1; i < 5000; i++ { 
  builder.Field(0).(*array.StringBuilder).Append("HelloWorld!") 
  if i%200 == 0 { 
 // Write row groups out every 2M times 
 rec := builder.NewRecord() 
 w.Write(rec) 
 rec.Release() 
  } 
   } 
   w.Close() 
}{code}
 


> [Go] String Binary Builder Leaks Memory When Writing to Parquet
> ---
>
> Key: ARROW-17573
> URL: https://issues.apache.org/jira/browse/ARROW-17573
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Affects Versions: 9.0.0
>Reporter: Sasha Sirovica
>Priority: Major
>
> When using `arrow.BinaryTypes.String` in a schema, appending multiple 
> strings, and then writing a record out to parquet the memory of the program 
> continuously increases. This also applies for the other `arrow.BinaryTypes` 
>  
> I took a heap dump midway through the program and the majority of allocations 
> comes from `StringBuilder.Append` which is not GC'd. I approached 16GB of RAM 
> before terminating the program.
>  
> I was not able to replicate this behavior with just PrimativeTypes. Another 
> interesting point, if the records are created but never written with pqarrow 
> memory does not grow. In the below program commenting out `w.Write(rec)` will 
> not cause memory

[jira] [Updated] (ARROW-17541) [R] Substantial RAM use increase in 9.0.0 release on write_dataset()

2022-08-30 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-17541:

Attachment: Screenshot 2022-08-30 at 14-23-20 Online Graph Maker · Plotly 
Chart Studio.png

> [R] Substantial RAM use increase in 9.0.0 release on write_dataset()
> 
>
> Key: ARROW-17541
> URL: https://issues.apache.org/jira/browse/ARROW-17541
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: Carl Boettiger
>Priority: Major
> Attachments: Screenshot 2022-08-30 at 14-23-20 Online Graph Maker · 
> Plotly Chart Studio.png
>
>
> Consider the following example of opening a remote dataset (a single 4 GB 
> parquet file) and streaming it to disk. Consider this reprex:
>  
> {code:java}
> s3 <- arrow::s3_bucket("data", endpoint_override = "minio3.ecoforecast.org", 
> anonymous=TRUE)
> df <- arrow::open_dataset(s3$path("waq_test"))
> arrow::write_dataset(df, tempfile())
>  {code}
> In 8.0.0, this operation peaks at about ~10 GB RAM use, which is already 
> surprisingly high (when the whole file is 4 GB when on disk), but on arrow 
> 9.0.0 RAM use for the same operation approximately doubles, which is large 
> enough to trigger the OOM killer on the task in several of our active 
> production workflows. 
>  
> Can this large RAM use increase introduced in 9.0 be avoided?  Is it possible 
> for this operation to use even less RAM than it does in 8.0 release?  Is 
> there something about this particular parquet file that should be responsible 
> for the large RAM use? 
>  
> Arrow's impressively fast performance on large data on remote hosts is really 
> game-changing for us.  Still, the OOM errors are a bit unexpected at this 
> scale (i.e. single 4GB parquet file), as R users we really depend on arrow's 
> out-of-band operations to work with larger-than-RAM data.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17541) [R] Substantial RAM use increase in 9.0.0 release on write_dataset()

2022-08-30 Thread Weston Pace (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598082#comment-17598082
 ] 

Weston Pace commented on ARROW-17541:
-

Ok.  There is definitely something strange going on here.  I plotted memory 
usage today during a run via R and then made the same run via python.  Notable 
observations:

 * R does not appear to be releasing batches that have been written but the 
python version is doing so :(
 * S3 download speeds, when run through python, are half as fast as the R 
version.

Naively I want to say backpressure is the culprit in both cases.  A lack of 
backpressure leading to excess memory in R and an excess of backpressure 
throttling downloads in python.  However, the R behavior is not what I would 
expect from a lack of backpressure.  Even without backpressure we should be 
freeing data as it is written.

 !Screenshot 2022-08-30 at 14-23-20 Online Graph Maker · Plotly Chart 
Studio.png! 

> [R] Substantial RAM use increase in 9.0.0 release on write_dataset()
> 
>
> Key: ARROW-17541
> URL: https://issues.apache.org/jira/browse/ARROW-17541
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: Carl Boettiger
>Priority: Major
>
> Consider the following example of opening a remote dataset (a single 4 GB 
> parquet file) and streaming it to disk. Consider this reprex:
>  
> {code:java}
> s3 <- arrow::s3_bucket("data", endpoint_override = "minio3.ecoforecast.org", 
> anonymous=TRUE)
> df <- arrow::open_dataset(s3$path("waq_test"))
> arrow::write_dataset(df, tempfile())
>  {code}
> In 8.0.0, this operation peaks at about ~10 GB RAM use, which is already 
> surprisingly high (when the whole file is 4 GB when on disk), but on arrow 
> 9.0.0 RAM use for the same operation approximately doubles, which is large 
> enough to trigger the OOM killer on the task in several of our active 
> production workflows. 
>  
> Can this large RAM use increase introduced in 9.0 be avoided?  Is it possible 
> for this operation to use even less RAM than it does in 8.0 release?  Is 
> there something about this particular parquet file that should be responsible 
> for the large RAM use? 
>  
> Arrow's impressively fast performance on large data on remote hosts is really 
> game-changing for us.  Still, the OOM errors are a bit unexpected at this 
> scale (i.e. single 4GB parquet file), as R users we really depend on arrow's 
> out-of-band operations to work with larger-than-RAM data.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17574) [R] [Docs] [CI] Investigate if we can auto generate Rd files in CI

2022-08-30 Thread Jonathan Keane (Jira)

Jonathan Keane created ARROW-17574:
--

 Summary: [R] [Docs] [CI] Investigate if we can auto generate Rd 
files in CI
 Key: ARROW-17574
 URL: https://issues.apache.org/jira/browse/ARROW-17574
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Documentation, R
Reporter: Jonathan Keane


Or alternatively, warn + recommend running autotune if they are out of date 
(e.g. any change)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17574) [R] [Docs] [CI] Investigate if we can auto generate Rd files in CI

2022-08-30 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598116#comment-17598116
 ] 

Neal Richardson commented on ARROW-17574:
-

This was discussed before when we added the autotune job, and there was 
resistance to the idea of having CI add commits to your branch, as would be 
needed to have it generate the Rd for you. Maybe that has changed (in which 
case we could revisit the autotune issue too). Otherwise you could add 
something similar to the lint check on the windows R CI job, run roxygenize() 
and if any files are modified, fail the build.

> [R] [Docs] [CI] Investigate if we can auto generate Rd files in CI
> --
>
> Key: ARROW-17574
> URL: https://issues.apache.org/jira/browse/ARROW-17574
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Documentation, R
>Reporter: Jonathan Keane
>Priority: Major
>
> Or alternatively, warn + recommend running autotune if they are out of date 
> (e.g. any change)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17541) [R] Substantial RAM use increase in 9.0.0 release on write_dataset()

2022-08-30 Thread Weston Pace (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598117#comment-17598117
 ] 

Weston Pace commented on ARROW-17541:
-

So I'm pretty sure that the real problem is that R's garbage collector is 
somehow holding onto pool memory.  This wasn't new in 9.0.0.  As you said 
yourself, it was already using a pretty excessive amount of RAM in 8.0.0.  
Notice that, in python, downloading this file uses less than 1GB of RAM.  I'm 
not going to worry too much about the fact that RAM rate doubled between 8.0.0 
and 9.0.0.  I suspect that may just be that we are more aggressive with 
readahead on these sorts of files in 9.0.0 (in 8.0.0 we always read ahead 8 
batches, in 9.0.0 the readahead is based on # of rows which leads to 20 batches 
on this file).

R's garbage collector is not running for two reasons:

1. R is not aware there is any memory pressure, because it doesn't see the RAM 
used by Arrow pool memory.  As far as it is concerned it is only holding onto 
80MB and in reality it is holding onto multiple GB of RAM.

2. We are executing a single (admittedly long running) C statement with 
{{write_dataset}}.  R's garbage collector will not (I think) run mid-execution.

We could investigate the above two issues but there is a third, more concerning 
problem:

3. R is holding onto memory when it isn't clear to me it should even be able to 
see the memory.

The allocations we are making in Arrow come from the memory pool, they are 
owned by record batch objects, and those record batch objects are never (as far 
as I know) converted to R.  Perhaps they are being converted to R somewhere (we 
are scanning and then writing, do we scan into R before we send to the write 
node?  I wouldn't think so but I could be wrong).  Or perhaps R's memory 
allocator works in some strange way I'm not aware of.

I'm going to have to step back from this investigation as I've hit my limit for 
the week and there is other work I am on the hook for.  So if any R aficionados 
want to investigate I'd be grateful.

> [R] Substantial RAM use increase in 9.0.0 release on write_dataset()
> 
>
> Key: ARROW-17541
> URL: https://issues.apache.org/jira/browse/ARROW-17541
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: Carl Boettiger
>Priority: Major
> Attachments: Screenshot 2022-08-30 at 14-23-20 Online Graph Maker · 
> Plotly Chart Studio.png
>
>
> Consider the following example of opening a remote dataset (a single 4 GB 
> parquet file) and streaming it to disk. Consider this reprex:
>  
> {code:java}
> s3 <- arrow::s3_bucket("data", endpoint_override = "minio3.ecoforecast.org", 
> anonymous=TRUE)
> df <- arrow::open_dataset(s3$path("waq_test"))
> arrow::write_dataset(df, tempfile())
>  {code}
> In 8.0.0, this operation peaks at about ~10 GB RAM use, which is already 
> surprisingly high (when the whole file is 4 GB when on disk), but on arrow 
> 9.0.0 RAM use for the same operation approximately doubles, which is large 
> enough to trigger the OOM killer on the task in several of our active 
> production workflows. 
>  
> Can this large RAM use increase introduced in 9.0 be avoided?  Is it possible 
> for this operation to use even less RAM than it does in 8.0 release?  Is 
> there something about this particular parquet file that should be responsible 
> for the large RAM use? 
>  
> Arrow's impressively fast performance on large data on remote hosts is really 
> game-changing for us.  Still, the OOM errors are a bit unexpected at this 
> scale (i.e. single 4GB parquet file), as R users we really depend on arrow's 
> out-of-band operations to work with larger-than-RAM data.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17541) [R] Substantial RAM use increase in 9.0.0 release on write_dataset()

2022-08-30 Thread Dewey Dunnington (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598124#comment-17598124
 ] 

Dewey Dunnington commented on ARROW-17541:
--

I'm working on two PRs that touch some of those parts of the code and will 
investigate at some point in the next two weeks. A few thoughts:

> R's garbage collector will not (I think) run mid-execution.

I believe that's true, although in theory R is also not allocating any memory 
either. I don't know of any way to allocate R memory without (potentially) 
running the garbage collector.

> R is holding onto memory when it isn't clear to me it should even be able to 
> see the memory.

Would a more precise way to say this be that there is some shared pointer 
(potentially held by an R6 object that is still in scope and not being 
destroyed) that is keeping the record batches from being freed? We do have an R 
reference to the exec plan and the final node of the exec plan (which would be 
the penultimate node in the dataset write, which is probably the scan node). 
(It still makes no sense to me why the batches aren't getting released).

> [R] Substantial RAM use increase in 9.0.0 release on write_dataset()
> 
>
> Key: ARROW-17541
> URL: https://issues.apache.org/jira/browse/ARROW-17541
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: Carl Boettiger
>Priority: Major
> Attachments: Screenshot 2022-08-30 at 14-23-20 Online Graph Maker · 
> Plotly Chart Studio.png
>
>
> Consider the following example of opening a remote dataset (a single 4 GB 
> parquet file) and streaming it to disk. Consider this reprex:
>  
> {code:java}
> s3 <- arrow::s3_bucket("data", endpoint_override = "minio3.ecoforecast.org", 
> anonymous=TRUE)
> df <- arrow::open_dataset(s3$path("waq_test"))
> arrow::write_dataset(df, tempfile())
>  {code}
> In 8.0.0, this operation peaks at about ~10 GB RAM use, which is already 
> surprisingly high (when the whole file is 4 GB when on disk), but on arrow 
> 9.0.0 RAM use for the same operation approximately doubles, which is large 
> enough to trigger the OOM killer on the task in several of our active 
> production workflows. 
>  
> Can this large RAM use increase introduced in 9.0 be avoided?  Is it possible 
> for this operation to use even less RAM than it does in 8.0 release?  Is 
> there something about this particular parquet file that should be responsible 
> for the large RAM use? 
>  
> Arrow's impressively fast performance on large data on remote hosts is really 
> game-changing for us.  Still, the OOM errors are a bit unexpected at this 
> scale (i.e. single 4GB parquet file), as R users we really depend on arrow's 
> out-of-band operations to work with larger-than-RAM data.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17575) [C++][Docs] Update build document to follow new CMake package

2022-08-30 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-17575:


 Summary: [C++][Docs] Update build document to follow new CMake 
package
 Key: ARROW-17575
 URL: https://issues.apache.org/jira/browse/ARROW-17575
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Documentation
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


This is a follow-up of ARROW-12175.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17565) [C++] Backward compatible ${PACKAGE}_shared CMake target isn't provided

2022-08-30 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-17565.
--
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 14003
[https://github.com/apache/arrow/pull/14003]

> [C++] Backward compatible ${PACKAGE}_shared CMake target isn't provided
> ---
>
> Key: ARROW-17565
> URL: https://issues.apache.org/jira/browse/ARROW-17565
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This is a follow-up of ARROW-12175.
> We introduced {{${PACKAGE}::}} prefix to all exported CMake targets such as 
> {{Arrow::arrow_shared}} and {{Arrow::arrow_static}} but we also provides no 
> namespaced CMake targets such as {{arrow_shared}} and {{arrow_static}} as 
> aliases of namespaced CMake targets. But the backward compatibility feature 
> isn't worked for {{_shared}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17089) [Python] Use `.arrow` as extension for IPC file dataset

2022-08-30 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-17089.
--
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13677
[https://github.com/apache/arrow/pull/13677]

> [Python] Use `.arrow` as extension for IPC file dataset
> ---
>
> Key: ARROW-17089
> URL: https://issues.apache.org/jira/browse/ARROW-17089
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Same as ARROW-17088
> As noted in the following document, the recommended extension for IPC files 
> is now `.arrow`.
> > We recommend the “.arrow” extension for files created with this format.
> https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
> However, currently when writing a dataset with the 
> {{pyarrow.dataset.write_dataset}} function, the default extension is 
> {{.feather}} when {{arrow}} or {{ipc}} or {{feather}} is selected.
> https://github.com/apache/arrow/blob/b8067151db9bfc04860285fdd8b5e73703346037/python/pyarrow/_dataset.pyx#L1149-L1151



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17571) [Benchmarks] Default build for PyArrow seems to be debug

2022-08-30 Thread Alenka Frim (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598175#comment-17598175
 ] 

Alenka Frim commented on ARROW-17571:
-

Joris ;)

> [Benchmarks] Default build for PyArrow seems to be debug
> 
>
> Key: ARROW-17571
> URL: https://issues.apache.org/jira/browse/ARROW-17571
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Benchmarking, Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
> Fix For: 10.0.0
>
>
> After a benchmark regression was identified in the [Python refactoring 
> PR|https://github.com/apache/arrow/pull/13311] we identified the cause is in 
> the build script for benchmarks. In the file _dev/conbench_envs/hooks.sh_ the 
> script used to build PyArrow is _ci/scripts/python_build.sh_ where the 
> default for PyArrow build type is set to *debug* (assuming _CMAKE_BUILD_TYPE_ 
> isn't defined)
> See:
> [https://github.com/apache/arrow/blob/74dae618ed8d6b492bf3b88e3b9b7dfd4c21e8d8/dev/conbench_envs/hooks.sh#L60-L62]
> [https://github.com/apache/arrow/blob/93b63e8f3b4880927ccbd5522c967df79e926cda/ci/scripts/python_build.sh#L55]
>  
> I think we need to change the build type to release in 
> _dev/conbench_envs/hooks.sh_ (_build_arrow_python()_) or maybe better to set 
> the variable _CMAKE_BUILD_TYPE_ to release in 
> _dev/conbench_envs/benchmarks.env_.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

80 matches

Mail list logo