[jira] [Updated] (ARROW-14204) [C++] Fails to compile Arrow without RE2 due to missing ifdef guard

2021-10-01 Thread Eduardo Ponce (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce updated ARROW-14204:
--
Description: [*RegexSubstringMatcher* is available only when RE2 is enabled 
as it is guarded with #ifdef 
ARROW_WITH_RE2|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L861-L862]
 but it is [used here without the RE2 
guard|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L1113],
 so it compilation fails.  (was: [*RegexSubstringMatcher* is available only 
when RE2 is enabled as it is guarded by an #ifdef 
macro|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L861-L862]
 but it is [used here without the RE2 
guard|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L1113],
 so it compilation fails.)

> [C++] Fails to compile Arrow without RE2 due to missing ifdef guard
> ---
>
> Key: ARROW-14204
> URL: https://issues.apache.org/jira/browse/ARROW-14204
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Eduardo Ponce
>Assignee: Eduardo Ponce
>Priority: Major
> Fix For: 6.0.0
>
>
> [*RegexSubstringMatcher* is available only when RE2 is enabled as it is 
> guarded with #ifdef 
> ARROW_WITH_RE2|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L861-L862]
>  but it is [used here without the RE2 
> guard|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L1113],
>  so it compilation fails.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14204) [C++] Fails to compile Arrow without RE2 due to missing ifdef guard

2021-10-01 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423467#comment-17423467
 ] 

Eduardo Ponce commented on ARROW-14204:
---

Should we consider this bug as motivation to have at least one CI build without 
RE2?

> [C++] Fails to compile Arrow without RE2 due to missing ifdef guard
> ---
>
> Key: ARROW-14204
> URL: https://issues.apache.org/jira/browse/ARROW-14204
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Eduardo Ponce
>Assignee: Eduardo Ponce
>Priority: Major
> Fix For: 6.0.0
>
>
> [*RegexSubstringMatcher* is available only when RE2 is enabled as it is 
> guarded by an #ifdef 
> macro|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L861-L862]
>  but it is [used here without the RE2 
> guard|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L1113],
>  so it compilation fails.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14204) [C++] Fails to compile Arrow without RE2 due to missing ifdef guard

2021-10-01 Thread Eduardo Ponce (Jira)
Eduardo Ponce created ARROW-14204:
-

 Summary: [C++] Fails to compile Arrow without RE2 due to missing 
ifdef guard
 Key: ARROW-14204
 URL: https://issues.apache.org/jira/browse/ARROW-14204
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Eduardo Ponce
Assignee: Eduardo Ponce
 Fix For: 6.0.0


[*RegexSubstringMatcher* is available only when RE2 is enabled as it is guarded 
by an #ifdef 
macro|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L861-L862]
 but it is [used here without the RE2 
guard|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L1113],
 so it compilation fails.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14188) link error on ubuntu

2021-10-01 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423464#comment-17423464
 ] 

Kouhei Sutou commented on ARROW-14188:
--

Thanks but could you show all log?

> link error on ubuntu
> 
>
> Key: ARROW-14188
> URL: https://issues.apache.org/jira/browse/ARROW-14188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 4.0.0, 5.0.0
> Environment: Ubuntu 18.04, gcc-9, and vcpkg installation of arrow
>Reporter: Amir Ghamarian
>Priority: Major
> Attachments: linkerror.txt
>
>
> I used vcpkg to install arrow versions 4 and 5, trying to build my code that 
> uses parquet fails by giving link errors of undefined reference.
> The same code works on OSX but fails on ubuntu.
> My cmake snippet is as follows:
>  
> {code:java}
> find_package(Arrow CONFIG REQUIRED)
> get_filename_component(MY_SEARCH_DIR ${Arrow_CONFIG} DIRECTORY)
> find_package(Parquet CONFIG REQUIRED PATHS ${MY_SEARCH_DIR})
> find_package(Thrift CONFIG REQUIRED)
> {code}
> and the linking: 
>  
> {code:java}
> target_link_libraries(vision_obj PUBLIC  thrift::thrift re2::re2 
> arrow_static parquet_static )
> {code}
>  
>  I get a lot of errors
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14180) [Packaging] Add support for AlmaLinux 8

2021-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14180:
---
Labels: pull-request-available  (was: )

> [Packaging] Add support for AlmaLinux 8
> ---
>
> Key: ARROW-14180
> URL: https://issues.apache.org/jira/browse/ARROW-14180
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14180) [Packaging] Add support for AlmaLinux 8

2021-10-01 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-14180:
-
Fix Version/s: (was: 7.0.0)

> [Packaging] Add support for AlmaLinux 8
> ---
>
> Key: ARROW-14180
> URL: https://issues.apache.org/jira/browse/ARROW-14180
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14180) [Packaging] Add support for AlmaLinux 8

2021-10-01 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-14180:
-
Fix Version/s: 6.0.0

> [Packaging] Add support for AlmaLinux 8
> ---
>
> Key: ARROW-14180
> URL: https://issues.apache.org/jira/browse/ARROW-14180
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14203) [C++] Fix description of ExecBatch.length for Scalars in aggregate kernels

2021-10-01 Thread Eduardo Ponce (Jira)
Eduardo Ponce created ARROW-14203:
-

 Summary: [C++] Fix description of ExecBatch.length for Scalars in 
aggregate kernels
 Key: ARROW-14203
 URL: https://issues.apache.org/jira/browse/ARROW-14203
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Eduardo Ponce
 Fix For: 6.0.0


[The comment for the *length* data member of 
ExecBatch|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/exec.h#L202-L211]
 states that it will always have a value of 1 for Scalar types. This is 
misleading/incorrect because for aggregate kernels you could have an ExecBatch 
formed by projecting just the partition columns from a batch, in which case 
you'd have scalar rows with a length > 1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-13879) [C++] Mixed support for binary types in regex functions

2021-10-01 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423460#comment-17423460
 ] 

Eduardo Ponce edited comment on ARROW-13879 at 10/2/21, 4:03 AM:
-

Well, the issue is that `string_view` is used to encapsulate binary data in 
certain parts and [`string_view.length()` is used to get the size but it stops 
at the first null 
byte|https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/string_view.hpp#L421],
 thus providing an incorrect size.


was (Author: edponce):
Well, the issue is that `string_view` is used to encapsulate binary data in 
certain parts and [`string_view.length()` is used to get the size but it stops 
at the first null 
byte|https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/string_view.hpp#L421],
 thus providing an incorrect size.

> [C++] Mixed support for binary types in regex functions
> ---
>
> Key: ARROW-13879
> URL: https://issues.apache.org/jira/browse/ARROW-13879
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: kernel, pull-request-available, types
> Fix For: 6.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> The functions count_substring, count_substring_regex, find_substring, and 
> find_substring_regex all accept binary types but the function extract_regex, 
> match_substring, match_substring_regex, match_like, starts_with, ends_with, 
> split_pattern, and split_pattern_regex do not.
> They should either all accept binary types or none should.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13879) [C++] Mixed support for binary types in regex functions

2021-10-01 Thread Eduardo Ponce (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce updated ARROW-13879:
--
Description: 
The functions count_substring, count_substring_regex, find_substring, and 
find_substring_regex all accept binary types but the function extract_regex, 
match_substring, match_substring_regex, match_like, starts_with, ends_with, 
split_pattern, and split_pattern_regex do not.

They should either all accept binary types or none should.

  was:
The functions count_substring, count_substring_regex, find_substring, and 
find_substring_regex all accept binary types but the function extract_regex, 
match_substring, match_substring_regex, match_like, starts_with, ends_with, 
split_pattern, and split_pattern_regex do not.

They should all accept binary types.


> [C++] Mixed support for binary types in regex functions
> ---
>
> Key: ARROW-13879
> URL: https://issues.apache.org/jira/browse/ARROW-13879
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: kernel, pull-request-available, types
> Fix For: 6.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> The functions count_substring, count_substring_regex, find_substring, and 
> find_substring_regex all accept binary types but the function extract_regex, 
> match_substring, match_substring_regex, match_like, starts_with, ends_with, 
> split_pattern, and split_pattern_regex do not.
> They should either all accept binary types or none should.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13879) [C++] Mixed support for binary types in regex functions

2021-10-01 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423460#comment-17423460
 ] 

Eduardo Ponce commented on ARROW-13879:
---

Well, the issue is that `string_view` is used to encapsulate binary data in 
certain parts and [`string_view.length()` is used to get the size but it stops 
at the first null 
byte|https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/string_view.hpp#L421],
 thus providing an incorrect size.

> [C++] Mixed support for binary types in regex functions
> ---
>
> Key: ARROW-13879
> URL: https://issues.apache.org/jira/browse/ARROW-13879
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: kernel, pull-request-available, types
> Fix For: 6.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> The functions count_substring, count_substring_regex, find_substring, and 
> find_substring_regex all accept binary types but the function extract_regex, 
> match_substring, match_substring_regex, match_like, starts_with, ends_with, 
> split_pattern, and split_pattern_regex do not.
> They should all accept binary types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14192) [C++][Dataset] Backpressure broken on ordered scans

2021-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14192:
---
Labels: pull-request-available query-engine  (was: query-engine)

> [C++][Dataset] Backpressure broken on ordered scans
> ---
>
> Key: ARROW-14192
> URL: https://issues.apache.org/jira/browse/ARROW-14192
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: pull-request-available, query-engine
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ARROW-13611 adds a backpressure mechanism that works for unordered scans.  
> However, this backpressure is not properly applied on ordered (i.e. 
> ScanBatches and not ScanBatchedUnordered) scans.  
> The fix will be to modify the merge generator used on ordered scans so that, 
> while it still will read ahead somewhat on several files, it will never 
> deliver batches except from the currently read file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13879) [C++] Mixed support for binary types in regex functions

2021-10-01 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423456#comment-17423456
 ] 

David Li commented on ARROW-13879:
--

std::string can contain null. Some place is constructing it from a pointer 
instead of a pointer + length if that's the case.

> [C++] Mixed support for binary types in regex functions
> ---
>
> Key: ARROW-13879
> URL: https://issues.apache.org/jira/browse/ARROW-13879
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: kernel, pull-request-available, types
> Fix For: 6.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> The functions count_substring, count_substring_regex, find_substring, and 
> find_substring_regex all accept binary types but the function extract_regex, 
> match_substring, match_substring_regex, match_like, starts_with, ends_with, 
> split_pattern, and split_pattern_regex do not.
> They should all accept binary types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13887) [R] Capture error produced when reading in CSV file with headers and using a schema, and add suggestion

2021-10-01 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423445#comment-17423445
 ] 

Weston Pace commented on ARROW-13887:
-

I'm happy with "as well as".  If someone takes this JIRA and doesn't want to do 
the C++ part (it might be a bit tricky identifying those conditions) then we 
can file a separate JIRA at that point.

> [R] Capture error produced when reading in CSV file with headers and using a 
> schema, and add suggestion
> ---
>
> Key: ARROW-13887
> URL: https://issues.apache.org/jira/browse/ARROW-13887
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>  Labels: good-first-issue
> Fix For: 6.0.0
>
>
> When reading in a CSV with headers, and also using a schema, we get an error 
> as the code tries to read in the header as a line of data.
> {code:java}
> share_data <- tibble::tibble(
>   company = c("AMZN", "GOOG", "BKNG", "TSLA"),
>   price = c(3463.12, 2884.38, 2300.46, 732.39)
> )
> readr::write_csv(share_data, file = "share_data.csv")
> share_schema <- schema(
>   company = utf8(),
>   price = float64()
> )
> read_csv_arrow("share_data.csv", schema = share_schema)
> {code}
> {code:java}
> Error: Invalid: In CSV column #1: CSV conversion error to double: invalid 
> value 'price'
> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:492 decoder_.Decode(data, 
> size, quoted, )
> /home/nic2/arrow/cpp/src/arrow/csv/parser.h:84 status
> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:496 
> parser.VisitColumn(col_index, visit) {code}
> The correct thing here would have been for the user to supply the argument 
> {{skip=1}} to {{read_csv_arrow()}} but this is not immediately obvious from 
> the error message returned from C++.  We should capture the error and instead 
> supply our own error message using {{rlang::abort}} which informs the user of 
> the error and then suggests what they can do to prevent it.
>  
> For similar examples (and their associated PRs) see 
> {color:#1d1c1d}ARROW-11766, and ARROW-12791{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13887) [R] Capture error produced when reading in CSV file with headers and using a schema, and add suggestion

2021-10-01 Thread Nicola Crane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423443#comment-17423443
 ] 

Nicola Crane commented on ARROW-13887:
--

[~westonpace] How about "as well as" rather than "instead"? I think it'd still 
be helpful to include the R-specific suggestion, as per 
https://style.tidyverse.org/error-messages.html#hints

> [R] Capture error produced when reading in CSV file with headers and using a 
> schema, and add suggestion
> ---
>
> Key: ARROW-13887
> URL: https://issues.apache.org/jira/browse/ARROW-13887
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>  Labels: good-first-issue
> Fix For: 6.0.0
>
>
> When reading in a CSV with headers, and also using a schema, we get an error 
> as the code tries to read in the header as a line of data.
> {code:java}
> share_data <- tibble::tibble(
>   company = c("AMZN", "GOOG", "BKNG", "TSLA"),
>   price = c(3463.12, 2884.38, 2300.46, 732.39)
> )
> readr::write_csv(share_data, file = "share_data.csv")
> share_schema <- schema(
>   company = utf8(),
>   price = float64()
> )
> read_csv_arrow("share_data.csv", schema = share_schema)
> {code}
> {code:java}
> Error: Invalid: In CSV column #1: CSV conversion error to double: invalid 
> value 'price'
> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:492 decoder_.Decode(data, 
> size, quoted, )
> /home/nic2/arrow/cpp/src/arrow/csv/parser.h:84 status
> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:496 
> parser.VisitColumn(col_index, visit) {code}
> The correct thing here would have been for the user to supply the argument 
> {{skip=1}} to {{read_csv_arrow()}} but this is not immediately obvious from 
> the error message returned from C++.  We should capture the error and instead 
> supply our own error message using {{rlang::abort}} which informs the user of 
> the error and then suggests what they can do to prevent it.
>  
> For similar examples (and their associated PRs) see 
> {color:#1d1c1d}ARROW-11766, and ARROW-12791{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14014) FlightClient.ClientStreamListener not notified on error when parsing invalid trailers

2021-10-01 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423442#comment-17423442
 ] 

Bryan Cutler commented on ARROW-14014:
--

I've run into this a couple times [~manudebouc], but can't seem to reproduce it 
right now. Do you have some code you could share that would reproduce this?

> FlightClient.ClientStreamListener not notified on error when parsing invalid 
> trailers
> -
>
> Key: ARROW-14014
> URL: https://issues.apache.org/jira/browse/ARROW-14014
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 5.0.0
>Reporter: manudebouc
>Priority: Major
>
> When using FlightClient.startPut combined with an AsyncPutListener, we are 
> sometimes blocked for ever on FlightClient.ClientStreamListener.getResult() 
> because we do not receive error notification.
> Due to intermediate proxy we sometime receive 502 or 504 errors and invalid 
> {{':status'}} header in trailers that cannot be parsed by 
> {{StatusUtils.parseTrailers in SetStreamObserver.onError(Throwable t)}} 
> generating an {{IllegalArgumentException}} that prevent our listener 
> notification, blocking for ever.
> {{SEVERE: Exception while executing runnable 
> io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed@de593f34
>  java.lang.IllegalArgumentException: Invalid character ':' in key name 
> ':status'
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:275)
>  at io.grpc.Metadata$Key.validateName(Metadata.java:742)
>  at io.grpc.Metadata$Key.(Metadata.java:750)
>  at io.grpc.Metadata$Key.(Metadata.java:668)
>  at io.grpc.Metadata$AsciiKey.(Metadata.java:959)
>  at io.grpc.Metadata$AsciiKey.(Metadata.java:954)
>  at io.grpc.Metadata$Key.of(Metadata.java:705)
>  at io.grpc.Metadata$Key.of(Metadata.java:701)
>  at 
> org.apache.arrow.flight.grpc.StatusUtils.parseTrailers(StatusUtils.java:164)
>  at 
> org.apache.arrow.flight.grpc.StatusUtils.fromGrpcStatusAndTrailers(StatusUtils.java:128)
>  at 
> org.apache.arrow.flight.grpc.StatusUtils.fromGrpcRuntimeException(StatusUtils.java:152)
>  at 
> org.apache.arrow.flight.grpc.StatusUtils.fromThrowable(StatusUtils.java:176)
>  at 
> org.apache.arrow.flight.FlightClient$SetStreamObserver.onError(FlightClient.java:440)
>  at 
> io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478)
>  at 
> io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
>  at 
> io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
>  at 
> io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
>  at 
> org.apache.arrow.flight.grpc.ClientInterceptorAdapter$FlightClientCallListener.onClose(ClientInterceptorAdapter.java:117)
>  at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:553)
>  at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:68)
>  at 
> io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:739)
>  at 
> io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:718)
>  at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
>  at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:823) }}
> It seems also that same problem exsists with FlightClient.getStream() and 
> ClientResponseObserver.onError(Throwable t)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14200) [R] strftime on a date should not use or be confused by timezones

2021-10-01 Thread Rok Mihevc (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423429#comment-17423429
 ] 

Rok Mihevc commented on ARROW-14200:


Yeah, what Weston said :).

The casting happens on L717 and probably I've made the assumption that R will 
behave the same for date as it would for a timezoneless timestamp. Should't be 
too hard to match the behavior once you know what it is.

> [R] strftime on a date should not use or be confused by timezones
> -
>
> Key: ARROW-14200
> URL: https://issues.apache.org/jira/browse/ARROW-14200
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Priority: Major
>
> When the input to {{strftime}} is a date, timezones shouldn't be necessary or 
> assumed.
> What I think is going on below is the date 1992-01-01 is being interpreted as 
> 1992-01-01 00:00:00 in UTC, and then when {{strftime()}} is being called it's 
> displaying that timestamp as 1991-12-31 ... (since my system is set to an 
> after UTC timezone), and then taking the year out of it. If I specify {{tz = 
> "utc"}} in the {{strftime()}}, I get the expected result (though that 
> shouldn't be necessary).
> Run in the US central timezone:
> {code}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> library(lubridate, warn.conflicts = FALSE)
> Table$create(
>   data.frame(
> x = as.Date("1992-01-01")
>   )
> ) %>% 
>   mutate(
> as_int_strftime = as.integer(strftime(x, "%Y")),
> strftime = strftime(x, "%Y"),
> as_int_strftime_utc = as.integer(strftime(x, "%Y", tz = "UTC")),
> strftime_utc = strftime(x, "%Y", tz = "UTC"),
> year = year(x)
>   ) %>%
>   collect()
> #>x as_int_strftime strftime as_int_strftime_utc strftime_utc year
> #> 1 1992-01-011991 19911992 1992 1992
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14202) [C++] A more RAM-efficient top-k sink node

2021-10-01 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14202:
---
Summary: [C++] A more RAM-efficient top-k sink node  (was: A more 
RAM-efficient top-k sink node)

> [C++] A more RAM-efficient top-k sink node
> --
>
> Key: ARROW-14202
> URL: https://issues.apache.org/jira/browse/ARROW-14202
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 7.0.0
>Reporter: Alexander Ocsa
>Priority: Major
>  Labels: query-engine
>
> Mentioned here:
> https://github.com/apache/arrow/pull/11274#pullrequestreview-768267959
> For example, a top-k implementation could periodically (when batches_ has 
> some configurable # of rows) run through and discard data. The way it is 
> written now it would still require me to buffer the entire dataset in memory 
> (and/or spillover).
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14200) [R] strftime on a date should not use or be confused by timezones

2021-10-01 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423428#comment-17423428
 ] 

Jonathan Keane commented on ARROW-14200:


Thanks for digging (I hadn't had a chance to yet), I've changed the tags

> [R] strftime on a date should not use or be confused by timezones
> -
>
> Key: ARROW-14200
> URL: https://issues.apache.org/jira/browse/ARROW-14200
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Priority: Major
>
> When the input to {{strftime}} is a date, timezones shouldn't be necessary or 
> assumed.
> What I think is going on below is the date 1992-01-01 is being interpreted as 
> 1992-01-01 00:00:00 in UTC, and then when {{strftime()}} is being called it's 
> displaying that timestamp as 1991-12-31 ... (since my system is set to an 
> after UTC timezone), and then taking the year out of it. If I specify {{tz = 
> "utc"}} in the {{strftime()}}, I get the expected result (though that 
> shouldn't be necessary).
> Run in the US central timezone:
> {code}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> library(lubridate, warn.conflicts = FALSE)
> Table$create(
>   data.frame(
> x = as.Date("1992-01-01")
>   )
> ) %>% 
>   mutate(
> as_int_strftime = as.integer(strftime(x, "%Y")),
> strftime = strftime(x, "%Y"),
> as_int_strftime_utc = as.integer(strftime(x, "%Y", tz = "UTC")),
> strftime_utc = strftime(x, "%Y", tz = "UTC"),
> year = year(x)
>   ) %>%
>   collect()
> #>x as_int_strftime strftime as_int_strftime_utc strftime_utc year
> #> 1 1992-01-011991 19911992 1992 1992
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14200) [R] strftime on a date should not use or be confused by timezones

2021-10-01 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14200:
---
Component/s: (was: C++)

> [R] strftime on a date should not use or be confused by timezones
> -
>
> Key: ARROW-14200
> URL: https://issues.apache.org/jira/browse/ARROW-14200
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Priority: Major
>
> When the input to {{strftime}} is a date, timezones shouldn't be necessary or 
> assumed.
> What I think is going on below is the date 1992-01-01 is being interpreted as 
> 1992-01-01 00:00:00 in UTC, and then when {{strftime()}} is being called it's 
> displaying that timestamp as 1991-12-31 ... (since my system is set to an 
> after UTC timezone), and then taking the year out of it. If I specify {{tz = 
> "utc"}} in the {{strftime()}}, I get the expected result (though that 
> shouldn't be necessary).
> Run in the US central timezone:
> {code}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> library(lubridate, warn.conflicts = FALSE)
> Table$create(
>   data.frame(
> x = as.Date("1992-01-01")
>   )
> ) %>% 
>   mutate(
> as_int_strftime = as.integer(strftime(x, "%Y")),
> strftime = strftime(x, "%Y"),
> as_int_strftime_utc = as.integer(strftime(x, "%Y", tz = "UTC")),
> strftime_utc = strftime(x, "%Y", tz = "UTC"),
> year = year(x)
>   ) %>%
>   collect()
> #>x as_int_strftime strftime as_int_strftime_utc strftime_utc year
> #> 1 1992-01-011991 19911992 1992 1992
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14200) [R] strftime on a date should not use or be confused by timezones

2021-10-01 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14200:
---
Summary: [R] strftime on a date should not use or be confused by timezones  
(was: [R] [C++] strftime on a date should not use or be confused by timezones)

> [R] strftime on a date should not use or be confused by timezones
> -
>
> Key: ARROW-14200
> URL: https://issues.apache.org/jira/browse/ARROW-14200
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Reporter: Jonathan Keane
>Priority: Major
>
> When the input to {{strftime}} is a date, timezones shouldn't be necessary or 
> assumed.
> What I think is going on below is the date 1992-01-01 is being interpreted as 
> 1992-01-01 00:00:00 in UTC, and then when {{strftime()}} is being called it's 
> displaying that timestamp as 1991-12-31 ... (since my system is set to an 
> after UTC timezone), and then taking the year out of it. If I specify {{tz = 
> "utc"}} in the {{strftime()}}, I get the expected result (though that 
> shouldn't be necessary).
> Run in the US central timezone:
> {code}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> library(lubridate, warn.conflicts = FALSE)
> Table$create(
>   data.frame(
> x = as.Date("1992-01-01")
>   )
> ) %>% 
>   mutate(
> as_int_strftime = as.integer(strftime(x, "%Y")),
> strftime = strftime(x, "%Y"),
> as_int_strftime_utc = as.integer(strftime(x, "%Y", tz = "UTC")),
> strftime_utc = strftime(x, "%Y", tz = "UTC"),
> year = year(x)
>   ) %>%
>   collect()
> #>x as_int_strftime strftime as_int_strftime_utc strftime_utc year
> #> 1 1992-01-011991 19911992 1992 1992
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14200) [R] [C++] strftime on a date should not use or be confused by timezones

2021-10-01 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423421#comment-17423421
 ] 

Weston Pace commented on ARROW-14200:
-

I'm pretty sure this is in the R bindings.  It is working correctly in python.  
It appears the R bindings for strftime are casting to a timestamp and always 
specifying some kind of time zone: 
https://github.com/apache/arrow/blob/b5814b6bcc6a242fd2a5be0b44cddb02cb60f088/r/R/dplyr-functions.R#L712


> [R] [C++] strftime on a date should not use or be confused by timezones
> ---
>
> Key: ARROW-14200
> URL: https://issues.apache.org/jira/browse/ARROW-14200
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Reporter: Jonathan Keane
>Priority: Major
>
> When the input to {{strftime}} is a date, timezones shouldn't be necessary or 
> assumed.
> What I think is going on below is the date 1992-01-01 is being interpreted as 
> 1992-01-01 00:00:00 in UTC, and then when {{strftime()}} is being called it's 
> displaying that timestamp as 1991-12-31 ... (since my system is set to an 
> after UTC timezone), and then taking the year out of it. If I specify {{tz = 
> "utc"}} in the {{strftime()}}, I get the expected result (though that 
> shouldn't be necessary).
> Run in the US central timezone:
> {code}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> library(lubridate, warn.conflicts = FALSE)
> Table$create(
>   data.frame(
> x = as.Date("1992-01-01")
>   )
> ) %>% 
>   mutate(
> as_int_strftime = as.integer(strftime(x, "%Y")),
> strftime = strftime(x, "%Y"),
> as_int_strftime_utc = as.integer(strftime(x, "%Y", tz = "UTC")),
> strftime_utc = strftime(x, "%Y", tz = "UTC"),
> year = year(x)
>   ) %>%
>   collect()
> #>x as_int_strftime strftime as_int_strftime_utc strftime_utc year
> #> 1 1992-01-011991 19911992 1992 1992
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14202) A more RAM-efficient top-k sink node

2021-10-01 Thread Alexander Ocsa (Jira)
Alexander Ocsa created ARROW-14202:
--

 Summary: A more RAM-efficient top-k sink node
 Key: ARROW-14202
 URL: https://issues.apache.org/jira/browse/ARROW-14202
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 7.0.0
Reporter: Alexander Ocsa


Mentioned here:

https://github.com/apache/arrow/pull/11274#pullrequestreview-768267959

For example, a top-k implementation could periodically (when batches_ has some 
configurable # of rows) run through and discard data. The way it is written now 
it would still require me to buffer the entire dataset in memory (and/or 
spillover).

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14201) RAM-efficient topk sink node

2021-10-01 Thread Alexander Ocsa (Jira)
Alexander Ocsa created ARROW-14201:
--

 Summary: RAM-efficient topk sink node
 Key: ARROW-14201
 URL: https://issues.apache.org/jira/browse/ARROW-14201
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 7.0.0
Reporter: Alexander Ocsa
 Fix For: 6.0.0


https://github.com/apache/arrow/pull/11274#pullrequestreview-768267959



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14200) [R] [C++] strftime on a date should not use or be confused by timezones

2021-10-01 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14200:
---
Issue Type: Bug  (was: New Feature)

> [R] [C++] strftime on a date should not use or be confused by timezones
> ---
>
> Key: ARROW-14200
> URL: https://issues.apache.org/jira/browse/ARROW-14200
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Reporter: Jonathan Keane
>Priority: Major
>
> When the input to {{strftime}} is a date, timezones shouldn't be necessary or 
> assumed.
> What I think is going on below is the date 1992-01-01 is being interpreted as 
> 1992-01-01 00:00:00 in UTC, and then when {{strftime()}} is being called it's 
> displaying that timestamp as 1991-12-31 ... (since my system is set to an 
> after UTC timezone), and then taking the year out of it. If I specify {{tz = 
> "utc"}} in the {{strftime()}}, I get the expected result (though that 
> shouldn't be necessary).
> Run in the US central timezone:
> {code}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> library(lubridate, warn.conflicts = FALSE)
> Table$create(
>   data.frame(
> x = as.Date("1992-01-01")
>   )
> ) %>% 
>   mutate(
> as_int_strftime = as.integer(strftime(x, "%Y")),
> strftime = strftime(x, "%Y"),
> as_int_strftime_utc = as.integer(strftime(x, "%Y", tz = "UTC")),
> strftime_utc = strftime(x, "%Y", tz = "UTC"),
> year = year(x)
>   ) %>%
>   collect()
> #>x as_int_strftime strftime as_int_strftime_utc strftime_utc year
> #> 1 1992-01-011991 19911992 1992 1992
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14200) [R] [C++] strftime on a date should not use or be confused by timezones

2021-10-01 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-14200:
--

 Summary: [R] [C++] strftime on a date should not use or be 
confused by timezones
 Key: ARROW-14200
 URL: https://issues.apache.org/jira/browse/ARROW-14200
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, R
Reporter: Jonathan Keane


When the input to {{strftime}} is a date, timezones shouldn't be necessary or 
assumed.

What I think is going on below is the date 1992-01-01 is being interpreted as 
1992-01-01 00:00:00 in UTC, and then when {{strftime()}} is being called it's 
displaying that timestamp as 1991-12-31 ... (since my system is set to an after 
UTC timezone), and then taking the year out of it. If I specify {{tz = "utc"}} 
in the {{strftime()}}, I get the expected result (though that shouldn't be 
necessary).


Run in the US central timezone:
{code}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)
Table$create(
  data.frame(
x = as.Date("1992-01-01")
  )
) %>% 
  mutate(
as_int_strftime = as.integer(strftime(x, "%Y")),
strftime = strftime(x, "%Y"),
as_int_strftime_utc = as.integer(strftime(x, "%Y", tz = "UTC")),
strftime_utc = strftime(x, "%Y", tz = "UTC"),
year = year(x)
  ) %>%
  collect()
#>x as_int_strftime strftime as_int_strftime_utc strftime_utc year
#> 1 1992-01-011991 19911992 1992 1992
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13751) [Doc][Cookbook] Searching for values matching a predicate in Arrays - Python

2021-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13751:
---
Labels: github-pullrequest pull-request-available  (was: github-pullrequest)

> [Doc][Cookbook] Searching for values matching a predicate in Arrays - Python
> 
>
> Key: ARROW-13751
> URL: https://issues.apache.org/jira/browse/ARROW-13751
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
>  Labels: github-pullrequest, pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13732) [Doc][Cookbook] Manipulating and analyze Arrow data with dplyr verbs - R

2021-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13732:
---
Labels: pull-request-available  (was: )

> [Doc][Cookbook] Manipulating and analyze Arrow data with dplyr verbs - R
> 
>
> Key: ARROW-13732
> URL: https://issues.apache.org/jira/browse/ARROW-13732
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Assignee: Nicola Crane
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13887) [R] Capture error produced when reading in CSV file with headers and using a schema, and add suggestion

2021-10-01 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423401#comment-17423401
 ] 

Weston Pace commented on ARROW-13887:
-

Can we fix the error message in C++ instead?  The criteria could be:

 * If there is a parse error
 * The error happens on the first line
 * A list of column names has been provided to the reader

Add to the error message, "If the data has a header it must be explicitly 
skipped since column names were provided"

> [R] Capture error produced when reading in CSV file with headers and using a 
> schema, and add suggestion
> ---
>
> Key: ARROW-13887
> URL: https://issues.apache.org/jira/browse/ARROW-13887
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>  Labels: good-first-issue
> Fix For: 6.0.0
>
>
> When reading in a CSV with headers, and also using a schema, we get an error 
> as the code tries to read in the header as a line of data.
> {code:java}
> share_data <- tibble::tibble(
>   company = c("AMZN", "GOOG", "BKNG", "TSLA"),
>   price = c(3463.12, 2884.38, 2300.46, 732.39)
> )
> readr::write_csv(share_data, file = "share_data.csv")
> share_schema <- schema(
>   company = utf8(),
>   price = float64()
> )
> read_csv_arrow("share_data.csv", schema = share_schema)
> {code}
> {code:java}
> Error: Invalid: In CSV column #1: CSV conversion error to double: invalid 
> value 'price'
> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:492 decoder_.Decode(data, 
> size, quoted, )
> /home/nic2/arrow/cpp/src/arrow/csv/parser.h:84 status
> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:496 
> parser.VisitColumn(col_index, visit) {code}
> The correct thing here would have been for the user to supply the argument 
> {{skip=1}} to {{read_csv_arrow()}} but this is not immediately obvious from 
> the error message returned from C++.  We should capture the error and instead 
> supply our own error message using {{rlang::abort}} which informs the user of 
> the error and then suggests what they can do to prevent it.
>  
> For similar examples (and their associated PRs) see 
> {color:#1d1c1d}ARROW-11766, and ARROW-12791{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-13722) [Doc][Cookbook] Specifying Schemas - R

2021-10-01 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane closed ARROW-13722.

Resolution: Fixed

> [Doc][Cookbook] Specifying Schemas - R
> --
>
> Key: ARROW-13722
> URL: https://issues.apache.org/jira/browse/ARROW-13722
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Assignee: Nicola Crane
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13893) [R] Make head/tail lazy on datasets and queries

2021-10-01 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13893:

Labels: query-engine  (was: )

> [R] Make head/tail lazy on datasets and queries
> ---
>
> Key: ARROW-13893
> URL: https://issues.apache.org/jira/browse/ARROW-13893
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Blocker
>  Labels: query-engine
> Fix For: 6.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14063) [R] open_dataset() does not work on CSVs without header rows

2021-10-01 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-14063:

Fix Version/s: 6.0.0

> [R] open_dataset() does not work on CSVs without header rows
> 
>
> Key: ARROW-14063
> URL: https://issues.apache.org/jira/browse/ARROW-14063
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 5.0.0
> Environment: sessionInfo()
> R version 4.0.5 (2021-03-31)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 18.04.5 LTS
> Matrix products: default
> BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
> LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
> locale:
>  [1] LC_CTYPE=C.UTF-8   LC_NUMERIC=C   LC_TIME=C.UTF-8   
>  [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8LC_MESSAGES=C.UTF-8   
>  [7] LC_PAPER=C.UTF-8   LC_NAME=C  LC_ADDRESS=C  
> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
> attached base packages:
> [1] stats graphics  grDevices utils datasets  methods   base 
> other attached packages:
> [1] arrow_5.0.0.2  dplyr_1.0.5magrittr_2.0.1 targets_0.6.0 
> loaded via a namespace (and not attached):
>  [1] httr_1.4.2  rnaturalearth_0.1.0 sass_0.4.0  tidyr_1.1.3  
>   
>  [5] jsonlite_1.7.2  bit64_4.0.5 bslib_0.2.5.1   
> assertthat_0.2.1   
>  [9] askpass_1.1 sp_1.4-5blob_1.2.1  renv_0.13.2  
>   
> [13] yaml_2.2.1  globals_0.14.0  pillar_1.5.1
> RSQLite_2.2.7  
> [17] lattice_0.20-41 glue_1.4.2  digest_0.6.27   
> htmltools_0.5.1.1  
> [21] pkgconfig_2.0.3 RPostgres_1.3.2 listenv_0.8.0   config_0.3.1 
>   
> [25] purrr_0.3.4 processx_3.5.1  openssl_1.4.3   tibble_3.1.0 
>   
> [29] proxy_0.4-25aws.s3_0.3.21   colourvalues_0.3.7  
> generics_0.1.0 
> [33] ellipsis_0.3.1  cachem_1.0.5withr_2.4.1 furrr_0.2.3  
>   
> [37] cli_2.4.0   crayon_1.4.1memoise_2.0.0   
> evaluate_0.14  
> [41] ps_1.6.0fs_1.5.0future_1.21.0   fansi_0.4.2  
>   
> [45] parallelly_1.25.0   xml2_1.3.2  class_7.3-18
> rsconnect_0.8.18   
> [49] tools_4.0.5 data.table_1.14.0   hms_1.0.0   
> lifecycle_1.0.0
> [53] stringr_1.4.0   callr_3.6.0 jquerylib_0.1.4 
> compiler_4.0.5 
> [57] e1071_1.7-6 rlang_0.4.10classInt_0.4-3  units_0.7-1  
>   
> [61] grid_4.0.5  rstudioapi_0.13 visNetwork_2.0.9
> htmlwidgets_1.5.3  
> [65] aws.signature_0.6.0 crosstalk_1.1.1 igraph_1.2.6
> base64enc_0.1-3
> [69] rmarkdown_2.7   codetools_0.2-18DBI_1.1.1   curl_4.3 
>   
> [73] R6_2.5.0lubridate_1.7.10knitr_1.31  
> fastmap_1.1.0  
> [77] rgeos_0.5-5 bit_4.0.4   utf8_1.2.1  
> tarchetypes_0.2.1  
> [81] readr_1.4.0 KernSmooth_2.23-18  stringi_1.5.3   
> parallel_4.0.5 
> [85] Rcpp_1.0.6  vctrs_0.3.7 sf_0.9-8
> leaflet_2.0.4.1
> [89] dbplyr_2.1.1tidyselect_1.1.0xfun_0.22
>Reporter: Jared Lander
>Priority: Major
>  Labels: bug
> Fix For: 6.0.0
>
>
> Using {{open_dataset()}} on a CSV without a header row, followed by 
> {{collect()}}, results either in a {{tibble}} of {{NA}}s or an error 
> depending on duplication of the first row of data. This affects reading one 
> file or a directory of files.
> Here we use the `diamonds` data, where the first row of data does not have 
> any repeat values.
> {code:java}
> library(arrow)
> library(magrittr)
> data(diamonds, package='ggplot2')
> readr::write_csv(head(diamonds), file='diamonds_with_header.csv', 
> col_names=TRUE)
> readr::write_csv(head(diamonds), file='diamonds_without_header.csv', 
> col_names=FALSE)
> diamond_schema <- schema(
> carat=float32()
> , cut=string()
> , color=string()
> , clarity=string()
> , depth=float32()
> , table=float32()
> , price=float32()
> , x=float32()
> , y=float32()
> , z=float32()
> )
> diamonds_with_headers <- open_dataset('diamonds_with_header.csv', 
> schema=diamond_schema, format='csv')
> diamonds_without_headers <- open_dataset('diamonds_without_header.csv', 
> schema=diamond_schema, format='csv')
> # this works
> diamonds_with_headers %>% collect()
> # A tibble: 6 x 10
>   carat cut   color clarity depth table price x y z
>  
> 1 0.230 Ideal E SI2  61.555   326  3.95  3.98  2.43
> 2 0.210 Premium   E SI1  59.861   326  3.89  3.84  2.31
> 3 0.230 

[jira] [Updated] (ARROW-14198) [Java] Upgrade Netty and gRPC dependencies

2021-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14198:
---
Labels: pull-request-available  (was: )

> [Java] Upgrade Netty and gRPC dependencies
> --
>
> Key: ARROW-14198
> URL: https://issues.apache.org/jira/browse/ARROW-14198
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Current versions in use are quite old and subject to vulnerabilities.
> See https://www.cvedetails.com/cve/CVE-2021-21409/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14199) [R] bindings for format where possible

2021-10-01 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-14199:
--

 Summary: [R] bindings for format where possible
 Key: ARROW-14199
 URL: https://issues.apache.org/jira/browse/ARROW-14199
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Jonathan Keane


Now that we have {{strftime}}, we should also be able to make bindings for 
{{format()}} as well. This might be complicated / we might need to punt on a 
bunch of types that {{format()}} can take but arrow doesn't (yet) support 
formatting of them, that's ok. 

Though some of those might be wrappable with a handful of kernels stacked 
together: {{format(float)}} might be round + cast to character




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14198) [Java] Upgrade Netty and gRPC dependencies

2021-10-01 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-14198:


 Summary: [Java] Upgrade Netty and gRPC dependencies
 Key: ARROW-14198
 URL: https://issues.apache.org/jira/browse/ARROW-14198
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Bryan Cutler
Assignee: Bryan Cutler


Current versions in use are quite old and subject to vulnerabilities.

See https://www.cvedetails.com/cve/CVE-2021-21409/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9647) [Java] Cannot install arrow-memory 1.0.0 from maven central

2021-10-01 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-9647.
-
Resolution: Not A Problem

> [Java] Cannot install arrow-memory 1.0.0 from maven central
> ---
>
> Key: ARROW-9647
> URL: https://issues.apache.org/jira/browse/ARROW-9647
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 1.0.0
>Reporter: Marius van Niekerk
>Priority: Major
>
> Seems that the jar is missing from 
> [https://mvnrepository.com/artifact/org.apache.arrow/arrow-memory/1.0.0]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13973) [C++] Add a SelectKSinkNode

2021-10-01 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-13973.
-
Resolution: Fixed

Issue resolved by pull request 11274
[https://github.com/apache/arrow/pull/11274]

> [C++] Add a SelectKSinkNode
> ---
>
> Key: ARROW-13973
> URL: https://issues.apache.org/jira/browse/ARROW-13973
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: Alexander Ocsa
>Priority: Major
>  Labels: pull-request-available, query-engine
> Fix For: 6.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Hook up the SelectK kernel in ARROW-1565 to the query engine as another type 
> of sink node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12763) [R] Optimize dplyr queries that use head/tail after arrange

2021-10-01 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12763:

Labels: query-engine  (was: )

> [R] Optimize dplyr queries that use head/tail after arrange
> ---
>
> Key: ARROW-12763
> URL: https://issues.apache.org/jira/browse/ARROW-12763
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Assignee: Alexander Ocsa
>Priority: Major
>  Labels: query-engine
> Fix For: 6.0.0
>
>
> Use the Arrow C++ function {{partition_nth_indices}} to optimize dplyr 
> queries like this:
> {code:r}
> iris %>%
>   Table$create() %>% 
>   arrange(desc(Sepal.Length)) %>%
>   head(10) %>%
>   collect()
> {code}
> This query sorts the full table even though it doesn't need to. It could use 
> {{partition_nth_indices}} to find the rows containing the top 10 values of 
> {{Sepal.Length}} and only collect and sort those 10 rows.
> Test to see if this improves performance in practice on larger data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12763) [R] Optimize dplyr queries that use head/tail after arrange

2021-10-01 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12763:

Fix Version/s: 6.0.0

> [R] Optimize dplyr queries that use head/tail after arrange
> ---
>
> Key: ARROW-12763
> URL: https://issues.apache.org/jira/browse/ARROW-12763
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Assignee: Alexander Ocsa
>Priority: Major
> Fix For: 6.0.0
>
>
> Use the Arrow C++ function {{partition_nth_indices}} to optimize dplyr 
> queries like this:
> {code:r}
> iris %>%
>   Table$create() %>% 
>   arrange(desc(Sepal.Length)) %>%
>   head(10) %>%
>   collect()
> {code}
> This query sorts the full table even though it doesn't need to. It could use 
> {{partition_nth_indices}} to find the rows containing the top 10 values of 
> {{Sepal.Length}} and only collect and sort those 10 rows.
> Test to see if this improves performance in practice on larger data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14181) [C++][Compute] Hash Join support for dictionary

2021-10-01 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-14181:

Fix Version/s: (was: 7.0.0)
   6.0.0

> [C++][Compute] Hash Join support for dictionary   
> 
>
> Key: ARROW-14181
> URL: https://issues.apache.org/jira/browse/ARROW-14181
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 6.0.0
>Reporter: Michal Nowakiewicz
>Assignee: Michal Nowakiewicz
>Priority: Critical
>  Labels: query-engine
> Fix For: 6.0.0
>
>
> Currently dictionary arrays are not supported at all as input columns to hash 
> join.
> Add support for dictionary arrays in hash join for both key columns and 
> payload columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13890) [R] Split up test-dataset.R and test-dplyr.R

2021-10-01 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-13890.
-
Resolution: Fixed

Issue resolved by pull request 11292
[https://github.com/apache/arrow/pull/11292]

> [R] Split up test-dataset.R and test-dplyr.R
> 
>
> Key: ARROW-13890
> URL: https://issues.apache.org/jira/browse/ARROW-13890
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> It's 1700 lines long and growing. One natural way to split is by file format 
> (parquet, csv).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13634) [R] Update distro() in nixlibs.R to map from "bookworm" to 12

2021-10-01 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-13634.
-
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 10939
[https://github.com/apache/arrow/pull/10939]

> [R] Update distro() in nixlibs.R to map from "bookworm" to 12
> -
>
> Key: ARROW-13634
> URL: https://issues.apache.org/jira/browse/ARROW-13634
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-14195) [R] Fix ExecPlan binding annotations

2021-10-01 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-14195.
-
Resolution: Fixed

Issue resolved by pull request 11290
[https://github.com/apache/arrow/pull/11290]

> [R] Fix ExecPlan binding annotations
> 
>
> Key: ARROW-14195
> URL: https://issues.apache.org/jira/browse/ARROW-14195
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The minimal build failed with undefined symbols because some of the ExecPlan 
> bindings require ARROW_DATASET but our annotations in R didn't reflect that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14159) [R] Re-allow some multithreading on Windows

2021-10-01 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-14159:

Description: 
Followup to ARROW-8379, which set use_threads = FALSE on Windows. See 
discussion about adding more controls, disabling threading in some places and 
not others, etc. We want to do this soon after release so that we have a few 
months to see how things behave on CI before releasing again.

-

Collecting some CI hangs after ARROW-8379

1. Rtools35, 64bit test suite hangs: 

https://github.com/apache/arrow/pull/11290/checks?check_run_id=3767787034

{code}
** running tests for arch 'i386' ...
  Running 'testthat.R' [17s]
 OK
** running tests for arch 'x64' ...

Error: Error: 
{code}

  was:
Followup to ARROW-8379, which set use_threads = FALSE on Windows. See 
discussion about adding more controls, disabling threading in some places and 
not others, etc. We want to do this soon after release so that we have a few 
months to see how things behave on CI before releasing again.

-

1. Collecting some CI hangs after ARROW-8379: 
https://github.com/apache/arrow/pull/11290/checks?check_run_id=3767787034

Rtools35, 64bit test suite hangs:

{code}
** running tests for arch 'i386' ...
  Running 'testthat.R' [17s]
 OK
** running tests for arch 'x64' ...

Error: Error: 
{code}


> [R] Re-allow some multithreading on Windows
> ---
>
> Key: ARROW-14159
> URL: https://issues.apache.org/jira/browse/ARROW-14159
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 7.0.0
>
>
> Followup to ARROW-8379, which set use_threads = FALSE on Windows. See 
> discussion about adding more controls, disabling threading in some places and 
> not others, etc. We want to do this soon after release so that we have a few 
> months to see how things behave on CI before releasing again.
> -
> Collecting some CI hangs after ARROW-8379
> 1. Rtools35, 64bit test suite hangs: 
> https://github.com/apache/arrow/pull/11290/checks?check_run_id=3767787034
> {code}
> ** running tests for arch 'i386' ...
>   Running 'testthat.R' [17s]
>  OK
> ** running tests for arch 'x64' ...
> Error: Error:   stderr is not a pipe.>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14159) [R] Re-allow some multithreading on Windows

2021-10-01 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-14159:

Description: 
Followup to ARROW-8379, which set use_threads = FALSE on Windows. See 
discussion about adding more controls, disabling threading in some places and 
not others, etc. We want to do this soon after release so that we have a few 
months to see how things behave on CI before releasing again.

-

1. Collecting some CI hangs after ARROW-8379: 
https://github.com/apache/arrow/pull/11290/checks?check_run_id=3767787034

Rtools35, 64bit test suite hangs:

{code}
** running tests for arch 'i386' ...
  Running 'testthat.R' [17s]
 OK
** running tests for arch 'x64' ...

Error: Error: 
{code}

  was:Followup to ARROW-8379, which set use_threads = FALSE on Windows. See 
discussion about adding more controls, disabling threading in some places and 
not others, etc. We want to do this soon after release so that we have a few 
months to see how things behave on CI before releasing again.


> [R] Re-allow some multithreading on Windows
> ---
>
> Key: ARROW-14159
> URL: https://issues.apache.org/jira/browse/ARROW-14159
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 7.0.0
>
>
> Followup to ARROW-8379, which set use_threads = FALSE on Windows. See 
> discussion about adding more controls, disabling threading in some places and 
> not others, etc. We want to do this soon after release so that we have a few 
> months to see how things behave on CI before releasing again.
> -
> 1. Collecting some CI hangs after ARROW-8379: 
> https://github.com/apache/arrow/pull/11290/checks?check_run_id=3767787034
> Rtools35, 64bit test suite hangs:
> {code}
> ** running tests for arch 'i386' ...
>   Running 'testthat.R' [17s]
>  OK
> ** running tests for arch 'x64' ...
> Error: Error:   stderr is not a pipe.>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14184) [C++] allow joins where the keys include new columns on the left

2021-10-01 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423338#comment-17423338
 ] 

Jonathan Keane commented on ARROW-14184:


Turns out this is part of TPC-H query number 8. I've been able to work around 
it (by swapping which table is left and which is right)

> [C++] allow joins where the keys include new columns on the left
> 
>
> Key: ARROW-14184
> URL: https://issues.apache.org/jira/browse/ARROW-14184
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jonathan Keane
>Assignee: Michal Nowakiewicz
>Priority: Major
>
> If I try to join where the key column on the left is new (a rename, or made 
> by an expression) I get an error:
> {code}
> ``` r
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> right_df <- data.frame(
>   gear = as.double(c(3:5)),
>   col_a = "a"
> )
> Table$create(mtcars) %>% 
>   rename(the_gear = gear) %>%
>   left_join(
> right_df,
> by = c(the_gear = "gear")
>   ) %>%
>   collect()
> #> Error: Invalid: No match or multiple matches for key field reference 
> FieldRef.Name(the_gear) on left side of the join
> {code}
> Interestingly, if the column is renamed/created on the right side, it works 
> just fine:
> {code}
> Table$create(mtcars) %>% 
>   left_join(
> right_df %>% 
>   rename(the_gear = gear),
> by = c(gear = "the_gear")
>   ) %>%
>   collect()
> #> mpg cyl  disp  hp dratwt  qsec vs am gear carb col_a
> #> 1  21.0   6 160.0 110 3.90 2.620 16.46  0  144 a
> #> 2  21.0   6 160.0 110 3.90 2.875 17.02  0  144 a
> #> 3  22.8   4 108.0  93 3.85 2.320 18.61  1  141 a
> #> 4  21.4   6 258.0 110 3.08 3.215 19.44  1  031 a
> #> 5  18.7   8 360.0 175 3.15 3.440 17.02  0  032 a
> #> 6  18.1   6 225.0 105 2.76 3.460 20.22  1  031 a
> #> 7  14.3   8 360.0 245 3.21 3.570 15.84  0  034 a
> #> 8  24.4   4 146.7  62 3.69 3.190 20.00  1  042 a
> #> 9  22.8   4 140.8  95 3.92 3.150 22.90  1  042 a
> #> 10 19.2   6 167.6 123 3.92 3.440 18.30  1  044 a
> #> 11 17.8   6 167.6 123 3.92 3.440 18.90  1  044 a
> #> 12 16.4   8 275.8 180 3.07 4.070 17.40  0  033 a
> #> 13 17.3   8 275.8 180 3.07 3.730 17.60  0  033 a
> #> 14 15.2   8 275.8 180 3.07 3.780 18.00  0  033 a
> #> 15 10.4   8 472.0 205 2.93 5.250 17.98  0  034 a
> #> 16 10.4   8 460.0 215 3.00 5.424 17.82  0  034 a
> #> 17 14.7   8 440.0 230 3.23 5.345 17.42  0  034 a
> #> 18 32.4   4  78.7  66 4.08 2.200 19.47  1  141 a
> #> 19 30.4   4  75.7  52 4.93 1.615 18.52  1  142 a
> #> 20 33.9   4  71.1  65 4.22 1.835 19.90  1  141 a
> #> 21 21.5   4 120.1  97 3.70 2.465 20.01  1  031 a
> #> 22 15.5   8 318.0 150 2.76 3.520 16.87  0  032 a
> #> 23 15.2   8 304.0 150 3.15 3.435 17.30  0  032 a
> #> 24 13.3   8 350.0 245 3.73 3.840 15.41  0  034 a
> #> 25 19.2   8 400.0 175 3.08 3.845 17.05  0  032 a
> #> 26 27.3   4  79.0  66 4.08 1.935 18.90  1  141 a
> #> 27 26.0   4 120.3  91 4.43 2.140 16.70  0  152 a
> #> 28 30.4   4  95.1 113 3.77 1.513 16.90  1  152 a
> #> 29 15.8   8 351.0 264 4.22 3.170 14.50  0  154 a
> #> 30 19.7   6 145.0 175 3.62 2.770 15.50  0  156 a
> #> 31 15.0   8 301.0 335 3.54 3.570 14.60  0  158 a
> #> 32 21.4   4 121.0 109 4.11 2.780 18.60  1  142 a
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14137) PyArrow - “The kernel appears to have died”- Segmentation fault

2021-10-01 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1742#comment-1742
 ] 

Joris Van den Bossche commented on ARROW-14137:
---

Would you be able to provide a reproducible example?

> PyArrow - “The kernel appears to have died”- Segmentation fault 
> 
>
> Key: ARROW-14137
> URL: https://issues.apache.org/jira/browse/ARROW-14137
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Debian Linux
>Reporter: Eric Li
>Priority: Critical
>
> Hi, 
> I am using PyTorch version 1.9.1+cu102 and PyArrow 1.19.5, when I train the 
> model, I got below error. Can you someone help me?
> During training, jupyter notebook crashes (after the same number of steps 
> each time) with the message “The kernel appears to have died. It will restart 
> automatically”. I get a segmentation fault. I am training on a single GPU, 
> pytorch version 1.9.1.
> Here’s what gdb stacktrace returns:
>  
> {{Thread 1 "python" received signal SIGSEGV, Segmentation fault. 
> 0x7ffea0a62a8a in 
> std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from 
> /opt/conda/lib/python3.7/site-packages/pyarrow/libarrow.so.500 (gdb) 
> backtrace #0 0x7ffea0a62a8a in 
> std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from 
> /opt/conda/lib/python3.7/site-packages/pyarrow/libarrow.so.500 #1 
> 0x7ffea0a8901c in arrow::json::ChunkedListArrayBuilder::InsertNull(long, 
> long) () from /opt/conda/lib/python3.7/site-packages/pyarrow/libarrow.so.500 
> #2 0x7ffea0a8941f in arrow::json::ChunkedListArrayBuilder::Insert(long, 
> std::shared_ptr const&, std::shared_ptr const&) 
> () from /opt/conda/lib/python3.7/site-packages/pyarrow/libarrow.so.500 #3 
> 0x7ffea0a860bd in 
> arrow::json::ChunkedStructArrayBuilder::Finish(std::shared_ptr*)
>  () from /opt/conda/lib/python3.7/site-packages/pyarrow/libarrow.so.500 #4 
> 0x7ffea0a7dc2c in 
> arrow::json::ChunkedListArrayBuilder::Finish(std::shared_ptr*)
>  () --Type  for more, q to quit, c to continue without paging--c from 
> /opt/conda/lib/python3.7/site-packages/pyarrow/libarrow.so.500 #5 
> 0x7ffea0a8654d in 
> arrow::json::ChunkedStructArrayBuilder::Finish(std::shared_ptr*)
>  () from /opt/conda/lib/python3.7/site-packages/pyarrow/libarrow.so.500 #6 
> 0x7ffea0a8654d in 
> arrow::json::ChunkedStructArrayBuilder::Finish(std::shared_ptr*)
>  () from /opt/conda/lib/python3.7/site-packages/pyarrow/libarrow.so.500 #7 
> 0x7ffea0a94e02 in arrow::json::TableReaderImpl::Read() () from 
> /opt/conda/lib/python3.7/site-packages/pyarrow/libarrow.so.500 #8 
> 0x7ffe9dc66f39 in __pyx_pw_7pyarrow_5_json_1read_json(_object*, _object*, 
> _object*) () from 
> /opt/conda/lib/python3.7/site-packages/pyarrow/_json.cpython-37m-x86_64-linux-gnu.so
>  #9 0x55666919 in _PyMethodDef_RawFastCallKeywords (method= out>, self=0x0, args=, nargs=, 
> kwnames=) at 
> /home/conda/feedstock_root/build_artifacts/python_1631559780463/work/Objects/call.c:693
>  #10 0x556c20b8 in _PyCFunction_FastCallKeywords (kwnames= out>, nargs=, args=0x5d179860, func=0x7ffe9de56c80) at 
> /home/conda/feedstock_root/build_artifacts/python_1631559780463/work/Objects/call.c:723
>  #11 call_function (pp_stack=0x7fffc970, oparg=, 
> kwnames=) at 
> /home/conda/feedstock_root/build_artifacts/python_1631559780463/work/Python/ceval.c:4568
>  #12 0x55707fe8 in _PyEval_EvalFrameDefault (f=, 
> throwflag=) at 
> /home/conda/feedstock_root/build_artifacts/python_1631559780463/work/Python/ceval.c:3139
>  #13 0x556d6e3f in PyEval_EvalFrameEx (throwflag=0, f=0x5d179680) 
> at 
> /home/conda/feedstock_root/build_artifacts/python_1631559780463/work/Python/ceval.c:544
>  #14 gen_send_ex (closing=0, exc=0, arg=0x0, gen=0x7ffe0a8bb8d0) at 
> /home/conda/feedstock_root/build_artifacts/python_1631559780463/work/Objects/genobject.c:221
>  #15 gen_iternext (gen=0x7ffe0a8bb8d0) at 
> /home/conda/feedstock_root/build_artifacts/python_1631559780463/work/Objects/genobject.c:542
>  #16 0x55707598 in _PyEval_EvalFrameDefault (f=, 
> throwflag=) at 
> /home/conda/feedstock_root/build_artifacts/python_1631559780463/work/Python/ceval.c:2809
>  #17 0x556d6e3f in PyEval_EvalFrameEx (throwflag=0, f=0x7ffe9dc17650) 
> at 
> /home/conda/feedstock_root/build_artifacts/python_1631559780463/work/Python/ceval.c:544
>  #18 gen_send_ex (closing=0, exc=0, arg=0x0, gen=0x7ffe0a8bb850) at 
> /home/conda/feedstock_root/build_artifacts/python_1631559780463/work/Objects/genobject.c:221
>  #19 gen_iternext (gen=0x7ffe0a8bb850) at 
> /home/conda/feedstock_root/build_artifacts/python_1631559780463/work/Objects/genobject.c:542
>  #20 0x55707598 in _PyEval_EvalFrameDefault 

[jira] [Issue Comment Deleted] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-01 Thread Judah (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Judah updated ARROW-14196:
--
Comment: was deleted

(was: [~trucnguyenlam] Is this something that could be flipped now?

https://github.com/pandas-dev/pandas/pull/43690#pullrequestreview-760364590)

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-14036) [R] Binding for n_distinct() with no grouping

2021-10-01 Thread Ian Cook (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook reassigned ARROW-14036:


Assignee: Percy Camilo Triveño Aucahuasi  (was: Ian Cook)

> [R] Binding for n_distinct() with no grouping
> -
>
> Key: ARROW-14036
> URL: https://issues.apache.org/jira/browse/ARROW-14036
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Ian Cook
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Critical
>  Labels: kernel
> Fix For: 6.0.0
>
>
> ARROW-13620 added a binding for {{n_distinct()}} but it only works for 
> _grouped_ aggregation, not whole-table aggregation. 
> This works:
> {code:java}
> Table$create(starwars) %>%
>   group_by(homeworld) %>%
>   summarise(n_distinct(species)) %>%
>   collect(){code}
> but this errors:
> {code:java}
> Table$create(starwars) %>%
>   summarise(n_distinct(species)) %>%
>   collect()
> #> Error: Key error: No function registered with name: count_distinct{code}
> Once we have a non-hash {{count_distinct}} aggregate kernel in the C++ 
> library (ARROW-14035) we should bind the options for it in the R package and 
> add a test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14036) [R] Binding for n_distinct() with no grouping

2021-10-01 Thread Ian Cook (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423331#comment-17423331
 ] 

Ian Cook commented on ARROW-14036:
--

[#11257|https://github.com/apache/arrow/pull/11257] (ARROW-14035) resolves this

> [R] Binding for n_distinct() with no grouping
> -
>
> Key: ARROW-14036
> URL: https://issues.apache.org/jira/browse/ARROW-14036
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Critical
>  Labels: kernel
> Fix For: 6.0.0
>
>
> ARROW-13620 added a binding for {{n_distinct()}} but it only works for 
> _grouped_ aggregation, not whole-table aggregation. 
> This works:
> {code:java}
> Table$create(starwars) %>%
>   group_by(homeworld) %>%
>   summarise(n_distinct(species)) %>%
>   collect(){code}
> but this errors:
> {code:java}
> Table$create(starwars) %>%
>   summarise(n_distinct(species)) %>%
>   collect()
> #> Error: Key error: No function registered with name: count_distinct{code}
> Once we have a non-hash {{count_distinct}} aggregate kernel in the C++ 
> library (ARROW-14035) we should bind the options for it in the R package and 
> add a test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6607) [Python] Support for set/list columns when converting from Pandas

2021-10-01 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-6607:
-
Fix Version/s: (was: 7.0.0)
   6.0.0

> [Python] Support for set/list columns when converting from Pandas
> -
>
> Key: ARROW-6607
> URL: https://issues.apache.org/jira/browse/ARROW-6607
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
> Environment: python 3.6.7, pandas 0.24.2, pyarrow 0.14.1 on WSL in 
> Windows 10
>Reporter: Giora Simchoni
>Assignee: Alessandro Molina
>Priority: Major
> Fix For: 6.0.0
>
>
> Hi,
> Using python 3.6.7, pandas 0.24.2, pyarrow 0.14.1 on WSL in Windows 10...
> ```python
> import pandas as pd
> df = pd.DataFrame(\{'a': [1,2,3], 'b': [set([1,2]), set([2,3]), 
> set([3,4,5])]})
> df.to_feather('test.ft')
> ```
> I get:
> ```
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/gioras/.local/lib/python3.6/site-packages/pandas/core/frame.py", 
> line 2131, in to_feather
>  to_feather(self, fname)
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pandas/io/feather_format.py",
>  line 83, in to_feather
>  feather.write_feather(df, path)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 182, in write_feather
>  writer.write(df)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 93, in write
>  table = Table.from_pandas(df, preserve_index=False)
>  File "pyarrow/table.pxi", line 1174, in pyarrow.lib.Table.from_pandas
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 496, in dataframe_to_arrays
>  for c, f in zip(columns_to_convert, convert_fields)]
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 496, in 
>  for c, f in zip(columns_to_convert, convert_fields)]
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 487, in convert_column
>  raise e
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 481, in convert_column
>  result = pa.array(col, type=type_, from_pandas=True, safe=safe)
>  File "pyarrow/array.pxi", line 191, in pyarrow.lib.array
>  File "pyarrow/array.pxi", line 78, in pyarrow.lib._ndarray_to_array
>  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: ('Could not convert \{1, 2} with type set: did not 
> recognize Python value type when inferring an Arrow data type', 'Conversion 
> failed for column b with type object')
> ```
> And obviously `df.drop('b', axis=1).to_feather('test.ft')` works.
> Questions:
> (1) Is it possible to support these kind of set/list columns?
> (2) Anyone has an idea on how to deal with this? I *cannot* unnest these 
> set/list columns as this would explode the DataFrame. My only other idea is 
> to convert set `\{1,2}` into a string `1,2` and parse it after reading the 
> file. And hoping it won't be slow.
>  
> Update:
> With lists column the error is different:
> ```python
> import pandas as pd
> df = pd.DataFrame(\{'a': [1,2,3], 'b': [[1,2], [2,3], [3,4,5]]})
> df.to_feather('test.ft')
> ```
> ```
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/gioras/.local/lib/python3.6/site-packages/pandas/core/frame.py", 
> line 2131, in to_feather
>  to_feather(self, fname)
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pandas/io/feather_format.py",
>  line 83, in to_feather
>  feather.write_feather(df, path)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 182, in write_feather
>  writer.write(df)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 97, in write
>  self.writer.write_array(name, col.data.chunk(0))
>  File "pyarrow/feather.pxi", line 67, in pyarrow.lib.FeatherWriter.write_array
>  File "pyarrow/error.pxi", line 93, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: list
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6607) [Python] Support for set/list columns when converting from Pandas

2021-10-01 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-6607.
--
Resolution: Fixed

> [Python] Support for set/list columns when converting from Pandas
> -
>
> Key: ARROW-6607
> URL: https://issues.apache.org/jira/browse/ARROW-6607
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
> Environment: python 3.6.7, pandas 0.24.2, pyarrow 0.14.1 on WSL in 
> Windows 10
>Reporter: Giora Simchoni
>Assignee: Alessandro Molina
>Priority: Major
> Fix For: 6.0.0
>
>
> Hi,
> Using python 3.6.7, pandas 0.24.2, pyarrow 0.14.1 on WSL in Windows 10...
> ```python
> import pandas as pd
> df = pd.DataFrame(\{'a': [1,2,3], 'b': [set([1,2]), set([2,3]), 
> set([3,4,5])]})
> df.to_feather('test.ft')
> ```
> I get:
> ```
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/gioras/.local/lib/python3.6/site-packages/pandas/core/frame.py", 
> line 2131, in to_feather
>  to_feather(self, fname)
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pandas/io/feather_format.py",
>  line 83, in to_feather
>  feather.write_feather(df, path)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 182, in write_feather
>  writer.write(df)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 93, in write
>  table = Table.from_pandas(df, preserve_index=False)
>  File "pyarrow/table.pxi", line 1174, in pyarrow.lib.Table.from_pandas
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 496, in dataframe_to_arrays
>  for c, f in zip(columns_to_convert, convert_fields)]
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 496, in 
>  for c, f in zip(columns_to_convert, convert_fields)]
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 487, in convert_column
>  raise e
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 481, in convert_column
>  result = pa.array(col, type=type_, from_pandas=True, safe=safe)
>  File "pyarrow/array.pxi", line 191, in pyarrow.lib.array
>  File "pyarrow/array.pxi", line 78, in pyarrow.lib._ndarray_to_array
>  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: ('Could not convert \{1, 2} with type set: did not 
> recognize Python value type when inferring an Arrow data type', 'Conversion 
> failed for column b with type object')
> ```
> And obviously `df.drop('b', axis=1).to_feather('test.ft')` works.
> Questions:
> (1) Is it possible to support these kind of set/list columns?
> (2) Anyone has an idea on how to deal with this? I *cannot* unnest these 
> set/list columns as this would explode the DataFrame. My only other idea is 
> to convert set `\{1,2}` into a string `1,2` and parse it after reading the 
> file. And hoping it won't be slow.
>  
> Update:
> With lists column the error is different:
> ```python
> import pandas as pd
> df = pd.DataFrame(\{'a': [1,2,3], 'b': [[1,2], [2,3], [3,4,5]]})
> df.to_feather('test.ft')
> ```
> ```
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/gioras/.local/lib/python3.6/site-packages/pandas/core/frame.py", 
> line 2131, in to_feather
>  to_feather(self, fname)
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pandas/io/feather_format.py",
>  line 83, in to_feather
>  feather.write_feather(df, path)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 182, in write_feather
>  writer.write(df)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 97, in write
>  self.writer.write_array(name, col.data.chunk(0))
>  File "pyarrow/feather.pxi", line 67, in pyarrow.lib.FeatherWriter.write_array
>  File "pyarrow/error.pxi", line 93, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: list
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6607) [Python] Support for set/list columns when converting from Pandas

2021-10-01 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-6607:


Assignee: Alessandro Molina  (was: Krisztian Szucs)

> [Python] Support for set/list columns when converting from Pandas
> -
>
> Key: ARROW-6607
> URL: https://issues.apache.org/jira/browse/ARROW-6607
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
> Environment: python 3.6.7, pandas 0.24.2, pyarrow 0.14.1 on WSL in 
> Windows 10
>Reporter: Giora Simchoni
>Assignee: Alessandro Molina
>Priority: Major
> Fix For: 7.0.0
>
>
> Hi,
> Using python 3.6.7, pandas 0.24.2, pyarrow 0.14.1 on WSL in Windows 10...
> ```python
> import pandas as pd
> df = pd.DataFrame(\{'a': [1,2,3], 'b': [set([1,2]), set([2,3]), 
> set([3,4,5])]})
> df.to_feather('test.ft')
> ```
> I get:
> ```
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/gioras/.local/lib/python3.6/site-packages/pandas/core/frame.py", 
> line 2131, in to_feather
>  to_feather(self, fname)
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pandas/io/feather_format.py",
>  line 83, in to_feather
>  feather.write_feather(df, path)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 182, in write_feather
>  writer.write(df)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 93, in write
>  table = Table.from_pandas(df, preserve_index=False)
>  File "pyarrow/table.pxi", line 1174, in pyarrow.lib.Table.from_pandas
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 496, in dataframe_to_arrays
>  for c, f in zip(columns_to_convert, convert_fields)]
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 496, in 
>  for c, f in zip(columns_to_convert, convert_fields)]
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 487, in convert_column
>  raise e
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 481, in convert_column
>  result = pa.array(col, type=type_, from_pandas=True, safe=safe)
>  File "pyarrow/array.pxi", line 191, in pyarrow.lib.array
>  File "pyarrow/array.pxi", line 78, in pyarrow.lib._ndarray_to_array
>  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: ('Could not convert \{1, 2} with type set: did not 
> recognize Python value type when inferring an Arrow data type', 'Conversion 
> failed for column b with type object')
> ```
> And obviously `df.drop('b', axis=1).to_feather('test.ft')` works.
> Questions:
> (1) Is it possible to support these kind of set/list columns?
> (2) Anyone has an idea on how to deal with this? I *cannot* unnest these 
> set/list columns as this would explode the DataFrame. My only other idea is 
> to convert set `\{1,2}` into a string `1,2` and parse it after reading the 
> file. And hoping it won't be slow.
>  
> Update:
> With lists column the error is different:
> ```python
> import pandas as pd
> df = pd.DataFrame(\{'a': [1,2,3], 'b': [[1,2], [2,3], [3,4,5]]})
> df.to_feather('test.ft')
> ```
> ```
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/gioras/.local/lib/python3.6/site-packages/pandas/core/frame.py", 
> line 2131, in to_feather
>  to_feather(self, fname)
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pandas/io/feather_format.py",
>  line 83, in to_feather
>  feather.write_feather(df, path)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 182, in write_feather
>  writer.write(df)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 97, in write
>  self.writer.write_array(name, col.data.chunk(0))
>  File "pyarrow/feather.pxi", line 67, in pyarrow.lib.FeatherWriter.write_array
>  File "pyarrow/error.pxi", line 93, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: list
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6607) [Python] Support for set/list columns when converting from Pandas

2021-10-01 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423330#comment-17423330
 ] 

Joris Van den Bossche commented on ARROW-6607:
--

Indeed, the snippet above works now. This is fixed by ARROW-6626

> [Python] Support for set/list columns when converting from Pandas
> -
>
> Key: ARROW-6607
> URL: https://issues.apache.org/jira/browse/ARROW-6607
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
> Environment: python 3.6.7, pandas 0.24.2, pyarrow 0.14.1 on WSL in 
> Windows 10
>Reporter: Giora Simchoni
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 7.0.0
>
>
> Hi,
> Using python 3.6.7, pandas 0.24.2, pyarrow 0.14.1 on WSL in Windows 10...
> ```python
> import pandas as pd
> df = pd.DataFrame(\{'a': [1,2,3], 'b': [set([1,2]), set([2,3]), 
> set([3,4,5])]})
> df.to_feather('test.ft')
> ```
> I get:
> ```
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/gioras/.local/lib/python3.6/site-packages/pandas/core/frame.py", 
> line 2131, in to_feather
>  to_feather(self, fname)
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pandas/io/feather_format.py",
>  line 83, in to_feather
>  feather.write_feather(df, path)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 182, in write_feather
>  writer.write(df)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 93, in write
>  table = Table.from_pandas(df, preserve_index=False)
>  File "pyarrow/table.pxi", line 1174, in pyarrow.lib.Table.from_pandas
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 496, in dataframe_to_arrays
>  for c, f in zip(columns_to_convert, convert_fields)]
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 496, in 
>  for c, f in zip(columns_to_convert, convert_fields)]
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 487, in convert_column
>  raise e
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 481, in convert_column
>  result = pa.array(col, type=type_, from_pandas=True, safe=safe)
>  File "pyarrow/array.pxi", line 191, in pyarrow.lib.array
>  File "pyarrow/array.pxi", line 78, in pyarrow.lib._ndarray_to_array
>  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: ('Could not convert \{1, 2} with type set: did not 
> recognize Python value type when inferring an Arrow data type', 'Conversion 
> failed for column b with type object')
> ```
> And obviously `df.drop('b', axis=1).to_feather('test.ft')` works.
> Questions:
> (1) Is it possible to support these kind of set/list columns?
> (2) Anyone has an idea on how to deal with this? I *cannot* unnest these 
> set/list columns as this would explode the DataFrame. My only other idea is 
> to convert set `\{1,2}` into a string `1,2` and parse it after reading the 
> file. And hoping it won't be slow.
>  
> Update:
> With lists column the error is different:
> ```python
> import pandas as pd
> df = pd.DataFrame(\{'a': [1,2,3], 'b': [[1,2], [2,3], [3,4,5]]})
> df.to_feather('test.ft')
> ```
> ```
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/gioras/.local/lib/python3.6/site-packages/pandas/core/frame.py", 
> line 2131, in to_feather
>  to_feather(self, fname)
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pandas/io/feather_format.py",
>  line 83, in to_feather
>  feather.write_feather(df, path)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 182, in write_feather
>  writer.write(df)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 97, in write
>  self.writer.write_array(name, col.data.chunk(0))
>  File "pyarrow/feather.pxi", line 67, in pyarrow.lib.FeatherWriter.write_array
>  File "pyarrow/error.pxi", line 93, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: list
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-01 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423328#comment-17423328
 ] 

Joris Van den Bossche commented on ARROW-14196:
---

> I'm also surprised that a non-existing column name wouldn't return an error 
> instead of selecting nothing?

With the new datasets API, it actually raises an error if a column name is not 
found. With the legacy implementation (or the plain {{ParquetFile}} interface), 
it ignores those. 
(and I was using use_legacy_dataset=True because with Datasets we don't yet 
support selecting nested fields ..)

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-01 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423323#comment-17423323
 ] 

Antoine Pitrou commented on ARROW-14196:


Is it really common to read only a list's child column without the list column 
itself? Otherwise we can live with the minor compatibility break, IMHO.

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-01 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423324#comment-17423324
 ] 

Antoine Pitrou commented on ARROW-14196:


I'm also surprised that a non-existing column name wouldn't return an error 
instead of selecting nothing?

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-01 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423310#comment-17423310
 ] 

Joris Van den Bossche edited comment on ARROW-14196 at 10/1/21, 2:31 PM:
-

I wrote this with the latest pyarrow (master):

{code:python}
table = pa.table({'a': [[1, 2], [3, 4, 5]]})
pq.write_table(table, "test_nested_noncompliant.parquet")
pq.write_table(table, "test_nested_compliant.parquet", 
use_compliant_nested_type=True)
{code}

In the latest pyarrow they both read fine, but so have different names (which 
can impact eg nested field refs):

{code:python}
>>> pq.read_table("test_nested_noncompliant.parquet")
pyarrow.Table
a: list
  child 0, item: int64

>>> pq.read_table("test_nested_compliant.parquet")
pyarrow.Table
a: list
  child 0, element: int64
{code}

So eg doing {{pq.read_table("test_nested_noncompliant.parquet", 
columns=["a.list.item"], use_legacy_dataset=True)}} for works the noncompliant 
file, but doesn't select anything with the compliant file.

Those files also read fine (and result in the same difference in list field 
names) with older versions of Arrow (tested down to Arrow 1.0). 


was (Author: jorisvandenbossche):
I wrote this with the latest pyarrow (master):

{code:python}
table= pa.table({'a': [[1, 2], [3, 4, 5]]})
pq.write_table(table, "test_nested_noncompliant.parquet")
pq.write_table(table, "test_nested_compliant.parquet", 
use_compliant_nested_type=True)
{code}

In the latest pyarrow they both read fine, but so have different names (which 
can impact eg nested field refs):

{code:python}
>>> pq.read_table("test_nested_noncompliant.parquet")
pyarrow.Table
a: list
  child 0, item: int64

>>> pq.read_table("test_nested_compliant.parquet")
pyarrow.Table
a: list
  child 0, element: int64
{code}

So eg doing {{pq.read_table("test_nested_noncompliant.parquet", 
columns=["a.list.item"], use_legacy_dataset=True)}} for works the noncompliant 
file, but doesn't select anything with the compliant file.

Those files also read fine (and result in the same difference in list field 
names) with older versions of Arrow (tested down to Arrow 1.0). 

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-01 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423310#comment-17423310
 ] 

Joris Van den Bossche commented on ARROW-14196:
---

I wrote this with the latest pyarrow (master):

{code:python}
table= pa.table({'a': [[1, 2], [3, 4, 5]]})
pq.write_table(table, "test_nested_noncompliant.parquet")
pq.write_table(table, "test_nested_compliant.parquet", 
use_compliant_nested_type=True)
{code}

In the latest pyarrow they both read fine, but so have different names (which 
can impact eg nested field refs):

{code:python}
>>> pq.read_table("test_nested_noncompliant.parquet")
pyarrow.Table
a: list
  child 0, item: int64

>>> pq.read_table("test_nested_compliant.parquet")
pyarrow.Table
a: list
  child 0, element: int64
{code}

So eg doing {{pq.read_table("test_nested_noncompliant.parquet", 
columns=["a.list.item"], use_legacy_dataset=True)}} for works the noncompliant 
file, but doesn't select anything with the compliant file.

Those files also read fine (and result in the same difference in list field 
names) with older versions of Arrow (tested down to Arrow 1.0). 

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13890) [R] Split up test-dataset.R and test-dplyr.R

2021-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13890:
---
Labels: pull-request-available  (was: )

> [R] Split up test-dataset.R and test-dplyr.R
> 
>
> Key: ARROW-13890
> URL: https://issues.apache.org/jira/browse/ARROW-13890
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It's 1700 lines long and growing. One natural way to split is by file format 
> (parquet, csv).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13890) [R] Split up test-dataset.R and test-dplyr.R

2021-10-01 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13890:

Summary: [R] Split up test-dataset.R and test-dplyr.R  (was: [R] Split up 
test-dataset.R)

> [R] Split up test-dataset.R and test-dplyr.R
> 
>
> Key: ARROW-13890
> URL: https://issues.apache.org/jira/browse/ARROW-13890
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
> Fix For: 6.0.0
>
>
> It's 1700 lines long and growing. One natural way to split is by file format 
> (parquet, csv).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13890) [R] Split up test-dataset.R

2021-10-01 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-13890:
---

Assignee: Neal Richardson

> [R] Split up test-dataset.R
> ---
>
> Key: ARROW-13890
> URL: https://issues.apache.org/jira/browse/ARROW-13890
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
> Fix For: 6.0.0
>
>
> It's 1700 lines long and growing. One natural way to split is by file format 
> (parquet, csv).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14190) [R] Should unify_schemas() allow change of type?

2021-10-01 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423308#comment-17423308
 ] 

Joris Van den Bossche commented on ARROW-14190:
---

Ah, sorry, my bad. I thought we already had this kind of schema evolution 
enabled when explicitly unifying the schema, and not only when implicitly 
casting to the schema of the first file.

> [R] Should unify_schemas() allow change of type?
> 
>
> Key: ARROW-14190
> URL: https://issues.apache.org/jira/browse/ARROW-14190
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>
> Should {{unify_schemas()}} be able to do schema evolution?  If schemas with 
> different (but compatible) types are combined using {{open_dataset()}}, this 
> works, whereas if done via {{unify_schemas()}}, it results in an error.
> See discussion here: 
> https://github.com/apache/arrow-cookbook/pull/67#discussion_r714847220
> {code:r}
> library(dplyr)
> library(arrow)
> # Set up schemas
> schema1 = schema(speed = int32(), dist = int32())
> schema2 = schema(speed = float64(), dist = float64())
> # Try to combine schemas via `unify_schemas()` - results in an error
> unify_schemas(schema1, schema2)
> ## Error: Invalid: Unable to merge: Field speed has incompatible types: int32 
> vs double
> ## /home/nic2/arrow/cpp/src/arrow/type.cc:1609  fields_[i]->MergeWith(field)
> ## /home/nic2/arrow/cpp/src/arrow/type.cc:1672  AddField(field)
> ## /home/nic2/arrow/cpp/src/arrow/type.cc:1743  builder.AddSchema(schema)
> # Create datasets with different schemas and read in via `open_dataset()`
> cars1 <- Table$create(slice(cars, 1:25), schema = schema1)
> cars2 <- Table$create(slice(cars, 26:50), schema = schema2)
> td <- tempfile()
> dir.create(td)
> write_parquet(cars1, paste0(td, "/cars1.parquet"))
> write_parquet(cars2, paste0(td, "/cars2.parquet"))
> new_dataset <- open_dataset(td) 
> new_dataset$schema
> # Schema
> # speed: int32
> # dist: int32
> # 
> # See $metadata for additional Schema metadata
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13887) [R] Capture error produced when reading in CSV file with headers and using a schema, and add suggestion

2021-10-01 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-13887:
-
Fix Version/s: 6.0.0

> [R] Capture error produced when reading in CSV file with headers and using a 
> schema, and add suggestion
> ---
>
> Key: ARROW-13887
> URL: https://issues.apache.org/jira/browse/ARROW-13887
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>  Labels: good-first-issue
> Fix For: 6.0.0
>
>
> When reading in a CSV with headers, and also using a schema, we get an error 
> as the code tries to read in the header as a line of data.
> {code:java}
> share_data <- tibble::tibble(
>   company = c("AMZN", "GOOG", "BKNG", "TSLA"),
>   price = c(3463.12, 2884.38, 2300.46, 732.39)
> )
> readr::write_csv(share_data, file = "share_data.csv")
> share_schema <- schema(
>   company = utf8(),
>   price = float64()
> )
> read_csv_arrow("share_data.csv", schema = share_schema)
> {code}
> {code:java}
> Error: Invalid: In CSV column #1: CSV conversion error to double: invalid 
> value 'price'
> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:492 decoder_.Decode(data, 
> size, quoted, )
> /home/nic2/arrow/cpp/src/arrow/csv/parser.h:84 status
> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:496 
> parser.VisitColumn(col_index, visit) {code}
> The correct thing here would have been for the user to supply the argument 
> {{skip=1}} to {{read_csv_arrow()}} but this is not immediately obvious from 
> the error message returned from C++.  We should capture the error and instead 
> supply our own error message using {{rlang::abort}} which informs the user of 
> the error and then suggests what they can do to prevent it.
>  
> For similar examples (and their associated PRs) see 
> {color:#1d1c1d}ARROW-11766, and ARROW-12791{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-01 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423307#comment-17423307
 ] 

Antoine Pitrou commented on ARROW-14196:


One possible investigation would be to write two different files, one with the 
option turned on, the other with the option turned off, and find out whether 
they are readable by 1) current version of Arrow 2) past versions of Arrow.

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-01 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423302#comment-17423302
 ] 

Joris Van den Bossche commented on ARROW-14196:
---

Some relevant quotes from the ARROW-11497:

> [Micah] I think the main reason it isn't enabled by default is it breaks 
> round trips for arrow data.  This could potentially be fixed on the reader 
> side as well.  

>> [Antoine] Perhaps we could convert the field name at the Arrow<->Parquet 
>> boundary.
> [Micah] This should be possible but it potentially needs another flag.  I 
> think in the short term plumbing the additional flag through to python makes 
> sense and we can figure out a longer term solution if this becomes a larger 
> problem.

>> [Antoine] It should simply be the default (and obviously right) behaviour. 
>> Am I missing something?
> [Micah] Backwards compatibility?  It might be possible to make some 
> inferences (haven't thought about it deeply).  But I think if we were reading 
> a conforming java produced parquet file then we would get different column 
> names if we transformed on the border (maybe there can be some rules around 
> Arrow metadata being present).  I think we can make the default to be 
> conforming behavior, but we should give users some level of control to 
> preserve the old behavior.

---

I am not super familiar, but so the simplest option is to just switch the 
default of the {{compliant_nested_types}} option in ArrowWriterProperties. What 
would be the (possible backwards incompatible) consequences of that?  
We would start writing a different Parquet file (but actually following the 
spec). But I suppose that also when reading such a file, you would then get a 
different name for the sub-lists (which can impact selecting a sublist with a 
nested field reference?) 
To avoid having a breaking change on the read path, we could by default also 
convert the names at the Parquet->Arrow boundary (like the 
{{compliant_nested_types}}  option already does on the Arrow->Parquet 
boundary). However, doing that can _also_ break code for people currently 
already reading compliant parquet files ..

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-01 Thread Judah (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423299#comment-17423299
 ] 

Judah commented on ARROW-14196:
---

[~trucnguyenlam] Is this something that could be flipped now?

https://github.com/pandas-dev/pandas/pull/43690#pullrequestreview-760364590

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14197) [C++] Hashjoin + datasets hanging

2021-10-01 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423296#comment-17423296
 ] 

Jonathan Keane commented on ARROW-14197:


I've attached a sample of the R process while it's hung from my machine, in 
case that's helpful

> [C++] Hashjoin + datasets hanging
> -
>
> Key: ARROW-14197
> URL: https://issues.apache.org/jira/browse/ARROW-14197
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jonathan Keane
>Priority: Major
> Attachments: sample-while-hung.out.txt
>
>
> I’m getting a hang on the TPC-H query 4 pretty reliably (though it’s not 
> _every_ time). The query is:
> {code}
> l <- input_table("lineitem") %>%
> select(l_orderkey, l_commitdate, l_receiptdate) %>%
> filter(l_commitdate < l_receiptdate) %>%
> select(l_orderkey)
>   o <- input_table("orders") %>%
> select(o_orderkey, o_orderdate, o_orderpriority) %>%
> # kludge: filter(o_orderdate >= "1993-07-01", o_orderdate < "1993-07-01" 
> + interval '3' month) %>%
> filter(o_orderdate >= as.Date("1993-07-01"), o_orderdate < 
> as.Date("1993-10-01")) %>%
> select(o_orderkey, o_orderpriority)
>   # distinct after join, tested and indeed faster
>   lo <- inner_join(l, o, by = c("l_orderkey" = "o_orderkey")) %>%
> distinct() %>%
> select(o_orderpriority)
>   aggr <- lo %>%
> group_by(o_orderpriority) %>%
> summarise(order_count = n()) %>%
> arrange(o_orderpriority) %>% 
> collect()
> {code}
> Basically, filtered lineitems, filtered orders, join those together, 
> group_by, summarise, arrange. 
> This happens pretty reliably when the {{input_table}} is a dataset backed by 
> parquet or feather fiels (e.g. {{input_table}} returns something like 
> {{arrow::open_dataset("path/to/{filename}.feather", format = "feather")}}
> One can replicate this by installing an arrowbench branch 
> (https://github.com/ursacomputing/arrowbench/pull/37) with, in R: 
> {{remotes::install_github("ursacomputing/arrowbench@moar-tpch"}} and then 
> running the following:
> {code}
> library(arrowbench)
> results <- run_benchmark(
>   tpc_h,
>   scale_factor = 1,
>   cpu_count = 8,
>   query_id = 4,
>   lib_path = "remote-apache/arrow@HEAD", # remove this line if you have a 
> recent install of the arrow r package that supports hash joins and want to 
> avoid building a separate copy.
>   format = "feather",
>   n_iter = 20
> )
> {code}
> Note this _sometimes_ will finish, but frequently it will not and be stuck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14197) [C++] Hashjoin + datasets hanging

2021-10-01 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14197:
---
Attachment: sample-while-hung.out.txt

> [C++] Hashjoin + datasets hanging
> -
>
> Key: ARROW-14197
> URL: https://issues.apache.org/jira/browse/ARROW-14197
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jonathan Keane
>Priority: Major
> Attachments: sample-while-hung.out.txt
>
>
> I’m getting a hang on the TPC-H query 4 pretty reliably (though it’s not 
> _every_ time). The query is:
> {code}
> l <- input_table("lineitem") %>%
> select(l_orderkey, l_commitdate, l_receiptdate) %>%
> filter(l_commitdate < l_receiptdate) %>%
> select(l_orderkey)
>   o <- input_table("orders") %>%
> select(o_orderkey, o_orderdate, o_orderpriority) %>%
> # kludge: filter(o_orderdate >= "1993-07-01", o_orderdate < "1993-07-01" 
> + interval '3' month) %>%
> filter(o_orderdate >= as.Date("1993-07-01"), o_orderdate < 
> as.Date("1993-10-01")) %>%
> select(o_orderkey, o_orderpriority)
>   # distinct after join, tested and indeed faster
>   lo <- inner_join(l, o, by = c("l_orderkey" = "o_orderkey")) %>%
> distinct() %>%
> select(o_orderpriority)
>   aggr <- lo %>%
> group_by(o_orderpriority) %>%
> summarise(order_count = n()) %>%
> arrange(o_orderpriority) %>% 
> collect()
> {code}
> Basically, filtered lineitems, filtered orders, join those together, 
> group_by, summarise, arrange. 
> This happens pretty reliably when the {{input_table}} is a dataset backed by 
> parquet or feather fiels (e.g. {{input_table}} returns something like 
> {{arrow::open_dataset("path/to/{filename}.feather", format = "feather")}}
> One can replicate this by installing an arrowbench branch 
> (https://github.com/ursacomputing/arrowbench/pull/37) with, in R: 
> {{remotes::install_github("ursacomputing/arrowbench@moar-tpch"}} and then 
> running the following:
> {code}
> library(arrowbench)
> results <- run_benchmark(
>   tpc_h,
>   scale_factor = 1,
>   cpu_count = 8,
>   query_id = 4,
>   lib_path = "remote-apache/arrow@HEAD", # remove this line if you have a 
> recent install of the arrow r package that supports hash joins and want to 
> avoid building a separate copy.
>   format = "feather",
>   n_iter = 20
> )
> {code}
> Note this _sometimes_ will finish, but frequently it will not and be stuck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14197) [C++] Hashjoin + datasets hanging

2021-10-01 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-14197:
--

 Summary: [C++] Hashjoin + datasets hanging
 Key: ARROW-14197
 URL: https://issues.apache.org/jira/browse/ARROW-14197
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Jonathan Keane
 Attachments: sample-while-hung.out.txt

I’m getting a hang on the TPC-H query 4 pretty reliably (though it’s not 
_every_ time). The query is:

{code}
l <- input_table("lineitem") %>%
select(l_orderkey, l_commitdate, l_receiptdate) %>%
filter(l_commitdate < l_receiptdate) %>%
select(l_orderkey)

  o <- input_table("orders") %>%
select(o_orderkey, o_orderdate, o_orderpriority) %>%
# kludge: filter(o_orderdate >= "1993-07-01", o_orderdate < "1993-07-01" + 
interval '3' month) %>%
filter(o_orderdate >= as.Date("1993-07-01"), o_orderdate < 
as.Date("1993-10-01")) %>%
select(o_orderkey, o_orderpriority)

  # distinct after join, tested and indeed faster
  lo <- inner_join(l, o, by = c("l_orderkey" = "o_orderkey")) %>%
distinct() %>%
select(o_orderpriority)

  aggr <- lo %>%
group_by(o_orderpriority) %>%
summarise(order_count = n()) %>%
arrange(o_orderpriority) %>% 
collect()
{code}

Basically, filtered lineitems, filtered orders, join those together, group_by, 
summarise, arrange. 

This happens pretty reliably when the {{input_table}} is a dataset backed by 
parquet or feather fiels (e.g. {{input_table}} returns something like 
{{arrow::open_dataset("path/to/{filename}.feather", format = "feather")}}

One can replicate this by installing an arrowbench branch 
(https://github.com/ursacomputing/arrowbench/pull/37) with, in R: 
{{remotes::install_github("ursacomputing/arrowbench@moar-tpch"}} and then 
running the following:

{code}
library(arrowbench)

results <- run_benchmark(
  tpc_h,
  scale_factor = 1,
  cpu_count = 8,
  query_id = 4,
  lib_path = "remote-apache/arrow@HEAD", # remove this line if you have a 
recent install of the arrow r package that supports hash joins and want to 
avoid building a separate copy.
  format = "feather",
  n_iter = 20
)
{code}

Note this _sometimes_ will finish, but frequently it will not and be stuck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-01 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-14196:
-

 Summary: [C++][Parquet] Default to compliant nested types in 
Parquet writer
 Key: ARROW-14196
 URL: https://issues.apache.org/jira/browse/ARROW-14196
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Parquet
Reporter: Joris Van den Bossche


In C++ there is already an option to get the "compliant_nested_types" (to have 
the list columns follow the Parquet specification), and ARROW-11497 exposed 
this option in Python.

This is still set to False by default, but in the source it says "TODO: At some 
point we should flip this.", and in ARROW-11497 there was also some discussion 
about what it would take to change the default.

cc [~emkornfield] [~apitrou]





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14195) [R] Fix ExecPlan binding annotations

2021-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14195:
---
Labels: pull-request-available  (was: )

> [R] Fix ExecPlan binding annotations
> 
>
> Key: ARROW-14195
> URL: https://issues.apache.org/jira/browse/ARROW-14195
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The minimal build failed with undefined symbols because some of the ExecPlan 
> bindings require ARROW_DATASET but our annotations in R didn't reflect that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14195) [R] Fix ExecPlan binding annotations

2021-10-01 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-14195:
---

 Summary: [R] Fix ExecPlan binding annotations
 Key: ARROW-14195
 URL: https://issues.apache.org/jira/browse/ARROW-14195
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 6.0.0


The minimal build failed with undefined symbols because some of the ExecPlan 
bindings require ARROW_DATASET but our annotations in R didn't reflect that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-14190) [R] Should unify_schemas() allow change of type?

2021-10-01 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane closed ARROW-14190.

Resolution: Not A Problem

> [R] Should unify_schemas() allow change of type?
> 
>
> Key: ARROW-14190
> URL: https://issues.apache.org/jira/browse/ARROW-14190
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>
> Should {{unify_schemas()}} be able to do schema evolution?  If schemas with 
> different (but compatible) types are combined using {{open_dataset()}}, this 
> works, whereas if done via {{unify_schemas()}}, it results in an error.
> See discussion here: 
> https://github.com/apache/arrow-cookbook/pull/67#discussion_r714847220
> {code:r}
> library(dplyr)
> library(arrow)
> # Set up schemas
> schema1 = schema(speed = int32(), dist = int32())
> schema2 = schema(speed = float64(), dist = float64())
> # Try to combine schemas via `unify_schemas()` - results in an error
> unify_schemas(schema1, schema2)
> ## Error: Invalid: Unable to merge: Field speed has incompatible types: int32 
> vs double
> ## /home/nic2/arrow/cpp/src/arrow/type.cc:1609  fields_[i]->MergeWith(field)
> ## /home/nic2/arrow/cpp/src/arrow/type.cc:1672  AddField(field)
> ## /home/nic2/arrow/cpp/src/arrow/type.cc:1743  builder.AddSchema(schema)
> # Create datasets with different schemas and read in via `open_dataset()`
> cars1 <- Table$create(slice(cars, 1:25), schema = schema1)
> cars2 <- Table$create(slice(cars, 26:50), schema = schema2)
> td <- tempfile()
> dir.create(td)
> write_parquet(cars1, paste0(td, "/cars1.parquet"))
> write_parquet(cars2, paste0(td, "/cars2.parquet"))
> new_dataset <- open_dataset(td) 
> new_dataset$schema
> # Schema
> # speed: int32
> # dist: int32
> # 
> # See $metadata for additional Schema metadata
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14190) [R] Should unify_schemas() allow change of type?

2021-10-01 Thread Nicola Crane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423280#comment-17423280
 ] 

Nicola Crane commented on ARROW-14190:
--

Ah, I misread the code, thanks for explaining that [~npr]!

> [R] Should unify_schemas() allow change of type?
> 
>
> Key: ARROW-14190
> URL: https://issues.apache.org/jira/browse/ARROW-14190
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>
> Should {{unify_schemas()}} be able to do schema evolution?  If schemas with 
> different (but compatible) types are combined using {{open_dataset()}}, this 
> works, whereas if done via {{unify_schemas()}}, it results in an error.
> See discussion here: 
> https://github.com/apache/arrow-cookbook/pull/67#discussion_r714847220
> {code:r}
> library(dplyr)
> library(arrow)
> # Set up schemas
> schema1 = schema(speed = int32(), dist = int32())
> schema2 = schema(speed = float64(), dist = float64())
> # Try to combine schemas via `unify_schemas()` - results in an error
> unify_schemas(schema1, schema2)
> ## Error: Invalid: Unable to merge: Field speed has incompatible types: int32 
> vs double
> ## /home/nic2/arrow/cpp/src/arrow/type.cc:1609  fields_[i]->MergeWith(field)
> ## /home/nic2/arrow/cpp/src/arrow/type.cc:1672  AddField(field)
> ## /home/nic2/arrow/cpp/src/arrow/type.cc:1743  builder.AddSchema(schema)
> # Create datasets with different schemas and read in via `open_dataset()`
> cars1 <- Table$create(slice(cars, 1:25), schema = schema1)
> cars2 <- Table$create(slice(cars, 26:50), schema = schema2)
> td <- tempfile()
> dir.create(td)
> write_parquet(cars1, paste0(td, "/cars1.parquet"))
> write_parquet(cars2, paste0(td, "/cars2.parquet"))
> new_dataset <- open_dataset(td) 
> new_dataset$schema
> # Schema
> # speed: int32
> # dist: int32
> # 
> # See $metadata for additional Schema metadata
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14190) [R] Should unify_schemas() allow change of type?

2021-10-01 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423277#comment-17423277
 ] 

Neal Richardson commented on ARROW-14190:
-

open_dataset isn't (by default) trying to unify schemas, it just takes the 
first one it finds (which is why you see int32 as the types, I'd expect if you 
unified those schemas that you'd promote to float64). You could pass 
unify_schemas = TRUE to it and would probably get the error. 

> [R] Should unify_schemas() allow change of type?
> 
>
> Key: ARROW-14190
> URL: https://issues.apache.org/jira/browse/ARROW-14190
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>
> Should {{unify_schemas()}} be able to do schema evolution?  If schemas with 
> different (but compatible) types are combined using {{open_dataset()}}, this 
> works, whereas if done via {{unify_schemas()}}, it results in an error.
> See discussion here: 
> https://github.com/apache/arrow-cookbook/pull/67#discussion_r714847220
> {code:r}
> library(dplyr)
> library(arrow)
> # Set up schemas
> schema1 = schema(speed = int32(), dist = int32())
> schema2 = schema(speed = float64(), dist = float64())
> # Try to combine schemas via `unify_schemas()` - results in an error
> unify_schemas(schema1, schema2)
> ## Error: Invalid: Unable to merge: Field speed has incompatible types: int32 
> vs double
> ## /home/nic2/arrow/cpp/src/arrow/type.cc:1609  fields_[i]->MergeWith(field)
> ## /home/nic2/arrow/cpp/src/arrow/type.cc:1672  AddField(field)
> ## /home/nic2/arrow/cpp/src/arrow/type.cc:1743  builder.AddSchema(schema)
> # Create datasets with different schemas and read in via `open_dataset()`
> cars1 <- Table$create(slice(cars, 1:25), schema = schema1)
> cars2 <- Table$create(slice(cars, 26:50), schema = schema2)
> td <- tempfile()
> dir.create(td)
> write_parquet(cars1, paste0(td, "/cars1.parquet"))
> write_parquet(cars2, paste0(td, "/cars2.parquet"))
> new_dataset <- open_dataset(td) 
> new_dataset$schema
> # Schema
> # speed: int32
> # dist: int32
> # 
> # See $metadata for additional Schema metadata
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13887) [R] Capture error produced when reading in CSV file with headers and using a schema, and add suggestion

2021-10-01 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-13887:
-
Description: 
When reading in a CSV with headers, and also using a schema, we get an error as 
the code tries to read in the header as a line of data.
{code:java}
share_data <- tibble::tibble(
  company = c("AMZN", "GOOG", "BKNG", "TSLA"),
  price = c(3463.12, 2884.38, 2300.46, 732.39)
)

readr::write_csv(share_data, file = "share_data.csv")

share_schema <- schema(
  company = utf8(),
  price = float64()
)

read_csv_arrow("share_data.csv", schema = share_schema)

{code}
{code:java}
Error: Invalid: In CSV column #1: CSV conversion error to double: invalid value 
'price'
/home/nic2/arrow/cpp/src/arrow/csv/converter.cc:492 decoder_.Decode(data, size, 
quoted, )
/home/nic2/arrow/cpp/src/arrow/csv/parser.h:84 status
/home/nic2/arrow/cpp/src/arrow/csv/converter.cc:496 
parser.VisitColumn(col_index, visit) {code}
The correct thing here would have been for the user to supply the argument 
{{skip=1}} to {{read_csv_arrow()}} but this is not immediately obvious from the 
error message returned from C++.  We should capture the error and instead 
supply our own error message using {{rlang::abort}} which informs the user of 
the error and then suggests what they can do to prevent it.

 

For similar examples (and their associated PRs) see {color:#1d1c1d}ARROW-11766, 
and ARROW-12791{color}

  was:
When reading in a CSV with headers, and also using a schema, we get an error as 
the code tries to read in the header as a line of data.
{code:java}
share_data <- tibble::tibble(
  company = c("AMZN", "GOOG", "BKNG", "TSLA"),
  price = c(3463.12, 2884.38, 2300.46, 732.39),
  date = rep(as.Date("2021-09-03"), 4)
)

readr::write_csv(share_data, file = "share_data.csv")

share_schema <- schema(
  company = utf8(),
  price = float64(),
  date = date32()
)

read_csv_arrow("share_data.csv", schema = share_schema)

{code}
{code:java}
Error: Invalid: In CSV column #1: CSV conversion error to double: invalid value 
'price'
/home/nic2/arrow/cpp/src/arrow/csv/converter.cc:492 decoder_.Decode(data, size, 
quoted, )
/home/nic2/arrow/cpp/src/arrow/csv/parser.h:84 status
/home/nic2/arrow/cpp/src/arrow/csv/converter.cc:496 
parser.VisitColumn(col_index, visit) {code}
The correct thing here would have been for the user to supply the argument 
{{skip=1}} to {{read_csv_arrow()}} but this is not immediately obvious from the 
error message returned from C++.  We should capture the error and instead 
supply our own error message using {{rlang::abort}} which informs the user of 
the error and then suggests what they can do to prevent it.

 

For similar examples (and their associated PRs) see {color:#1d1c1d}ARROW-11766, 
and ARROW-12791{color}


> [R] Capture error produced when reading in CSV file with headers and using a 
> schema, and add suggestion
> ---
>
> Key: ARROW-13887
> URL: https://issues.apache.org/jira/browse/ARROW-13887
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>  Labels: good-first-issue
>
> When reading in a CSV with headers, and also using a schema, we get an error 
> as the code tries to read in the header as a line of data.
> {code:java}
> share_data <- tibble::tibble(
>   company = c("AMZN", "GOOG", "BKNG", "TSLA"),
>   price = c(3463.12, 2884.38, 2300.46, 732.39)
> )
> readr::write_csv(share_data, file = "share_data.csv")
> share_schema <- schema(
>   company = utf8(),
>   price = float64()
> )
> read_csv_arrow("share_data.csv", schema = share_schema)
> {code}
> {code:java}
> Error: Invalid: In CSV column #1: CSV conversion error to double: invalid 
> value 'price'
> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:492 decoder_.Decode(data, 
> size, quoted, )
> /home/nic2/arrow/cpp/src/arrow/csv/parser.h:84 status
> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:496 
> parser.VisitColumn(col_index, visit) {code}
> The correct thing here would have been for the user to supply the argument 
> {{skip=1}} to {{read_csv_arrow()}} but this is not immediately obvious from 
> the error message returned from C++.  We should capture the error and instead 
> supply our own error message using {{rlang::abort}} which informs the user of 
> the error and then suggests what they can do to prevent it.
>  
> For similar examples (and their associated PRs) see 
> {color:#1d1c1d}ARROW-11766, and ARROW-12791{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13472) [R] Remove .engine = "duckdb" argument

2021-10-01 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-13472:
--
Fix Version/s: (was: 7.0.0)
   6.0.0

> [R] Remove .engine = "duckdb" argument
> --
>
> Key: ARROW-13472
> URL: https://issues.apache.org/jira/browse/ARROW-13472
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Priority: Critical
>  Labels: good-first-issue
> Fix For: 6.0.0
>
>
> ARROW-12688 added:
>  * A new function {{to_duckdb()}} which registers an Arrow Dataset with 
> DuckDB and returns a dbplyr object that can be used in dplyr pipelines
>  * An {{.engine = "duckdb"}} option in the {{summarise()}} function which 
> calls {{to_duckdb()}} inside {{summarise()}}
> At the moment, the latter is very convenient because {{summarise()}} is not 
> yet natively supported for Arrow Datasets.
> However, this {{.engine = "duckdb"}} option is probably not such a great 
> design for how users should interact with the arrow package in the longer 
> term after native {{summarise()}} support is added. At that point, it will 
> seem strange that this one particular dplyr verb has an {{.engine}} option 
> while the others do not. Adding the option to all the other dplyr verbs also 
> seems like a poor UX design.
> Consider whether we should ultimately have users choose whether to use the 
> Arrow C++ engine or the DuckDB engine by passing an {{.engine}} argument to 
> the {{collect()}} or {{compute()}} function, as [~jonkeane] suggested in 
> these comments. {{collect()}} would return a tibble whereas {{compute()}} 
> would return an Arrow Table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13727) [Doc][Cookbook] Appending Tables to an existing Table - Python

2021-10-01 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina resolved ARROW-13727.
---
Resolution: Fixed

> [Doc][Cookbook] Appending Tables to an existing Table - Python
> --
>
> Key: ARROW-13727
> URL: https://issues.apache.org/jira/browse/ARROW-13727
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13725) [Doc][Cookbook] Combining and Harmonizing Schemas - Python

2021-10-01 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina resolved ARROW-13725.
---
Resolution: Fixed

> [Doc][Cookbook] Combining and Harmonizing Schemas - Python
> --
>
> Key: ARROW-13725
> URL: https://issues.apache.org/jira/browse/ARROW-13725
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14188) link error on ubuntu

2021-10-01 Thread Amir Ghamarian (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amir Ghamarian updated ARROW-14188:
---
Description: 
I used vcpkg to install arrow versions 4 and 5, trying to build my code that 
uses parquet fails by giving link errors of undefined reference.

The same code works on OSX but fails on ubuntu.

My cmake snippet is as follows:

 
{code:java}
find_package(Arrow CONFIG REQUIRED)
get_filename_component(MY_SEARCH_DIR ${Arrow_CONFIG} DIRECTORY)
find_package(Parquet CONFIG REQUIRED PATHS ${MY_SEARCH_DIR})
find_package(Thrift CONFIG REQUIRED)
{code}
and the linking: 

 
{code:java}
target_link_libraries(vision_obj PUBLIC  thrift::thrift re2::re2 
arrow_static parquet_static )
{code}
 

 I get a lot of errors

 

  was:
I used vcpkg to install arrow versions 4 and 5, trying to build my code that 
uses parquet fails by giving link errors of undefined reference.

The same code works on OSX but fails on ubuntu.

My cmake snippet is as follows:

 

```

find_package(Arrow CONFIG REQUIRED)
 get_filename_component(MY_SEARCH_DIR ${Arrow_CONFIG} DIRECTORY)
 find_package(Parquet CONFIG REQUIRED PATHS ${MY_SEARCH_DIR})
 find_package(Thrift CONFIG REQUIRED)

```

and the linking: 

 

```

target_link_libraries(vision_obj
 PUBLIC
 
 thrift::thrift
 re2::re2 arrow_static parquet_static
 )

```

 I get a lot of errors

 


> link error on ubuntu
> 
>
> Key: ARROW-14188
> URL: https://issues.apache.org/jira/browse/ARROW-14188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 4.0.0, 5.0.0
> Environment: Ubuntu 18.04, gcc-9, and vcpkg installation of arrow
>Reporter: Amir Ghamarian
>Priority: Major
> Attachments: linkerror.txt
>
>
> I used vcpkg to install arrow versions 4 and 5, trying to build my code that 
> uses parquet fails by giving link errors of undefined reference.
> The same code works on OSX but fails on ubuntu.
> My cmake snippet is as follows:
>  
> {code:java}
> find_package(Arrow CONFIG REQUIRED)
> get_filename_component(MY_SEARCH_DIR ${Arrow_CONFIG} DIRECTORY)
> find_package(Parquet CONFIG REQUIRED PATHS ${MY_SEARCH_DIR})
> find_package(Thrift CONFIG REQUIRED)
> {code}
> and the linking: 
>  
> {code:java}
> target_link_libraries(vision_obj PUBLIC  thrift::thrift re2::re2 
> arrow_static parquet_static )
> {code}
>  
>  I get a lot of errors
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-5530) [C++] Add options to ValueCount/Unique/DictEncode kernel to toggle null behavior

2021-10-01 Thread Rok Mihevc (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc reassigned ARROW-5530:
-

Assignee: (was: Rok Mihevc)

> [C++] Add options to ValueCount/Unique/DictEncode kernel to toggle null 
> behavior
> 
>
> Key: ARROW-5530
> URL: https://issues.apache.org/jira/browse/ARROW-5530
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: analytics
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14188) link error on ubuntu

2021-10-01 Thread Amir Ghamarian (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amir Ghamarian updated ARROW-14188:
---
Description: 
I used vcpkg to install arrow versions 4 and 5, trying to build my code that 
uses parquet fails by giving link errors of undefined reference.

The same code works on OSX but fails on ubuntu.

My cmake snippet is as follows:

 

```

find_package(Arrow CONFIG REQUIRED)
 get_filename_component(MY_SEARCH_DIR ${Arrow_CONFIG} DIRECTORY)
 find_package(Parquet CONFIG REQUIRED PATHS ${MY_SEARCH_DIR})
 find_package(Thrift CONFIG REQUIRED)

```

and the linking: 

 

```

target_link_libraries(vision_obj
 PUBLIC
 
 thrift::thrift
 re2::re2 arrow_static parquet_static
 )

```

 I get a lot of errors

 

  was:
I used vcpkg to install arrow versions 4 and 5, trying to build my code that 
uses parquet fails by giving link errors of undefined reference.

The same code works on OSX but fails on ubuntu.

My cmake snippet is as follows:

 

```

find_package(Arrow CONFIG REQUIRED)
 get_filename_component(MY_SEARCH_DIR ${Arrow_CONFIG} DIRECTORY)
 find_package(Parquet CONFIG REQUIRED PATHS ${MY_SEARCH_DIR})
 find_package(Thrift CONFIG REQUIRED)

```

and the linking: 

```

target_link_libraries(vision_obj
 PUBLIC
 
 thrift::thrift
 re2::re2 arrow_static parquet_static
 )

```

 I get a lot of errors

 


> link error on ubuntu
> 
>
> Key: ARROW-14188
> URL: https://issues.apache.org/jira/browse/ARROW-14188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 4.0.0, 5.0.0
> Environment: Ubuntu 18.04, gcc-9, and vcpkg installation of arrow
>Reporter: Amir Ghamarian
>Priority: Major
> Attachments: linkerror.txt
>
>
> I used vcpkg to install arrow versions 4 and 5, trying to build my code that 
> uses parquet fails by giving link errors of undefined reference.
> The same code works on OSX but fails on ubuntu.
> My cmake snippet is as follows:
>  
> ```
> find_package(Arrow CONFIG REQUIRED)
>  get_filename_component(MY_SEARCH_DIR ${Arrow_CONFIG} DIRECTORY)
>  find_package(Parquet CONFIG REQUIRED PATHS ${MY_SEARCH_DIR})
>  find_package(Thrift CONFIG REQUIRED)
> ```
> and the linking: 
>  
> ```
> target_link_libraries(vision_obj
>  PUBLIC
>  
>  thrift::thrift
>  re2::re2 arrow_static parquet_static
>  )
> ```
>  I get a lot of errors
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14188) link error on ubuntu

2021-10-01 Thread Amir Ghamarian (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amir Ghamarian updated ARROW-14188:
---
Description: 
I used vcpkg to install arrow versions 4 and 5, trying to build my code that 
uses parquet fails by giving link errors of undefined reference.

The same code works on OSX but fails on ubuntu.

My cmake snippet is as follows:

 

```

find_package(Arrow CONFIG REQUIRED)
 get_filename_component(MY_SEARCH_DIR ${Arrow_CONFIG} DIRECTORY)
 find_package(Parquet CONFIG REQUIRED PATHS ${MY_SEARCH_DIR})
 find_package(Thrift CONFIG REQUIRED)

```

and the linking: 

```

target_link_libraries(vision_obj
 PUBLIC
 
 thrift::thrift
 re2::re2 arrow_static parquet_static
 )

```

 I get a lot of errors

 

  was:
I used vcpkg to install arrow versions 4 and 5, trying to build my code that 
uses parquet fails by giving link errors of undefined reference.

The same code works on OSX but fails on ubuntu.

My cmake snippet is as follows:

 

```

find_package(Arrow CONFIG REQUIRED)
get_filename_component(MY_SEARCH_DIR ${Arrow_CONFIG} DIRECTORY)
find_package(Parquet CONFIG REQUIRED PATHS ${MY_SEARCH_DIR})
find_package(Thrift CONFIG REQUIRED)

```

and the linking: 

```

target_link_libraries(vision_obj
 PUBLIC

 thrift::thrift
 re2::re2 arrow_static parquet_static
 )```

 I get  a lot of errors

 


> link error on ubuntu
> 
>
> Key: ARROW-14188
> URL: https://issues.apache.org/jira/browse/ARROW-14188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 4.0.0, 5.0.0
> Environment: Ubuntu 18.04, gcc-9, and vcpkg installation of arrow
>Reporter: Amir Ghamarian
>Priority: Major
> Attachments: linkerror.txt
>
>
> I used vcpkg to install arrow versions 4 and 5, trying to build my code that 
> uses parquet fails by giving link errors of undefined reference.
> The same code works on OSX but fails on ubuntu.
> My cmake snippet is as follows:
>  
> ```
> find_package(Arrow CONFIG REQUIRED)
>  get_filename_component(MY_SEARCH_DIR ${Arrow_CONFIG} DIRECTORY)
>  find_package(Parquet CONFIG REQUIRED PATHS ${MY_SEARCH_DIR})
>  find_package(Thrift CONFIG REQUIRED)
> ```
> and the linking: 
> ```
> target_link_libraries(vision_obj
>  PUBLIC
>  
>  thrift::thrift
>  re2::re2 arrow_static parquet_static
>  )
> ```
>  I get a lot of errors
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14194) [Docs] Improve vertical spacing in the sphinx API docs

2021-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14194:
---
Labels: pull-request-available  (was: )

> [Docs] Improve vertical spacing in the sphinx API docs
> --
>
> Key: ARROW-14194
> URL: https://issues.apache.org/jira/browse/ARROW-14194
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Brought up by [~apitrou], the vertical spacing in the C++ API docs can be 
> improved (see eg 
> https://arrow.apache.org/docs/cpp/api/datatype.html#time-related). Quick fix 
> would be to reduce the vertical spacing within one method, so there is more 
> spacing between methods than between paragraphs within one method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14194) [Docs] Improve vertical spacing in the sphinx API docs

2021-10-01 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-14194:
-

 Summary: [Docs] Improve vertical spacing in the sphinx API docs
 Key: ARROW-14194
 URL: https://issues.apache.org/jira/browse/ARROW-14194
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 6.0.0


Brought up by [~apitrou], the vertical spacing in the C++ API docs can be 
improved (see eg 
https://arrow.apache.org/docs/cpp/api/datatype.html#time-related). Quick fix 
would be to reduce the vertical spacing within one method, so there is more 
spacing between methods than between paragraphs within one method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13685) [C++] Cannot write dataset to S3FileSystem if bucket already exists

2021-10-01 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace resolved ARROW-13685.
-
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 11136
[https://github.com/apache/arrow/pull/11136]

> [C++] Cannot write dataset to S3FileSystem if bucket already exists
> ---
>
> Key: ARROW-13685
> URL: https://issues.apache.org/jira/browse/ARROW-13685
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 5.0.0
>Reporter: Caleb Overman
>Assignee: Weston Pace
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> I'm trying to write a parquet file to an existing S3 bucket using the new 
> S3FileSystem interface. However, this is failing with an AWS Access Denied 
> error (I do have necessary access). It appears to be trying to recreate the 
> bucket which already exists.
> {code:java}
> import numpy as np
> import pyarrow as pa
> from pyarrow import fs
> import pyarrow.dataset as ds
> s3 = fs.S3FileSystem(region="us-west-2")
> table = pa.table({"a": range(10), "b": np.random.randn(10), "c": [1, 2] * 5})
> ds.write_dataset(
> table,
> "my-bucket/test.parquet",
> format="parquet",
> filesystem=s3,
> ){code}
> {code:java}
> OSError: When creating bucket 'my-bucket': AWS Error [code 15]: Access Denied
> {code}
> I'm seeing the same behavior using {{S3FileSystem.create_dir}} when 
> {{recursive=True}}.
> {code:java}
> s3.create_dir("my-bucket/test_dir/", recursive=True) # Fails
> s3.create_dir("my-bucket/test_dir/", recursive=False) # Succeeds
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14191) [C++][Dataset] Dataset writes should respect backpressure

2021-10-01 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-14191:

Labels: pull-request-available query-engine  (was: kernel 
pull-request-available query-engine)

> [C++][Dataset] Dataset writes should respect backpressure
> -
>
> Key: ARROW-14191
> URL: https://issues.apache.org/jira/browse/ARROW-14191
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: pull-request-available, query-engine
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> If the write destination is slow then the dataset writer should back off and 
> apply backpressure to pause the reader.  This will allow simple dataset API 
> scans to operate on large out of core datasets.
> This is dependent on ARROW-13611 which adds a backpressure feature for 
> regular scanning (but not writing data) and on ARROW-13542 which moves the 
> dataset write to a node in the exec plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14191) [C++][Dataset] Dataset writes should respect backpressure

2021-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14191:
---
Labels: kernel pull-request-available query-engine  (was: kernel 
query-engine)

> [C++][Dataset] Dataset writes should respect backpressure
> -
>
> Key: ARROW-14191
> URL: https://issues.apache.org/jira/browse/ARROW-14191
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: kernel, pull-request-available, query-engine
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If the write destination is slow then the dataset writer should back off and 
> apply backpressure to pause the reader.  This will allow simple dataset API 
> scans to operate on large out of core datasets.
> This is dependent on ARROW-13611 which adds a backpressure feature for 
> regular scanning (but not writing data) and on ARROW-13542 which moves the 
> dataset write to a node in the exec plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14193) [C++][Gandiva] Implement INSTR function

2021-10-01 Thread Augusto Alves Silva (Jira)
Augusto Alves Silva created ARROW-14193:
---

 Summary: [C++][Gandiva] Implement INSTR function
 Key: ARROW-14193
 URL: https://issues.apache.org/jira/browse/ARROW-14193
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++ - Gandiva
Reporter: Augusto Alves Silva


Returns the position of the first occurrence of {{substr}} in {{str}}. Returns 
{{null}} if either of the arguments are {{null}} and returns {{0}} if 
{{substr}} could not be found in {{str}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13611) [C++] Scanning datasets does not enforce back pressure

2021-10-01 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423248#comment-17423248
 ] 

Weston Pace commented on ARROW-13611:
-

That is actually what I've been working on today (shame on me for not toggling 
Start Progress).  Right now I have a PR up that adds backpressure back in for 
unordered scans.  I expect to have a PR ready that will add backpressure for 
ordered scans soon (I am hoping tomorrow).  If you are interested I can 
probably get you a test wheel (are you installing with conda or pypi?) sometime 
next week and would appreciate if you could let me know if it solves your issue.

> [C++] Scanning datasets does not enforce back pressure
> --
>
> Key: ARROW-13611
> URL: https://issues.apache.org/jira/browse/ARROW-13611
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 4.0.0, 5.0.0, 4.0.1
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: pull-request-available, query-engine
> Fix For: 6.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I have a simple test case where I scan the batches of a 4GB dataset and print 
> out the currently used memory:
> {code:python}
> import pyarrow as pa
> import pyarrow.dataset as ds
> dataset = ds.dataset('/home/pace/dev/data/dataset/csv/5_big', format='csv')
> num_rows = 0
> for batch in dataset.to_batches():
> print(pa.total_allocated_bytes())
> num_rows += batch.num_rows
> print(num_rows)
> {code}
> In pyarrow 3.0.0 this consumes just over 5MB.  In pyarrow 4.0.0 and 5.0.0 
> this consumes multiple GB of RAM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13611) [C++] Scanning datasets does not enforce back pressure

2021-10-01 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace reassigned ARROW-13611:
---

Assignee: Weston Pace

> [C++] Scanning datasets does not enforce back pressure
> --
>
> Key: ARROW-13611
> URL: https://issues.apache.org/jira/browse/ARROW-13611
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 4.0.0, 5.0.0, 4.0.1
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: pull-request-available, query-engine
> Fix For: 6.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I have a simple test case where I scan the batches of a 4GB dataset and print 
> out the currently used memory:
> {code:python}
> import pyarrow as pa
> import pyarrow.dataset as ds
> dataset = ds.dataset('/home/pace/dev/data/dataset/csv/5_big', format='csv')
> num_rows = 0
> for batch in dataset.to_batches():
> print(pa.total_allocated_bytes())
> num_rows += batch.num_rows
> print(num_rows)
> {code}
> In pyarrow 3.0.0 this consumes just over 5MB.  In pyarrow 4.0.0 and 5.0.0 
> this consumes multiple GB of RAM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13611) [C++] Scanning datasets does not enforce back pressure

2021-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13611:
---
Labels: pull-request-available query-engine  (was: query-engine)

> [C++] Scanning datasets does not enforce back pressure
> --
>
> Key: ARROW-13611
> URL: https://issues.apache.org/jira/browse/ARROW-13611
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 4.0.0, 5.0.0, 4.0.1
>Reporter: Weston Pace
>Priority: Major
>  Labels: pull-request-available, query-engine
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I have a simple test case where I scan the batches of a 4GB dataset and print 
> out the currently used memory:
> {code:python}
> import pyarrow as pa
> import pyarrow.dataset as ds
> dataset = ds.dataset('/home/pace/dev/data/dataset/csv/5_big', format='csv')
> num_rows = 0
> for batch in dataset.to_batches():
> print(pa.total_allocated_bytes())
> num_rows += batch.num_rows
> print(num_rows)
> {code}
> In pyarrow 3.0.0 this consumes just over 5MB.  In pyarrow 4.0.0 and 5.0.0 
> this consumes multiple GB of RAM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14192) [C++][Dataset] Backpressure broken on ordered scans

2021-10-01 Thread Weston Pace (Jira)
Weston Pace created ARROW-14192:
---

 Summary: [C++][Dataset] Backpressure broken on ordered scans
 Key: ARROW-14192
 URL: https://issues.apache.org/jira/browse/ARROW-14192
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Weston Pace
Assignee: Weston Pace


ARROW-13611 adds a backpressure mechanism that works for unordered scans.  
However, this backpressure is not properly applied on ordered (i.e. ScanBatches 
and not ScanBatchedUnordered) scans.  

The fix will be to modify the merge generator used on ordered scans so that, 
while it still will read ahead somewhat on several files, it will never deliver 
batches except from the currently read file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14191) [C++][Dataset] Dataset writes should respect backpressure

2021-10-01 Thread Weston Pace (Jira)
Weston Pace created ARROW-14191:
---

 Summary: [C++][Dataset] Dataset writes should respect backpressure
 Key: ARROW-14191
 URL: https://issues.apache.org/jira/browse/ARROW-14191
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Weston Pace
Assignee: Weston Pace


If the write destination is slow then the dataset writer should back off and 
apply backpressure to pause the reader.  This will allow simple dataset API 
scans to operate on large out of core datasets.

This is dependent on ARROW-13611 which adds a backpressure feature for regular 
scanning (but not writing data) and on ARROW-13542 which moves the dataset 
write to a node in the exec plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14187) [Python] File reading regression

2021-10-01 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-14187:
--
Component/s: Python

> [Python] File reading regression
> 
>
> Key: ARROW-14187
> URL: https://issues.apache.org/jira/browse/ARROW-14187
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Jonathan Keane
>Assignee: Alessandro Molina
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> After ARROW-6626 was merged, we're seeing a slow down in file reading 
> benchmarks:
> One example: 
> https://conbench.ursa.dev/benchmarks/b92e91fe8a4041148360d9433552277b/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-14187) [Python] File reading regression

2021-10-01 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina reassigned ARROW-14187:
-

Assignee: Alessandro Molina

> [Python] File reading regression
> 
>
> Key: ARROW-14187
> URL: https://issues.apache.org/jira/browse/ARROW-14187
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jonathan Keane
>Assignee: Alessandro Molina
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> After ARROW-6626 was merged, we're seeing a slow down in file reading 
> benchmarks:
> One example: 
> https://conbench.ursa.dev/benchmarks/b92e91fe8a4041148360d9433552277b/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14187) [Python] File reading regression

2021-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14187:
---
Labels: pull-request-available  (was: )

> [Python] File reading regression
> 
>
> Key: ARROW-14187
> URL: https://issues.apache.org/jira/browse/ARROW-14187
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After ARROW-6626 was merged, we're seeing a slow down in file reading 
> benchmarks:
> One example: 
> https://conbench.ursa.dev/benchmarks/b92e91fe8a4041148360d9433552277b/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14190) [R] Should unify_schemas() allow change of type?

2021-10-01 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-14190:


 Summary: [R] Should unify_schemas() allow change of type?
 Key: ARROW-14190
 URL: https://issues.apache.org/jira/browse/ARROW-14190
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


Should {{unify_schemas()}} be able to do schema evolution?  If schemas with 
different (but compatible) types are combined using {{open_dataset()}}, this 
works, whereas if done via {{unify_schemas()}}, it results in an error.

See discussion here: 
https://github.com/apache/arrow-cookbook/pull/67#discussion_r714847220


{code:r}
library(dplyr)
library(arrow)

# Set up schemas
schema1 = schema(speed = int32(), dist = int32())
schema2 = schema(speed = float64(), dist = float64())

# Try to combine schemas via `unify_schemas()` - results in an error
unify_schemas(schema1, schema2)
## Error: Invalid: Unable to merge: Field speed has incompatible types: int32 
vs double
## /home/nic2/arrow/cpp/src/arrow/type.cc:1609  fields_[i]->MergeWith(field)
## /home/nic2/arrow/cpp/src/arrow/type.cc:1672  AddField(field)
## /home/nic2/arrow/cpp/src/arrow/type.cc:1743  builder.AddSchema(schema)

# Create datasets with different schemas and read in via `open_dataset()`
cars1 <- Table$create(slice(cars, 1:25), schema = schema1)
cars2 <- Table$create(slice(cars, 26:50), schema = schema2)

td <- tempfile()
dir.create(td)

write_parquet(cars1, paste0(td, "/cars1.parquet"))
write_parquet(cars2, paste0(td, "/cars2.parquet"))

new_dataset <- open_dataset(td) 

new_dataset$schema
# Schema
# speed: int32
# dist: int32
# 
# See $metadata for additional Schema metadata
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14189) [Docs] Add version dropdown to the sphinx docs

2021-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14189:
---
Labels: pull-request-available  (was: )

> [Docs] Add version dropdown to the sphinx docs
> --
>
> Key: ARROW-14189
> URL: https://issues.apache.org/jira/browse/ARROW-14189
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As part of ARROW-13260, we need to add a version dropdown to the sphinx theme 
> layout.
> There is working happening for this in the upstream theme 
> (https://github.com/pydata/pydata-sphinx-theme/pull/436), but since that is 
> not yet merged/released, I will "backport" it to our own docs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   >