date:20210902

[jira] [Updated] (ARROW-13717) [Doc][Cookbook] Creating Arrays - Python

2021-09-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13717:
---
Labels: pull-request-available  (was: )

> [Doc][Cookbook] Creating Arrays - Python
> 
>
> Key: ARROW-13717
> URL: https://issues.apache.org/jira/browse/ARROW-13717
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13881) [Python] Error message says "Please use a release of Arrow Flight built with gRPC 1.27 or higher." although I'm using gRPC 1.39

2021-09-02 Thread Oliver Mayer (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oliver Mayer updated ARROW-13881:
-
Summary: [Python] Error message says "Please use a release of Arrow Flight 
built with gRPC 1.27 or higher." although I'm using gRPC 1.39  (was: Error 
message says "Please use a release of Arrow Flight built with gRPC 1.27 or 
higher." although I'm using gRPC 1.39)

> [Python] Error message says "Please use a release of Arrow Flight built with 
> gRPC 1.27 or higher." although I'm using gRPC 1.39
> ---
>
> Key: ARROW-13881
> URL: https://issues.apache.org/jira/browse/ARROW-13881
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC, Packaging, Python
>Affects Versions: 5.0.0
>Reporter: Oliver Mayer
>Priority: Major
> Attachments: image-2021-09-03-11-10-01-004.png
>
>
> When initializing the FlightClient I get the following error message:
> {quote}ArrowNotImplementedError: Using encryption with server verification 
> disabled is unsupported. Please use a release of Arrow Flight built with gRPC 
> 1.27 or higher.
> {quote}
> Traceback:
> !image-2021-09-03-11-10-01-004.png!
>  
> Verions: Python 3.9.5
> {noformat}
> # NameVersion   Build  Channel
> arrow-cpp 5.0.0   py39h037c299_3_cpuconda-forge
> pyarrow   5.0.0   py39hf9247be_3_cpuconda-forge
> grpc-cpp  1.39.1   h1072645_0conda-forge
> grpcio1.38.1   py39hb76b349_0conda-forge
> {noformat}
>  
> Code is similar to:
> {code:java}
> import pandas
> from pyarrow import flight
> class XYZClient():
> __scheme = "grpc+tls"
> __token = None
> __flightclient = None
> 
> def __init__(self, username, password, hostname="some-host", 
> flightport=32010):
> flight_client = flight.FlightClient(
> "{}://{}:{}".format(self.__scheme, hostname, flightport),
> middleware=[DremioClient.DremioClientAuthMiddlewareFactory()],
> disable_server_verification=True
> )
> self.__token = flight_client.authenticate_basic_token(
> username, password)
> self.__flightclient = flight_client
> {code}
> There has been a similar issue 
> (https://issues.apache.org/jira/browse/ARROW-11695#) for an earlier version 
> of pyarrow that has been marked as resolved.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13138) [C++] Implement kernel to extract datetime components (year, month, day, etc) from date type objects

2021-09-02 Thread Jira



[ 
https://issues.apache.org/jira/browse/ARROW-13138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409253#comment-17409253
 ] 

Percy Camilo Triveño Aucahuasi commented on ARROW-13138:


PR https://github.com/apache/arrow/pull/11075

> [C++] Implement kernel to extract datetime components (year, month, day, etc) 
> from date type objects
> 
>
> Key: ARROW-13138
> URL: https://issues.apache.org/jira/browse/ARROW-13138
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Nic Crane
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Major
>  Labels: Kernels, kernel
> Fix For: 6.0.0
>
>
> ARROW-11759 implemented extraction of datetime components for timestamp 
> objects; please can we have the equivalent extraction functions implemented 
> for date objects too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13882) [C++] Add compute function min_max support for more types

2021-09-02 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13882:

Labels: types  (was: )

> [C++] Add compute function min_max support for more types
> -
>
> Key: ARROW-13882
> URL: https://issues.apache.org/jira/browse/ARROW-13882
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: types
>
> The min_max compute function does not support the following types but should:
>  - time32
>  - time64
>  - timestamp
>  - null
>  - binary
>  - large_binary
>  - fixed_size_binary
>  - string
>  - large_string



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13880) [C++] Compute function sort_indices does not support timestamps with time zones

2021-09-02 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13880:

Labels: types  (was: )

> [C++] Compute function sort_indices does not support timestamps with time 
> zones
> ---
>
> Key: ARROW-13880
> URL: https://issues.apache.org/jira/browse/ARROW-13880
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Weston Pace
>Priority: Major
>  Labels: types
>
> All other temporal types appear to be supported



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13879) [C++] Mixed support for binary types in regex functions

2021-09-02 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13879:

Labels: types  (was: )

> [C++] Mixed support for binary types in regex functions
> ---
>
> Key: ARROW-13879
> URL: https://issues.apache.org/jira/browse/ARROW-13879
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: types
>
> The functions count_substring, count_substring_regex, find_substring, and 
> find_substring_regex all accept binary types but the function extract_regex, 
> match_substring, match_substring_regex, split_pattern, and 
> split_pattern_regex do not.
> They either should all accept binary types or none of them should accept 
> binary types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13878) [C++] Add fixed_size_binary support to compute functions

2021-09-02 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13878:

Labels: types  (was: )

> [C++] Add fixed_size_binary support to compute functions
> 
>
> Key: ARROW-13878
> URL: https://issues.apache.org/jira/browse/ARROW-13878
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: types
>
> The following compute functions do not support fixed_size_binary but do 
> support binary:
>  - binary_length
>  - binary_replace_slice
>  - count_substring
>  - find_substring
>  - find_substring_regex



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13877) [C++] Added support for fixed sized list to compute functions that process lists

2021-09-02 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13877:

Labels: types  (was: )

> [C++] Added support for fixed sized list to compute functions that process 
> lists
> 
>
> Key: ARROW-13877
> URL: https://issues.apache.org/jira/browse/ARROW-13877
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: types
>
> The following functions do not support fixed size list (and should):
>  - list_flatten
>  - list_parent_indices (one could argue this doesn't need to be supported 
> since this should be obvious and fixed_size_list doesn't have an indices 
> array)
>  - list_value_length (should be easy)
> For reference, the following functions do correctly consume fixed_size_list 
> (there may be more, this isn't an exhaustive list):
>  - count
>  - drop_null
>  - is_null
>  - is_valid



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13873) [C++] Duplicate functions array_filter/array_take and filter/take

2021-09-02 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13873:

Labels: types  (was: )

> [C++] Duplicate functions array_filter/array_take and filter/take
> -
>
> Key: ARROW-13873
> URL: https://issues.apache.org/jira/browse/ARROW-13873
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: types
>
> This has been explained to me as a metafunction and a backend but in that 
> case the backend should not be registered with the function registry.  Note 
> both:
> {code:cpp}
> const FunctionDoc filter_doc(
> "Filter with a boolean selection filter",
> ("The output is populated with values from the input at positions\n"
>  "where the selection filter is non-zero.  Nulls in the selection 
> filter\n"
>  "are handled based on FilterOptions."),
> {"input", "selection_filter"}, "FilterOptions");
> {code}
> and
> {code:cpp}
> const FunctionDoc array_filter_doc(
> "Filter with a boolean selection filter",
> ("The output is populated with values from the input `array` at 
> positions\n"
>  "where the selection filter is non-zero.  Nulls in the selection 
> filter\n"
>  "are handled based on FilterOptions."),
> {"array", "selection_filter"}, "FilterOptions");
> {code}
> which seems wrong as well.
> Also sort_indices / array_sort_indices



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13876) [C++] Uniform null handling in compute functions

2021-09-02 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13876:

Labels: types  (was: )

> [C++] Uniform null handling in compute functions
> 
>
> Key: ARROW-13876
> URL: https://issues.apache.org/jira/browse/ARROW-13876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: types
>
> The compute functions today have mixed support for null types.
> Unary arithmetic functions (e.g. abs) don't support null arrays
> Binary arithmetic functions (e.g. add) support one null array (e.g. int32 + 
> null) but not both null arrays (i.e. null + null) but they do support both 
> values being null (e.g. [null] + [null] = [null] if dtype=int32 but not 
> supported if dtype=null)
> sort_indices should support null arrays.
> Some functions do forward null arrays:
>  - unique
> Some functions output a non-null type given null inputs
> - is_null (=> boolean)
> - is_valid (=> boolean)
> - value_counts (=> struct)
> - dictionary_encode (=> dictionary)
> - count (=> int64)
> Some functions throw an error other than "not implemented"
>  - list_parent_indices



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13873) [C++] Duplicate functions array_filter/array_take and filter/take

2021-09-02 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13873:

Labels:   (was: types)

> [C++] Duplicate functions array_filter/array_take and filter/take
> -
>
> Key: ARROW-13873
> URL: https://issues.apache.org/jira/browse/ARROW-13873
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> This has been explained to me as a metafunction and a backend but in that 
> case the backend should not be registered with the function registry.  Note 
> both:
> {code:cpp}
> const FunctionDoc filter_doc(
> "Filter with a boolean selection filter",
> ("The output is populated with values from the input at positions\n"
>  "where the selection filter is non-zero.  Nulls in the selection 
> filter\n"
>  "are handled based on FilterOptions."),
> {"input", "selection_filter"}, "FilterOptions");
> {code}
> and
> {code:cpp}
> const FunctionDoc array_filter_doc(
> "Filter with a boolean selection filter",
> ("The output is populated with values from the input `array` at 
> positions\n"
>  "where the selection filter is non-zero.  Nulls in the selection 
> filter\n"
>  "are handled based on FilterOptions."),
> {"array", "selection_filter"}, "FilterOptions");
> {code}
> which seems wrong as well.
> Also sort_indices / array_sort_indices



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13881) Error message says "Please use a release of Arrow Flight built with gRPC 1.27 or higher." although I'm using gRPC 1.39

2021-09-02 Thread Oliver Mayer (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oliver Mayer updated ARROW-13881:
-
Description: 
When initializing the FlightClient I get the following error message:
{quote}ArrowNotImplementedError: Using encryption with server verification 
disabled is unsupported. Please use a release of Arrow Flight built with gRPC 
1.27 or higher.
{quote}
Traceback:

!image-2021-09-03-11-10-01-004.png!

 

Verions: Python 3.9.5
{noformat}
# NameVersion   Build  Channel
arrow-cpp 5.0.0   py39h037c299_3_cpuconda-forge
pyarrow   5.0.0   py39hf9247be_3_cpuconda-forge
grpc-cpp  1.39.1   h1072645_0conda-forge
grpcio1.38.1   py39hb76b349_0conda-forge
{noformat}
 

Code is similar to:
{code:java}
import pandas
from pyarrow import flight

class XYZClient():
__scheme = "grpc+tls"
__token = None
__flightclient = None

def __init__(self, username, password, hostname="some-host", 
flightport=32010):
flight_client = flight.FlightClient(
"{}://{}:{}".format(self.__scheme, hostname, flightport),
middleware=[DremioClient.DremioClientAuthMiddlewareFactory()],
disable_server_verification=True
)
self.__token = flight_client.authenticate_basic_token(
username, password)
self.__flightclient = flight_client
{code}
There has been a similar issue 
(https://issues.apache.org/jira/browse/ARROW-11695#) for an earlier version of 
pyarrow that has been marked as resolved.

 

 

  was:
When initializing the FlightClient I get the following error message:
{quote}ArrowNotImplementedError: Using encryption with server verification 
disabled is unsupported. Please use a release of Arrow Flight built with gRPC 
1.27 or higher.
{quote}
Traceback:

!image-2021-09-03-11-10-01-004.png!

 

Verions: Python 3.9.5
{noformat}
# NameVersion   Build  Channel
arrow-cpp 5.0.0   py39h037c299_3_cpuconda-forge
pyarrow   5.0.0   py39hf9247be_3_cpuconda-forge
grpc-cpp  1.39.1   h1072645_0conda-forge
grpcio1.38.1   py39hb76b349_0conda-forge
{noformat}
 

Code is similar to:
{code:java}
import pandas
from pyarrow import flight

class XYZClient():__scheme = "grpc+tls"
__token = None
__flightclient = Nonedef __init__(self, username, password, 
hostname="some-host", flightport=32010):
flight_client = flight.FlightClient(
"{}://{}:{}".format(self.__scheme, hostname, flightport),
middleware=[DremioClient.DremioClientAuthMiddlewareFactory()],
disable_server_verification=True
)
self.__token = flight_client.authenticate_basic_token(
username, password)
self.__flightclient = flight_client
{code}
There has been a similar issue 
(https://issues.apache.org/jira/browse/ARROW-11695#) for an earlier version of 
pyarrow that has been marked as resolved.

 

 


> Error message says "Please use a release of Arrow Flight built with gRPC 1.27 
> or higher." although I'm using gRPC 1.39
> --
>
> Key: ARROW-13881
> URL: https://issues.apache.org/jira/browse/ARROW-13881
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC, Packaging, Python
>Affects Versions: 5.0.0
>Reporter: Oliver Mayer
>Priority: Major
> Attachments: image-2021-09-03-11-10-01-004.png
>
>
> When initializing the FlightClient I get the following error message:
> {quote}ArrowNotImplementedError: Using encryption with server verification 
> disabled is unsupported. Please use a release of Arrow Flight built with gRPC 
> 1.27 or higher.
> {quote}
> Traceback:
> !image-2021-09-03-11-10-01-004.png!
>  
> Verions: Python 3.9.5
> {noformat}
> # NameVersion   Build  Channel
> arrow-cpp 5.0.0   py39h037c299_3_cpuconda-forge
> pyarrow   5.0.0   py39hf9247be_3_cpuconda-forge
> grpc-cpp  1.39.1   h1072645_0conda-forge
> grpcio1.38.1   py39hb76b349_0conda-forge
> {noformat}
>  
> Code is similar to:
> {code:java}
> import pandas
> from pyarrow import flight
> class XYZClient():
> __scheme = "grpc+tls"
> __token = None
> __flightclient = None
> 
> def __init__(self, username, password, hostname="some-host", 
> flightport=32010):
> flight_client = flight.FlightClient(
>

[jira] [Updated] (ARROW-13881) Error message says "Please use a release of Arrow Flight built with gRPC 1.27 or higher." although I'm using gRPC 1.39

2021-09-02 Thread Oliver Mayer (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oliver Mayer updated ARROW-13881:
-
Description: 
When initializing the FlightClient I get the following error message:
{quote}ArrowNotImplementedError: Using encryption with server verification 
disabled is unsupported. Please use a release of Arrow Flight built with gRPC 
1.27 or higher.
{quote}
Traceback:

!image-2021-09-03-11-10-01-004.png!

 

Verions: Python 3.9.5
{noformat}
# NameVersion   Build  Channel
arrow-cpp 5.0.0   py39h037c299_3_cpuconda-forge
pyarrow   5.0.0   py39hf9247be_3_cpuconda-forge
grpc-cpp  1.39.1   h1072645_0conda-forge
grpcio1.38.1   py39hb76b349_0conda-forge
{noformat}
 

Code is similar to:
{code:java}
import pandas
from pyarrow import flight

class XYZClient():__scheme = "grpc+tls"
__token = None
__flightclient = Nonedef __init__(self, username, password, 
hostname="some-host", flightport=32010):
flight_client = flight.FlightClient(
"{}://{}:{}".format(self.__scheme, hostname, flightport),
middleware=[DremioClient.DremioClientAuthMiddlewareFactory()],
disable_server_verification=True
)
self.__token = flight_client.authenticate_basic_token(
username, password)
self.__flightclient = flight_client
{code}
There has been a similar issue 
(https://issues.apache.org/jira/browse/ARROW-11695#) for an earlier version of 
pyarrow that has been marked as resolved.

 

 

  was:
When initializing the FlightClient I get the following error message:
{quote}ArrowNotImplementedError: Using encryption with server verification 
disabled is unsupported. Please use a release of Arrow Flight built with gRPC 
1.27 or higher.
{quote}
Traceback:

!image-2021-09-03-11-10-01-004.png!

 

Verions: Python 3.9.5

 
{noformat}
# NameVersion   Build  Channel
arrow-cpp 5.0.0   py39h037c299_3_cpuconda-forge
pyarrow   5.0.0   py39hf9247be_3_cpuconda-forge
grpc-cpp  1.39.1   h1072645_0conda-forge
grpcio1.38.1   py39hb76b349_0conda-forge
{noformat}
 

 

Code is similar to:
{code:java}
import pandas
from pyarrow import flightclass XYZClient():__scheme = "grpc+tls"
__token = None
__flightclient = Nonedef __init__(self, username, password, 
hostname="some-host", flightport=32010):
flight_client = flight.FlightClient(
"{}://{}:{}".format(self.__scheme, hostname, flightport),
middleware=[DremioClient.DremioClientAuthMiddlewareFactory()],
disable_server_verification=True
)
self.__token = flight_client.authenticate_basic_token(
username, password)
self.__flightclient = flight_client
{code}
There has been a similar issue 
([|https://issues.apache.org/jira/browse/ARROW-11695#]) for an earlier version 
of pyarrow that has been marked as resolved.

 

 


> Error message says "Please use a release of Arrow Flight built with gRPC 1.27 
> or higher." although I'm using gRPC 1.39
> --
>
> Key: ARROW-13881
> URL: https://issues.apache.org/jira/browse/ARROW-13881
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC, Packaging, Python
>Affects Versions: 5.0.0
>Reporter: Oliver Mayer
>Priority: Major
> Attachments: image-2021-09-03-11-10-01-004.png
>
>
> When initializing the FlightClient I get the following error message:
> {quote}ArrowNotImplementedError: Using encryption with server verification 
> disabled is unsupported. Please use a release of Arrow Flight built with gRPC 
> 1.27 or higher.
> {quote}
> Traceback:
> !image-2021-09-03-11-10-01-004.png!
>  
> Verions: Python 3.9.5
> {noformat}
> # NameVersion   Build  Channel
> arrow-cpp 5.0.0   py39h037c299_3_cpuconda-forge
> pyarrow   5.0.0   py39hf9247be_3_cpuconda-forge
> grpc-cpp  1.39.1   h1072645_0conda-forge
> grpcio1.38.1   py39hb76b349_0conda-forge
> {noformat}
>  
> Code is similar to:
> {code:java}
> import pandas
> from pyarrow import flight
> class XYZClient():__scheme = "grpc+tls"
> __token = None
> __flightclient = Nonedef __init__(self, username, password, 
> hostname="some-host", flightport=32010):
> flight_client = flight.FlightClient(
> "{}://{}:{}".format(self.__scheme, hostname,

[jira] [Updated] (ARROW-13879) [C++] Mixed support for binary types in regex functions

2021-09-02 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13879:

Description: 
The functions count_substring, count_substring_regex, find_substring, and 
find_substring_regex all accept binary types but the function extract_regex, 
match_substring, match_substring_regex, split_pattern, and split_pattern_regex 
do not.

They either should all accept binary types or none of them should accept binary 
types.

  was:
The functions count_substring, count_substring_regex, find_substring, and 
find_substring_regex all accept binary types but the function extract_regex, 
match_substring, and match_substring_regex do not.

They either should all accept binary types or none of them should accept binary 
types.


> [C++] Mixed support for binary types in regex functions
> ---
>
> Key: ARROW-13879
> URL: https://issues.apache.org/jira/browse/ARROW-13879
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> The functions count_substring, count_substring_regex, find_substring, and 
> find_substring_regex all accept binary types but the function extract_regex, 
> match_substring, match_substring_regex, split_pattern, and 
> split_pattern_regex do not.
> They either should all accept binary types or none of them should accept 
> binary types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-13882) [C++] Add compute function min_max support for more types

2021-09-02 Thread Weston Pace (Jira)

Weston Pace created ARROW-13882:
---

 Summary: [C++] Add compute function min_max support for more types
 Key: ARROW-13882
 URL: https://issues.apache.org/jira/browse/ARROW-13882
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Weston Pace


The min_max compute function does not support the following types but should:

 - time32
 - time64
 - timestamp
 - null
 - binary
 - large_binary
 - fixed_size_binary
 - string
 - large_string



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-13881) Error message says "Please use a release of Arrow Flight built with gRPC 1.27 or higher." although I'm using gRPC 1.39

2021-09-02 Thread Oliver Mayer (Jira)

Oliver Mayer created ARROW-13881:


 Summary: Error message says "Please use a release of Arrow Flight 
built with gRPC 1.27 or higher." although I'm using gRPC 1.39
 Key: ARROW-13881
 URL: https://issues.apache.org/jira/browse/ARROW-13881
 Project: Apache Arrow
  Issue Type: Bug
  Components: FlightRPC, Packaging, Python
Affects Versions: 5.0.0
Reporter: Oliver Mayer
 Attachments: image-2021-09-03-11-10-01-004.png

When initializing the FlightClient I get the following error message:
{quote}ArrowNotImplementedError: Using encryption with server verification 
disabled is unsupported. Please use a release of Arrow Flight built with gRPC 
1.27 or higher.
{quote}
Traceback:

!image-2021-09-03-11-10-01-004.png!

 

Verions: Python 3.9.5

 
{noformat}
# NameVersion   Build  Channel
arrow-cpp 5.0.0   py39h037c299_3_cpuconda-forge
pyarrow   5.0.0   py39hf9247be_3_cpuconda-forge
grpc-cpp  1.39.1   h1072645_0conda-forge
grpcio1.38.1   py39hb76b349_0conda-forge
{noformat}
 

 

Code is similar to:
{code:java}
import pandas
from pyarrow import flightclass XYZClient():__scheme = "grpc+tls"
__token = None
__flightclient = Nonedef __init__(self, username, password, 
hostname="some-host", flightport=32010):
flight_client = flight.FlightClient(
"{}://{}:{}".format(self.__scheme, hostname, flightport),
middleware=[DremioClient.DremioClientAuthMiddlewareFactory()],
disable_server_verification=True
)
self.__token = flight_client.authenticate_basic_token(
username, password)
self.__flightclient = flight_client
{code}
There has been a similar issue 
([|https://issues.apache.org/jira/browse/ARROW-11695#]) for an earlier version 
of pyarrow that has been marked as resolved.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-13880) [C++] Compute function sort_indices does not support timestamps with time zones

2021-09-02 Thread Weston Pace (Jira)

Weston Pace created ARROW-13880:
---

 Summary: [C++] Compute function sort_indices does not support 
timestamps with time zones
 Key: ARROW-13880
 URL: https://issues.apache.org/jira/browse/ARROW-13880
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Weston Pace


All other temporal types appear to be supported



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13876) [C++] Uniform null handling in compute functions

2021-09-02 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13876:

Description: 
The compute functions today have mixed support for null types.

Unary arithmetic functions (e.g. abs) don't support null arrays

Binary arithmetic functions (e.g. add) support one null array (e.g. int32 + 
null) but not both null arrays (i.e. null + null) but they do support both 
values being null (e.g. [null] + [null] = [null] if dtype=int32 but not 
supported if dtype=null)

sort_indices should support null arrays.

Some functions do forward null arrays:
 - unique

Some functions output a non-null type given null inputs

- is_null (=> boolean)
- is_valid (=> boolean)
- value_counts (=> struct)
- dictionary_encode (=> dictionary)
- count (=> int64)


Some functions throw an error other than "not implemented"

 - list_parent_indices

  was:
The compute functions today have mixed support for null types.

Unary arithmetic functions (e.g. abs) don't support null arrays

Binary arithmetic functions (e.g. add) support one null array (e.g. int32 + 
null) but not both null arrays (i.e. null + null) but they do support both 
values being null (e.g. [null] + [null] = [null] if dtype=int32 but not 
supported if dtype=null)

Some functions do forward null arrays:
 - unique

Some functions output a non-null type given null inputs

- is_null (=> boolean)
- is_valid (=> boolean)
- value_counts (=> struct)
- dictionary_encode (=> dictionary)
- count (=> int64)


Some functions throw an error other than "not implemented"

 - list_parent_indices


> [C++] Uniform null handling in compute functions
> 
>
> Key: ARROW-13876
> URL: https://issues.apache.org/jira/browse/ARROW-13876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> The compute functions today have mixed support for null types.
> Unary arithmetic functions (e.g. abs) don't support null arrays
> Binary arithmetic functions (e.g. add) support one null array (e.g. int32 + 
> null) but not both null arrays (i.e. null + null) but they do support both 
> values being null (e.g. [null] + [null] = [null] if dtype=int32 but not 
> supported if dtype=null)
> sort_indices should support null arrays.
> Some functions do forward null arrays:
>  - unique
> Some functions output a non-null type given null inputs
> - is_null (=> boolean)
> - is_valid (=> boolean)
> - value_counts (=> struct)
> - dictionary_encode (=> dictionary)
> - count (=> int64)
> Some functions throw an error other than "not implemented"
>  - list_parent_indices



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13873) [C++] Duplicate functions array_filter/array_take and filter/take

2021-09-02 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13873:

Description: 
This has been explained to me as a metafunction and a backend but in that case 
the backend should not be registered with the function registry.  Note both:

{code:cpp}
const FunctionDoc filter_doc(
"Filter with a boolean selection filter",
("The output is populated with values from the input at positions\n"
 "where the selection filter is non-zero.  Nulls in the selection filter\n"
 "are handled based on FilterOptions."),
{"input", "selection_filter"}, "FilterOptions");
{code}

and

{code:cpp}
const FunctionDoc array_filter_doc(
"Filter with a boolean selection filter",
("The output is populated with values from the input `array` at positions\n"
 "where the selection filter is non-zero.  Nulls in the selection filter\n"
 "are handled based on FilterOptions."),
{"array", "selection_filter"}, "FilterOptions");
{code}

which seems wrong as well.

Also sort_indices / array_sort_indices

  was:
This has been explained to me as a metafunction and a backend but in that case 
the backend should not be registered with the function registry.  Note both:

{code:cpp}
const FunctionDoc filter_doc(
"Filter with a boolean selection filter",
("The output is populated with values from the input at positions\n"
 "where the selection filter is non-zero.  Nulls in the selection filter\n"
 "are handled based on FilterOptions."),
{"input", "selection_filter"}, "FilterOptions");
{code}

and

{code:cpp}
const FunctionDoc array_filter_doc(
"Filter with a boolean selection filter",
("The output is populated with values from the input `array` at positions\n"
 "where the selection filter is non-zero.  Nulls in the selection filter\n"
 "are handled based on FilterOptions."),
{"array", "selection_filter"}, "FilterOptions");
{code}

which seems wrong as well.


> [C++] Duplicate functions array_filter/array_take and filter/take
> -
>
> Key: ARROW-13873
> URL: https://issues.apache.org/jira/browse/ARROW-13873
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> This has been explained to me as a metafunction and a backend but in that 
> case the backend should not be registered with the function registry.  Note 
> both:
> {code:cpp}
> const FunctionDoc filter_doc(
> "Filter with a boolean selection filter",
> ("The output is populated with values from the input at positions\n"
>  "where the selection filter is non-zero.  Nulls in the selection 
> filter\n"
>  "are handled based on FilterOptions."),
> {"input", "selection_filter"}, "FilterOptions");
> {code}
> and
> {code:cpp}
> const FunctionDoc array_filter_doc(
> "Filter with a boolean selection filter",
> ("The output is populated with values from the input `array` at 
> positions\n"
>  "where the selection filter is non-zero.  Nulls in the selection 
> filter\n"
>  "are handled based on FilterOptions."),
> {"array", "selection_filter"}, "FilterOptions");
> {code}
> which seems wrong as well.
> Also sort_indices / array_sort_indices



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-13879) [C++] Mixed support for binary types in regex functions

2021-09-02 Thread Weston Pace (Jira)

Weston Pace created ARROW-13879:
---

 Summary: [C++] Mixed support for binary types in regex functions
 Key: ARROW-13879
 URL: https://issues.apache.org/jira/browse/ARROW-13879
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Weston Pace


The functions count_substring, count_substring_regex, find_substring, and 
find_substring_regex all accept binary types but the function extract_regex, 
match_substring, and match_substring_regex do not.

They either should all accept binary types or none of them should accept binary 
types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-13878) [C++] Add fixed_size_binary support to compute functions

2021-09-02 Thread Weston Pace (Jira)

Weston Pace created ARROW-13878:
---

 Summary: [C++] Add fixed_size_binary support to compute functions
 Key: ARROW-13878
 URL: https://issues.apache.org/jira/browse/ARROW-13878
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Weston Pace


The following compute functions do not support fixed_size_binary but do support 
binary:

 - binary_length
 - binary_replace_slice
 - count_substring
 - find_substring
 - find_substring_regex



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-13877) [C++] Added support for fixed sized list to compute functions that process lists

2021-09-02 Thread Weston Pace (Jira)

Weston Pace created ARROW-13877:
---

 Summary: [C++] Added support for fixed sized list to compute 
functions that process lists
 Key: ARROW-13877
 URL: https://issues.apache.org/jira/browse/ARROW-13877
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Weston Pace


The following functions do not support fixed size list (and should):

 - list_flatten
 - list_parent_indices (one could argue this doesn't need to be supported since 
this should be obvious and fixed_size_list doesn't have an indices array)
 - list_value_length (should be easy)

For reference, the following functions do correctly consume fixed_size_list 
(there may be more, this isn't an exhaustive list):

 - count
 - drop_null
 - is_null
 - is_valid



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13130) [C++][Compute] Add decimal support for arithmetic compute functions

2021-09-02 Thread Weston Pace (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409227#comment-17409227
 ] 

Weston Pace commented on ARROW-13130:
-

[~yibocai#1] I extended your existing abs/negate JIRA and added info about more 
kernels.  If you want to keep the old description let me know and I can create 
a new JIRA.

> [C++][Compute] Add decimal support for arithmetic compute functions
> ---
>
> Key: ARROW-13130
> URL: https://issues.apache.org/jira/browse/ARROW-13130
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: yibocai#1
>Priority: Major
>  Labels: kernel
>
> The following arithmetic functions do not support decimal:
>  - abs
>  - abs_checked
>  - acos
>  - acos_checked
>  - asin
>  - asin_checked
>  - atan
>  - ceil
>  - cos
>  - cos_checked
>  - floor
>  - greater
>  - greater_equal
>  - is_finite (?)
>  - is_inf (?)
>  - is_nan (?)
>  - less
>  - less_equal
>  - ln
>  - ln_checked
>  - log1p
>  - log1p_checked
>  - log2
>  - log2_checked
>  - logb (float/decimal works int/decimal does not)
>  - logb_checked (float/decimal works int/decimal does not)
>  - mode
>  - negate
>  - negate_checked
>  - power (float/decimal works int/decimal does not)
>  - power_checked (float/decimal works int/decimal does not)
>  - quantile
>  - sign
>  - sin
>  - sin_checked
>  - stddev
>  - tan
>  - tan_checked
>  - tdigest
>  - trunc
>  - variance
> ? - May not be applicable
> The following kernels arithmetic functions do support decimal inputs
>  - add
>  - add_checked
>  - atan2
>  - divide
>  - divide_checked
>  - mean
>  - min_max
>  - multiply
>  - multiply_checked
>  - product
>  - subtract
>  - subtract_checked
>  - sum
>  - unique
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13130) [C++][Compute] Add decimal support for arithmetic compute functions

2021-09-02 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13130:

Summary: [C++][Compute] Add decimal support for arithmetic compute 
functions  (was: [C++][Compute] Add abs, negate kernel for decimal inputs)

> [C++][Compute] Add decimal support for arithmetic compute functions
> ---
>
> Key: ARROW-13130
> URL: https://issues.apache.org/jira/browse/ARROW-13130
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: yibocai#1
>Priority: Major
>  Labels: kernel
>
> The following arithmetic functions do not support decimal:
>  - abs
>  - abs_checked
>  - acos
>  - acos_checked
>  - asin
>  - asin_checked
>  - atan
>  - ceil
>  - cos
>  - cos_checked
>  - floor
>  - greater
>  - greater_equal
>  - is_finite (?)
>  - is_inf (?)
>  - is_nan (?)
>  - less
>  - less_equal
>  - ln
>  - ln_checked
>  - log1p
>  - log1p_checked
>  - log2
>  - log2_checked
>  - logb (float/decimal works int/decimal does not)
>  - logb_checked (float/decimal works int/decimal does not)
>  - mode
>  - negate
>  - negate_checked
>  - power (float/decimal works int/decimal does not)
>  - power_checked (float/decimal works int/decimal does not)
>  - quantile
>  - sign
>  - sin
>  - sin_checked
>  - stddev
>  - tan
>  - tan_checked
>  - tdigest
>  - trunc
>  - variance
> ? - May not be applicable
> The following kernels arithmetic functions do support decimal inputs
>  - add
>  - add_checked
>  - atan2
>  - divide
>  - divide_checked
>  - mean
>  - min_max
>  - multiply
>  - multiply_checked
>  - product
>  - subtract
>  - subtract_checked
>  - sum
>  - unique
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13130) [C++][Compute] Add abs, negate kernel for decimal inputs

2021-09-02 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13130:

Description: 
The following arithmetic functions do not support decimal:

 - abs
 - abs_checked
 - acos
 - acos_checked
 - asin
 - asin_checked
 - atan
 - ceil
 - cos
 - cos_checked
 - floor
 - greater
 - greater_equal
 - is_finite (?)
 - is_inf (?)
 - is_nan (?)
 - less
 - less_equal
 - ln
 - ln_checked
 - log1p
 - log1p_checked
 - log2
 - log2_checked
 - logb (float/decimal works int/decimal does not)
 - logb_checked (float/decimal works int/decimal does not)
 - mode
 - negate
 - negate_checked
 - power (float/decimal works int/decimal does not)
 - power_checked (float/decimal works int/decimal does not)
 - quantile
 - sign
 - sin
 - sin_checked
 - stddev
 - tan
 - tan_checked
 - tdigest
 - trunc
 - variance

? - May not be applicable

The following kernels arithmetic functions do support decimal inputs
 - add
 - add_checked
 - atan2
 - divide
 - divide_checked
 - mean
 - min_max
 - multiply
 - multiply_checked
 - product
 - subtract
 - subtract_checked
 - sum
 - unique

 

> [C++][Compute] Add abs, negate kernel for decimal inputs
> 
>
> Key: ARROW-13130
> URL: https://issues.apache.org/jira/browse/ARROW-13130
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: yibocai#1
>Priority: Major
>  Labels: kernel
>
> The following arithmetic functions do not support decimal:
>  - abs
>  - abs_checked
>  - acos
>  - acos_checked
>  - asin
>  - asin_checked
>  - atan
>  - ceil
>  - cos
>  - cos_checked
>  - floor
>  - greater
>  - greater_equal
>  - is_finite (?)
>  - is_inf (?)
>  - is_nan (?)
>  - less
>  - less_equal
>  - ln
>  - ln_checked
>  - log1p
>  - log1p_checked
>  - log2
>  - log2_checked
>  - logb (float/decimal works int/decimal does not)
>  - logb_checked (float/decimal works int/decimal does not)
>  - mode
>  - negate
>  - negate_checked
>  - power (float/decimal works int/decimal does not)
>  - power_checked (float/decimal works int/decimal does not)
>  - quantile
>  - sign
>  - sin
>  - sin_checked
>  - stddev
>  - tan
>  - tan_checked
>  - tdigest
>  - trunc
>  - variance
> ? - May not be applicable
> The following kernels arithmetic functions do support decimal inputs
>  - add
>  - add_checked
>  - atan2
>  - divide
>  - divide_checked
>  - mean
>  - min_max
>  - multiply
>  - multiply_checked
>  - product
>  - subtract
>  - subtract_checked
>  - sum
>  - unique
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13863) [JS] How to implement two-dimensional array. Like pyarrow.array in javascript.

2021-09-02 Thread zizhao.chen (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zizhao.chen updated ARROW-13863:

Affects Version/s: 5.0.0

> [JS] How to implement two-dimensional array. Like pyarrow.array in javascript.
> --
>
> Key: ARROW-13863
> URL: https://issues.apache.org/jira/browse/ARROW-13863
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 5.0.0
>Reporter: zizhao.chen
>Priority: Major
>
> How to implement two-dimensional array.
> Like pyarrow.array in javascript.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-13876) [C++] Uniform null handling in compute functions

2021-09-02 Thread Weston Pace (Jira)

Weston Pace created ARROW-13876:
---

 Summary: [C++] Uniform null handling in compute functions
 Key: ARROW-13876
 URL: https://issues.apache.org/jira/browse/ARROW-13876
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Weston Pace


The compute functions today have mixed support for null types.

Unary arithmetic functions (e.g. abs) don't support null arrays

Binary arithmetic functions (e.g. add) support one null array (e.g. int32 + 
null) but not both null arrays (i.e. null + null) but they do support both 
values being null (e.g. [null] + [null] = [null] if dtype=int32 but not 
supported if dtype=null)

Some functions do forward null arrays:
 - unique

Some functions output a non-null type given null inputs

- is_null (=> boolean)
- is_valid (=> boolean)
- value_counts (=> struct)
- dictionary_encode (=> dictionary)
- count (=> int64)


Some functions throw an error other than "not implemented"

 - list_parent_indices



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-13875) [R] Bindings for stringr::str_extract/str_extract_all ~ "extract_regex" kernel

2021-09-02 Thread Nic Crane (Jira)

Nic Crane created ARROW-13875:
-

 Summary: [R] Bindings for stringr::str_extract/str_extract_all ~ 
"extract_regex" kernel
 Key: ARROW-13875
 URL: https://issues.apache.org/jira/browse/ARROW-13875
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nic Crane


Implementing this may need to make use of the {{list_flatten}} kernel too (and 
maybe others?), as this kernel returns a struct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-13768) [R] Allow JSON to be an optional component

2021-09-02 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook resolved ARROW-13768.
--
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 11046
[https://github.com/apache/arrow/pull/11046]

> [R] Allow JSON to be an optional component
> --
>
> Key: ARROW-13768
> URL: https://issues.apache.org/jira/browse/ARROW-13768
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Affects Versions: 5.0.0
>Reporter: Karl Dunkle Werner
>Assignee: Karl Dunkle Werner
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> JSON support requires RapidJSON, a third-party dependency that might not 
> always be available. Particularly for offline static builds (ARROW-12981), it 
> would be nice to allow {{ARROW_JSON=OFF}}.
> Here's the [relevant 
> section|https://github.com/apache/arrow/blob/64bef2ad8d9cd2fea122cfa079f8ca3fea8cdf5d/cpp/cmake_modules/ThirdpartyToolchain.cmake#L290-L292]
>  of {{ThirdpartyToolchain.cmake}}:
> {code:none}
> if(ARROW_JSON)
>   set(ARROW_WITH_RAPIDJSON ON)
> endif()
> {code}
> And the [relevant 
> section|https://github.com/apache/arrow/blob/64bef2ad8d9cd2fea122cfa079f8ca3fea8cdf5d/r/inst/build_arrow_static.sh#L62]
>  of the {{build_arrow_static.sh}} script.
> As Neal 
> [mentioned|https://github.com/apache/arrow/pull/11001#discussion_r696723923], 
> there's more to do than just replacing {{-DARROW_JSON=ON}} with 
> {{-DARROW_JSON=$\{ARROW_JSON:-ON}}}. "We'll have to conditionally build some 
> of the bindings like we do with dataset and parquet, and we'll have to 
> conditionally skip tests."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-13874) [R] Implement TrimOptions

2021-09-02 Thread Nic Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane reassigned ARROW-13874:
-

Assignee: Nic Crane

> [R] Implement TrimOptions
> -
>
> Key: ARROW-13874
> URL: https://issues.apache.org/jira/browse/ARROW-13874
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nic Crane
>Assignee: Nic Crane
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Implement TrimOptions for all of:
> utf8_trim_doc
> utf8_ltrim_doc
> utf8_rtrim_doc
> ascii_trim_doc
> ascii_ltrim_doc
> ascii_rtrim_doc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13874) [R] Implement TrimOptions

2021-09-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13874:
---
Labels: pull-request-available  (was: )

> [R] Implement TrimOptions
> -
>
> Key: ARROW-13874
> URL: https://issues.apache.org/jira/browse/ARROW-13874
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nic Crane
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Implement TrimOptions for all of:
> utf8_trim_doc
> utf8_ltrim_doc
> utf8_rtrim_doc
> ascii_trim_doc
> ascii_ltrim_doc
> ascii_rtrim_doc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-13768) [R] Allow JSON to be an optional component

2021-09-02 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook reassigned ARROW-13768:


Assignee: Karl Dunkle Werner  (was: Ian Cook)

> [R] Allow JSON to be an optional component
> --
>
> Key: ARROW-13768
> URL: https://issues.apache.org/jira/browse/ARROW-13768
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Affects Versions: 5.0.0
>Reporter: Karl Dunkle Werner
>Assignee: Karl Dunkle Werner
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> JSON support requires RapidJSON, a third-party dependency that might not 
> always be available. Particularly for offline static builds (ARROW-12981), it 
> would be nice to allow {{ARROW_JSON=OFF}}.
> Here's the [relevant 
> section|https://github.com/apache/arrow/blob/64bef2ad8d9cd2fea122cfa079f8ca3fea8cdf5d/cpp/cmake_modules/ThirdpartyToolchain.cmake#L290-L292]
>  of {{ThirdpartyToolchain.cmake}}:
> {code:none}
> if(ARROW_JSON)
>   set(ARROW_WITH_RAPIDJSON ON)
> endif()
> {code}
> And the [relevant 
> section|https://github.com/apache/arrow/blob/64bef2ad8d9cd2fea122cfa079f8ca3fea8cdf5d/r/inst/build_arrow_static.sh#L62]
>  of the {{build_arrow_static.sh}} script.
> As Neal 
> [mentioned|https://github.com/apache/arrow/pull/11001#discussion_r696723923], 
> there's more to do than just replacing {{-DARROW_JSON=ON}} with 
> {{-DARROW_JSON=$\{ARROW_JSON:-ON}}}. "We'll have to conditionally build some 
> of the bindings like we do with dataset and parquet, and we'll have to 
> conditionally skip tests."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-13768) [R] Allow JSON to be an optional component

2021-09-02 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook reassigned ARROW-13768:


Assignee: Ian Cook

> [R] Allow JSON to be an optional component
> --
>
> Key: ARROW-13768
> URL: https://issues.apache.org/jira/browse/ARROW-13768
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Affects Versions: 5.0.0
>Reporter: Karl Dunkle Werner
>Assignee: Ian Cook
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> JSON support requires RapidJSON, a third-party dependency that might not 
> always be available. Particularly for offline static builds (ARROW-12981), it 
> would be nice to allow {{ARROW_JSON=OFF}}.
> Here's the [relevant 
> section|https://github.com/apache/arrow/blob/64bef2ad8d9cd2fea122cfa079f8ca3fea8cdf5d/cpp/cmake_modules/ThirdpartyToolchain.cmake#L290-L292]
>  of {{ThirdpartyToolchain.cmake}}:
> {code:none}
> if(ARROW_JSON)
>   set(ARROW_WITH_RAPIDJSON ON)
> endif()
> {code}
> And the [relevant 
> section|https://github.com/apache/arrow/blob/64bef2ad8d9cd2fea122cfa079f8ca3fea8cdf5d/r/inst/build_arrow_static.sh#L62]
>  of the {{build_arrow_static.sh}} script.
> As Neal 
> [mentioned|https://github.com/apache/arrow/pull/11001#discussion_r696723923], 
> there's more to do than just replacing {{-DARROW_JSON=ON}} with 
> {{-DARROW_JSON=$\{ARROW_JSON:-ON}}}. "We'll have to conditionally build some 
> of the bindings like we do with dataset and parquet, and we'll have to 
> conditionally skip tests."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13868) [R] Check if utf8 kernels have options implemented and implement if not

2021-09-02 Thread Nic Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane updated ARROW-13868:
--
Description: 
The following compute kernels may not have direct bindings in the R package, 
but can be accessed via call_function().  Check if they have Options classes 
associated with them, and if so, implement these Options classes so that these 
kernels can be called via call_function without error.


* utf8_capitalize
* utf8_is_alnum
* utf8_is_alpha
* utf8_is_decimal
* utf8_is_digit
* utf8_is_lower
* utf8_is_numeric
* utf8_is_printable
* utf8_is_space
* utf8_is_title
* utf8_is_upper
* utf8_replace_slice
* utf8_reverse
* utf8_swapcase

  was:
The following compute kernels may not have direct bindings in the R package, 
but can be accessed via call_function().  Check if they have Options classes 
associated with them, and if so, implement these Options classes so that these 
kernels can be called via call_function without error.


* utf8_capitalize
* utf8_is_alnum
* utf8_is_alpha
* utf8_is_decimal
* utf8_is_digit
* utf8_is_lower
* utf8_is_numeric
* utf8_is_printable
* utf8_is_space
* utf8_is_title
* utf8_is_upper
* utf8_ltrim
* utf8_replace_slice
* utf8_reverse
* utf8_rtrim
* utf8_swapcase
* utf8_trim


> [R] Check if utf8 kernels have options implemented and implement if not
> ---
>
> Key: ARROW-13868
> URL: https://issues.apache.org/jira/browse/ARROW-13868
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nic Crane
>Priority: Major
>
> The following compute kernels may not have direct bindings in the R package, 
> but can be accessed via call_function().  Check if they have Options classes 
> associated with them, and if so, implement these Options classes so that 
> these kernels can be called via call_function without error.
> * utf8_capitalize
> * utf8_is_alnum
> * utf8_is_alpha
> * utf8_is_decimal
> * utf8_is_digit
> * utf8_is_lower
> * utf8_is_numeric
> * utf8_is_printable
> * utf8_is_space
> * utf8_is_title
> * utf8_is_upper
> * utf8_replace_slice
> * utf8_reverse
> * utf8_swapcase



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13867) [R] Check if ascii functions have options implemented and implement if not

2021-09-02 Thread Nic Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane updated ARROW-13867:
--
Description: 
The following compute kernels may not have direct bindings in the R package, 
but can be accessed via call_function().  Check if they have Options classes 
associated with them, and if so, implement these Options classes so that these 
kernels can be called via call_function without error.

* ascii_capitalize
* ascii_is_alnum
* ascii_is_alpha
* ascii_is_decimal
* ascii_is_lower
* ascii_is_printable
* ascii_is_space
* ascii_is_title
* ascii_is_upper
* ascii_lower
* ascii_lpad
* ascii_ltrim_whitespace
* ascii_rtrim_whitespace
* ascii_swapcase
* ascii_trim_whitespace
* ascii_upper

  was:
The following compute kernels may not have direct bindings in the R package, 
but can be accessed via call_function().  Check if they have Options classes 
associated with them, and if so, implement these Options classes so that these 
kernels can be called via call_function without error.

* ascii_capitalize
* ascii_is_alnum
* ascii_is_alpha
* ascii_is_decimal
* ascii_is_lower
* ascii_is_printable
* ascii_is_space
* ascii_is_title
* ascii_is_upper
* ascii_lower
* ascii_lpad
* ascii_ltrim
* ascii_ltrim_whitespace
* ascii_rtrim
* ascii_rtrim_whitespace
* ascii_swapcase
* ascii_trim
* ascii_trim_whitespace
* ascii_upper


> [R] Check if ascii functions have options implemented and implement if not
> --
>
> Key: ARROW-13867
> URL: https://issues.apache.org/jira/browse/ARROW-13867
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nic Crane
>Priority: Major
>
> The following compute kernels may not have direct bindings in the R package, 
> but can be accessed via call_function().  Check if they have Options classes 
> associated with them, and if so, implement these Options classes so that 
> these kernels can be called via call_function without error.
> * ascii_capitalize
> * ascii_is_alnum
> * ascii_is_alpha
> * ascii_is_decimal
> * ascii_is_lower
> * ascii_is_printable
> * ascii_is_space
> * ascii_is_title
> * ascii_is_upper
> * ascii_lower
> * ascii_lpad
> * ascii_ltrim_whitespace
> * ascii_rtrim_whitespace
> * ascii_swapcase
> * ascii_trim_whitespace
> * ascii_upper



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-13874) [R] Implement TrimOptions

2021-09-02 Thread Nic Crane (Jira)

Nic Crane created ARROW-13874:
-

 Summary: [R] Implement TrimOptions
 Key: ARROW-13874
 URL: https://issues.apache.org/jira/browse/ARROW-13874
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nic Crane


Implement TrimOptions for all of:

utf8_trim_doc
utf8_ltrim_doc
utf8_rtrim_doc
ascii_trim_doc
ascii_ltrim_doc
ascii_rtrim_doc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-13873) [C++] Duplicate functions array_filter/array_take and filter/take

2021-09-02 Thread Weston Pace (Jira)

Weston Pace created ARROW-13873:
---

 Summary: [C++] Duplicate functions array_filter/array_take and 
filter/take
 Key: ARROW-13873
 URL: https://issues.apache.org/jira/browse/ARROW-13873
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Weston Pace


This has been explained to me as a metafunction and a backend but in that case 
the backend should not be registered with the function registry.  Note both:

{code:cpp}
const FunctionDoc filter_doc(
"Filter with a boolean selection filter",
("The output is populated with values from the input at positions\n"
 "where the selection filter is non-zero.  Nulls in the selection filter\n"
 "are handled based on FilterOptions."),
{"input", "selection_filter"}, "FilterOptions");
{code}

and

{code:cpp}
const FunctionDoc array_filter_doc(
"Filter with a boolean selection filter",
("The output is populated with values from the input `array` at positions\n"
 "where the selection filter is non-zero.  Nulls in the selection filter\n"
 "are handled based on FilterOptions."),
{"array", "selection_filter"}, "FilterOptions");
{code}

which seems wrong as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13860) [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame

2021-09-02 Thread Nic Crane (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409150#comment-17409150
 ] 

Nic Crane commented on ARROW-13860:
---

[~icook] I think there might be other things at play as well as we also made 
big changes to {{arrow:::collect.arrow_dplyr_query()}} but honestly, it's 
confusing me more, so I'm leaving it as "the output reported above is as 
expected and it's just the {{is_writable_table()}} bit that is a problem".  

> [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame
> -
>
> Key: ARROW-13860
> URL: https://issues.apache.org/jira/browse/ARROW-13860
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: maxOS 11.1 Big Sur
>Reporter: Hideaki Hayashi
>Priority: Major
>
> arrow 5.0.0 write_parquet throws error writing grouped data.frame.
> Here is how to reproduce it.
> {{library(dplyr)}}
> {{ arrow::write_parquet(mtcars %>% group_by(am),"/tmp/mtcars_test.parquet")}}
> {{# Error: x must be an object of class 'data.frame', 'RecordBatch', or 
> 'Table', not 'arrow_dplyr_query’.}}
>  
> With arrow 4.0.1, this used to work fine.
> {{library(dplyr)}}
> {{arrow::write_parquet(mtcars %>% group_by(am),"/tmp/mtcars_test.parquet")}}
> {{x <- arrow::read_parquet("/tmp/mtcars_test.parquet")}}
> {{x}}
> {{# A tibble: 32 x 11}}
> {{# Groups:   am [2]}}
> {{#     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb}}
> {{# *   }}
> {{# 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4}}
> {{# 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4}}
> {{# 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1}}
> {{# 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1}}
> {{# 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2}}
> {{# 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1}}
> {{# 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4}}
> {{# …}}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13860) [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame

2021-09-02 Thread Ian Cook (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409144#comment-17409144
 ] 

Ian Cook commented on ARROW-13860:
--

Ah, thanks [~thisisnic], so I guess we used to define the groups directly in 
the {{Table}} or {{RecordBatch}} object instead of creating an 
{{arrow_dplyr_query}} object whenever there were groups.

> [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame
> -
>
> Key: ARROW-13860
> URL: https://issues.apache.org/jira/browse/ARROW-13860
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: maxOS 11.1 Big Sur
>Reporter: Hideaki Hayashi
>Priority: Major
>
> arrow 5.0.0 write_parquet throws error writing grouped data.frame.
> Here is how to reproduce it.
> {{library(dplyr)}}
> {{ arrow::write_parquet(mtcars %>% group_by(am),"/tmp/mtcars_test.parquet")}}
> {{# Error: x must be an object of class 'data.frame', 'RecordBatch', or 
> 'Table', not 'arrow_dplyr_query’.}}
>  
> With arrow 4.0.1, this used to work fine.
> {{library(dplyr)}}
> {{arrow::write_parquet(mtcars %>% group_by(am),"/tmp/mtcars_test.parquet")}}
> {{x <- arrow::read_parquet("/tmp/mtcars_test.parquet")}}
> {{x}}
> {{# A tibble: 32 x 11}}
> {{# Groups:   am [2]}}
> {{#     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb}}
> {{# *   }}
> {{# 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4}}
> {{# 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4}}
> {{# 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1}}
> {{# 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1}}
> {{# 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2}}
> {{# 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1}}
> {{# 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4}}
> {{# …}}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13872) [Java] ExtensionTypeVector does not work with RangeEqualsVisitor

2021-09-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13872:
---
Labels: pull-request-available  (was: )

> [Java] ExtensionTypeVector does not work with RangeEqualsVisitor
> 
>
> Key: ARROW-13872
> URL: https://issues.apache.org/jira/browse/ARROW-13872
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 5.0.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When using an ExtensionTypeVector with a RangeEqualsVector to compare with 
> another extension type vector, it fails because in vector.accept() the 
> extension type defers to the underlyingVector, but this is not done for the 
> vector initially set in the RangeEqualsVisitor, so it ends up either failing 
> due to different types or attempting to cast the extension vector to the 
> underlying vector type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-13860) [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame

2021-09-02 Thread Nic Crane (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409141#comment-17409141
 ] 

Nic Crane edited comment on ARROW-13860 at 9/2/21, 10:31 PM:
-

[~icook] I have no idea, but I ran a few things in Arrow 4.1 and make what you 
will of the below, but I think it might answer your question


{code:java}
> iris %>% group_by(Species) %>% record_batch() 

RecordBatch
150 rows x 5 columns
$Sepal.Length 
$Sepal.Width 
$Petal.Length 
$Petal.Width 
$Species >

See $metadata for additional Schema metadata

> iris %>% group_by(Species) %>% record_batch() %>% collect()
# A tibble: 150 x 5
# Groups:   Species [3]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
 * 
 1  5.1 3.5  1.4 0.2 setosa 
 2  4.9 31.4 0.2 setosa 
 3  4.7 3.2  1.3 0.2 setosa 
 4  4.6 3.1  1.5 0.2 setosa 
 5  5   3.6  1.4 0.2 setosa 
 6  5.4 3.9  1.7 0.4 setosa 
 7  4.6 3.4  1.4 0.3 setosa 
 8  5   3.4  1.5 0.2 setosa 
 9  4.4 2.9  1.4 0.2 setosa 
10  4.9 3.1  1.5 0.1 setosa 
# … with 140 more rows

> iris %>% record_batch() %>% group_by(Species) 
RecordBatch (query)
Sepal.Length: double
Sepal.Width: double
Petal.Length: double
Petal.Width: double
Species: dictionary

* Grouped by Species
See $.data for the source Arrow object

> iris %>% record_batch() %>% group_by(Species)  %>% collect()
# A tibble: 150 x 5
# Groups:   Species [3]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
   
 1  5.1 3.5  1.4 0.2 setosa 
 2  4.9 31.4 0.2 setosa 
 3  4.7 3.2  1.3 0.2 setosa 
 4  4.6 3.1  1.5 0.2 setosa 
 5  5   3.6  1.4 0.2 setosa 
 6  5.4 3.9  1.7 0.4 setosa 
 7  4.6 3.4  1.4 0.3 setosa 
 8  5   3.4  1.5 0.2 setosa 
 9  4.4 2.9  1.4 0.2 setosa 
10  4.9 3.1  1.5 0.1 setosa 
# … with 140 more rows

{code}



was (Author: thisisnic):
[~icook] I have no idea, but I ran a few things in Arrow 4.1 and make what you 
will of the below, but I think it might answer your question


{code:java}
iris %>% group_by(Species) %>% record_batch() 

RecordBatch
150 rows x 5 columns
$Sepal.Length 
$Sepal.Width 
$Petal.Length 
$Petal.Width 
$Species >

See $metadata for additional Schema metadata

> iris %>% group_by(Species) %>% record_batch() %>% collect()
# A tibble: 150 x 5
# Groups:   Species [3]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
 * 
 1  5.1 3.5  1.4 0.2 setosa 
 2  4.9 31.4 0.2 setosa 
 3  4.7 3.2  1.3 0.2 setosa 
 4  4.6 3.1  1.5 0.2 setosa 
 5  5   3.6  1.4 0.2 setosa 
 6  5.4 3.9  1.7 0.4 setosa 
 7  4.6 3.4  1.4 0.3 setosa 
 8  5   3.4  1.5 0.2 setosa 
 9  4.4 2.9  1.4 0.2 setosa 
10  4.9 3.1  1.5 0.1 setosa 
# … with 140 more rows

> iris %>% record_batch() %>% group_by(Species) 
RecordBatch (query)
Sepal.Length: double
Sepal.Width: double
Petal.Length: double
Petal.Width: double
Species: dictionary

* Grouped by Species
See $.data for the source Arrow object

> iris %>% record_batch() %>% group_by(Species)  %>% collect()
# A tibble: 150 x 5
# Groups:   Species [3]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
   
 1  5.1 3.5  1.4 0.2 setosa 
 2  4.9 31.4 0.2 setosa 
 3  4.7 3.2  1.3 0.2 setosa 
 4  4.6 3.1  1.5 0.2 setosa 
 5  5   3.6  1.4 0.2 setosa 
 6  5.4 3.9  1.7 0.4 setosa 
 7  4.6 3.4  1.4 0.3 setosa 
 8  5   3.4  1.5 0.2 setosa 
 9  4.4 2.9  1.4 0.2 setosa 
10  4.9 3.1  1.5 0.1 setosa 
# … with 140 more rows

{code}


> [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame
> -
>
>

[jira] [Commented] (ARROW-13860) [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame

2021-09-02 Thread Nic Crane (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409141#comment-17409141
 ] 

Nic Crane commented on ARROW-13860:
---

[~icook] I have no idea, but I ran a few things in Arrow 4.1 and make what you 
will of the below, but I think it might answer your question


{code:java}
iris %>% group_by(Species) %>% record_batch() 

RecordBatch
150 rows x 5 columns
$Sepal.Length 
$Sepal.Width 
$Petal.Length 
$Petal.Width 
$Species >

See $metadata for additional Schema metadata

> iris %>% group_by(Species) %>% record_batch() %>% collect()
# A tibble: 150 x 5
# Groups:   Species [3]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
 * 
 1  5.1 3.5  1.4 0.2 setosa 
 2  4.9 31.4 0.2 setosa 
 3  4.7 3.2  1.3 0.2 setosa 
 4  4.6 3.1  1.5 0.2 setosa 
 5  5   3.6  1.4 0.2 setosa 
 6  5.4 3.9  1.7 0.4 setosa 
 7  4.6 3.4  1.4 0.3 setosa 
 8  5   3.4  1.5 0.2 setosa 
 9  4.4 2.9  1.4 0.2 setosa 
10  4.9 3.1  1.5 0.1 setosa 
# … with 140 more rows

> iris %>% record_batch() %>% group_by(Species) 
RecordBatch (query)
Sepal.Length: double
Sepal.Width: double
Petal.Length: double
Petal.Width: double
Species: dictionary

* Grouped by Species
See $.data for the source Arrow object

> iris %>% record_batch() %>% group_by(Species)  %>% collect()
# A tibble: 150 x 5
# Groups:   Species [3]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
   
 1  5.1 3.5  1.4 0.2 setosa 
 2  4.9 31.4 0.2 setosa 
 3  4.7 3.2  1.3 0.2 setosa 
 4  4.6 3.1  1.5 0.2 setosa 
 5  5   3.6  1.4 0.2 setosa 
 6  5.4 3.9  1.7 0.4 setosa 
 7  4.6 3.4  1.4 0.3 setosa 
 8  5   3.4  1.5 0.2 setosa 
 9  4.4 2.9  1.4 0.2 setosa 
10  4.9 3.1  1.5 0.1 setosa 
# … with 140 more rows

{code}


> [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame
> -
>
> Key: ARROW-13860
> URL: https://issues.apache.org/jira/browse/ARROW-13860
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: maxOS 11.1 Big Sur
>Reporter: Hideaki Hayashi
>Priority: Major
>
> arrow 5.0.0 write_parquet throws error writing grouped data.frame.
> Here is how to reproduce it.
> {{library(dplyr)}}
> {{ arrow::write_parquet(mtcars %>% group_by(am),"/tmp/mtcars_test.parquet")}}
> {{# Error: x must be an object of class 'data.frame', 'RecordBatch', or 
> 'Table', not 'arrow_dplyr_query’.}}
>  
> With arrow 4.0.1, this used to work fine.
> {{library(dplyr)}}
> {{arrow::write_parquet(mtcars %>% group_by(am),"/tmp/mtcars_test.parquet")}}
> {{x <- arrow::read_parquet("/tmp/mtcars_test.parquet")}}
> {{x}}
> {{# A tibble: 32 x 11}}
> {{# Groups:   am [2]}}
> {{#     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb}}
> {{# *   }}
> {{# 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4}}
> {{# 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4}}
> {{# 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1}}
> {{# 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1}}
> {{# 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2}}
> {{# 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1}}
> {{# 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4}}
> {{# …}}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-13872) [Java] ExtensionTypeVector does not work with RangeEqualsVisitor

2021-09-02 Thread Bryan Cutler (Jira)

Bryan Cutler created ARROW-13872:


 Summary: [Java] ExtensionTypeVector does not work with 
RangeEqualsVisitor
 Key: ARROW-13872
 URL: https://issues.apache.org/jira/browse/ARROW-13872
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Affects Versions: 5.0.0
Reporter: Bryan Cutler
Assignee: Bryan Cutler


When using an ExtensionTypeVector with a RangeEqualsVector to compare with 
another extension type vector, it fails because in vector.accept() the 
extension type defers to the underlyingVector, but this is not done for the 
vector initially set in the RangeEqualsVisitor, so it ends up either failing 
due to different types or attempting to cast the extension vector to the 
underlying vector type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13871) [C++] JSON reader can fail if a list array key is present in one chunk but not in a later chunk

2021-09-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13871:
---
Labels: pull-request-available  (was: )

> [C++] JSON reader can fail if a list array key is present in one chunk but 
> not in a later chunk
> ---
>
> Key: ARROW-13871
> URL: https://issues.apache.org/jira/browse/ARROW-13871
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Reproduction and more information in 
> https://github.com/apache/arrow/issues/11044



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-13871) [C++] JSON reader can fail if a list array key is present in one chunk but not in a later chunk

2021-09-02 Thread Weston Pace (Jira)

Weston Pace created ARROW-13871:
---

 Summary: [C++] JSON reader can fail if a list array key is present 
in one chunk but not in a later chunk
 Key: ARROW-13871
 URL: https://issues.apache.org/jira/browse/ARROW-13871
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace
Assignee: Weston Pace


Reproduction and more information in 
https://github.com/apache/arrow/issues/11044



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13798) [Python] Selective projection of struct fields errors with use_legacy_dataset = False

2021-09-02 Thread Mark Grey (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409120#comment-17409120
 ] 

Mark Grey commented on ARROW-13798:
---

After looking into the implementation, it seems this is to some extent a known 
limitation of the new Dataset API, perhaps arising from ARROW-11259

> [Python] Selective projection of struct fields errors with use_legacy_dataset 
> = False
> -
>
> Key: ARROW-13798
> URL: https://issues.apache.org/jira/browse/ARROW-13798
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet, Python
>Affects Versions: 5.0.0
> Environment: Python 3.6.9
>Reporter: Mark Grey
>Priority: Major
>  Labels: columns, parquet, python
>
> Selectively projecting fields from within a struct when reading from parquet 
> files triggers an {{ArrowInvalid}} error when using the new dataset api 
> ({{use_legacy_dataset=False}}).  Passing {{use_legacy_dataset=True}} yields 
> the expected behavior: loading only the columns enumerated in the {{columns}} 
> argument, recursing into structs if there is a {{.}} delimeter in the field 
> name.
> Using the following test table:
> {code:python}
> df = pd.DataFrame({
> 'user_id': ['abc123', 'qrs456'],
> 'interaction': [{'type': 'click', 'element': 'button'}, {'type':'scroll', 
> 'element': 'window'}]
> })
> table = pa.Table.from_pandas(df)
> pq.write_table(table, 'example.parquet')
> {code}
> Using the current default datasets API:
> {code:python}
> table_latest = pq.read_table('example.parquet', columns = ['user_id', 
> 'interaction.type'])
> {code}
> yields:
> {noformat}
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 table_latest = pq.read_table('/'.join([out_path, 'example.parquet']), 
> columns = ['user_id', 'interaction.type'], filesystem = fs)
>   2 table_latest
> /usr/local/share/sciencebox/venv/lib/python3.6/site-packages/pyarrow/parquet.py
>  in read_table(source, columns, use_threads, metadata, use_pandas_metadata, 
> memory_map, read_dictionary, filesystem, filters, buffer_size, partitioning, 
> use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit)
>1894 
>1895 return dataset.read(columns=columns, use_threads=use_threads,
> -> 1896 use_pandas_metadata=use_pandas_metadata)
>1897 
>1898 if ignore_prefixes is not None:
> /usr/local/share/sciencebox/venv/lib/python3.6/site-packages/pyarrow/parquet.py
>  in read(self, columns, use_threads, use_pandas_metadata)
>1744 table = self._dataset.to_table(
>1745 columns=columns, filter=self._filter_expression,
> -> 1746 use_threads=use_threads
>1747 )
>1748 
> /usr/local/share/sciencebox/venv/lib/python3.6/site-packages/pyarrow/_dataset.pyx
>  in pyarrow._dataset.Dataset.to_table()
> /usr/local/share/sciencebox/venv/lib/python3.6/site-packages/pyarrow/_dataset.pyx
>  in pyarrow._dataset.Dataset.scanner()
> /usr/local/share/sciencebox/venv/lib/python3.6/site-packages/pyarrow/_dataset.pyx
>  in pyarrow._dataset.Scanner.from_dataset()
> /usr/local/share/sciencebox/venv/lib/python3.6/site-packages/pyarrow/_dataset.pyx
>  in pyarrow._dataset._populate_builder()
> /usr/local/share/sciencebox/venv/lib/python3.6/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()
> ArrowInvalid: No match for FieldRef.Name(interaction.type) in user_id: string
> interaction: struct{noformat}
> Whereas: 
> {code:python}
> table_legacy = pq.read_table('example.parquet', columns = ['user_id', 
> 'interaction.type'], use_legacy_dataset = True)
> {code}
> Yields:
> {noformat}
> pyarrow.Table
> user_id: string
> interaction: struct
>   child 0, type: string{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13870) [R] Should we proactively disable multithreading on 32bit windows?

2021-09-02 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409109#comment-17409109
 ] 

Antoine Pitrou commented on ARROW-13870:


Why would multithreading not work on 32-bit?

> [R] Should we proactively disable multithreading on 32bit windows?
> --
>
> Key: ARROW-13870
> URL: https://issues.apache.org/jira/browse/ARROW-13870
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Priority: Minor
>
> We have numerous test skips for 32bit windows + multithreading. Should we add 
> something to {{onLoad}} that detects this and proactively sets arrow to 
> single-threaded (using {{set_cpu_count()}} and / or 
> {{options(arrow.use_threads)}}?). 
> Should we display a message that we did this and if someone wants to live 
> experimentally they can increase that again?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13051) [Release][Packaging] Update the java post release task to use the crossbow artifacts

2021-09-02 Thread Kouhei Sutou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409108#comment-17409108
 ] 

Kouhei Sutou commented on ARROW-13051:
--

Sorry. I didn't do it...

It seems that we need to create a new "Maven" package type repository named 
"arrow-maven". The current our "arrow" repository can't be reused for it. Could 
you open an INFRA JIRA issue for this. We don't have permission to do it.

BTW: If we use Artifactory for Java packages, users will need to change their 
configuration ("pom.xml"?). Is it acceptable for Java users? I don't know about 
it because I'm not a Java user...

> [Release][Packaging] Update the java post release task to use the crossbow 
> artifacts
> 
>
> Key: ARROW-13051
> URL: https://issues.apache.org/jira/browse/ARROW-13051
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools, Packaging
>Reporter: Krisztian Szucs
>Assignee: Anthony Louis Gotlib Ferreira
>Priority: Major
>
> We produce java jars using a crossbow tasks. Ideally we should download and 
> deploy these packages instead of compiling them locally during the java post 
> release task.
> See the produced jars at: 
> https://github.com/ursacomputing/crossbow/releases/tag/actions-496-github-java-jars
> See more context at: https://github.com/apache/arrow/pull/10411



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-13870) [R] Should we proactively disable multithreading on 32bit windows?

2021-09-02 Thread Jonathan Keane (Jira)

Jonathan Keane created ARROW-13870:
--

 Summary: [R] Should we proactively disable multithreading on 32bit 
windows?
 Key: ARROW-13870
 URL: https://issues.apache.org/jira/browse/ARROW-13870
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Jonathan Keane


We have numerous test skips for 32bit windows + multithreading. Should we add 
something to {{onLoad}} that detects this and proactively sets arrow to 
single-threaded (using {{set_cpu_count()}} and / or 
{{options(arrow.use_threads)}}?). 

Should we display a message that we did this and if someone wants to live 
experimentally they can increase that again?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-13860) [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame

2021-09-02 Thread Ian Cook (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409102#comment-17409102
 ] 

Ian Cook edited comment on ARROW-13860 at 9/2/21, 8:12 PM:
---

Thanks for the report!

I dug into this and observed that it is happening because
{code:java}
write_parquet(x, ...) {code}
calls
{code:java}
x <- Table$create(){code}
which changes {{x}} into an {{arrow_dplyr_query}} because {{x}} has groups.

Then it calls
{code:java}
is_writable_table(x){code}
which triggers an error because {{x}} does not inherit {{data.frame}} or 
{{ArrowTabular}}.

In version 4.0.1 of the arrow package, this did not trigger an error because 
the {{is_writable_table\(x\)}} function did not exist. It was introduced in 
#10387: 
[https://github.com/apache/arrow/commit/2e3a25e5f1329929e0fdb88ecc76bf404a5ccf57#diff-f6235d4767fc4a7ee1bb726d816b9742ef0bc07503dceb678fd3bc55ee15454b]

But I am confused: Before ARROW-11769, I thought groups were lost when a 
grouped R data.frame was converted to a {{Table}}. So how is it that in the 
example above, the groups were seemingly written to the Parquet file and read 
back in? Didn't we always call {{Table$create()}} on the input to 
{{write_parquet()}} so shouldn't the groups have been lost?

cc [~jonkeane] [~thisisnic]


was (Author: icook):
Thanks for the report!

I dug into this and observed that it is happening because
{code:java}
write_parquet(x, ...) {code}
calls
{code:java}
x <- Table$create(){code}
which changes {{x}} into an {{arrow_dplyr_query}} because {{x}} has groups.

Then it calls
{code:java}
is_writable_table(x){code}
which triggers an error because {{x}} does not inherit {{data.frame}} or 
{{ArrowTabular}}.

In version 4.0.0 of the arrow package, this did not trigger an error because 
the {{is_writable_table\(x\)}} function did not exist. It was introduced in 
#10387: 
[https://github.com/apache/arrow/commit/2e3a25e5f1329929e0fdb88ecc76bf404a5ccf57#diff-f6235d4767fc4a7ee1bb726d816b9742ef0bc07503dceb678fd3bc55ee15454b]

But I am confused: Before ARROW-11769, I thought groups were lost when a 
grouped R data.frame was converted to a {{Table}}. So how is it that in the 
example above, the groups were seemingly written to the Parquet file and read 
back in? Didn't we always call {{Table$create()}} on the input to 
{{write_parquet()}} so shouldn't the groups have been lost?

cc [~jonkeane] [~thisisnic]

> [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame
> -
>
> Key: ARROW-13860
> URL: https://issues.apache.org/jira/browse/ARROW-13860
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: maxOS 11.1 Big Sur
>Reporter: Hideaki Hayashi
>Priority: Major
>
> arrow 5.0.0 write_parquet throws error writing grouped data.frame.
> Here is how to reproduce it.
> {{library(dplyr)}}
> {{ arrow::write_parquet(mtcars %>% group_by(am),"/tmp/mtcars_test.parquet")}}
> {{# Error: x must be an object of class 'data.frame', 'RecordBatch', or 
> 'Table', not 'arrow_dplyr_query’.}}
>  
> With arrow 4.0.1, this used to work fine.
> {{library(dplyr)}}
> {{arrow::write_parquet(mtcars %>% group_by(am),"/tmp/mtcars_test.parquet")}}
> {{x <- arrow::read_parquet("/tmp/mtcars_test.parquet")}}
> {{x}}
> {{# A tibble: 32 x 11}}
> {{# Groups:   am [2]}}
> {{#     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb}}
> {{# *   }}
> {{# 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4}}
> {{# 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4}}
> {{# 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1}}
> {{# 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1}}
> {{# 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2}}
> {{# 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1}}
> {{# 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4}}
> {{# …}}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13860) [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame

2021-09-02 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated ARROW-13860:
-
Component/s: R

> [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame
> -
>
> Key: ARROW-13860
> URL: https://issues.apache.org/jira/browse/ARROW-13860
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: maxOS 11.1 Big Sur
>Reporter: Hideaki Hayashi
>Priority: Major
>
> arrow 5.0.0 write_parquet throws error writing grouped data.frame.
> Here is how to reproduce it.
> {{library(dplyr)}}
> {{ arrow::write_parquet(mtcars %>% group_by(am),"/tmp/mtcars_test.parquet")}}
> {{# Error: x must be an object of class 'data.frame', 'RecordBatch', or 
> 'Table', not 'arrow_dplyr_query’.}}
>  
> With arrow 4.0.1, this used to work fine.
> {{library(dplyr)}}
> {{arrow::write_parquet(mtcars %>% group_by(am),"/tmp/mtcars_test.parquet")}}
> {{x <- arrow::read_parquet("/tmp/mtcars_test.parquet")}}
> {{x}}
> {{# A tibble: 32 x 11}}
> {{# Groups:   am [2]}}
> {{#     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb}}
> {{# *   }}
> {{# 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4}}
> {{# 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4}}
> {{# 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1}}
> {{# 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1}}
> {{# 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2}}
> {{# 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1}}
> {{# 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4}}
> {{# …}}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-13860) [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame

2021-09-02 Thread Ian Cook (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409102#comment-17409102
 ] 

Ian Cook edited comment on ARROW-13860 at 9/2/21, 8:10 PM:
---

Thanks for the report!

I dug into this and observed that it is happening because
{code:java}
write_parquet(x, ...) {code}
calls
{code:java}
x <- Table$create(){code}
which changes {{x}} into an {{arrow_dplyr_query}} because {{x}} has groups.

Then it calls
{code:java}
is_writable_table(x){code}
which triggers an error because {{x}} does not inherit {{data.frame}} or 
{{ArrowTabular}}.

In version 4.0.0 of the arrow package, this did not trigger an error because 
the {{is_writable_table\(x\)}} function did not exist. It was introduced in 
#10387: 
[https://github.com/apache/arrow/commit/2e3a25e5f1329929e0fdb88ecc76bf404a5ccf57#diff-f6235d4767fc4a7ee1bb726d816b9742ef0bc07503dceb678fd3bc55ee15454b]

But I am confused: Before ARROW-11769, I thought groups were lost when a 
grouped R data.frame was converted to a {{Table}}. So how is it that in the 
example above, the groups were seemingly written to the Parquet file and read 
back in? Didn't we always call {{Table$create()}} on the input to 
{{write_parquet()}} so shouldn't the groups have been lost?

cc [~jonkeane] [~thisisnic]


was (Author: icook):
Thanks for the report!

I dug into this and observed that it is happening because
{code:java}
write_parquet(x, ...) {code}
calls
{code:java}
x <- Table$create(){code}
which changes {{x}} into an {{arrow_dplyr_query}} because {{x}} has groups.

Then it calls
{code:java}
is_writable_table(x){code}
which triggers an error because {{x}} does not inherit {{data.frame}} or 
{{ArrowTabular}}.

In version 4.0.0 of the arrow package, this did not trigger an error because 
the {{is_writable_table(x)}} function did not exist. It was introduced in 
#10387: 
[https://github.com/apache/arrow/commit/2e3a25e5f1329929e0fdb88ecc76bf404a5ccf57#diff-f6235d4767fc4a7ee1bb726d816b9742ef0bc07503dceb678fd3bc55ee15454b]

But I am confused: Before ARROW-11769, I thought groups were lost when a 
grouped R data.frame was converted to a {{Table}}. So how is it that in the 
example above, the groups were seemingly written to the Parquet file and read 
back in? Didn't we always call {{Table$create()}} on the input to 
{{write_parquet()}} so shouldn't the groups have been lost?

cc [~jonkeane] [~thisisnic]

> [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame
> -
>
> Key: ARROW-13860
> URL: https://issues.apache.org/jira/browse/ARROW-13860
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: maxOS 11.1 Big Sur
>Reporter: Hideaki Hayashi
>Priority: Major
>
> arrow 5.0.0 write_parquet throws error writing grouped data.frame.
> Here is how to reproduce it.
> {{library(dplyr)}}
> {{ arrow::write_parquet(mtcars %>% group_by(am),"/tmp/mtcars_test.parquet")}}
> {{# Error: x must be an object of class 'data.frame', 'RecordBatch', or 
> 'Table', not 'arrow_dplyr_query’.}}
>  
> With arrow 4.0.1, this used to work fine.
> {{library(dplyr)}}
> {{arrow::write_parquet(mtcars %>% group_by(am),"/tmp/mtcars_test.parquet")}}
> {{x <- arrow::read_parquet("/tmp/mtcars_test.parquet")}}
> {{x}}
> {{# A tibble: 32 x 11}}
> {{# Groups:   am [2]}}
> {{#     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb}}
> {{# *   }}
> {{# 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4}}
> {{# 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4}}
> {{# 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1}}
> {{# 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1}}
> {{# 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2}}
> {{# 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1}}
> {{# 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4}}
> {{# …}}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13860) [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame

2021-09-02 Thread Ian Cook (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409102#comment-17409102
 ] 

Ian Cook commented on ARROW-13860:
--

Thanks for the report!

I dug into this and observed that it is happening because
{code:java}
write_parquet(x, ...) {code}
calls
{code:java}
x <- Table$create(){code}
which changes {{x}} into an {{arrow_dplyr_query}} because {{x}} has groups.

Then it calls
{code:java}
is_writable_table(x){code}
which triggers an error because {{x}} does not inherit {{data.frame}} or 
{{ArrowTabular}}.

In version 4.0.0 of the arrow package, this did not trigger an error because 
the {{is_writable_table(x)}} function did not exist. It was introduced in 
#10387: 
[https://github.com/apache/arrow/commit/2e3a25e5f1329929e0fdb88ecc76bf404a5ccf57#diff-f6235d4767fc4a7ee1bb726d816b9742ef0bc07503dceb678fd3bc55ee15454b]

But I am confused: Before ARROW-11769, I thought groups were lost when a 
grouped R data.frame was converted to a {{Table}}. So how is it that in the 
example above, the groups were seemingly written to the Parquet file and read 
back in? Didn't we always call {{Table$create()}} on the input to 
{{write_parquet()}} so shouldn't the groups have been lost?

cc [~jonkeane] [~thisisnic]

> [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame
> -
>
> Key: ARROW-13860
> URL: https://issues.apache.org/jira/browse/ARROW-13860
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: maxOS 11.1 Big Sur
>Reporter: Hideaki Hayashi
>Priority: Major
>
> arrow 5.0.0 write_parquet throws error writing grouped data.frame.
> Here is how to reproduce it.
> {{library(dplyr)}}
> {{ arrow::write_parquet(mtcars %>% group_by(am),"/tmp/mtcars_test.parquet")}}
> {{# Error: x must be an object of class 'data.frame', 'RecordBatch', or 
> 'Table', not 'arrow_dplyr_query’.}}
>  
> With arrow 4.0.1, this used to work fine.
> {{library(dplyr)}}
> {{arrow::write_parquet(mtcars %>% group_by(am),"/tmp/mtcars_test.parquet")}}
> {{x <- arrow::read_parquet("/tmp/mtcars_test.parquet")}}
> {{x}}
> {{# A tibble: 32 x 11}}
> {{# Groups:   am [2]}}
> {{#     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb}}
> {{# *   }}
> {{# 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4}}
> {{# 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4}}
> {{# 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1}}
> {{# 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1}}
> {{# 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2}}
> {{# 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1}}
> {{# 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4}}
> {{# …}}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-13831) [GLib][Ruby] Add support for writing by Arrow Dataset

2021-09-02 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-13831.
--
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 11055
[https://github.com/apache/arrow/pull/11055]

> [GLib][Ruby] Add support for writing by Arrow Dataset
> -
>
> Key: ARROW-13831
> URL: https://issues.apache.org/jira/browse/ARROW-13831
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib, Ruby
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13855) [C++] [Python] Add support for exporting extension types

2021-09-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13855:
---
Labels: pull-request-available  (was: )

> [C++] [Python] Add support for exporting extension types
> 
>
> Key: ARROW-13855
> URL: https://issues.apache.org/jira/browse/ARROW-13855
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Jorge Leitão
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It would be nice to be able to export extension fields and arrays, i.e.
> {code:python}
> import pyarrow
> class UuidType(pyarrow.PyExtensionType):
> def __init__(self):
> super().__init__(pyarrow.binary(16))
> def __reduce__(self):
> return UuidType, ()
> field = pyarrow.field("aa", UuidType())
> field._export_to_c(pointer)
> {code}
> would not raise
> {code:java}
> pyarrow.lib.ArrowNotImplementedError: Exporting 
> extension> array not supported
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13750) [Doc][Cookbook] Call an Arrow compute function which doesn't yet have an R binding - R

2021-09-02 Thread Nic Crane (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409014#comment-17409014
 ] 

Nic Crane commented on ARROW-13750:
---

Good point, I think this was the plan, but the title having the word "yet" in 
implies otherwise.

> [Doc][Cookbook] Call an Arrow compute function which doesn't yet have an R 
> binding - R
> --
>
> Key: ARROW-13750
> URL: https://issues.apache.org/jira/browse/ARROW-13750
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Assignee: Nic Crane
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-13869) [R] Check if misc kernels have options implemented and if not, implement them

2021-09-02 Thread Nic Crane (Jira)

Nic Crane created ARROW-13869:
-

 Summary: [R] Check if misc kernels have options implemented and if 
not, implement them
 Key: ARROW-13869
 URL: https://issues.apache.org/jira/browse/ARROW-13869
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nic Crane


Other kernels which may or may not have options properly implemented:

and_not
and_not_kleene
array_filter
array_take
binary_join
binary_replace_slice
bit_wise_and
bit_wise_not
bit_wise_or
bit_wise_xor
choose
count_substring
count_substring_regex
drop_null
ends_with
extract_regex
index
index_in
index_in_meta_binary
iso_calendar
list_flatten
list_parent_indices
list_value_length
max_element_wise
mode
or
partition_nth_indices
replace_substring_regex
replace_with_mask
shift_left
shift_left_checked
shift_right
shift_right_checked
starts_with
string_is_ascii
unique
xor



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-13803) [C++] Segfault on filtering taxi dataset

2021-09-02 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reassigned ARROW-13803:


Assignee: David Li

> [C++] Segfault on filtering taxi dataset
> 
>
> Key: ARROW-13803
> URL: https://issues.apache.org/jira/browse/ARROW-13803
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
> Environment: macOS 11.2.1, MacBook Pro (13-inch, M1, 2020)
>Reporter: Neal Richardson
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available, query-engine
> Fix For: 6.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Found this while testing ARROW-13740. Using the nyc-taxi dataset:
> {code}
> ds %>%
>   filter(total_amount > 0, passenger_count > 0) %>%
>   summarise(n = n()) %>%
>   collect()
> {code}
> {code}
>  *** caught segfault ***
> address 0x161784000, cause 'invalid permissions'
> Traceback:
>  1: .Call(`_arrow_ExecPlan_run`, plan, final_node, sort_options)
> ...
> {code}
> lldb shows 
> {code}
> * thread #11, stop reason = EXC_BAD_ACCESS (code=1, address=0x1631a8000)
> frame #0: 0x00013a79d9cc 
> libarrow.600.dylib`arrow::BitUtil::SetBitmap(unsigned char*, long long, long 
> long) + 296
> libarrow.600.dylib`arrow::BitUtil::SetBitmap:
> ->  0x13a79d9cc <+296>: ldrb   w10, [x8]
> 0x13a79d9d0 <+300>: cmpw9, #0x8  ; =0x8 
> 0x13a79d9d4 <+304>: cset   w11, lo
> 0x13a79d9d8 <+308>: andw9, w9, #0x7
> Target 0: (R) stopped.
> (lldb) 
> {code}
> Interestingly, I can evaluate those filter expressions just fine, and it only 
> seems to crash if both are provided. And I can count over the data with both:
> {code}
> ds %>% 
>   group_by(total_amount > 0, passenger_count > 0) %>% 
>   summarize(n=n()) %>% 
>   collect()
> # A tibble: 4 × 3
>   `total_amount > 0` `passenger_count > 0`  n
>   
> 1 FALSE  FALSE805
> 2 FALSE  TRUE  368680
> 3 TRUE   FALSE5810556
> 4 TRUE   TRUE  1541561340
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13803) [C++] Segfault on filtering taxi dataset

2021-09-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13803:
---
Labels: pull-request-available query-engine  (was: query-engine)

> [C++] Segfault on filtering taxi dataset
> 
>
> Key: ARROW-13803
> URL: https://issues.apache.org/jira/browse/ARROW-13803
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
> Environment: macOS 11.2.1, MacBook Pro (13-inch, M1, 2020)
>Reporter: Neal Richardson
>Priority: Major
>  Labels: pull-request-available, query-engine
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Found this while testing ARROW-13740. Using the nyc-taxi dataset:
> {code}
> ds %>%
>   filter(total_amount > 0, passenger_count > 0) %>%
>   summarise(n = n()) %>%
>   collect()
> {code}
> {code}
>  *** caught segfault ***
> address 0x161784000, cause 'invalid permissions'
> Traceback:
>  1: .Call(`_arrow_ExecPlan_run`, plan, final_node, sort_options)
> ...
> {code}
> lldb shows 
> {code}
> * thread #11, stop reason = EXC_BAD_ACCESS (code=1, address=0x1631a8000)
> frame #0: 0x00013a79d9cc 
> libarrow.600.dylib`arrow::BitUtil::SetBitmap(unsigned char*, long long, long 
> long) + 296
> libarrow.600.dylib`arrow::BitUtil::SetBitmap:
> ->  0x13a79d9cc <+296>: ldrb   w10, [x8]
> 0x13a79d9d0 <+300>: cmpw9, #0x8  ; =0x8 
> 0x13a79d9d4 <+304>: cset   w11, lo
> 0x13a79d9d8 <+308>: andw9, w9, #0x7
> Target 0: (R) stopped.
> (lldb) 
> {code}
> Interestingly, I can evaluate those filter expressions just fine, and it only 
> seems to crash if both are provided. And I can count over the data with both:
> {code}
> ds %>% 
>   group_by(total_amount > 0, passenger_count > 0) %>% 
>   summarize(n=n()) %>% 
>   collect()
> # A tibble: 4 × 3
>   `total_amount > 0` `passenger_count > 0`  n
>   
> 1 FALSE  FALSE805
> 2 FALSE  TRUE  368680
> 3 TRUE   FALSE5810556
> 4 TRUE   TRUE  1541561340
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13866) [R] Implement Options for all compute kernels available via list_compute_functions

2021-09-02 Thread Eduardo Ponce (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408997#comment-17408997
 ] 

Eduardo Ponce commented on ARROW-13866:
---

I think the task of identifying missing FunctionOptions can be automated. I am 
currently working on a script to help with this issue.

> [R] Implement Options for all compute kernels available via 
> list_compute_functions
> --
>
> Key: ARROW-13866
> URL: https://issues.apache.org/jira/browse/ARROW-13866
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nic Crane
>Assignee: Nic Crane
>Priority: Major
>
> I'm writing a section in the cookbook about calling kernels which don't have 
> R bindings.  I'm using {{utf8_ltrim}} as an example - it appears when we call 
> {{list_compute_functions()}}.  I tried to call it by searching for it in the 
> C++ code, seeing that it requires a TrimOptions class of options, and then 
> saw that it has a single parameter, characters.
> I tried calling {{call_function("utf8_ltrim", Array$create(c("abc", "abacus", 
> "abracadabra")), options = list(characters = "ab"))}} to see what would 
> happen, which resulted in:
> {{Error: Invalid: Attempted to initialize KernelState from null 
> FunctionOptions}}
> This is because TrimOptions isn't implemented in arrow/r/src/compute.cpp.  We 
> should go through all the compute functions listed via 
> {{list_compute_functions()}} and ensure all of them have options implemented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-13868) [R] Check if utf8 kernels have options implemented and implement if not

2021-09-02 Thread Nic Crane (Jira)

Nic Crane created ARROW-13868:
-

 Summary: [R] Check if utf8 kernels have options implemented and 
implement if not
 Key: ARROW-13868
 URL: https://issues.apache.org/jira/browse/ARROW-13868
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nic Crane


The following compute kernels may not have direct bindings in the R package, 
but can be accessed via call_function().  Check if they have Options classes 
associated with them, and if so, implement these Options classes so that these 
kernels can be called via call_function without error.


* utf8_capitalize
* utf8_is_alnum
* utf8_is_alpha
* utf8_is_decimal
* utf8_is_digit
* utf8_is_lower
* utf8_is_numeric
* utf8_is_printable
* utf8_is_space
* utf8_is_title
* utf8_is_upper
* utf8_ltrim
* utf8_replace_slice
* utf8_reverse
* utf8_rtrim
* utf8_swapcase
* utf8_trim



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-13867) [R] Check if ascii functions have options implemented and implement if not

2021-09-02 Thread Nic Crane (Jira)

Nic Crane created ARROW-13867:
-

 Summary: [R] Check if ascii functions have options implemented and 
implement if not
 Key: ARROW-13867
 URL: https://issues.apache.org/jira/browse/ARROW-13867
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nic Crane


The following compute kernels may not have direct bindings in the R package, 
but can be accessed via call_function().  Check if they have Options classes 
associated with them, and if so, implement these Options classes so that these 
kernels can be called via call_function without error.

* ascii_capitalize
* ascii_is_alnum
* ascii_is_alpha
* ascii_is_decimal
* ascii_is_lower
* ascii_is_printable
* ascii_is_space
* ascii_is_title
* ascii_is_upper
* ascii_lower
* ascii_lpad
* ascii_ltrim
* ascii_ltrim_whitespace
* ascii_rtrim
* ascii_rtrim_whitespace
* ascii_swapcase
* ascii_trim
* ascii_trim_whitespace
* ascii_upper



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13803) [C++] Segfault on filtering taxi dataset

2021-09-02 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408992#comment-17408992
 ] 

David Li commented on ARROW-13803:
--

There is an off-by-one error in 
[BitUtil::SetBitmap|https://github.com/apache/arrow/blob/8c70a5f5178c5b74cc181dc8bdd4b03ba14f36d9/cpp/src/arrow/util/bit_util.cc#L112-L115].
 In this case, offset started as 0 and length started as 65536. At this point 
in the function, offset is now 65536 and length is now 0. data is a pointer to 
an 8192-byte buffer. Hence it indexes {{data[8192]}} which is past the end of 
the buffer. We then crash because the memory at this region is not mapped on 
this platform. (I'm surprised valgrind/ASan/etc. don't catch the access on x64.)

> [C++] Segfault on filtering taxi dataset
> 
>
> Key: ARROW-13803
> URL: https://issues.apache.org/jira/browse/ARROW-13803
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
> Environment: macOS 11.2.1, MacBook Pro (13-inch, M1, 2020)
>Reporter: Neal Richardson
>Priority: Major
>  Labels: query-engine
> Fix For: 6.0.0
>
>
> Found this while testing ARROW-13740. Using the nyc-taxi dataset:
> {code}
> ds %>%
>   filter(total_amount > 0, passenger_count > 0) %>%
>   summarise(n = n()) %>%
>   collect()
> {code}
> {code}
>  *** caught segfault ***
> address 0x161784000, cause 'invalid permissions'
> Traceback:
>  1: .Call(`_arrow_ExecPlan_run`, plan, final_node, sort_options)
> ...
> {code}
> lldb shows 
> {code}
> * thread #11, stop reason = EXC_BAD_ACCESS (code=1, address=0x1631a8000)
> frame #0: 0x00013a79d9cc 
> libarrow.600.dylib`arrow::BitUtil::SetBitmap(unsigned char*, long long, long 
> long) + 296
> libarrow.600.dylib`arrow::BitUtil::SetBitmap:
> ->  0x13a79d9cc <+296>: ldrb   w10, [x8]
> 0x13a79d9d0 <+300>: cmpw9, #0x8  ; =0x8 
> 0x13a79d9d4 <+304>: cset   w11, lo
> 0x13a79d9d8 <+308>: andw9, w9, #0x7
> Target 0: (R) stopped.
> (lldb) 
> {code}
> Interestingly, I can evaluate those filter expressions just fine, and it only 
> seems to crash if both are provided. And I can count over the data with both:
> {code}
> ds %>% 
>   group_by(total_amount > 0, passenger_count > 0) %>% 
>   summarize(n=n()) %>% 
>   collect()
> # A tibble: 4 × 3
>   `total_amount > 0` `passenger_count > 0`  n
>   
> 1 FALSE  FALSE805
> 2 FALSE  TRUE  368680
> 3 TRUE   FALSE5810556
> 4 TRUE   TRUE  1541561340
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13859) [Java] Add code coverage support

2021-09-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13859:
---
Labels: pull-request-available  (was: )

> [Java] Add code coverage support
> 
>
> Key: ARROW-13859
> URL: https://issues.apache.org/jira/browse/ARROW-13859
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Laurent Goujon
>Assignee: Laurent Goujon
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There's currently no easy way to check code coverage for the Java codebase. 
> Ideally a profile should be added to enable code coverage reporting with a 
> tool like Jacoco



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13866) [R] Implement Options for all compute kernels available via list_compute_functions

2021-09-02 Thread Nic Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane updated ARROW-13866:
--
Description: 
I'm writing a section in the cookbook about calling kernels which don't have R 
bindings.  I'm using {{utf8_ltrim}} as an example - it appears when we call 
{{list_compute_functions()}}.  I tried to call it by searching for it in the 
C++ code, seeing that it requires a TrimOptions class of options, and then saw 
that it has a single parameter, characters.

I tried calling {{call_function("utf8_ltrim", Array$create(c("abc", "abacus", 
"abracadabra")), options = list(characters = "ab"))}} to see what would happen, 
which resulted in:
{{Error: Invalid: Attempted to initialize KernelState from null 
FunctionOptions}}

This is because TrimOptions isn't implemented in arrow/r/src/compute.cpp.  We 
should go through all the compute functions listed via 
{{list_compute_functions()}} and ensure all of them have options implemented.



  was:
I'm writing a section in the cookbook about calling kernels which don't have R 
bindings.  I'm using {{utf8_ltrim}} as an example - it appears when we call 
{{list_compute_functions()}}.  I tried to call it by searching for it in the 
C++ code, seeing that it requires a TrimOptions class of options, and then saw 
that it has a single parameter, characters.

I tried calling {{call_function("utf8_ltrim", Array$create(c("abc", "abacus", 
"abracadabra")), options = list(characters = "ab"))}} to see what would happen, 
which resulted in:
{{Error: Invalid: Attempted to initialize KernelState from null 
FunctionOptions}}

This is because TrimOptions isn't implemented in arrow/r/src/compute.cpp.  We 
should go through all the compute functions listed via 
{{list_compute_functions()}} and ensure all of them have options implemented.

Functions to implement (will create subtasks shortly) are any of these which 
have options to implement (some do not):

* abs
* abs_checked
* acos
* acos_checked
* add
* all
* and
* and_not
* and_not_kleene
* any
* array_filter
* array_take
* ascii_capitalize
* ascii_center
* ascii_is_alnum
* ascii_is_alpha
* ascii_is_decimal
* ascii_is_lower
* ascii_is_printable
* ascii_is_space
* ascii_is_title
* ascii_is_upper
* ascii_lower
* ascii_lpad
* ascii_ltrim
* ascii_ltrim_whitespace
* ascii_reverse
* ascii_rtrim
* ascii_rtrim_whitespace
* ascii_split_whitespace
* ascii_swapcase
* ascii_trim
* ascii_trim_whitespace
* ascii_upper
* asin
* asin_checked
* atan
* atan2
* binary_join
* binary_length
* binary_replace_slice
* bit_wise_and
* bit_wise_not
* bit_wise_or
* bit_wise_xor
* case_when
* ceil
* choose
* coalesce
* cos
* cos_checked
* count_substring
* count_substring_regex
* day
* day_of_year
* divide
* drop_null
* ends_with
* extract_regex
* find_substring
* find_substring_regex
* floor
* hash_any
* hash_count
* hash_count_distinct
* hash_distinct
* hash_mean
* hash_min_max
* hash_product
* hash_sum
* hash_tdigest
* hash_variance
* hour
* if_else
* index
* index_in
* index_in_meta_binary
* is_finite
* is_inf
* iso_calendar
* iso_week
* iso_year
* list_flatten
* list_parent_indices
* list_value_length
* ln
* ln_checked
* log10
* log10_checked
* log1p
* log1p_checked
* log2
* log2_checked
* logb
* logb_checked
* match_substring
* match_substring_regex
* max_element_wise
* mean
* microsecond
* millisecond
* min_max
* minute
* mode
* month
* multiply
* nanosecond
* negate
* negate_checked
* or
* partition_nth_indices
* power
* product
* quarter
* replace_substring_regex
* replace_with_mask
* second
* shift_left
* shift_left_checked
* shift_right
* shift_right_checked
* sign
* sin
* sin_checked
* split_pattern_regex
* starts_with
* stddev
* strftime
* string_is_ascii
* subsecond
* subtract
* sum
* tan
* tan_checked
* tdigest
* trunc
* unique
* utf8_capitalize
* utf8_center
* utf8_is_alnum
* utf8_is_alpha
* utf8_is_decimal
* utf8_is_digit
* utf8_is_lower
* utf8_is_numeric
* utf8_is_printable
* utf8_is_space
* utf8_is_title
* utf8_is_upper
* utf8_lpad
* utf8_ltrim
* utf8_ltrim_whitespace
* utf8_replace_slice
* utf8_reverse
* utf8_rpad
* utf8_rtrim
* utf8_rtrim_whitespace
* utf8_swapcase
* utf8_trim
* utf8_trim_whitespace
* value_counts
* variance
* xor
* year


> [R] Implement Options for all compute kernels available via 
> list_compute_functions
> --
>
> Key: ARROW-13866
> URL: https://issues.apache.org/jira/browse/ARROW-13866
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nic Crane
>Assignee: Nic Crane
>Priority: Major
>
> I'm writing a section in the cookbook about calling kernels which don't have 
> R bindings.  I'm using {{utf8_ltrim}} as an example - it appears when we call 
> {{list_compute_functions()}}.  I tried to call it by searching for it in

[jira] [Updated] (ARROW-13866) [R] Implement Options for all compute kernels available via list_compute_functions

2021-09-02 Thread Nic Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane updated ARROW-13866:
--
Description: 
I'm writing a section in the cookbook about calling kernels which don't have R 
bindings.  I'm using {{utf8_ltrim}} as an example - it appears when we call 
{{list_compute_functions()}}.  I tried to call it by searching for it in the 
C++ code, seeing that it requires a TrimOptions class of options, and then saw 
that it has a single parameter, characters.

I tried calling {{call_function("utf8_ltrim", Array$create(c("abc", "abacus", 
"abracadabra")), options = list(characters = "ab"))}} to see what would happen, 
which resulted in:
{{Error: Invalid: Attempted to initialize KernelState from null 
FunctionOptions}}

This is because TrimOptions isn't implemented in arrow/r/src/compute.cpp.  We 
should go through all the compute functions listed via 
{{list_compute_functions()}} and ensure all of them have options implemented.

Functions to implement (will create subtasks shortly) are any of these which 
have options to implement (some do not):

* abs
* abs_checked
* acos
* acos_checked
* add
* all
* and
* and_not
* and_not_kleene
* any
* array_filter
* array_take
* ascii_capitalize
* ascii_center
* ascii_is_alnum
* ascii_is_alpha
* ascii_is_decimal
* ascii_is_lower
* ascii_is_printable
* ascii_is_space
* ascii_is_title
* ascii_is_upper
* ascii_lower
* ascii_lpad
* ascii_ltrim
* ascii_ltrim_whitespace
* ascii_reverse
* ascii_rtrim
* ascii_rtrim_whitespace
* ascii_split_whitespace
* ascii_swapcase
* ascii_trim
* ascii_trim_whitespace
* ascii_upper
* asin
* asin_checked
* atan
* atan2
* binary_join
* binary_length
* binary_replace_slice
* bit_wise_and
* bit_wise_not
* bit_wise_or
* bit_wise_xor
* case_when
* ceil
* choose
* coalesce
* cos
* cos_checked
* count_substring
* count_substring_regex
* day
* day_of_year
* divide
* drop_null
* ends_with
* extract_regex
* find_substring
* find_substring_regex
* floor
* hash_any
* hash_count
* hash_count_distinct
* hash_distinct
* hash_mean
* hash_min_max
* hash_product
* hash_sum
* hash_tdigest
* hash_variance
* hour
* if_else
* index
* index_in
* index_in_meta_binary
* is_finite
* is_inf
* iso_calendar
* iso_week
* iso_year
* list_flatten
* list_parent_indices
* list_value_length
* ln
* ln_checked
* log10
* log10_checked
* log1p
* log1p_checked
* log2
* log2_checked
* logb
* logb_checked
* match_substring
* match_substring_regex
* max_element_wise
* mean
* microsecond
* millisecond
* min_max
* minute
* mode
* month
* multiply
* nanosecond
* negate
* negate_checked
* or
* partition_nth_indices
* power
* product
* quarter
* replace_substring_regex
* replace_with_mask
* second
* shift_left
* shift_left_checked
* shift_right
* shift_right_checked
* sign
* sin
* sin_checked
* split_pattern_regex
* starts_with
* stddev
* strftime
* string_is_ascii
* subsecond
* subtract
* sum
* tan
* tan_checked
* tdigest
* trunc
* unique
* utf8_capitalize
* utf8_center
* utf8_is_alnum
* utf8_is_alpha
* utf8_is_decimal
* utf8_is_digit
* utf8_is_lower
* utf8_is_numeric
* utf8_is_printable
* utf8_is_space
* utf8_is_title
* utf8_is_upper
* utf8_lpad
* utf8_ltrim
* utf8_ltrim_whitespace
* utf8_replace_slice
* utf8_reverse
* utf8_rpad
* utf8_rtrim
* utf8_rtrim_whitespace
* utf8_swapcase
* utf8_trim
* utf8_trim_whitespace
* value_counts
* variance
* xor
* year

  was:
I'm writing a section in the cookbook about calling kernels which don't have R 
bindings.  I'm using {{utf8_ltrim}} as an example - it appears when we call 
{{list_compute_functions()}}.  I tried to call it by searching for it in the 
C++ code, seeing that it requires a TrimOptions class of options, and then saw 
that it has a single parameter, characters.

I tried calling {{call_function("utf8_ltrim", Array$create(c("abc", "abacus", 
"abracadabra")), options = list(characters = "ab"))}} to see what would happen, 
which resulted in:
{{Error: Invalid: Attempted to initialize KernelState from null 
FunctionOptions}}

This is because TrimOptions isn't implemented in arrow/r/src/compute.cpp.  We 
should go through all the compute functions listed via 
{{list_compute_functions()}} and ensure all of them have options implemented.


> [R] Implement Options for all compute kernels available via 
> list_compute_functions
> --
>
> Key: ARROW-13866
> URL: https://issues.apache.org/jira/browse/ARROW-13866
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nic Crane
>Assignee: Nic Crane
>Priority: Major
>
> I'm writing a section in the cookbook about calling kernels which don't have 
> R bindings.  I'm using {{utf8_ltrim}} as an example - it appears when we call 
> {{list_compute_functions()}}.  I tried to call it by searching for it in

[jira] [Commented] (ARROW-13799) [R] case_when error handling is capturing strings

2021-09-02 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408971#comment-17408971
 ] 

Neal Richardson commented on ARROW-13799:
-

Yep that's ARROW-12632

> [R] case_when error handling is capturing strings
> -
>
> Key: ARROW-13799
> URL: https://issues.apache.org/jira/browse/ARROW-13799
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 6.0.0
>
>
> This test, now unskipped since case_when supports string data, fails:
> {code}
>   expect_dplyr_equal(
> input %>%
>   mutate(
> cw = case_when(!(!(!(lgl))) ~ factor(chr), TRUE ~ fct)
>   ) %>%
>   collect(),
> tbl
>   )
> {code}
> On inspection, it seems that `factor(chr)` is hitting `base::factor()`, which 
> tries to call `unique()` on the Expression and that fails with "unique() 
> applies only to vectors". This is getting propagated through to the resulting 
> dataset column because `arrow_eval()` returns a `try-error` on error and 
> `nse_funcs$case_when()` isn't checking for errors.
> cc [~icook]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13799) [R] case_when error handling is capturing strings

2021-09-02 Thread Ian Cook (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408969#comment-17408969
 ] 

Ian Cook commented on ARROW-13799:
--

Related to this: we have a binding for {{as.factor()}} that calls the 
{{dictionary_encode}} kernel. I would expect that using {{as.factor()}} instead 
of {{factor()}} would solve this, but instead it throws this error:
{code:java}
 Error: Invalid: ExecuteScalarExpression cannot Execute non-scalar expression 
case_when({1=invert(invert(invert(lgl))), 2=true}, cast(dictionary_encode(chr, 
{null_encoding_behavior=MASK}), {to_type=string, allow_int_overflow=false, 
allow_time_truncate=false, allow_time_overflow=false, 
allow_decimal_truncate=false, allow_float_truncate=false, 
allow_invalid_utf8=false}), cast(fct, {to_type=string, 
allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, 
allow_decimal_truncate=false, allow_float_truncate=false, 
allow_invalid_utf8=false}))  {code}
Reducing this down to a simpler example:
{code:java}
> Table$create(tbl) %>% mutate(as.factor(chr)) %>% collect()

Error: Invalid: ExecuteScalarExpression cannot Execute non-scalar expression 
dictionary_encode(chr, {null_encoding_behavior=MASK})  {code}
So it looks like something is going wrong with passing the 
{{null_encoding_behavior}} option to the {{dictionary_encode}} kernel.

> [R] case_when error handling is capturing strings
> -
>
> Key: ARROW-13799
> URL: https://issues.apache.org/jira/browse/ARROW-13799
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 6.0.0
>
>
> This test, now unskipped since case_when supports string data, fails:
> {code}
>   expect_dplyr_equal(
> input %>%
>   mutate(
> cw = case_when(!(!(!(lgl))) ~ factor(chr), TRUE ~ fct)
>   ) %>%
>   collect(),
> tbl
>   )
> {code}
> On inspection, it seems that `factor(chr)` is hitting `base::factor()`, which 
> tries to call `unique()` on the Expression and that fails with "unique() 
> applies only to vectors". This is getting propagated through to the resulting 
> dataset column because `arrow_eval()` returns a `try-error` on error and 
> `nse_funcs$case_when()` isn't checking for errors.
> cc [~icook]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13750) [Doc][Cookbook] Call an Arrow compute function which doesn't yet have an R binding - R

2021-09-02 Thread Ian Cook (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408956#comment-17408956
 ] 

Ian Cook commented on ARROW-13750:
--

Minor nit: over the next few Arrow releases, I think we will get to a state 
where virtually every Arrow compute function that has an analogous R function 
will already have a binding in the arrow R package. So I think it would be best 
to phrase the language in this part of the cookbook to be about functions that 
do not have a binding because there is no natural binding (instead of being 
about functions that do not _yet_ have a binding). We will always have some 
Arrow functions like that.

> [Doc][Cookbook] Call an Arrow compute function which doesn't yet have an R 
> binding - R
> --
>
> Key: ARROW-13750
> URL: https://issues.apache.org/jira/browse/ARROW-13750
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Assignee: Nic Crane
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-13866) [R] Implement Options for all compute kernels available via list_compute_functions

2021-09-02 Thread Nic Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane reassigned ARROW-13866:
-

Assignee: Nic Crane

> [R] Implement Options for all compute kernels available via 
> list_compute_functions
> --
>
> Key: ARROW-13866
> URL: https://issues.apache.org/jira/browse/ARROW-13866
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nic Crane
>Assignee: Nic Crane
>Priority: Major
>
> I'm writing a section in the cookbook about calling kernels which don't have 
> R bindings.  I'm using {{utf8_ltrim}} as an example - it appears when we call 
> {{list_compute_functions()}}.  I tried to call it by searching for it in the 
> C++ code, seeing that it requires a TrimOptions class of options, and then 
> saw that it has a single parameter, characters.
> I tried calling {{call_function("utf8_ltrim", Array$create(c("abc", "abacus", 
> "abracadabra")), options = list(characters = "ab"))}} to see what would 
> happen, which resulted in:
> {{Error: Invalid: Attempted to initialize KernelState from null 
> FunctionOptions}}
> This is because TrimOptions isn't implemented in arrow/r/src/compute.cpp.  We 
> should go through all the compute functions listed via 
> {{list_compute_functions()}} and ensure all of them have options implemented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-13866) [R] Implement Options for all compute kernels available via list_compute_functions

2021-09-02 Thread Nic Crane (Jira)

Nic Crane created ARROW-13866:
-

 Summary: [R] Implement Options for all compute kernels available 
via list_compute_functions
 Key: ARROW-13866
 URL: https://issues.apache.org/jira/browse/ARROW-13866
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nic Crane


I'm writing a section in the cookbook about calling kernels which don't have R 
bindings.  I'm using {{utf8_ltrim}} as an example - it appears when we call 
{{list_compute_functions()}}.  I tried to call it by searching for it in the 
C++ code, seeing that it requires a TrimOptions class of options, and then saw 
that it has a single parameter, characters.

I tried calling {{call_function("utf8_ltrim", Array$create(c("abc", "abacus", 
"abracadabra")), options = list(characters = "ab"))}} to see what would happen, 
which resulted in:
{{Error: Invalid: Attempted to initialize KernelState from null 
FunctionOptions}}

This is because TrimOptions isn't implemented in arrow/r/src/compute.cpp.  We 
should go through all the compute functions listed via 
{{list_compute_functions()}} and ensure all of them have options implemented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13865) [C++][R] Writing moderate-size parquet files of nested dataframes from R slows down/process hangs

2021-09-02 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13865:

Summary: [C++][R] Writing moderate-size parquet files of nested dataframes 
from R slows down/process hangs  (was: Writing moderate-size parquet files of 
nested dataframes from R slows down/process hangs)

> [C++][R] Writing moderate-size parquet files of nested dataframes from R 
> slows down/process hangs
> -
>
> Key: ARROW-13865
> URL: https://issues.apache.org/jira/browse/ARROW-13865
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 5.0.0
>Reporter: John Sheffield
>Priority: Major
> Attachments: Screen Shot 2021-09-02 at 11.21.37 AM.png
>
>
> I observed a significant slowdown in parquet writes (and ultimately the 
> process just hangs for minutes without completion) while writing 
> moderate-size nested dataframes from R. I have replicated the issue on MacOS 
> and Ubuntu so far.
>  
> An example:
> ```
> testdf <- dplyr::tibble(
>  id = uuid::UUIDgenerate(n = 5000),
>  l1 = as.list(lapply(1:5000, (function( x ) runif(1000,
>  l2 = as.list(lapply(1:5000, (function( x ) rnorm(1000
>  )
> testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))
>  
>  # This works
> arrow::write_parquet(testdf_long, "testdf_long.parquet")
>  # This write does not complete within a few minutes on my testing but throws 
> no errors
>  arrow::write_parquet(testdf, "testdf.parquet")
> ```
> I can't guess at why this is true, but the slowdown is closely tied to row 
> counts:
> ```
>  # screenshot attached; 12ms, 56ms, and 680ms respectively.
> microbenchmark::microbenchmark(
>  arrow::write_parquet(testdf[1, ], "testdf.parquet"),
>  arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
>  arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
>  times = 5
>  )
> ```
> I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu 
> is
>  R version 4.0.5 (2021-03-31)
>  Platform: x86_64-pc-linux-gnu (64-bit)
>  Running under: Ubuntu 20.04.3 LTS
> Matrix products: default
>  BLAS/LAPACK: 
> /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
> locale:
>  [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 
> LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C 
>  [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C 
> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
>  [1] stats graphics grDevices utils datasets methods base
> other attached packages:
>  [1] arrow_5.0.0
> And sessionInfo for MacOS is:
>  R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) 
> Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: 
> /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
>  LAPACK: 
> /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib 
> locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 
> attached base packages: [1] stats graphics grDevices utils datasets methods 
> base other attached packages: [1] arrow_5.0.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13865) [C++][R] Writing moderate-size parquet files of nested dataframes from R slows down/process hangs

2021-09-02 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13865:

Component/s: C++

> [C++][R] Writing moderate-size parquet files of nested dataframes from R 
> slows down/process hangs
> -
>
> Key: ARROW-13865
> URL: https://issues.apache.org/jira/browse/ARROW-13865
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 5.0.0
>Reporter: John Sheffield
>Priority: Major
> Attachments: Screen Shot 2021-09-02 at 11.21.37 AM.png
>
>
> I observed a significant slowdown in parquet writes (and ultimately the 
> process just hangs for minutes without completion) while writing 
> moderate-size nested dataframes from R. I have replicated the issue on MacOS 
> and Ubuntu so far.
>  
> An example:
> ```
> testdf <- dplyr::tibble(
>  id = uuid::UUIDgenerate(n = 5000),
>  l1 = as.list(lapply(1:5000, (function( x ) runif(1000,
>  l2 = as.list(lapply(1:5000, (function( x ) rnorm(1000
>  )
> testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))
>  
>  # This works
> arrow::write_parquet(testdf_long, "testdf_long.parquet")
>  # This write does not complete within a few minutes on my testing but throws 
> no errors
>  arrow::write_parquet(testdf, "testdf.parquet")
> ```
> I can't guess at why this is true, but the slowdown is closely tied to row 
> counts:
> ```
>  # screenshot attached; 12ms, 56ms, and 680ms respectively.
> microbenchmark::microbenchmark(
>  arrow::write_parquet(testdf[1, ], "testdf.parquet"),
>  arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
>  arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
>  times = 5
>  )
> ```
> I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu 
> is
>  R version 4.0.5 (2021-03-31)
>  Platform: x86_64-pc-linux-gnu (64-bit)
>  Running under: Ubuntu 20.04.3 LTS
> Matrix products: default
>  BLAS/LAPACK: 
> /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
> locale:
>  [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 
> LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C 
>  [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C 
> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
>  [1] stats graphics grDevices utils datasets methods base
> other attached packages:
>  [1] arrow_5.0.0
> And sessionInfo for MacOS is:
>  R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) 
> Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: 
> /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
>  LAPACK: 
> /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib 
> locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 
> attached base packages: [1] stats graphics grDevices utils datasets methods 
> base other attached packages: [1] arrow_5.0.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-13750) [Doc][Cookbook] Call an Arrow compute function which doesn't yet have an R binding - R

2021-09-02 Thread Nic Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane reassigned ARROW-13750:
-

Assignee: Nic Crane

> [Doc][Cookbook] Call an Arrow compute function which doesn't yet have an R 
> binding - R
> --
>
> Key: ARROW-13750
> URL: https://issues.apache.org/jira/browse/ARROW-13750
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Assignee: Nic Crane
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13803) [C++] Segfault on filtering taxi dataset

2021-09-02 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408929#comment-17408929
 ] 

David Li commented on ARROW-13803:
--

It still doesn't replicate on Linux/x64 or MacOS/x64, unfortunately, so it does 
seem ARM-specific.

> [C++] Segfault on filtering taxi dataset
> 
>
> Key: ARROW-13803
> URL: https://issues.apache.org/jira/browse/ARROW-13803
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
> Environment: macOS 11.2.1, MacBook Pro (13-inch, M1, 2020)
>Reporter: Neal Richardson
>Priority: Major
>  Labels: query-engine
> Fix For: 6.0.0
>
>
> Found this while testing ARROW-13740. Using the nyc-taxi dataset:
> {code}
> ds %>%
>   filter(total_amount > 0, passenger_count > 0) %>%
>   summarise(n = n()) %>%
>   collect()
> {code}
> {code}
>  *** caught segfault ***
> address 0x161784000, cause 'invalid permissions'
> Traceback:
>  1: .Call(`_arrow_ExecPlan_run`, plan, final_node, sort_options)
> ...
> {code}
> lldb shows 
> {code}
> * thread #11, stop reason = EXC_BAD_ACCESS (code=1, address=0x1631a8000)
> frame #0: 0x00013a79d9cc 
> libarrow.600.dylib`arrow::BitUtil::SetBitmap(unsigned char*, long long, long 
> long) + 296
> libarrow.600.dylib`arrow::BitUtil::SetBitmap:
> ->  0x13a79d9cc <+296>: ldrb   w10, [x8]
> 0x13a79d9d0 <+300>: cmpw9, #0x8  ; =0x8 
> 0x13a79d9d4 <+304>: cset   w11, lo
> 0x13a79d9d8 <+308>: andw9, w9, #0x7
> Target 0: (R) stopped.
> (lldb) 
> {code}
> Interestingly, I can evaluate those filter expressions just fine, and it only 
> seems to crash if both are provided. And I can count over the data with both:
> {code}
> ds %>% 
>   group_by(total_amount > 0, passenger_count > 0) %>% 
>   summarize(n=n()) %>% 
>   collect()
> # A tibble: 4 × 3
>   `total_amount > 0` `passenger_count > 0`  n
>   
> 1 FALSE  FALSE805
> 2 FALSE  TRUE  368680
> 3 TRUE   FALSE5810556
> 4 TRUE   TRUE  1541561340
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-13793) [C++] Migrate ORCFileReader to Result

2021-09-02 Thread Junwang Zhao (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junwang Zhao reassigned ARROW-13793:


Assignee: Junwang Zhao

> [C++] Migrate ORCFileReader to Result
> 
>
> Key: ARROW-13793
> URL: https://issues.apache.org/jira/browse/ARROW-13793
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Junwang Zhao
>Priority: Trivial
>  Labels: beginner, easy, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> {{ORCFileReader}} currently returns Status together with an out-pointer for 
> most methods returning a useful value. New variants of these methods 
> returning a {{Result}} should be created, and the legacy variants be 
> marked deprecated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13865) Writing moderate-size parquet files of nested dataframes from R slows down/process hangs

2021-09-02 Thread John Sheffield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sheffield updated ARROW-13865:
---
Description: 
I observed a significant slowdown in parquet writes (and ultimately the process 
just hangs for minutes without completion) while writing moderate-size nested 
dataframes from R. I have replicated the issue on MacOS and Ubuntu so far.

 

An example:

```

testdf <- dplyr::tibble(
 id = uuid::UUIDgenerate(n = 5000),
 l1 = as.list(lapply(1:5000, (function(x) runif(1000,
 l2 = as.list(lapply(1:5000, (function(x) rnorm(1000
 )

testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))

 
 # This works

arrow::write_parquet(testdf_long, "testdf_long.parquet")
 # This write does not complete within a few minutes on my testing but throws 
no errors
 arrow::write_parquet(testdf, "testdf.parquet")

```

I can't guess at why this is true, but the slowdown is closely tied to row 
counts:

```
 # screenshot attached; 12ms, 56ms, and 680ms respectively.

microbenchmark::microbenchmark(
 arrow::write_parquet(testdf[1, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
 times = 5
 )

```

I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu is
 R version 4.0.5 (2021-03-31)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Ubuntu 20.04.3 LTS

Matrix products: default
 BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 
LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C 
 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C 
LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 [1] stats graphics grDevices utils datasets methods base

other attached packages:
 [1] arrow_5.0.0

And sessionInfo for MacOS is:
 R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) 
Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: 
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
 LAPACK: 
/Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib 
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 
attached base packages: [1] stats graphics grDevices utils datasets methods 
base other attached packages: [1] arrow_5.0.0

  was:
I observed a significant slowdown in parquet writes (and ultimately the process 
just hangs for minutes without completion) while writing moderate-size nested 
dataframes from R. I have replicated the issue on MacOS and Ubuntu so far.

 

An example:

```

testdf <- dplyr::tibble(
 id = uuid::UUIDgenerate(n = 5000),
 l1 = as.list(lapply(1:5000, (function(x) runif(1000,
 l2 = as.list(lapply(1:5000, (function(x) rnorm(1000
)

testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))

 

# This works

arrow::write_parquet(testdf_long, "testdf_long.parquet")

# This write does not complete within a few minutes on my testing but throws no 
errors
arrow::write_parquet(testdf, "testdf.parquet")

```

I can't guess at why this is true, but the slowdown is closely tied to row 
counts:

```

# screenshot attached; 12ms, 56ms, and 680ms respectively.

microbenchmark::microbenchmark(
 arrow::write_parquet(testdf[1, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
 times = 5
)

```

I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu is
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C   LC_TIME=en_US.UTF-8  
  LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8LC_MESSAGES=C   
  
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C  LC_ADDRESS=C 
  LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C 
  

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

other attached packages:
[1] arrow_5.0.0

And sessionInfo for MacOS is:
R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) Running 
under: macOS Catalina 10.15.7 Matrix products: default BLAS: 
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
 LAPACK: 
/Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib 
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 
attached base packages: [1] stats graphics grDevices utils datasets methods 
base other attached packages: [1] arrow_5.0.0


> Writing moderate-size parquet

[jira] [Updated] (ARROW-13865) Writing moderate-size parquet files of nested dataframes from R slows down/process hangs

2021-09-02 Thread John Sheffield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sheffield updated ARROW-13865:
---
Description: 
I observed a significant slowdown in parquet writes (and ultimately the process 
just hangs for minutes without completion) while writing moderate-size nested 
dataframes from R. I have replicated the issue on MacOS and Ubuntu so far.

 

An example:

```

testdf <- dplyr::tibble(
 id = uuid::UUIDgenerate(n = 5000),
 l1 = as.list(lapply(1:5000, (function( x ) runif(1000,
 l2 = as.list(lapply(1:5000, (function( x ) rnorm(1000
 )

testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))

 
 # This works

arrow::write_parquet(testdf_long, "testdf_long.parquet")
 # This write does not complete within a few minutes on my testing but throws 
no errors
 arrow::write_parquet(testdf, "testdf.parquet")

```

I can't guess at why this is true, but the slowdown is closely tied to row 
counts:

```
 # screenshot attached; 12ms, 56ms, and 680ms respectively.

microbenchmark::microbenchmark(
 arrow::write_parquet(testdf[1, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
 times = 5
 )

```

I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu is
 R version 4.0.5 (2021-03-31)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Ubuntu 20.04.3 LTS

Matrix products: default
 BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 
LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C 
 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C 
LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 [1] stats graphics grDevices utils datasets methods base

other attached packages:
 [1] arrow_5.0.0

And sessionInfo for MacOS is:
 R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) 
Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: 
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
 LAPACK: 
/Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib 
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 
attached base packages: [1] stats graphics grDevices utils datasets methods 
base other attached packages: [1] arrow_5.0.0

  was:
I observed a significant slowdown in parquet writes (and ultimately the process 
just hangs for minutes without completion) while writing moderate-size nested 
dataframes from R. I have replicated the issue on MacOS and Ubuntu so far.

 

An example:

```

testdf <- dplyr::tibble(
 id = uuid::UUIDgenerate(n = 5000),
 l1 = as.list(lapply(1:5000, (function(x) runif(1000,
 l2 = as.list(lapply(1:5000, (function(x) rnorm(1000
 )

testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))

 
 # This works

arrow::write_parquet(testdf_long, "testdf_long.parquet")
 # This write does not complete within a few minutes on my testing but throws 
no errors
 arrow::write_parquet(testdf, "testdf.parquet")

```

I can't guess at why this is true, but the slowdown is closely tied to row 
counts:

```
 # screenshot attached; 12ms, 56ms, and 680ms respectively.

microbenchmark::microbenchmark(
 arrow::write_parquet(testdf[1, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
 times = 5
 )

```

I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu is
 R version 4.0.5 (2021-03-31)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Ubuntu 20.04.3 LTS

Matrix products: default
 BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 
LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C 
 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C 
LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 [1] stats graphics grDevices utils datasets methods base

other attached packages:
 [1] arrow_5.0.0

And sessionInfo for MacOS is:
 R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) 
Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: 
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
 LAPACK: 
/Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib 
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 
attached base packages: [1] stats graphics grDevices utils datasets methods 
base other attached packages: [1] arrow_5.0.0


> Writing moderate-size parquet files of nested dataframes from R slows 
> down/process hangs
>

[jira] [Created] (ARROW-13865) Writing moderate-size parquet files of nested dataframes from R slows down/process hangs

2021-09-02 Thread John Sheffield (Jira)

John Sheffield created ARROW-13865:
--

 Summary: Writing moderate-size parquet files of nested dataframes 
from R slows down/process hangs
 Key: ARROW-13865
 URL: https://issues.apache.org/jira/browse/ARROW-13865
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 5.0.0
Reporter: John Sheffield
 Attachments: Screen Shot 2021-09-02 at 11.21.37 AM.png

I observed a significant slowdown in parquet writes (and ultimately the process 
just hangs for minutes without completion) while writing moderate-size nested 
dataframes from R. I have replicated the issue on MacOS and Ubuntu so far.

 

An example:

```

testdf <- dplyr::tibble(
 id = uuid::UUIDgenerate(n = 5000),
 l1 = as.list(lapply(1:5000, (function(x) runif(1000,
 l2 = as.list(lapply(1:5000, (function(x) rnorm(1000
)

testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))

 

# This works

arrow::write_parquet(testdf_long, "testdf_long.parquet")

# This write does not complete within a few minutes on my testing but throws no 
errors
arrow::write_parquet(testdf, "testdf.parquet")

```

I can't guess at why this is true, but the slowdown is closely tied to row 
counts:

```

# screenshot attached; 12ms, 56ms, and 680ms respectively.

microbenchmark::microbenchmark(
 arrow::write_parquet(testdf[1, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
 times = 5
)

```

I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu is
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C   LC_TIME=en_US.UTF-8  
  LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8LC_MESSAGES=C   
  
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C  LC_ADDRESS=C 
  LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C 
  

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

other attached packages:
[1] arrow_5.0.0

And sessionInfo for MacOS is:
R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) Running 
under: macOS Catalina 10.15.7 Matrix products: default BLAS: 
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
 LAPACK: 
/Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib 
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 
attached base packages: [1] stats graphics grDevices utils datasets methods 
base other attached packages: [1] arrow_5.0.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13743) [CI] OSX job fails due to incompatible git and libcurl

2021-09-02 Thread Uwe Korn (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408916#comment-17408916
 ] 

Uwe Korn commented on ARROW-13743:
--

The issue here is that {{git}} is pulled from {{pkgs/main}} and not from 
{{conda-forge}}. You should switch to {{channel_priority: strict}} in the conda 
configuration to avoid channel clashed.

> [CI] OSX job fails due to incompatible git and libcurl
> --
>
> Key: ARROW-13743
> URL: https://issues.apache.org/jira/browse/ARROW-13743
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Yibo Cai
>Priority: Major
> Fix For: 6.0.0
>
>
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=10372=logs=cf796865-97b7-5cd1-be8e-6e00ce4fd8cf=9f7de14c-8ff0-55c4-a998-d852f888262c=15
> [NIGHTLY] Arrow Build Report for Job nightly-2021-08-24-0
> https://www.mail-archive.com/builds@arrow.apache.org/msg00109.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-13743) [CI] OSX job fails due to incompatible git and libcurl

2021-09-02 Thread Krisztian Szucs (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408913#comment-17408913
 ] 

Krisztian Szucs edited comment on ARROW-13743 at 9/2/21, 3:02 PM:
--

I guess we should sync the CI config again with the upstream feedstock. cc 
~[uwe]


was (Author: kszucs):
I guess we should sync the CI config again with the upstream feedstock. cc @uwe

> [CI] OSX job fails due to incompatible git and libcurl
> --
>
> Key: ARROW-13743
> URL: https://issues.apache.org/jira/browse/ARROW-13743
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Yibo Cai
>Priority: Major
> Fix For: 6.0.0
>
>
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=10372=logs=cf796865-97b7-5cd1-be8e-6e00ce4fd8cf=9f7de14c-8ff0-55c4-a998-d852f888262c=15
> [NIGHTLY] Arrow Build Report for Job nightly-2021-08-24-0
> https://www.mail-archive.com/builds@arrow.apache.org/msg00109.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13743) [CI] OSX job fails due to incompatible git and libcurl

2021-09-02 Thread Krisztian Szucs (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408913#comment-17408913
 ] 

Krisztian Szucs commented on ARROW-13743:
-

I guess we should sync the CI config again with the upstream feedstock. cc @uwe

> [CI] OSX job fails due to incompatible git and libcurl
> --
>
> Key: ARROW-13743
> URL: https://issues.apache.org/jira/browse/ARROW-13743
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Yibo Cai
>Priority: Major
> Fix For: 6.0.0
>
>
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=10372=logs=cf796865-97b7-5cd1-be8e-6e00ce4fd8cf=9f7de14c-8ff0-55c4-a998-d852f888262c=15
> [NIGHTLY] Arrow Build Report for Job nightly-2021-08-24-0
> https://www.mail-archive.com/builds@arrow.apache.org/msg00109.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13844) [Doc] Document the release process part of the CI

2021-09-02 Thread Nic Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane updated ARROW-13844:
--
Description: 
(ensure not to duplicate the release management guide)

Maybe a more high-level overview of the steps for someone with no prior 
knowledge, any diagrams etc that show how stuff fits together. More on the why 
not the what.

May be helpful for someone to shadow Kristian.

> [Doc] Document the release process part of the CI
> -
>
> Key: ARROW-13844
> URL: https://issues.apache.org/jira/browse/ARROW-13844
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Nic Crane
>Priority: Major
>
> (ensure not to duplicate the release management guide)
> Maybe a more high-level overview of the steps for someone with no prior 
> knowledge, any diagrams etc that show how stuff fits together. More on the 
> why not the what.
> May be helpful for someone to shadow Kristian.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13844) [Doc] Document the release process part of the CI

2021-09-02 Thread Nic Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane updated ARROW-13844:
--
Description: (was: How the )

> [Doc] Document the release process part of the CI
> -
>
> Key: ARROW-13844
> URL: https://issues.apache.org/jira/browse/ARROW-13844
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Nic Crane
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13844) [Doc] Document the release process part of the CI

2021-09-02 Thread Nic Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane updated ARROW-13844:
--
Description: How the 

> [Doc] Document the release process part of the CI
> -
>
> Key: ARROW-13844
> URL: https://issues.apache.org/jira/browse/ARROW-13844
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Nic Crane
>Priority: Major
>
> How the 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13716) [Doc][Cookbook] Creating RecordBatches - Python

2021-09-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13716:
---
Labels: pull-request-available  (was: )

> [Doc][Cookbook] Creating RecordBatches - Python
> ---
>
> Key: ARROW-13716
> URL: https://issues.apache.org/jira/browse/ARROW-13716
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13803) [C++] Segfault on filtering taxi dataset

2021-09-02 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408900#comment-17408900
 ] 

David Li commented on ARROW-13803:
--

Ok! I can reproduce it, turns out a release build was very important (should've 
thought of that earlier…)

> [C++] Segfault on filtering taxi dataset
> 
>
> Key: ARROW-13803
> URL: https://issues.apache.org/jira/browse/ARROW-13803
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
> Environment: macOS 11.2.1, MacBook Pro (13-inch, M1, 2020)
>Reporter: Neal Richardson
>Priority: Major
>  Labels: query-engine
> Fix For: 6.0.0
>
>
> Found this while testing ARROW-13740. Using the nyc-taxi dataset:
> {code}
> ds %>%
>   filter(total_amount > 0, passenger_count > 0) %>%
>   summarise(n = n()) %>%
>   collect()
> {code}
> {code}
>  *** caught segfault ***
> address 0x161784000, cause 'invalid permissions'
> Traceback:
>  1: .Call(`_arrow_ExecPlan_run`, plan, final_node, sort_options)
> ...
> {code}
> lldb shows 
> {code}
> * thread #11, stop reason = EXC_BAD_ACCESS (code=1, address=0x1631a8000)
> frame #0: 0x00013a79d9cc 
> libarrow.600.dylib`arrow::BitUtil::SetBitmap(unsigned char*, long long, long 
> long) + 296
> libarrow.600.dylib`arrow::BitUtil::SetBitmap:
> ->  0x13a79d9cc <+296>: ldrb   w10, [x8]
> 0x13a79d9d0 <+300>: cmpw9, #0x8  ; =0x8 
> 0x13a79d9d4 <+304>: cset   w11, lo
> 0x13a79d9d8 <+308>: andw9, w9, #0x7
> Target 0: (R) stopped.
> (lldb) 
> {code}
> Interestingly, I can evaluate those filter expressions just fine, and it only 
> seems to crash if both are provided. And I can count over the data with both:
> {code}
> ds %>% 
>   group_by(total_amount > 0, passenger_count > 0) %>% 
>   summarize(n=n()) %>% 
>   collect()
> # A tibble: 4 × 3
>   `total_amount > 0` `passenger_count > 0`  n
>   
> 1 FALSE  FALSE805
> 2 FALSE  TRUE  368680
> 3 TRUE   FALSE5810556
> 4 TRUE   TRUE  1541561340
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13806) [Python] Add conversion to/from Pandas/Python for Month, Day Nano Interval Type

2021-09-02 Thread Tim Swast (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408893#comment-17408893
 ] 

Tim Swast commented on ARROW-13806:
---

Regarding "Python" conversion, we decided in the Python BigQuery client that 
dateutil is widely used (including by pandas) to go with relativedelta for a 
similar conversion from this data type to Python object. 
[https://github.com/googleapis/python-bigquery/pull/840]

The package appears to be widely used and from what I can tell from 
[https://github.com/dateutil/dateutil] no additional transitive dependencies to 
worry about.

That said, a namedtuple or dict where the names match the arguments to 
relativedelta (months, days, microseconds) would be pretty easy to convert to a 
relativedelta if not.

> [Python] Add conversion to/from Pandas/Python for Month, Day Nano Interval 
> Type
> ---
>
> Key: ARROW-13806
> URL: https://issues.apache.org/jira/browse/ARROW-13806
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>
> [https://github.com/apache/arrow/pull/10177] has been merged we should 
> support conversion to and from this type for standard python surface areas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13803) [C++] Segfault on filtering taxi dataset

2021-09-02 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408881#comment-17408881
 ] 

David Li commented on ARROW-13803:
--

Thanks, I'll give it a try with bundled dependencies.

I'm testing using the entire dataset already as well.

> [C++] Segfault on filtering taxi dataset
> 
>
> Key: ARROW-13803
> URL: https://issues.apache.org/jira/browse/ARROW-13803
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
> Environment: macOS 11.2.1, MacBook Pro (13-inch, M1, 2020)
>Reporter: Neal Richardson
>Priority: Major
>  Labels: query-engine
> Fix For: 6.0.0
>
>
> Found this while testing ARROW-13740. Using the nyc-taxi dataset:
> {code}
> ds %>%
>   filter(total_amount > 0, passenger_count > 0) %>%
>   summarise(n = n()) %>%
>   collect()
> {code}
> {code}
>  *** caught segfault ***
> address 0x161784000, cause 'invalid permissions'
> Traceback:
>  1: .Call(`_arrow_ExecPlan_run`, plan, final_node, sort_options)
> ...
> {code}
> lldb shows 
> {code}
> * thread #11, stop reason = EXC_BAD_ACCESS (code=1, address=0x1631a8000)
> frame #0: 0x00013a79d9cc 
> libarrow.600.dylib`arrow::BitUtil::SetBitmap(unsigned char*, long long, long 
> long) + 296
> libarrow.600.dylib`arrow::BitUtil::SetBitmap:
> ->  0x13a79d9cc <+296>: ldrb   w10, [x8]
> 0x13a79d9d0 <+300>: cmpw9, #0x8  ; =0x8 
> 0x13a79d9d4 <+304>: cset   w11, lo
> 0x13a79d9d8 <+308>: andw9, w9, #0x7
> Target 0: (R) stopped.
> (lldb) 
> {code}
> Interestingly, I can evaluate those filter expressions just fine, and it only 
> seems to crash if both are provided. And I can count over the data with both:
> {code}
> ds %>% 
>   group_by(total_amount > 0, passenger_count > 0) %>% 
>   summarize(n=n()) %>% 
>   collect()
> # A tibble: 4 × 3
>   `total_amount > 0` `passenger_count > 0`  n
>   
> 1 FALSE  FALSE805
> 2 FALSE  TRUE  368680
> 3 TRUE   FALSE5810556
> 4 TRUE   TRUE  1541561340
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13803) [C++] Segfault on filtering taxi dataset

2021-09-02 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408877#comment-17408877
 ] 

Neal Richardson commented on ARROW-13803:
-

Which years did you test? It's possible there's a data issue in some file 
that's not being handled correctly, I know there are quirks.

I am testing with the Ursa bucket data.

No conda here, dependency source AUTO and I haven't installed much on the 
system so it's basically bundling everything except lz4 and zlib AFAICT. My 
cmake invocation is 

{code}
cmake \
  -GNinja \
  -DARROW_COMPUTE=ON \
  -DARROW_CSV=ON \
  -DARROW_DATASET=ON \
  -DARROW_FILESYSTEM=ON \
  -DARROW_JEMALLOC=ON \
  -DARROW_JSON=ON \
  -DARROW_PARQUET=ON \
  -DCMAKE_BUILD_TYPE=release \
  -DARROW_INSTALL_NAME_RPATH=OFF \
  -DARROW_S3=ON \
  -DARROW_MIMALLOC=OFF \
  -DARROW_WITH_BROTLI=ON \
  -DARROW_WITH_BZ2=ON \
  -DARROW_WITH_LZ4=ON \
  -DARROW_WITH_SNAPPY=ON \
  -DARROW_WITH_ZLIB=ON \
  -DARROW_WITH_ZSTD=ON \
  -DARROW_EXTRA_ERROR_CONTEXT=ON \
  -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
  -DARROW_BUILD_TESTS=OFF \
  -DARROW_WITH_UTF8PROC=ON \
  ..
{code}

No special compilation flags; cmake reports

{code}
-- CMAKE_C_FLAGS:  -Qunused-arguments -O3 -DNDEBUG  -Wall 
-Wno-unknown-warning-option -Wno-pass-failed -stdlib=libc++ -march=armv8-a 
-- CMAKE_CXX_FLAGS:   -Qunused-arguments -fcolor-diagnostics -O3 -DNDEBUG  
-Wall -Wno-unknown-warning-option -Wno-pass-failed -stdlib=libc++ 
-march=armv8-a 
{code}

> [C++] Segfault on filtering taxi dataset
> 
>
> Key: ARROW-13803
> URL: https://issues.apache.org/jira/browse/ARROW-13803
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
> Environment: macOS 11.2.1, MacBook Pro (13-inch, M1, 2020)
>Reporter: Neal Richardson
>Priority: Major
>  Labels: query-engine
> Fix For: 6.0.0
>
>
> Found this while testing ARROW-13740. Using the nyc-taxi dataset:
> {code}
> ds %>%
>   filter(total_amount > 0, passenger_count > 0) %>%
>   summarise(n = n()) %>%
>   collect()
> {code}
> {code}
>  *** caught segfault ***
> address 0x161784000, cause 'invalid permissions'
> Traceback:
>  1: .Call(`_arrow_ExecPlan_run`, plan, final_node, sort_options)
> ...
> {code}
> lldb shows 
> {code}
> * thread #11, stop reason = EXC_BAD_ACCESS (code=1, address=0x1631a8000)
> frame #0: 0x00013a79d9cc 
> libarrow.600.dylib`arrow::BitUtil::SetBitmap(unsigned char*, long long, long 
> long) + 296
> libarrow.600.dylib`arrow::BitUtil::SetBitmap:
> ->  0x13a79d9cc <+296>: ldrb   w10, [x8]
> 0x13a79d9d0 <+300>: cmpw9, #0x8  ; =0x8 
> 0x13a79d9d4 <+304>: cset   w11, lo
> 0x13a79d9d8 <+308>: andw9, w9, #0x7
> Target 0: (R) stopped.
> (lldb) 
> {code}
> Interestingly, I can evaluate those filter expressions just fine, and it only 
> seems to crash if both are provided. And I can count over the data with both:
> {code}
> ds %>% 
>   group_by(total_amount > 0, passenger_count > 0) %>% 
>   summarize(n=n()) %>% 
>   collect()
> # A tibble: 4 × 3
>   `total_amount > 0` `passenger_count > 0`  n
>   
> 1 FALSE  FALSE805
> 2 FALSE  TRUE  368680
> 3 TRUE   FALSE5810556
> 4 TRUE   TRUE  1541561340
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-13718) [Doc][Cookbook] Creating Arrays - R

2021-09-02 Thread Nic Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane reassigned ARROW-13718:
-

Assignee: Nic Crane

> [Doc][Cookbook] Creating Arrays - R
> ---
>
> Key: ARROW-13718
> URL: https://issues.apache.org/jira/browse/ARROW-13718
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Assignee: Nic Crane
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-13714) [Doc][Cookbook] Sharing data between R and Python - R

2021-09-02 Thread Nic Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane reassigned ARROW-13714:
-

Assignee: (was: Nic Crane)

> [Doc][Cookbook] Sharing data between R and Python - R
> -
>
> Key: ARROW-13714
> URL: https://issues.apache.org/jira/browse/ARROW-13714
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-13711) [Doc][Cookbook] Sending and receiving data over a network using an Arrow Flight RPC server - R

2021-09-02 Thread Nic Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane reassigned ARROW-13711:
-

Assignee: (was: Nic Crane)

> [Doc][Cookbook] Sending and receiving data over a network using an Arrow 
> Flight RPC server - R
> --
>
> Key: ARROW-13711
> URL: https://issues.apache.org/jira/browse/ARROW-13711
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-13724) [Doc][Cookbook] Reading Schemas - R

2021-09-02 Thread Nic Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane reassigned ARROW-13724:
-

Assignee: (was: Nic Crane)

> [Doc][Cookbook] Reading Schemas - R
> ---
>
> Key: ARROW-13724
> URL: https://issues.apache.org/jira/browse/ARROW-13724
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-13718) [Doc][Cookbook] Creating Arrays - R

2021-09-02 Thread Nic Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane reassigned ARROW-13718:
-

Assignee: (was: Nic Crane)

> [Doc][Cookbook] Creating Arrays - R
> ---
>
> Key: ARROW-13718
> URL: https://issues.apache.org/jira/browse/ARROW-13718
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-13720) [Doc][Cookbook] Working with Data Types - R

2021-09-02 Thread Nic Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane reassigned ARROW-13720:
-

Assignee: (was: Nic Crane)

> [Doc][Cookbook] Working with Data Types - R
> ---
>
> Key: ARROW-13720
> URL: https://issues.apache.org/jira/browse/ARROW-13720
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-13722) [Doc][Cookbook] Specifying Schemas - R

2021-09-02 Thread Nic Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane reassigned ARROW-13722:
-

Assignee: (was: Nic Crane)

> [Doc][Cookbook] Specifying Schemas - R
> --
>
> Key: ARROW-13722
> URL: https://issues.apache.org/jira/browse/ARROW-13722
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-13713) [Doc][Cookbook] Reading and Writing Compressed Data - R

2021-09-02 Thread Nic Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane reassigned ARROW-13713:
-

Assignee: (was: Nic Crane)

> [Doc][Cookbook] Reading and Writing Compressed Data - R
> ---
>
> Key: ARROW-13713
> URL: https://issues.apache.org/jira/browse/ARROW-13713
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-13748) [Doc][Cookbook] Work with character data (stringr functions and Arrow functions) - R

2021-09-02 Thread Nic Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane reassigned ARROW-13748:
-

Assignee: (was: Nic Crane)

> [Doc][Cookbook] Work with character data (stringr functions and Arrow 
> functions) - R
> 
>
> Key: ARROW-13748
> URL: https://issues.apache.org/jira/browse/ARROW-13748
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-13728) [Doc][Cookbook] Appending Tables to an existing Table - R

2021-09-02 Thread Nic Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane reassigned ARROW-13728:
-

Assignee: (was: Nic Crane)

> [Doc][Cookbook] Appending Tables to an existing Table - R
> -
>
> Key: ARROW-13728
> URL: https://issues.apache.org/jira/browse/ARROW-13728
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

1 2 >

1 - 100 of 148 matches

Mail list logo