[jira] [Commented] (ARROW-17046) [Python] pyarrow.parquet.write_to_dataset fails to pass kwargs to write_table function

2022-07-12 Thread Amir Khosroshahi (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566161#comment-17566161
 ] 

Amir Khosroshahi commented on ARROW-17046:
--

Got it. Thank you. The pull request is now actually created: 
https://github.com/apache/arrow/pull/13591

> [Python] pyarrow.parquet.write_to_dataset fails to pass kwargs to write_table 
> function
> --
>
> Key: ARROW-17046
> URL: https://issues.apache.org/jira/browse/ARROW-17046
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Amir Khosroshahi
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> According to PyArrow 8.0.0 
> [documentation|https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_to_dataset.html]
>  {{kwargs}} is "Additional kwargs for {{write_table}} function." However when 
> I try to pass additional arguments, for example {{{}flavor{}}}, to the 
> underlying write_table I get the following error
> {code:java}
> TypeError: unexpected parquet write option: flavor{code}
> This used to work in PyArrow versions as late as 7.0.0 but started to break 
> in 8.0.0.
> Minimal example to reproduce the error
> {code:java}
> import pyarrow as pa
> import pandas as pd
> import pyarrow.parquet as pq
> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
> tb = pa.Table.from_pandas(df)
> pq.write_to_dataset(tb, "test.parquet", flavor="spark") {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17046) [Python] pyarrow.parquet.write_to_dataset fails to pass kwargs to write_table function

2022-07-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17046:
---
Labels: pull-request-available  (was: )

> [Python] pyarrow.parquet.write_to_dataset fails to pass kwargs to write_table 
> function
> --
>
> Key: ARROW-17046
> URL: https://issues.apache.org/jira/browse/ARROW-17046
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Amir Khosroshahi
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> According to PyArrow 8.0.0 
> [documentation|https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_to_dataset.html]
>  {{kwargs}} is "Additional kwargs for {{write_table}} function." However when 
> I try to pass additional arguments, for example {{{}flavor{}}}, to the 
> underlying write_table I get the following error
> {code:java}
> TypeError: unexpected parquet write option: flavor{code}
> This used to work in PyArrow versions as late as 7.0.0 but started to break 
> in 8.0.0.
> Minimal example to reproduce the error
> {code:java}
> import pyarrow as pa
> import pandas as pd
> import pyarrow.parquet as pq
> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
> tb = pa.Table.from_pandas(df)
> pq.write_to_dataset(tb, "test.parquet", flavor="spark") {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17063) [GLib] Add examples to send/receive record batches via network

2022-07-12 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-17063.
--
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13590
[https://github.com/apache/arrow/pull/13590]

> [GLib] Add examples to send/receive record batches via network
> --
>
> Key: ARROW-17063
> URL: https://issues.apache.org/jira/browse/ARROW-17063
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17046) [Python] pyarrow.parquet.write_to_dataset fails to pass kwargs to write_table function

2022-07-12 Thread Alenka Frim (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566145#comment-17566145
 ] 

Alenka Frim commented on ARROW-17046:
-

Great, now you only need to click on "Create pull request" so that the change 
to the code will be visible on Apache Arrow repo as a PR.

When setting the title of the pull request please follow the {_}second item in 
the guidelines{_}: 
[https://arrow.apache.org/docs/developers/overview.html#pull-request-and-review]
 so that the PR gets connected to the Jira issue.

> [Python] pyarrow.parquet.write_to_dataset fails to pass kwargs to write_table 
> function
> --
>
> Key: ARROW-17046
> URL: https://issues.apache.org/jira/browse/ARROW-17046
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Amir Khosroshahi
>Priority: Minor
>
> According to PyArrow 8.0.0 
> [documentation|https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_to_dataset.html]
>  {{kwargs}} is "Additional kwargs for {{write_table}} function." However when 
> I try to pass additional arguments, for example {{{}flavor{}}}, to the 
> underlying write_table I get the following error
> {code:java}
> TypeError: unexpected parquet write option: flavor{code}
> This used to work in PyArrow versions as late as 7.0.0 but started to break 
> in 8.0.0.
> Minimal example to reproduce the error
> {code:java}
> import pyarrow as pa
> import pandas as pd
> import pyarrow.parquet as pq
> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
> tb = pa.Table.from_pandas(df)
> pq.write_to_dataset(tb, "test.parquet", flavor="spark") {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16791) [R] Expose Azure Blob Storage filesystem

2022-07-12 Thread Dean MacGregor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566128#comment-17566128
 ] 

Dean MacGregor commented on ARROW-16791:


I don't think so but I don't really know enough about how pyarrow or r arrow 
works behind the scenes to answer your question.

 

I just know azure blob works in pyarrow, that R arrow works with S3 (or at 
least the documentation is there) and I'm assuming that the difference between 
s3 and azure isn't too big. 

 

 

> [R] Expose Azure Blob Storage filesystem
> 
>
> Key: ARROW-16791
> URL: https://issues.apache.org/jira/browse/ARROW-16791
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Dean MacGregor
>Priority: Critical
>
> I'd like to see the R arrow package be able to interface with the Azure Blob 
> Storage file system from the AzureStor package.
>  
> In python, pyarrow and adlfs work together so I'd like for AzureStor and 
> arrow under R to also work together.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17063) [GLib] Add examples to send/receive record batches via network

2022-07-12 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-17063:
-
Summary: [GLib] Add examples to send/receive record batches via network  
(was: [GLib] Add examples to send/recive record batches via network)

> [GLib] Add examples to send/receive record batches via network
> --
>
> Key: ARROW-17063
> URL: https://issues.apache.org/jira/browse/ARROW-17063
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17063) [GLib] Add examples to send/recive record batches via network

2022-07-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17063:
---
Labels: pull-request-available  (was: )

> [GLib] Add examples to send/recive record batches via network
> -
>
> Key: ARROW-17063
> URL: https://issues.apache.org/jira/browse/ARROW-17063
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17063) [GLib] Add examples to send/recive record batches via network

2022-07-12 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-17063:


 Summary: [GLib] Add examples to send/recive record batches via 
network
 Key: ARROW-17063
 URL: https://issues.apache.org/jira/browse/ARROW-17063
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17062) write_feather() in R doesn't interop with ArrowFileReader.ReadNextRecordBatch()

2022-07-12 Thread Todd West (Jira)
Todd West created ARROW-17062:
-

 Summary: write_feather() in R doesn't interop with 
ArrowFileReader.ReadNextRecordBatch()
 Key: ARROW-17062
 URL: https://issues.apache.org/jira/browse/ARROW-17062
 Project: Apache Arrow
  Issue Type: Bug
  Components: C#, R
Affects Versions: 8.0.0
 Environment: Arrow 8.0.0, R 4.2.1, VS 17.2.4
Reporter: Todd West
 Fix For: 8.0.1


Hello world between write_feather() and ArrowFileReader.ReadNextRecordBatch() 
fails with default settings. This is specific to compressed files (see 
workaround below) and it looks like what happens is C# correctly decompresses 
the batches but provides the caller with the compressed versions of the data 
arrays instead of the uncompressed ones. While all of the various Length 
properties are set correctly in C#, the data arrays are too short to contain 
all of the values in the file, the bytes do not match what the decompressed 
bytes should be, and basic data accessors like PrimitiveArray.Values can't 
be used because they throw ArgumentOutOfRangeException. Looking through the C# 
classes in the github repo it doesn't appear there's a way for the caller to 
request decompression. So I'm guessing decompression is supposed to be 
automatic but, for some reason, isn't.

 

While functionally successful, the workaround of using uncompressed feather 
isn't great as the uncompressed files are bigger than .csv. In my application 
the resulting disk space penalty is hundreds of megabytes compared to the 
footprint of using compressed feather.

 

Simple single field repex:

In R (arrow 8.0.0):
{{write_feather(tibble(value = seq(0, 1, length.out = 21)), "test 
lz4.feather")}}

In C# (Apache.Arrow 8.0.0):

{{using Apache.Arrow;}}
{{using Apache.Arrow.Ipc;}}
{{using System.IO;}}
{{using System.Runtime.InteropServices;}}

{{            using FileStream stream = new("test lz4.feather", FileMode.Open, 
FileAccess.Read, FileShare.Read);}}

{{            using ArrowFileReader arrowFile = new(stream);}}

{{            for (RecordBatch batch = arrowFile.ReadNextRecordBatch(); batch 
!= null; batch = arrowFile.ReadNextRecordBatch())}}
{{            {}}

{{                IArrowArray[] fields = batch.Arrays.ToArray();}}

{{                ReadOnlySpan test = MemoryMarshal.Cast(((DoubleArray)fields[0]).ValueBuffer.Span); // 15 incorrect values 
instead of 21 correctly incrementing ones (0, 0.05, 0.10, ..., 1)}}

{{            }}}

Workaround in R:

{{write_feather(tibble(value = seq(0, 1, length.out = 21)), "test.feather", 
compression = "uncompressed")}}

 

Apologies if this is a known issue. I didn't find anything on a Jira search and 
this isn't included in the [known issues list on github|http://example.com/].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17061) [Python][Substrait] Acero consumer is unable to consume count function from substrait query plan

2022-07-12 Thread Richard Tia (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Tia updated ARROW-17061:

Summary: [Python][Substrait] Acero consumer is unable to consume count 
function from substrait query plan  (was: [Python] Acero consumer is unable to 
consume count function from substrait query plan)

> [Python][Substrait] Acero consumer is unable to consume count function from 
> substrait query plan
> 
>
> Key: ARROW-17061
> URL: https://issues.apache.org/jira/browse/ARROW-17061
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Richard Tia
>Priority: Major
>
> SQL
> {code:java}
> SELECT
> o_orderpriority,
> count(*) AS order_count
> FROM
> orders
> GROUP BY
> o_orderpriority{code}
> The substrait plan generated from SQL, using Isthmus.
>  
> substrait count: 
> [https://github.com/substrait-io/substrait/blob/main/extensions/functions_aggregate_generic.yaml]
>  
> Running the substrait plan with Acero returns this error:
> {code:java}
> E   pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned 
> INVALID_ARGUMENT:(relations[0].root.input.aggregate.measures[0].measure) 
> arguments: Cannot find field.  {code}
>  
> From substrait query plan:
> relations[0].root.input.aggregate.measures[0].measure
> {code:java}
> "measure": {
>   "functionReference": 0,
>   "args": [],
>   "sorts": [],
>   "phase": "AGGREGATION_PHASE_INITIAL_TO_RESULT",
>   "outputType": {
> "i64": {
>   "typeVariationReference": 0,
>   "nullability": "NULLABILITY_REQUIRED"
> }
>   },
>   "invocation": "AGGREGATION_INVOCATION_ALL",
>   "arguments": []
> }{code}
> {code:java}
> "extensions": [{
>   "extensionFunction": {
> "extensionUriReference": 1,
> "functionAnchor": 0,
> "name": "count:opt"
>   }
> }],{code}
> Count is a unary function and should be consumable, but isn't in this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17049) [C++] arrow-compute-expression-benchmark aborts with sanity check failure

2022-07-12 Thread Yibo Cai (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566088#comment-17566088
 ] 

Yibo Cai commented on ARROW-17049:
--

Same issue as ARROW-17059?

> [C++] arrow-compute-expression-benchmark aborts with sanity check failure
> -
>
> Key: ARROW-17049
> URL: https://issues.apache.org/jira/browse/ARROW-17049
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Benchmarking, C++
>Reporter: Antoine Pitrou
>Priority: Blocker
> Fix For: 9.0.0
>
>
> {code}
> $ ./build-release/relwithdebinfo/arrow-compute-expression-benchmark 
> 2022-07-12T11:56:06+02:00
> Running ./build-release/relwithdebinfo/arrow-compute-expression-benchmark
> Run on (24 X 3800 MHz CPU s)
> CPU Caches:
>   L1 Data 32 KiB (x12)
>   L1 Instruction 32 KiB (x12)
>   L2 Unified 512 KiB (x12)
>   L3 Unified 16384 KiB (x4)
> Load Average: 0.44, 3.87, 2.60
> ***WARNING*** CPU scaling is enabled, the benchmark real time measurements 
> may be noisy and will incur extra overhead.
> -
> Benchmark 
>   Time CPU   Iterations
> -
> SimplifyFilterWithGuarantee/negative_filter_simple_guarantee_simple   
>5734 ns 5733 ns   122775
> SimplifyFilterWithGuarantee/negative_filter_cast_guarantee_simple 
>9094 ns 9092 ns76172
> SimplifyFilterWithGuarantee/negative_filter_simple_guarantee_dictionary   
>   12992 ns12989 ns53601
> SimplifyFilterWithGuarantee/negative_filter_cast_guarantee_dictionary 
>   16395 ns16392 ns42601
> SimplifyFilterWithGuarantee/positive_filter_simple_guarantee_simple   
>5756 ns 5755 ns   120485
> SimplifyFilterWithGuarantee/positive_filter_cast_guarantee_simple 
>9197 ns 9195 ns76168
> SimplifyFilterWithGuarantee/positive_filter_simple_guarantee_dictionary   
>   12875 ns12872 ns54240
> SimplifyFilterWithGuarantee/positive_filter_cast_guarantee_dictionary 
>   16567 ns16563 ns42539
> BindAndEvaluate/simple_array  
> 255 ns  255 ns  2748813
> BindAndEvaluate/simple_scalar 
> 252 ns  252 ns  2765200
> BindAndEvaluate/nested_array  
>2251 ns 2251 ns   310424
> BindAndEvaluate/nested_scalar 
>2687 ns 2686 ns   261939
> -- Arrow Fatal Error --
> Invalid: Value lengths differed from ExecBatch length
> Abandon
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-15229) [C++][FlightRPC] Evaluate UCX/RDMA transport performance

2022-07-12 Thread Yibo Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai reassigned ARROW-15229:


Assignee: Yibo Cai

> [C++][FlightRPC] Evaluate UCX/RDMA transport performance
> 
>
> Key: ARROW-15229
> URL: https://issues.apache.org/jira/browse/ARROW-15229
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, FlightRPC
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>
> Evaluate FlgihtRPC UCX transport performance over 100Gb/RDMA network, based 
> on https://github.com/lidavidm/arrow/tree/flight-ucx.
> cc [~lidavidm], will be great if you can give a stable branch (or commit id) 
> for evaluation purpose.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-16476) [Dev] TEST_WHEELS=1 verify-release-candidate.sh fails on Arm host

2022-07-12 Thread Yibo Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai closed ARROW-16476.

Resolution: Fixed

Sure, thanks for remind.

> [Dev] TEST_WHEELS=1 verify-release-candidate.sh fails on Arm host
> -
>
> Key: ARROW-16476
> URL: https://issues.apache.org/jira/browse/ARROW-16476
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Yibo Cai
>Priority: Major
>
> {code:bash}
> $ TEST_DEFAULT=0 TEST_WHEELS=1 dev/release/verify-release-candidate.sh 8.0.0 3
> ..
> ERROR: 
> pyarrow-8.0.0-cp38-cp38-manylinux_2_12_aarch64.manylinux2010_aarch64.whl is 
> not a supported wheel on this platform.
> {code}
> Looks there are aarch64 wheels for {{manylinux_2_17}}, but not 
> {{manylinux_2_12}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17060) [C++] Change AsOfJoinNode to use ExecContext's Memory Pool

2022-07-12 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace resolved ARROW-17060.
-
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13585
[https://github.com/apache/arrow/pull/13585]

> [C++] Change AsOfJoinNode to use ExecContext's Memory Pool
> --
>
> Key: ARROW-17060
> URL: https://issues.apache.org/jira/browse/ARROW-17060
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ivan Chau
>Assignee: Ivan Chau
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17056) [C++] Bump version of bundled substrait

2022-07-12 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566081#comment-17566081
 ] 

Weston Pace commented on ARROW-17056:
-

[~apitrou] is right.  Substrait is just a spec so bumping to newer versions 
isn't likely to be needed for security reasons.  Plus, Substrait uses protobuf 
and aims to generally be backwards compatible so a newer producer should be 
fine even if we are targetting an older version as consumer.  Substrait is also 
iterating rapidly and will tend to have a new version / week for a while.  For 
the time being we should probably only bump Substrait when we need some kind of 
new feature for our own implementation.

> [C++] Bump version of bundled substrait
> ---
>
> Key: ARROW-17056
> URL: https://issues.apache.org/jira/browse/ARROW-17056
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Raúl Cumplido
>Priority: Major
> Fix For: 9.0.0
>
>
> There has been a new substrait version released:
> https://github.com/substrait-io/substrait/releases/tag/v0.7.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-16733) [C++] Bump version of bundled opentelemetry

2022-07-12 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-16733.
--
Resolution: Fixed

Issue resolved by pull request 13580
[https://github.com/apache/arrow/pull/13580]

> [C++] Bump version of bundled opentelemetry
> ---
>
> Key: ARROW-16733
> URL: https://issues.apache.org/jira/browse/ARROW-16733
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Raúl Cumplido
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This includes opentelemetry-cpp and opentelemetry-proto.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17046) [Python] pyarrow.parquet.write_to_dataset fails to pass kwargs to write_table function

2022-07-12 Thread Amir Khosroshahi (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566036#comment-17566036
 ] 

Amir Khosroshahi commented on ARROW-17046:
--

I created a pull request to master branch here: 
https://github.com/apache/arrow/compare/master...mirkhosro:arrow:patch-1

> [Python] pyarrow.parquet.write_to_dataset fails to pass kwargs to write_table 
> function
> --
>
> Key: ARROW-17046
> URL: https://issues.apache.org/jira/browse/ARROW-17046
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Amir Khosroshahi
>Priority: Minor
>
> According to PyArrow 8.0.0 
> [documentation|https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_to_dataset.html]
>  {{kwargs}} is "Additional kwargs for {{write_table}} function." However when 
> I try to pass additional arguments, for example {{{}flavor{}}}, to the 
> underlying write_table I get the following error
> {code:java}
> TypeError: unexpected parquet write option: flavor{code}
> This used to work in PyArrow versions as late as 7.0.0 but started to break 
> in 8.0.0.
> Minimal example to reproduce the error
> {code:java}
> import pyarrow as pa
> import pandas as pd
> import pyarrow.parquet as pq
> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
> tb = pa.Table.from_pandas(df)
> pq.write_to_dataset(tb, "test.parquet", flavor="spark") {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-13155) [C++] MapGenerator should optionally forward reentrant pressure

2022-07-12 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace reassigned ARROW-13155:
---

Assignee: Weston Pace

> [C++] MapGenerator should optionally forward reentrant pressure
> ---
>
> Key: ARROW-13155
> URL: https://issues.apache.org/jira/browse/ARROW-13155
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Currently the map generator will allow the map function to run in parallel 
> but it will not forward reentrant pressure onto the source generator.  
> Instead it queues requests.
> In some cases this is the right decision (if source is not async reentrant) 
> but in some cases we want it to forward the pressure (so that the entire 
> chain can run in parallel).
> By making it an option when the mapped generator is created we can allow 
> pressure to be forwarded where appropriate.
>  
> Phrasing it another way.  If we have source, map function A, map function B, 
> map function C, and then a reentrant pull we would currently only run C in 
> parallel.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-13004) [C++] Allow the creation of future "chains" to better control parallelism

2022-07-12 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace closed ARROW-13004.
---
Resolution: Won't Fix

This is handled better by the exec plan and we are not likely to pursue this in 
futures

> [C++] Allow the creation of future "chains" to better control parallelism
> -
>
> Key: ARROW-13004
> URL: https://issues.apache.org/jira/browse/ARROW-13004
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> This is a bit tricky to explain.  ShouldSchedule::Always works well for 
> AddCallback but falls short for Transfer and Then.  An example may explain 
> best.
> Consider three operators, Source, Transform, and Sink.  They are setup as...
> {code:java}
> source_fut = source(); // 1
> transform_fut = source_fut.Then(Transform(), ScheduleAlways); // 2
> sink_fut = transform_fut.Then(Consume()); // 3
> {code}
> The intent is to run Transform + Consume as a single thread task on each item 
> generated by source().  This is what happens if source() is slow.  If 
> source() is fast (let's pretend it's always finished) then this is not what 
> happens.
> Line 2 causes a new thread task to be launched (since source_fut is 
> finished).  It is possible that new thread task can mark transform_fut 
> finished before line 3 is executed by the original thread.  This causes 
> Consume() and Transform() to run on separate threads.
> The solution (at least as best I can come up with) is unfortunately a little 
> complex (though the complexity can be hidden in future/async_generator 
> internals).  Basically, it is worth waiting to schedule until the future 
> chain has had a chance to finish connecting the pressure.  This means a 
> future created with ScheduleAlways is created in an "unconsumed" mode.  Any 
> callbacks that would normally be launched will not be launched until the 
> future switches to "consumed".  Future.Wait(), VisitAsyncGenerator, 
> CollectAsyncGenerator, and some of the async_generator operators would cause 
> the future to be "consumed".  The "consume" signal will need to propagate 
> backwards up the chain so futures will need to keep a reference to their 
> antecedent future.
> This work meshes well with some other improvements I have been considering, 
> in particular, splitting future/promise and restricting futures to a single 
> callback.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17061) [Python] Acero consumer is unable to consume count function from substrait query plan

2022-07-12 Thread Richard Tia (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Tia updated ARROW-17061:

Description: 
SQL
{code:java}
SELECT
o_orderpriority,
count(*) AS order_count
FROM
orders
GROUP BY
o_orderpriority{code}
The substrait plan generated from SQL, using Isthmus.

 

substrait count: 

[https://github.com/substrait-io/substrait/blob/main/extensions/functions_aggregate_generic.yaml]

 

Running the substrait plan with Acero returns this error:
{code:java}
E   pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned 
INVALID_ARGUMENT:(relations[0].root.input.aggregate.measures[0].measure) 
arguments: Cannot find field.  {code}
 

>From substrait query plan:

relations[0].root.input.aggregate.measures[0].measure
{code:java}
"measure": {
  "functionReference": 0,
  "args": [],
  "sorts": [],
  "phase": "AGGREGATION_PHASE_INITIAL_TO_RESULT",
  "outputType": {
"i64": {
  "typeVariationReference": 0,
  "nullability": "NULLABILITY_REQUIRED"
}
  },
  "invocation": "AGGREGATION_INVOCATION_ALL",
  "arguments": []
}{code}
{code:java}
"extensions": [{
  "extensionFunction": {
"extensionUriReference": 1,
"functionAnchor": 0,
"name": "count:opt"
  }
}],{code}
Count is a unary function and should be consumable, but isn't in this case.

  was:
SQL
{code:java}
select
  l_returnflag,
  l_linestatus,
  sum(l_quantity) as sum_qty,
  sum(l_extendedprice) as sum_base_price,
  sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
  sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
  avg(l_quantity) as avg_qty,
  avg(l_extendedprice) as avg_price,
  avg(l_discount) as avg_disc,
  count(*) as count_order
from
  '{}'
where
  l_shipdate <= date '1998-12-01' - interval '120' day (3)
group by
  l_returnflag,
  l_linestatus
order by
  l_returnflag,
  l_linestatus {code}
The substrait plan generated from SQL, using Isthmus.

 

substrait count: 

[https://github.com/substrait-io/substrait/blob/main/extensions/functions_aggregate_generic.yaml]

 

Running the substrait plan with Acero returns this error:
{code:java}
E   pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned 
INVALID_ARGUMENT:(relations[0].root.input.sort.input.aggregate.measures[7].measure)
 arguments: Cannot find field.
 {code}
 

>From substrait query plan:

relations[0].root.input.sort.input.aggregate.measures[7].measure
{code:java}
"measure": {
  "functionReference": 7,
  "args": [],
  "sorts": [],
  "phase": "AGGREGATION_PHASE_INITIAL_TO_RESULT",
  "outputType": {
"i64": {
  "typeVariationReference": 0,
  "nullability": "NULLABILITY_REQUIRED"
}
  },
  "invocation": "AGGREGATION_INVOCATION_ALL",
  "arguments": []
} {code}
{code:java}
"extensionFunction": {
  "extensionUriReference": 3,
  "functionAnchor": 7,
  "name": "count:opt"
} {code}
Count is a unary function and should be consumable, but isn't in this case.


> [Python] Acero consumer is unable to consume count function from substrait 
> query plan
> -
>
> Key: ARROW-17061
> URL: https://issues.apache.org/jira/browse/ARROW-17061
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Richard Tia
>Priority: Major
>
> SQL
> {code:java}
> SELECT
> o_orderpriority,
> count(*) AS order_count
> FROM
> orders
> GROUP BY
> o_orderpriority{code}
> The substrait plan generated from SQL, using Isthmus.
>  
> substrait count: 
> [https://github.com/substrait-io/substrait/blob/main/extensions/functions_aggregate_generic.yaml]
>  
> Running the substrait plan with Acero returns this error:
> {code:java}
> E   pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned 
> INVALID_ARGUMENT:(relations[0].root.input.aggregate.measures[0].measure) 
> arguments: Cannot find field.  {code}
>  
> From substrait query plan:
> relations[0].root.input.aggregate.measures[0].measure
> {code:java}
> "measure": {
>   "functionReference": 0,
>   "args": [],
>   "sorts": [],
>   "phase": "AGGREGATION_PHASE_INITIAL_TO_RESULT",
>   "outputType": {
> "i64": {
>   "typeVariationReference": 0,
>   "nullability": "NULLABILITY_REQUIRED"
> }
>   },
>   "invocation": "AGGREGATION_INVOCATION_ALL",
>   "arguments": []
> }{code}
> {code:java}
> "extensions": [{
>   "extensionFunction": {
> "extensionUriReference": 1,
> "functionAnchor": 0,
> "name": "count:opt"
>   }
> }],{code}
> Count is a unary function and should be consumable, but isn't in this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17005) [Java] Incorrect results from JDBC Adapter from Postgres of non-nullable column through left join

2022-07-12 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-17005.
--
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13558
[https://github.com/apache/arrow/pull/13558]

> [Java] Incorrect results from JDBC Adapter from Postgres of non-nullable 
> column through left join
> -
>
> Key: ARROW-17005
> URL: https://issues.apache.org/jira/browse/ARROW-17005
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Java
>Reporter: Jonathan Swenson
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Unsure to consider this a bug or wish, but the JDBC to Arrow Adapter produces 
> incorrect results when wrapping the postgres driver in certain cases. 
> If you left join a non-nullable column, the column becomes nullable (if the 
> join condition does not match any columns). However the postgres 
> ResultSetMetaData lies to you and still indicates that the column is still 
> non-nullable. 
> When iterating through the data, results come back as null (isNull will 
> return true). 
> However, because of the way that the JDBCConsumer is created, it creates a 
> non-nullable consumer and will not check the nullability of these results. 
> Unfortunately, this results in incorrect data or errors depending on the data 
> types returned. 
> The postgres JDBC team has closed a ticket about this indicating that it 
> would be impossible for them to return the correct data nullability data to 
> the JDBC driver. see: [https://github.com/pgjdbc/pgjdbc/issues/2079]
> An example: 
> Table: 
> ||t1.id||
> |2|
> |3|
> {code:java}
> CREATE TABLE t1 (id integer NOT NULL);
> INSERT INTO t1 VALUES (2), (3);
> {code}
> Query
> {code:java}
> WITH t2 AS (SELECT 1 AS id UNION SELECT 2)
> SELECT 
>   t1.id 
> FROM t2 
> LEFT JOIN t1 on t1.id = t2.id;{code}
> This returns the result set:
> ||id||
> |2|
> |null|
> The ResultSetMetaData indicates that the column is non-nullable (as t1.id is 
> non-nullable) but there is null data in the result. 
> The Arrow Vector that is present after the result set is consumed, looks like 
> this: 
> ||id||
> |2|
> |0|
> ResultSet.getInt(1) will return 0 when the source data is null, with an 
> expectation that you check isNull. 
> The data is incorrect and silently fails potentially leading to clients / 
> consumers getting bad data. 
>  
> In other cases, such as UUID (mapped to UTF-8 vectors) the value will fail to 
> load into arrow due to expecting null data and throwing a NPE when 
> deserializing / converting to bytearrays. 
>  
> I was able to work around this problem by wrapping the postgres JDBC 
> ResultSetMetadata and always forcing the nullability to nullable (or 
> nullability unknown). 
> Unfortunately I don't think there is a great way to solve this, but perhaps 
> some way to configure / override the JDBCConsumer creation would allow for 
> users of this library to override this behavior, however the silent failure 
> and incorrect data might lead to users not noticing. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17059) [C++] Archery cpp-micro arrow-compute-expression-benchmark fails with Invalid: Value lengths differed from ExecBatch length

2022-07-12 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-17059.

Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13584
[https://github.com/apache/arrow/pull/13584]

> [C++] Archery cpp-micro arrow-compute-expression-benchmark fails with 
> Invalid: Value lengths differed from ExecBatch length
> ---
>
> Key: ARROW-17059
> URL: https://issues.apache.org/jira/browse/ARROW-17059
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Archery, C++
>Reporter: Elena Henderson
>Assignee: Sasha Krassovsky
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
>  [https://github.com/apache/arrow/pull/13179] causes 
> {{arrow-compute-expression-benchmark}}  to fail with:
> {code:java}
> -- Arrow Fatal Error --
> Invalid: Value lengths differed from ExecBatch length {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17057) [Python] S3FileSystem has no parameter for retry strategy

2022-07-12 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566011#comment-17566011
 ] 

Kouhei Sutou commented on ARROW-17057:
--

Do you want to open a pull request with the suggested fix?

> [Python] S3FileSystem has no parameter for retry strategy
> -
>
> Key: ARROW-17057
> URL: https://issues.apache.org/jira/browse/ARROW-17057
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 6.0.1, 7.0.0, 8.0.0
>Reporter: Duncan
>Priority: Major
>
> The Python wrapper for S3Fs does not accept a {{retry_strategy}} parameter, 
> but the underlying C++ implementation supports it.
>  
> Python wrapper's constructor arguments:
> [https://github.com/apache/arrow/blob/master/python/pyarrow/_s3fs.pyx#L181]
>  
> C++ base: 
> [https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.cc#L729]
>  
> The result is that Python users of S3Fs always default to the legacy retry 
> strategy, which is very limited.
>  
> Suggested fix is to allow the Python wrapper to specify a retry strategy to 
> be passed through to the wrapped C++ implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17051) [C++][Flight] arrow-flight-sql-test fails with ASAN UBSAN

2022-07-12 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566009#comment-17566009
 ] 

David Li commented on ARROW-17051:
--

Thanks for the help Raúl. I have it on my list to take a look this week

> [C++][Flight] arrow-flight-sql-test fails with ASAN UBSAN
> -
>
> Key: ARROW-17051
> URL: https://issues.apache.org/jira/browse/ARROW-17051
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: Raúl Cumplido
>Priority: Major
>
> The CI job for ASAN UBSAN is based on Ubuntu 20.04: *C++ / AMD64 Ubuntu 20.04 
> C++ ASAN UBSAN*  
> Trying to build Flight and Flight SQL on Ubuntu 20.04 the job for ASAN UBSAN 
> will also build with Flight and Flight SQL. This triggers some 
> arrow-flight-sql-test failures like:
> {code:java}
>   [ RUN      ] TestFlightSqlClient.TestGetDbSchemas
> unknown file: Failure
> Unexpected mock function call - taking default action specified at:
> /arrow/cpp/src/arrow/flight/sql/client_test.cc:151:
>     Function call: GetFlightInfo(@0x6157d948 184-byte object <00-00 00-00 
> 00-00 F0-BF 40-00 00-00 00-00 00-00 80-4C 06-49 CF-7F 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 01-01 00-00 00-00 00-00 
> 00-20 00-00 00-00 00-00 ... 01-00 00-04 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>, 
> @0x7fff35794e80 64-byte object <02-00 00-00 00-00 00-00 C0-45 08-00 B0-60 
> 00-00 65-00 00-00 00-00 00-00 65-00 00-00 00-00 00-00 C4-A9 AE-66 00-10 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>)
>           Returns: (nullptr)
> Google Mock tried the following 1 expectation, but it didn't match:
> /arrow/cpp/src/arrow/flight/sql/client_test.cc:152: EXPECT_CALL(sql_client_, 
> GetFlightInfo(Ref(call_options_), descriptor))...
>   Expected arg #1: is equal to 64-byte object <02-00 00-00 BE-BE BE-BE C0-6B 
> 05-00 C0-60 00-00 73-00 00-00 00-00 00-00 73-00 00-00 00-00 00-00 BE-BE BE-BE 
> BE-BE BE-BE 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00>
>            Actual: 64-byte object <02-00 00-00 00-00 00-00 C0-45 08-00 B0-60 
> 00-00 65-00 00-00 00-00 00-00 65-00 00-00 00-00 00-00 C4-A9 AE-66 00-10 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>
>          Expected: to be called once
>            Actual: never called - unsatisfied and active
> /arrow/cpp/src/arrow/flight/sql/client_test.cc:152: Failure
> Actual function call count doesn't match EXPECT_CALL(sql_client_, 
> GetFlightInfo(Ref(call_options_), descriptor))...
>          Expected: to be called once
>            Actual: never called - unsatisfied and active
> [  FAILED  ] TestFlightSqlClient.TestGetDbSchemas (1 ms){code}
> The error can be seen here: 
> [https://github.com/apache/arrow/runs/7297442828?check_suite_focus=true]
> This is the initial PR that triggered it:
> [https://github.com/apache/arrow/pull/13548]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17004) [Java] Implement Arrow->JDBC prepared statement parameters for arrow-jdbc

2022-07-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17004:
---
Labels: pull-request-available  (was: )

> [Java] Implement Arrow->JDBC prepared statement parameters for arrow-jdbc
> -
>
> Key: ARROW-17004
> URL: https://issues.apache.org/jira/browse/ARROW-17004
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> arrow-jdbc can turn JDBC ResultSets into Arrow VectorSchemaRoots. However, it 
> would also be useful to have the opposite: bind values from a 
> VectorSchemaRoot to a PreparedStatement for inserting/updating data.
> This is necessary for the ADBC project but isn't ADBC specific, so it could 
> be added to arrow-jdbc. We should also document the type mapping it uses and 
> how to customize the mapping.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17045) [C++] Reject trailing slashes on file path

2022-07-12 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-17045:
---
Labels: break pull-request-available  (was: pull-request-available)

> [C++] Reject trailing slashes on file path
> --
>
> Key: ARROW-17045
> URL: https://issues.apache.org/jira/browse/ARROW-17045
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 8.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Critical
>  Labels: break, pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We had several different behaviors when passing in file paths with trailing 
> slashes: LocalFileSystem would return IOError, S3 would trim off the trailing 
> slash, and GCS would keep the trailing slash as part of the file name (later 
> creating confusion as the file would be labelled a "directory" in list 
> calls). This PR moves them all to the behavior of LocalFileSystem: return 
> IOError.
> The R filesystem bindings relied on the behavior provided by S3, so they are 
> now modified to trim the trailing slash before passing down to C++.
> Here is an example of the differences in behavior between S3 and GCS:
> {code:python}
> import pyarrow.fs
> from pyarrow.fs import FileSelector
> from datetime import timedelta
> gcs = pyarrow.fs.GcsFileSystem(
> endpoint_override="localhost:9001",
> scheme="http",
> anonymous=True,
> retry_time_limit=timedelta(seconds=1),
> )
> gcs.create_dir("py_test")
> # Writing to test.txt with and without slash produces a file and a directory!?
> with gcs.open_output_stream("py_test/test.txt") as out_stream:
> out_stream.write(b"Hello world!")
> with gcs.open_output_stream("py_test/test.txt/") as out_stream:
> out_stream.write(b"Hello world!")
> gcs.get_file_info(FileSelector("py_test"))
> # [,  for 'py_test/test.txt': type=FileType.Directory>]
> s3 = pyarrow.fs.S3FileSystem(
> access_key="minioadmin",
> secret_key="minioadmin",
> scheme="http",
> endpoint_override="localhost:9000",
> allow_bucket_creation=True,
> allow_bucket_deletion=True,
> )
> s3.create_dir("py-test")
> # Writing to test.txt with and without slash writes to same file
> with s3.open_output_stream("py-test/test.txt") as out_stream:
> out_stream.write(b"Hello world!")
> with s3.open_output_stream("py-test/test.txt/") as out_stream:
> out_stream.write(b"Hello world!")
> s3.get_file_info(FileSelector("py-test"))
> # []
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17045) [C++] Reject trailing slashes on file path

2022-07-12 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-17045:
---
Labels: breaking-api pull-request-available  (was: break 
pull-request-available)

> [C++] Reject trailing slashes on file path
> --
>
> Key: ARROW-17045
> URL: https://issues.apache.org/jira/browse/ARROW-17045
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 8.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Critical
>  Labels: breaking-api, pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We had several different behaviors when passing in file paths with trailing 
> slashes: LocalFileSystem would return IOError, S3 would trim off the trailing 
> slash, and GCS would keep the trailing slash as part of the file name (later 
> creating confusion as the file would be labelled a "directory" in list 
> calls). This PR moves them all to the behavior of LocalFileSystem: return 
> IOError.
> The R filesystem bindings relied on the behavior provided by S3, so they are 
> now modified to trim the trailing slash before passing down to C++.
> Here is an example of the differences in behavior between S3 and GCS:
> {code:python}
> import pyarrow.fs
> from pyarrow.fs import FileSelector
> from datetime import timedelta
> gcs = pyarrow.fs.GcsFileSystem(
> endpoint_override="localhost:9001",
> scheme="http",
> anonymous=True,
> retry_time_limit=timedelta(seconds=1),
> )
> gcs.create_dir("py_test")
> # Writing to test.txt with and without slash produces a file and a directory!?
> with gcs.open_output_stream("py_test/test.txt") as out_stream:
> out_stream.write(b"Hello world!")
> with gcs.open_output_stream("py_test/test.txt/") as out_stream:
> out_stream.write(b"Hello world!")
> gcs.get_file_info(FileSelector("py_test"))
> # [,  for 'py_test/test.txt': type=FileType.Directory>]
> s3 = pyarrow.fs.S3FileSystem(
> access_key="minioadmin",
> secret_key="minioadmin",
> scheme="http",
> endpoint_override="localhost:9000",
> allow_bucket_creation=True,
> allow_bucket_deletion=True,
> )
> s3.create_dir("py-test")
> # Writing to test.txt with and without slash writes to same file
> with s3.open_output_stream("py-test/test.txt") as out_stream:
> out_stream.write(b"Hello world!")
> with s3.open_output_stream("py-test/test.txt/") as out_stream:
> out_stream.write(b"Hello world!")
> s3.get_file_info(FileSelector("py-test"))
> # []
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17045) [C++] GCS doesn't drop ending slash for files

2022-07-12 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-17045:
---
Description: 
We had several different behaviors when passing in file paths with trailing 
slashes: LocalFileSystem would return IOError, S3 would trim off the trailing 
slash, and GCS would keep the trailing slash as part of the file name (later 
creating confusion as the file would be labelled a "directory" in list calls). 
This PR moves them all to the behavior of LocalFileSystem: return IOError.

The R filesystem bindings relied on the behavior provided by S3, so they are 
now modified to trim the trailing slash before passing down to C++.

Here is an example of the differences in behavior between S3 and GCS:

{code:python}
import pyarrow.fs
from pyarrow.fs import FileSelector
from datetime import timedelta

gcs = pyarrow.fs.GcsFileSystem(
endpoint_override="localhost:9001",
scheme="http",
anonymous=True,
retry_time_limit=timedelta(seconds=1),
)

gcs.create_dir("py_test")

# Writing to test.txt with and without slash produces a file and a directory!?
with gcs.open_output_stream("py_test/test.txt") as out_stream:
out_stream.write(b"Hello world!")
with gcs.open_output_stream("py_test/test.txt/") as out_stream:
out_stream.write(b"Hello world!")
gcs.get_file_info(FileSelector("py_test"))
# [, ]

s3 = pyarrow.fs.S3FileSystem(
access_key="minioadmin",
secret_key="minioadmin",
scheme="http",
endpoint_override="localhost:9000",
allow_bucket_creation=True,
allow_bucket_deletion=True,
)

s3.create_dir("py-test")

# Writing to test.txt with and without slash writes to same file
with s3.open_output_stream("py-test/test.txt") as out_stream:
out_stream.write(b"Hello world!")
with s3.open_output_stream("py-test/test.txt/") as out_stream:
out_stream.write(b"Hello world!")
s3.get_file_info(FileSelector("py-test"))
# []
{code}


  was:
There is inconsistent behavior between GCS and S3 when it comes to creating 
files. I'm still not sure yet whether this is an implementation difference or 
difference between minio and GCS testbench.

Example:

{code:python}
import pyarrow.fs
from pyarrow.fs import FileSelector
from datetime import timedelta

gcs = pyarrow.fs.GcsFileSystem(
endpoint_override="localhost:9001",
scheme="http",
anonymous=True,
retry_time_limit=timedelta(seconds=1),
)

gcs.create_dir("py_test")
with gcs.open_output_stream("py_test/test.txt") as out_stream:
out_stream.write(b"Hello world!")

with gcs.open_output_stream("py_test/test.txt/") as out_stream:
out_stream.write(b"Hello world!")

gcs.get_file_info(FileSelector("py_test"))
# [, ]

s3 = pyarrow.fs.S3FileSystem(
access_key="minioadmin",
secret_key="minioadmin",
scheme="http",
endpoint_override="localhost:9000",
allow_bucket_creation=True,
allow_bucket_deletion=True,
)

s3.create_dir("py-test")
with s3.open_output_stream("py-test/test.txt") as out_stream:
out_stream.write(b"Hello world!")
with s3.open_output_stream("py-test/test.txt/") as out_stream:
out_stream.write(b"Hello world!")

s3.get_file_info(FileSelector("py-test"))
# []
{code}


> [C++] GCS doesn't drop ending slash for files
> -
>
> Key: ARROW-17045
> URL: https://issues.apache.org/jira/browse/ARROW-17045
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 8.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We had several different behaviors when passing in file paths with trailing 
> slashes: LocalFileSystem would return IOError, S3 would trim off the trailing 
> slash, and GCS would keep the trailing slash as part of the file name (later 
> creating confusion as the file would be labelled a "directory" in list 
> calls). This PR moves them all to the behavior of LocalFileSystem: return 
> IOError.
> The R filesystem bindings relied on the behavior provided by S3, so they are 
> now modified to trim the trailing slash before passing down to C++.
> Here is an example of the differences in behavior between S3 and GCS:
> {code:python}
> import pyarrow.fs
> from pyarrow.fs import FileSelector
> from datetime import timedelta
> gcs = pyarrow.fs.GcsFileSystem(
> endpoint_override="localhost:9001",
> scheme="http",
> anonymous=True,
> retry_time_limit=timedelta(seconds=1),
> )
> gcs.create_dir("py_test")
> # Writing to test.txt with and without slash produces a file and a directory!?
> with gcs.open_output_stream("py_test/test.txt") as out_stream:
> out_stream.write(b"Hello world!")
> with gcs.open_output_stream("py_test/test.txt/") as out_stream:
> 

[jira] [Created] (ARROW-17061) [Python] Acero consumer is unable to consume count function from substrait query plan

2022-07-12 Thread Richard Tia (Jira)
Richard Tia created ARROW-17061:
---

 Summary: [Python] Acero consumer is unable to consume count 
function from substrait query plan
 Key: ARROW-17061
 URL: https://issues.apache.org/jira/browse/ARROW-17061
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Richard Tia


SQL
{code:java}
select
  l_returnflag,
  l_linestatus,
  sum(l_quantity) as sum_qty,
  sum(l_extendedprice) as sum_base_price,
  sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
  sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
  avg(l_quantity) as avg_qty,
  avg(l_extendedprice) as avg_price,
  avg(l_discount) as avg_disc,
  count(*) as count_order
from
  '{}'
where
  l_shipdate <= date '1998-12-01' - interval '120' day (3)
group by
  l_returnflag,
  l_linestatus
order by
  l_returnflag,
  l_linestatus {code}
The substrait plan generated from SQL, using Isthmus.

 

substrait count: 

[https://github.com/substrait-io/substrait/blob/main/extensions/functions_aggregate_generic.yaml]

 

Running the substrait plan with Acero returns this error:
{code:java}
E   pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned 
INVALID_ARGUMENT:(relations[0].root.input.sort.input.aggregate.measures[7].measure)
 arguments: Cannot find field.
 {code}
 

>From substrait query plan:

relations[0].root.input.sort.input.aggregate.measures[7].measure
{code:java}
"measure": {
  "functionReference": 7,
  "args": [],
  "sorts": [],
  "phase": "AGGREGATION_PHASE_INITIAL_TO_RESULT",
  "outputType": {
"i64": {
  "typeVariationReference": 0,
  "nullability": "NULLABILITY_REQUIRED"
}
  },
  "invocation": "AGGREGATION_INVOCATION_ALL",
  "arguments": []
} {code}
{code:java}
"extensionFunction": {
  "extensionUriReference": 3,
  "functionAnchor": 7,
  "name": "count:opt"
} {code}
Count is a unary function and should be consumable, but isn't in this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17045) [C++] Reject trailing slashes on file path

2022-07-12 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-17045:
---
Summary: [C++] Reject trailing slashes on file path  (was: [C++] GCS 
doesn't drop ending slash for files)

> [C++] Reject trailing slashes on file path
> --
>
> Key: ARROW-17045
> URL: https://issues.apache.org/jira/browse/ARROW-17045
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 8.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We had several different behaviors when passing in file paths with trailing 
> slashes: LocalFileSystem would return IOError, S3 would trim off the trailing 
> slash, and GCS would keep the trailing slash as part of the file name (later 
> creating confusion as the file would be labelled a "directory" in list 
> calls). This PR moves them all to the behavior of LocalFileSystem: return 
> IOError.
> The R filesystem bindings relied on the behavior provided by S3, so they are 
> now modified to trim the trailing slash before passing down to C++.
> Here is an example of the differences in behavior between S3 and GCS:
> {code:python}
> import pyarrow.fs
> from pyarrow.fs import FileSelector
> from datetime import timedelta
> gcs = pyarrow.fs.GcsFileSystem(
> endpoint_override="localhost:9001",
> scheme="http",
> anonymous=True,
> retry_time_limit=timedelta(seconds=1),
> )
> gcs.create_dir("py_test")
> # Writing to test.txt with and without slash produces a file and a directory!?
> with gcs.open_output_stream("py_test/test.txt") as out_stream:
> out_stream.write(b"Hello world!")
> with gcs.open_output_stream("py_test/test.txt/") as out_stream:
> out_stream.write(b"Hello world!")
> gcs.get_file_info(FileSelector("py_test"))
> # [,  for 'py_test/test.txt': type=FileType.Directory>]
> s3 = pyarrow.fs.S3FileSystem(
> access_key="minioadmin",
> secret_key="minioadmin",
> scheme="http",
> endpoint_override="localhost:9000",
> allow_bucket_creation=True,
> allow_bucket_deletion=True,
> )
> s3.create_dir("py-test")
> # Writing to test.txt with and without slash writes to same file
> with s3.open_output_stream("py-test/test.txt") as out_stream:
> out_stream.write(b"Hello world!")
> with s3.open_output_stream("py-test/test.txt/") as out_stream:
> out_stream.write(b"Hello world!")
> s3.get_file_info(FileSelector("py-test"))
> # []
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-15904) [C++] Support rolling backwards and forwards with temporal arithmetic

2022-07-12 Thread Rok Mihevc (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc reassigned ARROW-15904:
--

Assignee: Rok Mihevc

> [C++] Support rolling backwards and forwards with temporal arithmetic
> -
>
> Key: ARROW-15904
> URL: https://issues.apache.org/jira/browse/ARROW-15904
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Rok Mihevc
>Priority: Blocker
>
> Original description in ARROW-11090: 
> "This should also cover the ability to do with and without rollback (so have 
> the ability to do e.g. 2021-03-30 minus 1 month and either get a null back, 
> or 2021-02-28), plus the ability to specify whether to rollback to the first 
> or last, and whether to preserve or rest the time.)"
> For example, in R, lubridate has the following functionality:
> * {{rollbackward()}} or {{rollback()}} which changes a date to the last day 
> of the previous month or to the first day of the current month
> * {{rollforward()}} which rolls to the last day of the current month or to 
> the first day of the next month.
> * all of the above also offer the option to preserve hms (hours, minutes and 
> seconds) when rolling. 
> This functionality underpins functions such as {{%m-%}} and {{%m+%}} which 
> are used to add or subtract months to a date without exceeding the last day 
> of the new month.
> {code:r}
> library(lubridate)
> jan <- ymd_hms("2010-01-31 03:04:05")
> jan + months(1:3) # Feb 31 and April 31 returned as NA
> #> [1] NA"2010-03-31 03:04:05 UTC"
> #> [3] NA
> # NA "2010-03-31 03:04:05 UTC" NA
> jan %m+% months(1:3) # No rollover
> #> [1] "2010-02-28 03:04:05 UTC" "2010-03-31 03:04:05 UTC"
> #> [3] "2010-04-30 03:04:05 UTC"
> leap <- ymd("2012-02-29")
> "2012-02-29 UTC"
> #> [1] "2012-02-29 UTC"
> leap %m+% years(1)
> #> [1] "2013-02-28"
> leap %m+% years(-1)
> #> [1] "2011-02-28"
> leap %m-% years(1)
> #> [1] "2011-02-28"
> x <- ymd_hms("2019-01-29 01:02:03")
> add_with_rollback(x, months(1))
> #> [1] "2019-02-28 01:02:03 UTC"
> add_with_rollback(x, months(1), preserve_hms = FALSE)
> #> [1] "2019-02-28 UTC"
> add_with_rollback(x, months(1), roll_to_first = TRUE)
> #> [1] "2019-03-01 01:02:03 UTC"
> add_with_rollback(x, months(1), roll_to_first = TRUE, preserve_hms = FALSE)
> #> [1] "2019-03-01 UTC"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-11341) [Python] [Gandiva] Check parameters are not None

2022-07-12 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-11341:
--

Assignee: Will Jones

> [Python] [Gandiva] Check parameters are not None
> 
>
> Key: ARROW-11341
> URL: https://issues.apache.org/jira/browse/ARROW-11341
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva, Python
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Most of the functions in Gandiva's Python Expression builder interface 
> current accept None in their arguments, but will segfault once they are used.
> Example:
> {code:python}
> import pyarrow
> import pyarrow.gandiva as gandiva
> builder = gandiva.TreeExprBuilder()
> field = pyarrow.field('whatever', type=pyarrow.date64())
> date_col = builder.make_field(field)
> func = builder.make_function('less_than_or_equal_to', [date_col, None], 
> pyarrow.bool_())
> condition = builder.make_condition(func)
> # Will segfault on this line:
> gandiva.make_filter(pyarrow.schema([field]), condition)
> {code}
> I think this is just a matter of adding {{not None}} to the appropriate 
> function arguments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-1614) [C++] Add a Tensor logical value type with constant shape, implemented using ExtensionType

2022-07-12 Thread Rok Mihevc (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc reassigned ARROW-1614:
-

Assignee: Rok Mihevc

> [C++] Add a Tensor logical value type with constant shape, implemented using 
> ExtensionType
> --
>
> Key: ARROW-1614
> URL: https://issues.apache.org/jira/browse/ARROW-1614
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Format
>Reporter: Wes McKinney
>Assignee: Rok Mihevc
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> In an Arrow table, we would like to add support for a column that has values 
> cells each containing a tensor value, with all tensors having the same 
> shape/dimensions. These would be stored as a binary value, plus some metadata 
> to store type and shape/strides.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-14108) [C++] interval_between(timestamptz, timestamptz) -> struct kernel

2022-07-12 Thread Rok Mihevc (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc reassigned ARROW-14108:
--

Assignee: Rok Mihevc

> [C++] interval_between(timestamptz, timestamptz) -> struct kernel
> -
>
> Key: ARROW-14108
> URL: https://issues.apache.org/jira/browse/ARROW-14108
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Assignee: Rok Mihevc
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Given two timestamps that each have time zones this function should return an 
> interval from the first timestamp to the second timestamp according to the 
> following rules (postgres rules):
> * 1 day is 24 physical hours (which may not exactly equal 1 calendar day).
> * Intervals returned will never contain a months/years field
> Examples:
> interval_between('2021-03-14 00:00:00 America/Denver', '2021-03-15 00:00:00 
> America/Denver') => { "hours": 23 }
> interval_between('2021-03-14 00:00:00 UTC', '2021-03-15 00:00:00 UTC') => { 
> "days": 1}
> interval_between('2021-03-14 00:00:00 UTC', '2020-03-14 00:00:00 UTC') => { 
> "days": 365}
> If the first timestamp is larger than the second timestamp then the interval 
> will be negative.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17060) [C++] Change AsOfJoinNode to use ExecContext's Memory Pool

2022-07-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17060:
---
Labels: pull-request-available  (was: )

> [C++] Change AsOfJoinNode to use ExecContext's Memory Pool
> --
>
> Key: ARROW-17060
> URL: https://issues.apache.org/jira/browse/ARROW-17060
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ivan Chau
>Assignee: Ivan Chau
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17060) [C++] Change AsOfJoinNode to use ExecContext's Memory Pool

2022-07-12 Thread Ivan Chau (Jira)
Ivan Chau created ARROW-17060:
-

 Summary: [C++] Change AsOfJoinNode to use ExecContext's Memory Pool
 Key: ARROW-17060
 URL: https://issues.apache.org/jira/browse/ARROW-17060
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Ivan Chau
Assignee: Ivan Chau






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-16825) [Java] Inclusion of a git.properties resource in Arrow JARs causes confusion in Spring Boot applications

2022-07-12 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-16825.
--
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13578
[https://github.com/apache/arrow/pull/13578]

> [Java] Inclusion of a git.properties resource in Arrow JARs causes confusion 
> in Spring Boot applications
> 
>
> Key: ARROW-16825
> URL: https://issues.apache.org/jira/browse/ARROW-16825
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 8.0.0
>Reporter: Nils Breunese
>Assignee: David Dali Susanibar Arce
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> By default Spring Boot reads {{classpath:git.properties}} to get information 
> about the Git repository of an application. However, Arrow JARs also include 
> a resource called {{git.properties}}, and this can cause Spring Boot to read 
> the information from one of the Arrow libraries instead of from the 
> application. [This was reported to Spring Boot as an 
> issue|https://github.com/spring-projects/spring-boot/issues/18137], but the 
> Spring Boot developers say that they cannot automatically distinguish the 
> application Git properties from {{git.properties}} resources from 
> dependencies.
> Would you consider omitting {{git.properties}} from Arrow JARs in future 
> releases or will Spring Boot users that (directly or indirectly) use Arrow 
> need to work around this by letting {{git-commit-id-plugin}} generate the 
> application Git properties in an alternative location and configuring Spring 
> Boot to read the information from that alternative location 
> ({{spring.info.git.location}})? Of course other libraries could also cause 
> this issue, but Arrow is the first and only library that I've encountered so 
> far that publishes JARs with a {{git.properties}} resource in them.
> It seems that the fact that Arrow JARs include {{git.properties}} resources 
> also caused ARROW-6361.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16825) [Java] Inclusion of a git.properties resource in Arrow JARs causes confusion in Spring Boot applications

2022-07-12 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-16825:
-
Summary: [Java] Inclusion of a git.properties resource in Arrow JARs causes 
confusion in Spring Boot applications  (was: Inclusion of a git.properties 
resource in Arrow JARs causes confusion in Spring Boot applications)

> [Java] Inclusion of a git.properties resource in Arrow JARs causes confusion 
> in Spring Boot applications
> 
>
> Key: ARROW-16825
> URL: https://issues.apache.org/jira/browse/ARROW-16825
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 8.0.0
>Reporter: Nils Breunese
>Assignee: David Dali Susanibar Arce
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> By default Spring Boot reads {{classpath:git.properties}} to get information 
> about the Git repository of an application. However, Arrow JARs also include 
> a resource called {{git.properties}}, and this can cause Spring Boot to read 
> the information from one of the Arrow libraries instead of from the 
> application. [This was reported to Spring Boot as an 
> issue|https://github.com/spring-projects/spring-boot/issues/18137], but the 
> Spring Boot developers say that they cannot automatically distinguish the 
> application Git properties from {{git.properties}} resources from 
> dependencies.
> Would you consider omitting {{git.properties}} from Arrow JARs in future 
> releases or will Spring Boot users that (directly or indirectly) use Arrow 
> need to work around this by letting {{git-commit-id-plugin}} generate the 
> application Git properties in an alternative location and configuring Spring 
> Boot to read the information from that alternative location 
> ({{spring.info.git.location}})? Of course other libraries could also cause 
> this issue, but Arrow is the first and only library that I've encountered so 
> far that publishes JARs with a {{git.properties}} resource in them.
> It seems that the fact that Arrow JARs include {{git.properties}} resources 
> also caused ARROW-6361.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-16776) [R] dplyr::glimpse method for arrow table and datasets

2022-07-12 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-16776.
-
Resolution: Fixed

Issue resolved by pull request 13563
[https://github.com/apache/arrow/pull/13563]

> [R] dplyr::glimpse method for arrow table and datasets
> --
>
> Key: ARROW-16776
> URL: https://issues.apache.org/jira/browse/ARROW-16776
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Thomas Mock
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> When working with Arrow datasets/tables, I often find myself wanting to 
> interactively print or "see" the results of a query or the first few rows of 
> the data without having to fully collect into memory. 
> I can perform exploratory data analysis on large out-of-memory datasets via 
> Arrow + dplyr but in order to print the returned values I have to collect() 
> into memory or send to_duckdb().
>  * compute() - returns number of rows/columns, but no data
>  * collect() - returns data fully into memory, can be combined with head()
>  * to_duckdb() - keeps data out of memory, always returns top 10 rows and all 
> columns, optionally increase/decrease number of printed rows
> While to_duckdb() gives me the ability to do true EDA, it seems 
> counterintuitive to need to send the arrow table over to a duckdb database 
> just to see the glimpse()/head() equivalent.
> My feature request is that there is a dplyr::glimpse() method that will 
> lazily print the first few values of table/dataset. The expected output would 
> be something like the below.
> ``` r
> library(dplyr)
> library(arrow)
> mtcars %>% arrow::write_parquet("mtcars.parquet")
> car_ds <- arrow::open_dataset("mtcars.parquet")
> car_ds %>% 
>   glimpse()
> Rows: ??
> Columns: 11
> $ mpg   21.0, 21.0, 22.8, 21.4, 18.7, …
> $ cyl   6, 6, 4, 6, 8, 6, 8, 4, 4, 6, …
> $ disp  160.0, 160.0, 108.0, 258.0, 36…
> $ hp110, 110, 93, 110, 175, 105, 2…
> $ drat  3.90, 3.90, 3.85, 3.08, 3.15, …
> $ wt2.620, 2.875, 2.320, 3.215, 3.…
> $ qsec  16.46, 17.02, 18.61, 19.44, 17…
> $ vs0, 0, 1, 1, 0, 1, 0, 1, 1, 1, …
> $ am1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
> $ gear  4, 4, 4, 3, 3, 3, 3, 4, 4, 4, …
> $ carb  4, 4, 1, 1, 2, 1, 4, 2, 2, 4, …
> ```
> Currently glimpse() will return a list output where the majority of the 
> output is erroneous to the actual data/values.
> ``` r
> library(dplyr)
> library(arrow)
> mtcars %>% arrow::write_parquet("mtcars.parquet")
> car_ds <- arrow::open_dataset("mtcars.parquet")
> car_ds %>% 
>   glimpse()
> #> Classes 'FileSystemDataset', 'Dataset', 'ArrowObject', 'R6' 
> 
> #>   Inherits from: 
> #>   Public:
> #>     .:xp:.: externalptr
> #>     .class_title: function () 
> #>     clone: function (deep = FALSE) 
> #>     files: active binding
> #>     filesystem: active binding
> #>     format: active binding
> #>     initialize: function (xp) 
> #>     invalidate: function () 
> #>     metadata: active binding
> #>     NewScan: function () 
> #>     num_cols: active binding
> #>     num_rows: active binding
> #>     pointer: function () 
> #>     print: function (...) 
> #>     schema: active binding
> #>     set_pointer: function (xp) 
> #>     ToString: function () 
> #>     type: active binding
> car_ds %>%
>   filter(cyl == 6) %>%
>   glimpse()
> #> List of 7
> #>  $ mpg :Classes 'FileSystemDataset', 'Dataset', 'ArrowObject', 'R6' 
> 
> #>   Inherits from: 
> #>   Public:
> #>     .:xp:.: externalptr
> #>     .class_title: function () 
> #>     clone: function (deep = FALSE) 
> #>     files: active binding
> #>     filesystem: active binding
> #>     format: active binding
> #>     initialize: function (xp) 
> #>     invalidate: function () 
> #>     metadata: active binding
> #>     NewScan: function () 
> #>     num_cols: active binding
> #>     num_rows: active binding
> #>     pointer: function () 
> #>     print: function (...) 
> #>     schema: active binding
> #>     set_pointer: function (xp) 
> #>     ToString: function () 
> #>     type: active binding 
> #>  $ cyl :List of 11
> #>   ..$ mpg :Classes 'Expression', 'ArrowObject', 'R6' 
> #>   Inherits from: 
> #>   Public:
> #>     .:xp:.: externalptr
> #>     cast: function (to_type, safe = TRUE, ...) 
> #>     clone: function (deep = FALSE) 
> #>     Equals: function (other, ...) 
> #>     field_name: active binding
> #>     initialize: function (xp) 
> #>     invalidate: function () 
> #>     pointer: function () 
> #>     print: function (...) 
> #>     schema: Schema, ArrowObject, R6
> #>     set_pointer: function (xp) 
> #>     ToString: function () 
> #>     type: function (schema = self$schema) 

[jira] [Updated] (ARROW-17059) [C++] Archery cpp-micro arrow-compute-expression-benchmark fails with Invalid: Value lengths differed from ExecBatch length

2022-07-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17059:
---
Labels: pull-request-available  (was: )

> [C++] Archery cpp-micro arrow-compute-expression-benchmark fails with 
> Invalid: Value lengths differed from ExecBatch length
> ---
>
> Key: ARROW-17059
> URL: https://issues.apache.org/jira/browse/ARROW-17059
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Archery, C++
>Reporter: Elena Henderson
>Assignee: Sasha Krassovsky
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
>  [https://github.com/apache/arrow/pull/13179] causes 
> {{arrow-compute-expression-benchmark}}  to fail with:
> {code:java}
> -- Arrow Fatal Error --
> Invalid: Value lengths differed from ExecBatch length {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17056) [C++] Bump version of bundled substrait

2022-07-12 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565989#comment-17565989
 ] 

Antoine Pitrou commented on ARROW-17056:


No, it's ok. However, I'm not sure we want to bump Substrait blindly as the 
newer versions may have incompatible changes. cc [~westonpace]

> [C++] Bump version of bundled substrait
> ---
>
> Key: ARROW-17056
> URL: https://issues.apache.org/jira/browse/ARROW-17056
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Raúl Cumplido
>Priority: Major
> Fix For: 9.0.0
>
>
> There has been a new substrait version released:
> https://github.com/substrait-io/substrait/releases/tag/v0.7.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17059) [C++] Archery cpp-micro arrow-compute-expression-benchmark fails with Invalid: Value lengths differed from ExecBatch length

2022-07-12 Thread Elena Henderson (Jira)
Elena Henderson created ARROW-17059:
---

 Summary: [C++] Archery cpp-micro 
arrow-compute-expression-benchmark fails with Invalid: Value lengths differed 
from ExecBatch length
 Key: ARROW-17059
 URL: https://issues.apache.org/jira/browse/ARROW-17059
 Project: Apache Arrow
  Issue Type: Bug
  Components: Archery, C++
Reporter: Elena Henderson
Assignee: Sasha Krassovsky


 [https://github.com/apache/arrow/pull/13179] causes 
{{arrow-compute-expression-benchmark}}  to fail with:
{code:java}
-- Arrow Fatal Error --

Invalid: Value lengths differed from ExecBatch length {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-12688) [R] Use DuckDB to query an Arrow Dataset

2022-07-12 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12688:

Component/s: (was: C++)

> [R] Use DuckDB to query an Arrow Dataset
> 
>
> Key: ARROW-12688
> URL: https://issues.apache.org/jira/browse/ARROW-12688
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> DuckDB can read data from an Arrow C-interface stream. Once we can provide 
> that struct from R, presumably DuckDB could query on that stream. 
> A first step is just connecting the pieces. A second step would be to handle 
> parts of the DuckDB query and push down filtering/projection to Arrow. 
> We need a function something like this:
> {code}
> #' Run a DuckDB query on Arrow data
> #'
> #' @param .data An `arrow` data object: `Dataset`, `Table`, `RecordBatch`, or 
> #' an `arrow_dplyr_query` containing filter/mutate/etc. expressions
> #' @return A `duckdb::duckdb_connection`
> to_duckdb <- function(.data) {
>   # ARROW-12687: [C++][Python][Dataset] Convert Scanner into a 
> RecordBatchReader 
>   reader <- Scanner$create(.data)$ToRecordBatchReader()
>   # ARROW-12689: [R] Implement ArrowArrayStream C interface
>   stream_ptr <- allocate_arrow_array_stream()
>   on.exit(delete_arrow_array_stream(stream_ptr))
>   ExportRecordBatchReader(x, stream_ptr)
>   # TODO: DuckDB method to create table/connection from ArrowArrayStream ptr
>   duckdb::duck_connection_from_arrow_stream(stream_ptr)
> }
> {code}
> Assuming this existed, we could do something like (a variation of 
> https://arrow.apache.org/docs/r/articles/dataset.html):
> {code}
> ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
> ds %>%
>   filter(total_amount > 100, year == 2015) %>%
>   select(tip_amount, total_amount, passenger_count) %>%
>   mutate(tip_pct = 100 * tip_amount / total_amount) %>%
>   to_duckdb() %>%
>   group_by(passenger_count) %>%
>   summarise(
> median_tip_pct = median(tip_pct),
> n = n()
>   )
> {code}
> and duckdb would do the aggregation while the data reading, predicate 
> pushdown, filtering, and projection would happen in Arrow. Or you could do 
> {{dbGetQuery(ds, "SOME SQL")}} and that would evaluate on arrow data. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17041) [C++][CI] arrow-compute-scalar-test fails on test-conda-cpp-valgrind

2022-07-12 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565977#comment-17565977
 ] 

Antoine Pitrou commented on ARROW-17041:


I'll take a look.

> [C++][CI] arrow-compute-scalar-test fails on test-conda-cpp-valgrind
> 
>
> Key: ARROW-17041
> URL: https://issues.apache.org/jira/browse/ARROW-17041
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Raúl Cumplido
>Priority: Critical
>  Labels: Nightly
> Fix For: 9.0.0
>
>
> There seems to be an issue on the arrow-compute-scalar-test as it has been 
> failing for the last days, example: 
> [https://github.com/ursacomputing/crossbow/runs/7274655770]
> See [https://crossbow.voltrondata.com/]
> Error:
> {code:java}
> ==13125== 
> ==13125== HEAP SUMMARY:
> ==13125== in use at exit: 16,090 bytes in 161 blocks
> ==13125==   total heap usage: 14,612,979 allocs, 14,612,818 frees, 
> 2,853,741,784 bytes allocated
> ==13125== 
> ==13125== LEAK SUMMARY:
> ==13125==definitely lost: 0 bytes in 0 blocks
> ==13125==indirectly lost: 0 bytes in 0 blocks
> ==13125==  possibly lost: 0 bytes in 0 blocks
> ==13125==still reachable: 16,090 bytes in 161 blocks
> ==13125== suppressed: 0 bytes in 0 blocks
> ==13125== Reachable blocks (those to which a pointer was found) are not shown.
> ==13125== To see them, rerun with: --leak-check=full --show-leak-kinds=all
> ==13125== 
> ==13125== Use --track-origins=yes to see where uninitialised values come from
> ==13125== For lists of detected and suppressed errors, rerun with: -s
> ==13125== ERROR SUMMARY: 54 errors from 12 contexts (suppressed: 517836 from 
> 44) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17058) Timezone aware parquet read with schema and filters

2022-07-12 Thread Jira
Blaž Zupančič created ARROW-17058:
-

 Summary: Timezone aware parquet read with schema and filters
 Key: ARROW-17058
 URL: https://issues.apache.org/jira/browse/ARROW-17058
 Project: Apache Arrow
  Issue Type: Bug
  Components: Parquet, Python
Affects Versions: 8.0.0
Reporter: Blaž Zupančič
 Attachments: output.txt, pyarrow_bug.py, spark-3.1.parquet, 
spark-3.2.parquet, spark_parquet.py

The parquet.read_table() method in pyarrow 8.0.0 added `schema` parameter which 
is great for handling timestamps, i.e., they are correctly converted from UTC 
to the timezone specified in the schema.

However, when `schema` is used together with `filters`, timezone conversion 
fails with "Cannot compare timestamp with timezone to timestamp without 
timezone" error. This was tested on 2 files created with different versions of 
spark. The test code, files and the output are attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17005) [Java] Incorrect results from JDBC Adapter from Postgres of non-nullable column through left join

2022-07-12 Thread Jonathan Swenson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565931#comment-17565931
 ] 

Jonathan Swenson commented on ARROW-17005:
--

Excellent, I'll play with removing my workaround and moving to this in the near 
future. 

On another note, the Redshift JDBC driver also seems to exhibit this same 
problem (likely because it is very close to the postgres JDBC driver / 
implementation). 

> [Java] Incorrect results from JDBC Adapter from Postgres of non-nullable 
> column through left join
> -
>
> Key: ARROW-17005
> URL: https://issues.apache.org/jira/browse/ARROW-17005
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Java
>Reporter: Jonathan Swenson
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Unsure to consider this a bug or wish, but the JDBC to Arrow Adapter produces 
> incorrect results when wrapping the postgres driver in certain cases. 
> If you left join a non-nullable column, the column becomes nullable (if the 
> join condition does not match any columns). However the postgres 
> ResultSetMetaData lies to you and still indicates that the column is still 
> non-nullable. 
> When iterating through the data, results come back as null (isNull will 
> return true). 
> However, because of the way that the JDBCConsumer is created, it creates a 
> non-nullable consumer and will not check the nullability of these results. 
> Unfortunately, this results in incorrect data or errors depending on the data 
> types returned. 
> The postgres JDBC team has closed a ticket about this indicating that it 
> would be impossible for them to return the correct data nullability data to 
> the JDBC driver. see: [https://github.com/pgjdbc/pgjdbc/issues/2079]
> An example: 
> Table: 
> ||t1.id||
> |2|
> |3|
> {code:java}
> CREATE TABLE t1 (id integer NOT NULL);
> INSERT INTO t1 VALUES (2), (3);
> {code}
> Query
> {code:java}
> WITH t2 AS (SELECT 1 AS id UNION SELECT 2)
> SELECT 
>   t1.id 
> FROM t2 
> LEFT JOIN t1 on t1.id = t2.id;{code}
> This returns the result set:
> ||id||
> |2|
> |null|
> The ResultSetMetaData indicates that the column is non-nullable (as t1.id is 
> non-nullable) but there is null data in the result. 
> The Arrow Vector that is present after the result set is consumed, looks like 
> this: 
> ||id||
> |2|
> |0|
> ResultSet.getInt(1) will return 0 when the source data is null, with an 
> expectation that you check isNull. 
> The data is incorrect and silently fails potentially leading to clients / 
> consumers getting bad data. 
>  
> In other cases, such as UUID (mapped to UTF-8 vectors) the value will fail to 
> load into arrow due to expecting null data and throwing a NPE when 
> deserializing / converting to bytearrays. 
>  
> I was able to work around this problem by wrapping the postgres JDBC 
> ResultSetMetadata and always forcing the nullability to nullable (or 
> nullability unknown). 
> Unfortunately I don't think there is a great way to solve this, but perhaps 
> some way to configure / override the JDBCConsumer creation would allow for 
> users of this library to override this behavior, however the silent failure 
> and incorrect data might lead to users not noticing. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16807) [C++] count_distinct aggregates incorrectly across row groups

2022-07-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16807:
---
Labels: pull-request-available  (was: )

> [C++] count_distinct aggregates incorrectly across row groups
> -
>
> Key: ARROW-16807
> URL: https://issues.apache.org/jira/browse/ARROW-16807
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
> Environment: > arrow::arrow_info()
> Arrow package version: 8.0.0.9000
> Capabilities:
>
> datasetTRUE
> substrait FALSE
> parquetTRUE
> json   TRUE
> s3 TRUE
> utf8proc   TRUE
> re2TRUE
> snappy TRUE
> gzip   TRUE
> brotli TRUE
> zstd   TRUE
> lz4TRUE
> lz4_frame  TRUE
> lzo   FALSE
> bz2TRUE
> jemalloc   TRUE
> mimalloc  FALSE
> Memory:
>
> Allocator  jemalloc
> Current37.25 Kb
> Max   925.42 Kb
> Runtime:
> 
> SIMD Level  none
> Detected SIMD Level none
> Build:
>  
> C++ Library Version9.0.0-SNAPSHOT
> C++ Compiler   AppleClang
> C++ Compiler Version  13.1.6.13160021
> Git ID   d9d78946607f36e25e9d812a5cc956bd00ab2bc9
>Reporter: Edward Visel
>Assignee: Aldrin M
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0, 8.0.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When reading from parquet files with multiple row groups, {{count_distinct}} 
> (wrapped by {{n_distinct}} in R) returns inaccurate and inconsistent results:
> {code:r}
> library(dplyr, warn.conflicts = FALSE)
> path <- tempfile(fileext = '.parquet')
> arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)
> ds <- arrow::open_dataset(path)
> ds %>% count(sex) %>% collect()
> #> # A tibble: 5 × 2
> #>   sex                n
> #>             
> #> 1 male              60
> #> 2 none               6
> #> 3 female            16
> #> 4 hermaphroditic     1
> #> 5                4
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    19
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    17
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    17
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    16
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    16
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    17
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    17
> # correct
> ds %>% collect() %>% summarise(n = n_distinct(sex))
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1     5
> {code}
> If the file is stored as a single row group, results are correct. When 
> grouped, results are correct.
> I can reproduce this in Python as well using the same file and 
> {{pyarrow.compute.count_distinct}}:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> pa.__version__
> #> 8.0.0
> starwars = 
> pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')
> pa.compute.count_distinct(starwars.column('sex')).as_py()
> #> 15
> pa.compute.unique(starwars.column('sex'))
> #> [
> #>   "male",
> #>   "none",
> #>   "female",
> #>   "hermaphroditic",
> #>null
> #> ]
> {code}
> This seems likely to be the same problem in this StackOverflow question: 
> https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array
>  which is working from orc files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-16904) [C++] min/max not deterministic if Parquet files have multiple row groups

2022-07-12 Thread Robert On (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565922#comment-17565922
 ] 

Robert On edited comment on ARROW-16904 at 7/12/22 4:28 PM:


Confirmed that nightly build of arrow (v8.0.0.20220712) computes the correct 
minimums 100 out of 100 times:{{{}{}}}
{code:java}
> sapply(1:100, function {
+ # create parquet file with a single column with numbers 1 to 1,000,000
+   arrow::write_parquet(
+ data.frame(val = 1:1e6), "test.parquet")
+ 
+   arrow::open_dataset("test.parquet") %>%
+   dplyr::summarise(min_val = min(val)) %>%
+   dplyr::collect() %>% dplyr::pull(min_val)
+ }) %>% table().
1 
100{code}


was (Author: JIRAUSER291598):
Confirmed that nightly build of arrow (v8.0.0.20220712) computes the correct 
minimums.
> sapply(1:100, function(x) {+   # create parquet file with a single column 
> with numbers 1 to 1,000,000+   arrow::write_parquet(+ data.frame(val = 
> 1:1e6), "test.parquet")+ +   arrow::open_dataset("test.parquet") %>%+ 
> dplyr::summarise(min_val = min(val)) %>%+ dplyr::collect() %>% 
> dplyr::pull(min_val)+ }) %>% table().
  1 
100

> [C++] min/max not deterministic if Parquet files have multiple row groups
> -
>
> Key: ARROW-16904
> URL: https://issues.apache.org/jira/browse/ARROW-16904
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0
> Environment: $ lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description:Ubuntu 20.04.4 LTS
> Release:20.04
> Codename:   focal
>Reporter: Robert On
>Assignee: Aldrin M
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> The following code produces non-deterministic result for getting the minimum 
> value of a sequence of 1e5 and 1e6 integers.
> {code:r}
> sapply(1:100, function(x) {
>   # create parquet file with a val column with numbers 1 to 100,000
>   arrow::write_parquet(
>     data.frame(val = 1:1e5), "test.parquet")
>   # find minimum value
>   arrow::open_dataset("test.parquet") %>%
>     dplyr::summarise(min_val = min(val)) %>%
>     dplyr::collect() %>% dplyr::pull(min_val)
> }) %>% table()
> sapply(1:100, function(x) {
>   # create parquet file with a val column with numbers 1 to 1,000,000
>   arrow::write_parquet(
>     data.frame(val = 1:1e6), "test.parquet")
>   # find minimum value
>   arrow::open_dataset("test.parquet") %>%
>     dplyr::summarise(min_val = min(val)) %>%
>     dplyr::collect() %>% dplyr::pull(min_val)
> }) %>% table()
> {code}
> The first 100 simulations using numbers 1 to 1e5 is able to find the minimum 
> number (1) all 100 times.
> The second 100 simulations using numbers 1 to 1e6 only finds the minimum 
> number (1) 65 out of 100 times. It finds near multiples of 131073, 25, 8, and 
> 2 times respectively.
> {code:r}
> . 1
> 100 
> . 1 131073 262145 393217 
>  65 25  8  2 {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-16904) [C++] min/max not deterministic if Parquet files have multiple row groups

2022-07-12 Thread Robert On (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565922#comment-17565922
 ] 

Robert On edited comment on ARROW-16904 at 7/12/22 4:28 PM:


Confirmed that nightly build of arrow (v8.0.0.20220712) computes the correct 
minimums 100 out of 100 times:
{code:java}
> sapply(1:100, function {
+ # create parquet file with a single column with numbers 1 to 1,000,000
+   arrow::write_parquet(
+ data.frame(val = 1:1e6), "test.parquet")
+ 
+   arrow::open_dataset("test.parquet") %>%
+   dplyr::summarise(min_val = min(val)) %>%
+   dplyr::collect() %>% dplyr::pull(min_val)
+ }) %>% table().
1 
100{code}


was (Author: JIRAUSER291598):
Confirmed that nightly build of arrow (v8.0.0.20220712) computes the correct 
minimums 100 out of 100 times:{{{}{}}}
{code:java}
> sapply(1:100, function {
+ # create parquet file with a single column with numbers 1 to 1,000,000
+   arrow::write_parquet(
+ data.frame(val = 1:1e6), "test.parquet")
+ 
+   arrow::open_dataset("test.parquet") %>%
+   dplyr::summarise(min_val = min(val)) %>%
+   dplyr::collect() %>% dplyr::pull(min_val)
+ }) %>% table().
1 
100{code}

> [C++] min/max not deterministic if Parquet files have multiple row groups
> -
>
> Key: ARROW-16904
> URL: https://issues.apache.org/jira/browse/ARROW-16904
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0
> Environment: $ lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description:Ubuntu 20.04.4 LTS
> Release:20.04
> Codename:   focal
>Reporter: Robert On
>Assignee: Aldrin M
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> The following code produces non-deterministic result for getting the minimum 
> value of a sequence of 1e5 and 1e6 integers.
> {code:r}
> sapply(1:100, function(x) {
>   # create parquet file with a val column with numbers 1 to 100,000
>   arrow::write_parquet(
>     data.frame(val = 1:1e5), "test.parquet")
>   # find minimum value
>   arrow::open_dataset("test.parquet") %>%
>     dplyr::summarise(min_val = min(val)) %>%
>     dplyr::collect() %>% dplyr::pull(min_val)
> }) %>% table()
> sapply(1:100, function(x) {
>   # create parquet file with a val column with numbers 1 to 1,000,000
>   arrow::write_parquet(
>     data.frame(val = 1:1e6), "test.parquet")
>   # find minimum value
>   arrow::open_dataset("test.parquet") %>%
>     dplyr::summarise(min_val = min(val)) %>%
>     dplyr::collect() %>% dplyr::pull(min_val)
> }) %>% table()
> {code}
> The first 100 simulations using numbers 1 to 1e5 is able to find the minimum 
> number (1) all 100 times.
> The second 100 simulations using numbers 1 to 1e6 only finds the minimum 
> number (1) 65 out of 100 times. It finds near multiples of 131073, 25, 8, and 
> 2 times respectively.
> {code:r}
> . 1
> 100 
> . 1 131073 262145 393217 
>  65 25  8  2 {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16904) [C++] min/max not deterministic if Parquet files have multiple row groups

2022-07-12 Thread Robert On (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565922#comment-17565922
 ] 

Robert On commented on ARROW-16904:
---

Confirmed that nightly build of arrow (v8.0.0.20220712) computes the correct 
minimums.
> sapply(1:100, function(x) {+   # create parquet file with a single column 
> with numbers 1 to 1,000,000+   arrow::write_parquet(+ data.frame(val = 
> 1:1e6), "test.parquet")+ +   arrow::open_dataset("test.parquet") %>%+ 
> dplyr::summarise(min_val = min(val)) %>%+ dplyr::collect() %>% 
> dplyr::pull(min_val)+ }) %>% table().
  1 
100

> [C++] min/max not deterministic if Parquet files have multiple row groups
> -
>
> Key: ARROW-16904
> URL: https://issues.apache.org/jira/browse/ARROW-16904
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0
> Environment: $ lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description:Ubuntu 20.04.4 LTS
> Release:20.04
> Codename:   focal
>Reporter: Robert On
>Assignee: Aldrin M
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> The following code produces non-deterministic result for getting the minimum 
> value of a sequence of 1e5 and 1e6 integers.
> {code:r}
> sapply(1:100, function(x) {
>   # create parquet file with a val column with numbers 1 to 100,000
>   arrow::write_parquet(
>     data.frame(val = 1:1e5), "test.parquet")
>   # find minimum value
>   arrow::open_dataset("test.parquet") %>%
>     dplyr::summarise(min_val = min(val)) %>%
>     dplyr::collect() %>% dplyr::pull(min_val)
> }) %>% table()
> sapply(1:100, function(x) {
>   # create parquet file with a val column with numbers 1 to 1,000,000
>   arrow::write_parquet(
>     data.frame(val = 1:1e6), "test.parquet")
>   # find minimum value
>   arrow::open_dataset("test.parquet") %>%
>     dplyr::summarise(min_val = min(val)) %>%
>     dplyr::collect() %>% dplyr::pull(min_val)
> }) %>% table()
> {code}
> The first 100 simulations using numbers 1 to 1e5 is able to find the minimum 
> number (1) all 100 times.
> The second 100 simulations using numbers 1 to 1e6 only finds the minimum 
> number (1) 65 out of 100 times. It finds near multiples of 131073, 25, 8, and 
> 2 times respectively.
> {code:r}
> . 1
> 100 
> . 1 131073 262145 393217 
>  65 25  8  2 {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17057) [Python] S3FileSystem has no parameter for retry strategy

2022-07-12 Thread Duncan (Jira)
Duncan created ARROW-17057:
--

 Summary: [Python] S3FileSystem has no parameter for retry strategy
 Key: ARROW-17057
 URL: https://issues.apache.org/jira/browse/ARROW-17057
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 8.0.0, 7.0.0, 6.0.1
Reporter: Duncan


The Python wrapper for S3Fs does not accept a {{retry_strategy}} parameter, but 
the underlying C++ implementation supports it.

 

Python wrapper's constructor arguments:
[https://github.com/apache/arrow/blob/master/python/pyarrow/_s3fs.pyx#L181]
 
C++ base: 
[https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.cc#L729]
 
The result is that Python users of S3Fs always default to the legacy retry 
strategy, which is very limited.
 
Suggested fix is to allow the Python wrapper to specify a retry strategy to be 
passed through to the wrapped C++ implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16734) [C++] Bump version of bundled protobuf

2022-07-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16734:
---
Labels: pull-request-available  (was: )

> [C++] Bump version of bundled protobuf
> --
>
> Key: ARROW-16734
> URL: https://issues.apache.org/jira/browse/ARROW-16734
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Raúl Cumplido
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17056) [C++] Bump version of bundled substrait

2022-07-12 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565893#comment-17565893
 ] 

Raúl Cumplido commented on ARROW-17056:
---

[~apitrou] I've created this as a subtask on the "Bump versions of bundled 
dependencies" let me know if you would prefer this to be separate.

> [C++] Bump version of bundled substrait
> ---
>
> Key: ARROW-17056
> URL: https://issues.apache.org/jira/browse/ARROW-17056
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Raúl Cumplido
>Priority: Major
> Fix For: 9.0.0
>
>
> There has been a new substrait version released:
> https://github.com/substrait-io/substrait/releases/tag/v0.7.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17055) [Java][FlightSql] flight-core and flight-sql jars delivering same class names

2022-07-12 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565892#comment-17565892
 ] 

David Li commented on ARROW-17055:
--

Yup, those classes shouldn't make it into both jars. (What do you use for this 
check? Ideally we'd set up a similar one.)

CC [~dsusanibara] 

> [Java][FlightSql] flight-core and flight-sql jars delivering same class names
> -
>
> Key: ARROW-17055
> URL: https://issues.apache.org/jira/browse/ARROW-17055
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Java
>Reporter: Kevin Bambrick
>Priority: Minor
>
> Hello. I am trying to uptake arrow flight sql. We have a check in out build 
> to make sure that there are no overlapping class files in our project. When 
> adding the flight sql dependency to our project the warning throws that 
> flight-sql and flight-core overlap and the jars deliver the same class files.
> {code:java}
> Sample Warning: WARNING: CLASSPATH OVERLAP: These jars deliver the same class 
> files: [flight-sql-7.0.0.jar, flight-core-7.0.0.jar] files: 
> [org/apache/arrow/flight/impl/Flight$FlightDescriptor$1.class, 
> org/apache/arrow/flight/impl/Flight$ActionOrBuilder.class{code}
>  
> It seems that the classes generated by Flight.proto gets generated in both 
> flight-sql and flight-core jars. Since these classes get generated in 
> flight-core, and flight-sql is dependent on flight-core, can the generation 
> of Flight.java and FlightServiceGrpc.java be removed from flight-sql and 
> instead rely on it to be pulled directly from flight-core?
>  
> thanks in advance!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17055) [Java][FlightRPC] flight-core and flight-sql jars delivering same class names

2022-07-12 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-17055:
-
Summary: [Java][FlightRPC] flight-core and flight-sql jars delivering same 
class names  (was: [Java][FlightSql] flight-core and flight-sql jars delivering 
same class names)

> [Java][FlightRPC] flight-core and flight-sql jars delivering same class names
> -
>
> Key: ARROW-17055
> URL: https://issues.apache.org/jira/browse/ARROW-17055
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Java
>Reporter: Kevin Bambrick
>Priority: Minor
>
> Hello. I am trying to uptake arrow flight sql. We have a check in out build 
> to make sure that there are no overlapping class files in our project. When 
> adding the flight sql dependency to our project the warning throws that 
> flight-sql and flight-core overlap and the jars deliver the same class files.
> {code:java}
> Sample Warning: WARNING: CLASSPATH OVERLAP: These jars deliver the same class 
> files: [flight-sql-7.0.0.jar, flight-core-7.0.0.jar] files: 
> [org/apache/arrow/flight/impl/Flight$FlightDescriptor$1.class, 
> org/apache/arrow/flight/impl/Flight$ActionOrBuilder.class{code}
>  
> It seems that the classes generated by Flight.proto gets generated in both 
> flight-sql and flight-core jars. Since these classes get generated in 
> flight-core, and flight-sql is dependent on flight-core, can the generation 
> of Flight.java and FlightServiceGrpc.java be removed from flight-sql and 
> instead rely on it to be pulled directly from flight-core?
>  
> thanks in advance!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17056) [C++] Bump version of bundled substrait

2022-07-12 Thread Jira
Raúl Cumplido created ARROW-17056:
-

 Summary: [C++] Bump version of bundled substrait
 Key: ARROW-17056
 URL: https://issues.apache.org/jira/browse/ARROW-17056
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Raúl Cumplido
 Fix For: 9.0.0


There has been a new substrait version released:

https://github.com/substrait-io/substrait/releases/tag/v0.7.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17055) [Java][FlightSql] flight-core and flight-sql jars delivering same class names

2022-07-12 Thread Kevin Bambrick (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Bambrick updated ARROW-17055:
---
Description: 
Hello. I am trying to uptake arrow flight sql. We have a check in out build to 
make sure that there are no overlapping class files in our project. When adding 
the flight sql dependency to our project the warning throws that flight-sql and 
flight-core overlap and the jars deliver the same class files.


{code:java}
Sample Warning: WARNING: CLASSPATH OVERLAP: These jars deliver the same class 
files: [flight-sql-7.0.0.jar, flight-core-7.0.0.jar] files: 
[org/apache/arrow/flight/impl/Flight$FlightDescriptor$1.class, 
org/apache/arrow/flight/impl/Flight$ActionOrBuilder.class{code}
 
It seems that the classes generated by Flight.proto gets generated in both 
flight-sql and flight-core jars. Since these classes get generated in 
flight-core, and flight-sql is dependent on flight-core, can the generation of 
Flight.java and FlightServiceGrpc.java be removed from flight-sql and instead 
rely on it to be pulled directly from flight-core?
 
thanks in advance!

  was:
Hello. I am trying to uptake arrow flight sql. We have a check in out build to 
make sure that there are no overlapping class files in our project. When adding 
the flight sql dependency to our project the warning throws that flight-sql and 
flight-core overlap and the jars deliver the same class files.
Sample Warning: WARNING: CLASSPATH OVERLAP: These jars deliver the same class 
files: [flight-sql-7.0.0.jar, flight-core-7.0.0.jar] files: 
[org/apache/arrow/flight/impl/Flight$FlightDescriptor$1.class, 
org/apache/arrow/flight/impl/Flight$ActionOrBuilder.class
 
It seems that the classes generated by Flight.proto gets generated in both 
flight-sql and flight-core jars. Since these classes get generated in 
flight-core, and flight-sql is dependent on flight-core, can the generation of 
Flight.java and FlightServiceGrpc.java be removed from flight-sql and instead 
rely on it to be pulled directly from flight-core?
 
thanks in advance!


> [Java][FlightSql] flight-core and flight-sql jars delivering same class names
> -
>
> Key: ARROW-17055
> URL: https://issues.apache.org/jira/browse/ARROW-17055
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Java
>Reporter: Kevin Bambrick
>Priority: Minor
>
> Hello. I am trying to uptake arrow flight sql. We have a check in out build 
> to make sure that there are no overlapping class files in our project. When 
> adding the flight sql dependency to our project the warning throws that 
> flight-sql and flight-core overlap and the jars deliver the same class files.
> {code:java}
> Sample Warning: WARNING: CLASSPATH OVERLAP: These jars deliver the same class 
> files: [flight-sql-7.0.0.jar, flight-core-7.0.0.jar] files: 
> [org/apache/arrow/flight/impl/Flight$FlightDescriptor$1.class, 
> org/apache/arrow/flight/impl/Flight$ActionOrBuilder.class{code}
>  
> It seems that the classes generated by Flight.proto gets generated in both 
> flight-sql and flight-core jars. Since these classes get generated in 
> flight-core, and flight-sql is dependent on flight-core, can the generation 
> of Flight.java and FlightServiceGrpc.java be removed from flight-sql and 
> instead rely on it to be pulled directly from flight-core?
>  
> thanks in advance!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17055) [Java][FlightSql] flight-core and flight-sql jars delivering same class names

2022-07-12 Thread Kevin Bambrick (Jira)
Kevin Bambrick created ARROW-17055:
--

 Summary: [Java][FlightSql] flight-core and flight-sql jars 
delivering same class names
 Key: ARROW-17055
 URL: https://issues.apache.org/jira/browse/ARROW-17055
 Project: Apache Arrow
  Issue Type: Improvement
  Components: FlightRPC, Java
Reporter: Kevin Bambrick


Hello. I am trying to uptake arrow flight sql. We have a check in out build to 
make sure that there are no overlapping class files in our project. When adding 
the flight sql dependency to our project the warning throws that flight-sql and 
flight-core overlap and the jars deliver the same class files.
Sample Warning: WARNING: CLASSPATH OVERLAP: These jars deliver the same class 
files: [flight-sql-7.0.0.jar, flight-core-7.0.0.jar] files: 
[org/apache/arrow/flight/impl/Flight$FlightDescriptor$1.class, 
org/apache/arrow/flight/impl/Flight$ActionOrBuilder.class
 
It seems that the classes generated by Flight.proto gets generated in both 
flight-sql and flight-core jars. Since these classes get generated in 
flight-core, and flight-sql is dependent on flight-core, can the generation of 
Flight.java and FlightServiceGrpc.java be removed from flight-sql and instead 
rely on it to be pulled directly from flight-core?
 
thanks in advance!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17054) [R] Creating an Array from an object bigger than 2^31 results in an Array of length 0

2022-07-12 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565868#comment-17565868
 ] 

Antoine Pitrou edited comment on ARROW-17054 at 7/12/22 3:07 PM:
-

Interesting. I suspect that there's some incorrect casting along the way. This 
doesn't happen with PyArrow.

[~paleolimbot]

{code:python}
>>> import numpy as np
>>> arr = np.zeros(2**31, dtype='bool')
>>> arr.dtype
dtype('bool')
>>> arr.nbytes
2147483648
>>> 
>>> parr = pa.array(arr)
>>> parr

[
  false,
  false,
  false,
  false,
  false,
  false,
  false,
  false,
  false,
  false,
  ...
  false,
  false,
  false,
  false,
  false,
  false,
  false,
  false,
  false,
  false
]
>>> len(parr)
2147483648
{code}


was (Author: pitrou):
Interesting. I suspect that there's some incorrect casting along the way. This 
doesn't happen with PyArrow.

[~paleolimbot]

> [R] Creating an Array from an object bigger than 2^31 results in an Array of 
> length 0
> -
>
> Key: ARROW-17054
> URL: https://issues.apache.org/jira/browse/ARROW-17054
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>
> Apologies for the lack of proper reprex but it crashes my session when I try 
> to make one.
> I'm working on ARROW-16977 which is all about the reporting of object size 
> having integer overflow issues, but this affects object creation.
> {code:r}
> library(arrow, warn.conflicts = TRUE)
> # works - creates a huge array, hurrah
> big_logical <- vector(mode = "logical", length = .Machine$integer.max)
> big_logical_array <- Array$create(big_logical)
> length(big_logical)
> ## [1] 2147483647
> length(big_logical_array)
> ## [1] 2147483647
> # creates an array of length 0, boo!
> too_big <- vector(mode = "logical", length = .Machine$integer.max + 1) 
> too_big_array <- Array$create(too_big)
> length(too_big)
> ## [1] 2147483648
> length(too_big_array)
> ## [1] 0
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17054) [R] Creating an Array from an object bigger than 2^31 results in an Array of length 0

2022-07-12 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565868#comment-17565868
 ] 

Antoine Pitrou commented on ARROW-17054:


Interesting. I suspect that there's some incorrect casting along the way. This 
doesn't happen with PyArrow.

[~paleolimbot]

> [R] Creating an Array from an object bigger than 2^31 results in an Array of 
> length 0
> -
>
> Key: ARROW-17054
> URL: https://issues.apache.org/jira/browse/ARROW-17054
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>
> Apologies for the lack of proper reprex but it crashes my session when I try 
> to make one.
> I'm working on ARROW-16977 which is all about the reporting of object size 
> having integer overflow issues, but this affects object creation.
> {code:r}
> library(arrow, warn.conflicts = TRUE)
> # works - creates a huge array, hurrah
> big_logical <- vector(mode = "logical", length = .Machine$integer.max)
> big_logical_array <- Array$create(big_logical)
> length(big_logical)
> ## [1] 2147483647
> length(big_logical_array)
> ## [1] 2147483647
> # creates an array of length 0, boo!
> too_big <- vector(mode = "logical", length = .Machine$integer.max + 1) 
> too_big_array <- Array$create(too_big)
> length(too_big)
> ## [1] 2147483648
> length(too_big_array)
> ## [1] 0
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17052) [C++][Python][FlightRPC] Ensure ::Serialize and ::Deserialize are consistently implemented

2022-07-12 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-17052:
-
Labels: good-first-issue good-second-issue  (was: )

> [C++][Python][FlightRPC] Ensure ::Serialize and ::Deserialize are 
> consistently implemented
> --
>
> Key: ARROW-17052
> URL: https://issues.apache.org/jira/browse/ARROW-17052
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, FlightRPC, Python
>Reporter: David Li
>Priority: Major
>  Labels: good-first-issue, good-second-issue
>
> Structures like Action don't expose these methods even though ones like 
> FlightInfo do.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17054) [R] Creating an Array from an object bigger than 2^31 results in an Array of length 0

2022-07-12 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17054:


 Summary: [R] Creating an Array from an object bigger than 2^31 
results in an Array of length 0
 Key: ARROW-17054
 URL: https://issues.apache.org/jira/browse/ARROW-17054
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


Apologies for the lack of proper reprex but it crashes my session when I try to 
make one.

I'm working on ARROW-16977 which is all about the reporting of object size 
having integer overflow issues, but this affects object creation.

{code:r}
library(arrow, warn.conflicts = TRUE)

# works - creates a huge array, hurrah
big_logical <- vector(mode = "logical", length = .Machine$integer.max)
big_logical_array <- Array$create(big_logical)

length(big_logical)
## [1] 2147483647
length(big_logical_array)
## [1] 2147483647

# creates an array of length 0, boo!
too_big <- vector(mode = "logical", length = .Machine$integer.max + 1) 
too_big_array <- Array$create(too_big)

length(too_big)
## [1] 2147483648
length(too_big_array)
## [1] 0
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15016) [R] show_exec_plan() for an arrow_dplyr_query

2022-07-12 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-15016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dragoș Moldovan-Grünfeld updated ARROW-15016:
-
Summary: [R] show_exec_plan() for an arrow_dplyr_query  (was: [R] 
show_query() for an arrow_dplyr_query)

> [R] show_exec_plan() for an arrow_dplyr_query
> -
>
> Key: ARROW-15016
> URL: https://issues.apache.org/jira/browse/ARROW-15016
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> {*}Proposed approach{*}: [design 
> doc|https://docs.google.com/document/d/1Ep8aV4jDsNCCy9uv1bjWY_JF17nzHQogv0EnGJvraQI/edit#]
> *Steps*
>  * Read about ExecPlan and ExecPlan::ToString
>  ** https://issues.apache.org/jira/browse/ARROW-14233
>  ** https://issues.apache.org/jira/browse/ARROW-15138
>  ** https://issues.apache.org/jira/browse/ARROW-13785
>  * Hook up to the existing C++ ToString method for ExecPlans 
>  * Implement a {{ToString()}} method for {{ExecPlan}} R6 class
>  * Implement and document {{show_exec_plan()}}
> {*}Original description{*}:
> Now that we can print a query plan (ARROW-13785) we should wire this up in R 
> so we can see what execution plans are being put together for various queries 
> (like the TPC-H queries)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-12701) [Website][Release] Include Rust contributors, committers, and commits in release notes

2022-07-12 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-12701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido updated ARROW-12701:
--
Fix Version/s: (was: 9.0.0)

> [Website][Release] Include Rust contributors, committers, and commits in 
> release notes
> --
>
> Key: ARROW-12701
> URL: https://issues.apache.org/jira/browse/ARROW-12701
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Ian Cook
>Priority: Major
>  Labels: pull-request-available
> Attachments: contributor_count.html
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> For the 5.0.0 release, we should change the code in 
> {{dev/releasepost-03-website.sh}} to include commits, contributors, and 
> changes to the official {{apache/arrow-rs}} and {{apache/arrow-datafusion}} 
> repos. This is import to ensure that the contributions to Rust, DataFusion, 
> and Ballista are recognized in our release notes and blog posts going forward.
> [~alamb] [~andygrove] [~Dandandan] [~jorgecarleitao] could one of you take 
> this on? Thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17053) [R] Update make_date, make_datetime, ISOdate and ISOdatetime to use `tz`

2022-07-12 Thread Jira
Dragoș Moldovan-Grünfeld created ARROW-17053:


 Summary: [R] Update make_date, make_datetime, ISOdate and 
ISOdatetime to use `tz`
 Key: ARROW-17053
 URL: https://issues.apache.org/jira/browse/ARROW-17053
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 8.0.0
Reporter: Dragoș Moldovan-Grünfeld


ARROW-12820 is now solved. Look for other ramifications of that ticket and 
update bindings definitions. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17052) [C++][Python][FlightRPC] Ensure ::Serialize and ::Deserialize are consistently implemented

2022-07-12 Thread David Li (Jira)
David Li created ARROW-17052:


 Summary: [C++][Python][FlightRPC] Ensure ::Serialize and 
::Deserialize are consistently implemented
 Key: ARROW-17052
 URL: https://issues.apache.org/jira/browse/ARROW-17052
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, FlightRPC, Python
Reporter: David Li


Structures like Action don't expose these methods even though ones like 
FlightInfo do.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17051) [C++][Flight] arrow-flight-sql-test fails with ASAN UBSAN

2022-07-12 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-17051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565863#comment-17565863
 ] 

Raúl Cumplido commented on ARROW-17051:
---

cc [~lidavidm] 

> [C++][Flight] arrow-flight-sql-test fails with ASAN UBSAN
> -
>
> Key: ARROW-17051
> URL: https://issues.apache.org/jira/browse/ARROW-17051
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: Raúl Cumplido
>Priority: Major
>
> The CI job for ASAN UBSAN is based on Ubuntu 20.04: *C++ / AMD64 Ubuntu 20.04 
> C++ ASAN UBSAN*  
> Trying to build Flight and Flight SQL on Ubuntu 20.04 the job for ASAN UBSAN 
> will also build with Flight and Flight SQL. This triggers some 
> arrow-flight-sql-test failures like:
> {code:java}
>   [ RUN      ] TestFlightSqlClient.TestGetDbSchemas
> unknown file: Failure
> Unexpected mock function call - taking default action specified at:
> /arrow/cpp/src/arrow/flight/sql/client_test.cc:151:
>     Function call: GetFlightInfo(@0x6157d948 184-byte object <00-00 00-00 
> 00-00 F0-BF 40-00 00-00 00-00 00-00 80-4C 06-49 CF-7F 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 01-01 00-00 00-00 00-00 
> 00-20 00-00 00-00 00-00 ... 01-00 00-04 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>, 
> @0x7fff35794e80 64-byte object <02-00 00-00 00-00 00-00 C0-45 08-00 B0-60 
> 00-00 65-00 00-00 00-00 00-00 65-00 00-00 00-00 00-00 C4-A9 AE-66 00-10 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>)
>           Returns: (nullptr)
> Google Mock tried the following 1 expectation, but it didn't match:
> /arrow/cpp/src/arrow/flight/sql/client_test.cc:152: EXPECT_CALL(sql_client_, 
> GetFlightInfo(Ref(call_options_), descriptor))...
>   Expected arg #1: is equal to 64-byte object <02-00 00-00 BE-BE BE-BE C0-6B 
> 05-00 C0-60 00-00 73-00 00-00 00-00 00-00 73-00 00-00 00-00 00-00 BE-BE BE-BE 
> BE-BE BE-BE 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00>
>            Actual: 64-byte object <02-00 00-00 00-00 00-00 C0-45 08-00 B0-60 
> 00-00 65-00 00-00 00-00 00-00 65-00 00-00 00-00 00-00 C4-A9 AE-66 00-10 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>
>          Expected: to be called once
>            Actual: never called - unsatisfied and active
> /arrow/cpp/src/arrow/flight/sql/client_test.cc:152: Failure
> Actual function call count doesn't match EXPECT_CALL(sql_client_, 
> GetFlightInfo(Ref(call_options_), descriptor))...
>          Expected: to be called once
>            Actual: never called - unsatisfied and active
> [  FAILED  ] TestFlightSqlClient.TestGetDbSchemas (1 ms){code}
> The error can be seen here: 
> [https://github.com/apache/arrow/runs/7297442828?check_suite_focus=true]
> This is the initial PR that triggered it:
> [https://github.com/apache/arrow/pull/13548]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17051) [C++][Flight] arrow-flight-sql-test fails with ASAN UBSAN

2022-07-12 Thread Jira
Raúl Cumplido created ARROW-17051:
-

 Summary: [C++][Flight] arrow-flight-sql-test fails with ASAN UBSAN
 Key: ARROW-17051
 URL: https://issues.apache.org/jira/browse/ARROW-17051
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, FlightRPC
Reporter: Raúl Cumplido


The CI job for ASAN UBSAN is based on Ubuntu 20.04: *C++ / AMD64 Ubuntu 20.04 
C++ ASAN UBSAN*  

Trying to build Flight and Flight SQL on Ubuntu 20.04 the job for ASAN UBSAN 
will also build with Flight and Flight SQL. This triggers some 
arrow-flight-sql-test failures like:
{code:java}
  [ RUN      ] TestFlightSqlClient.TestGetDbSchemas
unknown file: Failure
Unexpected mock function call - taking default action specified at:
/arrow/cpp/src/arrow/flight/sql/client_test.cc:151:
    Function call: GetFlightInfo(@0x6157d948 184-byte object <00-00 00-00 
00-00 F0-BF 40-00 00-00 00-00 00-00 80-4C 06-49 CF-7F 00-00 00-00 00-00 00-00 
00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 01-01 00-00 00-00 00-00 
00-20 00-00 00-00 00-00 ... 01-00 00-04 00-00 00-00 00-00 00-00 00-00 00-00 
00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>, 
@0x7fff35794e80 64-byte object <02-00 00-00 00-00 00-00 C0-45 08-00 B0-60 00-00 
65-00 00-00 00-00 00-00 65-00 00-00 00-00 00-00 C4-A9 AE-66 00-10 00-00 00-00 
00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>)
          Returns: (nullptr)
Google Mock tried the following 1 expectation, but it didn't match:
/arrow/cpp/src/arrow/flight/sql/client_test.cc:152: EXPECT_CALL(sql_client_, 
GetFlightInfo(Ref(call_options_), descriptor))...
  Expected arg #1: is equal to 64-byte object <02-00 00-00 BE-BE BE-BE C0-6B 
05-00 C0-60 00-00 73-00 00-00 00-00 00-00 73-00 00-00 00-00 00-00 BE-BE BE-BE 
BE-BE BE-BE 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
00-00>
           Actual: 64-byte object <02-00 00-00 00-00 00-00 C0-45 08-00 B0-60 
00-00 65-00 00-00 00-00 00-00 65-00 00-00 00-00 00-00 C4-A9 AE-66 00-10 00-00 
00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>
         Expected: to be called once
           Actual: never called - unsatisfied and active
/arrow/cpp/src/arrow/flight/sql/client_test.cc:152: Failure
Actual function call count doesn't match EXPECT_CALL(sql_client_, 
GetFlightInfo(Ref(call_options_), descriptor))...
         Expected: to be called once
           Actual: never called - unsatisfied and active
[  FAILED  ] TestFlightSqlClient.TestGetDbSchemas (1 ms){code}
The error can be seen here: 
[https://github.com/apache/arrow/runs/7297442828?check_suite_focus=true]

This is the initial PR that triggered it:

[https://github.com/apache/arrow/pull/13548]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17046) [Python] pyarrow.parquet.write_to_dataset fails to pass kwargs to write_table function

2022-07-12 Thread Amir Khosroshahi (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565858#comment-17565858
 ] 

Amir Khosroshahi commented on ARROW-17046:
--

The edit button is disabled with the message "You must be on a branch to make 
or propose changes to this file".

Which branch should I use to as the base for my pull request?

> [Python] pyarrow.parquet.write_to_dataset fails to pass kwargs to write_table 
> function
> --
>
> Key: ARROW-17046
> URL: https://issues.apache.org/jira/browse/ARROW-17046
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Amir Khosroshahi
>Priority: Minor
>
> According to PyArrow 8.0.0 
> [documentation|https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_to_dataset.html]
>  {{kwargs}} is "Additional kwargs for {{write_table}} function." However when 
> I try to pass additional arguments, for example {{{}flavor{}}}, to the 
> underlying write_table I get the following error
> {code:java}
> TypeError: unexpected parquet write option: flavor{code}
> This used to work in PyArrow versions as late as 7.0.0 but started to break 
> in 8.0.0.
> Minimal example to reproduce the error
> {code:java}
> import pyarrow as pa
> import pandas as pd
> import pyarrow.parquet as pq
> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
> tb = pa.Table.from_pandas(df)
> pq.write_to_dataset(tb, "test.parquet", flavor="spark") {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-14077) Compute IR source consumer

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-14077:
---

Assignee: (was: Ben Kietzman)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> Compute IR source consumer
> --
>
> Key: ARROW-14077
> URL: https://issues.apache.org/jira/browse/ARROW-14077
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Compute IR
>Reporter: Phillip Cloud
>Priority: Major
>
> This task tracks the implementation of the source IR consumer in Arrow C++.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-11689) [Rust][DataFusion] Reduce copies in DataFusion LogicalPlan and Expr creation

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-11689:
---

Assignee: (was: Andrew Lamb)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [Rust][DataFusion] Reduce copies in DataFusion LogicalPlan and Expr creation
> 
>
> Key: ARROW-11689
> URL: https://issues.apache.org/jira/browse/ARROW-11689
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Priority: Major
>
> The theme of this overall epic to make the plan and expression rewriting 
> phases of DataFusion more efficient by avoiding copies by leveraging the Rust 
> type system
> Benefits:
> * More standard / idomatic Rust usage
> * faster / more efficient (I don't have numbers to back this up)
> Downsides:
> * These will be  backwards incompatible changes
> h1. Background
> Many things in DataFusion  look like
> Input --tranformation-->output
> And the input is not used again. In rust, you can model this by giving 
> ownership to the transformation
> At a high level the idea is to avoid so much cloning in DataFustion
> The basic principle is if the function needs to `clone` one of its arguments, 
> the caller should be given the choice of when to do that. Often, the caller 
> can give up ownership without issue
> I envision at least the following the following items:
> 1. Optimizer passes that take `` and produce a new `LogicalPlan` 
> even though most callsites do not need the original
> 2. Expr builder calls that take `` and return a new `Expr`
> 3. An expression rewriter (TODO) while running down optimizer passes
> I think this style takes advantage of Rust's ownership model and will let us 
> avoid a lot o copying and allocations and avoid the need for something like 
> slab allocators



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-4133) [C++/Python] ORC adapter should fail gracefully if /etc/timezone is missing instead of aborting

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-4133:
--

Assignee: (was: Ian Alexander Joiner)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [C++/Python] ORC adapter should fail gracefully if /etc/timezone is missing 
> instead of aborting
> ---
>
> Key: ARROW-4133
> URL: https://issues.apache.org/jira/browse/ARROW-4133
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: orc
>
> The following core was genereted by nightly build: 
> https://travis-ci.org/kszucs/crossbow/builds/473397855
> {code}
> Core was generated by `/opt/conda/bin/python /opt/conda/bin/pytest -v 
> --pyargs pyarrow'.
> Program terminated with signal SIGABRT, Aborted.
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> 51  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> [Current thread is 1 (Thread 0x7fea61f9e740 (LWP 179))]
> (gdb) bt
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> #1  0x7fea608c8801 in __GI_abort () at abort.c:79
> #2  0x7fea4b3483df in __gnu_cxx::__verbose_terminate_handler ()
> at 
> /opt/conda/conda-bld/compilers_linux-64_1534514838838/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/vterminate.cc:95
> #3  0x7fea4b346b16 in __cxxabiv1::__terminate (handler=)
> at 
> /opt/conda/conda-bld/compilers_linux-64_1534514838838/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:47
> #4  0x7fea4b346b4c in std::terminate ()
> at 
> /opt/conda/conda-bld/compilers_linux-64_1534514838838/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:57
> #5  0x7fea4b346d28 in __cxxabiv1::__cxa_throw (obj=0x2039220,
> tinfo=0x7fea494803d0 ,
> dest=0x7fea49087e52 )
> at 
> /opt/conda/conda-bld/compilers_linux-64_1534514838838/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:95
> #6  0x7fea49086824 in orc::getTimezoneByFilename (filename=...)
> at /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Timezone.cc:704
> #7  0x7fea490868d2 in orc::getLocalTimezone () at 
> /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Timezone.cc:713   
>   
> #8  0x7fea49063e59 in 
> orc::RowReaderImpl::RowReaderImpl (this=0x204fe30, _contents=..., opts=...)
> at /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Reader.cc:185
> #9  0x7fea4906651e in orc::ReaderImpl::createRowReader (this=0x1fb41b0, 
> opts=...)
> at /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Reader.cc:630
> #10 0x7fea48c2d904 in 
> arrow::adapters::orc::ORCFileReader::Impl::ReadSchema (this=0x1270600, 
> opts=..., 
>
> out=0x7ffe0ccae7b0) at /arrow/cpp/src/arrow/adapters/orc/adapter.cc:264
> #11 0x7fea48c2e18d in arrow::adapters::orc::ORCFileReader::Impl::Read 
> (this=0x1270600, out=0x7ffe0ccaea00)
> at /arrow/cpp/src/arrow/adapters/orc/adapter.cc:302
> #12 0x7fea48c2a8b9 in arrow::adapters::orc::ORCFileReader::Read 
> (this=0x1e14d10, out=0x7ffe0ccaea00)
> at /arrow/cpp/src/arrow/adapters/orc/adapter.cc:697   
>   
>   
> #13 0x7fea48218c9d in __pyx_pf_7pyarrow_4_orc_9ORCReader_12read 
> (__pyx_v_self=0x7fea43de8688,
> __pyx_v_include_indices=0x7fea61d07b70 <_Py_NoneStruct>) at _orc.cpp:3865
> #14 0x7fea48218b31 in __pyx_pw_7pyarrow_4_orc_9ORCReader_13read 
> (__pyx_v_self=0x7fea43de8688,
> __pyx_args=0x7fea61f5e048, __pyx_kwds=0x7fea444f78b8) at _orc.cpp:3813
> #15 0x7fea61910cbd in _PyCFunction_FastCallDict 
> (func_obj=func_obj@entry=0x7fea444b9558,
> args=args@entry=0x7fea44a40fa8, nargs=nargs@entry=0, 
> kwargs=kwargs@entry=0x7fea444f78b8)
> at Objects/methodobject.c:231
> #16 0x7fea61910f16 in _PyCFunction_FastCallKeywords 
> (func=func@entry=0x7fea444b9558,
> stack=stack@entry=0x7fea44a40fa8, nargs=0, 
> kwnames=kwnames@entry=0x7fea47d81d30) at Objects/methodobject.c:294
> #17 

[jira] [Assigned] (ARROW-11341) [Python] [Gandiva] Check parameters are not None

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-11341:
---

Assignee: (was: Will Jones)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [Python] [Gandiva] Check parameters are not None
> 
>
> Key: ARROW-11341
> URL: https://issues.apache.org/jira/browse/ARROW-11341
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva, Python
>Reporter: Will Jones
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Most of the functions in Gandiva's Python Expression builder interface 
> current accept None in their arguments, but will segfault once they are used.
> Example:
> {code:python}
> import pyarrow
> import pyarrow.gandiva as gandiva
> builder = gandiva.TreeExprBuilder()
> field = pyarrow.field('whatever', type=pyarrow.date64())
> date_col = builder.make_field(field)
> func = builder.make_function('less_than_or_equal_to', [date_col, None], 
> pyarrow.bool_())
> condition = builder.make_condition(func)
> # Will segfault on this line:
> gandiva.make_filter(pyarrow.schema([field]), condition)
> {code}
> I think this is just a matter of adding {{not None}} to the appropriate 
> function arguments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-4354) Explore Codespeed feasibility and ease of customization

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-4354:
--

Assignee: (was: Tanya Schlusser)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> Explore Codespeed feasibility and ease of customization
> ---
>
> Key: ARROW-4354
> URL: https://issues.apache.org/jira/browse/ARROW-4354
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Developer Tools
>Reporter: Areg Melik-Adamyan
>Priority: Major
>  Labels: performance
> Attachments: codespeed-data-model.png
>
>
> @Tanya Schlusser can you please explore this option and report out?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-5272) [C++] [Gandiva] JIT code executed over uninitialized values

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-5272:
--

Assignee: (was: Pindikura Ravindra)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [C++] [Gandiva] JIT code executed over uninitialized values
> ---
>
> Key: ARROW-5272
> URL: https://issues.apache.org/jira/browse/ARROW-5272
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When running Gandiva tests with Valgrind, I get the following errors:
> {code}
> [==] Running 4 tests from 1 test case.
> [--] Global test environment set-up.
> [--] 4 tests from TestDecimal
> [ RUN  ] TestDecimal.TestSimple
> ==12052== Conditional jump or move depends on uninitialised value(s)
> ==12052==at 0x41110D5: ???
> ==12052== 
> {
>
>Memcheck:Cond
>obj:*
> }
> ==12052== Conditional jump or move depends on uninitialised value(s)
> ==12052==at 0x41110E8: ???
> ==12052== 
> {
>
>Memcheck:Cond
>obj:*
> }
> ==12052== Conditional jump or move depends on uninitialised value(s)
> ==12052==at 0x44B: ???
> ==12052== 
> {
>
>Memcheck:Cond
>obj:*
> }
> ==12052== Conditional jump or move depends on uninitialised value(s)
> ==12052==at 0x47B: ???
> ==12052== 
> {
>
>Memcheck:Cond
>obj:*
> }
> [   OK ] TestDecimal.TestSimple (16625 ms)
> [ RUN  ] TestDecimal.TestLiteral
> [   OK ] TestDecimal.TestLiteral (3480 ms)
> [ RUN  ] TestDecimal.TestIfElse
> [   OK ] TestDecimal.TestIfElse (2408 ms)
> [ RUN  ] TestDecimal.TestCompare
> [   OK ] TestDecimal.TestCompare (5303 ms)
> {code}
> I think this is legitimate. Gandiva runs computations over all values, even 
> when the bitmap indicates a null value. But decimal computations are complex 
> and involve conditional jumps, hence the error ("Conditional jump or move 
> depends on uninitialised value(s)").
> [~pravindra]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-4570) [Gandiva] Add overflow checks for decimals

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-4570:
--

Assignee: (was: Pindikura Ravindra)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [Gandiva] Add overflow checks for decimals
> --
>
> Key: ARROW-4570
> URL: https://issues.apache.org/jira/browse/ARROW-4570
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Pindikura Ravindra
>Priority: Major
>
> For decimals, overflows can occur at two places :
>  # input array can have values that are outside the bound (eg. > 38 digits)
>  # When an operation can result in overflows. eg. add of two decimals of (38, 
> 6) can result in an overflow, if the input numbers are very large.
> In both the above cases, just verifying that an overflow occurred can be a 
> perf overhead. We should do this based on a conf variable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-15543) [Doc] Improve documentation on usage of Schema metadata and usage/limitations in Execplan

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-15543:
---

Assignee: (was: Vibhatha Lakmal Abeykoon)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [Doc] Improve documentation on usage of Schema metadata and usage/limitations 
> in Execplan
> -
>
> Key: ARROW-15543
> URL: https://issues.apache.org/jira/browse/ARROW-15543
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> Schema metadata can be an important aspect related to computation and 
> maintaining states. How this is affected in the execution plan and how it 
> needs to be handled can be further documented. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-11508) [C++][Compute] Add support for generic conversions to Function::DispatchBest

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-11508:
---

Assignee: (was: Ben Kietzman)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [C++][Compute] Add support for generic conversions to Function::DispatchBest
> 
>
> Key: ARROW-11508
> URL: https://issues.apache.org/jira/browse/ARROW-11508
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 3.0.0
>Reporter: Ben Kietzman
>Priority: Major
>
> ARROW-8919 adds support for execution with implicit casts to any function 
> which overrides DispatchBest, allowing functions to specify conversions which 
> make sense in that function's context. For example "add" can promote its 
> arguments if their types disagree. By contrast, some conversions are more 
> generic and could be applicable to any function's arguments. For example if 
> any datum is dictionary encoded, a kernel which accepts the decoded type 
> should be usable with an implicit decoding cast:
> {code:java}
> import pyarrow as pa
> import pyarrow.compute as pc
> arr = pa.array('hello ' * 10)
> enc = arr.dictionary_encode()
> # result should not depend on encoding:
> assert pc.ascii_is_alnum(arr) == pc.ascii_is_alnum(enc)
> # currently raises:
> # ArrowNotImplementedError: Function ascii_is_alnum has no kernel matching
> #input types (array[dictionary])
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-15967) [C++] Return a nice error if the user types the wrong node name in an exec plan

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-15967:
---

Assignee: (was: Vibhatha Lakmal Abeykoon)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [C++] Return a nice error if the user types the wrong node name in an exec 
> plan
> ---
>
> Key: ARROW-15967
> URL: https://issues.apache.org/jira/browse/ARROW-15967
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> When writing code to generate exec plans we often write something like...
> {noformat}
> MakeExecNodeOrStop("write", plan.get(), {table}, options);
> {noformat}
> A few times now a user has (possibly via copy / paste) written the wrong node 
> (e.g. "sink" instead of "write") and when that happens we try and do a 
> checked cast on the options (e.g. to WriteNodeOptions) and this raises an 
> exception.  We should do a safer cast and return an invalid status with a 
> nice message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-15904) [C++] Support rolling backwards and forwards with temporal arithmetic

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-15904:
---

Assignee: (was: Rok Mihevc)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [C++] Support rolling backwards and forwards with temporal arithmetic
> -
>
> Key: ARROW-15904
> URL: https://issues.apache.org/jira/browse/ARROW-15904
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Blocker
>
> Original description in ARROW-11090: 
> "This should also cover the ability to do with and without rollback (so have 
> the ability to do e.g. 2021-03-30 minus 1 month and either get a null back, 
> or 2021-02-28), plus the ability to specify whether to rollback to the first 
> or last, and whether to preserve or rest the time.)"
> For example, in R, lubridate has the following functionality:
> * {{rollbackward()}} or {{rollback()}} which changes a date to the last day 
> of the previous month or to the first day of the current month
> * {{rollforward()}} which rolls to the last day of the current month or to 
> the first day of the next month.
> * all of the above also offer the option to preserve hms (hours, minutes and 
> seconds) when rolling. 
> This functionality underpins functions such as {{%m-%}} and {{%m+%}} which 
> are used to add or subtract months to a date without exceeding the last day 
> of the new month.
> {code:r}
> library(lubridate)
> jan <- ymd_hms("2010-01-31 03:04:05")
> jan + months(1:3) # Feb 31 and April 31 returned as NA
> #> [1] NA"2010-03-31 03:04:05 UTC"
> #> [3] NA
> # NA "2010-03-31 03:04:05 UTC" NA
> jan %m+% months(1:3) # No rollover
> #> [1] "2010-02-28 03:04:05 UTC" "2010-03-31 03:04:05 UTC"
> #> [3] "2010-04-30 03:04:05 UTC"
> leap <- ymd("2012-02-29")
> "2012-02-29 UTC"
> #> [1] "2012-02-29 UTC"
> leap %m+% years(1)
> #> [1] "2013-02-28"
> leap %m+% years(-1)
> #> [1] "2011-02-28"
> leap %m-% years(1)
> #> [1] "2011-02-28"
> x <- ymd_hms("2019-01-29 01:02:03")
> add_with_rollback(x, months(1))
> #> [1] "2019-02-28 01:02:03 UTC"
> add_with_rollback(x, months(1), preserve_hms = FALSE)
> #> [1] "2019-02-28 UTC"
> add_with_rollback(x, months(1), roll_to_first = TRUE)
> #> [1] "2019-03-01 01:02:03 UTC"
> add_with_rollback(x, months(1), roll_to_first = TRUE, preserve_hms = FALSE)
> #> [1] "2019-03-01 UTC"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-13775) [C++] Allow Partitioning objects to be created with a vector of field names

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-13775:
---

Assignee: (was: Weston Pace)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [C++] Allow Partitioning objects to be created with a vector of field names
> ---
>
> Key: ARROW-13775
> URL: https://issues.apache.org/jira/browse/ARROW-13775
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Formatting expressions should be doable without the data types.  If data 
> types are provided then the expression literals should still be cast to the 
> appropriate data type as before.
> Parsing paths could be doable without the data types (we could infer this) 
> but since we already have PartitioningFactory / FilesystemDatasetFactory for 
> this use case I don't see much advantage in supporting that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-2932) [Packaging] crossbow status should output shortened log URLs

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-2932:
--

Assignee: (was: Phillip Cloud)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [Packaging] crossbow status should output shortened log URLs
> 
>
> Key: ARROW-2932
> URL: https://issues.apache.org/jira/browse/ARROW-2932
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Packaging
>Affects Versions: 0.9.0
>Reporter: Phillip Cloud
>Priority: Major
>
> Would be nice to be able to easily get to the logs of a particular task. 
> Right now I have to click through GitHub stuff.
> I'll have a patch up for this shortly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-8032) [C++] example parquet-arrow project includes broken FindParquet.cmake

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-8032:
--

Assignee: (was: Krisztian Szucs)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [C++] example parquet-arrow project includes broken FindParquet.cmake
> -
>
> Key: ARROW-8032
> URL: https://issues.apache.org/jira/browse/ARROW-8032
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.17.0
> Environment: NA
>Reporter: Tomasz Cheda
>Priority: Minor
>  Labels: beginner, easyfix
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> The example project at 
> [https://github.com/apache/arrow/tree/master/cpp/examples/parquet/parquet-arrow/cmake_modules]
>  includes a broken version of FindParquet.cmake ( 
> [https://github.com/apache/arrow/blob/master/cpp/examples/parquet/parquet-arrow/cmake_modules/FindParquet.cmake]
>  )
> The other module is, correctly, a link to FindArrow.cmake in 
> [https://github.com/apache/arrow/tree/master/cpp/cmake_modules]
> For the curious, the broken part is assuming that FindPkgConfig variables 
> will be set if the module is found - this can be false if the include 
> directory is /usr/include. This can be controlled by one of FindPkgConfig's 
> config variables, but the default behaviour changes as of CMake 3.10. It then 
> erroneously reports that Parquet has not been found.
> This is not a major bug, but I based my build files off of those in the 
> example directory and it took me a LONG time to figure out the error. It can 
> be really confusing for new users and is simple to fix.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-5033) [C++] JSON table writer

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-5033:
--

Assignee: (was: Ben Kietzman)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [C++] JSON table writer
> ---
>
> Key: ARROW-5033
> URL: https://issues.apache.org/jira/browse/ARROW-5033
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Minor
>
> Users who need to emit json in line delimited format currently cannot do so 
> using arrow. It should be straightforward to implement this efficiently, and 
> it will be very helpful for testing and benchmarking



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-10573) Zero-Copy Pandas Dataframe Reads

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-10573:
---

Assignee: (was: Nick White)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> Zero-Copy Pandas Dataframe Reads
> 
>
> Key: ARROW-10573
> URL: https://issues.apache.org/jira/browse/ARROW-10573
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 2.0.0
>Reporter: Nick White
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Arrow should be able to expose an IPC message as a Pandas dataframe without 
> copying the data. This involves fixing:
>  * the alignment of the IPC message, so all columns of the same type are in 
> contiguous buffers and so can be exposed as the 2D arrays Pandas' block 
> manager expects.
>  * the `arrow_to_pandas.cc` code to make the UX work transparently.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-15991) [Website] Remove mentions of master branch from Apache Arrow website content

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-15991:
---

Assignee: (was: Kevin Gurney)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [Website] Remove mentions of master branch from Apache Arrow website content
> 
>
> Key: ARROW-15991
> URL: https://issues.apache.org/jira/browse/ARROW-15991
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Website
>Reporter: Kevin Gurney
>Priority: Major
>
> This is a follow up task to 
> [ARROW-15694|https://issues.apache.org/jira/browse/ARROW-15694] and 
> [ARROW-15988|https://issues.apache.org/jira/browse/ARROW-15988].
> We need to update the actual content of the Apache Arrow website to no longer 
> refer to the "master" branch. This includes updating links, text snippets, 
> and the {{README.md}} for {{{}arrow-site{}}}. There may also be other files 
> that need to be updated, as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-14972) [Python][Doc] Document automatic partitioning discovery

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-14972:
---

Assignee: (was: Weston Pace)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [Python][Doc] Document automatic partitioning discovery
> ---
>
> Key: ARROW-14972
> URL: https://issues.apache.org/jira/browse/ARROW-14972
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Python
>Reporter: Weston Pace
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> If a dataset is using the hive partitioning scheme the partitioning can be 
> automatically discovered.  If it is using the directory partitioning scheme 
> it will not be automatically discovered.  This has led to [some 
> confusion|https://github.com/apache/arrow/issues/11826] and should be 
> documented.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-12629) [C++] Configurable read-ahead in CSV and JSON readers

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-12629:
---

Assignee: (was: Supun Kamburugamuva)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [C++] Configurable read-ahead in CSV and JSON readers
> -
>
> Key: ARROW-12629
> URL: https://issues.apache.org/jira/browse/ARROW-12629
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Andre Kohn
>Priority: Major
>  Labels: good-first-issue, pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We are compiling Arrow C++ to WebAssembly and ran into the following issue 
> with the CSV reader:
> Browsers became very picky about the use of SharedArrayBuffers after the 
> events around Spectre and Meltdown.
> As a result, you have to compile Arrow to WebAssembly without threads if you 
> don't want to run your website with very strict cross-origin isolation.
> Unfortunately, the CSV reader seems to always spawn a thread for the 
> read-ahead in both, the SerialStreamingReader and the SerialTableReader 
> independent of whether use_threads is set.
> Right now, this effectively means that you cannot use the CSV (and JSON) 
> readers in threadless WebAssembly builds.
>  
> [https://github.com/apache/arrow/blob/4363fefe46dc357a9013f0f4bcdc235e1e2e8124/cpp/src/arrow/csv/reader.cc#L839]
> [https://github.com/apache/arrow/blob/4363fefe46dc357a9013f0f4bcdc235e1e2e8124/cpp/src/arrow/csv/reader.cc#L913]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16087) [C++] Add backpressure to TPC-H dbgen

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-16087:
---

Assignee: (was: Sasha Krassovsky)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [C++] Add backpressure to TPC-H dbgen
> -
>
> Key: ARROW-16087
> URL: https://issues.apache.org/jira/browse/ARROW-16087
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Jonathan Keane
>Priority: Major
>
> So that we don't OOM when generating large scale factors



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-7122) [CI][Documenation] docker-compose developer guide in the sphinx documentation

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-7122:
--

Assignee: (was: Krisztian Szucs)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [CI][Documenation] docker-compose developer guide in the sphinx documentation
> -
>
> Key: ARROW-7122
> URL: https://issues.apache.org/jira/browse/ARROW-7122
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Krisztian Szucs
>Priority: Major
>
> We have a short guide in the sphinx documentation under integration.rst
> It needs to be updated with the recent docker-compose changes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-721) [Java] Read and write record batches to shared memory

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-721:
-

Assignee: (was: Ji Liu)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [Java] Read and write record batches to shared memory
> -
>
> Key: ARROW-721
> URL: https://issues.apache.org/jira/browse/ARROW-721
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Wes McKinney
>Priority: Major
>
> It would be useful for a Java application to be able to read a record batch 
> as a set of memory mapped byte buffers given a file name and a memory address 
> for the metadata. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-11631) [R] Implement RPrimitiveConverter for Decimal type

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-11631:
---

Assignee: (was: Romain Francois)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [R] Implement RPrimitiveConverter for Decimal type
> --
>
> Key: ARROW-11631
> URL: https://issues.apache.org/jira/browse/ARROW-11631
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Ian Cook
>Priority: Major
>
> This succeeds:
> {code:java}
> Array$create(1)$cast(decimal(10, 2)){code}
> but this fails:
> {code:java}
> Array$create(1, type = decimal(10, 2)){code}
> with error:
> {code:java}
> NotImplemented: Extend{code}
> because the {{Extend}} method of the {{RPromitiveConverter}} class for the 
> Decimal type is not yet implemented.
> The error is thrown here: 
> [https://github.com/apache/arrow/blob/7184c3f46981dd52c3c521b2676796e82f17da77/r/src/r_to_arrow.cpp#L601]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-8175) [Python] Setup type checking with mypy

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-8175:
--

Assignee: (was: Uwe Korn)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [Python] Setup type checking with mypy
> --
>
> Key: ARROW-8175
> URL: https://issues.apache.org/jira/browse/ARROW-8175
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Python
>Reporter: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Get mypy checks running, activate things like {{check_untyped_defs}} later.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-7798) [R] Refactor R <-> Array conversion

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-7798:
--

Assignee: (was: Romain Francois)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [R] Refactor R <-> Array conversion
> ---
>
> Key: ARROW-7798
> URL: https://issues.apache.org/jira/browse/ARROW-7798
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> There's a bit of technical debt accumulated in array_to_vector and 
> vector_to_array:
>  * Mix of conversion *and* casting, ideally we'd move casting out of there 
> (at the cost of more memory copy). The rationale is that the conversion logic 
> will differ from the CastKernels, e.g. when to raise errors, benefits from 
> complex conversions like timezone... The current implementation is fast, e.g. 
> it fuses the conversion and casting in a single loop at the cost of code 
> clarity and divergence.
>  * There should be 2 paths, zero-copy, non zero-copy. The non-zero copy 
> should use the newly introduced VectorToArrayConverter which will work with 
> complex nested types.
>  * The in array_to vector, Converter should work primarily with Array and not 
> ArrayVector
>  * The vector_to_array should not use builders, sizes are known, the null 
> bitmap should be constructed separately. There's probably a chance that we 
> can re-use R's memory with zero-copy for the raw data.
>  * There seem to be multiple paths that do the same conversion: 
> [https://github.com/apache/arrow/pull/7514#discussion_r446706140]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-9657) [R][Dataset] Expose more FileSystemDatasetFactory options

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-9657:
--

Assignee: (was: Ian Cook)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [R][Dataset] Expose more FileSystemDatasetFactory options
> -
>
> Key: ARROW-9657
> URL: https://issues.apache.org/jira/browse/ARROW-9657
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>  Labels: dataset
>
> Among the features:
> * ignore_prefixes option
> * Pass an explicit list of files + base directory
> * Exclude invalid files (boolean) option
> An important use case this would allow/fix is being able to 
> open_dataset("really_really_big_file.csv") so you can partition/write it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16111) [C++][FlightRPC] Migrate SQL Client API to Result<>

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-16111:
---

Assignee: (was: Tobias Zagorni)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [C++][FlightRPC] Migrate SQL Client API to Result<>
> ---
>
> Key: ARROW-16111
> URL: https://issues.apache.org/jira/browse/ARROW-16111
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, FlightRPC
>Reporter: Tobias Zagorni
>Priority: Major
>
> convert this API too as suggested here: 
> [https://github.com/apache/arrow/pull/12719#discussion_r839570822]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-13954) [Python] Extend compute kernel type testing to supply scalars

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-13954:
---

Assignee: (was: Weston Pace)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [Python] Extend compute kernel type testing to supply scalars
> -
>
> Key: ARROW-13954
> URL: https://issues.apache.org/jira/browse/ARROW-13954
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Weston Pace
>Priority: Major
>  Labels: kernel, query-engine
>
> ARROW-13952 introduced testing for the various computer kernel signatures.  
> The current compute kernel type testing passes in all arguments as arrays.  
> We should extend it to account for cases where an argument is allowed to be a 
> scalar.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-6228) [C++] Add context lines to Diff formatting

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-6228:
--

Assignee: (was: Ben Kietzman)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [C++] Add context lines to Diff formatting
> --
>
> Key: ARROW-6228
> URL: https://issues.apache.org/jira/browse/ARROW-6228
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Trivial
>
> Diff currently renders only inserted or deleted elements, but context lines 
> can be helpful to viewers of the diff. Add an option for configurable context 
> line count to Diff and EqualOptions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-7822) [C++] Allocation free error Status constants

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-7822:
--

Assignee: (was: Ben Kietzman)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [C++] Allocation free error Status constants
> 
>
> Key: ARROW-7822
> URL: https://issues.apache.org/jira/browse/ARROW-7822
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Major
>
> {{Status::state_}} could be made a tagged pointer without affecting the fast 
> path (passing around a non error status). The extra bit could be used to mark 
> a Status' state as heap allocated or not, allowing very error statuses to be 
> extremely cheap when their error state is known to be immutable. For example, 
> this would allow a cheap default of {{Result<>::status_}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-6676) [C++] [Parquet] Refactor encoding/decoding APIs for clarity

2022-07-12 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-6676:
--

Assignee: (was: Ben Kietzman)

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned. Please feel free to re-take assignment of the issue if it 
is being actively worked, or if you plan to start that work soon.

> [C++] [Parquet] Refactor encoding/decoding APIs for clarity
> ---
>
> Key: ARROW-6676
> URL: https://issues.apache.org/jira/browse/ARROW-6676
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Major
>
> {{encoding.h}} and {{encoding.cc}} are difficult to read and rewrite. I think 
> there's also lost opportunities for more generic implementations. 
> Simplify/winnow the interfaces while keeping an eye on the benchmarks for 
> performance regressions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   >