date:20220713

[jira] [Commented] (ARROW-17068) [Python] "pyarrow.parquet.write_to_dataset", option "file_visitor" nothing happen

2022-07-13 Thread Alejandro Marco Ramos (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566637#comment-17566637
 ] 

Alejandro Marco Ramos commented on ARROW-17068:
---

Hi Will, thanks for response.

Passing `use_legacy_dataset=False` don't fix this situation, the list remains 
empty.

I will follow your recommendation to use the new dataset API.

Thanks.

 

> [Python] "pyarrow.parquet.write_to_dataset", option "file_visitor" nothing 
> happen
> -
>
> Key: ARROW-17068
> URL: https://issues.apache.org/jira/browse/ARROW-17068
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Alejandro Marco Ramos
>Priority: Minor
>
> When try to use the callback "file_visitor", nothing happens.
>  
> Example:
> {code:java}
> import pyarrow as pa
> from pyarrow import parquet as pa_parquet
> table = pa.table([
>         pa.array([1, 2, 3, 4, 5]),
>         pa.array(["a", "b", "c", "d", "e"]),
>         pa.array([1.0, 2.0, 3.0, 4.0, 5.0])
>     ], names=["col1", "col2", "col3"])
> written_files = []
> pa_parquet.write_to_dataset(table, partition_cols=["col2"], 
> root_path="tests", file_visitor=lambda x: written_files.append(x.path)))
> assert len(written_files) > 0  # This raises, length is 0{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17066) [C++][Python][Substrait] "ignore_unknown_fields" should be specified when converting JSON to binary

2022-07-13 Thread Richard Tia (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Tia updated ARROW-17066:

Priority: Critical  (was: Blocker)

> [C++][Python][Substrait] "ignore_unknown_fields" should be specified when 
> converting JSON to binary
> ---
>
> Key: ARROW-17066
> URL: https://issues.apache.org/jira/browse/ARROW-17066
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Richard Tia
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> [https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.util.json_util#JsonParseOptions]
>  
> When converting a substrait JSON to binary, there are many unknown fields 
> that may exist since substrait is being built every week. 
> ignore_unknown_fields should be specified when doing this conversion.
>  
> This is resulting in frequent errors similar to this:
> {code:java}
> E   pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned 
> INVALID_ARGUMENT:(relations[0].root.input.sort.input.aggregate.measures[0].measure)
>  arguments: Cannot find field.
> pyarrow/error.pxi:100: ArrowInvalid {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17066) [C++][Python][Substrait] "ignore_unknown_fields" should be specified when converting JSON to binary

2022-07-13 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17066:
---
Labels: pull-request-available  (was: )

> [C++][Python][Substrait] "ignore_unknown_fields" should be specified when 
> converting JSON to binary
> ---
>
> Key: ARROW-17066
> URL: https://issues.apache.org/jira/browse/ARROW-17066
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Richard Tia
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.util.json_util#JsonParseOptions]
>  
> When converting a substrait JSON to binary, there are many unknown fields 
> that may exist since substrait is being built every week. 
> ignore_unknown_fields should be specified when doing this conversion.
>  
> This is resulting in frequent errors similar to this:
> {code:java}
> E   pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned 
> INVALID_ARGUMENT:(relations[0].root.input.sort.input.aggregate.measures[0].measure)
>  arguments: Cannot find field.
> pyarrow/error.pxi:100: ArrowInvalid {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17066) [C++][Python][Substrait] "ignore_unknown_fields" should be specified when converting JSON to binary

2022-07-13 Thread Vibhatha Lakmal Abeykoon (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vibhatha Lakmal Abeykoon updated ARROW-17066:
-
Summary: [C++][Python][Substrait] "ignore_unknown_fields" should be 
specified when converting JSON to binary  (was: [Python][Substrait] 
"ignore_unknown_fields" should be specified when converting JSON to binary)

> [C++][Python][Substrait] "ignore_unknown_fields" should be specified when 
> converting JSON to binary
> ---
>
> Key: ARROW-17066
> URL: https://issues.apache.org/jira/browse/ARROW-17066
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Richard Tia
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Blocker
>
> [https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.util.json_util#JsonParseOptions]
>  
> When converting a substrait JSON to binary, there are many unknown fields 
> that may exist since substrait is being built every week. 
> ignore_unknown_fields should be specified when doing this conversion.
>  
> This is resulting in frequent errors similar to this:
> {code:java}
> E   pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned 
> INVALID_ARGUMENT:(relations[0].root.input.sort.input.aggregate.measures[0].measure)
>  arguments: Cannot find field.
> pyarrow/error.pxi:100: ArrowInvalid {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-16759) [Go] Update testify to fix securiy vulnerability

2022-07-13 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-16759:
-
Fix Version/s: 8.0.1
   (was: 8.0.2)

> [Go] Update testify to fix securiy vulnerability
> 
>
> Key: ARROW-16759
> URL: https://issues.apache.org/jira/browse/ARROW-16759
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Go
>Affects Versions: 7.0.0, 8.0.0
>Reporter: Dominic Barnes
>Assignee: Dominic Barnes
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 9.0.0, 6.0.2, 7.0.1, 8.0.1
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> The packges under github.com/apache/arrow/go currently have a dependency on 
> github.com/stretchr/testify v1.7.0 which has a dependency on gopkg.in/yaml.v3 
> that has an outstanding security vulnerability. 
> ([CVE-2022-28948|https://github.com/advisories/GHSA-hp87-p4gw-j4gq])
> While testify is only used during tests, this is not distinguished by the go 
> toolchain and other tools like Snyk which scan the dependency chain for 
> vulnerabilities. Unfortunately, due to Go's [Minimal version 
> selection|[https://go.dev/ref/mod#minimal-version-selection],] this ends up 
> requiring us to visit our dependencies to ensure this security vulnerability 
> is addressed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17069) [Python][R] GCSFIleSystem reports cannot resolve host on public buckets

2022-07-13 Thread Will Jones (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-17069:
---
Description: 
GCSFileSystem will returns {{Couldn't resolve host name}} if you don't supply 
{{anonymous}} as the user:
{code:python}
import pyarrow.dataset as ds

# Fails:
dataset = 
ds.dataset("gs://voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3")
Traceback (most recent call last):
  File "", line 1, in 
  File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
line 749, in dataset
return _filesystem_dataset(source, **kwargs)
  File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
line 441, in _filesystem_dataset
fs, paths_or_selector = _ensure_single_source(source, filesystem)
  File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
line 408, in _ensure_single_source
file_info = filesystem.get_file_info(path)
  File "pyarrow/_fs.pyx", line 444, in pyarrow._fs.FileSystem.get_file_info
info = GetResultValue(self.fs.GetFileInfo(path))
  File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
return check_status(status)
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
raise IOError(message)
OSError: google::cloud::Status(UNAVAILABLE: Retry policy exhausted in 
GetObjectMetadata: EasyPerform() - CURL error [6]=Couldn't resolve host name)

# This works fine:
>>> dataset = 
>>> ds.dataset("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3")
{code}

I would expect that we could connect.

  was:
GCSFileSystem will return {{Couldn't resolve host name}} if you don't supply 
{{anonymous}} as the user:
{code:python}
import pyarrow.dataset as ds

# Fails:
dataset = 
ds.dataset("gs://anonymous@voltrondata-labs-datasets/taxi-data/?retry_limit_seconds=3")
# Traceback (most recent call last):
#   File "", line 1, in 
#   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
line 749, in dataset
# return _filesystem_dataset(source, **kwargs)
#   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
line 441, in _filesystem_dataset
# fs, paths_or_selector = _ensure_single_source(source, filesystem)
#   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
line 417, in _ensure_single_source
# raise FileNotFoundError(path)
# FileNotFoundError: voltrondata-labs-datasets/taxi-data

# This works fine:
>>> dataset = 
>>> ds.dataset("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3")
{code}

I would expect that we could connect.


> [Python][R] GCSFIleSystem reports cannot resolve host on public buckets
> ---
>
> Key: ARROW-17069
> URL: https://issues.apache.org/jira/browse/ARROW-17069
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python, R
>Affects Versions: 8.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Critical
> Fix For: 9.0.0
>
>
> GCSFileSystem will returns {{Couldn't resolve host name}} if you don't supply 
> {{anonymous}} as the user:
> {code:python}
> import pyarrow.dataset as ds
> # Fails:
> dataset = 
> ds.dataset("gs://voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
> line 749, in dataset
> return _filesystem_dataset(source, **kwargs)
>   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
> line 441, in _filesystem_dataset
> fs, paths_or_selector = _ensure_single_source(source, filesystem)
>   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
> line 408, in _ensure_single_source
> file_info = filesystem.get_file_info(path)
>   File "pyarrow/_fs.pyx", line 444, in pyarrow._fs.FileSystem.get_file_info
> info = GetResultValue(self.fs.GetFileInfo(path))
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
> return check_status(status)
>   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
> raise IOError(message)
> OSError: google::cloud::Status(UNAVAILABLE: Retry policy exhausted in 
> GetObjectMetadata: EasyPerform() - CURL error [6]=Couldn't resolve host name)
> # This works fine:
> >>> dataset = 
> >>> ds.dataset("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3")
> {code}
> I would expect that we could connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17069) [Python][R] GCSFIleSystem reports cannot resolve host on public buckets

2022-07-13 Thread Will Jones (Jira)

Will Jones created ARROW-17069:
--

 Summary: [Python][R] GCSFIleSystem reports cannot resolve host on 
public buckets
 Key: ARROW-17069
 URL: https://issues.apache.org/jira/browse/ARROW-17069
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python, R
Affects Versions: 8.0.0
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 9.0.0


GCSFileSystem will return {{Couldn't resolve host name}} if you don't supply 
{{anonymous}} as the user:
{code:python}
import pyarrow.dataset as ds

# Fails:
dataset = 
ds.dataset("gs://anonymous@voltrondata-labs-datasets/taxi-data/?retry_limit_seconds=3")
# Traceback (most recent call last):
#   File "", line 1, in 
#   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
line 749, in dataset
# return _filesystem_dataset(source, **kwargs)
#   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
line 441, in _filesystem_dataset
# fs, paths_or_selector = _ensure_single_source(source, filesystem)
#   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
line 417, in _ensure_single_source
# raise FileNotFoundError(path)
# FileNotFoundError: voltrondata-labs-datasets/taxi-data

# This works fine:
>>> dataset = 
>>> ds.dataset("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3")
{code}

I would expect that we could connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-17066) [Python][Substrait] "ignore_unknown_fields" should be specified when converting JSON to binary

2022-07-13 Thread Vibhatha Lakmal Abeykoon (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vibhatha Lakmal Abeykoon reassigned ARROW-17066:


Assignee: Vibhatha Lakmal Abeykoon

> [Python][Substrait] "ignore_unknown_fields" should be specified when 
> converting JSON to binary
> --
>
> Key: ARROW-17066
> URL: https://issues.apache.org/jira/browse/ARROW-17066
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Richard Tia
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Blocker
>
> [https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.util.json_util#JsonParseOptions]
>  
> When converting a substrait JSON to binary, there are many unknown fields 
> that may exist since substrait is being built every week. 
> ignore_unknown_fields should be specified when doing this conversion.
>  
> This is resulting in frequent errors similar to this:
> {code:java}
> E   pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned 
> INVALID_ARGUMENT:(relations[0].root.input.sort.input.aggregate.measures[0].measure)
>  arguments: Cannot find field.
> pyarrow/error.pxi:100: ArrowInvalid {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-16918) [Gandiva][C++] Adding UTC and local time zone conversion functions to Gandiva

2022-07-13 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-16918:


Assignee: Palak Pariawala

> [Gandiva][C++] Adding UTC and local time zone conversion functions to Gandiva
> -
>
> Key: ARROW-16918
> URL: https://issues.apache.org/jira/browse/ARROW-16918
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Gandiva
>Reporter: Palak Pariawala
>Assignee: Palak Pariawala
>Priority: Minor
>  Labels: newbie, pull-request-available
>   Original Estimate: 168h
>  Time Spent: 3h
>  Remaining Estimate: 165h
>
> Adding functions in Gandiva to convert timestamps between UTC and local time 
> zones
> to_utc_timestamp(timestamp, timezone name)
> from_utc_timestamp(timestamp, timezone name)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-13062) [Dev] Add a way for people to add information to our saved crossbow data

2022-07-13 Thread Sam Albers (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566564#comment-17566564
 ] 

Sam Albers commented on ARROW-13062:


I have not added this ability. We certainly have the ability to add annotation 
like this though it does introduce some manual futzing unless we connected the 
report to this Jira board.

> [Dev] Add a way for people to add information to our saved crossbow data
> 
>
> Key: ARROW-13062
> URL: https://issues.apache.org/jira/browse/ARROW-13062
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Developer Tools
>Reporter: Jonathan Keane
>Priority: Major
>
> We should have a simple + ligthweight way to annotate specific builds with 
> information like "won't be fixed until dask has a new release" or "this is 
> supposed to be fixed in ARROW-XXX".
> We should find an easy, lightweight way to add this kind of information. 
> Only relevant in its previous parent: -We *should not* require, ask, or allow 
> people to add this information to the JSON that is saved as part of 
> ARROW-13509. That JSON should be kept pristine and not have manual edits. 
> Instead, we should have a plain-text look up file that matches notes to 
> specific builds (maybe to specific dates?)-



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17068) [Python] "pyarrow.parquet.write_to_dataset", option "file_visitor" nothing happen

2022-07-13 Thread Will Jones (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566543#comment-17566543
 ] 

Will Jones commented on ARROW-17068:


My guess is that if you pass {{use_legacy_dataset=False}} it should work. This 
option will become the default in 9.0.0 and we are removing legacy datasets 
implementation eventually so we might not fix this.

If you can, it would be preferable to use the dataset writer in 
{{pyarrow.dataset}}:

{code:python}
import pyarrow.dataset as ds

ds.write_dataset(table, base_dir="tests", partitioning=["col2"], 
file_visitor=lambda x: written_files.append(x.path)))
{code}

> [Python] "pyarrow.parquet.write_to_dataset", option "file_visitor" nothing 
> happen
> -
>
> Key: ARROW-17068
> URL: https://issues.apache.org/jira/browse/ARROW-17068
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Alejandro Marco Ramos
>Priority: Minor
>
> When try to use the callback "file_visitor", nothing happens.
>  
> Example:
> {code:java}
> import pyarrow as pa
> from pyarrow import parquet as pa_parquet
> table = pa.table([
>         pa.array([1, 2, 3, 4, 5]),
>         pa.array(["a", "b", "c", "d", "e"]),
>         pa.array([1.0, 2.0, 3.0, 4.0, 5.0])
>     ], names=["col1", "col2", "col3"])
> written_files = []
> pa_parquet.write_to_dataset(table, partition_cols=["col2"], 
> root_path="tests", file_visitor=lambda x: written_files.append(x.path)))
> assert len(written_files) > 0  # This raises, length is 0{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-16887) [Doc][R] Document GCSFileSystem for R package

2022-07-13 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16887:
---
Labels: pull-request-available  (was: )

> [Doc][R] Document GCSFileSystem for R package
> -
>
> Key: ARROW-16887
> URL: https://issues.apache.org/jira/browse/ARROW-16887
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We should update the [cloud storage 
> vignette|https://arrow.apache.org/docs/r/articles/fs.html] and the filesystem 
> RD to show configuration and usage of GCSFileSystem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-13656) [Website] [Rust] DataFusion 5.0.0 and Ballista 0.5.0 Blog Posts

2022-07-13 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-13656.
--
Resolution: Won't Fix

This is an old issue

> [Website] [Rust] DataFusion 5.0.0 and Ballista 0.5.0 Blog Posts
> ---
>
> Key: ARROW-13656
> URL: https://issues.apache.org/jira/browse/ARROW-13656
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Website
>Reporter: Andy Grove
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> https://github.com/apache/arrow-datafusion/issues/881



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-16002) [Go] fileBlock.NewMessage should use memory.Allocator

2022-07-13 Thread Matthew Topol (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-16002.
---
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13554
[https://github.com/apache/arrow/pull/13554]

> [Go] fileBlock.NewMessage should use memory.Allocator
> -
>
> Key: ARROW-16002
> URL: https://issues.apache.org/jira/browse/ARROW-16002
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Affects Versions: 8.0.0
>Reporter: Arjan Topolovec
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The current implementation of ipc.FileReader does not use the 
> memory.Allocator interface. Reading records from a file results in a large 
> number of allocations since the record body buffer is allocated each time 
> without reuse.
> https://github.com/apache/arrow/blob/master/go/arrow/ipc/metadata.go#L106



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-14182) [C++][Compute] Hash Join performance improvement

2022-07-13 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace resolved ARROW-14182.
-
Resolution: Fixed

Issue resolved by pull request 13493
[https://github.com/apache/arrow/pull/13493]

> [C++][Compute] Hash Join performance improvement
> 
>
> Key: ARROW-14182
> URL: https://issues.apache.org/jira/browse/ARROW-14182
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 6.0.0
>Reporter: Michal Nowakiewicz
>Assignee: Michal Nowakiewicz
>Priority: Major
>  Labels: pull-request-available, query-engine
> Fix For: 9.0.0
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Add micro-benchmarks for hash join exec node.
> Write a new implementation of the interface HashJoinImpl making sure that it 
> is efficient for all types of join. Current implementation, based on 
> unordered map, trades performance for a simpler code and is likely not as 
> fast as it could be.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-16288) [C++] ValueDescr::SCALAR nearly unused and does not work for projection

2022-07-13 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li closed ARROW-16288.

Resolution: Not A Problem

ValueDescr was simply removed.

> [C++] ValueDescr::SCALAR nearly unused and does not work for projection
> ---
>
> Key: ARROW-16288
> URL: https://issues.apache.org/jira/browse/ARROW-16288
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> First, there are almost no kernels that actually use this shape.  Only the 
> functions "all", "any", "list_element", "mean", "product", "struct_field", 
> and "sum" have kernels with this shape.  Most kernels that have special logic 
> for scalars handle it by using {{ValueDescr::ANY}}
> Second, when passing an expression to the project node, the expression must 
> be bound based on the dataset schema.  Since the binding happens based on a 
> schema (and not a batch) the function is bound to ValueDescr::ARRAY 
> (https://github.com/apache/arrow/blob/a16be6b7b6c8271202ff766b99c199b2e29bdfa8/cpp/src/arrow/compute/exec/expression.cc#L461)
> This results in an error if the function has only ValueDescr::SCALAR kernels 
> and would likely be a problem even if the function had both types of kernels 
> because it would get bound to the wrong kernel.
> This simplest fix may be to just get rid of ValueDescr and change all kernels 
> to ValueDescr::ANY behavior.  If we choose to keep it we will need to figure 
> out how to handle this kind of binding.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17068) [Python] "pyarrow.parquet.write_to_dataset", option "file_visitor" nothing happen

2022-07-13 Thread Alejandro Marco Ramos (Jira)

Alejandro Marco Ramos created ARROW-17068:
-

 Summary: [Python] "pyarrow.parquet.write_to_dataset", option 
"file_visitor" nothing happen
 Key: ARROW-17068
 URL: https://issues.apache.org/jira/browse/ARROW-17068
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 8.0.0
Reporter: Alejandro Marco Ramos


When try to use the callback "file_visitor", nothing happens.

 

Example:
{code:java}
import pyarrow as pa
from pyarrow import parquet as pa_parquet

table = pa.table([
        pa.array([1, 2, 3, 4, 5]),
        pa.array(["a", "b", "c", "d", "e"]),
        pa.array([1.0, 2.0, 3.0, 4.0, 5.0])
    ], names=["col1", "col2", "col3"])

written_files = []
pa_parquet.write_to_dataset(table, partition_cols=["col2"], root_path="tests", 
file_visitor=lambda x: written_files.append(x.path)))

assert len(written_files) > 0  # This raises, length is 0{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17064) [Python] Python hangs when use pyarrow.fs.copy_files with "used_threads=True"

2022-07-13 Thread Alejandro Marco Ramos (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alejandro Marco Ramos updated ARROW-17064:
--
Summary: [Python] Python hangs when use pyarrow.fs.copy_files with 
"used_threads=True"  (was: Python hangs when use pyarrow.fs.copy_files is used 
with "used_threads=True")

> [Python] Python hangs when use pyarrow.fs.copy_files with "used_threads=True"
> -
>
> Key: ARROW-17064
> URL: https://issues.apache.org/jira/browse/ARROW-17064
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Alejandro Marco Ramos
>Priority: Major
>
> When try to copy a local path to s3 remote filesystem using 
> `pyarrow.fs.copy_files` and using default parameter `use_threads=True`, the 
> system hangs. If use "use_threads=False` the operation must complete ok (but 
> more slow).
>  
> My code is:
> {code:java}
> >>> import pyarrow as pa
> >>> s3fs=pa.fs.S3FileSystem(endpoint_override="http://xx;)
> >>> pa.fs.copy_files("tests/data/payments", "bucket/payments", 
> >>> destination_filesystem=s3fs)
> ... (don't return){code}
> If check remote s3, all files appear, but the function don't return
>  
> Platform: Windows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-16919) [C++] Flight integration tests fail on verify rc nightly on linux amd64

2022-07-13 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-16919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566462#comment-17566462
 ] 

David Li commented on ARROW-16919:
--

This is still happening and I wasn't able to get a backtrace…I'll make another 
try soon.

> [C++] Flight integration tests fail on verify rc nightly on linux amd64
> ---
>
> Key: ARROW-16919
> URL: https://issues.apache.org/jira/browse/ARROW-16919
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration, FlightRPC
>Reporter: Raúl Cumplido
>Priority: Critical
>  Labels: Nightly, pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Some of our nightly builds to verify the release are failing:
> {color:#1d1c1d}- 
> {color}[verify-rc-source-integration-linux-almalinux-8-amd64|https://github.com/ursacomputing/crossbow/runs/7073206980?check_suite_focus=true]
> {color:#1d1c1d}- 
> {color}[verify-rc-source-integration-linux-ubuntu-18.04-amd64|https://github.com/ursacomputing/crossbow/runs/7073217433?check_suite_focus=true]
> {color:#1d1c1d}- 
> {color}[verify-rc-source-integration-linux-ubuntu-20.04-amd64|https://github.com/ursacomputing/crossbow/runs/7073210299?check_suite_focus=true]
> {color:#1d1c1d}- 
> {color}[verify-rc-source-integration-linux-ubuntu-22.04-amd64|https://github.com/ursacomputing/crossbow/runs/7073273051?check_suite_focus=true]
> with the following:
> {code:java}
>  # FAILURES #
> FAILED TEST: middleware C++ producing,  C++ consuming
> 1 failures
>   File "/arrow/dev/archery/archery/integration/util.py", line 139, in run_cmd
>     output = subprocess.check_output(cmd, stderr=subprocess.STDOUT)
>   File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
>     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
>   File "/usr/lib/python3.8/subprocess.py", line 512, in run
>     raise CalledProcessError(retcode, process.args,
> subprocess.CalledProcessError: Command 
> '['/tmp/arrow-HEAD.PZocX/cpp-build/release/flight-test-integration-client', 
> '-host', 'localhost', '-port=36719', '-scenario', 'middleware']' died with 
> .
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "/arrow/dev/archery/archery/integration/runner.py", line 379, in 
> _run_flight_test_case
>     consumer.flight_request(port, **client_args)
>   File "/arrow/dev/archery/archery/integration/tester_cpp.py", line 134, in 
> flight_request
>     run_cmd(cmd)
>   File "/arrow/dev/archery/archery/integration/util.py", line 148, in run_cmd
>     raise RuntimeError(sio.getvalue())
> RuntimeError: Command failed: 
> /tmp/arrow-HEAD.PZocX/cpp-build/release/flight-test-integration-client -host 
> localhost -port=36719 -scenario middleware
> With output:
> --
> Headers received successfully on failing call.
> Headers received successfully on passing call.
> free(): double free detected in tcache 2 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17067) Implement Substring_Index

2022-07-13 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17067:
---
Labels: pull-request-available  (was: )

> Implement Substring_Index
> -
>
> Key: ARROW-17067
> URL: https://issues.apache.org/jira/browse/ARROW-17067
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Sahaj Gupta
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Adding Substring_index Function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17067) Implement Substring_Index

2022-07-13 Thread Sahaj Gupta (Jira)

Sahaj Gupta created ARROW-17067:
---

 Summary: Implement Substring_Index
 Key: ARROW-17067
 URL: https://issues.apache.org/jira/browse/ARROW-17067
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Sahaj Gupta


Adding Substring_index Function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17066) [Python][Substrait] "ignore_unknown_fields" should be specified when converting JSON to binary

2022-07-13 Thread Richard Tia (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566446#comment-17566446
 ] 

Richard Tia commented on ARROW-17066:
-

CC: [~westonpace] 

> [Python][Substrait] "ignore_unknown_fields" should be specified when 
> converting JSON to binary
> --
>
> Key: ARROW-17066
> URL: https://issues.apache.org/jira/browse/ARROW-17066
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Richard Tia
>Priority: Blocker
>
> [https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.util.json_util#JsonParseOptions]
>  
> When converting a substrait JSON to binary, there are many unknown fields 
> that may exist since substrait is being built every week. 
> ignore_unknown_fields should be specified when doing this conversion.
>  
> This is resulting in frequent errors similar to this:
> {code:java}
> E   pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned 
> INVALID_ARGUMENT:(relations[0].root.input.sort.input.aggregate.measures[0].measure)
>  arguments: Cannot find field.
> pyarrow/error.pxi:100: ArrowInvalid {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17066) [Python][Substrait] "ignore_unknown_fields" should be specified when converting JSON to binary

2022-07-13 Thread Richard Tia (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Tia updated ARROW-17066:

Summary: [Python][Substrait] "ignore_unknown_fields" should be specified 
when converting JSON to binary  (was: [C++][Substrait] "ignore_unknown_fields" 
should be specified when converting JSON to binary)

> [Python][Substrait] "ignore_unknown_fields" should be specified when 
> converting JSON to binary
> --
>
> Key: ARROW-17066
> URL: https://issues.apache.org/jira/browse/ARROW-17066
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Richard Tia
>Priority: Blocker
>
> [https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.util.json_util#JsonParseOptions]
>  
> When converting a substrait JSON to binary, there are many unknown fields 
> that may exist since substrait is being built every week. 
> ignore_unknown_fields should be specified when doing this conversion.
>  
> This is resulting in frequent errors similar to this:
> {code:java}
> E   pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned 
> INVALID_ARGUMENT:(relations[0].root.input.sort.input.aggregate.measures[0].measure)
>  arguments: Cannot find field.
> pyarrow/error.pxi:100: ArrowInvalid {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17066) [C++][Substrait] "ignore_unknown_fields" should be specified when converting JSON to binary

2022-07-13 Thread Richard Tia (Jira)

Richard Tia created ARROW-17066:
---

 Summary: [C++][Substrait] "ignore_unknown_fields" should be 
specified when converting JSON to binary
 Key: ARROW-17066
 URL: https://issues.apache.org/jira/browse/ARROW-17066
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Richard Tia


[https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.util.json_util#JsonParseOptions]

 

When converting a substrait JSON to binary, there are many unknown fields that 
may exist since substrait is being built every week. ignore_unknown_fields 
should be specified when doing this conversion.

 

This is resulting in frequent errors similar to this:
{code:java}
E   pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned 
INVALID_ARGUMENT:(relations[0].root.input.sort.input.aggregate.measures[0].measure)
 arguments: Cannot find field.

pyarrow/error.pxi:100: ArrowInvalid {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17051) [C++][Flight] arrow-flight-sql-test fails with ASAN UBSAN

2022-07-13 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17051:
---
Labels: pull-request-available  (was: )

> [C++][Flight] arrow-flight-sql-test fails with ASAN UBSAN
> -
>
> Key: ARROW-17051
> URL: https://issues.apache.org/jira/browse/ARROW-17051
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: Raúl Cumplido
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The CI job for ASAN UBSAN is based on Ubuntu 20.04: *C++ / AMD64 Ubuntu 20.04 
> C++ ASAN UBSAN*  
> Trying to build Flight and Flight SQL on Ubuntu 20.04 the job for ASAN UBSAN 
> will also build with Flight and Flight SQL. This triggers some 
> arrow-flight-sql-test failures like:
> {code:java}
>   [ RUN      ] TestFlightSqlClient.TestGetDbSchemas
> unknown file: Failure
> Unexpected mock function call - taking default action specified at:
> /arrow/cpp/src/arrow/flight/sql/client_test.cc:151:
>     Function call: GetFlightInfo(@0x6157d948 184-byte object <00-00 00-00 
> 00-00 F0-BF 40-00 00-00 00-00 00-00 80-4C 06-49 CF-7F 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 01-01 00-00 00-00 00-00 
> 00-20 00-00 00-00 00-00 ... 01-00 00-04 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>, 
> @0x7fff35794e80 64-byte object <02-00 00-00 00-00 00-00 C0-45 08-00 B0-60 
> 00-00 65-00 00-00 00-00 00-00 65-00 00-00 00-00 00-00 C4-A9 AE-66 00-10 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>)
>           Returns: (nullptr)
> Google Mock tried the following 1 expectation, but it didn't match:
> /arrow/cpp/src/arrow/flight/sql/client_test.cc:152: EXPECT_CALL(sql_client_, 
> GetFlightInfo(Ref(call_options_), descriptor))...
>   Expected arg #1: is equal to 64-byte object <02-00 00-00 BE-BE BE-BE C0-6B 
> 05-00 C0-60 00-00 73-00 00-00 00-00 00-00 73-00 00-00 00-00 00-00 BE-BE BE-BE 
> BE-BE BE-BE 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00>
>            Actual: 64-byte object <02-00 00-00 00-00 00-00 C0-45 08-00 B0-60 
> 00-00 65-00 00-00 00-00 00-00 65-00 00-00 00-00 00-00 C4-A9 AE-66 00-10 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>
>          Expected: to be called once
>            Actual: never called - unsatisfied and active
> /arrow/cpp/src/arrow/flight/sql/client_test.cc:152: Failure
> Actual function call count doesn't match EXPECT_CALL(sql_client_, 
> GetFlightInfo(Ref(call_options_), descriptor))...
>          Expected: to be called once
>            Actual: never called - unsatisfied and active
> [  FAILED  ] TestFlightSqlClient.TestGetDbSchemas (1 ms){code}
> The error can be seen here: 
> [https://github.com/apache/arrow/runs/7297442828?check_suite_focus=true]
> This is the initial PR that triggered it:
> [https://github.com/apache/arrow/pull/13548]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17051) [C++][Flight] arrow-flight-sql-test fails with ASAN UBSAN

2022-07-13 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566440#comment-17566440
 ] 

David Li commented on ARROW-17051:
--

Ok, it only occurs with bundled (static) Protobuf/gRPC. It's not related to 
ASAN/UBSAN, this will do it:
{noformat}
-DARROW_FLIGHT=ON -DARROW_FLIGHT_SQL=ON -DARROW_BUILD_TESTS=ON 
-DProtobuf_SOURCE=BUNDLED -DgRPC_SOURCE=BUNDLED -DGTest_SOURCE=BUNDLED 
-DARROW_BUILD_SHARED=ON -DARROW_BUILD_STATIC=OFF {noformat}
It also fails differently when only a single test is run.

I suspect that gRPC/Protobuf is getting linked twice, which is a common issue. 
Both libarrow_flight and libarrow_flight_sql contain Protobuf symbols. {{env 
LD_DEBUG=all}} shows the dynamic linker is not resolving any Protobuf symbols - 
so presumably each library is using its own copy of Protobuf. But Protobuf has 
global state.

To wit, it passes if we set {{-DARROW_BUILD_SHARED=OFF 
-DARROW_BUILD_STATIC=ON}} instead.

So I think the solution here is: change this job to link statically instead of 
dynamically, and prevent Flight from building shared libraries if Protobuf/gRPC 
are static dependencies.

> [C++][Flight] arrow-flight-sql-test fails with ASAN UBSAN
> -
>
> Key: ARROW-17051
> URL: https://issues.apache.org/jira/browse/ARROW-17051
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: Raúl Cumplido
>Priority: Major
>
> The CI job for ASAN UBSAN is based on Ubuntu 20.04: *C++ / AMD64 Ubuntu 20.04 
> C++ ASAN UBSAN*  
> Trying to build Flight and Flight SQL on Ubuntu 20.04 the job for ASAN UBSAN 
> will also build with Flight and Flight SQL. This triggers some 
> arrow-flight-sql-test failures like:
> {code:java}
>   [ RUN      ] TestFlightSqlClient.TestGetDbSchemas
> unknown file: Failure
> Unexpected mock function call - taking default action specified at:
> /arrow/cpp/src/arrow/flight/sql/client_test.cc:151:
>     Function call: GetFlightInfo(@0x6157d948 184-byte object <00-00 00-00 
> 00-00 F0-BF 40-00 00-00 00-00 00-00 80-4C 06-49 CF-7F 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 01-01 00-00 00-00 00-00 
> 00-20 00-00 00-00 00-00 ... 01-00 00-04 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>, 
> @0x7fff35794e80 64-byte object <02-00 00-00 00-00 00-00 C0-45 08-00 B0-60 
> 00-00 65-00 00-00 00-00 00-00 65-00 00-00 00-00 00-00 C4-A9 AE-66 00-10 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>)
>           Returns: (nullptr)
> Google Mock tried the following 1 expectation, but it didn't match:
> /arrow/cpp/src/arrow/flight/sql/client_test.cc:152: EXPECT_CALL(sql_client_, 
> GetFlightInfo(Ref(call_options_), descriptor))...
>   Expected arg #1: is equal to 64-byte object <02-00 00-00 BE-BE BE-BE C0-6B 
> 05-00 C0-60 00-00 73-00 00-00 00-00 00-00 73-00 00-00 00-00 00-00 BE-BE BE-BE 
> BE-BE BE-BE 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00>
>            Actual: 64-byte object <02-00 00-00 00-00 00-00 C0-45 08-00 B0-60 
> 00-00 65-00 00-00 00-00 00-00 65-00 00-00 00-00 00-00 C4-A9 AE-66 00-10 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>
>          Expected: to be called once
>            Actual: never called - unsatisfied and active
> /arrow/cpp/src/arrow/flight/sql/client_test.cc:152: Failure
> Actual function call count doesn't match EXPECT_CALL(sql_client_, 
> GetFlightInfo(Ref(call_options_), descriptor))...
>          Expected: to be called once
>            Actual: never called - unsatisfied and active
> [  FAILED  ] TestFlightSqlClient.TestGetDbSchemas (1 ms){code}
> The error can be seen here: 
> [https://github.com/apache/arrow/runs/7297442828?check_suite_focus=true]
> This is the initial PR that triggered it:
> [https://github.com/apache/arrow/pull/13548]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-15762) [R] Revisit binding_format_datetime and remove manual casting

2022-07-13 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-15762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dragoș Moldovan-Grünfeld reassigned ARROW-15762:


Assignee: Dragoș Moldovan-Grünfeld

> [R] Revisit binding_format_datetime and remove manual casting 
> --
>
> Key: ARROW-15762
> URL: https://issues.apache.org/jira/browse/ARROW-15762
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> This is a follow-up issue to revisit the casting step in format once 
> [https://github.com/apache/arrow/pull/12240] gets merged.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-16863) [R] open_dataset() silently drops the missing values from a csv file

2022-07-13 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson closed ARROW-16863.
---
  Assignee: Neal Richardson
Resolution: Not A Problem

> [R] open_dataset() silently drops the missing values from a csv file
> 
>
> Key: ARROW-16863
> URL: https://issues.apache.org/jira/browse/ARROW-16863
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Zsolt Kegyes-Brassai
>Assignee: Neal Richardson
>Priority: Major
>
> The {{open_dataset()}} +silently+ drops the empty/missing values from a csv 
> file. This empty string was generated when writing a dataframe containing a 
> NA value using the {{{}write_csv_arrow(){}}}.
>  
> {code:java}
> df_numbers <- tibble::tibble(number = c(1, 2, "error", 4, 5, NA, 7, 8))
> arrow::write_csv_arrow(df_numbers, "numbers.csv")
> readLines("numbers.csv")
> #> [1] "\"number\"" "\"1\""      "\"2\""      "\"error\""  "\"4\""     
> #> [6] "\"5\""      ""           "\"7\""      "\"8\""
> arrow::open_dataset("numbers.csv", format = "csv") |> dplyr::collect()
> #> # A tibble: 7 x 1
> #>   number
> #>    
> #> 1 1     
> #> 2 2     
> #> 3 error 
> #> 4 4     
> #> 5 5     
> #> 6 7     
> #> 7 8
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-16863) [R] open_dataset() silently drops the missing values from a csv file

2022-07-13 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-16863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566426#comment-17566426
 ] 

Neal Richardson commented on ARROW-16863:
-

I think this is only an issue because the "csv" just has a single column (no 
commas involved really). So your missing value shows up as just an extra 
newline character. This behavior is consistent with base::read.csv() and 
readr::read_csv():

{code}
> read.csv("numbers.csv")
  number
1  1
2  2
3  error
4  4
5  5
6  7
7  8
> readr::read_csv("numbers.csv")

 Rows: 7 Columns: 1
── Column specification 
─
Delimiter: ","
chr (1): number

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this 
message.
# A tibble: 7 × 1
  number
   
1 1 
2 2 
3 error 
4 4 
5 5 
6 7 
7 8  
{code}

And if you have more than one column, there is no issue:

{code}
> df_numbers$num2 <- df_numbers$number
> tf <- tempfile()
> write_csv_arrow(df_numbers, tf)
> open_dataset(tf, format = "csv") %>% collect()
# A tibble: 8 × 2
  number  num2   
   
1 "1" "1"
2 "2" "2"
3 "error" "error"
4 "4" "4"
5 "5" "5"
6 ""  "" 
7 "7" "7"
8 "8" "8"
{code}


> [R] open_dataset() silently drops the missing values from a csv file
> 
>
> Key: ARROW-16863
> URL: https://issues.apache.org/jira/browse/ARROW-16863
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Zsolt Kegyes-Brassai
>Priority: Major
>
> The {{open_dataset()}} +silently+ drops the empty/missing values from a csv 
> file. This empty string was generated when writing a dataframe containing a 
> NA value using the {{{}write_csv_arrow(){}}}.
>  
> {code:java}
> df_numbers <- tibble::tibble(number = c(1, 2, "error", 4, 5, NA, 7, 8))
> arrow::write_csv_arrow(df_numbers, "numbers.csv")
> readLines("numbers.csv")
> #> [1] "\"number\"" "\"1\""      "\"2\""      "\"error\""  "\"4\""     
> #> [6] "\"5\""      ""           "\"7\""      "\"8\""
> arrow::open_dataset("numbers.csv", format = "csv") |> dplyr::collect()
> #> # A tibble: 7 x 1
> #>   number
> #>    
> #> 1 1     
> #> 2 2     
> #> 3 error 
> #> 4 4     
> #> 5 5     
> #> 6 7     
> #> 7 8
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17045) [C++] Reject trailing slashes on file path

2022-07-13 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-17045.

Resolution: Fixed

Issue resolved by pull request 13577
[https://github.com/apache/arrow/pull/13577]

> [C++] Reject trailing slashes on file path
> --
>
> Key: ARROW-17045
> URL: https://issues.apache.org/jira/browse/ARROW-17045
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 8.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Critical
>  Labels: breaking-api, pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> We had several different behaviors when passing in file paths with trailing 
> slashes: LocalFileSystem would return IOError, S3 would trim off the trailing 
> slash, and GCS would keep the trailing slash as part of the file name (later 
> creating confusion as the file would be labelled a "directory" in list 
> calls). This PR moves them all to the behavior of LocalFileSystem: return 
> IOError.
> The R filesystem bindings relied on the behavior provided by S3, so they are 
> now modified to trim the trailing slash before passing down to C++.
> Here is an example of the differences in behavior between S3 and GCS:
> {code:python}
> import pyarrow.fs
> from pyarrow.fs import FileSelector
> from datetime import timedelta
> gcs = pyarrow.fs.GcsFileSystem(
> endpoint_override="localhost:9001",
> scheme="http",
> anonymous=True,
> retry_time_limit=timedelta(seconds=1),
> )
> gcs.create_dir("py_test")
> # Writing to test.txt with and without slash produces a file and a directory!?
> with gcs.open_output_stream("py_test/test.txt") as out_stream:
> out_stream.write(b"Hello world!")
> with gcs.open_output_stream("py_test/test.txt/") as out_stream:
> out_stream.write(b"Hello world!")
> gcs.get_file_info(FileSelector("py_test"))
> # [,  for 'py_test/test.txt': type=FileType.Directory>]
> s3 = pyarrow.fs.S3FileSystem(
> access_key="minioadmin",
> secret_key="minioadmin",
> scheme="http",
> endpoint_override="localhost:9000",
> allow_bucket_creation=True,
> allow_bucket_deletion=True,
> )
> s3.create_dir("py-test")
> # Writing to test.txt with and without slash writes to same file
> with s3.open_output_stream("py-test/test.txt") as out_stream:
> out_stream.write(b"Hello world!")
> with s3.open_output_stream("py-test/test.txt/") as out_stream:
> out_stream.write(b"Hello world!")
> s3.get_file_info(FileSelector("py-test"))
> # []
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-11341) [Python] [Gandiva] Check parameters are not None

2022-07-13 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-11341.
--
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 9289
[https://github.com/apache/arrow/pull/9289]

> [Python] [Gandiva] Check parameters are not None
> 
>
> Key: ARROW-11341
> URL: https://issues.apache.org/jira/browse/ARROW-11341
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva, Python
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Most of the functions in Gandiva's Python Expression builder interface 
> current accept None in their arguments, but will segfault once they are used.
> Example:
> {code:python}
> import pyarrow
> import pyarrow.gandiva as gandiva
> builder = gandiva.TreeExprBuilder()
> field = pyarrow.field('whatever', type=pyarrow.date64())
> date_col = builder.make_field(field)
> func = builder.make_function('less_than_or_equal_to', [date_col, None], 
> pyarrow.bool_())
> condition = builder.make_condition(func)
> # Will segfault on this line:
> gandiva.make_filter(pyarrow.schema([field]), condition)
> {code}
> I think this is just a matter of adding {{not None}} to the appropriate 
> function arguments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-16324) [Go] Implement Dictionary Unification

2022-07-13 Thread Matthew Topol (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-16324.
---
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13529
[https://github.com/apache/arrow/pull/13529]

> [Go] Implement Dictionary Unification
> -
>
> Key: ARROW-16324
> URL: https://issues.apache.org/jira/browse/ARROW-16324
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Go
>Reporter: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-8043) [Developer] Provide better visibility for failed nightly builds

2022-07-13 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane closed ARROW-8043.
-
Resolution: Fixed

Done with the work on https://crossbow.voltrondata.com

> [Developer] Provide better visibility for failed nightly builds
> ---
>
> Key: ARROW-8043
> URL: https://issues.apache.org/jira/browse/ARROW-8043
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Developer Tools
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Sam Albers
>Priority: Major
>  Labels: pull-request-available
>
> Emails reporting nightly failures are unsatisfactory in two ways: there is a 
> large click/scroll distance between the links presented in that email and the 
> actual error message. Worse, once one is there it's not clear what JIRAs have 
> been made or which of them are in progress.
> One solution would be to replace or augment the [NIGHTLY] email with a page 
> ([https://ursa-labs.github.org/crossbow] would be my favorite) which shows 
> how many nights it has failed, a shortcut to the actual error line in CI's 
> logs, and useful views of JIRA. We could accomplish this with:
>  - dedicated JIRA tags; one for each nightly job so a JIRA can be easily 
> associated with specific jobs
>  - A static HTML dashboard with client side JavaScript to
>  ** scrape JIRA and update the page dynamically as soon as JIRAs are opened
>  ** show any relationships between failing jobs
>  ** highlight jobs that have not been addressed, along with a counter of how 
> many nights it has gone unaddressed
>  - provide automatic and expedited creation of correctly labelled JIRAs, so 
> that viewers can quickly organize/take ownership of a failed nightly job. 
> JIRA supports reading form fields from URL parameters, so this would be 
> fairly straightforward:
>  
> [https://issues.apache.org/jira/secure/CreateIssueDetails!init.jspa?pid=12319525=1=[NIGHTLY:gandiva-jar-osx,gandiva-jar-trusty]=12340948=12347769=12334626]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Reopened] (ARROW-8043) [Developer] Provide better visibility for failed nightly builds

2022-07-13 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reopened ARROW-8043:
---
  Assignee: Sam Albers

> [Developer] Provide better visibility for failed nightly builds
> ---
>
> Key: ARROW-8043
> URL: https://issues.apache.org/jira/browse/ARROW-8043
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Developer Tools
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Sam Albers
>Priority: Major
>  Labels: pull-request-available
>
> Emails reporting nightly failures are unsatisfactory in two ways: there is a 
> large click/scroll distance between the links presented in that email and the 
> actual error message. Worse, once one is there it's not clear what JIRAs have 
> been made or which of them are in progress.
> One solution would be to replace or augment the [NIGHTLY] email with a page 
> ([https://ursa-labs.github.org/crossbow] would be my favorite) which shows 
> how many nights it has failed, a shortcut to the actual error line in CI's 
> logs, and useful views of JIRA. We could accomplish this with:
>  - dedicated JIRA tags; one for each nightly job so a JIRA can be easily 
> associated with specific jobs
>  - A static HTML dashboard with client side JavaScript to
>  ** scrape JIRA and update the page dynamically as soon as JIRAs are opened
>  ** show any relationships between failing jobs
>  ** highlight jobs that have not been addressed, along with a counter of how 
> many nights it has gone unaddressed
>  - provide automatic and expedited creation of correctly labelled JIRAs, so 
> that viewers can quickly organize/take ownership of a failed nightly job. 
> JIRA supports reading form fields from URL parameters, so this would be 
> fairly straightforward:
>  
> [https://issues.apache.org/jira/secure/CreateIssueDetails!init.jspa?pid=12319525=1=[NIGHTLY:gandiva-jar-osx,gandiva-jar-trusty]=12340948=12347769=12334626]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-13936) Add a column to show us the number of time that this job is failing

2022-07-13 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane closed ARROW-13936.
--
Resolution: Fixed

Done with the work on https://crossbow.voltrondata.com

> Add a column to show us the number of time that this job is failing
> ---
>
> Key: ARROW-13936
> URL: https://issues.apache.org/jira/browse/ARROW-13936
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: David Dali Susanibar Arce
>Assignee: Sam Albers
>Priority: Minor
>
> Try to use external repository to collect information about jobs name failling



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Reopened] (ARROW-13936) Add a column to show us the number of time that this job is failing

2022-07-13 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reopened ARROW-13936:

  Assignee: Sam Albers

> Add a column to show us the number of time that this job is failing
> ---
>
> Key: ARROW-13936
> URL: https://issues.apache.org/jira/browse/ARROW-13936
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: David Dali Susanibar Arce
>Assignee: Sam Albers
>Priority: Minor
>
> Try to use external repository to collect information about jobs name failling



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-12845) [R] [C++] S3 connections for different providers

2022-07-13 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane closed ARROW-12845.
--
Resolution: Won't Fix

> [R] [C++] S3 connections for different providers
> 
>
> Key: ARROW-12845
> URL: https://issues.apache.org/jira/browse/ARROW-12845
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Affects Versions: 4.0.0
>Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Hi
> As a part of my thesis, I want to create an S3 bucket on DigitalOcean (what 
> PUC uses), and while I can write parquet files on my laptop and upload to 
> DigitalOcean Spaces (i.e. an "S3 + Google Drive") from the browser or by 
> using rclone, I could work in editing the existing code that allows to 
> connects to Amazon S3, and  provide a function that connects to 
> DigitalOcean/Linode/IBM/etc.
> This could be done in a way that amazon URL is the default and the user could 
> specify something like `new_s3_fun(...,  provider = "Tencent")` and connect 
> to an S3 that is not Amazon.
> Also, this involves the need to write more S3 documentation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-12862) [CI] Gather + display reliability of crossbow builds

2022-07-13 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane closed ARROW-12862.
--
Resolution: Fixed

Done with the work on https://crossbow.voltrondata.com

> [CI] Gather + display reliability of crossbow builds
> 
>
> Key: ARROW-12862
> URL: https://issues.apache.org/jira/browse/ARROW-12862
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Jonathan Keane
>Assignee: Sam Albers
>Priority: Major
>
> From Wes's suggestion on the mailing list:
> Having a website
> dashboard showing build health over time along with a ~ weekly e-mail
> to dev@ indicating currently broken builds and the reliability of each
> build over the trailing 7 or 30 days would be useful. Knowing that a
> particular build is only passing 20% of the time would help steer our
> efforts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-12862) [CI] Gather + display reliability of crossbow builds

2022-07-13 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reassigned ARROW-12862:
--

Assignee: Sam Albers

> [CI] Gather + display reliability of crossbow builds
> 
>
> Key: ARROW-12862
> URL: https://issues.apache.org/jira/browse/ARROW-12862
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Jonathan Keane
>Assignee: Sam Albers
>Priority: Major
>
> From Wes's suggestion on the mailing list:
> Having a website
> dashboard showing build health over time along with a ~ weekly e-mail
> to dev@ indicating currently broken builds and the reliability of each
> build over the trailing 7 or 30 days would be useful. Knowing that a
> particular build is only passing 20% of the time would help steer our
> efforts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-14378) [R] Make custom extension classes for (some) cols with row-level metadata

2022-07-13 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane closed ARROW-14378.
--
Resolution: Won't Fix

We ended up supporting geo columns using the geoarrow package + extension types

> [R] Make custom extension classes for (some) cols with row-level metadata
> -
>
> Key: ARROW-14378
> URL: https://issues.apache.org/jira/browse/ARROW-14378
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Priority: Major
>
> The major usecase for this is SF columns which have attributes/metadata for 
> each element of a column. We originally stored these in our standard 
> column-level metadata, but that was very fragile and took forever, so we 
> disabled it ARROW-13189
> This will likely take some steps to accomplish. I've sketched out some in the 
> subtasks here (though if we have a different approach, we could do that 
> directly)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-12182) [R] [Dev] new helpers and suggests for testing

2022-07-13 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane closed ARROW-12182.
--
Resolution: Won't Fix

> [R] [Dev] new helpers and suggests for testing
> --
>
> Key: ARROW-12182
> URL: https://issues.apache.org/jira/browse/ARROW-12182
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools, R
>Affects Versions: 3.0.0
>Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Minor
>
> _Related to https://issues.apache.org/jira/browse/ARROW-11705_
> While working on the related tickets I've found the next blockers:
> 1. Does it make sense to create expect_dplyr_named()? (i.e. to mimic 
> https://github.com/tidyverse/dplyr/blob/master/tests/testthat/test-mutate.r#L56-L59)
> 2. Does it make sense to create expect_dplyr_identical() (i.e. to mimic 
> https://github.com/tidyverse/dplyr/blob/master/tests/testthat/test-mutate.r#L61-L69
>  and 
> https://github.com/tidyverse/dplyr/blob/master/tests/testthat/test-mutate.r#L83-L91)
> 3. Should we need to add glue to Suggests? (i.e. replicate 
> https://github.com/tidyverse/dplyr/blob/master/tests/testthat/test-mutate.r#L95-L100)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-14624) [R] [Docs] Remove our tabbing hack now that it's supported by pkgdown

2022-07-13 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane closed ARROW-14624.
--
Resolution: Fixed

This was fixed as part of the work to update the version switcher in the docs.

> [R] [Docs] Remove our tabbing hack now that it's supported by pkgdown
> -
>
> Key: ARROW-14624
> URL: https://issues.apache.org/jira/browse/ARROW-14624
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Priority: Major
>
> tabsets are now supported natively in pkgdown (with bootstrap 5)
> https://github.com/r-lib/pkgdown/pull/1694
> So we can pull out the hack we have to make that work for our dev docs 
> vignette



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-16076) [R] Bindings for the new TPC-H generator

2022-07-13 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane closed ARROW-16076.
--
Resolution: Won't Fix

Since the TPC-H generator does not generate compliant data, there's not a big 
need to expose this in R.

> [R] Bindings for the new TPC-H generator
> 
>
> Key: ARROW-16076
> URL: https://issues.apache.org/jira/browse/ARROW-16076
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Now that https://github.com/apache/arrow/pull/12537 is merged, we should 
> implement the R changes needed to make that useable from R.
> We should basically do the opposite of 
> https://github.com/apache/arrow/pull/12537/commits/4b16296b4ef8cd3b3d440e8b7f8af32a89a16788
> But also add in the fixes from weston: 
> https://github.com/westonpace/arrow/commit/7c4c0e0b4e208918eb195701fab5d631b8c9517a



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-13062) [Dev] Add a way for people to add information to our saved crossbow data

2022-07-13 Thread Jonathan Keane (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566383#comment-17566383
 ] 

Jonathan Keane commented on ARROW-13062:


[~boshek] Did you already add this ability? I know it's a slightly different 
set of tickets than the ones we actually worked, but we should either close it 
as duplicate, done, or won't fix (and feel free to take credit for it if you 
did it elsewhere as part of a larger ticket!)

> [Dev] Add a way for people to add information to our saved crossbow data
> 
>
> Key: ARROW-13062
> URL: https://issues.apache.org/jira/browse/ARROW-13062
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Developer Tools
>Reporter: Jonathan Keane
>Priority: Major
>
> We should have a simple + ligthweight way to annotate specific builds with 
> information like "won't be fixed until dask has a new release" or "this is 
> supposed to be fixed in ARROW-XXX".
> We should find an easy, lightweight way to add this kind of information. 
> Only relevant in its previous parent: -We *should not* require, ask, or allow 
> people to add this information to the JSON that is saved as part of 
> ARROW-13509. That JSON should be kept pristine and not have manual edits. 
> Instead, we should have a plain-text look up file that matches notes to 
> specific builds (maybe to specific dates?)-



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17041) [C++][CI] arrow-compute-scalar-test fails on test-conda-cpp-valgrind

2022-07-13 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-17041.

Resolution: Fixed

Issue resolved by pull request 13597
[https://github.com/apache/arrow/pull/13597]

> [C++][CI] arrow-compute-scalar-test fails on test-conda-cpp-valgrind
> 
>
> Key: ARROW-17041
> URL: https://issues.apache.org/jira/browse/ARROW-17041
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Raúl Cumplido
>Assignee: Antoine Pitrou
>Priority: Critical
>  Labels: Nightly, pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> There seems to be an issue on the arrow-compute-scalar-test as it has been 
> failing for the last days, example: 
> [https://github.com/ursacomputing/crossbow/runs/7274655770]
> See [https://crossbow.voltrondata.com/]
> Error:
> {code:java}
> ==13125== 
> ==13125== HEAP SUMMARY:
> ==13125== in use at exit: 16,090 bytes in 161 blocks
> ==13125==   total heap usage: 14,612,979 allocs, 14,612,818 frees, 
> 2,853,741,784 bytes allocated
> ==13125== 
> ==13125== LEAK SUMMARY:
> ==13125==definitely lost: 0 bytes in 0 blocks
> ==13125==indirectly lost: 0 bytes in 0 blocks
> ==13125==  possibly lost: 0 bytes in 0 blocks
> ==13125==still reachable: 16,090 bytes in 161 blocks
> ==13125== suppressed: 0 bytes in 0 blocks
> ==13125== Reachable blocks (those to which a pointer was found) are not shown.
> ==13125== To see them, rerun with: --leak-check=full --show-leak-kinds=all
> ==13125== 
> ==13125== Use --track-origins=yes to see where uninitialised values come from
> ==13125== For lists of detected and suppressed errors, rerun with: -s
> ==13125== ERROR SUMMARY: 54 errors from 12 contexts (suppressed: 517836 from 
> 44) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17055) [Java][FlightRPC] flight-core and flight-sql jars delivering same class names

2022-07-13 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-17055.
--
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13596
[https://github.com/apache/arrow/pull/13596]

> [Java][FlightRPC] flight-core and flight-sql jars delivering same class names
> -
>
> Key: ARROW-17055
> URL: https://issues.apache.org/jira/browse/ARROW-17055
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Java
>Reporter: Kevin Bambrick
>Assignee: Kevin Bambrick
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Hello. I am trying to uptake arrow flight sql. We have a check in out build 
> to make sure that there are no overlapping class files in our project. When 
> adding the flight sql dependency to our project the warning throws that 
> flight-sql and flight-core overlap and the jars deliver the same class files.
> {code:java}
> Sample Warning: WARNING: CLASSPATH OVERLAP: These jars deliver the same class 
> files: [flight-sql-7.0.0.jar, flight-core-7.0.0.jar] files: 
> [org/apache/arrow/flight/impl/Flight$FlightDescriptor$1.class, 
> org/apache/arrow/flight/impl/Flight$ActionOrBuilder.class{code}
>  
> It seems that the classes generated by Flight.proto gets generated in both 
> flight-sql and flight-core jars. Since these classes get generated in 
> flight-core, and flight-sql is dependent on flight-core, can the generation 
> of Flight.java and FlightServiceGrpc.java be removed from flight-sql and 
> instead rely on it to be pulled directly from flight-core?
>  
> thanks in advance!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-17055) [Java][FlightRPC] flight-core and flight-sql jars delivering same class names

2022-07-13 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reassigned ARROW-17055:


Assignee: Kevin Bambrick

> [Java][FlightRPC] flight-core and flight-sql jars delivering same class names
> -
>
> Key: ARROW-17055
> URL: https://issues.apache.org/jira/browse/ARROW-17055
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Java
>Reporter: Kevin Bambrick
>Assignee: Kevin Bambrick
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Hello. I am trying to uptake arrow flight sql. We have a check in out build 
> to make sure that there are no overlapping class files in our project. When 
> adding the flight sql dependency to our project the warning throws that 
> flight-sql and flight-core overlap and the jars deliver the same class files.
> {code:java}
> Sample Warning: WARNING: CLASSPATH OVERLAP: These jars deliver the same class 
> files: [flight-sql-7.0.0.jar, flight-core-7.0.0.jar] files: 
> [org/apache/arrow/flight/impl/Flight$FlightDescriptor$1.class, 
> org/apache/arrow/flight/impl/Flight$ActionOrBuilder.class{code}
>  
> It seems that the classes generated by Flight.proto gets generated in both 
> flight-sql and flight-core jars. Since these classes get generated in 
> flight-core, and flight-sql is dependent on flight-core, can the generation 
> of Flight.java and FlightServiceGrpc.java be removed from flight-sql and 
> instead rely on it to be pulled directly from flight-core?
>  
> thanks in advance!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-16992) [Java][C++] Separate JNI compilation & linking from main arrow CMakeLists

2022-07-13 Thread David Dali Susanibar Arce (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-16992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566332#comment-17566332
 ] 

David Dali Susanibar Arce commented on ARROW-16992:
---

I agree with all of these points.

 

PoC could help us a lot to have an idea about how the JNI java modules projects 
are building isolated and then try to call that building execution by Maven 
side.

> [Java][C++] Separate JNI compilation & linking from main arrow CMakeLists 
> --
>
> Key: ARROW-16992
> URL: https://issues.apache.org/jira/browse/ARROW-16992
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Java
>Reporter: Larry White
>Priority: Major
>
> We need to separate the JNI elements from CMakeLists, with related 
> modifications to the CI build scripts likely. Separating the JNI portion 
> serves two related purposes:
>  # Simplify building JNI code against precompiled lib arrow C++ code
>  # Enable control of JNI build through Maven, rather than requiring Java devs 
> to work with CMake directly
> [~dsusanibara]
> [~kou] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-15938) [R][C++] Segfault in left join with empty right table when filtered on partition

2022-07-13 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace reassigned ARROW-15938:
---

Assignee: Weston Pace

> [R][C++] Segfault in left join with empty right table when filtered on 
> partition
> 
>
> Key: ARROW-15938
> URL: https://issues.apache.org/jira/browse/ARROW-15938
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 7.0.2
> Environment: ubuntu linux, R4.1.2
>Reporter: Vitalie Spinu
>Assignee: Weston Pace
>Priority: Major
>  Labels: query-engine
> Fix For: 9.0.0
>
>
> When the right table in a join is empty as a result of a filtering on a 
> partition group the join segfaults:
> {code:java}
>   library(arrow)
>   library(glue)
>   df <- mutate(iris, id = runif(n()))
>   dir <- "./tmp/iris"
>   dir.create(glue("{dir}/group=a/"), recursive = T, showWarnings = F)
>   dir.create(glue("{dir}/group=b/"), recursive = T, showWarnings = F)
>   write_parquet(df, glue("{dir}/group=a/part1.parquet"))
>   write_parquet(df, glue("{dir}/group=b/part2.parquet")) 
>  db1 <- open_dataset(dir) %>%
>     filter(group == "blabla")  
> open_dataset(dir) %>%
>     filter(group == "b") %>%
>     select(id) %>%
>     left_join(db1, by = "id") %>%
>     collect()
>   {code}
> {code:java}
> ==24063== Thread 7:
> ==24063== Invalid read of size 1
> ==24063==    at 0x1FFE606D: 
> arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(long, 
> arrow::compute::ExecBatch*, arrow::compute::ExecBatch*, 
> arrow::compute::ExecBatch*, arrow::compute::ExecBatch*) (in 
> /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFE68CC: 
> arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(unsigned long, long, 
> int const*, int const*) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFE84D5: 
> arrow::compute::HashJoinBasicImpl::ProbeBatch(unsigned long, 
> arrow::compute::ExecBatch const&) (in 
> /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFE8CB4: 
> arrow::compute::HashJoinBasicImpl::InputReceived(unsigned long, int, 
> arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x200011CF: 
> arrow::compute::HashJoinNode::InputReceived(arrow::compute::ExecNode*, 
> arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFB580E: 
> arrow::compute::MapNode::SubmitTask(std::function
>  (arrow::compute::ExecBatch)>, 
> arrow::compute::ExecBatch)::{lambda()#1}::operator()() const (in 
> /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFB6444: arrow::internal::FnOnce ()>::FnImpl (arrow::Future, 
> arrow::compute::MapNode::SubmitTask(std::function
>  (arrow::compute::ExecBatch)>, 
> arrow::compute::ExecBatch)::{lambda()#2}::operator()() const::{lambda()#1})> 
> >::invoke() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FE2B2A0: 
> std::thread::_State_impl
>  > >::_M_run() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x92844BF: ??? (in 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.29)
> ==24063==    by 0x6DD46DA: start_thread (pthread_create.c:463)
> ==24063==    by 0x710D71E: clone (clone.S:95)
> ==24063==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
> ==24063==  *** caught segfault ***
> address 0x10, cause 'memory not mapped'Traceback:
>  1: Table__from_RecordBatchReader(self)
>  2: tab$read_table()
>  3: do_exec_plan(x)
>  4: doTryCatch(return(expr), name, parentenv, handler)
>  5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
>  6: tryCatchList(expr, classes, parentenv, handlers)
>  7: tryCatch(tab <- do_exec_plan(x), error = function(e) {    
> handle_csv_read_error(e, x$.data$schema)})
>  8: collect.arrow_dplyr_query(.)
>  9: collect(.)
> 10: open_dataset(dir) %>% filter(group == "b") %>% select(id) %>%     
> left_join(db1, by = "id") %>% collect()Possible actions:
> 1: abort (with core dump, if enabled)
> 2: normal R exit
> 3: exit R without saving workspace
> 4: exit R saving workspace {code}
> This is arrow from the current master ece0e23f1. 
> It's worth noting that if the right table is filtered on a non-partitioned 
> variable the problem does not occur.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-16523) [C++] Move ExecPlan scheduling into the plan

2022-07-13 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-16523:

Labels: pull-request-available query-engine  (was: acero 
pull-request-available)

> [C++] Move ExecPlan scheduling into the plan
> 
>
> Key: ARROW-16523
> URL: https://issues.apache.org/jira/browse/ARROW-16523
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Assignee: Sasha Krassovsky
>Priority: Major
>  Labels: pull-request-available, query-engine
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Source nodes and pipeline breakers need to schedule new thread tasks.  These 
> tasks run entire fused pipelines (e.g. the thread task could be thought of as 
> analogous to a "driver" in some other models).
> At the moment every node that needs to schedule tasks (scan node, hash-join 
> node, aggregate node, etc.) handles this independently.  The result is a lot 
> of similar looking code and bugs like ARROW-15221 where one node takes care 
> of cleanup but another doesn't.
> We can centralize this by moving this scheduling into the ExecPlan itself and 
> giving nodes an ability to schedule tasks via the ExecPlan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-16628) [C++] Support limit operation

2022-07-13 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-16628:

Labels: query-engine  (was: acero)

> [C++] Support limit operation
> -
>
> Key: ARROW-16628
> URL: https://issues.apache.org/jira/browse/ARROW-16628
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
>  Labels: query-engine
>
> Either an option to a SinkNode (TopK already takes a number of results to 
> keep) or a streaming LimitNode that only lets N rows through.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-8043) [Developer] Provide better visibility for failed nightly builds

2022-07-13 Thread David Dali Susanibar Arce (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Dali Susanibar Arce closed ARROW-8043.

Resolution: Abandoned

Implemented at: https://issues.apache.org/jira/browse/ARROW-16333

> [Developer] Provide better visibility for failed nightly builds
> ---
>
> Key: ARROW-8043
> URL: https://issues.apache.org/jira/browse/ARROW-8043
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Developer Tools
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
>
> Emails reporting nightly failures are unsatisfactory in two ways: there is a 
> large click/scroll distance between the links presented in that email and the 
> actual error message. Worse, once one is there it's not clear what JIRAs have 
> been made or which of them are in progress.
> One solution would be to replace or augment the [NIGHTLY] email with a page 
> ([https://ursa-labs.github.org/crossbow] would be my favorite) which shows 
> how many nights it has failed, a shortcut to the actual error line in CI's 
> logs, and useful views of JIRA. We could accomplish this with:
>  - dedicated JIRA tags; one for each nightly job so a JIRA can be easily 
> associated with specific jobs
>  - A static HTML dashboard with client side JavaScript to
>  ** scrape JIRA and update the page dynamically as soon as JIRAs are opened
>  ** show any relationships between failing jobs
>  ** highlight jobs that have not been addressed, along with a counter of how 
> many nights it has gone unaddressed
>  - provide automatic and expedited creation of correctly labelled JIRAs, so 
> that viewers can quickly organize/take ownership of a failed nightly job. 
> JIRA supports reading form fields from URL parameters, so this would be 
> fairly straightforward:
>  
> [https://issues.apache.org/jira/secure/CreateIssueDetails!init.jspa?pid=12319525=1=[NIGHTLY:gandiva-jar-osx,gandiva-jar-trusty]=12340948=12347769=12334626]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-8043) [Developer] Provide better visibility for failed nightly builds

2022-07-13 Thread David Dali Susanibar Arce (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566325#comment-17566325
 ] 

David Dali Susanibar Arce commented on ARROW-8043:
--

This ticket cover in a more fashion manner this implementation: 
https://issues.apache.org/jira/browse/ARROW-16333

> [Developer] Provide better visibility for failed nightly builds
> ---
>
> Key: ARROW-8043
> URL: https://issues.apache.org/jira/browse/ARROW-8043
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Developer Tools
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
>
> Emails reporting nightly failures are unsatisfactory in two ways: there is a 
> large click/scroll distance between the links presented in that email and the 
> actual error message. Worse, once one is there it's not clear what JIRAs have 
> been made or which of them are in progress.
> One solution would be to replace or augment the [NIGHTLY] email with a page 
> ([https://ursa-labs.github.org/crossbow] would be my favorite) which shows 
> how many nights it has failed, a shortcut to the actual error line in CI's 
> logs, and useful views of JIRA. We could accomplish this with:
>  - dedicated JIRA tags; one for each nightly job so a JIRA can be easily 
> associated with specific jobs
>  - A static HTML dashboard with client side JavaScript to
>  ** scrape JIRA and update the page dynamically as soon as JIRAs are opened
>  ** show any relationships between failing jobs
>  ** highlight jobs that have not been addressed, along with a counter of how 
> many nights it has gone unaddressed
>  - provide automatic and expedited creation of correctly labelled JIRAs, so 
> that viewers can quickly organize/take ownership of a failed nightly job. 
> JIRA supports reading form fields from URL parameters, so this would be 
> fairly straightforward:
>  
> [https://issues.apache.org/jira/secure/CreateIssueDetails!init.jspa?pid=12319525=1=[NIGHTLY:gandiva-jar-osx,gandiva-jar-trusty]=12340948=12347769=12334626]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-16575) [R] arrow::write_dataset() does nothing with 0 row dataframes in R

2022-07-13 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-16575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566324#comment-17566324
 ] 

Neal Richardson commented on ARROW-16575:
-

This matches my expectations. write_dataset also won't write files for 
partitions that don't exist either. 

If you want a file/dataset with 0 rows and just the schema, you can use the 
single file writer, write_feather:

{code}
> write_feather(cars[cars$speed > 1000, ], "test.arrow")
> read_feather("test.arrow", as_data_frame=FALSE)
Table
0 rows x 2 columns
$speed 
$dist 

See $metadata for additional Schema metadata
{code}

> [R] arrow::write_dataset() does nothing with 0 row dataframes in R
> --
>
> Key: ARROW-16575
> URL: https://issues.apache.org/jira/browse/ARROW-16575
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
> Environment: Mac OS 12.3, R 4.1
>Reporter: Adam Black
>Priority: Minor
>
> In R a dataframe can have 0 rows. It still has column names and types. 
>  
> Expected behavior of arrow::write_dataset
> I would expect that it would be possible to have a FileSystemDataset with 
> zero rows that would contain metadata about the column names and types. 
> arrow::write_dataset would create the FileSystemDataset metadata when given a 
> dataframe with zero rows.
>  
> Actual behavior
> arrow::write_dataset() does nothing when passed a dataframe with zero rows.
>  
> Reproducible example using the current arrow package on CRAN
> {code:java}
> arrow::write_dataset(cars, here::here("cars"))
> arrow::open_dataset(here::here("cars"))
> #> FileSystemDataset with 1 Parquet file
> #> speed: double
> #> dist: double
> #> 
> #> See $metadata for additional Schema metadata
> file.exists(here::here("cars"))
> #> [1] TRUE
> df <- cars[cars$speed > 1000, ]
> nrow(df)
> #> [1] 0
> arrow::write_dataset(df, here::here("df"), format = "feather")
> arrow::open_dataset(here::here("df"))
> #> Error: IOError: Cannot list directory 
> '/private/var/folders/xx/01v98b6546ldnm1rg1_bvk00gn/T/RtmpGkX0gK/reprex-17c305ed29ad5-nerdy-ram/df'.
>  Detail: [errno 2] No such file or directory
> file.exists(here::here("df"))
> #> [1] FALSE{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-15938) [R][C++] Segfault in left join with empty right table when filtered on partition

2022-07-13 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-15938:

Labels: query-engine  (was: )

> [R][C++] Segfault in left join with empty right table when filtered on 
> partition
> 
>
> Key: ARROW-15938
> URL: https://issues.apache.org/jira/browse/ARROW-15938
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 7.0.2
> Environment: ubuntu linux, R4.1.2
>Reporter: Vitalie Spinu
>Priority: Major
>  Labels: query-engine
>
> When the right table in a join is empty as a result of a filtering on a 
> partition group the join segfaults:
> {code:java}
>   library(arrow)
>   library(glue)
>   df <- mutate(iris, id = runif(n()))
>   dir <- "./tmp/iris"
>   dir.create(glue("{dir}/group=a/"), recursive = T, showWarnings = F)
>   dir.create(glue("{dir}/group=b/"), recursive = T, showWarnings = F)
>   write_parquet(df, glue("{dir}/group=a/part1.parquet"))
>   write_parquet(df, glue("{dir}/group=b/part2.parquet")) 
>  db1 <- open_dataset(dir) %>%
>     filter(group == "blabla")  
> open_dataset(dir) %>%
>     filter(group == "b") %>%
>     select(id) %>%
>     left_join(db1, by = "id") %>%
>     collect()
>   {code}
> {code:java}
> ==24063== Thread 7:
> ==24063== Invalid read of size 1
> ==24063==    at 0x1FFE606D: 
> arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(long, 
> arrow::compute::ExecBatch*, arrow::compute::ExecBatch*, 
> arrow::compute::ExecBatch*, arrow::compute::ExecBatch*) (in 
> /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFE68CC: 
> arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(unsigned long, long, 
> int const*, int const*) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFE84D5: 
> arrow::compute::HashJoinBasicImpl::ProbeBatch(unsigned long, 
> arrow::compute::ExecBatch const&) (in 
> /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFE8CB4: 
> arrow::compute::HashJoinBasicImpl::InputReceived(unsigned long, int, 
> arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x200011CF: 
> arrow::compute::HashJoinNode::InputReceived(arrow::compute::ExecNode*, 
> arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFB580E: 
> arrow::compute::MapNode::SubmitTask(std::function
>  (arrow::compute::ExecBatch)>, 
> arrow::compute::ExecBatch)::{lambda()#1}::operator()() const (in 
> /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFB6444: arrow::internal::FnOnce ()>::FnImpl (arrow::Future, 
> arrow::compute::MapNode::SubmitTask(std::function
>  (arrow::compute::ExecBatch)>, 
> arrow::compute::ExecBatch)::{lambda()#2}::operator()() const::{lambda()#1})> 
> >::invoke() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FE2B2A0: 
> std::thread::_State_impl
>  > >::_M_run() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x92844BF: ??? (in 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.29)
> ==24063==    by 0x6DD46DA: start_thread (pthread_create.c:463)
> ==24063==    by 0x710D71E: clone (clone.S:95)
> ==24063==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
> ==24063==  *** caught segfault ***
> address 0x10, cause 'memory not mapped'Traceback:
>  1: Table__from_RecordBatchReader(self)
>  2: tab$read_table()
>  3: do_exec_plan(x)
>  4: doTryCatch(return(expr), name, parentenv, handler)
>  5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
>  6: tryCatchList(expr, classes, parentenv, handlers)
>  7: tryCatch(tab <- do_exec_plan(x), error = function(e) {    
> handle_csv_read_error(e, x$.data$schema)})
>  8: collect.arrow_dplyr_query(.)
>  9: collect(.)
> 10: open_dataset(dir) %>% filter(group == "b") %>% select(id) %>%     
> left_join(db1, by = "id") %>% collect()Possible actions:
> 1: abort (with core dump, if enabled)
> 2: normal R exit
> 3: exit R without saving workspace
> 4: exit R saving workspace {code}
> This is arrow from the current master ece0e23f1. 
> It's worth noting that if the right table is filtered on a non-partitioned 
> variable the problem does not occur.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-15938) [R][C++] Segfault in left join with empty right table when filtered on partition

2022-07-13 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-15938:

Fix Version/s: 9.0.0

> [R][C++] Segfault in left join with empty right table when filtered on 
> partition
> 
>
> Key: ARROW-15938
> URL: https://issues.apache.org/jira/browse/ARROW-15938
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 7.0.2
> Environment: ubuntu linux, R4.1.2
>Reporter: Vitalie Spinu
>Priority: Major
>  Labels: query-engine
> Fix For: 9.0.0
>
>
> When the right table in a join is empty as a result of a filtering on a 
> partition group the join segfaults:
> {code:java}
>   library(arrow)
>   library(glue)
>   df <- mutate(iris, id = runif(n()))
>   dir <- "./tmp/iris"
>   dir.create(glue("{dir}/group=a/"), recursive = T, showWarnings = F)
>   dir.create(glue("{dir}/group=b/"), recursive = T, showWarnings = F)
>   write_parquet(df, glue("{dir}/group=a/part1.parquet"))
>   write_parquet(df, glue("{dir}/group=b/part2.parquet")) 
>  db1 <- open_dataset(dir) %>%
>     filter(group == "blabla")  
> open_dataset(dir) %>%
>     filter(group == "b") %>%
>     select(id) %>%
>     left_join(db1, by = "id") %>%
>     collect()
>   {code}
> {code:java}
> ==24063== Thread 7:
> ==24063== Invalid read of size 1
> ==24063==    at 0x1FFE606D: 
> arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(long, 
> arrow::compute::ExecBatch*, arrow::compute::ExecBatch*, 
> arrow::compute::ExecBatch*, arrow::compute::ExecBatch*) (in 
> /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFE68CC: 
> arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(unsigned long, long, 
> int const*, int const*) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFE84D5: 
> arrow::compute::HashJoinBasicImpl::ProbeBatch(unsigned long, 
> arrow::compute::ExecBatch const&) (in 
> /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFE8CB4: 
> arrow::compute::HashJoinBasicImpl::InputReceived(unsigned long, int, 
> arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x200011CF: 
> arrow::compute::HashJoinNode::InputReceived(arrow::compute::ExecNode*, 
> arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFB580E: 
> arrow::compute::MapNode::SubmitTask(std::function
>  (arrow::compute::ExecBatch)>, 
> arrow::compute::ExecBatch)::{lambda()#1}::operator()() const (in 
> /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFB6444: arrow::internal::FnOnce ()>::FnImpl (arrow::Future, 
> arrow::compute::MapNode::SubmitTask(std::function
>  (arrow::compute::ExecBatch)>, 
> arrow::compute::ExecBatch)::{lambda()#2}::operator()() const::{lambda()#1})> 
> >::invoke() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FE2B2A0: 
> std::thread::_State_impl
>  > >::_M_run() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x92844BF: ??? (in 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.29)
> ==24063==    by 0x6DD46DA: start_thread (pthread_create.c:463)
> ==24063==    by 0x710D71E: clone (clone.S:95)
> ==24063==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
> ==24063==  *** caught segfault ***
> address 0x10, cause 'memory not mapped'Traceback:
>  1: Table__from_RecordBatchReader(self)
>  2: tab$read_table()
>  3: do_exec_plan(x)
>  4: doTryCatch(return(expr), name, parentenv, handler)
>  5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
>  6: tryCatchList(expr, classes, parentenv, handlers)
>  7: tryCatch(tab <- do_exec_plan(x), error = function(e) {    
> handle_csv_read_error(e, x$.data$schema)})
>  8: collect.arrow_dplyr_query(.)
>  9: collect(.)
> 10: open_dataset(dir) %>% filter(group == "b") %>% select(id) %>%     
> left_join(db1, by = "id") %>% collect()Possible actions:
> 1: abort (with core dump, if enabled)
> 2: normal R exit
> 3: exit R without saving workspace
> 4: exit R saving workspace {code}
> This is arrow from the current master ece0e23f1. 
> It's worth noting that if the right table is filtered on a non-partitioned 
> variable the problem does not occur.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-15938) [R][C++] Segfault in left join with empty right table when filtered on partition

2022-07-13 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566322#comment-17566322
 ] 

Neal Richardson commented on ARROW-15938:
-

Confirmed that this is still an issue.

> [R][C++] Segfault in left join with empty right table when filtered on 
> partition
> 
>
> Key: ARROW-15938
> URL: https://issues.apache.org/jira/browse/ARROW-15938
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 7.0.2
> Environment: ubuntu linux, R4.1.2
>Reporter: Vitalie Spinu
>Priority: Major
>  Labels: query-engine
> Fix For: 9.0.0
>
>
> When the right table in a join is empty as a result of a filtering on a 
> partition group the join segfaults:
> {code:java}
>   library(arrow)
>   library(glue)
>   df <- mutate(iris, id = runif(n()))
>   dir <- "./tmp/iris"
>   dir.create(glue("{dir}/group=a/"), recursive = T, showWarnings = F)
>   dir.create(glue("{dir}/group=b/"), recursive = T, showWarnings = F)
>   write_parquet(df, glue("{dir}/group=a/part1.parquet"))
>   write_parquet(df, glue("{dir}/group=b/part2.parquet")) 
>  db1 <- open_dataset(dir) %>%
>     filter(group == "blabla")  
> open_dataset(dir) %>%
>     filter(group == "b") %>%
>     select(id) %>%
>     left_join(db1, by = "id") %>%
>     collect()
>   {code}
> {code:java}
> ==24063== Thread 7:
> ==24063== Invalid read of size 1
> ==24063==    at 0x1FFE606D: 
> arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(long, 
> arrow::compute::ExecBatch*, arrow::compute::ExecBatch*, 
> arrow::compute::ExecBatch*, arrow::compute::ExecBatch*) (in 
> /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFE68CC: 
> arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(unsigned long, long, 
> int const*, int const*) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFE84D5: 
> arrow::compute::HashJoinBasicImpl::ProbeBatch(unsigned long, 
> arrow::compute::ExecBatch const&) (in 
> /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFE8CB4: 
> arrow::compute::HashJoinBasicImpl::InputReceived(unsigned long, int, 
> arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x200011CF: 
> arrow::compute::HashJoinNode::InputReceived(arrow::compute::ExecNode*, 
> arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFB580E: 
> arrow::compute::MapNode::SubmitTask(std::function
>  (arrow::compute::ExecBatch)>, 
> arrow::compute::ExecBatch)::{lambda()#1}::operator()() const (in 
> /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFB6444: arrow::internal::FnOnce ()>::FnImpl (arrow::Future, 
> arrow::compute::MapNode::SubmitTask(std::function
>  (arrow::compute::ExecBatch)>, 
> arrow::compute::ExecBatch)::{lambda()#2}::operator()() const::{lambda()#1})> 
> >::invoke() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FE2B2A0: 
> std::thread::_State_impl
>  > >::_M_run() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x92844BF: ??? (in 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.29)
> ==24063==    by 0x6DD46DA: start_thread (pthread_create.c:463)
> ==24063==    by 0x710D71E: clone (clone.S:95)
> ==24063==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
> ==24063==  *** caught segfault ***
> address 0x10, cause 'memory not mapped'Traceback:
>  1: Table__from_RecordBatchReader(self)
>  2: tab$read_table()
>  3: do_exec_plan(x)
>  4: doTryCatch(return(expr), name, parentenv, handler)
>  5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
>  6: tryCatchList(expr, classes, parentenv, handlers)
>  7: tryCatch(tab <- do_exec_plan(x), error = function(e) {    
> handle_csv_read_error(e, x$.data$schema)})
>  8: collect.arrow_dplyr_query(.)
>  9: collect(.)
> 10: open_dataset(dir) %>% filter(group == "b") %>% select(id) %>%     
> left_join(db1, by = "id") %>% collect()Possible actions:
> 1: abort (with core dump, if enabled)
> 2: normal R exit
> 3: exit R without saving workspace
> 4: exit R saving workspace {code}
> This is arrow from the current master ece0e23f1. 
> It's worth noting that if the right table is filtered on a non-partitioned 
> variable the problem does not occur.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-13936) Add a column to show us the number of time that this job is failing

2022-07-13 Thread David Dali Susanibar Arce (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Dali Susanibar Arce closed ARROW-13936.
-
Resolution: Abandoned

> Add a column to show us the number of time that this job is failing
> ---
>
> Key: ARROW-13936
> URL: https://issues.apache.org/jira/browse/ARROW-13936
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: David Dali Susanibar Arce
>Priority: Minor
>
> Try to use external repository to collect information about jobs name failling



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-15938) [R][C++] Segfault in left join with empty right table when filtered on partition

2022-07-13 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-15938:

Component/s: (was: Compute IR)

> [R][C++] Segfault in left join with empty right table when filtered on 
> partition
> 
>
> Key: ARROW-15938
> URL: https://issues.apache.org/jira/browse/ARROW-15938
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 7.0.2
> Environment: ubuntu linux, R4.1.2
>Reporter: Vitalie Spinu
>Priority: Major
>
> When the right table in a join is empty as a result of a filtering on a 
> partition group the join segfaults:
> {code:java}
>   library(arrow)
>   library(glue)
>   df <- mutate(iris, id = runif(n()))
>   dir <- "./tmp/iris"
>   dir.create(glue("{dir}/group=a/"), recursive = T, showWarnings = F)
>   dir.create(glue("{dir}/group=b/"), recursive = T, showWarnings = F)
>   write_parquet(df, glue("{dir}/group=a/part1.parquet"))
>   write_parquet(df, glue("{dir}/group=b/part2.parquet")) 
>  db1 <- open_dataset(dir) %>%
>     filter(group == "blabla")  
> open_dataset(dir) %>%
>     filter(group == "b") %>%
>     select(id) %>%
>     left_join(db1, by = "id") %>%
>     collect()
>   {code}
> {code:java}
> ==24063== Thread 7:
> ==24063== Invalid read of size 1
> ==24063==    at 0x1FFE606D: 
> arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(long, 
> arrow::compute::ExecBatch*, arrow::compute::ExecBatch*, 
> arrow::compute::ExecBatch*, arrow::compute::ExecBatch*) (in 
> /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFE68CC: 
> arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(unsigned long, long, 
> int const*, int const*) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFE84D5: 
> arrow::compute::HashJoinBasicImpl::ProbeBatch(unsigned long, 
> arrow::compute::ExecBatch const&) (in 
> /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFE8CB4: 
> arrow::compute::HashJoinBasicImpl::InputReceived(unsigned long, int, 
> arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x200011CF: 
> arrow::compute::HashJoinNode::InputReceived(arrow::compute::ExecNode*, 
> arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFB580E: 
> arrow::compute::MapNode::SubmitTask(std::function
>  (arrow::compute::ExecBatch)>, 
> arrow::compute::ExecBatch)::{lambda()#1}::operator()() const (in 
> /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFB6444: arrow::internal::FnOnce ()>::FnImpl (arrow::Future, 
> arrow::compute::MapNode::SubmitTask(std::function
>  (arrow::compute::ExecBatch)>, 
> arrow::compute::ExecBatch)::{lambda()#2}::operator()() const::{lambda()#1})> 
> >::invoke() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FE2B2A0: 
> std::thread::_State_impl
>  > >::_M_run() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x92844BF: ??? (in 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.29)
> ==24063==    by 0x6DD46DA: start_thread (pthread_create.c:463)
> ==24063==    by 0x710D71E: clone (clone.S:95)
> ==24063==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
> ==24063==  *** caught segfault ***
> address 0x10, cause 'memory not mapped'Traceback:
>  1: Table__from_RecordBatchReader(self)
>  2: tab$read_table()
>  3: do_exec_plan(x)
>  4: doTryCatch(return(expr), name, parentenv, handler)
>  5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
>  6: tryCatchList(expr, classes, parentenv, handlers)
>  7: tryCatch(tab <- do_exec_plan(x), error = function(e) {    
> handle_csv_read_error(e, x$.data$schema)})
>  8: collect.arrow_dplyr_query(.)
>  9: collect(.)
> 10: open_dataset(dir) %>% filter(group == "b") %>% select(id) %>%     
> left_join(db1, by = "id") %>% collect()Possible actions:
> 1: abort (with core dump, if enabled)
> 2: normal R exit
> 3: exit R without saving workspace
> 4: exit R saving workspace {code}
> This is arrow from the current master ece0e23f1. 
> It's worth noting that if the right table is filtered on a non-partitioned 
> variable the problem does not occur.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17041) [C++][CI] arrow-compute-scalar-test fails on test-conda-cpp-valgrind

2022-07-13 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17041:
---
Labels: Nightly pull-request-available  (was: Nightly)

> [C++][CI] arrow-compute-scalar-test fails on test-conda-cpp-valgrind
> 
>
> Key: ARROW-17041
> URL: https://issues.apache.org/jira/browse/ARROW-17041
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Raúl Cumplido
>Assignee: Antoine Pitrou
>Priority: Critical
>  Labels: Nightly, pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There seems to be an issue on the arrow-compute-scalar-test as it has been 
> failing for the last days, example: 
> [https://github.com/ursacomputing/crossbow/runs/7274655770]
> See [https://crossbow.voltrondata.com/]
> Error:
> {code:java}
> ==13125== 
> ==13125== HEAP SUMMARY:
> ==13125== in use at exit: 16,090 bytes in 161 blocks
> ==13125==   total heap usage: 14,612,979 allocs, 14,612,818 frees, 
> 2,853,741,784 bytes allocated
> ==13125== 
> ==13125== LEAK SUMMARY:
> ==13125==definitely lost: 0 bytes in 0 blocks
> ==13125==indirectly lost: 0 bytes in 0 blocks
> ==13125==  possibly lost: 0 bytes in 0 blocks
> ==13125==still reachable: 16,090 bytes in 161 blocks
> ==13125== suppressed: 0 bytes in 0 blocks
> ==13125== Reachable blocks (those to which a pointer was found) are not shown.
> ==13125== To see them, rerun with: --leak-check=full --show-leak-kinds=all
> ==13125== 
> ==13125== Use --track-origins=yes to see where uninitialised values come from
> ==13125== For lists of detected and suppressed errors, rerun with: -s
> ==13125== ERROR SUMMARY: 54 errors from 12 contexts (suppressed: 517836 from 
> 44) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-17041) [C++][CI] arrow-compute-scalar-test fails on test-conda-cpp-valgrind

2022-07-13 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-17041:
--

Assignee: Antoine Pitrou

> [C++][CI] arrow-compute-scalar-test fails on test-conda-cpp-valgrind
> 
>
> Key: ARROW-17041
> URL: https://issues.apache.org/jira/browse/ARROW-17041
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Raúl Cumplido
>Assignee: Antoine Pitrou
>Priority: Critical
>  Labels: Nightly
> Fix For: 9.0.0
>
>
> There seems to be an issue on the arrow-compute-scalar-test as it has been 
> failing for the last days, example: 
> [https://github.com/ursacomputing/crossbow/runs/7274655770]
> See [https://crossbow.voltrondata.com/]
> Error:
> {code:java}
> ==13125== 
> ==13125== HEAP SUMMARY:
> ==13125== in use at exit: 16,090 bytes in 161 blocks
> ==13125==   total heap usage: 14,612,979 allocs, 14,612,818 frees, 
> 2,853,741,784 bytes allocated
> ==13125== 
> ==13125== LEAK SUMMARY:
> ==13125==definitely lost: 0 bytes in 0 blocks
> ==13125==indirectly lost: 0 bytes in 0 blocks
> ==13125==  possibly lost: 0 bytes in 0 blocks
> ==13125==still reachable: 16,090 bytes in 161 blocks
> ==13125== suppressed: 0 bytes in 0 blocks
> ==13125== Reachable blocks (those to which a pointer was found) are not shown.
> ==13125== To see them, rerun with: --leak-check=full --show-leak-kinds=all
> ==13125== 
> ==13125== Use --track-origins=yes to see where uninitialised values come from
> ==13125== For lists of detected and suppressed errors, rerun with: -s
> ==13125== ERROR SUMMARY: 54 errors from 12 contexts (suppressed: 517836 from 
> 44) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17062) [C#] Support compression in IPC format

2022-07-13 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566301#comment-17566301
 ] 

Neal Richardson commented on ARROW-17062:
-

It looks like the C# implementation does not yet support compression: 
https://arrow.apache.org/docs/status.html#ipc-format

> [C#] Support compression in IPC format
> --
>
> Key: ARROW-17062
> URL: https://issues.apache.org/jira/browse/ARROW-17062
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#, R
>Affects Versions: 8.0.0
> Environment: Arrow 8.0.0, R 4.2.1, VS 17.2.4
>Reporter: Todd West
>Priority: Major
> Fix For: 8.0.2
>
>
> Hello world between write_feather() and ArrowFileReader.ReadNextRecordBatch() 
> fails with default settings. This is specific to compressed files (see 
> workaround below) and it looks like what happens is C# correctly decompresses 
> the batches but provides the caller with the compressed versions of the data 
> arrays instead of the uncompressed ones. While all of the various Length 
> properties are set correctly in C#, the data arrays are too short to contain 
> all of the values in the file, the bytes do not match what the decompressed 
> bytes should be, and basic data accessors like PrimitiveArray.Values can't 
> be used because they throw ArgumentOutOfRangeException. Looking through the 
> C# classes in the github repo it doesn't appear there's a way for the caller 
> to request decompression. So I'm guessing decompression is supposed to be 
> automatic but, for some reason, isn't.
>  
> While functionally successful, the workaround of using uncompressed feather 
> isn't great as the uncompressed files are bigger than .csv. In my application 
> the resulting disk space penalty is hundreds of megabytes compared to the 
> footprint of using compressed feather.
>  
> Simple single field repex:
> In R (arrow 8.0.0):
> {{write_feather(tibble(value = seq(0, 1, length.out = 21)), "test 
> lz4.feather")}}
> In C# (Apache.Arrow 8.0.0):
> {{using Apache.Arrow;}}
> {{using Apache.Arrow.Ipc;}}
> {{using System.IO;}}
> {{using System.Runtime.InteropServices;}}
> {{            using FileStream stream = new("test lz4.feather", 
> FileMode.Open, FileAccess.Read, FileShare.Read);}}
> {{            using ArrowFileReader arrowFile = new(stream);}}
> {{            for (RecordBatch batch = arrowFile.ReadNextRecordBatch(); batch 
> != null; batch = arrowFile.ReadNextRecordBatch())}}
> {{            {}}
> {{                IArrowArray[] fields = batch.Arrays.ToArray();}}
> {{                ReadOnlySpan test = MemoryMarshal.Cast double>(((DoubleArray)fields[0]).ValueBuffer.Span); // 15 incorrect values 
> instead of 21 correctly incrementing ones (0, 0.05, 0.10, ..., 1)}}
> {{            }}}
> Workaround in R:
> {{write_feather(tibble(value = seq(0, 1, length.out = 21)), "test.feather", 
> compression = "uncompressed")}}
>  
> Apologies if this is a known issue. I didn't find anything on a Jira search 
> and this isn't included in the [known issues list on 
> github|http://example.com/].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17062) [C#] Support compression in IPC format

2022-07-13 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-17062:

Summary: [C#] Support compression in IPC format  (was: [C#] write_feather() 
in R doesn't interop with ArrowFileReader.ReadNextRecordBatch())

> [C#] Support compression in IPC format
> --
>
> Key: ARROW-17062
> URL: https://issues.apache.org/jira/browse/ARROW-17062
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#, R
>Affects Versions: 8.0.0
> Environment: Arrow 8.0.0, R 4.2.1, VS 17.2.4
>Reporter: Todd West
>Priority: Major
> Fix For: 8.0.2
>
>
> Hello world between write_feather() and ArrowFileReader.ReadNextRecordBatch() 
> fails with default settings. This is specific to compressed files (see 
> workaround below) and it looks like what happens is C# correctly decompresses 
> the batches but provides the caller with the compressed versions of the data 
> arrays instead of the uncompressed ones. While all of the various Length 
> properties are set correctly in C#, the data arrays are too short to contain 
> all of the values in the file, the bytes do not match what the decompressed 
> bytes should be, and basic data accessors like PrimitiveArray.Values can't 
> be used because they throw ArgumentOutOfRangeException. Looking through the 
> C# classes in the github repo it doesn't appear there's a way for the caller 
> to request decompression. So I'm guessing decompression is supposed to be 
> automatic but, for some reason, isn't.
>  
> While functionally successful, the workaround of using uncompressed feather 
> isn't great as the uncompressed files are bigger than .csv. In my application 
> the resulting disk space penalty is hundreds of megabytes compared to the 
> footprint of using compressed feather.
>  
> Simple single field repex:
> In R (arrow 8.0.0):
> {{write_feather(tibble(value = seq(0, 1, length.out = 21)), "test 
> lz4.feather")}}
> In C# (Apache.Arrow 8.0.0):
> {{using Apache.Arrow;}}
> {{using Apache.Arrow.Ipc;}}
> {{using System.IO;}}
> {{using System.Runtime.InteropServices;}}
> {{            using FileStream stream = new("test lz4.feather", 
> FileMode.Open, FileAccess.Read, FileShare.Read);}}
> {{            using ArrowFileReader arrowFile = new(stream);}}
> {{            for (RecordBatch batch = arrowFile.ReadNextRecordBatch(); batch 
> != null; batch = arrowFile.ReadNextRecordBatch())}}
> {{            {}}
> {{                IArrowArray[] fields = batch.Arrays.ToArray();}}
> {{                ReadOnlySpan test = MemoryMarshal.Cast double>(((DoubleArray)fields[0]).ValueBuffer.Span); // 15 incorrect values 
> instead of 21 correctly incrementing ones (0, 0.05, 0.10, ..., 1)}}
> {{            }}}
> Workaround in R:
> {{write_feather(tibble(value = seq(0, 1, length.out = 21)), "test.feather", 
> compression = "uncompressed")}}
>  
> Apologies if this is a known issue. I didn't find anything on a Jira search 
> and this isn't included in the [known issues list on 
> github|http://example.com/].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17062) [C#] write_feather() in R doesn't interop with ArrowFileReader.ReadNextRecordBatch()

2022-07-13 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-17062:

Summary: [C#] write_feather() in R doesn't interop with 
ArrowFileReader.ReadNextRecordBatch()  (was: write_feather() in R doesn't 
interop with ArrowFileReader.ReadNextRecordBatch())

> [C#] write_feather() in R doesn't interop with 
> ArrowFileReader.ReadNextRecordBatch()
> 
>
> Key: ARROW-17062
> URL: https://issues.apache.org/jira/browse/ARROW-17062
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#, R
>Affects Versions: 8.0.0
> Environment: Arrow 8.0.0, R 4.2.1, VS 17.2.4
>Reporter: Todd West
>Priority: Major
> Fix For: 8.0.2
>
>
> Hello world between write_feather() and ArrowFileReader.ReadNextRecordBatch() 
> fails with default settings. This is specific to compressed files (see 
> workaround below) and it looks like what happens is C# correctly decompresses 
> the batches but provides the caller with the compressed versions of the data 
> arrays instead of the uncompressed ones. While all of the various Length 
> properties are set correctly in C#, the data arrays are too short to contain 
> all of the values in the file, the bytes do not match what the decompressed 
> bytes should be, and basic data accessors like PrimitiveArray.Values can't 
> be used because they throw ArgumentOutOfRangeException. Looking through the 
> C# classes in the github repo it doesn't appear there's a way for the caller 
> to request decompression. So I'm guessing decompression is supposed to be 
> automatic but, for some reason, isn't.
>  
> While functionally successful, the workaround of using uncompressed feather 
> isn't great as the uncompressed files are bigger than .csv. In my application 
> the resulting disk space penalty is hundreds of megabytes compared to the 
> footprint of using compressed feather.
>  
> Simple single field repex:
> In R (arrow 8.0.0):
> {{write_feather(tibble(value = seq(0, 1, length.out = 21)), "test 
> lz4.feather")}}
> In C# (Apache.Arrow 8.0.0):
> {{using Apache.Arrow;}}
> {{using Apache.Arrow.Ipc;}}
> {{using System.IO;}}
> {{using System.Runtime.InteropServices;}}
> {{            using FileStream stream = new("test lz4.feather", 
> FileMode.Open, FileAccess.Read, FileShare.Read);}}
> {{            using ArrowFileReader arrowFile = new(stream);}}
> {{            for (RecordBatch batch = arrowFile.ReadNextRecordBatch(); batch 
> != null; batch = arrowFile.ReadNextRecordBatch())}}
> {{            {}}
> {{                IArrowArray[] fields = batch.Arrays.ToArray();}}
> {{                ReadOnlySpan test = MemoryMarshal.Cast double>(((DoubleArray)fields[0]).ValueBuffer.Span); // 15 incorrect values 
> instead of 21 correctly incrementing ones (0, 0.05, 0.10, ..., 1)}}
> {{            }}}
> Workaround in R:
> {{write_feather(tibble(value = seq(0, 1, length.out = 21)), "test.feather", 
> compression = "uncompressed")}}
>  
> Apologies if this is a known issue. I didn't find anything on a Jira search 
> and this isn't included in the [known issues list on 
> github|http://example.com/].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-14889) [C++] GCSFS tests hang if testbench not installed

2022-07-13 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-14889:
-
Summary: [C++] GCSFS tests hang if testbench not installed  (was: [C++] 
GCFS tests hang if testbench not installed)

> [C++] GCSFS tests hang if testbench not installed
> -
>
> Key: ARROW-14889
> URL: https://issues.apache.org/jira/browse/ARROW-14889
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Will Jones
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> They should probably error out instead of hanging.
> {code}
> Running main() from 
> /home/antoine/arrow/dev/cpp/build-preset/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc
> [==] Running 22 tests from 2 test suites.
> [--] Global test environment set-up.
> [--] 13 tests from GcsFileSystem
> [ RUN  ] GcsFileSystem.OptionsCompare
> [   OK ] GcsFileSystem.OptionsCompare (0 ms)
> [ RUN  ] GcsFileSystem.ToArrowStatusOK
> [   OK ] GcsFileSystem.ToArrowStatusOK (0 ms)
> [ RUN  ] GcsFileSystem.ToArrowStatus
> [   OK ] GcsFileSystem.ToArrowStatus (0 ms)
> [ RUN  ] GcsFileSystem.FileSystemCompare
> [   OK ] GcsFileSystem.FileSystemCompare (2 ms)
> [ RUN  ] GcsFileSystem.ToEncryptionKey
> [   OK ] GcsFileSystem.ToEncryptionKey (0 ms)
> [ RUN  ] GcsFileSystem.ToEncryptionKeyEmpty
> [   OK ] GcsFileSystem.ToEncryptionKeyEmpty (0 ms)
> [ RUN  ] GcsFileSystem.ToKmsKeyName
> [   OK ] GcsFileSystem.ToKmsKeyName (0 ms)
> [ RUN  ] GcsFileSystem.ToKmsKeyNameEmpty
> [   OK ] GcsFileSystem.ToKmsKeyNameEmpty (0 ms)
> [ RUN  ] GcsFileSystem.ToPredefinedAcl
> [   OK ] GcsFileSystem.ToPredefinedAcl (0 ms)
> [ RUN  ] GcsFileSystem.ToPredefinedAclEmpty
> [   OK ] GcsFileSystem.ToPredefinedAclEmpty (0 ms)
> [ RUN  ] GcsFileSystem.ToObjectMetadata
> [   OK ] GcsFileSystem.ToObjectMetadata (0 ms)
> [ RUN  ] GcsFileSystem.ToObjectMetadataEmpty
> [   OK ] GcsFileSystem.ToObjectMetadataEmpty (0 ms)
> [ RUN  ] GcsFileSystem.ToObjectMetadataInvalidCustomTime
> [   OK ] GcsFileSystem.ToObjectMetadataInvalidCustomTime (0 ms)
> [--] 13 tests from GcsFileSystem (3 ms total)
> [--] 9 tests from GcsIntegrationTest
> [ RUN  ] GcsIntegrationTest.GetFileInfoBucket
> /home/antoine/miniconda3/envs/pyarrow/bin/python3: No module named testbench
> ^C
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-16093) [Python] Address docstrings in Filesystems (Python Implementations)

2022-07-13 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16093:
---
Labels: pull-request-available  (was: )

> [Python] Address docstrings in Filesystems (Python Implementations)
> ---
>
> Key: ARROW-16093
> URL: https://issues.apache.org/jira/browse/ARROW-16093
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ensure docstrings for Filesystem Interface have an {{Examples}} section:
>  * 
> [https://arrow.apache.org/docs/python/generated/pyarrow.fs.PyFileSystem.html#pyarrow.fs.PyFileSystem]
>  * 
> [https://arrow.apache.org/docs/python/generated/pyarrow.fs.FileSystemHandler.html#pyarrow.fs.FileSystemHandler]
>  * 
> [https://arrow.apache.org/docs/python/generated/pyarrow.fs.FSSpecHandler.html#pyarrow.fs.FSSpecHandler]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-14889) [C++] GCFS tests hang if testbench not installed

2022-07-13 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-14889.
--
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13520
[https://github.com/apache/arrow/pull/13520]

> [C++] GCFS tests hang if testbench not installed
> 
>
> Key: ARROW-14889
> URL: https://issues.apache.org/jira/browse/ARROW-14889
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Will Jones
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> They should probably error out instead of hanging.
> {code}
> Running main() from 
> /home/antoine/arrow/dev/cpp/build-preset/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc
> [==] Running 22 tests from 2 test suites.
> [--] Global test environment set-up.
> [--] 13 tests from GcsFileSystem
> [ RUN  ] GcsFileSystem.OptionsCompare
> [   OK ] GcsFileSystem.OptionsCompare (0 ms)
> [ RUN  ] GcsFileSystem.ToArrowStatusOK
> [   OK ] GcsFileSystem.ToArrowStatusOK (0 ms)
> [ RUN  ] GcsFileSystem.ToArrowStatus
> [   OK ] GcsFileSystem.ToArrowStatus (0 ms)
> [ RUN  ] GcsFileSystem.FileSystemCompare
> [   OK ] GcsFileSystem.FileSystemCompare (2 ms)
> [ RUN  ] GcsFileSystem.ToEncryptionKey
> [   OK ] GcsFileSystem.ToEncryptionKey (0 ms)
> [ RUN  ] GcsFileSystem.ToEncryptionKeyEmpty
> [   OK ] GcsFileSystem.ToEncryptionKeyEmpty (0 ms)
> [ RUN  ] GcsFileSystem.ToKmsKeyName
> [   OK ] GcsFileSystem.ToKmsKeyName (0 ms)
> [ RUN  ] GcsFileSystem.ToKmsKeyNameEmpty
> [   OK ] GcsFileSystem.ToKmsKeyNameEmpty (0 ms)
> [ RUN  ] GcsFileSystem.ToPredefinedAcl
> [   OK ] GcsFileSystem.ToPredefinedAcl (0 ms)
> [ RUN  ] GcsFileSystem.ToPredefinedAclEmpty
> [   OK ] GcsFileSystem.ToPredefinedAclEmpty (0 ms)
> [ RUN  ] GcsFileSystem.ToObjectMetadata
> [   OK ] GcsFileSystem.ToObjectMetadata (0 ms)
> [ RUN  ] GcsFileSystem.ToObjectMetadataEmpty
> [   OK ] GcsFileSystem.ToObjectMetadataEmpty (0 ms)
> [ RUN  ] GcsFileSystem.ToObjectMetadataInvalidCustomTime
> [   OK ] GcsFileSystem.ToObjectMetadataInvalidCustomTime (0 ms)
> [--] 13 tests from GcsFileSystem (3 ms total)
> [--] 9 tests from GcsIntegrationTest
> [ RUN  ] GcsIntegrationTest.GetFileInfoBucket
> /home/antoine/miniconda3/envs/pyarrow/bin/python3: No module named testbench
> ^C
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17003) [Java][Docs] Document JDBC module

2022-07-13 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-17003.
--
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13543
[https://github.com/apache/arrow/pull/13543]

> [Java][Docs] Document JDBC module
> -
>
> Key: ARROW-17003
> URL: https://issues.apache.org/jira/browse/ARROW-17003
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Java
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> The arrow-jdbc submodule could use its own documentation page.
> In particular, we should document the type mapping it uses (and the rationale 
> where applicable) and how to customize it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17055) [Java][FlightRPC] flight-core and flight-sql jars delivering same class names

2022-07-13 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17055:
---
Labels: pull-request-available  (was: )

> [Java][FlightRPC] flight-core and flight-sql jars delivering same class names
> -
>
> Key: ARROW-17055
> URL: https://issues.apache.org/jira/browse/ARROW-17055
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Java
>Reporter: Kevin Bambrick
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hello. I am trying to uptake arrow flight sql. We have a check in out build 
> to make sure that there are no overlapping class files in our project. When 
> adding the flight sql dependency to our project the warning throws that 
> flight-sql and flight-core overlap and the jars deliver the same class files.
> {code:java}
> Sample Warning: WARNING: CLASSPATH OVERLAP: These jars deliver the same class 
> files: [flight-sql-7.0.0.jar, flight-core-7.0.0.jar] files: 
> [org/apache/arrow/flight/impl/Flight$FlightDescriptor$1.class, 
> org/apache/arrow/flight/impl/Flight$ActionOrBuilder.class{code}
>  
> It seems that the classes generated by Flight.proto gets generated in both 
> flight-sql and flight-core jars. Since these classes get generated in 
> flight-core, and flight-sql is dependent on flight-core, can the generation 
> of Flight.java and FlightServiceGrpc.java be removed from flight-sql and 
> instead rely on it to be pulled directly from flight-core?
>  
> thanks in advance!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17065) [Python] Allow using subclassed ExtensionScalar in ExtensionType

2022-07-13 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17065:
---
Labels: pull-request-available  (was: )

> [Python] Allow using subclassed ExtensionScalar in ExtensionType
> 
>
> Key: ARROW-17065
> URL: https://issues.apache.org/jira/browse/ARROW-17065
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Rok Mihevc
>Assignee: Rok Mihevc
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a follow-up to ARROW-13612.
> [See 
> discussion.|https://github.com/apache/arrow/pull/13454#issuecomment-1177140141]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-17065) [Python] Allow using subclassed ExtensionScalar in ExtensionType

2022-07-13 Thread Rok Mihevc (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc reassigned ARROW-17065:
--

Assignee: Rok Mihevc

> [Python] Allow using subclassed ExtensionScalar in ExtensionType
> 
>
> Key: ARROW-17065
> URL: https://issues.apache.org/jira/browse/ARROW-17065
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Rok Mihevc
>Assignee: Rok Mihevc
>Priority: Major
> Fix For: 9.0.0
>
>
> This is a follow-up to ARROW-13612.
> [See 
> discussion.|https://github.com/apache/arrow/pull/13454#issuecomment-1177140141]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17065) [Python] Allow using subclassed ExtensionScalar in ExtensionType

2022-07-13 Thread Rok Mihevc (Jira)

Rok Mihevc created ARROW-17065:
--

 Summary: [Python] Allow using subclassed ExtensionScalar in 
ExtensionType
 Key: ARROW-17065
 URL: https://issues.apache.org/jira/browse/ARROW-17065
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Rok Mihevc
 Fix For: 9.0.0


This is a follow-up to ARROW-13612.

[See 
discussion.|https://github.com/apache/arrow/pull/13454#issuecomment-1177140141]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-17056) [C++] Bump version of bundled substrait

2022-07-13 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido closed ARROW-17056.
-
Fix Version/s: (was: 9.0.0)
   Resolution: Won't Fix

This is not required as seen on the ticket comments. We will update substrait 
once an arrow feature requires it. At the moment substrait is evolving rapidly 
and we don't require to keep up with the latest version

> [C++] Bump version of bundled substrait
> ---
>
> Key: ARROW-17056
> URL: https://issues.apache.org/jira/browse/ARROW-17056
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Raúl Cumplido
>Priority: Major
>
> There has been a new substrait version released:
> https://github.com/substrait-io/substrait/releases/tag/v0.7.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17056) [C++] Bump version of bundled substrait

2022-07-13 Thread Jira



[ 
https://issues.apache.org/jira/browse/ARROW-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566243#comment-17566243
 ] 

Raúl Cumplido commented on ARROW-17056:
---

Ok, I'll close this one then as we will update based on feature need at the 
moment.

> [C++] Bump version of bundled substrait
> ---
>
> Key: ARROW-17056
> URL: https://issues.apache.org/jira/browse/ARROW-17056
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Raúl Cumplido
>Priority: Major
> Fix For: 9.0.0
>
>
> There has been a new substrait version released:
> https://github.com/substrait-io/substrait/releases/tag/v0.7.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17064) Python hangs when use pyarrow.fs.copy_files is used with "used_threads=True"

2022-07-13 Thread Alejandro Marco Ramos (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alejandro Marco Ramos updated ARROW-17064:
--
Description: 
When try to copy a local path to s3 remote filesystem using 
`pyarrow.fs.copy_files` and using default parameter `use_threads=True`, the 
system hangs. If use "use_threads=False` the operation must complete ok (but 
more slow).

 

My code is:
{code:java}
>>> import pyarrow as pa
>>> s3fs=pa.fs.S3FileSystem(endpoint_override="http://xx;)
>>> pa.fs.copy_files("tests/data/payments", "bucket/payments", 
>>> destination_filesystem=s3fs)
... (don't return){code}
If check remote s3, all files appear, but the function don't return

 

Platform: Windows

  was:
When try to copy a local path to s3 remote filesystem using 
`pyarrow.fs.copy_files` and using default parameter `use_threads=True`, the 
system hangs. If use "use_threads=False` the operation must complete ok (but 
more slow).

 

My code is:
{code:java}
>>> import pyarrow as pa
>>> s3fs=pa.fs.S3FileSystem(endpoint_override="http://xx;)
>>> pa.fs.copy_files("tests/data/payments", "bucket/payments", 
>>> destination_filesystem=s3fs)
... (don't return){code}
If check remote s3, all files appear, but the function don't return


> Python hangs when use pyarrow.fs.copy_files is used with "used_threads=True"
> 
>
> Key: ARROW-17064
> URL: https://issues.apache.org/jira/browse/ARROW-17064
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Alejandro Marco Ramos
>Priority: Major
>
> When try to copy a local path to s3 remote filesystem using 
> `pyarrow.fs.copy_files` and using default parameter `use_threads=True`, the 
> system hangs. If use "use_threads=False` the operation must complete ok (but 
> more slow).
>  
> My code is:
> {code:java}
> >>> import pyarrow as pa
> >>> s3fs=pa.fs.S3FileSystem(endpoint_override="http://xx;)
> >>> pa.fs.copy_files("tests/data/payments", "bucket/payments", 
> >>> destination_filesystem=s3fs)
> ... (don't return){code}
> If check remote s3, all files appear, but the function don't return
>  
> Platform: Windows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-16093) [Python] Address docstrings in Filesystems (Python Implementations)

2022-07-13 Thread Alenka Frim (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim reassigned ARROW-16093:
---

Assignee: Alenka Frim

> [Python] Address docstrings in Filesystems (Python Implementations)
> ---
>
> Key: ARROW-16093
> URL: https://issues.apache.org/jira/browse/ARROW-16093
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>
> Ensure docstrings for Filesystem Interface have an {{Examples}} section:
>  * 
> [https://arrow.apache.org/docs/python/generated/pyarrow.fs.PyFileSystem.html#pyarrow.fs.PyFileSystem]
>  * 
> [https://arrow.apache.org/docs/python/generated/pyarrow.fs.FileSystemHandler.html#pyarrow.fs.FileSystemHandler]
>  * 
> [https://arrow.apache.org/docs/python/generated/pyarrow.fs.FSSpecHandler.html#pyarrow.fs.FSSpecHandler]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-16091) [Python] Continuation of improving Classes and Methods Docstrings

2022-07-13 Thread Alenka Frim (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim reassigned ARROW-16091:
---

Assignee: Alenka Frim

> [Python] Continuation of improving Classes and Methods Docstrings 
> --
>
> Key: ARROW-16091
> URL: https://issues.apache.org/jira/browse/ARROW-16091
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>
> Continuation of the initiative aimed at improving methods and classes 
> docstrings, especially from the point of view of ensuring they have an 
> {{Examples}} section.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-14494) [C++] signal cancel test fails occasionaly on windows

2022-07-13 Thread Yibo Cai (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai closed ARROW-14494.

Resolution: Cannot Reproduce

> [C++] signal cancel test fails occasionaly on windows
> -
>
> Key: ARROW-14494
> URL: https://issues.apache.org/jira/browse/ARROW-14494
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 6.0.0
>Reporter: Yibo Cai
>Priority: Major
>
> Log: 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/41278276/job/j9k4897e9ppwt2q4#L782



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-16665) [Release] Update 03-binary-submit.sh to comment on PR and track binary submission with badges

2022-07-13 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-16665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido reassigned ARROW-16665:
-

Assignee: Raúl Cumplido

> [Release] Update 03-binary-submit.sh to comment on PR and track binary 
> submission with badges
> -
>
> Key: ARROW-16665
> URL: https://issues.apache.org/jira/browse/ARROW-16665
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Developer Tools
>Reporter: Raúl Cumplido
>Assignee: Raúl Cumplido
>Priority: Major
> Fix For: 9.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-15793) [C++][FlightRPC] DoPutLargeBatch test sometimes stuck for 10 seconds

2022-07-13 Thread Yibo Cai (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai closed ARROW-15793.

Resolution: Not A Bug

> [C++][FlightRPC] DoPutLargeBatch test sometimes stuck for 10 seconds
> 
>
> Key: ARROW-15793
> URL: https://issues.apache.org/jira/browse/ARROW-15793
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: Yibo Cai
>Priority: Major
>
> Normally the test finishes in 100ms. But it often costs 10s on my test 
> machine.
> Debug build is good.
> I did brief debug, looks it's related to 
> [https://github.com/apache/arrow/pull/12302].
> It stuck 10 seconds in destructing grpc::Server at code 
> [https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/server.cc#L863]
> To reproduce:
> {code:bash}
> $ cmake -GNinja -DARROW_BUILD_TESTS=ON -DCMAKE_BUILD_TYPE=RelWithDebInfo 
> -DARROW_FLIGHT=ON ..
> $ ninja arrow-flight-test
> $ relwithdebinfo/arrow-flight-test --gtest_filter="*DoPutLargeBatch*"
> [==] Running 1 test from 1 test suite.
> [--] Global test environment set-up.
> [--] 1 test from TestDoPut
> [ RUN  ] TestDoPut.DoPutLargeBatch
> [   OK ] TestDoPut.DoPutLargeBatch (10017 ms)
> [--] 1 test from TestDoPut (10017 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test suite ran. (10017 ms total)
> [  PASSED  ] 1 test.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-16667) [Release] Post merge script for release should not be necessary with the new workflow

2022-07-13 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-16667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido reassigned ARROW-16667:
-

Assignee: Raúl Cumplido

> [Release] Post merge script for release should not be necessary with the new 
> workflow
> -
>
> Key: ARROW-16667
> URL: https://issues.apache.org/jira/browse/ARROW-16667
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Developer Tools
>Reporter: Raúl Cumplido
>Assignee: Raúl Cumplido
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In discussion with Krisztián we think that post-01-merge.sh is not required 
> as we should be using archery cherry-pick on the maintenance branch instead 
> of creating branches and cherry picking manually for patch releases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-16667) [Release] Post merge script for release should not be necessary with the new workflow

2022-07-13 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16667:
---
Labels: pull-request-available  (was: )

> [Release] Post merge script for release should not be necessary with the new 
> workflow
> -
>
> Key: ARROW-16667
> URL: https://issues.apache.org/jira/browse/ARROW-16667
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Developer Tools
>Reporter: Raúl Cumplido
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In discussion with Krisztián we think that post-01-merge.sh is not required 
> as we should be using archery cherry-pick on the maintenance branch instead 
> of creating branches and cherry picking manually for patch releases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-16667) [Release] Post merge script for release should not be necessary with the new workflow

2022-07-13 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-16667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido updated ARROW-16667:
--
Description: In discussion with Krisztián we think that post-01-merge.sh is 
not required as we should be using archery cherry-pick on the maintenance 
branch instead of creating branches and cherry picking manually for patch 
releases.  (was: In discussion with Krisztián we think that post-01-merge.sh is 
not required if we fix post-12-bump-versions.sh to support minor releases. 
Investigate, fix and remove if not necessary.)

> [Release] Post merge script for release should not be necessary with the new 
> workflow
> -
>
> Key: ARROW-16667
> URL: https://issues.apache.org/jira/browse/ARROW-16667
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Developer Tools
>Reporter: Raúl Cumplido
>Priority: Major
> Fix For: 9.0.0
>
>
> In discussion with Krisztián we think that post-01-merge.sh is not required 
> as we should be using archery cherry-pick on the maintenance branch instead 
> of creating branches and cherry picking manually for patch releases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-16759) [Go] Update testify to fix securiy vulnerability

2022-07-13 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-16759:
-
Fix Version/s: 8.0.2

> [Go] Update testify to fix securiy vulnerability
> 
>
> Key: ARROW-16759
> URL: https://issues.apache.org/jira/browse/ARROW-16759
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Go
>Affects Versions: 7.0.0, 8.0.0
>Reporter: Dominic Barnes
>Assignee: Dominic Barnes
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 9.0.0, 8.0.2, 6.0.2, 7.0.1
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> The packges under github.com/apache/arrow/go currently have a dependency on 
> github.com/stretchr/testify v1.7.0 which has a dependency on gopkg.in/yaml.v3 
> that has an outstanding security vulnerability. 
> ([CVE-2022-28948|https://github.com/advisories/GHSA-hp87-p4gw-j4gq])
> While testify is only used during tests, this is not distinguished by the go 
> toolchain and other tools like Snyk which scan the dependency chain for 
> vulnerabilities. Unfortunately, due to Go's [Minimal version 
> selection|[https://go.dev/ref/mod#minimal-version-selection],] this ends up 
> requiring us to visit our dependencies to ensure this security vulnerability 
> is addressed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-16759) [Go] Update testify to fix securiy vulnerability

2022-07-13 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-16759:
-
Fix Version/s: 7.0.1

> [Go] Update testify to fix securiy vulnerability
> 
>
> Key: ARROW-16759
> URL: https://issues.apache.org/jira/browse/ARROW-16759
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Go
>Affects Versions: 7.0.0, 8.0.0
>Reporter: Dominic Barnes
>Assignee: Dominic Barnes
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 9.0.0, 6.0.2, 7.0.1
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> The packges under github.com/apache/arrow/go currently have a dependency on 
> github.com/stretchr/testify v1.7.0 which has a dependency on gopkg.in/yaml.v3 
> that has an outstanding security vulnerability. 
> ([CVE-2022-28948|https://github.com/advisories/GHSA-hp87-p4gw-j4gq])
> While testify is only used during tests, this is not distinguished by the go 
> toolchain and other tools like Snyk which scan the dependency chain for 
> vulnerabilities. Unfortunately, due to Go's [Minimal version 
> selection|[https://go.dev/ref/mod#minimal-version-selection],] this ends up 
> requiring us to visit our dependencies to ensure this security vulnerability 
> is addressed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-16759) [Go] Update testify to fix securiy vulnerability

2022-07-13 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-16759:
-
Fix Version/s: 6.0.2
   (was: 6.0.3)

> [Go] Update testify to fix securiy vulnerability
> 
>
> Key: ARROW-16759
> URL: https://issues.apache.org/jira/browse/ARROW-16759
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Go
>Affects Versions: 7.0.0, 8.0.0
>Reporter: Dominic Barnes
>Assignee: Dominic Barnes
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 9.0.0, 6.0.2
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> The packges under github.com/apache/arrow/go currently have a dependency on 
> github.com/stretchr/testify v1.7.0 which has a dependency on gopkg.in/yaml.v3 
> that has an outstanding security vulnerability. 
> ([CVE-2022-28948|https://github.com/advisories/GHSA-hp87-p4gw-j4gq])
> While testify is only used during tests, this is not distinguished by the go 
> toolchain and other tools like Snyk which scan the dependency chain for 
> vulnerabilities. Unfortunately, due to Go's [Minimal version 
> selection|[https://go.dev/ref/mod#minimal-version-selection],] this ends up 
> requiring us to visit our dependencies to ensure this security vulnerability 
> is addressed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-16759) [Go] Update testify to fix securiy vulnerability

2022-07-13 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-16759:
-
Fix Version/s: 6.0.2

> [Go] Update testify to fix securiy vulnerability
> 
>
> Key: ARROW-16759
> URL: https://issues.apache.org/jira/browse/ARROW-16759
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Go
>Affects Versions: 7.0.0, 8.0.0
>Reporter: Dominic Barnes
>Assignee: Dominic Barnes
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 6.0.2, 9.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> The packges under github.com/apache/arrow/go currently have a dependency on 
> github.com/stretchr/testify v1.7.0 which has a dependency on gopkg.in/yaml.v3 
> that has an outstanding security vulnerability. 
> ([CVE-2022-28948|https://github.com/advisories/GHSA-hp87-p4gw-j4gq])
> While testify is only used during tests, this is not distinguished by the go 
> toolchain and other tools like Snyk which scan the dependency chain for 
> vulnerabilities. Unfortunately, due to Go's [Minimal version 
> selection|[https://go.dev/ref/mod#minimal-version-selection],] this ends up 
> requiring us to visit our dependencies to ensure this security vulnerability 
> is addressed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-16759) [Go] Update testify to fix securiy vulnerability

2022-07-13 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-16759:
-
Summary: [Go] Update testify to fix securiy vulnerability  (was: [Go] 
update testify to fix securiy vulnerability)

> [Go] Update testify to fix securiy vulnerability
> 
>
> Key: ARROW-16759
> URL: https://issues.apache.org/jira/browse/ARROW-16759
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Go
>Affects Versions: 7.0.0, 8.0.0
>Reporter: Dominic Barnes
>Assignee: Dominic Barnes
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> The packges under github.com/apache/arrow/go currently have a dependency on 
> github.com/stretchr/testify v1.7.0 which has a dependency on gopkg.in/yaml.v3 
> that has an outstanding security vulnerability. 
> ([CVE-2022-28948|https://github.com/advisories/GHSA-hp87-p4gw-j4gq])
> While testify is only used during tests, this is not distinguished by the go 
> toolchain and other tools like Snyk which scan the dependency chain for 
> vulnerabilities. Unfortunately, due to Go's [Minimal version 
> selection|[https://go.dev/ref/mod#minimal-version-selection],] this ends up 
> requiring us to visit our dependencies to ensure this security vulnerability 
> is addressed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-16759) [Go] update testify to fix securiy vulnerability

2022-07-13 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-16759:
-
Summary: [Go] update testify to fix securiy vulnerability  (was: [Go])

> [Go] update testify to fix securiy vulnerability
> 
>
> Key: ARROW-16759
> URL: https://issues.apache.org/jira/browse/ARROW-16759
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Go
>Affects Versions: 7.0.0, 8.0.0
>Reporter: Dominic Barnes
>Assignee: Dominic Barnes
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> The packges under github.com/apache/arrow/go currently have a dependency on 
> github.com/stretchr/testify v1.7.0 which has a dependency on gopkg.in/yaml.v3 
> that has an outstanding security vulnerability. 
> ([CVE-2022-28948|https://github.com/advisories/GHSA-hp87-p4gw-j4gq])
> While testify is only used during tests, this is not distinguished by the go 
> toolchain and other tools like Snyk which scan the dependency chain for 
> vulnerabilities. Unfortunately, due to Go's [Minimal version 
> selection|[https://go.dev/ref/mod#minimal-version-selection],] this ends up 
> requiring us to visit our dependencies to ensure this security vulnerability 
> is addressed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17064) Python hangs when use pyarrow.fs.copy_files is used with "used_threads=True"

2022-07-13 Thread Alejandro Marco Ramos (Jira)

Alejandro Marco Ramos created ARROW-17064:
-

 Summary: Python hangs when use pyarrow.fs.copy_files is used with 
"used_threads=True"
 Key: ARROW-17064
 URL: https://issues.apache.org/jira/browse/ARROW-17064
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 8.0.0
Reporter: Alejandro Marco Ramos


When try to copy a local path to s3 remote filesystem using 
`pyarrow.fs.copy_files` and using default parameter `use_threads=True`, the 
system hangs. If use "use_threads=False` the operation must complete ok (but 
more slow).

 

My code is:
{code:java}
>>> import pyarrow as pa
>>> s3fs=pa.fs.S3FileSystem(endpoint_override="http://xx;)
>>> pa.fs.copy_files("tests/data/payments", "bucket/payments", 
>>> destination_filesystem=s3fs)
... (don't return){code}
If check remote s3, all files appear, but the function don't return



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17049) [C++] arrow-compute-expression-benchmark aborts with sanity check failure

2022-07-13 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566202#comment-17566202
 ] 

Antoine Pitrou commented on ARROW-17049:


Yes!

> [C++] arrow-compute-expression-benchmark aborts with sanity check failure
> -
>
> Key: ARROW-17049
> URL: https://issues.apache.org/jira/browse/ARROW-17049
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Benchmarking, C++
>Reporter: Antoine Pitrou
>Priority: Blocker
> Fix For: 9.0.0
>
>
> {code}
> $ ./build-release/relwithdebinfo/arrow-compute-expression-benchmark 
> 2022-07-12T11:56:06+02:00
> Running ./build-release/relwithdebinfo/arrow-compute-expression-benchmark
> Run on (24 X 3800 MHz CPU s)
> CPU Caches:
>   L1 Data 32 KiB (x12)
>   L1 Instruction 32 KiB (x12)
>   L2 Unified 512 KiB (x12)
>   L3 Unified 16384 KiB (x4)
> Load Average: 0.44, 3.87, 2.60
> ***WARNING*** CPU scaling is enabled, the benchmark real time measurements 
> may be noisy and will incur extra overhead.
> -
> Benchmark 
>   Time CPU   Iterations
> -
> SimplifyFilterWithGuarantee/negative_filter_simple_guarantee_simple   
>5734 ns 5733 ns   122775
> SimplifyFilterWithGuarantee/negative_filter_cast_guarantee_simple 
>9094 ns 9092 ns76172
> SimplifyFilterWithGuarantee/negative_filter_simple_guarantee_dictionary   
>   12992 ns12989 ns53601
> SimplifyFilterWithGuarantee/negative_filter_cast_guarantee_dictionary 
>   16395 ns16392 ns42601
> SimplifyFilterWithGuarantee/positive_filter_simple_guarantee_simple   
>5756 ns 5755 ns   120485
> SimplifyFilterWithGuarantee/positive_filter_cast_guarantee_simple 
>9197 ns 9195 ns76168
> SimplifyFilterWithGuarantee/positive_filter_simple_guarantee_dictionary   
>   12875 ns12872 ns54240
> SimplifyFilterWithGuarantee/positive_filter_cast_guarantee_dictionary 
>   16567 ns16563 ns42539
> BindAndEvaluate/simple_array  
> 255 ns  255 ns  2748813
> BindAndEvaluate/simple_scalar 
> 252 ns  252 ns  2765200
> BindAndEvaluate/nested_array  
>2251 ns 2251 ns   310424
> BindAndEvaluate/nested_scalar 
>2687 ns 2686 ns   261939
> -- Arrow Fatal Error --
> Invalid: Value lengths differed from ExecBatch length
> Abandon
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-17049) [C++] arrow-compute-expression-benchmark aborts with sanity check failure

2022-07-13 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou closed ARROW-17049.
--
Fix Version/s: (was: 9.0.0)
   Resolution: Duplicate

> [C++] arrow-compute-expression-benchmark aborts with sanity check failure
> -
>
> Key: ARROW-17049
> URL: https://issues.apache.org/jira/browse/ARROW-17049
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Benchmarking, C++
>Reporter: Antoine Pitrou
>Priority: Blocker
>
> {code}
> $ ./build-release/relwithdebinfo/arrow-compute-expression-benchmark 
> 2022-07-12T11:56:06+02:00
> Running ./build-release/relwithdebinfo/arrow-compute-expression-benchmark
> Run on (24 X 3800 MHz CPU s)
> CPU Caches:
>   L1 Data 32 KiB (x12)
>   L1 Instruction 32 KiB (x12)
>   L2 Unified 512 KiB (x12)
>   L3 Unified 16384 KiB (x4)
> Load Average: 0.44, 3.87, 2.60
> ***WARNING*** CPU scaling is enabled, the benchmark real time measurements 
> may be noisy and will incur extra overhead.
> -
> Benchmark 
>   Time CPU   Iterations
> -
> SimplifyFilterWithGuarantee/negative_filter_simple_guarantee_simple   
>5734 ns 5733 ns   122775
> SimplifyFilterWithGuarantee/negative_filter_cast_guarantee_simple 
>9094 ns 9092 ns76172
> SimplifyFilterWithGuarantee/negative_filter_simple_guarantee_dictionary   
>   12992 ns12989 ns53601
> SimplifyFilterWithGuarantee/negative_filter_cast_guarantee_dictionary 
>   16395 ns16392 ns42601
> SimplifyFilterWithGuarantee/positive_filter_simple_guarantee_simple   
>5756 ns 5755 ns   120485
> SimplifyFilterWithGuarantee/positive_filter_cast_guarantee_simple 
>9197 ns 9195 ns76168
> SimplifyFilterWithGuarantee/positive_filter_simple_guarantee_dictionary   
>   12875 ns12872 ns54240
> SimplifyFilterWithGuarantee/positive_filter_cast_guarantee_dictionary 
>   16567 ns16563 ns42539
> BindAndEvaluate/simple_array  
> 255 ns  255 ns  2748813
> BindAndEvaluate/simple_scalar 
> 252 ns  252 ns  2765200
> BindAndEvaluate/nested_array  
>2251 ns 2251 ns   310424
> BindAndEvaluate/nested_scalar 
>2687 ns 2686 ns   261939
> -- Arrow Fatal Error --
> Invalid: Value lengths differed from ExecBatch length
> Abandon
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-15213) [Ruby] Add bindings for between kernel

2022-07-13 Thread Benson Muite (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Muite reassigned ARROW-15213:


Assignee: Benson Muite

> [Ruby] Add bindings for between kernel
> --
>
> Key: ARROW-15213
> URL: https://issues.apache.org/jira/browse/ARROW-15213
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Ruby
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Major
>
> Ruby bindings for between kernel. Follow on to 
> https://issues.apache.org/jira/browse/ARROW-9843



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-6322) [C#] Implement a plasma client

2022-07-13 Thread Kouhei Sutou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566190#comment-17566190
 ] 

Kouhei Sutou commented on ARROW-6322:
-

[~eerhardt] Can we close this because Plasma is deprecated?

> [C#] Implement a plasma client
> --
>
> Key: ARROW-6322
> URL: https://issues.apache.org/jira/browse/ARROW-6322
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C#
>Reporter: Eric Erhardt
>Priority: Major
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> We should create a C# plasma client, so .NET code can get and put objects 
> into the plasma store.
> An easy-ish way of implementing this would be to build on the c_glib C APIs 
> already exposed for the plasma client. Unfortunately, I haven't found a 
> decent C# GObject generator, so I think the C bindings will need to be 
> written by hand, but there isn't too many of them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-10911) [C++] Improve *_SOURCE CMake variables naming

2022-07-13 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-10911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-10911:


Assignee: Kouhei Sutou

> [C++] Improve *_SOURCE CMake variables naming
> -
>
> Key: ARROW-10911
> URL: https://issues.apache.org/jira/browse/ARROW-10911
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>
> https://github.com/apache/arrow/pull/8908#issuecomment-744780934
> {quote}
> > This change also renamed our Boost dependency name to "Boost" from
> "BOOST". It means that users need to use -DBoost_SOURCE not
> -DBOOST_SOURCE. To keep backward compatibility, -DBOOST_SOURCE is
> still accepted when -DBoost_SOURCE isn't specified.
> > Users also need to use -Dre2_SOURCE not -DRE2_SOURCE. To keep backward
> compatibility, -DRE2_SOURCE is still accepted when -Dre2_SOURCE isn't
> specified.
> I would love to have this kind of case-insensitive handling for all 
> dependencies. This has tripped me up many times and it is difficult to 
> explain to others why everything else is ALL_CAPS but these dependencies are 
> a mix.
> {quote}
> https://github.com/apache/arrow/pull/8908#issuecomment-744898897
> {quote}
> OK. How about using `ARROW_${UPPERCASE_DEPENDENCY_NAME}_SOURCE` CMake 
> variables for them like `ARROW_*_USE_SHARED`?
> If it sounds reasonable, we can work on it as a separated task.
> {quote}
> https://github.com/apache/arrow/pull/8908#issuecomment-744954917
> {quote}
> Why does it need the `ARROW_` namespace prefix?
> I'm fine with anything that is intuitive and trivial to document.
> {quote}
> https://github.com/apache/arrow/pull/8908#issuecomment-745005158
> {quote}
> Because of consistency.
> If we use `ARROW_${UPPERCASE_DEPENDENCY_NAME}_SOURCE` not 
> `${UPPERCASE_DEPENDENCY_NAME}_SOURCE`,  we can explain that you can customize 
> how to use `${DEPENDENCY}` by 
> `ARROW_${UPPERCASE_DEPENDENCY_NAME}_{SOURCE,USE_SHARED}` CMake variables. 
> It'll more intuitive than using `${UPPERCASE_DEPENDENCY_NAME}_SOURCE` and 
> `ARROW_${UPPERCASE_DEPENDENCY_NAME}_USE_SHARED`.
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-17061) [Python][Substrait] Acero consumer is unable to consume count function from substrait query plan

2022-07-13 Thread Vibhatha Lakmal Abeykoon (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vibhatha Lakmal Abeykoon reassigned ARROW-17061:


Assignee: Vibhatha Lakmal Abeykoon

> [Python][Substrait] Acero consumer is unable to consume count function from 
> substrait query plan
> 
>
> Key: ARROW-17061
> URL: https://issues.apache.org/jira/browse/ARROW-17061
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Richard Tia
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> SQL
> {code:java}
> SELECT
> o_orderpriority,
> count(*) AS order_count
> FROM
> orders
> GROUP BY
> o_orderpriority{code}
> The substrait plan generated from SQL, using Isthmus.
>  
> substrait count: 
> [https://github.com/substrait-io/substrait/blob/main/extensions/functions_aggregate_generic.yaml]
>  
> Running the substrait plan with Acero returns this error:
> {code:java}
> E   pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned 
> INVALID_ARGUMENT:(relations[0].root.input.aggregate.measures[0].measure) 
> arguments: Cannot find field.  {code}
>  
> From substrait query plan:
> relations[0].root.input.aggregate.measures[0].measure
> {code:java}
> "measure": {
>   "functionReference": 0,
>   "args": [],
>   "sorts": [],
>   "phase": "AGGREGATION_PHASE_INITIAL_TO_RESULT",
>   "outputType": {
> "i64": {
>   "typeVariationReference": 0,
>   "nullability": "NULLABILITY_REQUIRED"
> }
>   },
>   "invocation": "AGGREGATION_INVOCATION_ALL",
>   "arguments": []
> }{code}
> {code:java}
> "extensions": [{
>   "extensionFunction": {
> "extensionUriReference": 1,
> "functionAnchor": 0,
> "name": "count:opt"
>   }
> }],{code}
> Count is a unary function and should be consumable, but isn't in this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-7131) [GLib][CI] Fail to execute lua examples in the MacOS build

2022-07-13 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou closed ARROW-7131.
---
Resolution: Won't Do

We don't need this.

> [GLib][CI] Fail to execute lua examples in the MacOS build
> --
>
> Key: ARROW-7131
> URL: https://issues.apache.org/jira/browse/ARROW-7131
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: CI, Continuous Integration, GLib
>Reporter: Krisztian Szucs
>Priority: Major
>
> Fails to load 'lgi.corelgilua51' despite that lgi is installed in the macOS 
> build.
> References:
> - https://github.com/apache/arrow/blob/master/.github/workflows/ruby.yml#L77
> - https://github.com/apache/arrow/blob/master/ci/scripts/c_glib_test.sh#L35



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-14316) [CI] extends is removed from docker v3

2022-07-13 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou closed ARROW-14316.

Resolution: Invalid

We can use extends again with recent docker-compose.

> [CI] extends is removed from docker v3
> --
>
> Key: ARROW-14316
> URL: https://issues.apache.org/jira/browse/ARROW-14316
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Benson Muite
>Priority: Minor
>
> As explained in [https://github.com/docker/compose/issues/4315] extends has 
> been removed from docker compose v3 schema, it should therefore be removed 
> from schema used in Arrow.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-3902) [Gandiva] [C++] Remove static c++ linked in Gandiva.

2022-07-13 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou closed ARROW-3902.
---
Resolution: Invalid

Now, we need to do nothing for this.

> [Gandiva] [C++] Remove static c++ linked in Gandiva.
> 
>
> Key: ARROW-3902
> URL: https://issues.apache.org/jira/browse/ARROW-3902
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Affects Versions: 0.12.0
>Reporter: Praveen Kumar
>Priority: Major
>
> Hi,
> [~wesm_impala_7e40], I am looking into switching Gandiva Redhat developer 
> toolchain. We are not too familiar with it and not sure the effort required 
> there.
> In the meanwhile for the short term, can we turn get Crossbow builds to only 
> do static linking for Dremio builds (through a travis env variable)? and 
> Arrow can ship Gandiva linked to std-c++ dynamically?
> We can then move to redhat toolchain for 0.13 version of Arrow?
> Thx.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-14360) [Ruby] Add DSL to build expression

2022-07-13 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-14360:


Assignee: Kouhei Sutou

> [Ruby] Add DSL to build expression
> --
>
> Key: ARROW-14360
> URL: https://issues.apache.org/jira/browse/ARROW-14360
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Ruby
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

1 2 >

1 - 100 of 101 matches

Mail list logo