[jira] [Created] (ARROW-16907) [C++][R][CI] homebrew-r-brew job always installs from apache/arrow master

2022-06-25 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-16907:
---

 Summary: [C++][R][CI] homebrew-r-brew job always installs from 
apache/arrow master
 Key: ARROW-16907
 URL: https://issues.apache.org/jira/browse/ARROW-16907
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Continuous Integration
Reporter: Neal Richardson


On one run on ARROW-16510:

{code}
  brew install -v --HEAD apache-arrow
  # for testing
  brew install minio
  shell: /usr/local/bin/bash -e {0}
  env:
ARROW_GLIB_FORMULA: ./arrow/dev/tasks/homebrew-formulae/apache-arrow-glib.rb
Warning: apache-arrow HEAD-65a6929 is already installed and up-to-date.
To reinstall HEAD, run:
  brew reinstall apache-arrow
{code}

But, 65a6929 is the SHA of apache/arrow@master, not the SHA of the commit being 
tested: 
https://github.com/ursacomputing/crossbow/runs/7055249700?check_suite_focus=true#step:3:17

I tried to force an uninstall and then a reinstall, but that errored 
differently: 
https://github.com/ursacomputing/crossbow/runs/7056474511?check_suite_focus=true#step:4:29

Note that the revision pinning logic does work correctly in the autobrew 
version of this.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16904) [C++] min/max not deterministic if Parquet files have multiple row groups

2022-06-25 Thread Robert On (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert On updated ARROW-16904:
--
Description: 
 

The following code produces non-deterministic result for getting the minimum 
value of a sequence of 1e6 integers.
{code:java}
sapply(1:100, function(x) {
  # create parquet file with a val column with numbers 1 to 100,000
  arrow::write_parquet(
    data.frame(val = 1:1e5), "test.parquet")
  # find minimum value
  arrow::open_dataset("test.parquet") %>%
    dplyr::summarise(min_val = min(val)) %>%
    dplyr::collect() %>% dplyr::pull(min_val)
}) %>% table()

sapply(1:100, function(x) {
  # create parquet file with a val column with numbers 1 to 1,000,000
  arrow::write_parquet(
    data.frame(val = 1:1e6), "test.parquet")
  # find minimum value
  arrow::open_dataset("test.parquet") %>%
    dplyr::summarise(min_val = min(val)) %>%
    dplyr::collect() %>% dplyr::pull(min_val)
}) %>% table()
{code}
The first 100 simulations using numbers 1 to 1e5 is able to find the minimum 
number (1) all 100 times.

The second 100 simulations using numbers 1 to 1e6 only finds the minimum number 
(1) 65 out of 100 times. It finds near multiples of 131073, 25, 8, and 2 times 
respectively.
{code:java}
. 1
100 
. 1 131073 262145 393217 
 65 25  8  2 {code}
 

 

  was:
 

The following code produces non-deterministic result for getting the minimum 
value of a sequence of 1e6 integers.
{code:java}
sapply(1:100, function(x) {
  # create parquet file with a val column with numbers 1 to 100,000
  arrow::write_parquet(
    data.frame(val = 1:1e5), "test.parquet")  
arrow::open_dataset("test.parquet") %>%
    dplyr::summarise(min_val = min(val)) %>%
    dplyr::collect() %>% dplyr::pull(min_val)
}) %>% table()

sapply(1:100, function(x) {
  # create parquet file with a val column with numbers 1 to 1,000,000
  arrow::write_parquet(
    data.frame(val = 1:1e6), "test.parquet")
  
  arrow::open_dataset("test.parquet") %>%
    dplyr::summarise(min_val = min(val)) %>%
    dplyr::collect() %>% dplyr::pull(min_val)
}) %>% table()
{code}
The first 100 simulations using numbers 1 to 1e5 is able to find the minimum 
number (1) all 100 times.

The second 100 simulations using numbers 1 to 1e6 only finds the minimum number 
(1) 65 out of 100 times. It finds near multiples of 131073, 25, 8, and 2 times 
respectively.
{code:java}
. 1
100 
. 1 131073 262145 393217 
65 25 8 2 {code}
 

 


> [C++] min/max not deterministic if Parquet files have multiple row groups
> -
>
> Key: ARROW-16904
> URL: https://issues.apache.org/jira/browse/ARROW-16904
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0
> Environment: $ lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description:Ubuntu 20.04.4 LTS
> Release:20.04
> Codename:   focal
>Reporter: Robert On
>Priority: Blocker
> Fix For: 9.0.0
>
>
>  
> The following code produces non-deterministic result for getting the minimum 
> value of a sequence of 1e6 integers.
> {code:java}
> sapply(1:100, function(x) {
>   # create parquet file with a val column with numbers 1 to 100,000
>   arrow::write_parquet(
>     data.frame(val = 1:1e5), "test.parquet")
>   # find minimum value
>   arrow::open_dataset("test.parquet") %>%
>     dplyr::summarise(min_val = min(val)) %>%
>     dplyr::collect() %>% dplyr::pull(min_val)
> }) %>% table()
> sapply(1:100, function(x) {
>   # create parquet file with a val column with numbers 1 to 1,000,000
>   arrow::write_parquet(
>     data.frame(val = 1:1e6), "test.parquet")
>   # find minimum value
>   arrow::open_dataset("test.parquet") %>%
>     dplyr::summarise(min_val = min(val)) %>%
>     dplyr::collect() %>% dplyr::pull(min_val)
> }) %>% table()
> {code}
> The first 100 simulations using numbers 1 to 1e5 is able to find the minimum 
> number (1) all 100 times.
> The second 100 simulations using numbers 1 to 1e6 only finds the minimum 
> number (1) 65 out of 100 times. It finds near multiples of 131073, 25, 8, and 
> 2 times respectively.
> {code:java}
> . 1
> 100 
> . 1 131073 262145 393217 
>  65 25  8  2 {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16692) [C++] Segfault in datasets

2022-06-25 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-16692:
---
Priority: Blocker  (was: Major)

> [C++] Segfault in datasets
> --
>
> Key: ARROW-16692
> URL: https://issues.apache.org/jira/browse/ARROW-16692
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jonathan Keane
>Assignee: Weston Pace
>Priority: Blocker
> Attachments: backtrace.txt
>
>
> I'm still working to make a minimal reproducer for this, though I can 
> reliably reproduce it below (though that means needing to download a bunch of 
> data first...). I've cleaned out much of the unnecessary code (so this query 
> below is a bit silly, and not what I'm actually trying to do), but haven't 
> been able to make a constructed dataset that reproduces this.
> Working on some example with the new | more cleaned taxi dataset at 
> {{s3://ursa-labs-taxi-data-v2}}, I've run into a segfault:
> {code}
> library(arrow)
> library(dplyr)
> ds <- open_dataset("path/to/new_taxi/")
> ds %>%
>   filter(!is.na(pickup_location_id)) %>%
>   summarise(n = n()) %>% collect()
> {code}
> Most of the time ends in a segfault (though I have gotten it to work on 
> occasion). I've tried with smaller files | constructed datasets and haven't 
> been able to replicate it yet. One thing that might be important is:  
> {{pickup_location_id}} is all NAs | nulls in the first 8 years of the data or 
> so.
> I've attached a backtrace in case that's enough to see what's going on here.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16692) [C++] Segfault in datasets

2022-06-25 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-16692:
---
Fix Version/s: 9.0.0

> [C++] Segfault in datasets
> --
>
> Key: ARROW-16692
> URL: https://issues.apache.org/jira/browse/ARROW-16692
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jonathan Keane
>Assignee: Weston Pace
>Priority: Blocker
> Fix For: 9.0.0
>
> Attachments: backtrace.txt
>
>
> I'm still working to make a minimal reproducer for this, though I can 
> reliably reproduce it below (though that means needing to download a bunch of 
> data first...). I've cleaned out much of the unnecessary code (so this query 
> below is a bit silly, and not what I'm actually trying to do), but haven't 
> been able to make a constructed dataset that reproduces this.
> Working on some example with the new | more cleaned taxi dataset at 
> {{s3://ursa-labs-taxi-data-v2}}, I've run into a segfault:
> {code}
> library(arrow)
> library(dplyr)
> ds <- open_dataset("path/to/new_taxi/")
> ds %>%
>   filter(!is.na(pickup_location_id)) %>%
>   summarise(n = n()) %>% collect()
> {code}
> Most of the time ends in a segfault (though I have gotten it to work on 
> occasion). I've tried with smaller files | constructed datasets and haven't 
> been able to replicate it yet. One thing that might be important is:  
> {{pickup_location_id}} is all NAs | nulls in the first 8 years of the data or 
> so.
> I've attached a backtrace in case that's enough to see what's going on here.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16906) [C++][CI] Enable ARROW_GCS on MinGW workflows

2022-06-25 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-16906:
---

 Summary: [C++][CI] Enable ARROW_GCS on MinGW workflows
 Key: ARROW-16906
 URL: https://issues.apache.org/jira/browse/ARROW-16906
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Continuous Integration
Reporter: Neal Richardson


See discussion at 
https://github.com/apache/arrow/pull/13404#issuecomment-1166353926



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16904) [C++] min/max not deterministic if Parquet files have multiple row groups

2022-06-25 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558825#comment-17558825
 ] 

Neal Richardson commented on ARROW-16904:
-

That's not good. I am pretty sure this is the same issue as ARROW-16807. I can 
reproduce it with the starwars data example used there: 

{code}
> replicate(100, ds %>% summarize(min(height, na.rm = TRUE)) %>% pull())
  [1] 79 79 66 66 79 79 66 79 66 66 66 66 66 79 66 66 66 66 79 94 88 79 66 79 79
 [26] 66 66 79 66 66 66 66 66 66 66 79 79 66 79 79 66 88 79 66 66 94 66 66 66 79
 [51] 66 66 66 66 79 66 66 66 79 66 94 79 79 79 66 79 66 79 79 66 79 66 79 66 88
 [76] 88 66 66 66 66 66 66 66 66 66 66 66 79 79 66 66 79 66 66 66 66 66 66 66 66
{code}


> [C++] min/max not deterministic if Parquet files have multiple row groups
> -
>
> Key: ARROW-16904
> URL: https://issues.apache.org/jira/browse/ARROW-16904
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0
> Environment: $ lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description:Ubuntu 20.04.4 LTS
> Release:20.04
> Codename:   focal
>Reporter: Robert On
>Priority: Blocker
> Fix For: 9.0.0
>
>
>  
> The following code produces non-deterministic result for getting the minimum 
> value of a sequence of 1e6 integers.
> {code:java}
> sapply(1:100, function(x) {
>   # create parquet file with a val column with numbers 1 to 100,000
>   arrow::write_parquet(
>     data.frame(val = 1:1e5), "test.parquet")  
> arrow::open_dataset("test.parquet") %>%
>     dplyr::summarise(min_val = min(val)) %>%
>     dplyr::collect() %>% dplyr::pull(min_val)
> }) %>% table()
> sapply(1:100, function(x) {
>   # create parquet file with a val column with numbers 1 to 1,000,000
>   arrow::write_parquet(
>     data.frame(val = 1:1e6), "test.parquet")
>   
>   arrow::open_dataset("test.parquet") %>%
>     dplyr::summarise(min_val = min(val)) %>%
>     dplyr::collect() %>% dplyr::pull(min_val)
> }) %>% table()
> {code}
> The first 100 simulations using numbers 1 to 1e5 is able to find the minimum 
> number (1) all 100 times.
> The second 100 simulations using numbers 1 to 1e6 only finds the minimum 
> number (1) 65 out of 100 times. It finds near multiples of 131073, 25, 8, and 
> 2 times respectively.
> {code:java}
> . 1
> 100 
> . 1 131073 262145 393217 
> 65 25 8 2 {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16904) [C++] min/max not deterministic if Parquet files have multiple row groups

2022-06-25 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-16904:

Priority: Blocker  (was: Critical)

> [C++] min/max not deterministic if Parquet files have multiple row groups
> -
>
> Key: ARROW-16904
> URL: https://issues.apache.org/jira/browse/ARROW-16904
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0
> Environment: $ lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description:Ubuntu 20.04.4 LTS
> Release:20.04
> Codename:   focal
>Reporter: Robert On
>Priority: Blocker
>
>  
> The following code produces non-deterministic result for getting the minimum 
> value of a sequence of 1e6 integers.
> {code:java}
> sapply(1:100, function(x) {
>   # create parquet file with a val column with numbers 1 to 100,000
>   arrow::write_parquet(
>     data.frame(val = 1:1e5), "test.parquet")  
> arrow::open_dataset("test.parquet") %>%
>     dplyr::summarise(min_val = min(val)) %>%
>     dplyr::collect() %>% dplyr::pull(min_val)
> }) %>% table()
> sapply(1:100, function(x) {
>   # create parquet file with a val column with numbers 1 to 1,000,000
>   arrow::write_parquet(
>     data.frame(val = 1:1e6), "test.parquet")
>   
>   arrow::open_dataset("test.parquet") %>%
>     dplyr::summarise(min_val = min(val)) %>%
>     dplyr::collect() %>% dplyr::pull(min_val)
> }) %>% table()
> {code}
> The first 100 simulations using numbers 1 to 1e5 is able to find the minimum 
> number (1) all 100 times.
> The second 100 simulations using numbers 1 to 1e6 only finds the minimum 
> number (1) 65 out of 100 times. It finds near multiples of 131073, 25, 8, and 
> 2 times respectively.
> {code:java}
> . 1
> 100 
> . 1 131073 262145 393217 
> 65 25 8 2 {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16904) [C++] min/max not deterministic if Parquet files have multiple row groups

2022-06-25 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-16904:

Fix Version/s: 9.0.0

> [C++] min/max not deterministic if Parquet files have multiple row groups
> -
>
> Key: ARROW-16904
> URL: https://issues.apache.org/jira/browse/ARROW-16904
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0
> Environment: $ lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description:Ubuntu 20.04.4 LTS
> Release:20.04
> Codename:   focal
>Reporter: Robert On
>Priority: Blocker
> Fix For: 9.0.0
>
>
>  
> The following code produces non-deterministic result for getting the minimum 
> value of a sequence of 1e6 integers.
> {code:java}
> sapply(1:100, function(x) {
>   # create parquet file with a val column with numbers 1 to 100,000
>   arrow::write_parquet(
>     data.frame(val = 1:1e5), "test.parquet")  
> arrow::open_dataset("test.parquet") %>%
>     dplyr::summarise(min_val = min(val)) %>%
>     dplyr::collect() %>% dplyr::pull(min_val)
> }) %>% table()
> sapply(1:100, function(x) {
>   # create parquet file with a val column with numbers 1 to 1,000,000
>   arrow::write_parquet(
>     data.frame(val = 1:1e6), "test.parquet")
>   
>   arrow::open_dataset("test.parquet") %>%
>     dplyr::summarise(min_val = min(val)) %>%
>     dplyr::collect() %>% dplyr::pull(min_val)
> }) %>% table()
> {code}
> The first 100 simulations using numbers 1 to 1e5 is able to find the minimum 
> number (1) all 100 times.
> The second 100 simulations using numbers 1 to 1e6 only finds the minimum 
> number (1) 65 out of 100 times. It finds near multiples of 131073, 25, 8, and 
> 2 times respectively.
> {code:java}
> . 1
> 100 
> . 1 131073 262145 393217 
> 65 25 8 2 {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16904) [C++] min/max not deterministic if Parquet files have multiple row groups

2022-06-25 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-16904:

Summary: [C++] min/max not deterministic if Parquet files have multiple row 
groups  (was: dplyr summarise using min/max aggregate function 
non-deterministic for large number of elements)

> [C++] min/max not deterministic if Parquet files have multiple row groups
> -
>
> Key: ARROW-16904
> URL: https://issues.apache.org/jira/browse/ARROW-16904
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0
> Environment: $ lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description:Ubuntu 20.04.4 LTS
> Release:20.04
> Codename:   focal
>Reporter: Robert On
>Priority: Critical
>
>  
> The following code produces non-deterministic result for getting the minimum 
> value of a sequence of 1e6 integers.
> {code:java}
> sapply(1:100, function(x) {
>   # create parquet file with a val column with numbers 1 to 100,000
>   arrow::write_parquet(
>     data.frame(val = 1:1e5), "test.parquet")  
> arrow::open_dataset("test.parquet") %>%
>     dplyr::summarise(min_val = min(val)) %>%
>     dplyr::collect() %>% dplyr::pull(min_val)
> }) %>% table()
> sapply(1:100, function(x) {
>   # create parquet file with a val column with numbers 1 to 1,000,000
>   arrow::write_parquet(
>     data.frame(val = 1:1e6), "test.parquet")
>   
>   arrow::open_dataset("test.parquet") %>%
>     dplyr::summarise(min_val = min(val)) %>%
>     dplyr::collect() %>% dplyr::pull(min_val)
> }) %>% table()
> {code}
> The first 100 simulations using numbers 1 to 1e5 is able to find the minimum 
> number (1) all 100 times.
> The second 100 simulations using numbers 1 to 1e6 only finds the minimum 
> number (1) 65 out of 100 times. It finds near multiples of 131073, 25, 8, and 
> 2 times respectively.
> {code:java}
> . 1
> 100 
> . 1 131073 262145 393217 
> 65 25 8 2 {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16905) Table.to_pandas() fails for dictionary encoded columns with an is_null partition_expression

2022-06-25 Thread Thomas Newton (Jira)
Thomas Newton created ARROW-16905:
-

 Summary: Table.to_pandas() fails for dictionary encoded columns 
with an is_null partition_expression
 Key: ARROW-16905
 URL: https://issues.apache.org/jira/browse/ARROW-16905
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 8.0.0
 Environment: Ubuntu 18.04, PyArrow 8.0.0, Pandas 1.4.3
Reporter: Thomas Newton
 Attachments: reproduce_null_dictionary_issue.zip

Minimal steps to reproduce:

I attached a `.zip` file containing a python script and a test parquet file. 
Running this python script reproduces the issue.

The steps taken to reproduce:
 # Create a test parquet file with one column containing only null.
 # Create a parquet fragment from this file adding a `partition_expression` 
with an `is_null` guarantee on this fragment.
 # Create a `FileSystemDataset` from this fragment setting the schema to be a 
dictionary column.
 # Call `.to_table().to_pandas()` on the resulting pyarrow dataset. You will 
get the following error.

{code:java}
  File "/.../pip-core_pandas/pandas/core/dtypes/dtypes.py", line 492, in 
validate_categories
    raise ValueError("Categorical categories cannot be null")
ValueError: Categorical categories cannot be null {code}
 

My understanding of why this doesn't work:
 # There are 2 ways of dictionary encoding nulls: `mask` and `encode` described 
in the [pyarrow 
docs|https://arrow.apache.org/docs/python/generated/pyarrow.compute.DictionaryEncodeOptions.html#pyarrow.compute.DictionaryEncodeOptions].
 Pyarrow supports both but pandas categoricals only supports mask. Arguably the 
real issue here is pandas should support `encode` style categoricals.
 # When you provide an `.is_null` guarantee on a fragment arrow will not 
actually read the data. It knows the type from the schema, we've guaranteed the 
values are all null and it can get the length from the parquet metadata so it 
has everything it needs.
 # Instead of reading the data it uses the [Null 
ArrayFactory|https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/util.cc].
 For dictionary type columns I believe that calls [this DictionaryArray 
constructor 
|https://github.com/apache/arrow/blob/53752adc6b81166cd4ee7db5a819494042f29197/cpp/src/arrow/array/array_dict.cc#L80-L93]which
 appears to be creating the dictionary in the `encode` style.

Would it be possible to make this configurable? It seems like the `mask` style 
of dictionary encoding is the default for the rest of PyArrow and it would 
solve the Pandas compatibility issue. I appreciate this is probably an 
extremely niche issue but my options for a workaround are looking pretty 
horrible. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16753) [C++] LocalFileSystem cannot list Linux directory recursively when permission to subdirectory contents are denied

2022-06-25 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-16753:


Assignee: David Rauschenbach

> [C++] LocalFileSystem cannot list Linux directory recursively when permission 
> to subdirectory contents are denied
> -
>
> Key: ARROW-16753
> URL: https://issues.apache.org/jira/browse/ARROW-16753
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 6.0.1
> Environment: Ubuntu 20.04 LTS
>Reporter: David Rauschenbach
>Assignee: David Rauschenbach
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The following code to list my root directory fails:
>  
> {code:java}
> FileSelector file_selector;
> file_selector.base_dir = "/";
> file_selector.allow_not_found = true;
> file_selector.recursive = true;
> auto result = fs.GetFileInfo(file_selector);{code}
> The result.ok() value returns {+}false{+}, and then result.status().message() 
> returns {+}Cannot list directory '/var/run/wpa_supplicant'{+}. 
> An examination of the /run directory (which /var/run symlinks to) shows:
>  
> {code:java}
> $ ls -al /run
> drwxr-xr-x 35 root              root  1040 Jun  6 06:11 .
> drwxr-xr-x 20 root              root  4096 May 20 12:42 ..
> ...
> drwxr-x---  2 root              root    60 Jun  4 12:14 wpa_supplicant{code}
> And then attempting to list this directory reveals:
>  
> {code:java}
> $ ls -al /run/wpa_supplicant/
> ls: cannot open directory '/run/wpa_supplicant/': Permission denied{code}
>  
> As a user of LocalFileSystem, I should be able to list all of the files that 
> I have access to.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16753) [C++] LocalFileSystem cannot list Linux directory recursively when permission to subdirectory contents are denied

2022-06-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16753:
---
Labels: pull-request-available  (was: )

> [C++] LocalFileSystem cannot list Linux directory recursively when permission 
> to subdirectory contents are denied
> -
>
> Key: ARROW-16753
> URL: https://issues.apache.org/jira/browse/ARROW-16753
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 6.0.1
> Environment: Ubuntu 20.04 LTS
>Reporter: David Rauschenbach
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The following code to list my root directory fails:
>  
> {code:java}
> FileSelector file_selector;
> file_selector.base_dir = "/";
> file_selector.allow_not_found = true;
> file_selector.recursive = true;
> auto result = fs.GetFileInfo(file_selector);{code}
> The result.ok() value returns {+}false{+}, and then result.status().message() 
> returns {+}Cannot list directory '/var/run/wpa_supplicant'{+}. 
> An examination of the /run directory (which /var/run symlinks to) shows:
>  
> {code:java}
> $ ls -al /run
> drwxr-xr-x 35 root              root  1040 Jun  6 06:11 .
> drwxr-xr-x 20 root              root  4096 May 20 12:42 ..
> ...
> drwxr-x---  2 root              root    60 Jun  4 12:14 wpa_supplicant{code}
> And then attempting to list this directory reveals:
>  
> {code:java}
> $ ls -al /run/wpa_supplicant/
> ls: cannot open directory '/run/wpa_supplicant/': Permission denied{code}
>  
> As a user of LocalFileSystem, I should be able to list all of the files that 
> I have access to.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16904) dplyr summarise using min/max aggregate function non-deterministic for large number of elements

2022-06-25 Thread Robert On (Jira)
Robert On created ARROW-16904:
-

 Summary: dplyr summarise using min/max aggregate function 
non-deterministic for large number of elements
 Key: ARROW-16904
 URL: https://issues.apache.org/jira/browse/ARROW-16904
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 8.0.0
 Environment: $ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 20.04.4 LTS
Release:20.04
Codename:   focal
Reporter: Robert On


 

The following code produces non-deterministic result for getting the minimum 
value of a sequence of 1e6 integers.
{code:java}
sapply(1:100, function(x) {
  # create parquet file with a val column with numbers 1 to 100,000
  arrow::write_parquet(
    data.frame(val = 1:1e5), "test.parquet")  
arrow::open_dataset("test.parquet") %>%
    dplyr::summarise(min_val = min(val)) %>%
    dplyr::collect() %>% dplyr::pull(min_val)
}) %>% table()

sapply(1:100, function(x) {
  # create parquet file with a val column with numbers 1 to 1,000,000
  arrow::write_parquet(
    data.frame(val = 1:1e6), "test.parquet")
  
  arrow::open_dataset("test.parquet") %>%
    dplyr::summarise(min_val = min(val)) %>%
    dplyr::collect() %>% dplyr::pull(min_val)
}) %>% table()
{code}
The first 100 simulations using numbers 1 to 1e5 is able to find the minimum 
number (1) all 100 times.

The second 100 simulations using numbers 1 to 1e6 only finds the minimum number 
(1) 65 out of 100 times. It finds near multiples of 131073, 25, 8, and 2 times 
respectively.
{code:java}
. 1
100 
. 1 131073 262145 393217 
65 25 8 2 {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16878) [R] Move Windows GCS dependency building upstream

2022-06-25 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-16878:

Description: 
On ARROW-16510, I added the GCS filesystem to the arrow PKGBUILD, bundling it 
in the arrow build. A better solution would be to put google-cloud-cpp in 
rtools-packages so we don't have to build it every time. 

There is no google-cloud-cpp in https://github.com/msys2/MINGW-packages, so 
either we'd have to make one up for rtools-packages, or we use the bundled 
google-cloud-cpp in our cmake and see if we can put as many of its dependencies 
in rtools-packages to ease the build. Either way, we'd want to start by adding 
its dependencies.

https://github.com/msys2/MINGW-packages/tree/master/mingw-w64-nlohmann-json 
exists in MINGW-packages and could be brought over, but I don't think it's a 
big deal if it is bundled.

https://github.com/msys2/MINGW-packages/blob/master/mingw-w64-abseil-cpp/PKGBUILD
 exists and could be brought over, but note that it uses C++17. That doesn't 
seem to be a hard requirement, at least for what we're using, since we're 
building it with C++11.


  was:
On ARROW-16510, I added the GCS filesystem to the arrow PKGBUILD, bundling it 
in the arrow build. A better solution would be to put google-cloud-cpp in 
rtools-packages so we don't have to build it every time. 

There is no google-cloud-cpp in https://github.com/msys2/MINGW-packages, so 
either we'd have to make one up for rtools-packages, or we use the bundled 
google-cloud-cpp in our cmake and see if we can put as many of its dependencies 
in rtools-packages to ease the build. 

https://github.com/msys2/MINGW-packages/tree/master/mingw-w64-nlohmann-json 
exists in MINGW-packages and could be brought over, but I don't think it's a 
big deal if it is bundled.

https://github.com/msys2/MINGW-packages/blob/master/mingw-w64-abseil-cpp/PKGBUILD
 exists and could be brought over, but note that it uses C++17. That doesn't 
seem to be a hard requirement, at least for what we're using, since we're 
building it with C++11.



> [R] Move Windows GCS dependency building upstream
> -
>
> Key: ARROW-16878
> URL: https://issues.apache.org/jira/browse/ARROW-16878
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Packaging, R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 9.0.0
>
>
> On ARROW-16510, I added the GCS filesystem to the arrow PKGBUILD, bundling it 
> in the arrow build. A better solution would be to put google-cloud-cpp in 
> rtools-packages so we don't have to build it every time. 
> There is no google-cloud-cpp in https://github.com/msys2/MINGW-packages, so 
> either we'd have to make one up for rtools-packages, or we use the bundled 
> google-cloud-cpp in our cmake and see if we can put as many of its 
> dependencies in rtools-packages to ease the build. Either way, we'd want to 
> start by adding its dependencies.
> https://github.com/msys2/MINGW-packages/tree/master/mingw-w64-nlohmann-json 
> exists in MINGW-packages and could be brought over, but I don't think it's a 
> big deal if it is bundled.
> https://github.com/msys2/MINGW-packages/blob/master/mingw-w64-abseil-cpp/PKGBUILD
>  exists and could be brought over, but note that it uses C++17. That doesn't 
> seem to be a hard requirement, at least for what we're using, since we're 
> building it with C++11.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16878) [R] Move Windows GCS dependency building upstream

2022-06-25 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-16878:

Description: 
On ARROW-16510, I added the GCS filesystem to the arrow PKGBUILD, bundling it 
in the arrow build. A better solution would be to put google-cloud-cpp in 
rtools-packages so we don't have to build it every time. 

There is no google-cloud-cpp in https://github.com/msys2/MINGW-packages, so 
either we'd have to make one up for rtools-packages, or we use the bundled 
google-cloud-cpp in our cmake and see if we can put as many of its dependencies 
in rtools-packages to ease the build. 

https://github.com/msys2/MINGW-packages/tree/master/mingw-w64-nlohmann-json 
exists in MINGW-packages and could be brought over, but I don't think it's a 
big deal if it is bundled.

https://github.com/msys2/MINGW-packages/blob/master/mingw-w64-abseil-cpp/PKGBUILD
 exists and could be brought over, but note that it uses C++17. That doesn't 
seem to be a hard requirement, at least for what we're using, since we're 
building it with C++11.


  was:
On ARROW-16510, I made some progress on this but had to back out the changes. 
There is no google-cloud-cpp in https://github.com/msys2/MINGW-packages, so 
either we'd have to make one up for rtools-packages, or we use the bundled 
google-cloud-cpp in our cmake and see if we can put as many of its dependencies 
in rtools-packages to ease the build. 

https://github.com/msys2/MINGW-packages/tree/master/mingw-w64-nlohmann-json 
exists in MINGW-packages and could be brought over, but I don't think it's a 
big deal if it is bundled.

https://github.com/msys2/MINGW-packages/blob/master/mingw-w64-abseil-cpp/PKGBUILD
 exists and could be brought over, but note that it uses C++17. 

It may be that we have to bump arrow up to C++17 for this to work anyway, 
judging from the undefined symbols errors I got (see below). 
https://github.com/msys2/MINGW-packages/pull/11758 and 
https://github.com/apache/arrow/pull/13407 suggest maybe so. That should work 
ok for rtools >= 40, but we'll see what other problems it brings.

In case it's relevant, 
https://github.com/r-windows/rtools-packages/blob/master/mingw-w64-grpc/PKGBUILD
 exists in rtools-packages already, and grpc depends on abseil, but note how it 
handles abseil. It doesn't use all of the same parts of abseil that 
google-cloud-cpp does, so maybe that's fine there but not here?

Here's the diff I backed out of ARROW-16510, and below that is the 
undefined-symbol messages from the build failure. There's something up with 
libcurl too that I don't understand because I added it.

{code}
diff --git a/ci/scripts/PKGBUILD b/ci/scripts/PKGBUILD
index b9b0194f5c8c..566ec881e404 100644
--- a/ci/scripts/PKGBUILD
+++ b/ci/scripts/PKGBUILD
@@ -25,6 +25,7 @@ arch=("any")
 url="https://arrow.apache.org/";
 license=("Apache-2.0")
 depends=("${MINGW_PACKAGE_PREFIX}-aws-sdk-cpp"
+ "${MINGW_PACKAGE_PREFIX}-curl" # for google-cloud-cpp bundled build
  "${MINGW_PACKAGE_PREFIX}-libutf8proc"
  "${MINGW_PACKAGE_PREFIX}-re2"
  "${MINGW_PACKAGE_PREFIX}-thrift"
@@ -79,11 +80,13 @@ build() {
 export PATH="/C/Rtools${MINGW_PREFIX/mingw/mingw_}/bin:$PATH"
 export CPPFLAGS="${CPPFLAGS} -I${MINGW_PREFIX}/include"
 export LIBS="-L${MINGW_PREFIX}/libs"
+export ARROW_GCS=OFF
 export ARROW_S3=OFF
 export ARROW_WITH_RE2=OFF
 # Without this, some dataset functionality segfaults
 export CMAKE_UNITY_BUILD=ON
   else
+export ARROW_GCS=ON
 export ARROW_S3=ON
 export ARROW_WITH_RE2=ON
 # Without this, some compute functionality segfaults in tests
@@ -101,6 +104,7 @@ build() {
 -DARROW_CSV=ON \
 -DARROW_DATASET=ON \
 -DARROW_FILESYSTEM=ON \
+-DARROW_GCS="${ARROW_GCS}" \
 -DARROW_HDFS=OFF \
 -DARROW_JEMALLOC=OFF \
 -DARROW_JSON=ON \
diff --git a/ci/scripts/r_windows_build.sh b/ci/scripts/r_windows_build.sh
index 89d5737a09bd..3334eab8663a 100755
--- a/ci/scripts/r_windows_build.sh
+++ b/ci/scripts/r_windows_build.sh
@@ -87,7 +87,7 @@ if [ -d mingw64/lib/ ]; then
   # These may be from https://dl.bintray.com/rtools/backports/
   cp $MSYS_LIB_DIR/mingw64/lib/lib{thrift,snappy}.a 
$DST_DIR/${RWINLIB_LIB_DIR}/x64
   # These are from https://dl.bintray.com/rtools/mingw{32,64}/
-  cp 
$MSYS_LIB_DIR/mingw64/lib/lib{zstd,lz4,brotli*,crypto,utf8proc,re2,aws*}.a 
$DST_DIR/lib/x64
+  cp 
$MSYS_LIB_DIR/mingw64/lib/lib{zstd,lz4,brotli*,crypto,curl,ss*,utf8proc,re2,aws*}.a
 $DST_DIR/lib/x64
 fi
 
 # Same for the 32-bit versions
@@ -97,7 +97,7 @@ if [ -d mingw32/lib/ ]; then
   mkdir -p $DST_DIR/lib/i386
   mv mingw32/lib/*.a $DST_DIR/${RWINLIB_LIB_DIR}/i386
   cp $MSYS_LIB_DIR/mingw32/lib/lib{thrift,snappy}.a 
$DST_DIR/${RWINLIB_LIB_DIR}/i386
-  cp 
$MSYS_LIB_DIR/mingw32/lib/lib{zstd,lz4,brotli*,crypto,utf8proc,re2,aws*}.a 
$DST_DIR/lib/i386
+  cp 
$MSYS_LIB_DIR/mingw32/lib/lib

[jira] [Updated] (ARROW-16878) [R] Move Windows GCS dependency building upstream

2022-06-25 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-16878:

Summary: [R] Move Windows GCS dependency building upstream  (was: [R] Add 
GCS to Windows builds)

> [R] Move Windows GCS dependency building upstream
> -
>
> Key: ARROW-16878
> URL: https://issues.apache.org/jira/browse/ARROW-16878
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Packaging, R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 9.0.0
>
>
> On ARROW-16510, I made some progress on this but had to back out the changes. 
> There is no google-cloud-cpp in https://github.com/msys2/MINGW-packages, so 
> either we'd have to make one up for rtools-packages, or we use the bundled 
> google-cloud-cpp in our cmake and see if we can put as many of its 
> dependencies in rtools-packages to ease the build. 
> https://github.com/msys2/MINGW-packages/tree/master/mingw-w64-nlohmann-json 
> exists in MINGW-packages and could be brought over, but I don't think it's a 
> big deal if it is bundled.
> https://github.com/msys2/MINGW-packages/blob/master/mingw-w64-abseil-cpp/PKGBUILD
>  exists and could be brought over, but note that it uses C++17. 
> It may be that we have to bump arrow up to C++17 for this to work anyway, 
> judging from the undefined symbols errors I got (see below). 
> https://github.com/msys2/MINGW-packages/pull/11758 and 
> https://github.com/apache/arrow/pull/13407 suggest maybe so. That should work 
> ok for rtools >= 40, but we'll see what other problems it brings.
> In case it's relevant, 
> https://github.com/r-windows/rtools-packages/blob/master/mingw-w64-grpc/PKGBUILD
>  exists in rtools-packages already, and grpc depends on abseil, but note how 
> it handles abseil. It doesn't use all of the same parts of abseil that 
> google-cloud-cpp does, so maybe that's fine there but not here?
> Here's the diff I backed out of ARROW-16510, and below that is the 
> undefined-symbol messages from the build failure. There's something up with 
> libcurl too that I don't understand because I added it.
> {code}
> diff --git a/ci/scripts/PKGBUILD b/ci/scripts/PKGBUILD
> index b9b0194f5c8c..566ec881e404 100644
> --- a/ci/scripts/PKGBUILD
> +++ b/ci/scripts/PKGBUILD
> @@ -25,6 +25,7 @@ arch=("any")
>  url="https://arrow.apache.org/";
>  license=("Apache-2.0")
>  depends=("${MINGW_PACKAGE_PREFIX}-aws-sdk-cpp"
> + "${MINGW_PACKAGE_PREFIX}-curl" # for google-cloud-cpp bundled build
>   "${MINGW_PACKAGE_PREFIX}-libutf8proc"
>   "${MINGW_PACKAGE_PREFIX}-re2"
>   "${MINGW_PACKAGE_PREFIX}-thrift"
> @@ -79,11 +80,13 @@ build() {
>  export PATH="/C/Rtools${MINGW_PREFIX/mingw/mingw_}/bin:$PATH"
>  export CPPFLAGS="${CPPFLAGS} -I${MINGW_PREFIX}/include"
>  export LIBS="-L${MINGW_PREFIX}/libs"
> +export ARROW_GCS=OFF
>  export ARROW_S3=OFF
>  export ARROW_WITH_RE2=OFF
>  # Without this, some dataset functionality segfaults
>  export CMAKE_UNITY_BUILD=ON
>else
> +export ARROW_GCS=ON
>  export ARROW_S3=ON
>  export ARROW_WITH_RE2=ON
>  # Without this, some compute functionality segfaults in tests
> @@ -101,6 +104,7 @@ build() {
>  -DARROW_CSV=ON \
>  -DARROW_DATASET=ON \
>  -DARROW_FILESYSTEM=ON \
> +-DARROW_GCS="${ARROW_GCS}" \
>  -DARROW_HDFS=OFF \
>  -DARROW_JEMALLOC=OFF \
>  -DARROW_JSON=ON \
> diff --git a/ci/scripts/r_windows_build.sh b/ci/scripts/r_windows_build.sh
> index 89d5737a09bd..3334eab8663a 100755
> --- a/ci/scripts/r_windows_build.sh
> +++ b/ci/scripts/r_windows_build.sh
> @@ -87,7 +87,7 @@ if [ -d mingw64/lib/ ]; then
># These may be from https://dl.bintray.com/rtools/backports/
>cp $MSYS_LIB_DIR/mingw64/lib/lib{thrift,snappy}.a 
> $DST_DIR/${RWINLIB_LIB_DIR}/x64
># These are from https://dl.bintray.com/rtools/mingw{32,64}/
> -  cp 
> $MSYS_LIB_DIR/mingw64/lib/lib{zstd,lz4,brotli*,crypto,utf8proc,re2,aws*}.a 
> $DST_DIR/lib/x64
> +  cp 
> $MSYS_LIB_DIR/mingw64/lib/lib{zstd,lz4,brotli*,crypto,curl,ss*,utf8proc,re2,aws*}.a
>  $DST_DIR/lib/x64
>  fi
>  
>  # Same for the 32-bit versions
> @@ -97,7 +97,7 @@ if [ -d mingw32/lib/ ]; then
>mkdir -p $DST_DIR/lib/i386
>mv mingw32/lib/*.a $DST_DIR/${RWINLIB_LIB_DIR}/i386
>cp $MSYS_LIB_DIR/mingw32/lib/lib{thrift,snappy}.a 
> $DST_DIR/${RWINLIB_LIB_DIR}/i386
> -  cp 
> $MSYS_LIB_DIR/mingw32/lib/lib{zstd,lz4,brotli*,crypto,utf8proc,re2,aws*}.a 
> $DST_DIR/lib/i386
> +  cp 
> $MSYS_LIB_DIR/mingw32/lib/lib{zstd,lz4,brotli*,crypto,curl,ss*,utf8proc,re2,aws*}.a
>  $DST_DIR/lib/i386
>  fi
>  
>  # Do the same also for ucrt64
> @@ -105,7 +105,7 @@ if [ -d ucrt64/lib/ ]; then
>ls $MSYS_LIB_DIR/ucrt64/lib/
>mkdir -p $DST_DIR/lib/x64-ucrt
>mv ucrt64/lib/*.a $DS