[jira] [Created] (ARROW-16907) [C++][R][CI] homebrew-r-brew job always installs from apache/arrow master
Neal Richardson created ARROW-16907: --- Summary: [C++][R][CI] homebrew-r-brew job always installs from apache/arrow master Key: ARROW-16907 URL: https://issues.apache.org/jira/browse/ARROW-16907 Project: Apache Arrow Issue Type: New Feature Components: Continuous Integration Reporter: Neal Richardson On one run on ARROW-16510: {code} brew install -v --HEAD apache-arrow # for testing brew install minio shell: /usr/local/bin/bash -e {0} env: ARROW_GLIB_FORMULA: ./arrow/dev/tasks/homebrew-formulae/apache-arrow-glib.rb Warning: apache-arrow HEAD-65a6929 is already installed and up-to-date. To reinstall HEAD, run: brew reinstall apache-arrow {code} But, 65a6929 is the SHA of apache/arrow@master, not the SHA of the commit being tested: https://github.com/ursacomputing/crossbow/runs/7055249700?check_suite_focus=true#step:3:17 I tried to force an uninstall and then a reinstall, but that errored differently: https://github.com/ursacomputing/crossbow/runs/7056474511?check_suite_focus=true#step:4:29 Note that the revision pinning logic does work correctly in the autobrew version of this. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16904) [C++] min/max not deterministic if Parquet files have multiple row groups
[ https://issues.apache.org/jira/browse/ARROW-16904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert On updated ARROW-16904: -- Description: The following code produces non-deterministic result for getting the minimum value of a sequence of 1e6 integers. {code:java} sapply(1:100, function(x) { # create parquet file with a val column with numbers 1 to 100,000 arrow::write_parquet( data.frame(val = 1:1e5), "test.parquet") # find minimum value arrow::open_dataset("test.parquet") %>% dplyr::summarise(min_val = min(val)) %>% dplyr::collect() %>% dplyr::pull(min_val) }) %>% table() sapply(1:100, function(x) { # create parquet file with a val column with numbers 1 to 1,000,000 arrow::write_parquet( data.frame(val = 1:1e6), "test.parquet") # find minimum value arrow::open_dataset("test.parquet") %>% dplyr::summarise(min_val = min(val)) %>% dplyr::collect() %>% dplyr::pull(min_val) }) %>% table() {code} The first 100 simulations using numbers 1 to 1e5 is able to find the minimum number (1) all 100 times. The second 100 simulations using numbers 1 to 1e6 only finds the minimum number (1) 65 out of 100 times. It finds near multiples of 131073, 25, 8, and 2 times respectively. {code:java} . 1 100 . 1 131073 262145 393217 65 25 8 2 {code} was: The following code produces non-deterministic result for getting the minimum value of a sequence of 1e6 integers. {code:java} sapply(1:100, function(x) { # create parquet file with a val column with numbers 1 to 100,000 arrow::write_parquet( data.frame(val = 1:1e5), "test.parquet") arrow::open_dataset("test.parquet") %>% dplyr::summarise(min_val = min(val)) %>% dplyr::collect() %>% dplyr::pull(min_val) }) %>% table() sapply(1:100, function(x) { # create parquet file with a val column with numbers 1 to 1,000,000 arrow::write_parquet( data.frame(val = 1:1e6), "test.parquet") arrow::open_dataset("test.parquet") %>% dplyr::summarise(min_val = min(val)) %>% dplyr::collect() %>% dplyr::pull(min_val) }) %>% table() {code} The first 100 simulations using numbers 1 to 1e5 is able to find the minimum number (1) all 100 times. The second 100 simulations using numbers 1 to 1e6 only finds the minimum number (1) 65 out of 100 times. It finds near multiples of 131073, 25, 8, and 2 times respectively. {code:java} . 1 100 . 1 131073 262145 393217 65 25 8 2 {code} > [C++] min/max not deterministic if Parquet files have multiple row groups > - > > Key: ARROW-16904 > URL: https://issues.apache.org/jira/browse/ARROW-16904 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 8.0.0 > Environment: $ lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description:Ubuntu 20.04.4 LTS > Release:20.04 > Codename: focal >Reporter: Robert On >Priority: Blocker > Fix For: 9.0.0 > > > > The following code produces non-deterministic result for getting the minimum > value of a sequence of 1e6 integers. > {code:java} > sapply(1:100, function(x) { > # create parquet file with a val column with numbers 1 to 100,000 > arrow::write_parquet( > data.frame(val = 1:1e5), "test.parquet") > # find minimum value > arrow::open_dataset("test.parquet") %>% > dplyr::summarise(min_val = min(val)) %>% > dplyr::collect() %>% dplyr::pull(min_val) > }) %>% table() > sapply(1:100, function(x) { > # create parquet file with a val column with numbers 1 to 1,000,000 > arrow::write_parquet( > data.frame(val = 1:1e6), "test.parquet") > # find minimum value > arrow::open_dataset("test.parquet") %>% > dplyr::summarise(min_val = min(val)) %>% > dplyr::collect() %>% dplyr::pull(min_val) > }) %>% table() > {code} > The first 100 simulations using numbers 1 to 1e5 is able to find the minimum > number (1) all 100 times. > The second 100 simulations using numbers 1 to 1e6 only finds the minimum > number (1) 65 out of 100 times. It finds near multiples of 131073, 25, 8, and > 2 times respectively. > {code:java} > . 1 > 100 > . 1 131073 262145 393217 > 65 25 8 2 {code} > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16692) [C++] Segfault in datasets
[ https://issues.apache.org/jira/browse/ARROW-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-16692: --- Priority: Blocker (was: Major) > [C++] Segfault in datasets > -- > > Key: ARROW-16692 > URL: https://issues.apache.org/jira/browse/ARROW-16692 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Jonathan Keane >Assignee: Weston Pace >Priority: Blocker > Attachments: backtrace.txt > > > I'm still working to make a minimal reproducer for this, though I can > reliably reproduce it below (though that means needing to download a bunch of > data first...). I've cleaned out much of the unnecessary code (so this query > below is a bit silly, and not what I'm actually trying to do), but haven't > been able to make a constructed dataset that reproduces this. > Working on some example with the new | more cleaned taxi dataset at > {{s3://ursa-labs-taxi-data-v2}}, I've run into a segfault: > {code} > library(arrow) > library(dplyr) > ds <- open_dataset("path/to/new_taxi/") > ds %>% > filter(!is.na(pickup_location_id)) %>% > summarise(n = n()) %>% collect() > {code} > Most of the time ends in a segfault (though I have gotten it to work on > occasion). I've tried with smaller files | constructed datasets and haven't > been able to replicate it yet. One thing that might be important is: > {{pickup_location_id}} is all NAs | nulls in the first 8 years of the data or > so. > I've attached a backtrace in case that's enough to see what's going on here. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16692) [C++] Segfault in datasets
[ https://issues.apache.org/jira/browse/ARROW-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-16692: --- Fix Version/s: 9.0.0 > [C++] Segfault in datasets > -- > > Key: ARROW-16692 > URL: https://issues.apache.org/jira/browse/ARROW-16692 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Jonathan Keane >Assignee: Weston Pace >Priority: Blocker > Fix For: 9.0.0 > > Attachments: backtrace.txt > > > I'm still working to make a minimal reproducer for this, though I can > reliably reproduce it below (though that means needing to download a bunch of > data first...). I've cleaned out much of the unnecessary code (so this query > below is a bit silly, and not what I'm actually trying to do), but haven't > been able to make a constructed dataset that reproduces this. > Working on some example with the new | more cleaned taxi dataset at > {{s3://ursa-labs-taxi-data-v2}}, I've run into a segfault: > {code} > library(arrow) > library(dplyr) > ds <- open_dataset("path/to/new_taxi/") > ds %>% > filter(!is.na(pickup_location_id)) %>% > summarise(n = n()) %>% collect() > {code} > Most of the time ends in a segfault (though I have gotten it to work on > occasion). I've tried with smaller files | constructed datasets and haven't > been able to replicate it yet. One thing that might be important is: > {{pickup_location_id}} is all NAs | nulls in the first 8 years of the data or > so. > I've attached a backtrace in case that's enough to see what's going on here. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16906) [C++][CI] Enable ARROW_GCS on MinGW workflows
Neal Richardson created ARROW-16906: --- Summary: [C++][CI] Enable ARROW_GCS on MinGW workflows Key: ARROW-16906 URL: https://issues.apache.org/jira/browse/ARROW-16906 Project: Apache Arrow Issue Type: New Feature Components: C++, Continuous Integration Reporter: Neal Richardson See discussion at https://github.com/apache/arrow/pull/13404#issuecomment-1166353926 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16904) [C++] min/max not deterministic if Parquet files have multiple row groups
[ https://issues.apache.org/jira/browse/ARROW-16904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558825#comment-17558825 ] Neal Richardson commented on ARROW-16904: - That's not good. I am pretty sure this is the same issue as ARROW-16807. I can reproduce it with the starwars data example used there: {code} > replicate(100, ds %>% summarize(min(height, na.rm = TRUE)) %>% pull()) [1] 79 79 66 66 79 79 66 79 66 66 66 66 66 79 66 66 66 66 79 94 88 79 66 79 79 [26] 66 66 79 66 66 66 66 66 66 66 79 79 66 79 79 66 88 79 66 66 94 66 66 66 79 [51] 66 66 66 66 79 66 66 66 79 66 94 79 79 79 66 79 66 79 79 66 79 66 79 66 88 [76] 88 66 66 66 66 66 66 66 66 66 66 66 79 79 66 66 79 66 66 66 66 66 66 66 66 {code} > [C++] min/max not deterministic if Parquet files have multiple row groups > - > > Key: ARROW-16904 > URL: https://issues.apache.org/jira/browse/ARROW-16904 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 8.0.0 > Environment: $ lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description:Ubuntu 20.04.4 LTS > Release:20.04 > Codename: focal >Reporter: Robert On >Priority: Blocker > Fix For: 9.0.0 > > > > The following code produces non-deterministic result for getting the minimum > value of a sequence of 1e6 integers. > {code:java} > sapply(1:100, function(x) { > # create parquet file with a val column with numbers 1 to 100,000 > arrow::write_parquet( > data.frame(val = 1:1e5), "test.parquet") > arrow::open_dataset("test.parquet") %>% > dplyr::summarise(min_val = min(val)) %>% > dplyr::collect() %>% dplyr::pull(min_val) > }) %>% table() > sapply(1:100, function(x) { > # create parquet file with a val column with numbers 1 to 1,000,000 > arrow::write_parquet( > data.frame(val = 1:1e6), "test.parquet") > > arrow::open_dataset("test.parquet") %>% > dplyr::summarise(min_val = min(val)) %>% > dplyr::collect() %>% dplyr::pull(min_val) > }) %>% table() > {code} > The first 100 simulations using numbers 1 to 1e5 is able to find the minimum > number (1) all 100 times. > The second 100 simulations using numbers 1 to 1e6 only finds the minimum > number (1) 65 out of 100 times. It finds near multiples of 131073, 25, 8, and > 2 times respectively. > {code:java} > . 1 > 100 > . 1 131073 262145 393217 > 65 25 8 2 {code} > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16904) [C++] min/max not deterministic if Parquet files have multiple row groups
[ https://issues.apache.org/jira/browse/ARROW-16904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-16904: Priority: Blocker (was: Critical) > [C++] min/max not deterministic if Parquet files have multiple row groups > - > > Key: ARROW-16904 > URL: https://issues.apache.org/jira/browse/ARROW-16904 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 8.0.0 > Environment: $ lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description:Ubuntu 20.04.4 LTS > Release:20.04 > Codename: focal >Reporter: Robert On >Priority: Blocker > > > The following code produces non-deterministic result for getting the minimum > value of a sequence of 1e6 integers. > {code:java} > sapply(1:100, function(x) { > # create parquet file with a val column with numbers 1 to 100,000 > arrow::write_parquet( > data.frame(val = 1:1e5), "test.parquet") > arrow::open_dataset("test.parquet") %>% > dplyr::summarise(min_val = min(val)) %>% > dplyr::collect() %>% dplyr::pull(min_val) > }) %>% table() > sapply(1:100, function(x) { > # create parquet file with a val column with numbers 1 to 1,000,000 > arrow::write_parquet( > data.frame(val = 1:1e6), "test.parquet") > > arrow::open_dataset("test.parquet") %>% > dplyr::summarise(min_val = min(val)) %>% > dplyr::collect() %>% dplyr::pull(min_val) > }) %>% table() > {code} > The first 100 simulations using numbers 1 to 1e5 is able to find the minimum > number (1) all 100 times. > The second 100 simulations using numbers 1 to 1e6 only finds the minimum > number (1) 65 out of 100 times. It finds near multiples of 131073, 25, 8, and > 2 times respectively. > {code:java} > . 1 > 100 > . 1 131073 262145 393217 > 65 25 8 2 {code} > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16904) [C++] min/max not deterministic if Parquet files have multiple row groups
[ https://issues.apache.org/jira/browse/ARROW-16904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-16904: Fix Version/s: 9.0.0 > [C++] min/max not deterministic if Parquet files have multiple row groups > - > > Key: ARROW-16904 > URL: https://issues.apache.org/jira/browse/ARROW-16904 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 8.0.0 > Environment: $ lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description:Ubuntu 20.04.4 LTS > Release:20.04 > Codename: focal >Reporter: Robert On >Priority: Blocker > Fix For: 9.0.0 > > > > The following code produces non-deterministic result for getting the minimum > value of a sequence of 1e6 integers. > {code:java} > sapply(1:100, function(x) { > # create parquet file with a val column with numbers 1 to 100,000 > arrow::write_parquet( > data.frame(val = 1:1e5), "test.parquet") > arrow::open_dataset("test.parquet") %>% > dplyr::summarise(min_val = min(val)) %>% > dplyr::collect() %>% dplyr::pull(min_val) > }) %>% table() > sapply(1:100, function(x) { > # create parquet file with a val column with numbers 1 to 1,000,000 > arrow::write_parquet( > data.frame(val = 1:1e6), "test.parquet") > > arrow::open_dataset("test.parquet") %>% > dplyr::summarise(min_val = min(val)) %>% > dplyr::collect() %>% dplyr::pull(min_val) > }) %>% table() > {code} > The first 100 simulations using numbers 1 to 1e5 is able to find the minimum > number (1) all 100 times. > The second 100 simulations using numbers 1 to 1e6 only finds the minimum > number (1) 65 out of 100 times. It finds near multiples of 131073, 25, 8, and > 2 times respectively. > {code:java} > . 1 > 100 > . 1 131073 262145 393217 > 65 25 8 2 {code} > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16904) [C++] min/max not deterministic if Parquet files have multiple row groups
[ https://issues.apache.org/jira/browse/ARROW-16904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-16904: Summary: [C++] min/max not deterministic if Parquet files have multiple row groups (was: dplyr summarise using min/max aggregate function non-deterministic for large number of elements) > [C++] min/max not deterministic if Parquet files have multiple row groups > - > > Key: ARROW-16904 > URL: https://issues.apache.org/jira/browse/ARROW-16904 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 8.0.0 > Environment: $ lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description:Ubuntu 20.04.4 LTS > Release:20.04 > Codename: focal >Reporter: Robert On >Priority: Critical > > > The following code produces non-deterministic result for getting the minimum > value of a sequence of 1e6 integers. > {code:java} > sapply(1:100, function(x) { > # create parquet file with a val column with numbers 1 to 100,000 > arrow::write_parquet( > data.frame(val = 1:1e5), "test.parquet") > arrow::open_dataset("test.parquet") %>% > dplyr::summarise(min_val = min(val)) %>% > dplyr::collect() %>% dplyr::pull(min_val) > }) %>% table() > sapply(1:100, function(x) { > # create parquet file with a val column with numbers 1 to 1,000,000 > arrow::write_parquet( > data.frame(val = 1:1e6), "test.parquet") > > arrow::open_dataset("test.parquet") %>% > dplyr::summarise(min_val = min(val)) %>% > dplyr::collect() %>% dplyr::pull(min_val) > }) %>% table() > {code} > The first 100 simulations using numbers 1 to 1e5 is able to find the minimum > number (1) all 100 times. > The second 100 simulations using numbers 1 to 1e6 only finds the minimum > number (1) 65 out of 100 times. It finds near multiples of 131073, 25, 8, and > 2 times respectively. > {code:java} > . 1 > 100 > . 1 131073 262145 393217 > 65 25 8 2 {code} > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16905) Table.to_pandas() fails for dictionary encoded columns with an is_null partition_expression
Thomas Newton created ARROW-16905: - Summary: Table.to_pandas() fails for dictionary encoded columns with an is_null partition_expression Key: ARROW-16905 URL: https://issues.apache.org/jira/browse/ARROW-16905 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 8.0.0 Environment: Ubuntu 18.04, PyArrow 8.0.0, Pandas 1.4.3 Reporter: Thomas Newton Attachments: reproduce_null_dictionary_issue.zip Minimal steps to reproduce: I attached a `.zip` file containing a python script and a test parquet file. Running this python script reproduces the issue. The steps taken to reproduce: # Create a test parquet file with one column containing only null. # Create a parquet fragment from this file adding a `partition_expression` with an `is_null` guarantee on this fragment. # Create a `FileSystemDataset` from this fragment setting the schema to be a dictionary column. # Call `.to_table().to_pandas()` on the resulting pyarrow dataset. You will get the following error. {code:java} File "/.../pip-core_pandas/pandas/core/dtypes/dtypes.py", line 492, in validate_categories raise ValueError("Categorical categories cannot be null") ValueError: Categorical categories cannot be null {code} My understanding of why this doesn't work: # There are 2 ways of dictionary encoding nulls: `mask` and `encode` described in the [pyarrow docs|https://arrow.apache.org/docs/python/generated/pyarrow.compute.DictionaryEncodeOptions.html#pyarrow.compute.DictionaryEncodeOptions]. Pyarrow supports both but pandas categoricals only supports mask. Arguably the real issue here is pandas should support `encode` style categoricals. # When you provide an `.is_null` guarantee on a fragment arrow will not actually read the data. It knows the type from the schema, we've guaranteed the values are all null and it can get the length from the parquet metadata so it has everything it needs. # Instead of reading the data it uses the [Null ArrayFactory|https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/util.cc]. For dictionary type columns I believe that calls [this DictionaryArray constructor |https://github.com/apache/arrow/blob/53752adc6b81166cd4ee7db5a819494042f29197/cpp/src/arrow/array/array_dict.cc#L80-L93]which appears to be creating the dictionary in the `encode` style. Would it be possible to make this configurable? It seems like the `mask` style of dictionary encoding is the default for the rest of PyArrow and it would solve the Pandas compatibility issue. I appreciate this is probably an extremely niche issue but my options for a workaround are looking pretty horrible. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (ARROW-16753) [C++] LocalFileSystem cannot list Linux directory recursively when permission to subdirectory contents are denied
[ https://issues.apache.org/jira/browse/ARROW-16753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou reassigned ARROW-16753: Assignee: David Rauschenbach > [C++] LocalFileSystem cannot list Linux directory recursively when permission > to subdirectory contents are denied > - > > Key: ARROW-16753 > URL: https://issues.apache.org/jira/browse/ARROW-16753 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 6.0.1 > Environment: Ubuntu 20.04 LTS >Reporter: David Rauschenbach >Assignee: David Rauschenbach >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > The following code to list my root directory fails: > > {code:java} > FileSelector file_selector; > file_selector.base_dir = "/"; > file_selector.allow_not_found = true; > file_selector.recursive = true; > auto result = fs.GetFileInfo(file_selector);{code} > The result.ok() value returns {+}false{+}, and then result.status().message() > returns {+}Cannot list directory '/var/run/wpa_supplicant'{+}. > An examination of the /run directory (which /var/run symlinks to) shows: > > {code:java} > $ ls -al /run > drwxr-xr-x 35 root root 1040 Jun 6 06:11 . > drwxr-xr-x 20 root root 4096 May 20 12:42 .. > ... > drwxr-x--- 2 root root 60 Jun 4 12:14 wpa_supplicant{code} > And then attempting to list this directory reveals: > > {code:java} > $ ls -al /run/wpa_supplicant/ > ls: cannot open directory '/run/wpa_supplicant/': Permission denied{code} > > As a user of LocalFileSystem, I should be able to list all of the files that > I have access to. > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16753) [C++] LocalFileSystem cannot list Linux directory recursively when permission to subdirectory contents are denied
[ https://issues.apache.org/jira/browse/ARROW-16753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16753: --- Labels: pull-request-available (was: ) > [C++] LocalFileSystem cannot list Linux directory recursively when permission > to subdirectory contents are denied > - > > Key: ARROW-16753 > URL: https://issues.apache.org/jira/browse/ARROW-16753 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 6.0.1 > Environment: Ubuntu 20.04 LTS >Reporter: David Rauschenbach >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The following code to list my root directory fails: > > {code:java} > FileSelector file_selector; > file_selector.base_dir = "/"; > file_selector.allow_not_found = true; > file_selector.recursive = true; > auto result = fs.GetFileInfo(file_selector);{code} > The result.ok() value returns {+}false{+}, and then result.status().message() > returns {+}Cannot list directory '/var/run/wpa_supplicant'{+}. > An examination of the /run directory (which /var/run symlinks to) shows: > > {code:java} > $ ls -al /run > drwxr-xr-x 35 root root 1040 Jun 6 06:11 . > drwxr-xr-x 20 root root 4096 May 20 12:42 .. > ... > drwxr-x--- 2 root root 60 Jun 4 12:14 wpa_supplicant{code} > And then attempting to list this directory reveals: > > {code:java} > $ ls -al /run/wpa_supplicant/ > ls: cannot open directory '/run/wpa_supplicant/': Permission denied{code} > > As a user of LocalFileSystem, I should be able to list all of the files that > I have access to. > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16904) dplyr summarise using min/max aggregate function non-deterministic for large number of elements
Robert On created ARROW-16904: - Summary: dplyr summarise using min/max aggregate function non-deterministic for large number of elements Key: ARROW-16904 URL: https://issues.apache.org/jira/browse/ARROW-16904 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 8.0.0 Environment: $ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description:Ubuntu 20.04.4 LTS Release:20.04 Codename: focal Reporter: Robert On The following code produces non-deterministic result for getting the minimum value of a sequence of 1e6 integers. {code:java} sapply(1:100, function(x) { # create parquet file with a val column with numbers 1 to 100,000 arrow::write_parquet( data.frame(val = 1:1e5), "test.parquet") arrow::open_dataset("test.parquet") %>% dplyr::summarise(min_val = min(val)) %>% dplyr::collect() %>% dplyr::pull(min_val) }) %>% table() sapply(1:100, function(x) { # create parquet file with a val column with numbers 1 to 1,000,000 arrow::write_parquet( data.frame(val = 1:1e6), "test.parquet") arrow::open_dataset("test.parquet") %>% dplyr::summarise(min_val = min(val)) %>% dplyr::collect() %>% dplyr::pull(min_val) }) %>% table() {code} The first 100 simulations using numbers 1 to 1e5 is able to find the minimum number (1) all 100 times. The second 100 simulations using numbers 1 to 1e6 only finds the minimum number (1) 65 out of 100 times. It finds near multiples of 131073, 25, 8, and 2 times respectively. {code:java} . 1 100 . 1 131073 262145 393217 65 25 8 2 {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16878) [R] Move Windows GCS dependency building upstream
[ https://issues.apache.org/jira/browse/ARROW-16878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-16878: Description: On ARROW-16510, I added the GCS filesystem to the arrow PKGBUILD, bundling it in the arrow build. A better solution would be to put google-cloud-cpp in rtools-packages so we don't have to build it every time. There is no google-cloud-cpp in https://github.com/msys2/MINGW-packages, so either we'd have to make one up for rtools-packages, or we use the bundled google-cloud-cpp in our cmake and see if we can put as many of its dependencies in rtools-packages to ease the build. Either way, we'd want to start by adding its dependencies. https://github.com/msys2/MINGW-packages/tree/master/mingw-w64-nlohmann-json exists in MINGW-packages and could be brought over, but I don't think it's a big deal if it is bundled. https://github.com/msys2/MINGW-packages/blob/master/mingw-w64-abseil-cpp/PKGBUILD exists and could be brought over, but note that it uses C++17. That doesn't seem to be a hard requirement, at least for what we're using, since we're building it with C++11. was: On ARROW-16510, I added the GCS filesystem to the arrow PKGBUILD, bundling it in the arrow build. A better solution would be to put google-cloud-cpp in rtools-packages so we don't have to build it every time. There is no google-cloud-cpp in https://github.com/msys2/MINGW-packages, so either we'd have to make one up for rtools-packages, or we use the bundled google-cloud-cpp in our cmake and see if we can put as many of its dependencies in rtools-packages to ease the build. https://github.com/msys2/MINGW-packages/tree/master/mingw-w64-nlohmann-json exists in MINGW-packages and could be brought over, but I don't think it's a big deal if it is bundled. https://github.com/msys2/MINGW-packages/blob/master/mingw-w64-abseil-cpp/PKGBUILD exists and could be brought over, but note that it uses C++17. That doesn't seem to be a hard requirement, at least for what we're using, since we're building it with C++11. > [R] Move Windows GCS dependency building upstream > - > > Key: ARROW-16878 > URL: https://issues.apache.org/jira/browse/ARROW-16878 > Project: Apache Arrow > Issue Type: New Feature > Components: Packaging, R >Reporter: Neal Richardson >Priority: Major > Fix For: 9.0.0 > > > On ARROW-16510, I added the GCS filesystem to the arrow PKGBUILD, bundling it > in the arrow build. A better solution would be to put google-cloud-cpp in > rtools-packages so we don't have to build it every time. > There is no google-cloud-cpp in https://github.com/msys2/MINGW-packages, so > either we'd have to make one up for rtools-packages, or we use the bundled > google-cloud-cpp in our cmake and see if we can put as many of its > dependencies in rtools-packages to ease the build. Either way, we'd want to > start by adding its dependencies. > https://github.com/msys2/MINGW-packages/tree/master/mingw-w64-nlohmann-json > exists in MINGW-packages and could be brought over, but I don't think it's a > big deal if it is bundled. > https://github.com/msys2/MINGW-packages/blob/master/mingw-w64-abseil-cpp/PKGBUILD > exists and could be brought over, but note that it uses C++17. That doesn't > seem to be a hard requirement, at least for what we're using, since we're > building it with C++11. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16878) [R] Move Windows GCS dependency building upstream
[ https://issues.apache.org/jira/browse/ARROW-16878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-16878: Description: On ARROW-16510, I added the GCS filesystem to the arrow PKGBUILD, bundling it in the arrow build. A better solution would be to put google-cloud-cpp in rtools-packages so we don't have to build it every time. There is no google-cloud-cpp in https://github.com/msys2/MINGW-packages, so either we'd have to make one up for rtools-packages, or we use the bundled google-cloud-cpp in our cmake and see if we can put as many of its dependencies in rtools-packages to ease the build. https://github.com/msys2/MINGW-packages/tree/master/mingw-w64-nlohmann-json exists in MINGW-packages and could be brought over, but I don't think it's a big deal if it is bundled. https://github.com/msys2/MINGW-packages/blob/master/mingw-w64-abseil-cpp/PKGBUILD exists and could be brought over, but note that it uses C++17. That doesn't seem to be a hard requirement, at least for what we're using, since we're building it with C++11. was: On ARROW-16510, I made some progress on this but had to back out the changes. There is no google-cloud-cpp in https://github.com/msys2/MINGW-packages, so either we'd have to make one up for rtools-packages, or we use the bundled google-cloud-cpp in our cmake and see if we can put as many of its dependencies in rtools-packages to ease the build. https://github.com/msys2/MINGW-packages/tree/master/mingw-w64-nlohmann-json exists in MINGW-packages and could be brought over, but I don't think it's a big deal if it is bundled. https://github.com/msys2/MINGW-packages/blob/master/mingw-w64-abseil-cpp/PKGBUILD exists and could be brought over, but note that it uses C++17. It may be that we have to bump arrow up to C++17 for this to work anyway, judging from the undefined symbols errors I got (see below). https://github.com/msys2/MINGW-packages/pull/11758 and https://github.com/apache/arrow/pull/13407 suggest maybe so. That should work ok for rtools >= 40, but we'll see what other problems it brings. In case it's relevant, https://github.com/r-windows/rtools-packages/blob/master/mingw-w64-grpc/PKGBUILD exists in rtools-packages already, and grpc depends on abseil, but note how it handles abseil. It doesn't use all of the same parts of abseil that google-cloud-cpp does, so maybe that's fine there but not here? Here's the diff I backed out of ARROW-16510, and below that is the undefined-symbol messages from the build failure. There's something up with libcurl too that I don't understand because I added it. {code} diff --git a/ci/scripts/PKGBUILD b/ci/scripts/PKGBUILD index b9b0194f5c8c..566ec881e404 100644 --- a/ci/scripts/PKGBUILD +++ b/ci/scripts/PKGBUILD @@ -25,6 +25,7 @@ arch=("any") url="https://arrow.apache.org/"; license=("Apache-2.0") depends=("${MINGW_PACKAGE_PREFIX}-aws-sdk-cpp" + "${MINGW_PACKAGE_PREFIX}-curl" # for google-cloud-cpp bundled build "${MINGW_PACKAGE_PREFIX}-libutf8proc" "${MINGW_PACKAGE_PREFIX}-re2" "${MINGW_PACKAGE_PREFIX}-thrift" @@ -79,11 +80,13 @@ build() { export PATH="/C/Rtools${MINGW_PREFIX/mingw/mingw_}/bin:$PATH" export CPPFLAGS="${CPPFLAGS} -I${MINGW_PREFIX}/include" export LIBS="-L${MINGW_PREFIX}/libs" +export ARROW_GCS=OFF export ARROW_S3=OFF export ARROW_WITH_RE2=OFF # Without this, some dataset functionality segfaults export CMAKE_UNITY_BUILD=ON else +export ARROW_GCS=ON export ARROW_S3=ON export ARROW_WITH_RE2=ON # Without this, some compute functionality segfaults in tests @@ -101,6 +104,7 @@ build() { -DARROW_CSV=ON \ -DARROW_DATASET=ON \ -DARROW_FILESYSTEM=ON \ +-DARROW_GCS="${ARROW_GCS}" \ -DARROW_HDFS=OFF \ -DARROW_JEMALLOC=OFF \ -DARROW_JSON=ON \ diff --git a/ci/scripts/r_windows_build.sh b/ci/scripts/r_windows_build.sh index 89d5737a09bd..3334eab8663a 100755 --- a/ci/scripts/r_windows_build.sh +++ b/ci/scripts/r_windows_build.sh @@ -87,7 +87,7 @@ if [ -d mingw64/lib/ ]; then # These may be from https://dl.bintray.com/rtools/backports/ cp $MSYS_LIB_DIR/mingw64/lib/lib{thrift,snappy}.a $DST_DIR/${RWINLIB_LIB_DIR}/x64 # These are from https://dl.bintray.com/rtools/mingw{32,64}/ - cp $MSYS_LIB_DIR/mingw64/lib/lib{zstd,lz4,brotli*,crypto,utf8proc,re2,aws*}.a $DST_DIR/lib/x64 + cp $MSYS_LIB_DIR/mingw64/lib/lib{zstd,lz4,brotli*,crypto,curl,ss*,utf8proc,re2,aws*}.a $DST_DIR/lib/x64 fi # Same for the 32-bit versions @@ -97,7 +97,7 @@ if [ -d mingw32/lib/ ]; then mkdir -p $DST_DIR/lib/i386 mv mingw32/lib/*.a $DST_DIR/${RWINLIB_LIB_DIR}/i386 cp $MSYS_LIB_DIR/mingw32/lib/lib{thrift,snappy}.a $DST_DIR/${RWINLIB_LIB_DIR}/i386 - cp $MSYS_LIB_DIR/mingw32/lib/lib{zstd,lz4,brotli*,crypto,utf8proc,re2,aws*}.a $DST_DIR/lib/i386 + cp $MSYS_LIB_DIR/mingw32/lib/lib
[jira] [Updated] (ARROW-16878) [R] Move Windows GCS dependency building upstream
[ https://issues.apache.org/jira/browse/ARROW-16878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-16878: Summary: [R] Move Windows GCS dependency building upstream (was: [R] Add GCS to Windows builds) > [R] Move Windows GCS dependency building upstream > - > > Key: ARROW-16878 > URL: https://issues.apache.org/jira/browse/ARROW-16878 > Project: Apache Arrow > Issue Type: New Feature > Components: Packaging, R >Reporter: Neal Richardson >Priority: Major > Fix For: 9.0.0 > > > On ARROW-16510, I made some progress on this but had to back out the changes. > There is no google-cloud-cpp in https://github.com/msys2/MINGW-packages, so > either we'd have to make one up for rtools-packages, or we use the bundled > google-cloud-cpp in our cmake and see if we can put as many of its > dependencies in rtools-packages to ease the build. > https://github.com/msys2/MINGW-packages/tree/master/mingw-w64-nlohmann-json > exists in MINGW-packages and could be brought over, but I don't think it's a > big deal if it is bundled. > https://github.com/msys2/MINGW-packages/blob/master/mingw-w64-abseil-cpp/PKGBUILD > exists and could be brought over, but note that it uses C++17. > It may be that we have to bump arrow up to C++17 for this to work anyway, > judging from the undefined symbols errors I got (see below). > https://github.com/msys2/MINGW-packages/pull/11758 and > https://github.com/apache/arrow/pull/13407 suggest maybe so. That should work > ok for rtools >= 40, but we'll see what other problems it brings. > In case it's relevant, > https://github.com/r-windows/rtools-packages/blob/master/mingw-w64-grpc/PKGBUILD > exists in rtools-packages already, and grpc depends on abseil, but note how > it handles abseil. It doesn't use all of the same parts of abseil that > google-cloud-cpp does, so maybe that's fine there but not here? > Here's the diff I backed out of ARROW-16510, and below that is the > undefined-symbol messages from the build failure. There's something up with > libcurl too that I don't understand because I added it. > {code} > diff --git a/ci/scripts/PKGBUILD b/ci/scripts/PKGBUILD > index b9b0194f5c8c..566ec881e404 100644 > --- a/ci/scripts/PKGBUILD > +++ b/ci/scripts/PKGBUILD > @@ -25,6 +25,7 @@ arch=("any") > url="https://arrow.apache.org/"; > license=("Apache-2.0") > depends=("${MINGW_PACKAGE_PREFIX}-aws-sdk-cpp" > + "${MINGW_PACKAGE_PREFIX}-curl" # for google-cloud-cpp bundled build > "${MINGW_PACKAGE_PREFIX}-libutf8proc" > "${MINGW_PACKAGE_PREFIX}-re2" > "${MINGW_PACKAGE_PREFIX}-thrift" > @@ -79,11 +80,13 @@ build() { > export PATH="/C/Rtools${MINGW_PREFIX/mingw/mingw_}/bin:$PATH" > export CPPFLAGS="${CPPFLAGS} -I${MINGW_PREFIX}/include" > export LIBS="-L${MINGW_PREFIX}/libs" > +export ARROW_GCS=OFF > export ARROW_S3=OFF > export ARROW_WITH_RE2=OFF > # Without this, some dataset functionality segfaults > export CMAKE_UNITY_BUILD=ON >else > +export ARROW_GCS=ON > export ARROW_S3=ON > export ARROW_WITH_RE2=ON > # Without this, some compute functionality segfaults in tests > @@ -101,6 +104,7 @@ build() { > -DARROW_CSV=ON \ > -DARROW_DATASET=ON \ > -DARROW_FILESYSTEM=ON \ > +-DARROW_GCS="${ARROW_GCS}" \ > -DARROW_HDFS=OFF \ > -DARROW_JEMALLOC=OFF \ > -DARROW_JSON=ON \ > diff --git a/ci/scripts/r_windows_build.sh b/ci/scripts/r_windows_build.sh > index 89d5737a09bd..3334eab8663a 100755 > --- a/ci/scripts/r_windows_build.sh > +++ b/ci/scripts/r_windows_build.sh > @@ -87,7 +87,7 @@ if [ -d mingw64/lib/ ]; then ># These may be from https://dl.bintray.com/rtools/backports/ >cp $MSYS_LIB_DIR/mingw64/lib/lib{thrift,snappy}.a > $DST_DIR/${RWINLIB_LIB_DIR}/x64 ># These are from https://dl.bintray.com/rtools/mingw{32,64}/ > - cp > $MSYS_LIB_DIR/mingw64/lib/lib{zstd,lz4,brotli*,crypto,utf8proc,re2,aws*}.a > $DST_DIR/lib/x64 > + cp > $MSYS_LIB_DIR/mingw64/lib/lib{zstd,lz4,brotli*,crypto,curl,ss*,utf8proc,re2,aws*}.a > $DST_DIR/lib/x64 > fi > > # Same for the 32-bit versions > @@ -97,7 +97,7 @@ if [ -d mingw32/lib/ ]; then >mkdir -p $DST_DIR/lib/i386 >mv mingw32/lib/*.a $DST_DIR/${RWINLIB_LIB_DIR}/i386 >cp $MSYS_LIB_DIR/mingw32/lib/lib{thrift,snappy}.a > $DST_DIR/${RWINLIB_LIB_DIR}/i386 > - cp > $MSYS_LIB_DIR/mingw32/lib/lib{zstd,lz4,brotli*,crypto,utf8proc,re2,aws*}.a > $DST_DIR/lib/i386 > + cp > $MSYS_LIB_DIR/mingw32/lib/lib{zstd,lz4,brotli*,crypto,curl,ss*,utf8proc,re2,aws*}.a > $DST_DIR/lib/i386 > fi > > # Do the same also for ucrt64 > @@ -105,7 +105,7 @@ if [ -d ucrt64/lib/ ]; then >ls $MSYS_LIB_DIR/ucrt64/lib/ >mkdir -p $DST_DIR/lib/x64-ucrt >mv ucrt64/lib/*.a $DS