[ https://issues.apache.org/jira/browse/ARROW-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488306#comment-17488306 ]
Dewey Dunnington commented on ARROW-15081: ------------------------------------------ There was another user who reported an issue with count on a parquet file that seems to have been fixed in the development version (which is about to be released to CRAN). Perhaps ARROW-15201 is the same issue? If it is not, when I try to reproduce the above I get an error (see below). Is there a more recent bucket with the files we can use to reproduce? {code:R} library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) server <- arrow::s3_bucket( "ebird", endpoint_override = "minio.cirrus.carlboettiger.info" ) path <- server$path("Oct-2021/observations") path$ls() #> Error: IOError: Path does not exist 'ebird/Oct-2021/observations' #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/s3fs.cc:1913 collector.Finish(this) #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/s3fs.cc:2275 impl_->Walk(select, base_path.bucket, base_path.key, &results) #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/filesystem.cc:341 base_fs_->GetFileInfo(selector) #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/filesystem.cc:341 base_fs_->GetFileInfo(selector) path <- server$path("partitioned") path$ls() #> Error: IOError: Path does not exist 'ebird/partitioned' #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/s3fs.cc:1913 collector.Finish(this) #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/s3fs.cc:2275 impl_->Walk(select, base_path.bucket, base_path.key, &results) #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/filesystem.cc:341 base_fs_->GetFileInfo(selector) #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/filesystem.cc:341 base_fs_->GetFileInfo(selector) {code} > [R][C++] Arrow crashes (OOM) on R client with large remote parquet files > ------------------------------------------------------------------------ > > Key: ARROW-15081 > URL: https://issues.apache.org/jira/browse/ARROW-15081 > Project: Apache Arrow > Issue Type: Bug > Components: R > Reporter: Carl Boettiger > Assignee: Weston Pace > Priority: Major > > The below should be a reproducible crash: > {code:java} > library(arrow) > library(dplyr) > server <- arrow::s3_bucket("ebird",endpoint_override = > "minio.cirrus.carlboettiger.info") > path <- server$path("Oct-2021/observations") > obs <- arrow::open_dataset(path) > path$ls() # observe -- 1 parquet file > obs %>% count() # CRASH > obs %>% to_duckdb() # also crash{code} > I have attempted to split this large (~100 GB parquet file) into some smaller > files, which helps: > {code:java} > path <- server$path("partitioned") > obs <- arrow::open_dataset(path) > obs$ls() # observe, multiple parquet files now > obs %>% count() > {code} > (These parquet files have also been created by arrow, btw, from a single > large csv file provided by the original data provider (eBird). Unfortunately > generating the partitioned versions is cumbersome as the data is very > unevenly distributed, there's few columns that can avoid creating 1000s of > parquet partition files and even so the bulk of the 1-billion rows fall > within the same group. But all the same I think this is a bug as there's no > indication why arrow cannot handle a single 100GB parquet file I think?). > > Let me know if I can provide more info! I'm testing in R with latest CRAN > version of arrow on a machine with 200 GB RAM. -- This message was sent by Atlassian Jira (v8.20.1#820001)