[ 
https://issues.apache.org/jira/browse/ARROW-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488306#comment-17488306
 ] 

Dewey Dunnington commented on ARROW-15081:
------------------------------------------

There was another user who reported an issue with count on a parquet file that 
seems to have been fixed in the development version (which is about to be 
released to CRAN). Perhaps ARROW-15201 is the same issue?

If it is not, when I try to reproduce the above I get an error (see below). Is 
there a more recent bucket with the files we can use to reproduce?

{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

server <- arrow::s3_bucket(
  "ebird",
  endpoint_override = "minio.cirrus.carlboettiger.info"
)

path <- server$path("Oct-2021/observations")
path$ls()
#> Error: IOError: Path does not exist 'ebird/Oct-2021/observations'
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/s3fs.cc:1913
  collector.Finish(this)
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/s3fs.cc:2275
  impl_->Walk(select, base_path.bucket, base_path.key, &results)
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/filesystem.cc:341
  base_fs_->GetFileInfo(selector)
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/filesystem.cc:341
  base_fs_->GetFileInfo(selector)

path <- server$path("partitioned")
path$ls()
#> Error: IOError: Path does not exist 'ebird/partitioned'
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/s3fs.cc:1913
  collector.Finish(this)
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/s3fs.cc:2275
  impl_->Walk(select, base_path.bucket, base_path.key, &results)
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/filesystem.cc:341
  base_fs_->GetFileInfo(selector)
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/filesystem.cc:341
  base_fs_->GetFileInfo(selector)
{code}


> [R][C++] Arrow crashes (OOM) on R client with large remote parquet files
> ------------------------------------------------------------------------
>
>                 Key: ARROW-15081
>                 URL: https://issues.apache.org/jira/browse/ARROW-15081
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Carl Boettiger
>            Assignee: Weston Pace
>            Priority: Major
>
> The below should be a reproducible crash:
> {code:java}
> library(arrow)
> library(dplyr)
> server <- arrow::s3_bucket("ebird",endpoint_override = 
> "minio.cirrus.carlboettiger.info")
> path <- server$path("Oct-2021/observations")
> obs <- arrow::open_dataset(path)
> path$ls() # observe -- 1 parquet file
> obs %>% count() # CRASH
> obs %>% to_duckdb() # also crash{code}
> I have attempted to split this large (~100 GB parquet file) into some smaller 
> files, which helps: 
> {code:java}
> path <- server$path("partitioned")
> obs <- arrow::open_dataset(path)
> obs$ls() # observe, multiple parquet files now
> obs %>% count() 
>  {code}
> (These parquet files have also been created by arrow, btw, from a single 
> large csv file provided by the original data provider (eBird).  Unfortunately 
> generating the partitioned versions is cumbersome as the data is very 
> unevenly distributed, there's few columns that can avoid creating 1000s of 
> parquet partition files and even so the bulk of the 1-billion rows fall 
> within the same group.  But all the same I think this is a bug as there's no 
> indication why arrow cannot handle a single 100GB parquet file I think?). 
>  
> Let me know if I can provide more info! I'm testing in R with latest CRAN 
> version of arrow on a machine with 200 GB RAM. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to