[ 
https://issues.apache.org/jira/browse/ARROW-14653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane reassigned ARROW-14653:
------------------------------------

    Assignee: Nicola Crane

> [R] head() hangs on CSV datasets > 600MB
> ----------------------------------------
>
>                 Key: ARROW-14653
>                 URL: https://issues.apache.org/jira/browse/ARROW-14653
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Nicola Crane
>            Assignee: Nicola Crane
>            Priority: Major
>             Fix For: 7.0.0
>
>
> I'm calling {{head()}} on a CSV dataset containing CSV files.  I'm doing this 
> as I want to preview my dataset before I try to do anything with it that's 
> going to be more expensive computationally.
> {code:r}
> open_dataset("../../data/nyc-raw/", format = "csv") %>%
>   head(1) %>%
>   collect()
> {code}
> I have experimented with different combinations of files in the dataset 
> folder, and it seems to work fine when my total file size is <~600Mb but hang 
> if it's above that.  This might not even be what that actual issue is but I'm 
> struggling to narrow it down beyond add extra files to the equation.
> I've tried running with with the C++ debugger attached, but again, it just 
> hangs.
> The files I'm using are the 2020-2021 Yellow Taxi trip records available 
> from: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
> A bit of investigation has shown me that I can load in different subsets of 
> files in fine, but when using all of them, the session hangs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to