[jira] [Commented] (ARROW-14653) [R] head() hangs on CSV datasets > 600MB

Weston Pace (Jira) Wed, 10 Nov 2021 12:08:05 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-14653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441929#comment-17441929
 ]


Weston Pace commented on ARROW-14653:
-------------------------------------

Since you've got it running with the debugger can you do the following (these 
instructions are for gdb, I assume lldb has something similar):


 * Run it with the debugger
 * Get it into a hang state (let it sit there long enough you're pretty sure it 
should be finished)
 * Press Ctrl-C to interrupt the debugger
 * Run the command "thread apply all bt"

This will generate a ton of output (it prints a backtrace of every running 
thread) which would be helpful in diagnosing the hang.

> [R] head() hangs on CSV datasets > 600MB
> ----------------------------------------
>
>                 Key: ARROW-14653
>                 URL: https://issues.apache.org/jira/browse/ARROW-14653
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Nicola Crane
>            Priority: Major
>
> I'm calling {{head()}} on a CSV dataset containing CSV files.  I'm doing this 
> as I want to preview my dataset before I try to do anything with it that's 
> going to be more expensive computationally.
> {code:r}
> open_dataset("../../data/nyc-raw/", format = "csv") %>%
>   head(1) %>%
>   collect()
> {code}
> I have experimented with different combinations of files in the dataset 
> folder, and it seems to work fine when my total file size is <~600Mb but hang 
> if it's above that.  This might not even be what that actual issue is but I'm 
> struggling to narrow it down beyond add extra files to the equation.
> I've tried running with with the C++ debugger attached, but again, it just 
> hangs.
> The files I'm using are the 2020-2021 Yellow Taxi trip records available 
> from: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
> A bit of investigation has shown me that I can load in different subsets of 
> files in fine, but when using all of them, the session hangs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14653) [R] head() hangs on CSV datasets > 600MB

Reply via email to