[
https://issues.apache.org/jira/browse/ARROW-14653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441929#comment-17441929
]
Weston Pace commented on ARROW-14653:
-------------------------------------
Since you've got it running with the debugger can you do the following (these
instructions are for gdb, I assume lldb has something similar):
* Run it with the debugger
* Get it into a hang state (let it sit there long enough you're pretty sure it
should be finished)
* Press Ctrl-C to interrupt the debugger
* Run the command "thread apply all bt"
This will generate a ton of output (it prints a backtrace of every running
thread) which would be helpful in diagnosing the hang.
> [R] head() hangs on CSV datasets > 600MB
> ----------------------------------------
>
> Key: ARROW-14653
> URL: https://issues.apache.org/jira/browse/ARROW-14653
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: Nicola Crane
> Priority: Major
>
> I'm calling {{head()}} on a CSV dataset containing CSV files. I'm doing this
> as I want to preview my dataset before I try to do anything with it that's
> going to be more expensive computationally.
> {code:r}
> open_dataset("../../data/nyc-raw/", format = "csv") %>%
> head(1) %>%
> collect()
> {code}
> I have experimented with different combinations of files in the dataset
> folder, and it seems to work fine when my total file size is <~600Mb but hang
> if it's above that. This might not even be what that actual issue is but I'm
> struggling to narrow it down beyond add extra files to the equation.
> I've tried running with with the C++ debugger attached, but again, it just
> hangs.
> The files I'm using are the 2020-2021 Yellow Taxi trip records available
> from: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
> A bit of investigation has shown me that I can load in different subsets of
> files in fine, but when using all of them, the session hangs.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)