[ 
https://issues.apache.org/jira/browse/ARROW-14653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442878#comment-17442878
 ] 

Weston Pace commented on ARROW-14653:
-------------------------------------

Thanks for this information.  This is a nested deadlock issue.  The 8 CPU 
threads are inappropriately blocked.  The easiest workaround is that we should 
be using the asynchronous scanner here.  I thought we had worked out some 
delicate ways to avoid this kind of thing in the synchronous scanner but it 
seems those don't work for Head.

 

We could either fix the synchronous scanner to handle this correctly, return an 
error if the synchronous scanner is used with head, or silently switch 
use_threads to false when this situation is encountered.

 

However, I'd rather not do too much here.  The synchronous scanner should be 
going away (soon) and so any fix will be throwaway.

> [R] head() hangs on CSV datasets > 600MB
> ----------------------------------------
>
>                 Key: ARROW-14653
>                 URL: https://issues.apache.org/jira/browse/ARROW-14653
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Nicola Crane
>            Priority: Major
>
> I'm calling {{head()}} on a CSV dataset containing CSV files.  I'm doing this 
> as I want to preview my dataset before I try to do anything with it that's 
> going to be more expensive computationally.
> {code:r}
> open_dataset("../../data/nyc-raw/", format = "csv") %>%
>   head(1) %>%
>   collect()
> {code}
> I have experimented with different combinations of files in the dataset 
> folder, and it seems to work fine when my total file size is <~600Mb but hang 
> if it's above that.  This might not even be what that actual issue is but I'm 
> struggling to narrow it down beyond add extra files to the equation.
> I've tried running with with the C++ debugger attached, but again, it just 
> hangs.
> The files I'm using are the 2020-2021 Yellow Taxi trip records available 
> from: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
> A bit of investigation has shown me that I can load in different subsets of 
> files in fine, but when using all of them, the session hangs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to