[jira] [Commented] (ARROW-12161) [C++] Async streaming CSV reader deadlocking when being run synchronously from datasets

Weston Pace (Jira) Tue, 30 Mar 2021 20:30:12 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-12161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312026#comment-17312026
 ]


Weston Pace commented on ARROW-12161:
-------------------------------------

So there are a few choices here.  In the meantime, if it is going to take some 
time to fix, I'd recommend reverting ARROW-11887 while the fix is worked 
through.  I've created [https://github.com/apache/arrow/pull/9859] as a 
convenience.

Option #1: Leave 11887 out, include it as part of ARROW-7001

-Pros: No wasted work

-Cons: Makes ARROW-7001 an even larger change.

 

Option #2: Bring part of ARROW-7001 in and patch it in.  Basically add 
`supports_async()` and `ExecuteAsync()` to `ScanTask` and then modify 
`Scanner::ToTable` so that it will create a task group (for synchronous scan 
tasks) AND collect a set of futures (for async scan tasks).  It will then await 
both of those one after the other.  This should avoid the nested dataset issue. 
 I've prototyped this today and it should work but it'll take me a little bit 
of work to polish it which I could do tomorrow.

-Pros: Makes ARROW-7001 a smaller change

-Cons: Potentially delays ARROW-7001 review while this is worked through / Some 
wasted work.

> [C++] Async streaming CSV reader deadlocking when being run synchronously 
> from datasets
> ---------------------------------------------------------------------------------------
>
>                 Key: ARROW-12161
>                 URL: https://issues.apache.org/jira/browse/ARROW-12161
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Weston Pace
>            Priority: Major
>
> ARROW-11887 added async to the streaming CSV reader.  In order to keep 
> backwards compatibility the old sync API simply calls the async API and waits 
> for it to finish.  However, that wait cannot happen safely in a "nested" 
> context (e.g. dataset reading).
> For example, imagine two cores.  The dataset read launches two CSV scans.  
> Each scan occupies a core waiting for a future.  Those futures are being 
> filled by I/O threads.  The I/O threads finish and go to transfer.  The 
> transfer cannot happen because the CPU executor is filled.
> This will be fixed as part of ARROW-7001 but that still some ways away.  An 
> easier change might be to take some of the 7001 changes and include them as 
> part of the 11887 feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-12161) [C++] Async streaming CSV reader deadlocking when being run synchronously from datasets

Reply via email to