[ 
https://issues.apache.org/jira/browse/ARROW-13088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363929#comment-17363929
 ] 

Weston Pace commented on ARROW-13088:
-------------------------------------

This may be obvious but it isn't "just" using OptionalParallelForAsync 
unfortunately.  For example, you change "return OptionalParallelFor(...)" with 
"return OptionalParallelForAsync(...).result()"

 

You will need to make sure you never call ".result()" or ".Wait()" on an 
unfinished future from a thread task.  You will have to return the thread task 
all the way up the chain.  This can end up being quite tricky.

 

> Dataset API calls with nested parallelism does not progress
> -----------------------------------------------------------
>
>                 Key: ARROW-13088
>                 URL: https://issues.apache.org/jira/browse/ARROW-13088
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 3.0.0, 4.0.1
>            Reporter: Jayjeet Chakraborty
>            Priority: Major
>
> Launching nested threads using the C++ ThreadPool causes the Dataset API 
> (C++/Python using the SyncScanner/AsyncScanner) to stop progressing. The 
> execution just hangs and keeps waiting. Examples of nested threading:
>  # Force turn on parallel column reads in ParquetFileFormat by tweaking the 
> code 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/file_parquet.cc#L327]
>  to get into nested parallelism (for every scan task launched in parallel to 
> spawn threads internally). 
>  # Create a simple dataset containing Parquet files of count X (where X >= 
> No. of logical cores). 
>  # {{Just to a dataset.to_table() call.}}
> The execution should stop and go into a kind of a sleep state immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to