[ 
https://issues.apache.org/jira/browse/PARQUET-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556480#comment-17556480
 ] 

ASF GitHub Bot commented on PARQUET-2149:
-----------------------------------------

parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1160693399

   @shangxinli Thank you for the review! I'll address these comments asap.
   I am reviewing the thread pool and its initialization. IMO, it is better if 
there is no default initialization of the pool and the calling 
application/framework does so explicitly. One side effect of the default 
initialization is that the pool is created unnecessarily even if async is off. 
Also, if an application, shades and includes another copy of the library (or 
transitively, many more), then one more thread pool gets created for every 
version of the library included. 
   It is probably a better idea to allow the thread pool to be assigned as a 
per instance variable. The calling application can then decide to use a single 
pool for all instances or a new one per instance whichever use case is better 
for their performance.
   Finally, some large scale testing has revealed a possible resource leak. I'm 
looking into addressing it. 




> Implement async IO for Parquet file reader
> ------------------------------------------
>
>                 Key: PARQUET-2149
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2149
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Parth Chandra
>            Priority: Major
>
> ParquetFileReader's implementation has the following flow (simplified) - 
>       - For every column -> Read from storage in 8MB blocks -> Read all 
> uncompressed pages into output queue 
>       - From output queues -> (downstream ) decompression + decoding
> This flow is serialized, which means that downstream threads are blocked 
> until the data has been read. Because a large part of the time spent is 
> waiting for data from storage, threads are idle and CPU utilization is really 
> low.
> There is no reason why this cannot be made asynchronous _and_ parallel. So 
> For Column _i_ -> reading one chunk until end, from storage -> intermediate 
> output queue -> read one uncompressed page until end -> output queue -> 
> (downstream ) decompression + decoding
> Note that this can be made completely self contained in ParquetFileReader and 
> downstream implementations like Iceberg and Spark will automatically be able 
> to take advantage without code change as long as the ParquetFileReader apis 
> are not changed. 
> In past work with async io  [Drill - async page reader 
> |https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java]
>  , I have seen 2x-3x improvement in reading speed for Parquet files.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to