[jira] [Commented] (PARQUET-1641) Parquet pages for different columns cannot be read in parallel

ASF GitHub Bot (Jira) Tue, 27 Aug 2019 15:30:10 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16917185#comment-16917185
 ]


ASF GitHub Bot commented on PARQUET-1641:
-----------------------------------------

samarthjain commented on pull request #670: PARQUET-1641 Do not cache 
decompressors in CodecFactory if the decompressor is not meant to be pooled
URL: https://github.com/apache/parquet-mr/pull/670
 
 
   Decompressors like GZipDecompressor are not meant to be pooled. However, the 
CodecFactory was still caching them. With this change we now check if the 
Decompressor is annotated with @DoNotPool. If it is, then we don't cache the 
decompressor in CodecFactory.
   
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Parquet pages for different columns cannot be read in parallel 
> ---------------------------------------------------------------
>
>                 Key: PARQUET-1641
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1641
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Samarth Jain
>            Priority: Major
>              Labels: pull-request-available
>
> All ColumnChunkPageReader instances use the same decompressor. 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1286]
> {code:java}
> BytesInputDecompressor decompressor = 
> options.getCodecFactory().getDecompressor(descriptor.metadata.getCodec());
> return new ColumnChunkPageReader(decompressor, pagesInChunk, dictionaryPage);
> {code}
> The CodecFactory caches the decompressors for every codec type returning the 
> same instance on every getCompressor(codecName) call. See the caching 
> happening here:
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java#L197]
> {code:java}
> @Override
>  public BytesDecompressor getDecompressor(CompressionCodecName codecName) {
>     BytesDecompressor decomp = decompressors.get(codecName);
>     if (decomp == null){ 
>        decomp = createDecompressor(codecName); decompressors.put(codecName, 
> decomp); 
>     }
>     return decomp;
>  }
>  
> {code}
>  
> If multiple threads try to read the pages belonging to different columns, 
> they run into thread
> safety issues. This issue prevents increasing the throughput at which 
> applications can read parquet data by parallelizing page reads. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (PARQUET-1641) Parquet pages for different columns cannot be read in parallel

Reply via email to