[jira] [Commented] (DRILL-5207) Improve Parquet scan pipelining

ASF GitHub Bot (JIRA) Fri, 27 Jan 2017 16:45:12 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15843722#comment-15843722
 ]


ASF GitHub Bot commented on DRILL-5207:
---------------------------------------

Github user parthchandra commented on a diff in the pull request:

    https://github.com/apache/drill/pull/723#discussion_r98269034
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java
 ---
    @@ -41,26 +42,33 @@
     import java.io.IOException;
     import java.nio.ByteBuffer;
     import java.util.concurrent.Callable;
    +import java.util.concurrent.ConcurrentLinkedQueue;
     import java.util.concurrent.ExecutorService;
     import java.util.concurrent.Future;
    +import java.util.concurrent.LinkedBlockingQueue;
     import java.util.concurrent.TimeUnit;
     
     import static org.apache.parquet.column.Encoding.valueOf;
     
     class AsyncPageReader extends PageReader {
       static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(AsyncPageReader.class);
     
    -
       private ExecutorService threadPool;
    -  private Future<ReadStatus> asyncPageRead;
    +  private long queueSize;
    +  private LinkedBlockingQueue<ReadStatus> pageQueue;
    +  private ConcurrentLinkedQueue<Future<Boolean>> asyncPageRead;
    +  private long totalPageValuesRead = 0;
     
       AsyncPageReader(ColumnReader<?> parentStatus, FileSystem fs, Path path,
           ColumnChunkMetaData columnChunkMetaData) throws 
ExecutionSetupException {
         super(parentStatus, fs, path, columnChunkMetaData);
    -    if (threadPool == null) {
    +    if (threadPool == null & asyncPageRead == null) {
    --- End diff --
    
    Thanks for catching this!


> Improve Parquet scan pipelining
> -------------------------------
>
>                 Key: DRILL-5207
>                 URL: https://issues.apache.org/jira/browse/DRILL-5207
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.9.0
>            Reporter: Parth Chandra
>            Assignee: Parth Chandra
>             Fix For: 1.10
>
>
> The parquet reader's async page reader is not quite efficiently pipelined. 
> The default size of the disk read buffer is 4MB while the page reader reads 
> ~1MB at a time. The Parquet decode is also processing 1MB at a time. This 
> means the disk is idle while the data is being processed. Reducing the buffer 
> to 1MB will reduce the time the processing thread waits for the disk read 
> thread.
> Additionally, since the data to process a page may be more or less than 1MB, 
> a queue of pages will help so that the disk scan does not block (until the 
> queue is full), waiting for the processing thread.
> Additionally, the BufferedDirectBufInputStream class reads from disk as soon 
> as it is initialized. Since this is called at setup time, this increases the 
> setup time for the query and query execution does not begin until this is 
> completed.
> There are a few other inefficiencies - options are read every time a page 
> reader is created. Reading options can be expensive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-5207) Improve Parquet scan pipelining

Reply via email to