[ https://issues.apache.org/jira/browse/DRILL-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15843722#comment-15843722 ]
ASF GitHub Bot commented on DRILL-5207: --------------------------------------- Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/723#discussion_r98269034 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java --- @@ -41,26 +42,33 @@ import java.io.IOException; import java.nio.ByteBuffer; import java.util.concurrent.Callable; +import java.util.concurrent.ConcurrentLinkedQueue; import java.util.concurrent.ExecutorService; import java.util.concurrent.Future; +import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.TimeUnit; import static org.apache.parquet.column.Encoding.valueOf; class AsyncPageReader extends PageReader { static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(AsyncPageReader.class); - private ExecutorService threadPool; - private Future<ReadStatus> asyncPageRead; + private long queueSize; + private LinkedBlockingQueue<ReadStatus> pageQueue; + private ConcurrentLinkedQueue<Future<Boolean>> asyncPageRead; + private long totalPageValuesRead = 0; AsyncPageReader(ColumnReader<?> parentStatus, FileSystem fs, Path path, ColumnChunkMetaData columnChunkMetaData) throws ExecutionSetupException { super(parentStatus, fs, path, columnChunkMetaData); - if (threadPool == null) { + if (threadPool == null & asyncPageRead == null) { --- End diff -- Thanks for catching this! > Improve Parquet scan pipelining > ------------------------------- > > Key: DRILL-5207 > URL: https://issues.apache.org/jira/browse/DRILL-5207 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet > Affects Versions: 1.9.0 > Reporter: Parth Chandra > Assignee: Parth Chandra > Fix For: 1.10 > > > The parquet reader's async page reader is not quite efficiently pipelined. > The default size of the disk read buffer is 4MB while the page reader reads > ~1MB at a time. The Parquet decode is also processing 1MB at a time. This > means the disk is idle while the data is being processed. Reducing the buffer > to 1MB will reduce the time the processing thread waits for the disk read > thread. > Additionally, since the data to process a page may be more or less than 1MB, > a queue of pages will help so that the disk scan does not block (until the > queue is full), waiting for the processing thread. > Additionally, the BufferedDirectBufInputStream class reads from disk as soon > as it is initialized. Since this is called at setup time, this increases the > setup time for the query and query execution does not begin until this is > completed. > There are a few other inefficiencies - options are read every time a page > reader is created. Reading options can be expensive. -- This message was sent by Atlassian JIRA (v6.3.4#6332)