Riza Suminto has uploaded a new patch set (#14) to the change originally created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/15370 )
Change subject: IMPALA-6636: Use async IO in ORC scanner ...................................................................... IMPALA-6636: Use async IO in ORC scanner This patch implements async IO in the ORC scanner. For each ORC stripe, we begin with iterating the column streams. If a column stream is possible for async IO, it will create ColumnRange, register ScannerContext::Stream for that ORC stream, and start the stream. We modify HdfsOrcScanner::ScanRangeInputStream::read to check whether there is a matching ColumnRange for the given offset and length. If so, the reading continue through HdfsOrcScanner::ColumnRange::read. We leverage existing async IO methods from HdfsParquetScanner class for initial memory allocations. We moved related methods such as DivideReservationBetweenColumns and ComputeIdealReservation up to HdfsColumnarScanner class. Currently, there are corner cases where planner might underestimate the number of async IO stream for a table. A case like "select count(*)" over complex type column might have empty desc._getSlots() in HdfsScanNode.computeMinColumnMemReservations, but HdfsOrcScanner::StartColumnReading later see couple streams that are eligible for async IO. In this situation, HdfsOrcScanner will try to increase reservation 8KB (min_buffer_size) for each eligible stream. Once the reservation increment fails, it will read the rest of the streams synchronously. To show the improvement from ORC async IO, we contrast the total time and geomean (in milliseconds) to run full TPC-DS 10 TB, 19 executors, with varying ORC_ASYNC_IO and DISABLE_DATA_CACHE options as follow: +--------------------------+----------------------+---------------------+ | Total time | ORC_ASYNC_READ=false | ORC_ASYNC_READ=true | +--------------------------+----------------------+---------------------+ | DISABLE_DATA_CACHE=false | 3511075 | 3484736 | | DISABLE_DATA_CACHE=true | 5243337 | 4370095 | +--------------------------+----------------------+---------------------+ +--------------------------+----------------------+---------------------+ | Geomean | ORC_ASYNC_READ=false | ORC_ASYNC_READ=true | +--------------------------+----------------------+---------------------+ | DISABLE_DATA_CACHE=false | 12786.58042 | 12454.80365 | | DISABLE_DATA_CACHE=true | 23081.10888 | 16692.31512 | +--------------------------+----------------------+---------------------+ Testing: - Pass core tests. Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074 --- M be/src/exec/hdfs-columnar-scanner.cc M be/src/exec/hdfs-columnar-scanner.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-orc-scanner.h M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-page-reader.cc M be/src/exec/scanner-context.cc M be/src/exec/scanner-context.h M be/src/runtime/io/disk-io-mgr.h M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaService.thrift M common/thrift/Query.thrift M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-query/queries/QueryTest/scanner-reservation.test 17 files changed, 495 insertions(+), 216 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/14 -- To view, visit http://gerrit.cloudera.org:8080/15370 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074 Gerrit-Change-Number: 15370 Gerrit-PatchSet: 14 Gerrit-Owner: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>