Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/15370 )
Change subject: IMPALA-6636: Use async IO in ORC scanner ...................................................................... IMPALA-6636: Use async IO in ORC scanner This patch implements async IO in the ORC scanner. For each ORC stripe, we begin with iterating the column streams. If a column stream is possible for async IO, it will create ColumnRange, register ScannerContext::Stream for that ORC stream, and start the stream. We modify HdfsOrcScanner::ScanRangeInputStream::read to check whether there is a matching ColumnRange for the given offset and length. If so, the reading continue through HdfsOrcScanner::ColumnRange::read. We leverage existing async IO methods from HdfsParquetScanner class for initial memory allocations. We moved related methods such as DivideReservationBetweenColumns and ComputeIdealReservation up to HdfsColumnarScanner class. Planner calculates the memory reservation differently between async Parquet and async ORC. In async Parquet, the planner calculates the column memory reservation and relies on the backend to divide them as needed. In async ORC, the planner needs to split the column's memory reservation based on the estimated number of streams for that column type. For example, a string column with a 4MB memory estimate will need to split that estimate into four 1MB because it might use dictionary encoding with four streams (PRESENT, DATA, DICTIONARY_DATA, and LENGTH stream). This splitting is required because each async IO stream needs to start with an 8KB (min_buffer_size) initial memory reservation. To show the improvement from ORC async IO, we contrast the total time and geomean (in milliseconds) to run full TPC-DS 10 TB, 19 executors, with varying ORC_ASYNC_IO and DISABLE_DATA_CACHE options as follow: +----------------------+------------------+------------------+ | Total time | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 | +----------------------+------------------+------------------+ | DISABLE_DATA_CACHE=0 | 3511075 | 3484736 | | DISABLE_DATA_CACHE=1 | 5243337 | 4370095 | +----------------------+------------------+------------------+ +----------------------+------------------+------------------+ | Geomean | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 | +----------------------+------------------+------------------+ | DISABLE_DATA_CACHE=0 | 12786.58042 | 12454.80365 | | DISABLE_DATA_CACHE=1 | 23081.10888 | 16692.31512 | +----------------------+------------------+------------------+ Testing: - Pass core tests. - Pass core e2e tests with ORC_ASYNC_READ=1. Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074 Reviewed-on: http://gerrit.cloudera.org:8080/15370 Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> --- M be/src/exec/hdfs-columnar-scanner.cc M be/src/exec/hdfs-columnar-scanner.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-orc-scanner.h M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-page-index.cc M be/src/exec/parquet/parquet-page-reader.cc M be/src/exec/scanner-context.cc M be/src/exec/scanner-context.h M be/src/runtime/io/disk-io-mgr.h M be/src/runtime/io/request-ranges.h M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaService.thrift M common/thrift/Query.thrift M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-query/queries/QueryTest/scanner-reservation.test 19 files changed, 720 insertions(+), 279 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/15370 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074 Gerrit-Change-Number: 15370 Gerrit-PatchSet: 32 Gerrit-Owner: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Kurt Deschler <kdesc...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>