Hello Thomas Tauber-Marshall, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/8146 to look at the new patch set (#10). Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro ...................................................................... IMPALA-5307: Part 2: copy out strings in uncompressed Avro The approach is to re-materialize strings in those tuples that survive conjunct evaluation and may reference disk I/O buffers directly. This means that perf should not regress for the following cases: * Compressed Avro files. * Non-string columns. * Selective scans where the majority of tuples are filtered out. This approach will also work for the Sequence and Text scanners. Includes some improvements to Avro codegen to replace more constants to help win back some performance (with limited success): replaced InitTuple() with an optimised version and substituted tuple_byte_size() with a constant. Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed length. Perf: Did microbenchmarks on uncompressed Avro files, one with all columns from lineitem and one with only l_comment. Tests were run with: set num_scanner_threads=1; I ran the query 5 times and extracted MaterializeTupleTime from the profile to measure CPU cost of materialization. Overall string materialization got significantly slower, mainly because of the extra memcpy() calls required. Selecting one string from a table with multiple columns: select min(l_comment) from biglineitem_avro 1.814 -> 2.096 Selecting one string from a table with one column: select min(l_comment) from biglineitem_comment; profile; 1.708 -> 3.7 Selecting one string from a table with one column with predicate: select min(l_comment) from biglineitem_comment where length(l_comment) > 10000; 1.691 -> 1.449 Selecting all columns: select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber), min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax), min(l_returnflag), min(l_linestatus), min(l_shipdate), min(l_commitdate), min(l_receiptdate), min(l_shipinstruct), min(l_shipmode), min(l_comment) from biglineitem_avro; profile; 2.335 -> 3.711 Selecting an int column (no strings): select min(l_linenumber) from biglineitem_avro 1.806 -> 1.819 Testing: Ran exhaustive tests. Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 --- M be/src/codegen/gen_ir_descriptions.py M be/src/codegen/impala-ir.cc M be/src/codegen/llvm-codegen.cc M be/src/codegen/llvm-codegen.h M be/src/common/status.cc M be/src/common/status.h M be/src/exec/hdfs-avro-scanner-ir.cc M be/src/exec/hdfs-avro-scanner.cc M be/src/exec/hdfs-avro-scanner.h M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-scanner.h M be/src/runtime/CMakeLists.txt M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h A be/src/runtime/tuple-ir.cc M be/src/runtime/tuple.cc M be/src/runtime/tuple.h 19 files changed, 331 insertions(+), 56 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/8146/10 -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 10 Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>