Adar Dembo has submitted this change and it was merged. Change subject: tpch: improve encodings and compression ......................................................................
tpch: improve encodings and compression Previously all of the columns had been hard-coded to 'PLAIN' encoding. This is no longer our default, nor would we recommend it for the types of data used in the TPCH dataset. This switches to default encodings everywhere, and also enables LZ compression on the "Comment" column. The reduction in data size is as follows: original: size: 993MB median scan time for TPCH1 query: 0.8685 sec with LZ4 'comment': size: 901MB (1.1x compression vs original) scan time: unaffected (query does not read comment column) with LZ4 'comment' and new encodings: size: 342MB (2.9x compression vs original) median scan time: 0.8488 sec Per the above, the on-disk size is reduced by almost 3x and the scan performance is improved by a couple percent (perhaps within the realm of measurement error). This workload is small enough to be fully RAM-resident, but in a larger dataset which is disk-bound on reads, the space reduction should yield a corresponding improvement in scan performance. Change-Id: I168eb1c4ff619556f6879a20fe335a6158d0e81b Reviewed-on: http://gerrit.cloudera.org:8080/5689 Tested-by: Kudu Jenkins Reviewed-by: Adar Dembo <a...@cloudera.com> --- M src/kudu/benchmarks/tpch/tpch-schemas.h 1 file changed, 9 insertions(+), 8 deletions(-) Approvals: Adar Dembo: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/5689 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I168eb1c4ff619556f6879a20fe335a6158d0e81b Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Jean-Daniel Cryans <jdcry...@apache.org> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon <t...@apache.org>