[ https://issues.apache.org/jira/browse/IMPALA-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071379#comment-17071379 ]
ASF subversion and git services commented on IMPALA-3766: --------------------------------------------------------- Commit ebbe52b4bed944d3012e3679dc984827ce11d5a8 in impala's branch refs/heads/master from Tim Armstrong [ https://gitbox.apache.org/repos/asf?p=impala.git;h=ebbe52b ] IMPALA-3766: optionally compress spilled data Enabled via --disk_spill_compression_codec, which uses the same syntax as the compression_codec query option. Recommended codecs are LZ4 and ZSTD. ZSTD supports specifying a compression level. The compression is done in TmpFileMgr using a temporary buffer. Allocation of disk space is reworked slightly so that the allocation can happen after compression. The default power-of-two disk block sizes would lead to a lot of internal fragmentation, so a new strategy for free space management, similar to that used in the data cache, can be used with --disk_spill_punch_holes=true. TmpFileMgr will allocate a range of the actual compressed size and punch holes in the file for each range that is no longer needed. UncompressedWriteIoBytes is added to the buffer pool profiles, so that you can see what degree of compression is achieved. Typically I saw ratios of 2-3x for LZ4 and ZSTD (with LZ4 toward the lower end and ZSTD toward the higher end). Limitations: The management of the compression buffer memory could be improved. Ideally it would be integrated with the buffer pool and use the buffer pool allocator instead of being done "on the side". We would probably want to do this before making this the default, for resource management and performance reasons (doing a malloc() directly does not use the caching supported by the buffer pool). Testing: * Run buffer pool spilling tests with different combinations of the new options. * Extend existing TmpFileMgr tests for file space allocation to run with hole punching enabled. * Switch a couple of spilling tests to use the new option. * Add a metrics test to check for scratch leaks. * Enable the new options by default for end-to-end dockerized tests to get additional coverage. * Add a unit test where allocating compression memory fails, both on the read and write path. * Ran a single-node stress test on TPC-DS SF 1 and TPC-H SF 10 The peak compression buffer usage was ~40MB. Perf: I ran this spilling query using an SSD as the scratch disk: set mem_limit=200m; select count(distinct l_partkey) from tpch30_parquet.lineitem; The time taken for the second run of each query was: No compression: 19.59s LZ4: 18.56s ZSTD: 20.59s Change-Id: I9c08ff9504097f0fee8c32316c5c150136abe659 Reviewed-on: http://gerrit.cloudera.org:8080/15454 Reviewed-by: Tim Armstrong <tarmstr...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Reviewed-by: Bikramjeet Vig <bikramjeet....@cloudera.com> > Optionally compress spilled data before writing it to disk > ---------------------------------------------------------- > > Key: IMPALA-3766 > URL: https://issues.apache.org/jira/browse/IMPALA-3766 > Project: IMPALA > Issue Type: New Feature > Components: Backend > Affects Versions: Impala 2.7.0 > Reporter: Mostafa Mokhtar > Assignee: Tim Armstrong > Priority: Minor > Labels: performance > > Evaluate compressing the buffers before writing them to disk for spilling > operators. > Applying LZ4 on row batches before sending them over the network as part of > exchange provides around 2x compression. > {code} > - BytesSent: 612.87 MB (642635712) > - NetworkThroughput(*): 1.88 GB/sec > - OverallThroughput: 1.21 GB/sec > - PeakMemoryUsage: 51.00 KB (52224) > - RowsReturned: 360.00K (360000) > - SerializeBatchTime: 176.002ms > - TransmitDataRPCTime: 319.005ms > - UncompressedRowBatchSize: 1.47 GB (1573356320) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org