[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhaojing Yu updated HUDI-2928: -- Fix Version/s: 0.13.0 (was: 0.12.1) > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Improvement > Components: performance, storage-management >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Labels: pull-request-available > Fix For: 0.13.0 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2928: - Sprint: (was: 2022/09/05) > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Improvement > Components: performance, storage-management >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-2928: -- Sprint: 2022/09/05 (was: 2022/09/19) > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Improvement > Components: performance, storage-management >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-2928: -- Story Points: 2 > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Improvement > Components: performance, storage-management >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2928: - Sprint: 2022/09/19 (was: 2022/08/22) > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Improvement > Components: performance, storage-management >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-2928: -- Fix Version/s: 0.12.1 (was: 0.12.0) > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Improvement > Components: performance, storage-management >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2928: - Sprint: Hudi-Sprint-Jan-10, 2022/08/08 (was: Hudi-Sprint-Jan-10) > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Improvement > Components: performance, storage-management >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.0 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-2928: -- Sprint: Hudi-Sprint-Jan-10 (was: Hudi-Sprint-Jan-10, 2022/05/02) > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Improvement > Components: performance, storage-management >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.0 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2928: - Priority: Critical (was: Blocker) > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Improvement > Components: performance, storage-management >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.0 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2928: - Sprint: Hudi-Sprint-Jan-10, 2022/05/02 (was: Hudi-Sprint-Jan-10) > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Improvement > Components: performance, storage-management >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.12.0 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2928: - Component/s: performance storage-management Epic Link: HUDI-3249 > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Task > Components: performance, storage-management >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2928: - Issue Type: Improvement (was: Task) > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Improvement > Components: performance, storage-management >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2928: - Fix Version/s: 0.12.0 > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Improvement > Components: performance, storage-management >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.12.0 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2928: - Fix Version/s: (was: 0.11.0) > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Task >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-2928: -- Sprint: Hudi-Sprint-Jan-10 (was: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18) > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Task >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-2928: - Sprint: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18 (was: Hudi-Sprint-Jan-10) > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Task >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-2928: - Status: Open (was: In Progress) > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Task >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-2928: -- Status: In Progress (was: Open) > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Task >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-2928: -- Sprint: Hudi-Sprint-Jan-10 > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Task >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-2928: -- Status: Open (was: In Progress) > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Task >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-2928: -- Status: In Progress (was: Open) > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Task >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-2928: - Status: Open (was: Patch Available) > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Task >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-2928: - Priority: Blocker (was: Major) > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Task >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-2928: -- Attachment: Screen Shot 2021-12-06 at 11.49.05 AM.png > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Task >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Major > Labels: pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot > 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-2928: -- Status: Patch Available (was: In Progress) > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Task >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Major > Labels: pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, > image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-2928: -- Description: Currently, having Gzip as a default we prioritize Compression/Storage cost at the expense of * Compute (on the {+}write-path{+}): about *30%* of Compute burned during bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) * Compute (on the {+}read-path{+}), as well as queries Latencies: queries scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put is *3-4x* less than Snappy, Zstd, [EX|https://stackoverflow.com/a/56410326/3520840]) P.S Spark switched its default compression algorithm to Snappy [a while ago|https://github.com/apache/spark/pull/12256]. *EDIT* We should actually evaluate putting in [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] instead of Snappy. It has compression ratios comparable to Gzip, while bringing in much better performance: !image-2021-12-03-13-13-02-892.png! [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] was: Currently, having Gzip as a default we prioritize Compression/Storage cost at the expense of * Compute (on the {+}write-path{+}): about *30%* of Compute burned during bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) * Compute (on the {+}read-path{+}), as well as queries Latencies: queries scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put is *3-4x* less than Snappy, Zstd, [EX|https://stackoverflow.com/a/56410326/3520840]) P.S Spark switched its default compression algorithm to Snappy [a while ago|https://github.com/apache/spark/pull/12256]. *EDIT* We should actually evaluate putting in [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] instead of Snappy. It has compression ratios comparable to Gzip, while bringing in much better performance: !image-2021-12-03-13-13-02-892.png! > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Task >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Major > Labels: pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, > image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-2928: - Labels: pull-request-available (was: ) > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Task >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Major > Labels: pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, > image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-2928: -- Fix Version/s: 0.11.0 > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Task >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Major > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, > image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd
[ https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-2928: -- Description: Currently, having Gzip as a default we prioritize Compression/Storage cost at the expense of * Compute (on the {+}write-path{+}): about *30%* of Compute burned during bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) * Compute (on the {+}read-path{+}), as well as queries Latencies: queries scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put is *3-4x* less than Snappy, Zstd, [EX|https://stackoverflow.com/a/56410326/3520840]) P.S Spark switched its default compression algorithm to Snappy [a while ago|https://github.com/apache/spark/pull/12256]. *EDIT* We should actually evaluate putting in [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] instead of Snappy. It has compression ratios comparable to Gzip, while bringing in much better performance: !image-2021-12-03-13-13-02-892.png! was: Currently, having Gzip as a default we prioritize Compression/Storage cost at the expense of * Compute (on the {+}write-path{+}): about *30%* of Compute burned during bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) * Compute (on the {+}read-path{+}), as well as queries Latencies: queries scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put is *3-4x* less than Snappy's, [EX|https://stackoverflow.com/a/56410326/3520840]) P.S Spark switched its default compression algorithm to Snappy [a while ago|https://github.com/apache/spark/pull/12256]. *EDIT* We should actually evaluate putting in [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] instead of Snappy. It has compression ratios comparable to Gzip, while bringing in much better performance: !image-2021-12-03-13-13-02-892.png! Summary: Evaluate rebasing Hudi's default compression from Gzip to Zstd (was: Evaluate rebasing Hudi's default compression from Gzip to Snappy) > Evaluate rebasing Hudi's default compression from Gzip to Zstd > -- > > Key: HUDI-2928 > URL: https://issues.apache.org/jira/browse/HUDI-2928 > Project: Apache Hudi > Issue Type: Task >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Major > Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, > image-2021-12-03-13-13-02-892.png > > > Currently, having Gzip as a default we prioritize Compression/Storage cost at > the expense of > * Compute (on the {+}write-path{+}): about *30%* of Compute burned during > bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) > * Compute (on the {+}read-path{+}), as well as queries Latencies: queries > scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put > is *3-4x* less than Snappy, Zstd, > [EX|https://stackoverflow.com/a/56410326/3520840]) > P.S Spark switched its default compression algorithm to Snappy [a while > ago|https://github.com/apache/spark/pull/12256]. > > *EDIT* > We should actually evaluate putting in > [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] > instead of Snappy. It has compression ratios comparable to Gzip, while > bringing in much better performance: > !image-2021-12-03-13-13-02-892.png! > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)