[
https://issues.apache.org/jira/browse/SPARK-33978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun reassigned SPARK-33978:
-
Assignee: Dongjoon Hyun
> Support ZSTD compression in ORC data source
> ---
>
> Key: SPARK-33978
> URL: https://issues.apache.org/jira/browse/SPARK-33978
> Project: Spark
> Issue Type: New Feature
> Components: SQL
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> h3. What changes were proposed in this pull request?
> This PR aims to support ZSTD compression in ORC data source.
> h3. Why are the changes needed?
> Apache ORC 1.6 supports ZSTD compression to generate more compact files and
> save the storage cost.
> *BEFORE*
> {code:java}
> scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd")
> java.lang.IllegalArgumentException: Codec [zstd] is not available. Available
> codecs are uncompressed, lzo, snappy, zlib, none. {code}
> *AFTER*
> {code:java}
> scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd")
> {code}
> {code:java}
> $ orc-tools meta /tmp/zstd
> Processing data file
> file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc
> [length: 230]
> Structure for
> file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc
> File Version: 0.12 with ORC_14
> Rows: 1
> Compression: ZSTD
> Compression size: 262144
> Calendar: Julian/Gregorian
> Type: struct
> Stripe Statistics:
> Stripe 1:
> Column 0: count: 1 hasNull: false
> Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9
> File Statistics:
> Column 0: count: 1 hasNull: false
> Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9
> Stripes:
> Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
> Stream: column 0 section ROW_INDEX start: 3 length 11
> Stream: column 1 section ROW_INDEX start: 14 length 24
> Stream: column 1 section DATA start: 38 length 6
> Encoding column 0: DIRECT
> Encoding column 1: DIRECT_V2
> File length: 230 bytes
> Padding length: 0 bytes
> Padding ratio: 0%
> User Metadata:
> org.apache.spark.version=3.2.0{code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org