[ https://issues.apache.org/jira/browse/SPARK-33978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-33978: ---------------------------------- Description: h3. What changes were proposed in this pull request? This PR aims to support ZSTD compression in ORC data source. h3. Why are the changes needed? Apache ORC 1.6 supports ZSTD compression to generate more compact files and save the storage cost. *BEFORE* {code:java} scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd") java.lang.IllegalArgumentException: Codec [zstd] is not available. Available codecs are uncompressed, lzo, snappy, zlib, none. {code} *AFTER* {code:java} scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd") {code} {code:java} $ orc-tools meta /tmp/zstd Processing data file file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc [length: 230] Structure for file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc File Version: 0.12 with ORC_14 Rows: 1 Compression: ZSTD Compression size: 262144 Calendar: Julian/Gregorian Type: struct<id:bigint> Stripe Statistics: Stripe 1: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9 File Statistics: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9 Stripes: Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35 Stream: column 0 section ROW_INDEX start: 3 length 11 Stream: column 1 section ROW_INDEX start: 14 length 24 Stream: column 1 section DATA start: 38 length 6 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 File length: 230 bytes Padding length: 0 bytes Padding ratio: 0% User Metadata: org.apache.spark.version=3.2.0{code} > Support ZSTD compression in ORC data source > ------------------------------------------- > > Key: SPARK-33978 > URL: https://issues.apache.org/jira/browse/SPARK-33978 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 3.2.0 > Reporter: Dongjoon Hyun > Priority: Major > > h3. What changes were proposed in this pull request? > This PR aims to support ZSTD compression in ORC data source. > h3. Why are the changes needed? > Apache ORC 1.6 supports ZSTD compression to generate more compact files and > save the storage cost. > *BEFORE* > {code:java} > scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd") > java.lang.IllegalArgumentException: Codec [zstd] is not available. Available > codecs are uncompressed, lzo, snappy, zlib, none. {code} > *AFTER* > {code:java} > scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd") > {code} > {code:java} > $ orc-tools meta /tmp/zstd > Processing data file > file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc > [length: 230] > Structure for > file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc > File Version: 0.12 with ORC_14 > Rows: 1 > Compression: ZSTD > Compression size: 262144 > Calendar: Julian/Gregorian > Type: struct<id:bigint> > Stripe Statistics: > Stripe 1: > Column 0: count: 1 hasNull: false > Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9 > File Statistics: > Column 0: count: 1 hasNull: false > Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9 > Stripes: > Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35 > Stream: column 0 section ROW_INDEX start: 3 length 11 > Stream: column 1 section ROW_INDEX start: 14 length 24 > Stream: column 1 section DATA start: 38 length 6 > Encoding column 0: DIRECT > Encoding column 1: DIRECT_V2 > File length: 230 bytes > Padding length: 0 bytes > Padding ratio: 0% > User Metadata: > org.apache.spark.version=3.2.0{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org