[ https://issues.apache.org/jira/browse/SPARK-33978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun reassigned SPARK-33978: ------------------------------------- Assignee: Dongjoon Hyun > Support ZSTD compression in ORC data source > ------------------------------------------- > > Key: SPARK-33978 > URL: https://issues.apache.org/jira/browse/SPARK-33978 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 3.2.0 > Reporter: Dongjoon Hyun > Assignee: Dongjoon Hyun > Priority: Major > > h3. What changes were proposed in this pull request? > This PR aims to support ZSTD compression in ORC data source. > h3. Why are the changes needed? > Apache ORC 1.6 supports ZSTD compression to generate more compact files and > save the storage cost. > *BEFORE* > {code:java} > scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd") > java.lang.IllegalArgumentException: Codec [zstd] is not available. Available > codecs are uncompressed, lzo, snappy, zlib, none. {code} > *AFTER* > {code:java} > scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd") > {code} > {code:java} > $ orc-tools meta /tmp/zstd > Processing data file > file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc > [length: 230] > Structure for > file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc > File Version: 0.12 with ORC_14 > Rows: 1 > Compression: ZSTD > Compression size: 262144 > Calendar: Julian/Gregorian > Type: struct<id:bigint> > Stripe Statistics: > Stripe 1: > Column 0: count: 1 hasNull: false > Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9 > File Statistics: > Column 0: count: 1 hasNull: false > Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9 > Stripes: > Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35 > Stream: column 0 section ROW_INDEX start: 3 length 11 > Stream: column 1 section ROW_INDEX start: 14 length 24 > Stream: column 1 section DATA start: 38 length 6 > Encoding column 0: DIRECT > Encoding column 1: DIRECT_V2 > File length: 230 bytes > Padding length: 0 bytes > Padding ratio: 0% > User Metadata: > org.apache.spark.version=3.2.0{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org