[ 
https://issues.apache.org/jira/browse/SPARK-33978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33978:
-------------------------------------

    Assignee: Dongjoon Hyun

> Support ZSTD compression in ORC data source
> -------------------------------------------
>
>                 Key: SPARK-33978
>                 URL: https://issues.apache.org/jira/browse/SPARK-33978
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Dongjoon Hyun
>            Assignee: Dongjoon Hyun
>            Priority: Major
>
> h3. What changes were proposed in this pull request?
> This PR aims to support ZSTD compression in ORC data source.
> h3. Why are the changes needed?
> Apache ORC 1.6 supports ZSTD compression to generate more compact files and 
> save the storage cost.
> *BEFORE*
> {code:java}
> scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd")
>  java.lang.IllegalArgumentException: Codec [zstd] is not available. Available 
> codecs are uncompressed, lzo, snappy, zlib, none. {code}
> *AFTER*
> {code:java}
> scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd") 
> {code}
> {code:java}
>  $ orc-tools meta /tmp/zstd 
>  Processing data file 
> file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc 
> [length: 230]
>  Structure for 
> file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc
>  File Version: 0.12 with ORC_14
>  Rows: 1
>  Compression: ZSTD
>  Compression size: 262144
>  Calendar: Julian/Gregorian
>  Type: struct<id:bigint>
> Stripe Statistics:
>  Stripe 1:
>  Column 0: count: 1 hasNull: false
>  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9
> File Statistics:
>  Column 0: count: 1 hasNull: false
>  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9
> Stripes:
>  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
>  Stream: column 0 section ROW_INDEX start: 3 length 11
>  Stream: column 1 section ROW_INDEX start: 14 length 24
>  Stream: column 1 section DATA start: 38 length 6
>  Encoding column 0: DIRECT
>  Encoding column 1: DIRECT_V2
> File length: 230 bytes
>  Padding length: 0 bytes
>  Padding ratio: 0%
> User Metadata:
>  org.apache.spark.version=3.2.0{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to