[ 
https://issues.apache.org/jira/browse/SPARK-33978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33978:
----------------------------------
    Description: 
h3. What changes were proposed in this pull request?

This PR aims to support ZSTD compression in ORC data source.
h3. Why are the changes needed?

Apache ORC 1.6 supports ZSTD compression to generate more compact files and 
save the storage cost.

*BEFORE*
{code:java}
scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd")
 java.lang.IllegalArgumentException: Codec [zstd] is not available. Available 
codecs are uncompressed, lzo, snappy, zlib, none. {code}
*AFTER*
{code:java}
scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd") 
{code}
{code:java}
 $ orc-tools meta /tmp/zstd 
 Processing data file 
file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc 
[length: 230]
 Structure for 
file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc
 File Version: 0.12 with ORC_14
 Rows: 1
 Compression: ZSTD
 Compression size: 262144
 Calendar: Julian/Gregorian
 Type: struct<id:bigint>
Stripe Statistics:
 Stripe 1:
 Column 0: count: 1 hasNull: false
 Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9
File Statistics:
 Column 0: count: 1 hasNull: false
 Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9
Stripes:
 Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
 Stream: column 0 section ROW_INDEX start: 3 length 11
 Stream: column 1 section ROW_INDEX start: 14 length 24
 Stream: column 1 section DATA start: 38 length 6
 Encoding column 0: DIRECT
 Encoding column 1: DIRECT_V2
File length: 230 bytes
 Padding length: 0 bytes
 Padding ratio: 0%
User Metadata:
 org.apache.spark.version=3.2.0{code}
 

> Support ZSTD compression in ORC data source
> -------------------------------------------
>
>                 Key: SPARK-33978
>                 URL: https://issues.apache.org/jira/browse/SPARK-33978
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Dongjoon Hyun
>            Priority: Major
>
> h3. What changes were proposed in this pull request?
> This PR aims to support ZSTD compression in ORC data source.
> h3. Why are the changes needed?
> Apache ORC 1.6 supports ZSTD compression to generate more compact files and 
> save the storage cost.
> *BEFORE*
> {code:java}
> scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd")
>  java.lang.IllegalArgumentException: Codec [zstd] is not available. Available 
> codecs are uncompressed, lzo, snappy, zlib, none. {code}
> *AFTER*
> {code:java}
> scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd") 
> {code}
> {code:java}
>  $ orc-tools meta /tmp/zstd 
>  Processing data file 
> file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc 
> [length: 230]
>  Structure for 
> file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc
>  File Version: 0.12 with ORC_14
>  Rows: 1
>  Compression: ZSTD
>  Compression size: 262144
>  Calendar: Julian/Gregorian
>  Type: struct<id:bigint>
> Stripe Statistics:
>  Stripe 1:
>  Column 0: count: 1 hasNull: false
>  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9
> File Statistics:
>  Column 0: count: 1 hasNull: false
>  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9
> Stripes:
>  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
>  Stream: column 0 section ROW_INDEX start: 3 length 11
>  Stream: column 1 section ROW_INDEX start: 14 length 24
>  Stream: column 1 section DATA start: 38 length 6
>  Encoding column 0: DIRECT
>  Encoding column 1: DIRECT_V2
> File length: 230 bytes
>  Padding length: 0 bytes
>  Padding ratio: 0%
> User Metadata:
>  org.apache.spark.version=3.2.0{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to