[jira] [Updated] (SPARK-48359) Built-in functions for Zstd compression and decompression

Xi Lyu (Jira) Mon, 20 May 2024 13:27:03 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-48359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xi Lyu updated SPARK-48359:
---------------------------
    Description: 
Some users are using UDFs for Zstd compression and decompression, which results 
in poor performance. If we provide native functions, the performance will be 
improved by compressing and decompressing just within the JVM.

 

Now, we are introducing three new built-in functions:
{code:java}
zstd_compress(input: binary [, level: int [, steaming_mode: bool]])

zstd_decompress(input: binary)

try_zstd_decompress(input: binary)
{code}
where
 * `input`: The binary value to compress or decompress.
 * `level`: Optional integer argument that represents the compression level. 
The compression level controls the trade-off between compression speed and 
compression ratio. The default level is 3. Valid values: between 1 and 22 
inclusive
 * `streaming_mode`: Optional boolean argument that represents whether to use 
streaming mode to compress. 

Examples:
{code:sql}
> SELECT base64(zstd_compress(repeat("Apache Spark ", 10)));
  KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=
> SELECT base64(zstd_compress(repeat("Apache Spark ", 10), 3, true));
  KLUv/QBYpAAAaEFwYWNoZSBTcGFyayABABLS+QUBAAA=
> SELECT 
> string(zstd_decompress(unbase64("KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=")));
  Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark 
Apache Spark Apache Spark Apache Spark Apache Spark
> SELECT zstd_decompress(zstd_compress("Apache Spark"));
  Apache Spark
> SELECT try_zstd_decompress("invalid input")
  NULL
{code}
These three built-in functions are also available in Python and Scala.

  was:
Some users are using UDFs for Zstd compression and decompression, which results 
in poor performance. If we provide native functions, the performance will be 
improved by compressing and decompressing just within the JVM.

 

Now, we are introducing three new built-in functions:
{code:java}
zstd_compress(input: binary [, level: int [, steaming_mode: bool]])

zstd_decompress(input: binary)

try_zstd_decompress(input: binary)
{code}
where
 * input: The binary value to compress or decompress.
 * level: Optional integer argument that represents the compression level. The 
compression level controls the trade-off between compression speed and 
compression ratio. The default level is 3. Valid values: between 1 and 22 
inclusive
 * streaming_mode: Optional boolean argument that represents whether to use 
streaming mode to compress. 

Examples:
{code:sql}
> SELECT base64(zstd_compress(repeat("Apache Spark ", 10)));
  KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=
> SELECT base64(zstd_compress(repeat("Apache Spark ", 10), 3, true));
  KLUv/QBYpAAAaEFwYWNoZSBTcGFyayABABLS+QUBAAA=
> SELECT 
> string(zstd_decompress(unbase64("KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=")));
  Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark 
Apache Spark Apache Spark Apache Spark Apache Spark
> SELECT zstd_decompress(zstd_compress("Apache Spark"));
  Apache Spark
> SELECT try_zstd_decompress("invalid input")
  NULL
{code}
These three built-in functions are also available in Python and Scala.


> Built-in functions for Zstd compression and decompression
> ---------------------------------------------------------
>
>                 Key: SPARK-48359
>                 URL: https://issues.apache.org/jira/browse/SPARK-48359
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 4.0.0
>            Reporter: Xi Lyu
>            Priority: Major
>              Labels: pull-request-available
>
> Some users are using UDFs for Zstd compression and decompression, which 
> results in poor performance. If we provide native functions, the performance 
> will be improved by compressing and decompressing just within the JVM.
>  
> Now, we are introducing three new built-in functions:
> {code:java}
> zstd_compress(input: binary [, level: int [, steaming_mode: bool]])
> zstd_decompress(input: binary)
> try_zstd_decompress(input: binary)
> {code}
> where
>  * `input`: The binary value to compress or decompress.
>  * `level`: Optional integer argument that represents the compression level. 
> The compression level controls the trade-off between compression speed and 
> compression ratio. The default level is 3. Valid values: between 1 and 22 
> inclusive
>  * `streaming_mode`: Optional boolean argument that represents whether to use 
> streaming mode to compress. 
> Examples:
> {code:sql}
> > SELECT base64(zstd_compress(repeat("Apache Spark ", 10)));
>   KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=
> > SELECT base64(zstd_compress(repeat("Apache Spark ", 10), 3, true));
>   KLUv/QBYpAAAaEFwYWNoZSBTcGFyayABABLS+QUBAAA=
> > SELECT 
> > string(zstd_decompress(unbase64("KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=")));
>   Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark Apache 
> Spark Apache Spark Apache Spark Apache Spark Apache Spark
> > SELECT zstd_decompress(zstd_compress("Apache Spark"));
>   Apache Spark
> > SELECT try_zstd_decompress("invalid input")
>   NULL
> {code}
> These three built-in functions are also available in Python and Scala.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48359) Built-in functions for Zstd compression and decompression

Reply via email to