Re: [PR] [SPARK-48359][SQL] Built-in functions for Zstd compression and decompression [spark]

via GitHub Tue, 21 May 2024 02:15:36 -0700


yaooqinn commented on PR #46672:
URL: https://github.com/apache/spark/pull/46672#issuecomment-2122158202

> The compress methods in MySQL and SQL Server only accept one argument and
users can't specify the compression algorithm or compression level. Besides,
the compression algorithm used in [MySQL's compress is not
specified](https://dev.mysql.com/doc/refman/8.0/en/encryption-functions.html#function_compress:~:text=a%20binary%20string.-,This%20function%20requires%20MySQL%20to%20have%20been%20compiled%20with%20a%20compression%20library%20such%20as%20zlib.%20Otherwise%2C%20the%20return%20value%20is%20always%20NULL,-.%20The%20return%20value),
and [SQL Server only uses
gzip](https://learn.microsoft.com/en-us/sql/t-sql/functions/compress-transact-sql?view=sql-server-ver16#:~:text=using%20the%20Gzip%20algorithm),
which is different from our cases. This may cause confusion for users who are
familiar with other databases when using compress function in Apache Spark if
we reuse the same name.

A parameter with a default value can achieve this. The default value can be
either hard coded or configurable by session conf.

If `zstd` is replaced/dropped someday, we'd have to remove these functions
first and cause a breaking change. I understand that it's unlikely to happen
for 'zstd'. But what if we add compression functions in the same naming pattern
for other existing compression codecs, will the possibility increase? And what
if we add a new codec, do we need to add similar functions for
self-consistency? Will it increase the maintenance cost?

> Looking at our [SQL Function
Reference](https://spark.apache.org/docs/latest/api/sql/#built-in-functions),
there is no precedent for integrating multiple algorithms into one SQL
function, which might make the functions more complicated to use. Following the
naming convention like aes_encrypt, url_encode and regexp_replace, this
function is named zstd_compress, including the algorithm name.

Most of the existing SQL functions are derived from other systems, Apache
Hive, Postgres, MySQL, etc. AFAIK, Spark currently does not have such a naming
convention itself, while 'supported by many other modern platforms' or 'defined
in ANSI' are the rules we used mostly for adding new SQL functions

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48359][SQL] Built-in functions for Zstd compression and decompression [spark]

Reply via email to