yaooqinn commented on PR #46672:
URL: https://github.com/apache/spark/pull/46672#issuecomment-2122158202

   > The compress methods in MySQL and SQL Server only accept one argument and 
users can't specify the compression algorithm or compression level. Besides, 
the compression algorithm used in [MySQL's compress is not 
specified](https://dev.mysql.com/doc/refman/8.0/en/encryption-functions.html#function_compress:~:text=a%20binary%20string.-,This%20function%20requires%20MySQL%20to%20have%20been%20compiled%20with%20a%20compression%20library%20such%20as%20zlib.%20Otherwise%2C%20the%20return%20value%20is%20always%20NULL,-.%20The%20return%20value),
 and [SQL Server only uses 
gzip](https://learn.microsoft.com/en-us/sql/t-sql/functions/compress-transact-sql?view=sql-server-ver16#:~:text=using%20the%20Gzip%20algorithm),
 which is different from our cases. This may cause confusion for users who are 
familiar with other databases when using compress function in Apache Spark if 
we reuse the same name.
   
   A parameter with a default value can achieve this. The default value can be 
either hard coded or configurable by session conf.
   
   If `zstd` is replaced/dropped someday, we'd have to remove these functions 
first and cause a breaking change. I understand that it's unlikely to happen 
for 'zstd'. But what if we add compression functions in the same naming pattern 
for other existing compression codecs, will the possibility increase? And what 
if we add a new codec, do we need to add similar functions for 
self-consistency? Will it increase the maintenance cost?
     
   
   > Looking at our [SQL Function 
Reference](https://spark.apache.org/docs/latest/api/sql/#built-in-functions), 
there is no precedent for integrating multiple algorithms into one SQL 
function, which might make the functions more complicated to use. Following the 
naming convention like aes_encrypt, url_encode and regexp_replace, this 
function is named zstd_compress, including the algorithm name.
   
   
   Most of the existing SQL functions are derived from other systems, Apache 
Hive, Postgres, MySQL, etc. AFAIK, Spark currently does not have such a naming 
convention itself, while 'supported by many other modern platforms' or 'defined 
in ANSI' are the rules we used mostly for adding new SQL functions
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to