yaooqinn commented on PR #46672: URL: https://github.com/apache/spark/pull/46672#issuecomment-2122158202
> The compress methods in MySQL and SQL Server only accept one argument and users can't specify the compression algorithm or compression level. Besides, the compression algorithm used in [MySQL's compress is not specified](https://dev.mysql.com/doc/refman/8.0/en/encryption-functions.html#function_compress:~:text=a%20binary%20string.-,This%20function%20requires%20MySQL%20to%20have%20been%20compiled%20with%20a%20compression%20library%20such%20as%20zlib.%20Otherwise%2C%20the%20return%20value%20is%20always%20NULL,-.%20The%20return%20value), and [SQL Server only uses gzip](https://learn.microsoft.com/en-us/sql/t-sql/functions/compress-transact-sql?view=sql-server-ver16#:~:text=using%20the%20Gzip%20algorithm), which is different from our cases. This may cause confusion for users who are familiar with other databases when using compress function in Apache Spark if we reuse the same name. A parameter with a default value can achieve this. The default value can be either hard coded or configurable by session conf. If `zstd` is replaced/dropped someday, we'd have to remove these functions first and cause a breaking change. I understand that it's unlikely to happen for 'zstd'. But what if we add compression functions in the same naming pattern for other existing compression codecs, will the possibility increase? And what if we add a new codec, do we need to add similar functions for self-consistency? Will it increase the maintenance cost? > Looking at our [SQL Function Reference](https://spark.apache.org/docs/latest/api/sql/#built-in-functions), there is no precedent for integrating multiple algorithms into one SQL function, which might make the functions more complicated to use. Following the naming convention like aes_encrypt, url_encode and regexp_replace, this function is named zstd_compress, including the algorithm name. Most of the existing SQL functions are derived from other systems, Apache Hive, Postgres, MySQL, etc. AFAIK, Spark currently does not have such a naming convention itself, while 'supported by many other modern platforms' or 'defined in ANSI' are the rules we used mostly for adding new SQL functions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org