RTRIM`

Dongjoon Hyun Wed, 19 Feb 2020 14:18:09 -0800

Yes, right. We need to choose the best way case by case.
The following level of `Runtime warning` also sounds reasonable to me.


> Maybe we should only log the warning once per Spark application.

Bests,
Dongjoon.

On Tue, Feb 18, 2020 at 1:13 AM Wenchen Fan <cloud0...@gmail.com> wrote:

> I don't know what's the best way to deprecate an SQL function. Runtime
> warning can be annoying if it keeps coming out. Maybe we should only log
> the warning once per Spark application.
>
> On Tue, Feb 18, 2020 at 3:45 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
>
>> Thank you for feedback, Wenchen, Maxim, and Takeshi.
>>
>>     1. "we would also promote the SQL standard TRIM syntax"
>>     2. "we could output a warning with a notice about the order of
>> parameters"
>>     3. "it looks nice to make these (two parameters) trim functions
>> unrecommended in future releases"
>>
>> Yes, in case of reverting SPARK-28093, we had better deprecate these
>> two-parameter SQL function invocations. It's because this is really
>> esoteric and even inconsistent inside Spark (Scala API also follows
>> PostgresSQL/Presto style like the following.)
>>
>>     def trim(e: Column, trimString: String)
>>
>> If we keep this situation in 3.0.0 release (a major release), it means
>> Apache Spark will be forever.
>> And, it becomes worse and worse because it leads more users to fall into
>> this trap.
>>
>> There are a few deprecation ways. I believe (3) can be a proper next step
>> in case of reverting because (2) is infeasible and (1) is considered
>> insufficient because it's silent when we do SPARK-28093. We need non-silent
>> (noisy) one in this case. Technically, (3) can be done in
>> `Analyzer.ResolveFunctions`.
>>
>>     1. Documentation-only: Removing example and add migration guide
>>     2. Compile-time warning by annotation: This is not an option for SQL
>> function in SQL string.
>>     3. Runtime warning with a directional guide
>>        (log.warn("... USE TRIM(trimStr FROM str) INSTEAD")
>>
>> How do you think about (3)?
>>
>> Bests,
>> Dongjoon.
>>
>> On Sun, Feb 16, 2020 at 1:22 AM Takeshi Yamamuro <linguin....@gmail.com>
>> wrote:
>>
>>> The revert looks fine to me for keeping the compatibility.
>>> Also, I think the different orders between the systems easily lead to
>>> mistakes, so
>>> , as Wenchen suggested, it looks nice to make these (two parameters)
>>> trim functions
>>> unrecommended in future releases:
>>> https://github.com/apache/spark/pull/27540#discussion_r377682518
>>> Actually, I think the SQL TRIM syntax is enough for trim use cases...
>>>
>>> Bests,
>>> Takeshi
>>>
>>>
>>> On Sun, Feb 16, 2020 at 3:02 AM Maxim Gekk <maxim.g...@databricks.com>
>>> wrote:
>>>
>>>> Also if we look at possible combination of trim parameters:
>>>> 1. foldable srcStr + foldable trimStr
>>>> 2. foldable srcStr + non-foldable trimStr
>>>> 3. non-foldable srcStr + foldable trimStr
>>>> 4. non-foldable srcStr + non-foldable trimStr
>>>>
>>>> The case # 2 seems a rare case, and # 3 is probably the most common
>>>> case. Once we see the second case, we could output a warning with a notice
>>>> about the order of parameters.
>>>>
>>>> Maxim Gekk
>>>>
>>>> Software Engineer
>>>>
>>>> Databricks, Inc.
>>>>
>>>>
>>>> On Sat, Feb 15, 2020 at 5:04 PM Wenchen Fan <cloud0...@gmail.com>
>>>> wrote:
>>>>
>>>>> It's unfortunate that we don't have a clear document to talk about
>>>>> breaking changes (I'm working on it BTW). I believe the general guidance
>>>>> is: *avoid breaking changes unless we have to*. For example, the
>>>>> previous result was so broken that we have to fix it, moving to Scala 2.12
>>>>> makes us have to break some APIs, etc.
>>>>>
>>>>> For this particular case, do we have to switch the parameter order?
>>>>> It's different from some systems, the order was not decided explicitly, 
>>>>> but
>>>>> I don't think they are strong reasons. People from RDBMS should use the 
>>>>> SQL
>>>>> standard TRIM syntax more often. People using prior Spark versions should
>>>>> have figured out the parameter order of Spark TRIM (there was no document)
>>>>> and adjusted their queries. There is no such standard that defines the
>>>>> parameter order of the TRIM function.
>>>>>
>>>>> In the long term, we would also promote the SQL standard TRIM syntax.
>>>>> I don't see much benefit of "fixing" the parameter order that worth to 
>>>>> make
>>>>> a breaking change.
>>>>>
>>>>> Thanks,
>>>>> Wenchen
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Please note that the context if TRIM/LTRIM/RTRIM with two-parameters
>>>>>> and TRIM(trimStr FROM str) syntax.
>>>>>>
>>>>>> This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM.
>>>>>>
>>>>>> On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun <
>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi, All.
>>>>>>>
>>>>>>> I'm sending this email because the Apache Spark committers had
>>>>>>> better have a consistent point of views for the upcoming PRs. And, the
>>>>>>> community policy is the way to lead the community members transparently 
>>>>>>> and
>>>>>>> clearly for a long term good.
>>>>>>>
>>>>>>> First of all, I want to emphasize that, like Apache Spark 2.0.0,
>>>>>>> Apache Spark 3.0.0 is going to achieve many improvement in SQL areas.
>>>>>>>
>>>>>>> Especially, we invested lots of efforts to improve SQL ANSI
>>>>>>> compliance and related test coverage for the future. One simple example 
>>>>>>> is
>>>>>>> the following.
>>>>>>>
>>>>>>>     [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax
>>>>>>>
>>>>>>> With this support, we are able to move away from one of esoteric
>>>>>>> Spark function `TRIM/LTRIM/RTRIM`.
>>>>>>> (Note that the above syntax is ANSI standard, but we have additional
>>>>>>> non-standard these functions, too.)
>>>>>>>
>>>>>>> Here is the summary of the current status.
>>>>>>>
>>>>>>>     +------------------------+-------------+---------------+
>>>>>>>     | SQL Progressing Engine | TRIM Syntax | TRIM Function |
>>>>>>>     +------------------------------------------------------+
>>>>>>>     | Spark 3.0.0-preview/2  |      O      |       O       |
>>>>>>>     | PostgreSQL             |      O      |       O       |
>>>>>>>     | Presto                 |      O      |       O       |
>>>>>>>     | MySQL                  |      O(*)   |       X       |
>>>>>>>     | Oracle                 |      O      |       X       |
>>>>>>>     | MsSQL                  |      O      |       X       |
>>>>>>>     | Terradata              |      O      |       X       |
>>>>>>>     | Hive                   |      X      |       X       |
>>>>>>>     | Spark 1.6 ~ 2.2        |      X      |       X       |
>>>>>>>     | Spark 2.3 ~ 2.4        |      X      |       O(**)   |
>>>>>>>     +------------------------+-------------+---------------+
>>>>>>>     * MySQL doesn't allow multiple trim character
>>>>>>>     * Spark 2.3 ~ 2.4 have the function in a different way.
>>>>>>>
>>>>>>> Here is the illustrative example of the problem.
>>>>>>>
>>>>>>>     postgres=# SELECT trim('yxTomxx', 'xyz');
>>>>>>>     btrim
>>>>>>>     -------
>>>>>>>     Tom
>>>>>>>
>>>>>>>     presto:default> SELECT trim('yxTomxx', 'xyz');
>>>>>>>     _col0
>>>>>>>     -------
>>>>>>>     Tom
>>>>>>>
>>>>>>>     spark-sql> SELECT trim('yxTomxx', 'xyz');
>>>>>>>     z
>>>>>>>
>>>>>>> Here is our history to fix the above issue.
>>>>>>>
>>>>>>>     [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order
>>>>>>> issue
>>>>>>>     1. https://github.com/apache/spark/pull/24902
>>>>>>>        (Merged 2019-06-18 for v3.0.0, 3.0.0-preview and
>>>>>>> 3.0.0-preview2 released.)
>>>>>>>     2. https://github.com/apache/spark/pull/24907
>>>>>>>        (Merged 2019-06-20 for v2.3.4, but reverted)
>>>>>>>     3. https://github.com/apache/spark/pull/24908
>>>>>>>        (Merged 2019-06-21 for v2.4.4, but reverted)
>>>>>>>
>>>>>>> (2) and (3) were reverted before releases because we didn't want to
>>>>>>> fix that in the maintenance releases. Please see the following 
>>>>>>> references
>>>>>>> of the decision.
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/apache/spark/pull/24908#issuecomment-504799028
>>>>>>> (2.3)
>>>>>>>
>>>>>>> https://github.com/apache/spark/pull/24907#issuecomment-504799021
>>>>>>> (2.4)
>>>>>>>
>>>>>>> Now, there are some requests to revert SPARK-28093 and to keep these
>>>>>>> esoteric functions for backward compatibility and the following reason.
>>>>>>>
>>>>>>>     > Reordering function parameters to match another system,
>>>>>>>     > for a method that is otherwise working correctly,
>>>>>>>     > sounds exactly like a cosmetic change to me.
>>>>>>>
>>>>>>>     > How can we silently change the parameter of an existing SQL
>>>>>>> function?
>>>>>>>     > I don't think this is a correctness issue as the SQL standard
>>>>>>>     > doesn't say that the function signature have to be
>>>>>>> trim(srcStr, trimStr).
>>>>>>>
>>>>>>> The concern and the point of views make sense.
>>>>>>>
>>>>>>> My concerns are the followings.
>>>>>>>
>>>>>>>     1. This kind of esoteric differences are called `vendor-lock-in`
>>>>>>> stuffs in a negative way.
>>>>>>>        - It's difficult for new users to understand.
>>>>>>>        - It's hard to migrate between Apache Spark and the others.
>>>>>>>     2. Although we did our bests, Apache Spark SQL has not been
>>>>>>> enough always.
>>>>>>>     3. We need to do improvement in the future releases.
>>>>>>>
>>>>>>> In short, we can keep 3.0.0-preview behaviors here or revert
>>>>>>> SPARK-28093 in order to keep this vendor-lock-in stuffs for
>>>>>>> backward-compatibility.
>>>>>>>
>>>>>>> What I want is building a consistent point of views in this category
>>>>>>> for the upcoming PR reviews.
>>>>>>>
>>>>>>> If we have a clear policy, we can save future community efforts in
>>>>>>> many ways.
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

Reply via email to