Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

Reynold Xin Wed, 06 Mar 2019 11:21:26 -0800

I think the general philosophy here should be Python should be the most liberal 
and support a column object, or a literal value. It's also super useful to 
support column name, but we need to decide what happens for a string column. Is 
a string passed in a literal string value, or a column name?


On Mon, Mar 04, 2019 at 6:00 AM, André Mello < ame...@palantir.com > wrote:

> 
> 
> 
> Hey everyone,
> 
> 
> 
>  
> 
> 
> 
> Progress has been made with PR #23882 (
> https://github.com/apache/spark/pull/23882 ) , and it is now in a state
> where it could be merged with master.
> 
> 
> 
>  
> 
> 
> 
> This is what we’re doing for now:
> 
> * PySpark *will* support strings consistently throughout its API.
> * Arguably string support makes syntax closer to SQL and Scala, where you
> can use similar shorthands to specify columns, and the general direction
> of the PySpark API has been to be consistent with those other two;
> * This is a small, additive change that will not break anything;
> * The reason support was not there in the first place was because the code
> that generated functions was originally designed for aggregators, which
> all support column names, but it was being used for other functions (e.g.
> lower, abs) that did not, so it seems like it was not intentional.
> 
> 
> 
> 
> 
> 
> We are NOT going to:
> 
> * Make any code changes in Scala;
> * This requires first deciding if string support is desirable or not;
> * Decide whether or not strings should be supported in the Scala API;
> * This requires a larger discussion and the above changes are independent
> of this;
> * Make PySpark support Column objects where it currently only supports
> strings (e.g. multi-argument version of drop());
> * Converting from Column to column name is not something the API does
> right now, so this is a stronger change;
> * This can be considered separately.
> * Do anything with R for now.
> * Anyone is free to take on this, but I have no experience with R.
> 
> 
>  
> 
> 
> 
> If you folks agree with this, let us know, so we can move forward with the
> merge.
> 
> 
> 
>  
> 
> 
> 
> Best.
> 
> 
> 
>  
> 
> 
> 
> -- André.
> 
> 
> 
>  
> 
> 
> 
> *From:* Reynold Xin < r...@databricks.com >
> *Date:* Monday, 25 February 2019 at 00:49
> *To:* Felix Cheung < felixcheun...@hotmail.com >
> *Cc:* dev < dev@spark.apache.org >, Sean Owen < sro...@gmail.com >, André
> Mello < asmello...@gmail.com >
> *Subject:* Re: [DISCUSS][SQL][PySpark] Column name support for SQL
> functions
> 
> 
> 
> 
>  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> The challenge with the Scala/Java API in the past is that when there are
> multipe parameters, it'd lead to an explosion of function overloads. 
> 
> 
> 
> 
>  
> 
> 
> 
> 
>  
> 
> 
> 
> On Sun, Feb 24, 2019 at 3:22 PM, Felix Cheung < felixcheun...@hotmail.com >
> wrote:
> 
> 
>> 
>> 
>> I hear three topics in this thread
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> 1. I don’t think we should remove string. Column and string can both be
>> “type safe”. And I would agree we don’t *need* to break API compatibility
>> here.
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> 2. Gaps in python API. Extending on #1, definitely we should be consistent
>> and add string as param where it is missed.
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> 3. Scala API for string - hard to say but make sense if nothing but for
>> consistency. Though I can also see the argument of Column only in Scala.
>> String might be more natural in python and much less significant in Scala
>> because of $”foo” notation.
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> (My 2 c)
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> 
>> *From:* Sean Owen < sro...@gmail.com >
>> *Sent:* Sunday, February 24, 2019 6:59 AM
>> *To:* André Mello
>> *Cc:* dev
>> *Subject:* Re: [DISCUSS][SQL][PySpark] Column name support for SQL
>> functions
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> I just commented on the PR -- I personally don't think it's worth
>> removing support for, say, max("foo") over max(col("foo")) or
>> max($"foo") in Scala. We can make breaking changes in Spark 3 but this
>> seems like it would unnecessarily break a lot of code. The string arg
>> is more concise in Python and I can't think of cases where it's
>> particularly ambiguous or confusing; on the contrary it's more natural
>> coming from SQL.
>> 
>> What we do have are inconsistencies and errors in support of string vs
>> Column as fixed in the PR. I was surprised to see that
>> df.select [df.select] (
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__df.select_&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=Q8z7q05IotYos8ff_hQuw0iwPH3qaLxV1IKxvqk-djY&m=0WT5JP6QGu6HeMsItzix27CBSwNfQiH9NHnmqVr0WFc&s=dEvDkbQgr79-nt7QL6ZvI5bLBtkHixh2Pw8NzIH7mfw&e=
>> ) (abs("col")) throws an error while df.select [df.select] (
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__df.select_&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=Q8z7q05IotYos8ff_hQuw0iwPH3qaLxV1IKxvqk-djY&m=0WT5JP6QGu6HeMsItzix27CBSwNfQiH9NHnmqVr0WFc&s=dEvDkbQgr79-nt7QL6ZvI5bLBtkHixh2Pw8NzIH7mfw&e=
>> ) (sqrt("col"))
>> doesn't. I think that's easy to fix on the Python side. Really I think
>> the question is: do we need to add methods like "def abs(String)" and
>> more in Scala? that would remain inconsistent even if the Pyspark side
>> is fixed.
>> 
>> On Sun, Feb 24, 2019 at 8:54 AM André Mello < asmello...@gmail.com > wrote:
>> 
>> >
>> > # Context
>> >
>> > This comes from [SPARK-26979], which became PR #23879 and then PR
>> > #23882. The following reflects all the findings made so far.
>> >
>> > # Description
>> >
>> > Currently, in the Scala API, some SQL functions have two overloads,
>> > one taking a string that names the column to be operated on, the other
>> > taking a proper Column object. This allows for two patterns of calling
>> > these functions, which is a source of inconsistency and generates
>> > confusion for new users, since it is hard to predict which functions
>> > will take a column name or not.
>> >
>> > The PySpark API partially solves this problem by internally converting
>> > the argument to a Column object prior to passing it through to the
>> > underlying JVM implementation. This allows for a consistent use of
>> > name literals across the API, except for a few violations:
>> >
>> > - lower()
>> > - upper()
>> > - abs()
>> > - bitwiseNOT()
>> > - ltrim()
>> > - rtrim()
>> > - trim()
>> > - ascii()
>> > - base64()
>> > - unbase64()
>> >
>> > These violations happen because for a subset of the SQL functions,
>> > PySpark uses a functional mechanism (`_create_function`) to directly
>> > call the underlying JVM equivalent by name, thus skipping the
>> > conversion step. In most cases the column name pattern still works
>> > because the Scala API has its own support for string arguments, but
>> > the aforementioned functions are also exceptions there.
>> >
>> > My proposal was to solve this problem by adding the string support
>> > where it was missing in the PySpark API. Since this is a purely
>> > additive change, it doesn't break past code. Additionally, I find the
>> > API sugar to be a positive feature, since code like `max("foo")` is
>> > more concise and readable than `max(col("foo"))`. It adheres to the
>> > DRY philosophy and is consistent with Python's preference for
>> > readability over type protection.
>> >
>> > However, upon submission of the PR, a discussion was started about
>> > whether it wouldn't be better to entirely deprecate string support
>> > instead - in particular with major release 3.0 in mind. The reasoning,
>> > as I understood it, was that this approach is more explicit and type
>> > safe, which is preferred in Java/Scala, plus it reduces the API
>> > surface area - and the Python API should be consistent with the others
>> > as well.
>> >
>> > Upon request by @HyukjinKwon I'm submitting this matter for discussion
>> > by this mailing list.
>> >
>> > # Summary
>> >
>> > There is a problem with inconsistency in the Scala/Python SQL API,
>> > where sometimes you can use a column name string as a proxy, and
>> > sometimes you have to use a proper Column object. To solve it there
>> > are two approaches - to remove the string support entirely, or to add
>> > it where it is missing. Which approach is best?
>> >
>> > Hope this is clear.
>> >
>> > -- André.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org (
>> dev-unsubscr...@spark.apache.org )
>> >
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org (
>> dev-unsubscr...@spark.apache.org )
>> 
>> 
>> 
> 
>

Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

Reply via email to