Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

Reynold Xin Sun, 24 Feb 2019 16:49:39 -0800

The challenge with the Scala/Java API in the past is that when there are 
multipe parameters, it'd lead to an explosion of function overloads.


On Sun, Feb 24, 2019 at 3:22 PM, Felix Cheung < felixcheun...@hotmail.com > 
wrote:

> 
> I hear three topics in this thread
> 
> 
> 1. I don’t think we should remove string. Column and string can both be
> “type safe”. And I would agree we don’t *need* to break API compatibility
> here.
> 
> 
> 2. Gaps in python API. Extending on #1, definitely we should be consistent
> and add string as param where it is missed.
> 
> 
> 3. Scala API for string - hard to say but make sense if nothing but for
> consistency. Though I can also see the argument of Column only in Scala.
> String might be more natural in python and much less significant in Scala
> because of $”foo” notation.
> 
> 
> (My 2 c)
> 
> 
> 
>  
> *From:* Sean Owen < srowen@ gmail. com ( sro...@gmail.com ) >
> *Sent:* Sunday, February 24, 2019 6:59 AM
> *To:* André Mello
> *Cc:* dev
> *Subject:* Re: [DISCUSS][SQL][PySpark] Column name support for SQL
> functions
>  
> I just commented on the PR -- I personally don't think it's worth
> removing support for, say, max("foo") over max(col("foo")) or
> max($"foo") in Scala. We can make breaking changes in Spark 3 but this
> seems like it would unnecessarily break a lot of code. The string arg
> is more concise in Python and I can't think of cases where it's
> particularly ambiguous or confusing; on the contrary it's more natural
> coming from SQL.
> 
> What we do have are inconsistencies and errors in support of string vs
> Column as fixed in the PR. I was surprised to see that
> df. select ( http://df.select/ ) (abs("col")) throws an error while df. select
> ( http://df.select/ ) (sqrt("col"))
> doesn't. I think that's easy to fix on the Python side. Really I think
> the question is: do we need to add methods like "def abs(String)" and
> more in Scala? that would remain inconsistent even if the Pyspark side
> is fixed.
> 
> On Sun, Feb 24, 2019 at 8:54 AM André Mello < asmello. br@ gmail. com (
> asmello...@gmail.com ) > wrote:
> >
> > # Context
> >
> > This comes from [SPARK-26979], which became PR #23879 and then PR
> > #23882. The following reflects all the findings made so far.
> >
> > # Description
> >
> > Currently, in the Scala API, some SQL functions have two overloads,
> > one taking a string that names the column to be operated on, the other
> > taking a proper Column object. This allows for two patterns of calling
> > these functions, which is a source of inconsistency and generates
> > confusion for new users, since it is hard to predict which functions
> > will take a column name or not.
> >
> > The PySpark API partially solves this problem by internally converting
> > the argument to a Column object prior to passing it through to the
> > underlying JVM implementation. This allows for a consistent use of
> > name literals across the API, except for a few violations:
> >
> > - lower()
> > - upper()
> > - abs()
> > - bitwiseNOT()
> > - ltrim()
> > - rtrim()
> > - trim()
> > - ascii()
> > - base64()
> > - unbase64()
> >
> > These violations happen because for a subset of the SQL functions,
> > PySpark uses a functional mechanism (`_create_function`) to directly
> > call the underlying JVM equivalent by name, thus skipping the
> > conversion step. In most cases the column name pattern still works
> > because the Scala API has its own support for string arguments, but
> > the aforementioned functions are also exceptions there.
> >
> > My proposal was to solve this problem by adding the string support
> > where it was missing in the PySpark API. Since this is a purely
> > additive change, it doesn't break past code. Additionally, I find the
> > API sugar to be a positive feature, since code like `max("foo")` is
> > more concise and readable than `max(col("foo"))`. It adheres to the
> > DRY philosophy and is consistent with Python's preference for
> > readability over type protection.
> >
> > However, upon submission of the PR, a discussion was started about
> > whether it wouldn't be better to entirely deprecate string support
> > instead - in particular with major release 3.0 in mind. The reasoning,
> > as I understood it, was that this approach is more explicit and type
> > safe, which is preferred in Java/Scala, plus it reduces the API
> > surface area - and the Python API should be consistent with the others
> > as well.
> >
> > Upon request by @HyukjinKwon I'm submitting this matter for discussion
> > by this mailing list.
> >
> > # Summary
> >
> > There is a problem with inconsistency in the Scala/Python SQL API,
> > where sometimes you can use a column name string as a proxy, and
> > sometimes you have to use a proper Column object. To solve it there
> > are two approaches - to remove the string support entirely, or to add
> > it where it is missing. Which approach is best?
> >
> > Hope this is clear.
> >
> > -- André.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
>

Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

Reply via email to