The challenge with the Scala/Java API in the past is that when there are multipe parameters, it'd lead to an explosion of function overloads.
On Sun, Feb 24, 2019 at 3:22 PM, Felix Cheung < felixcheun...@hotmail.com > wrote: > > I hear three topics in this thread > > > 1. I don’t think we should remove string. Column and string can both be > “type safe”. And I would agree we don’t *need* to break API compatibility > here. > > > 2. Gaps in python API. Extending on #1, definitely we should be consistent > and add string as param where it is missed. > > > 3. Scala API for string - hard to say but make sense if nothing but for > consistency. Though I can also see the argument of Column only in Scala. > String might be more natural in python and much less significant in Scala > because of $”foo” notation. > > > (My 2 c) > > > > > *From:* Sean Owen < srowen@ gmail. com ( sro...@gmail.com ) > > *Sent:* Sunday, February 24, 2019 6:59 AM > *To:* André Mello > *Cc:* dev > *Subject:* Re: [DISCUSS][SQL][PySpark] Column name support for SQL > functions > > I just commented on the PR -- I personally don't think it's worth > removing support for, say, max("foo") over max(col("foo")) or > max($"foo") in Scala. We can make breaking changes in Spark 3 but this > seems like it would unnecessarily break a lot of code. The string arg > is more concise in Python and I can't think of cases where it's > particularly ambiguous or confusing; on the contrary it's more natural > coming from SQL. > > What we do have are inconsistencies and errors in support of string vs > Column as fixed in the PR. I was surprised to see that > df. select ( http://df.select/ ) (abs("col")) throws an error while df. select > ( http://df.select/ ) (sqrt("col")) > doesn't. I think that's easy to fix on the Python side. Really I think > the question is: do we need to add methods like "def abs(String)" and > more in Scala? that would remain inconsistent even if the Pyspark side > is fixed. > > On Sun, Feb 24, 2019 at 8:54 AM André Mello < asmello. br@ gmail. com ( > asmello...@gmail.com ) > wrote: > > > > # Context > > > > This comes from [SPARK-26979], which became PR #23879 and then PR > > #23882. The following reflects all the findings made so far. > > > > # Description > > > > Currently, in the Scala API, some SQL functions have two overloads, > > one taking a string that names the column to be operated on, the other > > taking a proper Column object. This allows for two patterns of calling > > these functions, which is a source of inconsistency and generates > > confusion for new users, since it is hard to predict which functions > > will take a column name or not. > > > > The PySpark API partially solves this problem by internally converting > > the argument to a Column object prior to passing it through to the > > underlying JVM implementation. This allows for a consistent use of > > name literals across the API, except for a few violations: > > > > - lower() > > - upper() > > - abs() > > - bitwiseNOT() > > - ltrim() > > - rtrim() > > - trim() > > - ascii() > > - base64() > > - unbase64() > > > > These violations happen because for a subset of the SQL functions, > > PySpark uses a functional mechanism (`_create_function`) to directly > > call the underlying JVM equivalent by name, thus skipping the > > conversion step. In most cases the column name pattern still works > > because the Scala API has its own support for string arguments, but > > the aforementioned functions are also exceptions there. > > > > My proposal was to solve this problem by adding the string support > > where it was missing in the PySpark API. Since this is a purely > > additive change, it doesn't break past code. Additionally, I find the > > API sugar to be a positive feature, since code like `max("foo")` is > > more concise and readable than `max(col("foo"))`. It adheres to the > > DRY philosophy and is consistent with Python's preference for > > readability over type protection. > > > > However, upon submission of the PR, a discussion was started about > > whether it wouldn't be better to entirely deprecate string support > > instead - in particular with major release 3.0 in mind. The reasoning, > > as I understood it, was that this approach is more explicit and type > > safe, which is preferred in Java/Scala, plus it reduces the API > > surface area - and the Python API should be consistent with the others > > as well. > > > > Upon request by @HyukjinKwon I'm submitting this matter for discussion > > by this mailing list. > > > > # Summary > > > > There is a problem with inconsistency in the Scala/Python SQL API, > > where sometimes you can use a column name string as a proxy, and > > sometimes you have to use a proper Column object. To solve it there > > are two approaches - to remove the string support entirely, or to add > > it where it is missing. Which approach is best? > > > > Hope this is clear. > > > > -- André. > > > > --------------------------------------------------------------------- > > To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org ( > dev-unsubscr...@spark.apache.org ) > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org ( > dev-unsubscr...@spark.apache.org ) >