I think the general philosophy here should be Python should be the most liberal and support a column object, or a literal value. It's also super useful to support column name, but we need to decide what happens for a string column. Is a string passed in a literal string value, or a column name?
On Mon, Mar 04, 2019 at 6:00 AM, André Mello < ame...@palantir.com > wrote: > > > > Hey everyone, > > > > > > > > Progress has been made with PR #23882 ( > https://github.com/apache/spark/pull/23882 ) , and it is now in a state > where it could be merged with master. > > > > > > > > This is what we’re doing for now: > > * PySpark *will* support strings consistently throughout its API. > * Arguably string support makes syntax closer to SQL and Scala, where you > can use similar shorthands to specify columns, and the general direction > of the PySpark API has been to be consistent with those other two; > * This is a small, additive change that will not break anything; > * The reason support was not there in the first place was because the code > that generated functions was originally designed for aggregators, which > all support column names, but it was being used for other functions (e.g. > lower, abs) that did not, so it seems like it was not intentional. > > > > > > > We are NOT going to: > > * Make any code changes in Scala; > * This requires first deciding if string support is desirable or not; > * Decide whether or not strings should be supported in the Scala API; > * This requires a larger discussion and the above changes are independent > of this; > * Make PySpark support Column objects where it currently only supports > strings (e.g. multi-argument version of drop()); > * Converting from Column to column name is not something the API does > right now, so this is a stronger change; > * This can be considered separately. > * Do anything with R for now. > * Anyone is free to take on this, but I have no experience with R. > > > > > > > If you folks agree with this, let us know, so we can move forward with the > merge. > > > > > > > > Best. > > > > > > > > -- André. > > > > > > > > *From:* Reynold Xin < r...@databricks.com > > *Date:* Monday, 25 February 2019 at 00:49 > *To:* Felix Cheung < felixcheun...@hotmail.com > > *Cc:* dev < dev@spark.apache.org >, Sean Owen < sro...@gmail.com >, André > Mello < asmello...@gmail.com > > *Subject:* Re: [DISCUSS][SQL][PySpark] Column name support for SQL > functions > > > > > > > > > > > > > > > The challenge with the Scala/Java API in the past is that when there are > multipe parameters, it'd lead to an explosion of function overloads. > > > > > > > > > > > > > > On Sun, Feb 24, 2019 at 3:22 PM, Felix Cheung < felixcheun...@hotmail.com > > wrote: > > >> >> >> I hear three topics in this thread >> >> >> >> >> >> >> >> >> >> 1. I don’t think we should remove string. Column and string can both be >> “type safe”. And I would agree we don’t *need* to break API compatibility >> here. >> >> >> >> >> >> >> >> >> >> 2. Gaps in python API. Extending on #1, definitely we should be consistent >> and add string as param where it is missed. >> >> >> >> >> >> >> >> >> >> 3. Scala API for string - hard to say but make sense if nothing but for >> consistency. Though I can also see the argument of Column only in Scala. >> String might be more natural in python and much less significant in Scala >> because of $”foo” notation. >> >> >> >> >> >> >> >> >> >> (My 2 c) >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> *From:* Sean Owen < sro...@gmail.com > >> *Sent:* Sunday, February 24, 2019 6:59 AM >> *To:* André Mello >> *Cc:* dev >> *Subject:* Re: [DISCUSS][SQL][PySpark] Column name support for SQL >> functions >> >> >> >> >> >> >> >> >> I just commented on the PR -- I personally don't think it's worth >> removing support for, say, max("foo") over max(col("foo")) or >> max($"foo") in Scala. We can make breaking changes in Spark 3 but this >> seems like it would unnecessarily break a lot of code. The string arg >> is more concise in Python and I can't think of cases where it's >> particularly ambiguous or confusing; on the contrary it's more natural >> coming from SQL. >> >> What we do have are inconsistencies and errors in support of string vs >> Column as fixed in the PR. I was surprised to see that >> df.select [df.select] ( >> https://urldefense.proofpoint.com/v2/url?u=http-3A__df.select_&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=Q8z7q05IotYos8ff_hQuw0iwPH3qaLxV1IKxvqk-djY&m=0WT5JP6QGu6HeMsItzix27CBSwNfQiH9NHnmqVr0WFc&s=dEvDkbQgr79-nt7QL6ZvI5bLBtkHixh2Pw8NzIH7mfw&e= >> ) (abs("col")) throws an error while df.select [df.select] ( >> https://urldefense.proofpoint.com/v2/url?u=http-3A__df.select_&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=Q8z7q05IotYos8ff_hQuw0iwPH3qaLxV1IKxvqk-djY&m=0WT5JP6QGu6HeMsItzix27CBSwNfQiH9NHnmqVr0WFc&s=dEvDkbQgr79-nt7QL6ZvI5bLBtkHixh2Pw8NzIH7mfw&e= >> ) (sqrt("col")) >> doesn't. I think that's easy to fix on the Python side. Really I think >> the question is: do we need to add methods like "def abs(String)" and >> more in Scala? that would remain inconsistent even if the Pyspark side >> is fixed. >> >> On Sun, Feb 24, 2019 at 8:54 AM André Mello < asmello...@gmail.com > wrote: >> >> > >> > # Context >> > >> > This comes from [SPARK-26979], which became PR #23879 and then PR >> > #23882. The following reflects all the findings made so far. >> > >> > # Description >> > >> > Currently, in the Scala API, some SQL functions have two overloads, >> > one taking a string that names the column to be operated on, the other >> > taking a proper Column object. This allows for two patterns of calling >> > these functions, which is a source of inconsistency and generates >> > confusion for new users, since it is hard to predict which functions >> > will take a column name or not. >> > >> > The PySpark API partially solves this problem by internally converting >> > the argument to a Column object prior to passing it through to the >> > underlying JVM implementation. This allows for a consistent use of >> > name literals across the API, except for a few violations: >> > >> > - lower() >> > - upper() >> > - abs() >> > - bitwiseNOT() >> > - ltrim() >> > - rtrim() >> > - trim() >> > - ascii() >> > - base64() >> > - unbase64() >> > >> > These violations happen because for a subset of the SQL functions, >> > PySpark uses a functional mechanism (`_create_function`) to directly >> > call the underlying JVM equivalent by name, thus skipping the >> > conversion step. In most cases the column name pattern still works >> > because the Scala API has its own support for string arguments, but >> > the aforementioned functions are also exceptions there. >> > >> > My proposal was to solve this problem by adding the string support >> > where it was missing in the PySpark API. Since this is a purely >> > additive change, it doesn't break past code. Additionally, I find the >> > API sugar to be a positive feature, since code like `max("foo")` is >> > more concise and readable than `max(col("foo"))`. It adheres to the >> > DRY philosophy and is consistent with Python's preference for >> > readability over type protection. >> > >> > However, upon submission of the PR, a discussion was started about >> > whether it wouldn't be better to entirely deprecate string support >> > instead - in particular with major release 3.0 in mind. The reasoning, >> > as I understood it, was that this approach is more explicit and type >> > safe, which is preferred in Java/Scala, plus it reduces the API >> > surface area - and the Python API should be consistent with the others >> > as well. >> > >> > Upon request by @HyukjinKwon I'm submitting this matter for discussion >> > by this mailing list. >> > >> > # Summary >> > >> > There is a problem with inconsistency in the Scala/Python SQL API, >> > where sometimes you can use a column name string as a proxy, and >> > sometimes you have to use a proper Column object. To solve it there >> > are two approaches - to remove the string support entirely, or to add >> > it where it is missing. Which approach is best? >> > >> > Hope this is clear. >> > >> > -- André. >> > >> > --------------------------------------------------------------------- >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org ( >> dev-unsubscr...@spark.apache.org ) >> > >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org ( >> dev-unsubscr...@spark.apache.org ) >> >> >> > >