date:20190224

Fwd: CRAN submission SparkR 2.3.3

2019-02-24 Thread Shivaram Venkataraman

FYI here is the note from CRAN from submitting 2.3.3. There were some
minor issues with the package description file in our CRAN submission.
We are discussing with the CRAN team about this and also Felix has a
patch to address this for upcoming releases.

One thing I was wondering is that if there have not been too many
changes since 2.3.3, how much effort would it be to cut a 2.3.4 with
just this change.

Thanks
Shivaram

-- Forwarded message -
From: Uwe Ligges 
Date: Sun, Feb 17, 2019 at 12:28 PM
Subject: Re: CRAN submission SparkR 2.3.3
To: Shivaram Venkataraman , CRAN



Thanks, but see below.


On 17.02.2019 18:46, CRAN submission wrote:
> [This was generated from CRAN.R-project.org/submit.html]
>
> The following package was uploaded to CRAN:
> ===
>
> Package Information:
> Package: SparkR
> Version: 2.3.3
> Title: R Frontend for Apache Spark

Perhaps omit the redundant R?

Please single quote software names.


> Author(s): Shivaram Venkataraman [aut, cre], Xiangrui Meng [aut], Felix
>Cheung [aut], The Apache Software Foundation [aut, cph]
> Maintainer: Shivaram Venkataraman 
> Depends: R (>= 3.0), methods
> Suggests: knitr, rmarkdown, testthat, e1071, survival
> Description: Provides an R Frontend for Apache Spark.

Please single quote software names and give a ewb reference in the form
 to Apache SPark.

Best,
Uwe Ligges




> License: Apache License (== 2.0)
>
>
> The maintainer confirms that he or she
> has read and agrees to the CRAN policies.
>
> =
>
> Original content of DESCRIPTION file:
>
> Package: SparkR
> Type: Package
> Version: 2.3.3
> Title: R Frontend for Apache Spark
> Description: Provides an R Frontend for Apache Spark.
> Authors@R: c(person("Shivaram", "Venkataraman", role = c("aut", "cre"),
>  email = "shiva...@cs.berkeley.edu"),
>   person("Xiangrui", "Meng", role = "aut",
>  email = "m...@databricks.com"),
>   person("Felix", "Cheung", role = "aut",
>  email = "felixche...@apache.org"),
>   person(family = "The Apache Software Foundation", role = 
> c("aut", "cph")))
> License: Apache License (== 2.0)
> URL: http://www.apache.org/ http://spark.apache.org/
> BugReports: http://spark.apache.org/contributing.html
> SystemRequirements: Java (== 8)
> Depends: R (>= 3.0), methods
> Suggests: knitr, rmarkdown, testthat, e1071, survival
> Collate: 'schema.R' 'generics.R' 'jobj.R' 'column.R' 'group.R' 'RDD.R'
>  'pairRDD.R' 'DataFrame.R' 'SQLContext.R' 'WindowSpec.R'
>  'backend.R' 'broadcast.R' 'catalog.R' 'client.R' 'context.R'
>  'deserialize.R' 'functions.R' 'install.R' 'jvm.R'
>  'mllib_classification.R' 'mllib_clustering.R' 'mllib_fpm.R'
>  'mllib_recommendation.R' 'mllib_regression.R' 'mllib_stat.R'
>  'mllib_tree.R' 'mllib_utils.R' 'serialize.R' 'sparkR.R'
>  'stats.R' 'streaming.R' 'types.R' 'utils.R' 'window.R'
> RoxygenNote: 6.1.1
> VignetteBuilder: knitr
> NeedsCompilation: no
> Packaged: 2019-02-04 15:40:09 UTC; spark-rm
> Author: Shivaram Venkataraman [aut, cre],
>Xiangrui Meng [aut],
>Felix Cheung [aut],
>The Apache Software Foundation [aut, cph]
> Maintainer: Shivaram Venkataraman 
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

2019-02-24 Thread Reynold Xin

The challenge with the Scala/Java API in the past is that when there are 
multipe parameters, it'd lead to an explosion of function overloads. 

On Sun, Feb 24, 2019 at 3:22 PM, Felix Cheung < felixcheun...@hotmail.com > 
wrote:

> 
> I hear three topics in this thread
> 
> 
> 1. I don’t think we should remove string. Column and string can both be
> “type safe”. And I would agree we don’t *need* to break API compatibility
> here.
> 
> 
> 2. Gaps in python API. Extending on #1, definitely we should be consistent
> and add string as param where it is missed.
> 
> 
> 3. Scala API for string - hard to say but make sense if nothing but for
> consistency. Though I can also see the argument of Column only in Scala.
> String might be more natural in python and much less significant in Scala
> because of $”foo” notation.
> 
> 
> (My 2 c)
> 
> 
> 
>  
> *From:* Sean Owen < srowen@ gmail. com ( sro...@gmail.com ) >
> *Sent:* Sunday, February 24, 2019 6:59 AM
> *To:* André Mello
> *Cc:* dev
> *Subject:* Re: [DISCUSS][SQL][PySpark] Column name support for SQL
> functions
>  
> I just commented on the PR -- I personally don't think it's worth
> removing support for, say, max("foo") over max(col("foo")) or
> max($"foo") in Scala. We can make breaking changes in Spark 3 but this
> seems like it would unnecessarily break a lot of code. The string arg
> is more concise in Python and I can't think of cases where it's
> particularly ambiguous or confusing; on the contrary it's more natural
> coming from SQL.
> 
> What we do have are inconsistencies and errors in support of string vs
> Column as fixed in the PR. I was surprised to see that
> df. select ( http://df.select/ ) (abs("col")) throws an error while df. select
> ( http://df.select/ ) (sqrt("col"))
> doesn't. I think that's easy to fix on the Python side. Really I think
> the question is: do we need to add methods like "def abs(String)" and
> more in Scala? that would remain inconsistent even if the Pyspark side
> is fixed.
> 
> On Sun, Feb 24, 2019 at 8:54 AM André Mello < asmello. br@ gmail. com (
> asmello...@gmail.com ) > wrote:
> >
> > # Context
> >
> > This comes from [SPARK-26979], which became PR #23879 and then PR
> > #23882. The following reflects all the findings made so far.
> >
> > # Description
> >
> > Currently, in the Scala API, some SQL functions have two overloads,
> > one taking a string that names the column to be operated on, the other
> > taking a proper Column object. This allows for two patterns of calling
> > these functions, which is a source of inconsistency and generates
> > confusion for new users, since it is hard to predict which functions
> > will take a column name or not.
> >
> > The PySpark API partially solves this problem by internally converting
> > the argument to a Column object prior to passing it through to the
> > underlying JVM implementation. This allows for a consistent use of
> > name literals across the API, except for a few violations:
> >
> > - lower()
> > - upper()
> > - abs()
> > - bitwiseNOT()
> > - ltrim()
> > - rtrim()
> > - trim()
> > - ascii()
> > - base64()
> > - unbase64()
> >
> > These violations happen because for a subset of the SQL functions,
> > PySpark uses a functional mechanism (`_create_function`) to directly
> > call the underlying JVM equivalent by name, thus skipping the
> > conversion step. In most cases the column name pattern still works
> > because the Scala API has its own support for string arguments, but
> > the aforementioned functions are also exceptions there.
> >
> > My proposal was to solve this problem by adding the string support
> > where it was missing in the PySpark API. Since this is a purely
> > additive change, it doesn't break past code. Additionally, I find the
> > API sugar to be a positive feature, since code like `max("foo")` is
> > more concise and readable than `max(col("foo"))`. It adheres to the
> > DRY philosophy and is consistent with Python's preference for
> > readability over type protection.
> >
> > However, upon submission of the PR, a discussion was started about
> > whether it wouldn't be better to entirely deprecate string support
> > instead - in particular with major release 3.0 in mind. The reasoning,
> > as I understood it, was that this approach is more explicit and type
> > safe, which is preferred in Java/Scala, plus it reduces the API
> > surface area - and the Python API should be consistent with the others
> > as well.
> >
> > Upon request by @HyukjinKwon I'm submitting this matter for discussion
> > by this mailing list.
> >
> > # Summary
> >
> > There is a problem with inconsistency in the Scala/Python SQL API,
> > where sometimes you can use a column name string as a proxy, and
> > sometimes you have to use a proper Column object. To solve it there
> > are two approaches - to remove the string support entirely, or to add
> > it where it is missing. Which approach is best?
> >
> > Hope this is clear.
> >
> > -- André.
> >
> >

Re: [DISCUSS] SPIP: Relational Cache

2019-02-24 Thread Reynold Xin

How is this different from materialized views?

On Sun, Feb 24, 2019 at 3:44 PM Daoyuan Wang  wrote:

> Hi everyone,
>
> We'd like to discuss our proposal of Spark relational cache in this
> thread. Spark has native command for RDD caching, but the use of CACHE
> command in Spark SQL is limited, as we cannot use the cache cross session,
> as well as we have to rewrite queries by ourselves to make use of existing
> cache.
> To resolve this, we have done some initial work to do the following:
>
>  1. allow user to persist cache on HDFS in format of Parquet.
>  2. rewrite user queries in Catalyst, to utilize any existing cache (on
> HDFS or defined as in memory in current session) if possible.
>
> I have created a jira ticket(
> https://issues.apache.org/jira/browse/SPARK-26764) for this and attached
> an official SPIP document.
>
> Thanks for taking a look at the proposal.
>
> Best Regards,
> Daoyuan
>

[DISCUSS] SPIP: Relational Cache

2019-02-24 Thread Daoyuan Wang

Hi everyone,


We'd like to discuss our proposal of Spark relational cache in this thread. 
Spark has native command for RDD caching, but the use of CACHE command in Spark 
SQL is limited, as we cannot use the cache cross session, as well as we have to 
rewrite queries by ourselves to make use of existing cache.
To resolve this, we have done some initial work to do the following:


 1. allow user to persist cache on HDFS in format of Parquet.
 2. rewrite user queries in Catalyst, to utilize any existing cache (on HDFS or 
defined as in memory in current session) if possible.


I have created a jira ticket(https://issues.apache.org/jira/browse/SPARK-26764) 
for this and attached an official SPIP document.


Thanks for taking a look at the proposal.


Best Regards,
Daoyuan

Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

2019-02-24 Thread Felix Cheung

I hear three topics in this thread

1. I don’t think we should remove string. Column and string can both be “type 
safe”. And I would agree we don’t *need* to break API compatibility here.

2. Gaps in python API. Extending on #1, definitely we should be consistent and 
add string as param where it is missed.

3. Scala API for string - hard to say but make sense if nothing but for 
consistency. Though I can also see the argument of Column only in Scala. String 
might be more natural in python and much less significant in Scala because of 
$”foo” notation.

(My 2 c)

From: Sean Owen 
Sent: Sunday, February 24, 2019 6:59 AM
To: André Mello
Cc: dev
Subject: Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

I just commented on the PR -- I personally don't think it's worth
removing support for, say, max("foo") over max(col("foo")) or
max($"foo") in Scala. We can make breaking changes in Spark 3 but this
seems like it would unnecessarily break a lot of code. The string arg
is more concise in Python and I can't think of cases where it's
particularly ambiguous or confusing; on the contrary it's more natural
coming from SQL.

What we do have are inconsistencies and errors in support of string vs
Column as fixed in the PR. I was surprised to see that
df.select(abs("col")) throws an error while df.select(sqrt("col"))
doesn't. I think that's easy to fix on the Python side. Really I think
the question is: do we need to add methods like "def abs(String)" and
more in Scala? that would remain inconsistent even if the Pyspark side
is fixed.

On Sun, Feb 24, 2019 at 8:54 AM André Mello  wrote:
>
> # Context
>
> This comes from [SPARK-26979], which became PR #23879 and then PR
> #23882. The following reflects all the findings made so far.
>
> # Description
>
> Currently, in the Scala API, some SQL functions have two overloads,
> one taking a string that names the column to be operated on, the other
> taking a proper Column object. This allows for two patterns of calling
> these functions, which is a source of inconsistency and generates
> confusion for new users, since it is hard to predict which functions
> will take a column name or not.
>
> The PySpark API partially solves this problem by internally converting
> the argument to a Column object prior to passing it through to the
> underlying JVM implementation. This allows for a consistent use of
> name literals across the API, except for a few violations:
>
> - lower()
> - upper()
> - abs()
> - bitwiseNOT()
> - ltrim()
> - rtrim()
> - trim()
> - ascii()
> - base64()
> - unbase64()
>
> These violations happen because for a subset of the SQL functions,
> PySpark uses a functional mechanism (`_create_function`) to directly
> call the underlying JVM equivalent by name, thus skipping the
> conversion step. In most cases the column name pattern still works
> because the Scala API has its own support for string arguments, but
> the aforementioned functions are also exceptions there.
>
> My proposal was to solve this problem by adding the string support
> where it was missing in the PySpark API. Since this is a purely
> additive change, it doesn't break past code. Additionally, I find the
> API sugar to be a positive feature, since code like `max("foo")` is
> more concise and readable than `max(col("foo"))`. It adheres to the
> DRY philosophy and is consistent with Python's preference for
> readability over type protection.
>
> However, upon submission of the PR, a discussion was started about
> whether it wouldn't be better to entirely deprecate string support
> instead - in particular with major release 3.0 in mind. The reasoning,
> as I understood it, was that this approach is more explicit and type
> safe, which is preferred in Java/Scala, plus it reduces the API
> surface area - and the Python API should be consistent with the others
> as well.
>
> Upon request by @HyukjinKwon I'm submitting this matter for discussion
> by this mailing list.
>
> # Summary
>
> There is a problem with inconsistency in the Scala/Python SQL API,
> where sometimes you can use a column name string as a proxy, and
> sometimes you have to use a proper Column object. To solve it there
> are two approaches - to remove the string support entirely, or to add
> it where it is missing. Which approach is best?
>
> Hope this is clear.
>
> -- André.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-24 Thread Sean Owen

Sure, I don't read anyone making these statements though? Let's assume
good intent, that "foo should happen" as "my opinion as a member of
the community, which is not solely up to me, is that foo should
happen". I understand it's possible for a person to make their opinion
over-weighted; this whole style of decision making assumes good actors
and doesn't optimize against bad ones. Not that it can't happen, just
not seeing it here.

I have never seen any vote on a feature list, by a PMC or otherwise.
We can do that if really needed I guess. But that also isn't the
authoritative process in play here, in contrast.

If there's not a more specific subtext or issue here, which is fine to
say (on private@ if it's sensitive or something), yes, let's move on
in good faith.

On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra  wrote:
> There is nothing wrong with individuals advocating for what they think should 
> or should not be in Spark 3.0, nor should anyone shy away from explaining why 
> they think delaying the release for some reason is or isn't a good idea. What 
> is a problem, or is at least something that I have a problem with, are 
> declarative, pseudo-authoritative statements that 3.0 (or some other release) 
> will or won't contain some feature, API, etc. or that some issue is or is not 
> blocker or worth delaying for. When the PMC has not voted on such issues, I'm 
> often left thinking, "Wait... what? Who decided that, or where did that 
> decision come from?"

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-24 Thread Mark Hamstra

>
> I’m not quite sure what you mean here.
>

I'll try to explain once more, then I'll drop it since continuing the rest
of the discussion in this thread is more important than getting
side-tracked.

There is nothing wrong with individuals advocating for what they think
should or should not be in Spark 3.0, nor should anyone shy away from
explaining why they think delaying the release for some reason is or isn't
a good idea. What is a problem, or is at least something that I have a
problem with, are declarative, pseudo-authoritative statements that 3.0 (or
some other release) will or won't contain some feature, API, etc. or that
some issue is or is not blocker or worth delaying for. When the PMC has not
voted on such issues, I'm often left thinking, "Wait... what? Who decided
that, or where did that decision come from?"

On Sun, Feb 24, 2019 at 1:27 PM Ryan Blue  wrote:

> Thanks to Matt for his philosophical take. I agree.
>
> The intent is to set a common goal, so that we work toward getting v2 in a
> usable state as a community. Part of that is making choices to get it done
> on time, which we have already seen on this thread: setting out more
> clearly what we mean by “DSv2” and what we think we can get done on time.
>
> I don’t mean to say that we should commit to a plan that *requires* a
> delay to the next release (which describes the goal better than 3.0 does).
> But we should commit to making sure the goal is met, acknowledging that
> this is one of the most important efforts for many people that work in this
> community.
>
> I think it would help to clarify what this commitment means, at least to
> me:
>
>1. What it means: the community will seriously consider delaying the
>next release if this isn’t done by our initial deadline.
>2. What it does not mean: delaying the release no matter what happens.
>
> In that event that this feature isn’t done on time, it would be up to the
> community to decide what to do. But in the mean time, I think it is healthy
> to set a goal and work toward it. (I am not making a distinction between
> PMC and community here.)
>
> I think this commitment is a good idea for the same reason why we set
> other goals: to hold ourselves accountable. When one sets a New Years
> resolution to drop 10 pounds, it isn’t that the hope or intent wasn’t there
> before. It is about having a (self-imposed) constraint that helps you make
> hard choices: cake now or meet my goal?
>
> Spark 3.0 has many other major features as well, delaying the release has
> significant cost and we should try our best to not let it happen.”
>
> I agree with Wenchen here. No one wants to actually delay the release. We
> just want to push ourselves to make some tough decisions, using that delay
> as a motivating factor.
>
> The fact that some entity other than the PMC thinks that Spark 3.0 should
> contain certain new features or that it will be costly to them if 3.0 does
> not contain those features is not dispositive.
>
> I’m not quite sure what you mean here. While I am representing my
> employer, I am bringing up this topic as a member of the community, to
> suggest a direction for the community to take, and I fully accept that the
> decision is up to the community. I think it is reasonable to candidly state
> how this matters; that context informs the discussion.
>
> On Fri, Feb 22, 2019 at 1:55 PM Mark Hamstra 
> wrote:
>
>> To your other message: I already see a number of PMC members here. Who's
>>> the other entity?
>>>
>>
>> I'll answer indirectly since pointing fingers isn't really my intent. In
>> the absence of a PMC vote, I react negatively to individuals making new
>> declarative policy statements or statements to the effect that Spark
>> 3.0 will (or will not) include these features..., or that it will be too
>> costly to do something. Maybe these are innocent shorthand that leave off a
>> clarifying "in my opinion" or "according to the current state of JIRA" or
>> some such.
>>
>> My points are simply that nobody other than the PMC has an authoritative
>> say on such matters, and if we are at a point where the community needs
>> some definitive guidance, then we need PMC involvement and a vote. That's
>> not intended to preclude or terminate community discussion, because that
>> is, indeed, lovely to see.
>>
>> On Fri, Feb 22, 2019 at 12:04 PM Sean Owen  wrote:
>>
>>> To your other message: I already see a number of PMC members here. Who's
>>> the other entity? The PMC is the thing that says a thing is a release,
>>> sure, but this discussion is properly a community one. And here we are,
>>> this is lovely to see.
>>>
>>> (May I remind everyone to casually, sometime, browse the large list of
>>> other JIRAs targeted for Spark 3? it's much more than DSv2!)
>>>
>>> I can't speak to specific decisions here, but, I see:
>>>
>>> Spark 3 doesn't have a release date. Notionally it's 6 months after
>>> Spark 2.4 (Nov 2018). It'd be reasonable to plan for a little more time.
>>> Can we

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-24 Thread Ryan Blue

Thanks to Matt for his philosophical take. I agree.

The intent is to set a common goal, so that we work toward getting v2 in a
usable state as a community. Part of that is making choices to get it done
on time, which we have already seen on this thread: setting out more
clearly what we mean by “DSv2” and what we think we can get done on time.

I don’t mean to say that we should commit to a plan that *requires* a delay
to the next release (which describes the goal better than 3.0 does). But we
should commit to making sure the goal is met, acknowledging that this is
one of the most important efforts for many people that work in this
community.

I think it would help to clarify what this commitment means, at least to me:

   1. What it means: the community will seriously consider delaying the
   next release if this isn’t done by our initial deadline.
   2. What it does not mean: delaying the release no matter what happens.

In that event that this feature isn’t done on time, it would be up to the
community to decide what to do. But in the mean time, I think it is healthy
to set a goal and work toward it. (I am not making a distinction between
PMC and community here.)

I think this commitment is a good idea for the same reason why we set other
goals: to hold ourselves accountable. When one sets a New Years resolution
to drop 10 pounds, it isn’t that the hope or intent wasn’t there before. It
is about having a (self-imposed) constraint that helps you make hard
choices: cake now or meet my goal?

Spark 3.0 has many other major features as well, delaying the release has
significant cost and we should try our best to not let it happen.”

I agree with Wenchen here. No one wants to actually delay the release. We
just want to push ourselves to make some tough decisions, using that delay
as a motivating factor.

The fact that some entity other than the PMC thinks that Spark 3.0 should
contain certain new features or that it will be costly to them if 3.0 does
not contain those features is not dispositive.

I’m not quite sure what you mean here. While I am representing my employer,
I am bringing up this topic as a member of the community, to suggest a
direction for the community to take, and I fully accept that the decision
is up to the community. I think it is reasonable to candidly state how this
matters; that context informs the discussion.

On Fri, Feb 22, 2019 at 1:55 PM Mark Hamstra 
wrote:

> To your other message: I already see a number of PMC members here. Who's
>> the other entity?
>>
>
> I'll answer indirectly since pointing fingers isn't really my intent. In
> the absence of a PMC vote, I react negatively to individuals making new
> declarative policy statements or statements to the effect that Spark
> 3.0 will (or will not) include these features..., or that it will be too
> costly to do something. Maybe these are innocent shorthand that leave off a
> clarifying "in my opinion" or "according to the current state of JIRA" or
> some such.
>
> My points are simply that nobody other than the PMC has an authoritative
> say on such matters, and if we are at a point where the community needs
> some definitive guidance, then we need PMC involvement and a vote. That's
> not intended to preclude or terminate community discussion, because that
> is, indeed, lovely to see.
>
> On Fri, Feb 22, 2019 at 12:04 PM Sean Owen  wrote:
>
>> To your other message: I already see a number of PMC members here. Who's
>> the other entity? The PMC is the thing that says a thing is a release,
>> sure, but this discussion is properly a community one. And here we are,
>> this is lovely to see.
>>
>> (May I remind everyone to casually, sometime, browse the large list of
>> other JIRAs targeted for Spark 3? it's much more than DSv2!)
>>
>> I can't speak to specific decisions here, but, I see:
>>
>> Spark 3 doesn't have a release date. Notionally it's 6 months after Spark
>> 2.4 (Nov 2018). It'd be reasonable to plan for a little more time. Can we
>> throw out... June 2019, and I update the website? It can slip but that
>> gives a concrete timeframe around which to plan. What can comfortably get
>> in by June 2019?
>>
>> Agreement that "DSv2" is going into Spark 3, for some definition of DSv2
>> that's probably roughly Matt's list.
>>
>> Changes that can't go into a minor release (API changes, etc) must by
>> definition go into Spark 3.0. Agree those first and do those now. Delay
>> Spark 3 until they're done and prioritize accordingly.
>> Changes that can go into a minor release can go into 3.1, if needed.
>> This has been in discussion long enough that I think whatever design(s)
>> are on the table for DSv2 now are as close as one is going to get. The
>> perfect is the enemy of the good.
>>
>> Aside from throwing out a date, I probably just restated what everyone
>> said. But I was 'summoned' :)
>>
>> On Fri, Feb 22, 2019 at 12:40 PM Mark Hamstra 
>> wrote:
>>
>>> However, as other people mentioned, Spark 3.0 has many other major

Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

2019-02-24 Thread Sean Owen

I just commented on the PR -- I personally don't think it's worth
removing support for, say, max("foo") over max(col("foo")) or
max($"foo") in Scala. We can make breaking changes in Spark 3 but this
seems like it would unnecessarily break a lot of code. The string arg
is more concise in Python and I can't think of cases where it's
particularly ambiguous or confusing; on the contrary it's more natural
coming from SQL.

What we do have are inconsistencies and errors in support of string vs
Column as fixed in the PR. I was surprised to see that
df.select(abs("col")) throws an error while df.select(sqrt("col"))
doesn't. I think that's easy to fix on the Python side. Really I think
the question is: do we need to add methods like "def abs(String)" and
more in Scala? that would remain inconsistent even if the Pyspark side
is fixed.

On Sun, Feb 24, 2019 at 8:54 AM André Mello  wrote:
>
> # Context
>
> This comes from [SPARK-26979], which became PR #23879 and then PR
> #23882. The following reflects all the findings made so far.
>
> # Description
>
> Currently, in the Scala API, some SQL functions have two overloads,
> one taking a string that names the column to be operated on, the other
> taking a proper Column object. This allows for two patterns of calling
> these functions, which is a source of inconsistency and generates
> confusion for new users, since it is hard to predict which functions
> will take a column name or not.
>
> The PySpark API partially solves this problem by internally converting
> the argument to a Column object prior to passing it through to the
> underlying JVM implementation. This allows for a consistent use of
> name literals across the API, except for a few violations:
>
> - lower()
> - upper()
> - abs()
> - bitwiseNOT()
> - ltrim()
> - rtrim()
> - trim()
> - ascii()
> - base64()
> - unbase64()
>
> These violations happen because for a subset of the SQL functions,
> PySpark uses a functional mechanism (`_create_function`) to directly
> call the underlying JVM equivalent by name, thus skipping the
> conversion step. In most cases the column name pattern still works
> because the Scala API has its own support for string arguments, but
> the aforementioned functions are also exceptions there.
>
> My proposal was to solve this problem by adding the string support
> where it was missing in the PySpark API. Since this is a purely
> additive change, it doesn't break past code. Additionally, I find the
> API sugar to be a positive feature, since code like `max("foo")` is
> more concise and readable than `max(col("foo"))`. It adheres to the
> DRY philosophy and is consistent with Python's preference for
> readability over type protection.
>
> However, upon submission of the PR, a discussion was started about
> whether it wouldn't be better to entirely deprecate string support
> instead - in particular with major release 3.0 in mind. The reasoning,
> as I understood it, was that this approach is more explicit and type
> safe, which is preferred in Java/Scala, plus it reduces the API
> surface area - and the Python API should be consistent with the others
> as well.
>
> Upon request by @HyukjinKwon I'm submitting this matter for discussion
> by this mailing list.
>
> # Summary
>
> There is a problem with inconsistency in the Scala/Python SQL API,
> where sometimes you can use a column name string as a proxy, and
> sometimes you have to use a proper Column object. To solve it there
> are two approaches - to remove the string support entirely, or to add
> it where it is missing. Which approach is best?
>
> Hope this is clear.
>
> -- André.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[DISCUSS][SQL][PySpark] Column name support for SQL functions

2019-02-24 Thread André Mello

# Context

This comes from [SPARK-26979], which became PR #23879 and then PR
#23882. The following reflects all the findings made so far.

# Description

Currently, in the Scala API, some SQL functions have two overloads,
one taking a string that names the column to be operated on, the other
taking a proper Column object. This allows for two patterns of calling
these functions, which is a source of inconsistency and generates
confusion for new users, since it is hard to predict which functions
will take a column name or not.

The PySpark API partially solves this problem by internally converting
the argument to a Column object prior to passing it through to the
underlying JVM implementation. This allows for a consistent use of
name literals across the API, except for a few violations:

- lower()
- upper()
- abs()
- bitwiseNOT()
- ltrim()
- rtrim()
- trim()
- ascii()
- base64()
- unbase64()

These violations happen because for a subset of the SQL functions,
PySpark uses a functional mechanism (`_create_function`) to directly
call the underlying JVM equivalent by name, thus skipping the
conversion step. In most cases the column name pattern still works
because the Scala API has its own support for string arguments, but
the aforementioned functions are also exceptions there.

My proposal was to solve this problem by adding the string support
where it was missing in the PySpark API. Since this is a purely
additive change, it doesn't break past code. Additionally, I find the
API sugar to be a positive feature, since code like `max("foo")` is
more concise and readable than `max(col("foo"))`. It adheres to the
DRY philosophy and is consistent with Python's preference for
readability over type protection.

However, upon submission of the PR, a discussion was started about
whether it wouldn't be better to entirely deprecate string support
instead - in particular with major release 3.0 in mind. The reasoning,
as I understood it, was that this approach is more explicit and type
safe, which is preferred in Java/Scala, plus it reduces the API
surface area - and the Python API should be consistent with the others
as well.

Upon request by @HyukjinKwon I'm submitting this matter for discussion
by this mailing list.

# Summary

There is a problem with inconsistency in the Scala/Python SQL API,
where sometimes you can use a column name string as a proxy, and
sometimes you have to use a proper Column object. To solve it there
are two approaches - to remove the string support entirely, or to add
it where it is missing. Which approach is best?

Hope this is clear.

-- André.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Fwd: CRAN submission SparkR 2.3.3

Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

Re: [DISCUSS] SPIP: Relational Cache

[DISCUSS] SPIP: Relational Cache

Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

Re: [DISCUSS] Spark 3.0 and DataSourceV2

Re: [DISCUSS] Spark 3.0 and DataSourceV2

Re: [DISCUSS] Spark 3.0 and DataSourceV2

Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

[DISCUSS][SQL][PySpark] Column name support for SQL functions

10 matches

Site Navigation

Mail list logo

Footer information