Re: SPIP: Spark on Kubernetes

2017-09-01 Thread Reynold Xin
Anirudh (or somebody else familiar with spark-on-k8s),

Can you create a short plan on how we would integrate and do code review to
merge the project? If the diff is too large it'd be difficult to review and
merge in one shot. Once we have a plan we can create subtickets to track
the progress.



On Thu, Aug 31, 2017 at 5:21 PM, Anirudh Ramanathan 
wrote:

> The proposal is in the process of being updated to include the details on
> testing that we have, that Imran pointed out.
> Please expect an update on the SPARK-18278
> .
>
> Mridul had a couple of points as well, about exposing an SPI and we've
> been exploring that, to ascertain the effort involved.
> That effort is separate, fairly long-term and we should have a working
> group of representatives from all cluster managers to make progress on it.
> A proposal regarding this will be in SPARK-19700
> .
>
> This vote has passed.
> So far, there have been 4 binding +1 votes, ~25 non-binding votes, and no
> -1 votes.
>
> Thanks all!
>
> +1 votes (binding):
> Reynold Xin
> Matei Zahari
> Marcelo Vanzin
> Mark Hamstra
>
> +1 votes (non-binding):
> Anirudh Ramanathan
> Erik Erlandson
> Ilan Filonenko
> Sean Suchter
> Kimoon Kim
> Timothy Chen
> Will Benton
> Holden Karau
> Seshu Adunuthula
> Daniel Imberman
> Shubham Chopra
> Jiri Kremser
> Yinan Li
> Andrew Ash
> 李书明
> Gary Lucas
> Ismael Mejia
> Jean-Baptiste Onofré
> Alexander Bezzubov
> duyanghao
> elmiko
> Sudarshan Kadambi
> Varun Katta
> Matt Cheah
> Edward Zhang
> Vaquar Khan
>
>
>
>
>
> On Wed, Aug 30, 2017 at 10:42 PM, Reynold Xin  wrote:
>
>> This has passed, hasn't it?
>>
>> On Tue, Aug 15, 2017 at 5:33 PM Anirudh Ramanathan 
>> wrote:
>>
>>> Spark on Kubernetes effort has been developed separately in a fork, and
>>> linked back from the Apache Spark project as an experimental backend
>>> .
>>> We're ~6 months in, have had 5 releases
>>> .
>>>
>>>- 2 Spark versions maintained (2.1, and 2.2)
>>>- Extensive integration testing and refactoring efforts to maintain
>>>code quality
>>>- Developer
>>> and
>>>user-facing  docu
>>>mentation
>>>- 10+ consistent code contributors from different organizations
>>>
>>> 
>>>  involved
>>>in actively maintaining and using the project, with several more members
>>>involved in testing and providing feedback.
>>>- The community has delivered several talks on Spark-on-Kubernetes
>>>generating lots of feedback from users.
>>>- In addition to these, we've seen efforts spawn off such as:
>>>- HDFS on Kubernetes
>>>    with
>>>   Locality and Performance Experiments
>>>   - Kerberized access
>>>   
>>> 
>>>  to
>>>   HDFS from Spark running on Kubernetes
>>>
>>> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>>>
>>>- +1: Yeah, let's go forward and implement the SPIP.
>>>- +0: Don't really care.
>>>- -1: I don't think this is a good idea because of the following
>>>technical reasons.
>>>
>>> If there is any further clarification desired, on the design or the
>>> implementation, please feel free to ask questions or provide feedback.
>>>
>>>
>>> SPIP: Kubernetes as A Native Cluster Manager
>>>
>>> Full Design Doc: link
>>> 
>>>
>>> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>>>
>>> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>>>
>>> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
>>> Cheah,
>>>
>>> Ilan Filonenko, Sean Suchter, Kimoon Kim
>>> Background and Motivation
>>>
>>> Containerization and cluster management technologies are constantly
>>> evolving in the cluster computing world. Apache Spark currently implements
>>> support for Apache Hadoop YARN and Apache Mesos, in addition to providing
>>> its own standalone cluster manager. In 2014, Google announced development
>>> of Kubernetes  which has its own unique feature
>>> set and differentiates itself from YARN and Mesos. Since its debut, it has
>>> seen contributions from over 1300 contributors with over 5 commits.
>>> Kubernetes has cemented itself as a core player in the cluster computing
>>> world, and cloud-computing providers such as Google Container Engine,
>>> Google Compute Engine, Amazon Web S

Spark 2.2.0 - Odd Hive SQL Warnings

2017-09-01 Thread Don Drake
I'm in the process of migrating a few applications from Spark 2.1.1 to
Spark 2.2.0 and so far the transition has been smooth.  One odd thing is
that when I query a Hive table that I do not own, but have read access, I
get a very long WARNING with a stack trace that basically says I do not
have permission to ALTER the table.

As you can see, I'm just doing a SELECT on the table.   Everything works,
but this stack trace is a little concerning.  Anyone know what is going on?


I'm using a downloaded binary (spark-2.2.0-bin-hadoop2.6) on CDH 5.10.1.

Thanks.

-Don

-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake 
800-733-2143

scal> spark.sql("select * from test.my_table")
17/09/01 15:40:30 WARN HiveExternalCatalog: Could not alter schema of table
 `test`.`my_table` in a Hive compatible way. Updating Hive metastore in
Spark SQL specific format.
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.spark.sql.hive.client.Shim_v0_12.alterTable(HiveShim.scala:399)
at
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$alterTable$1.apply$mcV$sp(HiveClientImpl.scala:461)
at
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$alterTable$1.apply(HiveClientImpl.scala:457)
at
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$alterTable$1.apply(HiveClientImpl.scala:457)
at
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:290)
at
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:231)
at
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:230)
at
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
at
org.apache.spark.sql.hive.client.HiveClientImpl.alterTable(HiveClientImpl.scala:457)
at
org.apache.spark.sql.hive.client.HiveClient$class.alterTable(HiveClient.scala:87)
at
org.apache.spark.sql.hive.client.HiveClientImpl.alterTable(HiveClientImpl.scala:79)
at
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableSchema$1.apply$mcV$sp(HiveExternalCatalog.scala:636)
at
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableSchema$1.apply(HiveExternalCatalog.scala:627)
at
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableSchema$1.apply(HiveExternalCatalog.scala:627)
at
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at
org.apache.spark.sql.hive.HiveExternalCatalog.alterTableSchema(HiveExternalCatalog.scala:627)
at
org.apache.spark.sql.hive.HiveMetastoreCatalog.updateCatalogSchema(HiveMetastoreCatalog.scala:267)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.org
$apache$spark$sql$hive$HiveMetastoreCatalog$$inferIfNeeded(HiveMetastoreCatalog.scala:251)
at
org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$6$$anonfun$7.apply(HiveMetastoreCatalog.scala:195)
at
org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$6$$anonfun$7.apply(HiveMetastoreCatalog.scala:194)
at scala.Option.getOrElse(Option.scala:121)
at
org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$6.apply(HiveMetastoreCatalog.scala:194)
at
org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$6.apply(HiveMetastoreCatalog.scala:187)
at
org.apache.spark.sql.hive.HiveMetastoreCatalog.withTableCreationLock(HiveMetastoreCatalog.scala:54)
at
org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:187)
at org.apache.spark.sql.hive.RelationConversions.org
$apache$spark$sql$hive$RelationConversions$$convert(HiveStrategies.scala:199)
at
org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:219)
at
org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:208)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transf

Re: Moving Scala 2.12 forward one step

2017-09-01 Thread Matei Zaharia
That would be awesome. I’m not sure whether we want 3.0 to be right after 2.3 
(I guess this Scala issue is one reason to start discussing that), but even if 
we do, I imagine that wouldn’t be out for at least 4-6 more months after 2.3, 
and that’s a long time to go without Scala 2.12 support. If we decide to do 2.4 
next instead, that’s even longer.

Matei

> On Sep 1, 2017, at 1:52 AM, Sean Owen  wrote:
> 
> OK, what I'll do is focus on some changes that can be merged to master 
> without impacting the 2.11 build (e.g. putting kafka-0.8 behind a profile, 
> maybe, or adding the 2.12 REPL). Anything that is breaking, we can work on in 
> a series of open PRs, or maybe a branch, yea. It's unusual but might be 
> worthwhile.
> 
> On Fri, Sep 1, 2017 at 9:44 AM Matei Zaharia  wrote:
> If the changes aren’t that hard, I think we should also consider building a 
> Scala 2.12 version of Spark 2.3 in a separate branch. I’ve definitely seen 
> concerns from some large Scala users that Spark isn’t supporting 2.12 soon 
> enough. I thought SPARK-14220 was blocked mainly because the changes are 
> hard, but if not, maybe we can release such a branch sooner.
> 
> Matei
> 
> > On Aug 31, 2017, at 3:59 AM, Sean Owen  wrote:
> >
> > I don't think there's a target. The changes aren't all that hard (see the 
> > SPARK-14220 umbrella) but there are some changes that are hard or 
> > impossible without changing key APIs, as far as we can see. That would 
> > suggest 3.0.
> >
> > One motivation I have here for getting it as far as possible otherwise is 
> > so people could, if they wanted, create a 2.12 build themselves without 
> > much work even if it were not supported upstream. This particular change is 
> > a lot of the miscellaneous stuff you'd have to fix to get to that point.
> >
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-01 Thread Felix Cheung
+1 on this and like the suggestion of type in string form.

Would it be correct to assume there will be data type check, for example the 
returned pandas data frame column data types match what are specified. We have 
seen quite a bit of issues/confusions with that in R.

Would it make sense to have a more generic decorator name so that it could also 
be useable for other efficient vectorized format in the future? Or do we 
anticipate the decorator to be format specific and will have more in the future?


From: Reynold Xin 
Sent: Friday, September 1, 2017 5:16:11 AM
To: Takuya UESHIN
Cc: spark-dev
Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Ok, thanks.

+1 on the SPIP for scope etc


On API details (will deal with in code reviews as well but leaving a note here 
in case I forget)

1. I would suggest having the API also accept data type specification in string 
form. It is usually simpler to say "long" then "LongType()".

2. Think about what error message to show when the rows numbers don't match at 
runtime.


On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN 
mailto:ues...@happy-camper.st>> wrote:
Yes, the aggregation is out of scope for now.
I think we should continue discussing the aggregation at JIRA and we will be 
adding those later separately.

Thanks.


On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin 
mailto:r...@databricks.com>> wrote:
Is the idea aggregate is out of scope for the current effort and we will be 
adding those later?

On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN 
mailto:ues...@happy-camper.st>> wrote:
Hi all,

We've been discussing to support vectorized UDFs in Python and we almost got a 
consensus about the APIs, so I'd like to summarize and call for a vote.

Note that this vote should focus on APIs for vectorized UDFs, not APIs for 
vectorized UDAFs or Window operations.

https://issues.apache.org/jira/browse/SPARK-21190


Proposed API

We introduce a @pandas_udf decorator (or annotation) to define vectorized UDFs 
which takes one or more pandas.Series or one integer value meaning the length 
of the input value for 0-parameter UDFs. The return value should be 
pandas.Series of the specified type and the length of the returned value should 
be the same as input value.

We can define vectorized UDFs as:

  @pandas_udf(DoubleType())
  def plus(v1, v2):
  return v1 + v2

or we can define as:

  plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())

We can use it similar to row-by-row UDFs:

  df.withColumn('sum', plus(df.v1, df.v2))

As for 0-parameter UDFs, we can define and use as:

  @pandas_udf(LongType())
  def f0(size):
  return pd.Series(1).repeat(size)

  df.select(f0())



The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical 
reasons.

Thanks!

--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin



--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-01 Thread Reynold Xin
Ok, thanks.

+1 on the SPIP for scope etc


On API details (will deal with in code reviews as well but leaving a note
here in case I forget)

1. I would suggest having the API also accept data type specification in
string form. It is usually simpler to say "long" then "LongType()".

2. Think about what error message to show when the rows numbers don't match
at runtime.


On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN 
wrote:

> Yes, the aggregation is out of scope for now.
> I think we should continue discussing the aggregation at JIRA and we will
> be adding those later separately.
>
> Thanks.
>
>
> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin  wrote:
>
>> Is the idea aggregate is out of scope for the current effort and we will
>> be adding those later?
>>
>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN 
>> wrote:
>>
>>> Hi all,
>>>
>>> We've been discussing to support vectorized UDFs in Python and we almost
>>> got a consensus about the APIs, so I'd like to summarize and call for a
>>> vote.
>>>
>>> Note that this vote should focus on APIs for vectorized UDFs, not APIs
>>> for vectorized UDAFs or Window operations.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-21190
>>>
>>>
>>> *Proposed API*
>>>
>>> We introduce a @pandas_udf decorator (or annotation) to define
>>> vectorized UDFs which takes one or more pandas.Series or one integer
>>> value meaning the length of the input value for 0-parameter UDFs. The
>>> return value should be pandas.Series of the specified type and the
>>> length of the returned value should be the same as input value.
>>>
>>> We can define vectorized UDFs as:
>>>
>>>   @pandas_udf(DoubleType())
>>>   def plus(v1, v2):
>>>   return v1 + v2
>>>
>>> or we can define as:
>>>
>>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>>
>>> We can use it similar to row-by-row UDFs:
>>>
>>>   df.withColumn('sum', plus(df.v1, df.v2))
>>>
>>> As for 0-parameter UDFs, we can define and use as:
>>>
>>>   @pandas_udf(LongType())
>>>   def f0(size):
>>>   return pd.Series(1).repeat(size)
>>>
>>>   df.select(f0())
>>>
>>>
>>>
>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>
>>> +1: Yeah, let's go forward and implement the SPIP.
>>> +0: Don't really care.
>>> -1: I don't think this is a good idea because of the following technical
>>> reasons.
>>>
>>> Thanks!
>>>
>>> --
>>> Takuya UESHIN
>>> Tokyo, Japan
>>>
>>> http://twitter.com/ueshin
>>>
>>
>
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>


Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-01 Thread Takuya UESHIN
Yes, the aggregation is out of scope for now.
I think we should continue discussing the aggregation at JIRA and we will
be adding those later separately.

Thanks.


On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin  wrote:

> Is the idea aggregate is out of scope for the current effort and we will
> be adding those later?
>
> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN 
> wrote:
>
>> Hi all,
>>
>> We've been discussing to support vectorized UDFs in Python and we almost
>> got a consensus about the APIs, so I'd like to summarize and call for a
>> vote.
>>
>> Note that this vote should focus on APIs for vectorized UDFs, not APIs
>> for vectorized UDAFs or Window operations.
>>
>> https://issues.apache.org/jira/browse/SPARK-21190
>>
>>
>> *Proposed API*
>>
>> We introduce a @pandas_udf decorator (or annotation) to define
>> vectorized UDFs which takes one or more pandas.Series or one integer
>> value meaning the length of the input value for 0-parameter UDFs. The
>> return value should be pandas.Series of the specified type and the
>> length of the returned value should be the same as input value.
>>
>> We can define vectorized UDFs as:
>>
>>   @pandas_udf(DoubleType())
>>   def plus(v1, v2):
>>   return v1 + v2
>>
>> or we can define as:
>>
>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>
>> We can use it similar to row-by-row UDFs:
>>
>>   df.withColumn('sum', plus(df.v1, df.v2))
>>
>> As for 0-parameter UDFs, we can define and use as:
>>
>>   @pandas_udf(LongType())
>>   def f0(size):
>>   return pd.Series(1).repeat(size)
>>
>>   df.select(f0())
>>
>>
>>
>> The vote will be up for the next 72 hours. Please reply with your vote:
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>> +0: Don't really care.
>> -1: I don't think this is a good idea because of the following technical
>> reasons.
>>
>> Thanks!
>>
>> --
>> Takuya UESHIN
>> Tokyo, Japan
>>
>> http://twitter.com/ueshin
>>
>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-09-01 Thread Reynold Xin
Why does ordering matter here for sort vs filter? The source should be able
to handle it in whatever way it wants (which is almost always filter
beneath sort I'd imagine).

The only ordering that'd matter in the current set of pushdowns is limit -
it should always mean the root of the pushded tree.


On Fri, Sep 1, 2017 at 3:22 AM, Wenchen Fan  wrote:

> > Ideally also getting sort orders _after_ getting filters.
>
> Yea we should have a deterministic order when applying various push downs,
> and I think filter should definitely go before sort. This is one of the
> details we can discuss during PR review :)
>
> On Fri, Sep 1, 2017 at 9:19 AM, James Baker  wrote:
>
>> I think that makes sense. I didn't understand backcompat was the primary
>> driver. I actually don't care right now about aggregations on the
>> datasource I'm integrating with - I just care about receiving all the
>> filters (and ideally also the desired sort order) at the same time. I am
>> mostly fine with anything else; but getting filters at the same time is
>> important for me, and doesn't seem overly contentious? (e.g. it's
>> compatible with datasources v1). Ideally also getting sort orders _after_
>> getting filters.
>>
>> That said, an unstable api that gets me the query plan would be
>> appreciated by plenty I'm sure :) (and would make my implementation more
>> straightforward - the state management is painful atm).
>>
>> James
>>
>> On Wed, 30 Aug 2017 at 14:56 Reynold Xin  wrote:
>>
>>> Sure that's good to do (and as discussed earlier a good compromise might
>>> be to expose an interface for the source to decide which part of the
>>> logical plan they want to accept).
>>>
>>> To me everything is about cost vs benefit.
>>>
>>> In my mind, the biggest issue with the existing data source API is
>>> backward and forward compatibility. All the data sources written for Spark
>>> 1.x broke in Spark 2.x. And that's one of the biggest value v2 can bring.
>>> To me it's far more important to have data sources implemented in 2017 to
>>> be able to work in 2027, in Spark 10.x.
>>>
>>> You are basically arguing for creating a new API that is capable of
>>> doing arbitrary expression, aggregation, and join pushdowns (you only
>>> mentioned aggregation so far, but I've talked to enough database people
>>> that I know once Spark gives them aggregation pushdown, they will come back
>>> for join pushdown). We can do that using unstable APIs, and creating stable
>>> APIs would be extremely difficult (still doable, just would take a long
>>> time to design and implement). As mentioned earlier, it basically involves
>>> creating a stable representation for all of logical plan, which is a lot of
>>> work. I think we should still work towards that (for other reasons as
>>> well), but I'd consider that out of scope for the current one. Otherwise
>>> we'd not release something probably for the next 2 or 3 years.
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Aug 30, 2017 at 11:50 PM, James Baker 
>>> wrote:
>>>
 I guess I was more suggesting that by coding up the powerful mode as
 the API, it becomes easy for someone to layer an easy mode beneath it to
 enable simpler datasources to be integrated (and that simple mode should be
 the out of scope thing).

 Taking a small step back here, one of the places where I think I'm
 missing some context is in understanding the target consumers of these
 interfaces. I've done some amount (though likely not enough) of research
 about the places where people have had issues of API surface in the past -
 the concrete tickets I've seen have been based on Cassandra integration
 where you want to indicate clustering, and SAP HANA where they want to push
 down more complicated queries through Spark. This proposal supports the
 former, but the amount of change required to support clustering in the
 current API is not obviously high - whilst the current proposal for V2
 seems to make it very difficult to add support for pushing down plenty of
 aggregations in the future (I've found the question of how to add GROUP BY
 to be pretty tricky to answer for the current proposal).

 Googling around for implementations of the current PrunedFilteredScan,
 I basically find a lot of databases, which seems reasonable - SAP HANA,
 ElasticSearch, Solr, MongoDB, Apache Phoenix, etc. I've talked to people
 who've used (some of) these connectors and the sticking point has generally
 been that Spark needs to load a lot of data out in order to solve
 aggregations that can be very efficiently pushed down into the datasources.

 So, with this proposal it appears that we're optimising towards making
 it easy to write one-off datasource integrations, with some amount of
 pluggability for people who want to do more complicated things (the most
 interesting being bucketing integration). However, my guess is that this
 isn't what the current major 

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-01 Thread Reynold Xin
Is the idea aggregate is out of scope for the current effort and we will be
adding those later?

On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN  wrote:

> Hi all,
>
> We've been discussing to support vectorized UDFs in Python and we almost
> got a consensus about the APIs, so I'd like to summarize and call for a
> vote.
>
> Note that this vote should focus on APIs for vectorized UDFs, not APIs for
> vectorized UDAFs or Window operations.
>
> https://issues.apache.org/jira/browse/SPARK-21190
>
>
> *Proposed API*
>
> We introduce a @pandas_udf decorator (or annotation) to define vectorized
> UDFs which takes one or more pandas.Series or one integer value meaning
> the length of the input value for 0-parameter UDFs. The return value should
> be pandas.Series of the specified type and the length of the returned
> value should be the same as input value.
>
> We can define vectorized UDFs as:
>
>   @pandas_udf(DoubleType())
>   def plus(v1, v2):
>   return v1 + v2
>
> or we can define as:
>
>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>
> We can use it similar to row-by-row UDFs:
>
>   df.withColumn('sum', plus(df.v1, df.v2))
>
> As for 0-parameter UDFs, we can define and use as:
>
>   @pandas_udf(LongType())
>   def f0(size):
>   return pd.Series(1).repeat(size)
>
>   df.select(f0())
>
>
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical
> reasons.
>
> Thanks!
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>


[SS] New numSavedStates metric for StateStoreRestoreExec for saved state?

2017-09-01 Thread Jacek Laskowski
Hi,

Just reviewing StateStoreRestoreExec [1] and been wondering how to
know whether a state was available for a key. It has numOutputRows
metric [2], but that gives the number of aggregations from the child
operator only and seems to say nothing about whether state was
available for an aggregation.

What do you think about adding numSavedStates metric to
StateStoreRestoreExec? Or is there a way to find it out already
(perhaps in web UI)?

[1] 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala#L186

[2] 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala#L206

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Spark Structured Streaming (Apache Spark 2.2+)
https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Moving Scala 2.12 forward one step

2017-09-01 Thread Sean Owen
OK, what I'll do is focus on some changes that can be merged to master
without impacting the 2.11 build (e.g. putting kafka-0.8 behind a profile,
maybe, or adding the 2.12 REPL). Anything that is breaking, we can work on
in a series of open PRs, or maybe a branch, yea. It's unusual but might be
worthwhile.

On Fri, Sep 1, 2017 at 9:44 AM Matei Zaharia 
wrote:

> If the changes aren’t that hard, I think we should also consider building
> a Scala 2.12 version of Spark 2.3 in a separate branch. I’ve definitely
> seen concerns from some large Scala users that Spark isn’t supporting 2.12
> soon enough. I thought SPARK-14220 was blocked mainly because the changes
> are hard, but if not, maybe we can release such a branch sooner.
>
> Matei
>
> > On Aug 31, 2017, at 3:59 AM, Sean Owen  wrote:
> >
> > I don't think there's a target. The changes aren't all that hard (see
> the SPARK-14220 umbrella) but there are some changes that are hard or
> impossible without changing key APIs, as far as we can see. That would
> suggest 3.0.
> >
> > One motivation I have here for getting it as far as possible otherwise
> is so people could, if they wanted, create a 2.12 build themselves without
> much work even if it were not supported upstream. This particular change is
> a lot of the miscellaneous stuff you'd have to fix to get to that point.
> >
>
>


Re: Moving Scala 2.12 forward one step

2017-09-01 Thread Matei Zaharia
If the changes aren’t that hard, I think we should also consider building a 
Scala 2.12 version of Spark 2.3 in a separate branch. I’ve definitely seen 
concerns from some large Scala users that Spark isn’t supporting 2.12 soon 
enough. I thought SPARK-14220 was blocked mainly because the changes are hard, 
but if not, maybe we can release such a branch sooner.

Matei

> On Aug 31, 2017, at 3:59 AM, Sean Owen  wrote:
> 
> I don't think there's a target. The changes aren't all that hard (see the 
> SPARK-14220 umbrella) but there are some changes that are hard or impossible 
> without changing key APIs, as far as we can see. That would suggest 3.0.
> 
> One motivation I have here for getting it as far as possible otherwise is so 
> people could, if they wanted, create a 2.12 build themselves without much 
> work even if it were not supported upstream. This particular change is a lot 
> of the miscellaneous stuff you'd have to fix to get to that point.
> 
> On Thu, Aug 31, 2017 at 11:04 AM Saisai Shao  wrote:
> Hi Sean,
> 
> Do we have a planned target version for Scala 2.12 support? Several other 
> projects like Zeppelin, Livy which rely on Spark repl also require changes to 
> support this Scala 2.12.
> 
> Thanks
> Jerry
> 
> On Thu, Aug 31, 2017 at 5:55 PM, Sean Owen  wrote:
> No, this doesn't let Spark build and run on 2.12. It makes changes that will 
> be required though, the ones that are really no loss to the current 2.11 
> build.
> 
> 
> On Thu, Aug 31, 2017, 10:48 Denis Bolshakov  wrote:
> Hello,
> 
> Sounds amazing. Is there any improvements in benchmarks?
> 
> 
> On 31 August 2017 at 12:25, Sean Owen  wrote:
> Calling attention to the question of Scala 2.12 again for moment. I'd like to 
> make a modest step towards support. Have a look again, if you would, at 
> SPARK-14280:
> 
> https://github.com/apache/spark/pull/18645
> 
> This is a lot of the change for 2.12 that doesn't break 2.11, and really 
> doesn't add any complexity. It's mostly dependency updates and clarifying 
> some code. Other items like dealing with Kafka 0.8 support, the 2.12 REPL, 
> etc, are not  here.
> 
> So, this still doesn't result in a working 2.12 build but it's most of the 
> miscellany that will be required.
> 
> I'd like to merge it but wanted to flag it for feedback as it's not trivial.
> 
> 
> 
> -- 
> //with Best Regards
> --Denis Bolshakov
> e-mail: bolshakov.de...@gmail.com
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org