Re: SPIP: Spark on Kubernetes

2017-09-01 Thread Reynold Xin
Anirudh (or somebody else familiar with spark-on-k8s), Can you create a short plan on how we would integrate and do code review to merge the project? If the diff is too large it'd be difficult to review and merge in one shot. Once we have a plan we can create subtickets to track the progress.

Spark 2.2.0 - Odd Hive SQL Warnings

2017-09-01 Thread Don Drake
I'm in the process of migrating a few applications from Spark 2.1.1 to Spark 2.2.0 and so far the transition has been smooth. One odd thing is that when I query a Hive table that I do not own, but have read access, I get a very long WARNING with a stack trace that basically says I do not have

Re: Moving Scala 2.12 forward one step

2017-09-01 Thread Matei Zaharia
That would be awesome. I’m not sure whether we want 3.0 to be right after 2.3 (I guess this Scala issue is one reason to start discussing that), but even if we do, I imagine that wouldn’t be out for at least 4-6 more months after 2.3, and that’s a long time to go without Scala 2.12 support. If

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-01 Thread Felix Cheung
+1 on this and like the suggestion of type in string form. Would it be correct to assume there will be data type check, for example the returned pandas data frame column data types match what are specified. We have seen quite a bit of issues/confusions with that in R. Would it make sense to

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-01 Thread Reynold Xin
Ok, thanks. +1 on the SPIP for scope etc On API details (will deal with in code reviews as well but leaving a note here in case I forget) 1. I would suggest having the API also accept data type specification in string form. It is usually simpler to say "long" then "LongType()". 2. Think about

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-01 Thread Takuya UESHIN
Yes, the aggregation is out of scope for now. I think we should continue discussing the aggregation at JIRA and we will be adding those later separately. Thanks. On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin wrote: > Is the idea aggregate is out of scope for the current

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-09-01 Thread Reynold Xin
Why does ordering matter here for sort vs filter? The source should be able to handle it in whatever way it wants (which is almost always filter beneath sort I'd imagine). The only ordering that'd matter in the current set of pushdowns is limit - it should always mean the root of the pushded

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-01 Thread Reynold Xin
Is the idea aggregate is out of scope for the current effort and we will be adding those later? On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN wrote: > Hi all, > > We've been discussing to support vectorized UDFs in Python and we almost > got a consensus about the APIs, so

[SS] New numSavedStates metric for StateStoreRestoreExec for saved state?

2017-09-01 Thread Jacek Laskowski
Hi, Just reviewing StateStoreRestoreExec [1] and been wondering how to know whether a state was available for a key. It has numOutputRows metric [2], but that gives the number of aggregations from the child operator only and seems to say nothing about whether state was available for an

Re: Moving Scala 2.12 forward one step

2017-09-01 Thread Sean Owen
OK, what I'll do is focus on some changes that can be merged to master without impacting the 2.11 build (e.g. putting kafka-0.8 behind a profile, maybe, or adding the 2.12 REPL). Anything that is breaking, we can work on in a series of open PRs, or maybe a branch, yea. It's unusual but might be

Re: Moving Scala 2.12 forward one step

2017-09-01 Thread Matei Zaharia
If the changes aren’t that hard, I think we should also consider building a Scala 2.12 version of Spark 2.3 in a separate branch. I’ve definitely seen concerns from some large Scala users that Spark isn’t supporting 2.12 soon enough. I thought SPARK-14220 was blocked mainly because the changes

[VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-01 Thread Takuya UESHIN
Hi all, We've been discussing to support vectorized UDFs in Python and we almost got a consensus about the APIs, so I'd like to summarize and call for a vote. Note that this vote should focus on APIs for vectorized UDFs, not APIs for vectorized UDAFs or Window operations.