Re: data source api v2 refactoring

2018-08-31 Thread Jungtaek Lim
Nice suggestion Reynold and great news to see that Wenchen succeeded prototyping! One thing I would like to make sure is, how continuous mode works with such abstraction. Would continuous mode be also abstracted with Stream, and createScan would provide unbounded Scan? Thanks, Jungtaek Lim

Re: [Feedback Requested] SPARK-25299: Using Distributed Storage for Persisting Shuffle Data

2018-08-31 Thread Yuanjian Li
Hi Matt, Thanks for the great document and proposal, I want to +1 for the reliable shuffle data and give some feedback. I think a reliable shuffle service based on DFS is necessary on Spark, especially running Spark job over unstable environment. For example, while mixed deploying Spark

[Feedback Requested] SPARK-25299: Using Distributed Storage for Persisting Shuffle Data

2018-08-31 Thread Matt Cheah
Hi everyone, I filed SPARK-25299 to promote discussion on how we can improve the shuffle operation in Spark. The basic premise is to discuss the ways we can leverage distributed storage to improve the reliability and isolation of Spark’s shuffle architecture. A few designs and a full

Re: [discuss] replacing SPIP template with Heilmeier's Catechism?

2018-08-31 Thread Ryan Blue
+1 I think this is a great suggestion. I agree a bit with Sean, but I think it is really about mapping these questions into some of the existing structure. These are a great way to think about projects, but they're general and it would help to rephrase them for a software project, like Matei's

Re: data source api v2 refactoring

2018-08-31 Thread Ryan Blue
Thanks, Reynold! I think your API sketch looks great. I appreciate having the Table level in the abstraction to plug into as well. I think this makes it clear what everything does, particularly having the Stream level that represents a configured (by ScanConfig) streaming read and can act as a

Re: [discuss] replacing SPIP template with Heilmeier's Catechism?

2018-08-31 Thread Jules Damji
+1 One could argue that the litany of the questions are really a double-click on the essence: why, what, how. The three interrogatives ought to be the essence and distillation of any proposal or technical exposition. Cheers Jules Sent from my iPhone Pardon the dumb thumb typos :) > On Aug

Re: Nightly Builds in the docs (in spark-nightly/spark-master-bin/latest? Can't seem to find it)

2018-08-31 Thread Marcelo Vanzin
I think there still might be an active job publishing stuff. Here's a pretty recent build from master: https://dist.apache.org/repos/dist/dev/spark/2.4.0-SNAPSHOT-2018_08_31_12_02-32da87d-docs/_site/index.html But it seems only docs are being published, which makes me think it's those builds

Re: Nightly Builds in the docs (in spark-nightly/spark-master-bin/latest? Can't seem to find it)

2018-08-31 Thread Matei Zaharia
If we actually build stuff nightly in Jenkins, it wouldn’t hurt to publish them IMO. It helps more people try master and test it. > On Aug 31, 2018, at 1:28 PM, Sean Owen wrote: > > There are some builds there, but they're not recent: > > https://people.apache.org/~pwendell/spark-nightly/ >

Re: Nightly Builds in the docs (in spark-nightly/spark-master-bin/latest? Can't seem to find it)

2018-08-31 Thread Sean Owen
There are some builds there, but they're not recent: https://people.apache.org/~pwendell/spark-nightly/ We can either get the jobs running again, or just knock this on the head and remove it. Anyone know how to get it running again and want to? I have a feeling Shane knows if anyone. Or does

Re: [discuss] replacing SPIP template with Heilmeier's Catechism?

2018-08-31 Thread Matei Zaharia
I like this as well. Regarding “cost”, I think the equivalent concept for us is impact on the rest of the project (say maintenance cost down the line or whatever). This could be captured in the “risks” too, but it’s a slightly different concept. We should probably just clarify what we mean with

Re: Nightly Builds in the docs (in spark-nightly/spark-master-bin/latest? Can't seem to find it)

2018-08-31 Thread Cody Koeninger
Just got a question about this on the user list as well. Worth removing that link to pwendell's directory from the docs? On Sun, Jan 21, 2018 at 12:13 PM, Jacek Laskowski wrote: > Hi, > > http://spark.apache.org/developer-tools.html#nightly-builds reads: > >> Spark nightly packages are

Re: [discuss] replacing SPIP template with Heilmeier's Catechism?

2018-08-31 Thread Cody Koeninger
+1 to Sean's comment On Fri, Aug 31, 2018 at 2:48 PM, Reynold Xin wrote: > Yup all good points. One way I've done it in the past is to have an appendix > section for design sketch, as an expansion to the question "- What is new in > your approach and why do you think it will be successful?" > >

Re: [discuss] replacing SPIP template with Heilmeier's Catechism?

2018-08-31 Thread Reynold Xin
Yup all good points. One way I've done it in the past is to have an appendix section for design sketch, as an expansion to the question "- What is new in your approach and why do you think it will be successful?" On Fri, Aug 31, 2018 at 12:47 PM Marcelo Vanzin wrote: > I like the questions

Re: [discuss] replacing SPIP template with Heilmeier's Catechism?

2018-08-31 Thread Marcelo Vanzin
I like the questions (aside maybe from the cost one which perhaps does not matter much here), especially since they encourage explaining things in a more plain language than generally used by specs. But I don't think we can ignore design aspects; it's been my observation that a good portion of

Re: TimSort bug

2018-08-31 Thread Reynold Xin
Thanks for looking into this, Sean! Loved the tl;dr. On Fri, Aug 31, 2018 at 12:28 PM Sean Owen wrote: > TL;DR - We already had the fix from SPARK-5984. The delta from the current > JDK implementation to Spark's looks actually inconsequential. No action > required AFAICT. > > On Fri, Aug 31,

Re: TimSort bug

2018-08-31 Thread Sean Owen
TL;DR - We already had the fix from SPARK-5984. The delta from the current JDK implementation to Spark's looks actually inconsequential. No action required AFAICT. On Fri, Aug 31, 2018 at 12:30 PM Sean Owen wrote: > I looked into this, because it sure sounds like a similar issue from a few >

Re: [discuss] replacing SPIP template with Heilmeier's Catechism?

2018-08-31 Thread Sean Owen
Looks good. From the existing template at https://spark.apache.org/improvement-proposals.html I might keep points about design sketch, API, and non goals. And we don't need a cost section. On Fri, Aug 31, 2018, 1:23 PM Reynold Xin wrote: > I helped craft the current SPIP template >

[discuss] replacing SPIP template with Heilmeier's Catechism?

2018-08-31 Thread Reynold Xin
I helped craft the current SPIP template last year. I was recently (re-)introduced to the Heilmeier Catechism, a set of questions DARPA developed to evaluate proposals. The set of questions are: - What are you trying to do? Articulate your

Re: TimSort bug

2018-08-31 Thread Sean Owen
I looked into this, because it sure sounds like a similar issue from a few years ago that was fixed in https://issues.apache.org/jira/browse/SPARK-5984 The change in that JIRA actually looks almost identical to the change mentioned in the JDK bug:

Re: SPIP: Executor Plugin (SPARK-24918)

2018-08-31 Thread Reynold Xin
Both ahead of time, or just in time. Just like a normal Spark closure. On Fri, Aug 31, 2018 at 10:18 AM Nihar Sheth wrote: > Hi @rxin, > > Just to make sure I understand your desired use case, are you suggesting a > way (for the driver) to call, at any given time, a general method that can >

Re: SPIP: Executor Plugin (SPARK-24918)

2018-08-31 Thread Nihar Sheth
Hi @rxin, Just to make sure I understand your desired use case, are you suggesting a way (for the driver) to call, at any given time, a general method that can be defined ahead of time on the executors? On Thu, Aug 30, 2018 at 11:11 PM, Reynold Xin wrote: > I actually had a similar use case a

Re: Upgrade SBT to the latest

2018-08-31 Thread Ted Yu
+1 Original message From: Sean Owen Date: 8/31/18 6:40 AM (GMT-08:00) To: Darcy Shen Cc: dev@spark.apache.org Subject: Re: Upgrade SBT to the latest Certainly worthwhile. I think this should target Spark 3, which should come after 2.4, which is itself already just about

Re: mllib + SQL

2018-08-31 Thread Sean Owen
My $0.02 -- this isn't worthwhile. Yes, there are ML-in-SQL tools. I'm thinking of MADlib for example. I think these hold over from days when someone's only interface to a data warehouse was SQL, and so there had to be SQL-language support for invoking ML jobs. There was no programmatic

Re: Upgrade SBT to the latest

2018-08-31 Thread Sean Owen
Certainly worthwhile. I think this should target Spark 3, which should come after 2.4, which is itself already just about ready to test and release. On Fri, Aug 31, 2018 at 8:16 AM Darcy Shen wrote: > > SBT 1.x is ready for a long time. > > We may spare some time upgrading sbt for Spark. > > An

Upgrade SBT to the latest

2018-08-31 Thread Darcy Shen
SBT 1.x is ready for a long time. We may spare some time upgrading sbt for Spark. An unbrella JIRA like Scala 2.12 should be created.

Re: mllib + SQL

2018-08-31 Thread Hemant Bhanawat
BTW, I can contribute if there is already an effort going on somewhere. On Fri, Aug 31, 2018 at 3:35 PM Hemant Bhanawat wrote: > We allow our users to interact with spark cluster using SQL queries only. > That's easy for them. MLLib does not have SQL extensions and we cannot > expose it to our

Re: mllib + SQL

2018-08-31 Thread Hemant Bhanawat
We allow our users to interact with spark cluster using SQL queries only. That's easy for them. MLLib does not have SQL extensions and we cannot expose it to our users. SQL extensions can further accelerate MLLib's adoption. See https://cloud.google.com/bigquery/docs/bigqueryml-intro. Hemant

Re: SPIP: Executor Plugin (SPARK-24918)

2018-08-31 Thread Ted Yu
+1 Original message From: Reynold Xin Date: 8/30/18 11:11 PM (GMT-08:00) To: Felix Cheung Cc: dev Subject: Re: SPIP: Executor Plugin (SPARK-24918) I actually had a similar use case a while ago, but not entirely the same. In my use case, Spark is already up, but I want to

TimSort bug

2018-08-31 Thread Reynold Xin
“As a byproduct of our study, we uncover a bug in the Java implementation that can cause the sorting method to fail during the execution.” http://drops.dagstuhl.de/opus/volltexte/2018/9467/ This might impact Spark since we took the Java based TimSort implementation. I have seen in the wild

Re: [DISCUSS] move away from python doctests

2018-08-31 Thread Hyukjin Kwon
IMHO, one thing we should consider before this is, refactoring the PySpark tests all to make them separate pairs for main codes. Now, we put all those unit tests into few several files, which makes hard to follow the tests. 2018년 8월 31일 (금) 오후 2:05, Felix Cheung 님이 작성: > +1 on what Li said. > >

Re: SPIP: Executor Plugin (SPARK-24918)

2018-08-31 Thread Lars Francke
+1 On Fri, Aug 31, 2018 at 8:11 AM, Reynold Xin wrote: > I actually had a similar use case a while ago, but not entirely the same. > In my use case, Spark is already up, but I want to make sure all existing > (and new) executors run some specific code. Can we update the API to > support that? I

Re: SPIP: Executor Plugin (SPARK-24918)

2018-08-31 Thread Reynold Xin
I actually had a similar use case a while ago, but not entirely the same. In my use case, Spark is already up, but I want to make sure all existing (and new) executors run some specific code. Can we update the API to support that? I think that's doable if we split the design into two: one is the

Re: [DISCUSS] move away from python doctests

2018-08-31 Thread Felix Cheung
+1 on what Li said. And +1 on getting more coverage in unit tests - however often times we omit python unit tests deliberately if the python “wrapper” is trivial. This is what I’ve learned over the years from the previous pyspark maintainers. Admittedly gaps are there.

Re: SPIP: Executor Plugin (SPARK-24918)

2018-08-31 Thread Felix Cheung
+1 From: Mridul Muralidharan Sent: Wednesday, August 29, 2018 1:27:27 PM To: dev@spark.apache.org Subject: Re: SPIP: Executor Plugin (SPARK-24918) +1 I left a couple of comments in NiharS's PR, but this is very useful to have in spark ! Regards, Mridul On Fri,

data source api v2 refactoring

2018-08-31 Thread Reynold Xin
I spent some time last week looking at the current data source v2 apis, and I thought we should be a bit more buttoned up in terms of the abstractions and the guarantees Spark provides. In particular, I feel we need the following levels of "abstractions", to fit the use cases in Spark, from batch,