from:"Matei Zaharia"

Re: Kill Spark Streaming JOB from Spark UI or Yarn

2017-08-27 Thread Matei Zaharia

The batches should all have the same application ID, so use that one. You can 
also find the application in the YARN UI to terminate it from there.

Matei

> On Aug 27, 2017, at 10:27 AM, KhajaAsmath Mohammed  
> wrote:
> 
> Hi,
> 
> I am new to spark streaming and not able to find an option to kill it after 
> starting spark streaming context.
> 
> Streaming Tab doesnt have option to kill it.
> 
> Jobs tab too doesn't have option to kill it
> 
> 
> 
> if scheduled on yarn, how to kill that if spark submit is running in 
> background as I will not have an option to find yarn application id. does 
> batches have separate yarn application id or same one?
> 
> Thanks,
> Asmath


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: SPIP: Spark on Kubernetes

2017-08-17 Thread Matei Zaharia

+1 from me as well.

Matei

> On Aug 17, 2017, at 10:55 AM, Reynold Xin  wrote:
> 
> +1 on adding Kubernetes support in Spark (as a separate module similar to how 
> YARN is done)
> 
> I talk with a lot of developers and teams that operate cloud services, and 
> k8s in the last year has definitely become one of the key projects, if not 
> the one with the strongest momentum in this space. I'm not 100% sure we can 
> make it into 2.3 but IMO based on the activities in the forked repo and 
> claims that certain deployments are already running in production, this could 
> already be a solid project and will have everlasting positive impact.
> 
> 
> 
> On Wed, Aug 16, 2017 at 10:24 AM, Alexander Bezzubov  wrote:
> +1 (non-binding)
> 
> 
> Looking forward using it as part of Apache Spark release, instead of 
> Standalone cluster deployed on top of k8s.
> 
> 
> --
> Alex
> 
> On Wed, Aug 16, 2017 at 11:11 AM, Ismaël Mejía  wrote:
> +1 (non-binding)
> 
> This is something really great to have. More schedulers and runtime
> environments are a HUGE win for the Spark ecosystem.
> Amazing work, Big kudos for the guys who created and continue working on this.
> 
> On Wed, Aug 16, 2017 at 2:07 AM, lucas.g...@gmail.com
>  wrote:
> > From our perspective, we have invested heavily in Kubernetes as our cluster
> > manager of choice.
> >
> > We also make quite heavy use of spark.  We've been experimenting with using
> > these builds (2.1 with pyspark enabled) quite heavily.  Given that we've
> > already 'paid the price' to operate Kubernetes in AWS it seems rational to
> > move our jobs over to spark on k8s.  Having this project merged into the
> > master will significantly ease keeping our Data Munging toolchain primarily
> > on Spark.
> >
> >
> > Gary Lucas
> > Data Ops Team Lead
> > Unbounce
> >
> > On 15 August 2017 at 15:52, Andrew Ash  wrote:
> >>
> >> +1 (non-binding)
> >>
> >> We're moving large amounts of infrastructure from a combination of open
> >> source and homegrown cluster management systems to unify on Kubernetes and
> >> want to bring Spark workloads along with us.
> >>
> >> On Tue, Aug 15, 2017 at 2:29 PM, liyinan926  wrote:
> >>>
> >>> +1 (non-binding)
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>> http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22164.html
> >>> Sent from the Apache Spark Developers List mailing list archive at
> >>> Nabble.com.
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
> >>
> >
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Welcoming Hyukjin Kwon and Sameer Agarwal as committers

2017-08-07 Thread Matei Zaharia

Hi everyone,

The Spark PMC recently voted to add Hyukjin Kwon and Sameer Agarwal as 
committers. Join me in congratulating both of them and thanking them for their 
contributions to the project!

Matei
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: real world spark code

2017-07-25 Thread Matei Zaharia

You can also find a lot of GitHub repos for external packages here: 
http://spark.apache.org/third-party-projects.html

Matei

> On Jul 25, 2017, at 5:30 PM, Frank Austin Nothaft  
> wrote:
> 
> There’s a number of real-world open source Spark applications in the sciences:
> 
> genomics:
> 
> github.com/bigdatagenomics/adam <— core is scala, has py/r wrappers
> https://github.com/broadinstitute/gatk <— core is java
> https://github.com/hail-is/hail <— core is scala, mostly used through python 
> wrappers
> 
> neuroscience:
> 
> https://github.com/thunder-project/thunder#using-with-spark <— pyspark
> 
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466
> 
>> On Jul 25, 2017, at 8:09 AM, Jörn Franke  wrote:
>> 
>> Continuous integration (Travis, jenkins) and reporting on unit tests, 
>> integration tests etc for each source code version.
>> 
>> On 25. Jul 2017, at 16:58, Adaryl Wakefield  
>> wrote:
>> 
>>> ci+reporting? I’ve never heard of that term before. What is that?
>>>  
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics, LLC
>>> 913.938.6685
>>> www.massstreet.net
>>> www.linkedin.com/in/bobwakefieldmba
>>> Twitter: @BobLovesData
>>>  
>>>  
>>> From: Jörn Franke [mailto:jornfra...@gmail.com] 
>>> Sent: Tuesday, July 25, 2017 8:31 AM
>>> To: Adaryl Wakefield 
>>> Cc: user@spark.apache.org
>>> Subject: Re: real world spark code
>>>  
>>> Look for the ones that have unit and integration tests as well as a 
>>> ci+reporting on code quality.
>>>  
>>> All the others are just toy examples. Well should be :)
>>> 
>>> On 25. Jul 2017, at 01:08, Adaryl Wakefield  
>>> wrote:
>>> 
>>> Anybody know of publicly available GitHub repos of real world Spark 
>>> applications written in scala?
>>>  
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics, LLC
>>> 913.938.6685
>>> www.massstreet.net
>>> www.linkedin.com/in/bobwakefieldmba
>>> Twitter: @BobLovesData
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Fwd: Testing Apache Spark with JDK 9 Early Access builds

2017-07-14 Thread Matei Zaharia

FYI, the JDK group at Oracle is reaching out to see whether anyone
wants to test with JDK 9 and give them feedback. Just contact them
directly if you'd like to.

-- Forwarded message --
From: dalibor topic 
Date: Wed, Jul 12, 2017 at 3:16 AM
Subject: Testing Apache Spark with JDK 9 Early Access builds
To: ma...@cs.stanford.edu
Cc: Rory O'Donnell 

Hi Matei,

As part of evaluating how far along various popular open source
projects are regarding testing with upcoming JDK releases, I thought
I'd reach out to you about adding your projects to the Quality
Outreach [1][2] effort that Rory (CC:ed, as the OpenJDK Quality Group
Lead) leads.

Through that effort, we're trying to encourage more community testing
of JDK Early Access (EA) builds, and assist those projects that
participate in filing, tracking and (hopefully) resolving issues they
find along the way. Currently, about 80 FOSS projects participate in
the effort.

I'm curious you have had a chance to test your projects with JDK 9, if
you have run into any showstopper issues, and if so, if you have filed
any issues against the JDK discovered while testing with JDK 9 (or an
earlier release).

Last but not least, I'd be curious if you'd be interested in joining
the Quality Outreach effort with your projects. Rory can fill you in
on the details of how it all works.

cheers,
dalibor topic

[1] https://wiki.openjdk.java.net/display/quality/Quality+Outreach
[2] 
https://wiki.openjdk.java.net/download/attachments/21430310/TheWisdomOfCrowdTestingOpenJDK.pdf
--
 Dalibor Topic | Principal Product Manager
Phone: +494089091214  | Mobile: +491737185961

ORACLE Deutschland B.V. & Co. KG | Kühnehöfe 5 | 22761 Hamburg

ORACLE Deutschland B.V. & Co. KG
Hauptverwaltung: Riesstr. 25, D-80992 München
Registergericht: Amtsgericht München, HRA 95603

Komplementärin: ORACLE Deutschland Verwaltung B.V.
Hertogswetering 163/167, 3543 AS Utrecht, Niederlande
Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
Geschäftsführer: Alexander van der Ven, Jan Schultheiss, Val Maher

 Oracle is committed to developing
practices and products that help protect the environment

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Are release docs part of a release?

2017-06-08 Thread Matei Zaharia

I agree that it seems completely fine to update the web version of the docs 
after a release. What would not be fine is updating the downloadable package 
for it without another vote (and another release number). When people voted on 
a release, they voted that we should put up that package as "spark-2.1.0.tgz" 
or whatever, which is not changing here.

At the same time though, even though docs on the website *can* be updated 
after, it might not be smart to release something until the docs *are* fully in 
sync with the new features -- otherwise users might get confused.

Matei

> On Jun 8, 2017, at 3:25 PM, Ryan Blue  wrote:
> 
> I've never thought that docs are strictly part of a release and can't be 
> updated outside of one. Javadoc jars are included in releases with jars, but 
> that's more because they are produced by the build and are tied to the source 
> code. There is plenty of other documentation that isn't normally included in 
> a release, like the project's web pages and wiki content. I think the 
> expectation is for that to be continuously updated. So my interpretation is 
> that the release artifacts in the document you're quoting from are the source 
> code and convenience binaries. There's definitely room for interpretation 
> here, but I don't think it would be a problem as long as we do something 
> reasonable.
> 
> On Tue, Jun 6, 2017 at 2:15 AM, Sean Owen  > wrote:
> That's good, but, I think we should agree on whether release docs are part of 
> a release. It's important to reasoning about releases.
> 
> To be clear, you're suggesting that, say, right now you are OK with updating 
> this page with a few more paragraphs? 
> http://spark.apache.org/docs/2.1.0/streaming-programming-guide.html 
>   Even 
> though those paragraphs can't be in the released 2.1.0 doc source?
> 
> First, what is everyone's understanding of the answer?
> 
> The only official guidance I can find is 
> http://www.apache.org/legal/release-policy.html#distribute-other-artifacts 
>  
> , which suggests that docs need to be released similarly, with signatures. 
> Not quite the same question, but strongly implies they're treated like any 
> other source that is released with a vote.
> 
> --
> 
> WHAT ARE THE REQUIREMENTS TO DISTRIBUTE OTHER ARTIFACTS IN ADDITION TO THE 
> SOURCE PACKAGE? 
> 
> ASF releases typically contain additional material together with the source 
> package. This material may include documentation concerning the release but 
> must contain LICENSE and NOTICE files. As mentioned above, these artifacts 
> must be signed by a committer with a detached signature if they are to be 
> placed in the project's distribution directory.
> 
> Again, these artifacts may be distributed only if they contain LICENSE and 
> NOTICE files. For example, the Java artifact format is based on a compressed 
> directory structure and those projects wishing to distribute jars must place 
> LICENSE and NOTICE files in the META-INF directory within the jar.
> 
> Nothing in this section is meant to supersede the requirements defined here 
>  and here 
> 
>  that all releases be primarily based on a signed source package.
> 
> 
> On Tue, Jun 6, 2017 at 9:50 AM Nick Pentreath  > wrote:
> The website updates for ML QA (SPARK-20507) are not actually critical as the 
> project website certainly can be updated separately from the source code 
> guide and is not part of the release to be voted on. In future that 
> particular work item for the QA process could be marked down in priority, and 
> is definitely not a release blocker.
> 
> In any event I just resolved SPARK-20507, as I don't believe any website 
> updates are required for this release anyway. That fully resolves the ML QA 
> umbrella (SPARK-20499).
> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: Uploading PySpark 2.1.1 to PyPi

2017-05-29 Thread Matei Zaharia

Didn't we want to upload 2.1.1 too? What is the local version string problem?

Matei

> On May 26, 2017, at 10:11 AM, Xiao Li  wrote:
> 
> Hi, Holden,
> 
> That sounds good to me! 
> 
> Thanks,
> 
> Xiao
> 
> 2017-05-23 16:32 GMT-07:00 Holden Karau  >:
> An account already exists, the PMC has the info for it. I think we will need 
> to wait for the 2.2 artifacts to do the actual PyPI upload because of the 
> local version string in 2.2.1, but rest assured this isn't something I've 
> lost track of.
> 
> On Wed, May 24, 2017 at 12:11 AM Xiao Li  > wrote:
> Hi, Holden, 
> 
> Based on the PR, https://github.com/pypa/packaging-problems/issues/90 
>  , the limit has been 
> increased to 250MB. 
> 
> Just wondering if we can publish PySpark to PyPI now? Have you created the 
> account? 
> 
> Thanks,
> 
> Xiao Li
> 
> 
> 
> 2017-05-12 11:35 GMT-07:00 Sameer Agarwal  >:
> Holden,
> 
> Thanks again for pushing this forward! Out of curiosity, did we get an 
> approval from the PyPi folks?
> 
> Regards,
> Sameer
> 
> On Mon, May 8, 2017 at 11:44 PM, Holden Karau  > wrote:
> So I have a PR to add this to the release process documentation - I'm waiting 
> on the necessary approvals from PyPi folks before I merge that incase 
> anything changes as a result of the discussion (like uploading to the legacy 
> host or something). As for conda-forge, it's not something we need to do, but 
> I'll add a note about pinging them when we make a new release so their users 
> can keep up to date easily. The parent JIRA for PyPi related tasks is 
> SPARK-18267 :)
> 
> 
> On Mon, May 8, 2017 at 6:22 PM cloud0fan  > wrote:
> Hi Holden,
> 
> Thanks for working on it! Do we have a JIRA ticket to track this? We should
> make it part of the release process in all the following Spark releases, and
> it will be great if we have a JIRA ticket to record the detailed steps of
> doing this and even automate it.
> 
> Thanks,
> Wenchen
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Uploading-PySpark-2-1-1-to-PyPi-tp21531p21532.html
>  
> 
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> 
> 
> 
> 
> 
> -- 
> Sameer Agarwal
> Software Engineer | Databricks Inc.
> http://cs.berkeley.edu/~sameerag 
> -- 
> Cell : 425-233-8271 
> Twitter: https://twitter.com/holdenkarau

Re: Why did spark switch from AKKA to net / ...

2017-05-07 Thread Matei Zaharia

More specifically, many user applications that link to Spark also linked to 
Akka as a library (e.g. say you want to write a service that receives requests 
from Akka and runs them on Spark). In that case, you'd have two conflicting 
versions of the Akka library in the same JVM.

Matei

> On May 7, 2017, at 2:24 PM, Mark Hamstra  wrote:
> 
> The point is that Spark's prior usage of Akka was limited enough that it 
> could fairly easily be removed entirely instead of forcing particular 
> architectural decisions on Spark's users.
> 
> On Sun, May 7, 2017 at 1:14 PM, geoHeil  > wrote:
> Thank you! 
> In the issue they outline that hard wired dependencies were the problem.
> But wouldn't one want to not directly accept the messages from an actor but 
> have Kafka as an failsafe intermediary?
> 
> zero323 [via Apache Spark Developers List] <[hidden email] 
> > schrieb am So., 7. Mai 
> 2017 um 21:17 Uhr:
> https://issues.apache.org/jira/browse/SPARK-5293 
> 
> 
> 
> On 05/07/2017 08:59 PM, geoHeil wrote:
> 
> > Hi, 
> > 
> > I am curious why spark (with 2.0 completely) removed any akka dependencies 
> > for RPC and switched entirely to (as far as I know natty) 
> > 
> > regards, 
> > Georg 
> > 
> > 
> > 
> > -- 
> > View this message in context: 
> > http://apache-spark-developers-list.1001551.n3.nabble.com/Why-did-spark-switch-from-AKKA-to-net-tp21522.html
> >  
> > 
> > Sent from the Apache Spark Developers List mailing list archive at 
> > Nabble.com. 
> > 
> > -
> > To unsubscribe e-mail: [hidden email] 
> >  
> >
> 
> 
> - 
> To unsubscribe e-mail: [hidden email] 
>  
> 
> 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Why-did-spark-switch-from-AKKA-to-net-tp21522p21523.html
>  
> 
> To unsubscribe from Why did spark switch from AKKA to net / ..., click here 
> <>.
> NAML 
> 
> View this message in context: Re: Why did spark switch from AKKA to net / ... 
> 
> 
> Sent from the Apache Spark Developers List mailing list archive 
>  at Nabble.com.
>

Re: SPIP docs are live

2017-03-16 Thread Matei Zaharia

Yup, thanks everyone and Cody in particular for putting this together. I think 
it will help a lot.

Matei

> On Mar 16, 2017, at 1:57 PM, Joseph Bradley  wrote:
> 
> Awesome!  Thanks for pushing this through, Cody.
> Joseph
> 
> On Sun, Mar 12, 2017 at 1:18 AM, Sean Owen  > wrote:
> http://spark.apache.org/improvement-proposals.html 
> 
> 
> (Thanks Cody!)
> 
> We should use this process where appropriate now, and we can refine it 
> further if needed.
> 
> 
> 
> -- 
> Joseph Bradley
> Software Engineer - Machine Learning
> Databricks, Inc.
>

Re: Handling questions in the mailing lists

2016-11-06 Thread Matei Zaharia

Even for the mailing list, I'd love to have a short set of instructions on how 
to submit your questions (maybe on http://spark.apache.org/community.html or 
maybe in the welcome email when you subscribe). It would be great if someone 
added that. After all, we have such instructions for contributing PRs, for 
example.

Matei

> On Nov 6, 2016, at 11:09 PM, assaf.mendelson  wrote:
> 
> There are other options as well. For example hosting an answerhub 
> (www.answerhub.com ) or other similar separate Q 
> service.
> 
> BTW, I believe the main issue is not how opinionated people are but who is 
> answering questions.
> 
> Today there are already people asking (and getting answers) on SO (including 
> myself). The problem is that many people do not go to SO.
> 
> The problem I see is how to “bump” up questions which are not being answered 
> to someone more likely to be able to answer them. Simple questions can be 
> answered by many people, many of them even newbies who ran into the issue 
> themselves.
> 
> The main issue is that the more complex the question, the less people there 
> are who can answer it and those people’s bandwidth is already clogged by 
> other questions.
> 
> We could for example try to create tags on SO for “basic questions”, 
> “medium”, “advanced”. Provide guidelines to ask first on basic, if not 
> answered after X days then add the medium tag etc. Downvote people who don’t 
> go by the process. This would mean that committers for example can look at 
> advanced only tag and have a manageable number of questions they can help 
> with while others can answer medium and basic.
> 
>  
> 
> I agree that some things are not good for SO. Basically stuff which asks for 
> opinion is such but most cases in the mailing list are either “how do I solve 
> this bug” or “how do I do X”. Either of those two are good for SO.
> 
>  
> 
>  
> 
> Assaf.
> 
>  
> 
>  
> 
>  
> 
> From: rxin [via Apache Spark Developers List] [mailto:ml-node+[hidden email] 
> ] 
> Sent: Monday, November 07, 2016 8:33 AM
> To: Mendelson, Assaf
> Subject: Re: Handling questions in the mailing lists
> 
>  
> 
> This is an excellent point. If we do go ahead and feature SO as a way for 
> users to ask questions more prominently, as someone who knows SO very well, 
> would you be willing to help write a short guideline (ideally the shorter the 
> better, which makes it hard) to direct what goes to user@ and what goes to SO?
> 
>  
> 
>  
> 
> On Sun, Nov 6, 2016 at 9:54 PM, Maciej Szymkiewicz <[hidden email] 
> > wrote:
> 
> Damn, I always thought that mailing list is only for nice and welcoming 
> people and there is nothing to do for me here >:)
> 
> To be serious though, there are many questions on the users list which would 
> fit just fine on SO but it is not true in general. There are dozens of 
> questions which are to broad, opinion based, ask for external resources and 
> so on. If you want to direct users to SO you have to help them to decide if 
> it is the right channel. Otherwise it will just create a really bad 
> experience for both seeking help and active answerers. Former ones will be 
> downvoted and bashed, latter ones will have to deal with handling all the 
> junk and the number of active Spark users with moderation privileges is 
> really low (with only Massg and me being able to directly close duplicates).
> 
> Believe me, I've seen this before.
> 
> On 11/07/2016 05:08 AM, Reynold Xin wrote:
> 
> You have substantially underestimated how opinionated people can be on 
> mailing lists too :)
> 
> On Sunday, November 6, 2016, Maciej Szymkiewicz <[hidden email] 
> > wrote:
> 
> You have to remember that Stack Overflow crowd (like me) is highly 
> opinionated, so many questions, which could be just fine on the mailing list, 
> will be quickly downvoted and / or closed as off-topic. Just saying...
> 
> -- 
> Best, 
> Maciej
>  
> 
> On 11/07/2016 04:03 AM, Reynold Xin wrote:
> 
> OK I've checked on the ASF member list (which is private so there is no 
> public archive).
> 
>  
> 
> It is not against any ASF rule to recommend StackOverflow as a place for 
> users to ask questions. I don't think we can or should delete the existing 
> user@spark list either, but we can certainly make SO more visible than it is.
> 
>  
> 
>  
> 
>  
> 
> On Wed, Nov 2, 2016 at 10:21 AM, Reynold Xin <[hidden email] 
> > wrote:
> 
> Actually after talking with more ASF members, I believe the only policy is 
> that development decisions have to be made and announced on ASF properties 
> (dev list or jira), but user questions don't have to. 
> 
>  
> 
> I'm going to double check this. If it is true, I would actually recommend us 
> moving entirely over the Q part of the user list to stackoverflow, or at 
> least make that the recommended way rather than the existing user list which 
> is not very scalable. 
> 
> 
> 
> On Wednesday, November 2, 2016, Nicholas Chammas <[hidden

Re: Structured Streaming with Kafka Source, does it work??

2016-11-06 Thread Matei Zaharia

The Kafka source will only appear in 2.0.2 -- see this thread for the current 
release candidate: 
https://lists.apache.org/thread.html/597d630135e9eb3ede54bb0cc0b61a2b57b189588f269a64b58c9243@%3Cdev.spark.apache.org%3E
 . You can try that right now if you want from the staging Maven repo shown 
there. The vote looks likely to pass so an actual release should hopefully also 
be out soon.

Matei

> On Nov 6, 2016, at 5:25 PM, shyla deshpande  wrote:
> 
> Hi Jaya!
> 
> Thanks for the reply. Structured streaming works fine for me with socket text 
> stream . I think structured streaming with kafka source not yet supported.
> 
> Please if anyone has got it working with kafka source, please provide me some 
> sample code or direction.
> 
> Thanks
> 
> 
> On Sun, Nov 6, 2016 at 5:17 PM, Jayaradha Natarajan  > wrote:
> Shyla!
> 
> Check
> https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
>  
> 
> 
> Thanks,
> Jayaradha
> 
> On Sun, Nov 6, 2016 at 5:13 PM, shyla  > wrote:
> I am trying to do Structured Streaming with Kafka Source. Please let me know
> where I can find some sample code for this. Thanks
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Structured-Streaming-with-Kafka-Source-does-it-work-tp19748.html
>  
> 
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> 
> 
> 
>

Re: Structured Streaming with Kafka Source, does it work??

2016-11-06 Thread Matei Zaharia

The Kafka source will only appear in 2.0.2 -- see this thread for the current 
release candidate: 
https://lists.apache.org/thread.html/597d630135e9eb3ede54bb0cc0b61a2b57b189588f269a64b58c9243@%3Cdev.spark.apache.org%3E
 . You can try that right now if you want from the staging Maven repo shown 
there. The vote looks likely to pass so an actual release should hopefully also 
be out soon.

Matei

> On Nov 6, 2016, at 5:25 PM, shyla deshpande  wrote:
> 
> Hi Jaya!
> 
> Thanks for the reply. Structured streaming works fine for me with socket text 
> stream . I think structured streaming with kafka source not yet supported.
> 
> Please if anyone has got it working with kafka source, please provide me some 
> sample code or direction.
> 
> Thanks
> 
> 
> On Sun, Nov 6, 2016 at 5:17 PM, Jayaradha Natarajan  > wrote:
> Shyla!
> 
> Check
> https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
>  
> 
> 
> Thanks,
> Jayaradha
> 
> On Sun, Nov 6, 2016 at 5:13 PM, shyla  > wrote:
> I am trying to do Structured Streaming with Kafka Source. Please let me know
> where I can find some sample code for this. Thanks
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Structured-Streaming-with-Kafka-Source-does-it-work-tp19748.html
>  
> 
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> 
> 
> 
>

Re: Anyone seeing a lot of Spark emails go to Gmail spam?

2016-11-02 Thread Matei Zaharia

It might be useful to ask Apache Infra whether they have any information on 
these (e.g. what do their own spam metrics say, do they get any feedback from 
Google, etc). Unfortunately mailing lists seem to be less and less well 
supported by most email providers.

Matei

> On Nov 2, 2016, at 6:48 AM, Pete Robbins  wrote:
> 
> I have gmail filters to add labels and skip inbox for anything sent to 
> dev@spark user@spark etc but still get the occasional message marked as spam
> 
> 
> On Wed, 2 Nov 2016 at 08:18 Sean Owen  > wrote:
> I couldn't figure out why I was missing a lot of dev@ announcements, and have 
> just realized hundreds of messages to dev@ over the past month or so have 
> been marked as spam for me by Gmail. I have no idea why but it's usually 
> messages from Michael and Reynold, but not all of them. I'll see replies to 
> the messages but not the original. Who knows. I can make a filter. I just 
> wanted to give a heads up in case anyone else has been silently missing a lot 
> of messages.

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-27 Thread Matei Zaharia

Just to comment on this, I'm generally against removing these types of things 
unless they create a substantial burden on project contributors. It doesn't 
sound like Python 2.6 and Java 7 do that yet -- Scala 2.10 might, but then of 
course we need to wait for 2.12 to be out and stable.

In general, this type of stuff only hurts users, and doesn't have a huge impact 
on Spark contributors' productivity (sure, it's a bit unpleasant, but that's 
life). If we break compatibility this way too quickly, we fragment the user 
community, and then either people have a crappy experience with Spark because 
their corporate IT doesn't yet have an environment that can run the latest 
version, or worse, they create more maintenance burden for us because they ask 
for more patches to be backported to old Spark versions (1.6.x, 2.0.x, etc). 
Python in particular is pretty fundamental to many Linux distros.

In the future, rather than just looking at when some software came out, it may 
be good to have some criteria for when to drop support for something. For 
example, if there are really nice libraries in Python 2.7 or Java 8 that we're 
missing out on, that may be a good reason. The maintenance burden for multiple 
Scala versions is definitely painful but I also think we should always support 
the latest two Scala releases.

Matei

> On Oct 27, 2016, at 12:15 PM, Reynold Xin  wrote:
> 
> I created a JIRA ticket to track this: 
> https://issues.apache.org/jira/browse/SPARK-18138 
> 
> 
> 
> 
> On Thu, Oct 27, 2016 at 10:19 AM, Steve Loughran  > wrote:
> 
>> On 27 Oct 2016, at 10:03, Sean Owen > > wrote:
>> 
>> Seems OK by me.
>> How about Hadoop < 2.6, Python 2.6? Those seem more removeable. I'd like to 
>> add that to a list of things that will begin to be unsupported 6 months from 
>> now.
>> 
> 
> If you go to java 8 only, then hadoop 2.6+ is mandatory. 
> 
> 
>> On Wed, Oct 26, 2016 at 8:49 PM Koert Kuipers > > wrote:
>> that sounds good to me
>> 
>> On Wed, Oct 26, 2016 at 2:26 PM, Reynold Xin > > wrote:
>> We can do the following concrete proposal:
>> 
>> 1. Plan to remove support for Java 7 / Scala 2.10 in Spark 2.2.0 (Mar/Apr 
>> 2017).
>> 
>> 2. In Spark 2.1.0 release, aggressively and explicitly announce the 
>> deprecation of Java 7 / Scala 2.10 support.
>> 
>> (a) It should appear in release notes, documentations that mention how to 
>> build Spark
>> 
>> (b) and a warning should be shown every time SparkContext is started using 
>> Scala 2.10 or Java 7.
>> 
> 
>

Re: StructuredStreaming status

2016-10-19 Thread Matei Zaharia

Both Spark Streaming and Structured Streaming preserve locality for operator 
state actually. They only reshuffle state if a cluster node fails or if the 
load becomes heavily imbalanced and it's better to launch a task on another 
node and load the state remotely.

Matei

> On Oct 19, 2016, at 9:38 PM, Abhishek R. Singh 
> <abhis...@tetrationanalytics.com> wrote:
> 
> Its not so much about latency actually. The bigger rub for me is that the 
> state has to be reshuffled every micro/mini-batch (unless I am not 
> understanding it right - spark 2.0 state model i.e.).
> 
> Operator model avoids it by preserving state locality. Event time processing 
> and state purging are the other essentials (which are thankfully getting 
> addressed).
> 
> Any guidance on (timelines for) expected exit from alpha state would also be 
> greatly appreciated.
> 
> -Abhishek-
> 
>> On Oct 19, 2016, at 5:36 PM, Matei Zaharia <matei.zaha...@gmail.com 
>> <mailto:matei.zaha...@gmail.com>> wrote:
>> 
>> I'm also curious whether there are concerns other than latency with the way 
>> stuff executes in Structured Streaming (now that the time steps don't have 
>> to act as triggers), as well as what latency people want for various apps.
>> 
>> The stateful operator designs for streaming systems aren't inherently 
>> "better" than micro-batching -- they lose a lot of stuff that is possible in 
>> Spark, such as load balancing work dynamically across nodes, speculative 
>> execution for stragglers, scaling clusters up and down elastically, etc. 
>> Moreover, Spark itself could execute the current model with much lower 
>> latency. The question is just what combinations of latency, throughput, 
>> fault recovery, etc to target.
>> 
>> Matei
>> 
>>> On Oct 19, 2016, at 2:18 PM, Amit Sela <amitsel...@gmail.com 
>>> <mailto:amitsel...@gmail.com>> wrote:
>>> 
>>> 
>>> 
>>> On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman 
>>> <shiva...@eecs.berkeley.edu <mailto:shiva...@eecs.berkeley.edu>> wrote:
>>> At the AMPLab we've been working on a research project that looks at
>>> just the scheduling latencies and on techniques to get lower
>>> scheduling latency. It moves away from the micro-batch model, but
>>> reuses the fault tolerance etc. in Spark. However we haven't yet
>>> figure out all the parts in integrating this with the rest of
>>> structured streaming. I'll try to post a design doc / SIP about this
>>> soon.
>>> 
>>> On a related note - are there other problems users face with
>>> micro-batch other than latency ?
>>> I think that the fact that they serve as an output trigger is a problem, 
>>> but Structured Streaming seems to resolve this now.  
>>> 
>>> Thanks
>>> Shivaram
>>> 
>>> On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust
>>> <mich...@databricks.com <mailto:mich...@databricks.com>> wrote:
>>> > I know people are seriously thinking about latency.  So far that has not
>>> > been the limiting factor in the users I've been working with.
>>> >
>>> > On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger <c...@koeninger.org 
>>> > <mailto:c...@koeninger.org>> wrote:
>>> >>
>>> >> Is anyone seriously thinking about alternatives to microbatches?
>>> >>
>>> >> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
>>> >> <mich...@databricks.com <mailto:mich...@databricks.com>> wrote:
>>> >> > Anything that is actively being designed should be in JIRA, and it 
>>> >> > seems
>>> >> > like you found most of it.  In general, release windows can be found on
>>> >> > the
>>> >> > wiki.
>>> >> >
>>> >> > 2.1 has a lot of stability fixes as well as the kafka support you
>>> >> > mentioned.
>>> >> > It may also include some of the following.
>>> >> >
>>> >> > The items I'd like to start thinking about next are:
>>> >> >  - Evicting state from the store based on event time watermarks
>>> >> >  - Sessionization (grouping together related events by key / eventTime)
>>> >> >  - Improvements to the query planner (remove some of the restrictions 
>>> >> > on
>>> >> > what queries can be run).
>>> >> >
>>> >> > This is roughly in order ba

Re: StructuredStreaming status

2016-10-19 Thread Matei Zaharia

Yeah, as Shivaram pointed out, there have been research projects that looked at 
it. Also, Structured Streaming was explicitly designed to not make 
microbatching part of the API or part of the output behavior (tying triggers to 
it). However, when people begin working on that is a function of demand 
relative to other features. I don't think we can commit to one plan before 
exploring more options, but basically there is Shivaram's project, which adds a 
few new concepts to the scheduler, and there's the option to reduce control 
plane latency in the current system, which hasn't been heavily optimized yet 
but should be doable (lots of systems can handle 10,000s of RPCs per second).

Matei

> On Oct 19, 2016, at 9:20 PM, Cody Koeninger <c...@koeninger.org> wrote:
> 
> I don't think it's just about what to target - if you could target 1ms 
> batches, without harming 1 second or 1 minute batches why wouldn't you?
> I think it's about having a clear strategy and dedicating resources to it. If 
>  scheduling batches at an order of magnitude or two lower latency is the 
> strategy, and that's actually feasible, that's great. But I haven't seen that 
> clear direction, and this is by no means a recent issue.
> 
> 
> On Oct 19, 2016 7:36 PM, "Matei Zaharia" <matei.zaha...@gmail.com 
> <mailto:matei.zaha...@gmail.com>> wrote:
> I'm also curious whether there are concerns other than latency with the way 
> stuff executes in Structured Streaming (now that the time steps don't have to 
> act as triggers), as well as what latency people want for various apps.
> 
> The stateful operator designs for streaming systems aren't inherently 
> "better" than micro-batching -- they lose a lot of stuff that is possible in 
> Spark, such as load balancing work dynamically across nodes, speculative 
> execution for stragglers, scaling clusters up and down elastically, etc. 
> Moreover, Spark itself could execute the current model with much lower 
> latency. The question is just what combinations of latency, throughput, fault 
> recovery, etc to target.
> 
> Matei
> 
>> On Oct 19, 2016, at 2:18 PM, Amit Sela <amitsel...@gmail.com 
>> <mailto:amitsel...@gmail.com>> wrote:
>> 
>> 
>> 
>> On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman 
>> <shiva...@eecs.berkeley.edu <mailto:shiva...@eecs.berkeley.edu>> wrote:
>> At the AMPLab we've been working on a research project that looks at
>> just the scheduling latencies and on techniques to get lower
>> scheduling latency. It moves away from the micro-batch model, but
>> reuses the fault tolerance etc. in Spark. However we haven't yet
>> figure out all the parts in integrating this with the rest of
>> structured streaming. I'll try to post a design doc / SIP about this
>> soon.
>> 
>> On a related note - are there other problems users face with
>> micro-batch other than latency ?
>> I think that the fact that they serve as an output trigger is a problem, but 
>> Structured Streaming seems to resolve this now.  
>> 
>> Thanks
>> Shivaram
>> 
>> On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust
>> <mich...@databricks.com <mailto:mich...@databricks.com>> wrote:
>> > I know people are seriously thinking about latency.  So far that has not
>> > been the limiting factor in the users I've been working with.
>> >
>> > On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger <c...@koeninger.org 
>> > <mailto:c...@koeninger.org>> wrote:
>> >>
>> >> Is anyone seriously thinking about alternatives to microbatches?
>> >>
>> >> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
>> >> <mich...@databricks.com <mailto:mich...@databricks.com>> wrote:
>> >> > Anything that is actively being designed should be in JIRA, and it seems
>> >> > like you found most of it.  In general, release windows can be found on
>> >> > the
>> >> > wiki.
>> >> >
>> >> > 2.1 has a lot of stability fixes as well as the kafka support you
>> >> > mentioned.
>> >> > It may also include some of the following.
>> >> >
>> >> > The items I'd like to start thinking about next are:
>> >> >  - Evicting state from the store based on event time watermarks
>> >> >  - Sessionization (grouping together related events by key / eventTime)
>> >> >  - Improvements to the query planner (remove some of the restrictions on
>> >> > what queries can be run).
>> >> >
>> >> > This is roug

Re: StructuredStreaming status

2016-10-19 Thread Matei Zaharia

I'm also curious whether there are concerns other than latency with the way 
stuff executes in Structured Streaming (now that the time steps don't have to 
act as triggers), as well as what latency people want for various apps.

The stateful operator designs for streaming systems aren't inherently "better" 
than micro-batching -- they lose a lot of stuff that is possible in Spark, such 
as load balancing work dynamically across nodes, speculative execution for 
stragglers, scaling clusters up and down elastically, etc. Moreover, Spark 
itself could execute the current model with much lower latency. The question is 
just what combinations of latency, throughput, fault recovery, etc to target.

Matei

> On Oct 19, 2016, at 2:18 PM, Amit Sela  wrote:
> 
> 
> 
> On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman 
> > wrote:
> At the AMPLab we've been working on a research project that looks at
> just the scheduling latencies and on techniques to get lower
> scheduling latency. It moves away from the micro-batch model, but
> reuses the fault tolerance etc. in Spark. However we haven't yet
> figure out all the parts in integrating this with the rest of
> structured streaming. I'll try to post a design doc / SIP about this
> soon.
> 
> On a related note - are there other problems users face with
> micro-batch other than latency ?
> I think that the fact that they serve as an output trigger is a problem, but 
> Structured Streaming seems to resolve this now.  
> 
> Thanks
> Shivaram
> 
> On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust
> > wrote:
> > I know people are seriously thinking about latency.  So far that has not
> > been the limiting factor in the users I've been working with.
> >
> > On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger  > > wrote:
> >>
> >> Is anyone seriously thinking about alternatives to microbatches?
> >>
> >> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
> >> > wrote:
> >> > Anything that is actively being designed should be in JIRA, and it seems
> >> > like you found most of it.  In general, release windows can be found on
> >> > the
> >> > wiki.
> >> >
> >> > 2.1 has a lot of stability fixes as well as the kafka support you
> >> > mentioned.
> >> > It may also include some of the following.
> >> >
> >> > The items I'd like to start thinking about next are:
> >> >  - Evicting state from the store based on event time watermarks
> >> >  - Sessionization (grouping together related events by key / eventTime)
> >> >  - Improvements to the query planner (remove some of the restrictions on
> >> > what queries can be run).
> >> >
> >> > This is roughly in order based on what I've been hearing users hit the
> >> > most.
> >> > Would love more feedback on what is blocking real use cases.
> >> >
> >> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor  >> > >
> >> > wrote:
> >> >>
> >> >> Hi,
> >> >> I hope it is the right forum.
> >> >> I am looking for some information of what to expect from
> >> >> StructuredStreaming in its next releases to help me choose when / where
> >> >> to
> >> >> start using it more seriously (or where to invest in workarounds and
> >> >> where
> >> >> to wait). I couldn't find a good place where such planning discussed
> >> >> for 2.1
> >> >> (like, for example ML and SPARK-15581).
> >> >> I'm aware of the 2.0 documented limits
> >> >>
> >> >> (http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations
> >> >>  
> >> >> ),
> >> >> like no support for multiple aggregations levels, joins are strictly to
> >> >> a
> >> >> static dataset (no SCD or stream-stream) etc, limited sources / sinks
> >> >> (like
> >> >> no sink for interactive queries) etc etc
> >> >> I'm also aware of some changes that have landed in master, like the new
> >> >> Kafka 0.10 source (and its on-going improvements) in SPARK-15406, the
> >> >> metrics in SPARK-17731, and some improvements for the file source.
> >> >> If I remember correctly, the discussion on Spark release cadence
> >> >> concluded
> >> >> with a preference to a four-month cycles, with likely code freeze
> >> >> pretty
> >> >> soon (end of October). So I believe the scope for 2.1 should likely
> >> >> quite
> >> >> clear to some, and that 2.2 planning should likely be starting about
> >> >> now.
> >> >> Any visibility / sharing will be highly appreciated!
> >> >> thanks in advance,
> >> >>
> >> >> Ofir Manor
> >> >>
> >> >> Co-Founder & CTO | Equalum
> >> >>
> >> >> Mobile: +972-54-7801286  | Email: 
> >> >> ofir.ma...@equalum.io 
> >> >
> >> >
> >
> >
> 
>

Re: Mini-Proposal: Make it easier to contribute to the contributing to Spark Guide

2016-10-18 Thread Matei Zaharia

Is there any way to tie wiki accounts with JIRA accounts? I found it weird that 
they're not tied at the ASF.

Otherwise, moving this into the docs might make sense.

Matei

> On Oct 18, 2016, at 6:19 AM, Cody Koeninger  wrote:
> 
> +1 to putting docs in one clear place.
> 
> On Oct 18, 2016 6:40 AM, "Sean Owen"  > wrote:
> I'm OK with that. The upside to the wiki is that it can be edited directly 
> outside of a release cycle. However, in practice I find that the wiki is 
> rarely changed. To me it also serves as a place for information that isn't 
> exactly project documentation like "powered by" listings.
> 
> In a way I'd like to get rid of the wiki to have one less place for docs, 
> that doesn't have the same accessibility (I don't know who can give edit 
> access), and doesn't have a review process.
> 
> For now I'd settle for bringing over a few key docs like the one you mention. 
> I spent a little time a while ago removing some duplication across the wiki 
> and project docs and think there's a bit more than could be done.
> 
> 
> On Tue, Oct 18, 2016 at 12:25 PM Holden Karau  > wrote:
> Right now the wiki isn't particularly accessible to updates by external 
> contributors. We've already got a contributing to spark page which just links 
> to the wiki - how about if we just move the wiki contents over? This way 
> contributors can contribute to our documentation about how to contribute 
> probably helping clear up points of confusion for new contributors which the 
> rest of us may be blind to.
> 
> If we do this we would probably want to update the wiki page to point to the 
> documentation generated from markdown. It would also mean that the results of 
> any update to the contributing guide take a full release cycle to be visible. 
> Another alternative would be opening up the wiki to a broader set of people.
> 
> I know a lot of people are probably getting ready for Spark Summit EU (and I 
> hope to catch up with some of y'all there) but I figured this a relatively 
> minor proposal.
> -- 
> Cell : 425-233-8271 
> Twitter: https://twitter.com/holdenkarau

Re: Spark Improvement Proposals

2016-10-09 Thread Matei Zaharia

Well, I think there are a few things here that don't make sense. First, why 
should only committers submit SIPs? Development in the project should be open 
to all contributors, whether they're committers or not. Second, I think 
unrealistic goals can be found just by inspecting the goals, and I'm not super 
worried that we'll accept a lot of SIPs that are then infeasible -- we can then 
submit new ones. But this depends on whether you want this process to be a 
"design doc lite", where people also agree on implementation strategy, or just 
a way to agree on goals. This is what I asked earlier about PRDs vs design docs 
(and I'm open to either one but I'd just like clarity). Finally, both as a user 
and designer of software, I always want to give feedback on APIs, so I'd really 
like a culture of having those early. People don't argue about prettiness when 
they discuss APIs, they argue about the core concepts to expose in order to 
meet various goals, and then they're stuck maintaining those for a long time.

Matei

> On Oct 9, 2016, at 3:10 PM, Cody Koeninger <c...@koeninger.org> wrote:
> 
> Users instead of people, sure.  Commiters and contributors are (or at least 
> should be) a subset of users.
> 
> Non goals, sure. I don't care what the name is, but we need to clearly say 
> e.g. 'no we are not maintaining compatibility with XYZ right now'.
> 
> API, what I care most about is whether it allows me to accomplish the goals. 
> Arguing about how ugly or pretty it is can be saved for design/ 
> implementation imho.
> 
> Strategy, this is necessary because otherwise goals can be out of line with 
> reality.  Don't propose goals you don't have at least some idea of how to 
> implement.
> 
> Rejected strategies, given that commiters are the only ones I'm saying should 
> formally submit SPARKLIs or SIPs, if they put junk in a required section then 
> slap them down for it and tell them to fix it.
> 
> 
> On Oct 9, 2016 4:36 PM, "Matei Zaharia" <matei.zaha...@gmail.com 
> <mailto:matei.zaha...@gmail.com>> wrote:
> Yup, this is the stuff that I found unclear. Thanks for clarifying here, but 
> we should also clarify it in the writeup. In particular:
> 
> - Goals needs to be about user-facing behavior ("people" is broad)
> 
> - I'd rename Rejected Goals to Non-Goals. Otherwise someone will dig up one 
> of these and say "Spark's developers have officially rejected X, which our 
> awesome system has".
> 
> - For user-facing stuff, I think you need a section on API. Virtually all 
> other *IPs I've seen have that.
> 
> - I'm still not sure why the strategy section is needed if the purpose is to 
> define user-facing behavior -- unless this is the strategy for setting the 
> goals or for defining the API. That sounds squarely like a design doc issue. 
> In some sense, who cares whether the proposal is technically feasible right 
> now? If it's infeasible, that will be discovered later during design and 
> implementation. Same thing with rejected strategies -- listing some of those 
> is definitely useful sometimes, but if you make this a *required* section, 
> people are just going to fill it in with bogus stuff (I've seen this happen 
> before).
> 
> Matei
> 
> > On Oct 9, 2016, at 2:14 PM, Cody Koeninger <c...@koeninger.org 
> > <mailto:c...@koeninger.org>> wrote:
> >
> > So to focus the discussion on the specific strategy I'm suggesting,
> > documented at
> >
> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
> >  
> > <https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md>
> >
> > "Goals: What must this allow people to do, that they can't currently?"
> >
> > Is it unclear that this is focusing specifically on people-visible behavior?
> >
> > Rejected goals -  are important because otherwise people keep trying
> > to argue about scope.  Of course you can change things later with a
> > different SIP and different vote, the point is to focus.
> >
> > Use cases - are something that people are going to bring up in
> > discussion.  If they aren't clearly documented as a goal ("This must
> > allow me to connect using SSL"), they should be added.
> >
> > Internal architecture - if the people who need specific behavior are
> > implementers of other parts of the system, that's fine.
> >
> > Rejected strategies - If you have none of these, you have no evidence
> > that the proponent didn't just go with the first thing they had in
> > mind (or have already implemented), which is a big problem currently.
> > Approval isn't binding as to specifics of imp

Re: Spark Improvement Proposals

2016-10-09 Thread Matei Zaharia

Yup, this is the stuff that I found unclear. Thanks for clarifying here, but we 
should also clarify it in the writeup. In particular:

- Goals needs to be about user-facing behavior ("people" is broad)

- I'd rename Rejected Goals to Non-Goals. Otherwise someone will dig up one of 
these and say "Spark's developers have officially rejected X, which our awesome 
system has".

- For user-facing stuff, I think you need a section on API. Virtually all other 
*IPs I've seen have that.

- I'm still not sure why the strategy section is needed if the purpose is to 
define user-facing behavior -- unless this is the strategy for setting the 
goals or for defining the API. That sounds squarely like a design doc issue. In 
some sense, who cares whether the proposal is technically feasible right now? 
If it's infeasible, that will be discovered later during design and 
implementation. Same thing with rejected strategies -- listing some of those is 
definitely useful sometimes, but if you make this a *required* section, people 
are just going to fill it in with bogus stuff (I've seen this happen before).

Matei

> On Oct 9, 2016, at 2:14 PM, Cody Koeninger <c...@koeninger.org> wrote:
> 
> So to focus the discussion on the specific strategy I'm suggesting,
> documented at
> 
> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
> 
> "Goals: What must this allow people to do, that they can't currently?"
> 
> Is it unclear that this is focusing specifically on people-visible behavior?
> 
> Rejected goals -  are important because otherwise people keep trying
> to argue about scope.  Of course you can change things later with a
> different SIP and different vote, the point is to focus.
> 
> Use cases - are something that people are going to bring up in
> discussion.  If they aren't clearly documented as a goal ("This must
> allow me to connect using SSL"), they should be added.
> 
> Internal architecture - if the people who need specific behavior are
> implementers of other parts of the system, that's fine.
> 
> Rejected strategies - If you have none of these, you have no evidence
> that the proponent didn't just go with the first thing they had in
> mind (or have already implemented), which is a big problem currently.
> Approval isn't binding as to specifics of implementation, so these
> aren't handcuffs.  The goals are the contract, the strategy is
> evidence that contract can actually be met.
> 
> Design docs - I'm not touching design docs.  The markdown file I
> linked specifically says of the strategy section "This is not a full
> design document."  Is this unclear?  Design docs can be worked on
> obviously, but that's not what I'm concerned with here.
> 
> 
> 
> 
> On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote:
>> Hi Cody,
>> 
>> I think this would be a lot more concrete if we had a more detailed template
>> for SIPs. Right now, it's not super clear what's in scope -- e.g. are  they
>> a way to solicit feedback on the user-facing behavior or on the internals?
>> "Goals" can cover both things. I've been thinking of SIPs more as Product
>> Requirements Docs (PRDs), which focus on *what* a code change should do as
>> opposed to how.
>> 
>> In particular, here are some things that you may or may not consider in
>> scope for SIPs:
>> 
>> - Goals and non-goals: This is definitely in scope, and IMO should focus on
>> user-visible behavior (e.g. "system supports SQL window functions" or
>> "system continues working if one node fails"). BTW I wouldn't say "rejected
>> goals" because some of them might become goals later, so we're not
>> definitively rejecting them.
>> 
>> - Public API: Probably should be included in most SIPs unless it's too large
>> to fully specify then (e.g. "let's add an ML library").
>> 
>> - Use cases: I usually find this very useful in PRDs to better communicate
>> the goals.
>> 
>> - Internal architecture: This is usually *not* a thing users can easily
>> comment on and it sounds more like a design doc item. Of course it's
>> important to show that the SIP is feasible to implement. One exception,
>> however, is that I think we'll have some SIPs primarily on internals (e.g.
>> if somebody wants to refactor Spark's query optimizer or something).
>> 
>> - Rejected strategies: I personally wouldn't put this, because what's the
>> point of voting to reject a strategy before you've really begun designing
>> and implementing something? What if you discover that the strategy is
>> actually better when you start doin

Re: Spark Improvement Proposals

2016-10-09 Thread Matei Zaharia

Yup, but the example you gave is for alternatives about *user-facing behavior*, 
not implementation. The current SIP doc describes "strategy" more as 
implementation strategy. I'm just saying there are different possible goals for 
these types of docs.

BTW, PEPs and Scala SIPs focus primarily on user-facing behavior, but also 
require a reference implementation. This is a bit different from what Cody had 
in mind, I think.

Matei

> On Oct 9, 2016, at 1:25 PM, Nicholas Chammas <nicholas.cham...@gmail.com> 
> wrote:
> 
> Rejected strategies: I personally wouldn’t put this, because what’s the point 
> of voting to reject a strategy before you’ve really begun designing and 
> implementing something? What if you discover that the strategy is actually 
> better when you start doing stuff?
> I would guess the point is to document alternatives that were discussed and 
> rejected, so that later on people can be pointed to that discussion and the 
> devs don’t have to repeat themselves unnecessarily every time someone comes 
> along and asks “Why didn’t you do this other thing?” That doesn’t mean a 
> rejected proposal can’t later be revisited and the SIP can’t be updated.
> 
> For reference from the Python community, PEP 492 
> <https://www.python.org/dev/peps/pep-0492/>, a Python Enhancement Proposal 
> for adding async and await syntax and “first-class” coroutines to Python, has 
> a section on rejected ideas 
> <https://www.python.org/dev/peps/pep-0492/#why-async-def> for the new syntax. 
> It captures a summary of what the devs discussed, but it doesn’t mean the PEP 
> can’t be updated and a previously rejected proposal can’t be revived.
> 
> At least in the Python community, a PEP serves not just as formal starting 
> point for a proposal (the “real” starting point is usually a discussion on 
> python-ideas or python-dev), but also as documentation of what was agreed on 
> and a living “spec” of sorts. So PEPs sometimes get updated years after they 
> are approved when revisions are agreed upon. PEPs are also intended for wide 
> consumption, vs. bug tracker issues which the broader Python dev community 
> are not expected to follow closely.
> 
> Dunno if we want to follow a similar pattern for Spark, since the project’s 
> needs are different. But the Python community has used PEPs to help organize 
> and steer development since 2000; there are plenty of examples there we can 
> probably take inspiration from.
> 
> By the way, can we call these things something other than Spark Improvement 
> Proposals? The acronym, SIP, conflicts with Scala SIPs 
> <http://docs.scala-lang.org/sips/index.html>. Since the Scala and Spark 
> communities have a lot of overlap, we don’t want, for example, names like 
> “SIP-10” to have an ambiguous meaning.
> 
> Nick
> 
> 
> On Sun, Oct 9, 2016 at 3:34 PM Matei Zaharia <matei.zaha...@gmail.com 
> <mailto:matei.zaha...@gmail.com>> wrote:
> Hi Cody,
> 
> I think this would be a lot more concrete if we had a more detailed template 
> for SIPs. Right now, it's not super clear what's in scope -- e.g. are  they a 
> way to solicit feedback on the user-facing behavior or on the internals? 
> "Goals" can cover both things. I've been thinking of SIPs more as Product 
> Requirements Docs (PRDs), which focus on *what* a code change should do as 
> opposed to how.
> 
> In particular, here are some things that you may or may not consider in scope 
> for SIPs:
> 
> - Goals and non-goals: This is definitely in scope, and IMO should focus on 
> user-visible behavior (e.g. "system supports SQL window functions" or "system 
> continues working if one node fails"). BTW I wouldn't say "rejected goals" 
> because some of them might become goals later, so we're not definitively 
> rejecting them.
> 
> - Public API: Probably should be included in most SIPs unless it's too large 
> to fully specify then (e.g. "let's add an ML library").
> 
> - Use cases: I usually find this very useful in PRDs to better communicate 
> the goals.
> 
> - Internal architecture: This is usually *not* a thing users can easily 
> comment on and it sounds more like a design doc item. Of course it's 
> important to show that the SIP is feasible to implement. One exception, 
> however, is that I think we'll have some SIPs primarily on internals (e.g. if 
> somebody wants to refactor Spark's query optimizer or something).
> 
> - Rejected strategies: I personally wouldn't put this, because what's the 
> point of voting to reject a strategy before you've really begun designing and 
> implementing something? What if you discover that the strategy is actually 
> better when you start doing stuff?
>

Re: Spark Improvement Proposals

2016-10-08 Thread Matei Zaharia

Sounds good. Just to comment on the compatibility part:

> I meant changing public user interfaces.  I think the first design is
> unlikely to be right, because it's done at a time when you have the
> least information.  As a user, I find it considerably more frustrating
> to be unable to use a tool to get my job done, than I do having to
> make minor changes to my code in order to take advantage of features.
> I've seen committers be seriously reluctant to allow changes to
> @experimental code that are needed in order for it to really work
> right.  You need to be able to iterate, and if people on both sides of
> the fence aren't going to respect that some newer apis are subject to
> change, then why even mark them as such?
> 
> Ideally a finished SIP should give me a checklist of things that an
> implementation must do, and things that it doesn't need to do.
> Contributors/committers should be seriously discouraged from putting
> out a version 0.1 that doesn't have at least a prototype
> implementation of all those things, especially if they're then going
> to argue against interface changes necessary to get the the rest of
> the things done in the 0.2 version.

Experimental APIs and alpha components are indeed supposed to be changeable 
(https://cwiki.apache.org/confluence/display/SPARK/Spark+Versioning+Policy). 
Maybe people are being too conservative in some cases, but I do want to note 
that regardless of what precise policy we try to write down, this type of issue 
will ultimately be a judgment call. Is it worth making a small cosmetic change 
in an API that's marked experimental, but has been used widely for a year? 
Perhaps not. Is it worth making it in something one month old, or even in an 
older API as we move to 2.0? Maybe yes. I think we should just discuss each one 
(start an email thread if resolving it on JIRA is too complex) and perhaps be 
more religious about making things non-experimental when we think they're done.

Matei


> 
> 
> On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin <r...@databricks.com> wrote:
>> I like the lightweight proposal to add a SIP label.
>> 
>> During Spark 2.0 development, Tom (Graves) and I suggested using wiki to
>> track the list of major changes, but that never really materialized due to
>> the overhead. Adding a SIP label on major JIRAs and then link to them
>> prominently on the Spark website makes a lot of sense.
>> 
>> 
>> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia <matei.zaha...@gmail.com>
>> wrote:
>>> 
>>> For the improvement proposals, I think one major point was to make them
>>> really visible to users who are not contributors, so we should do more than
>>> sending stuff to dev@. One very lightweight idea is to have a new type of
>>> JIRA called a SIP and have a link to a filter that shows all such JIRAs from
>>> http://spark.apache.org. I also like the idea of SIP and design doc
>>> templates (in fact many projects have them).
>>> 
>>> Matei
>>> 
>>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <r...@databricks.com> wrote:
>>> 
>>> I called Cody last night and talked about some of the topics in his email.
>>> It became clear to me Cody genuinely cares about the project.
>>> 
>>> Some of the frustrations come from the success of the project itself
>>> becoming very "hot", and it is difficult to get clarity from people who
>>> don't dedicate all their time to Spark. In fact, it is in some ways similar
>>> to scaling an engineering team in a successful startup: old processes that
>>> worked well might not work so well when it gets to a certain size, cultures
>>> can get diluted, building culture vs building process, etc.
>>> 
>>> I also really like to have a more visible process for larger changes,
>>> especially major user facing API changes. Historically we upload design docs
>>> for major changes, but it is not always consistent and difficult to quality
>>> of the docs, due to the volunteering nature of the organization.
>>> 
>>> Some of the more concrete ideas we discussed focus on building a culture
>>> to improve clarity:
>>> 
>>> - Process: Large changes should have design docs posted on JIRA. One thing
>>> Cody and I didn't discuss but an idea that just came to me is we should
>>> create a design doc template for the project and ask everybody to follow.
>>> The design doc template should also explicitly list goals and non-goals, to
>>> make design doc more consistent.
>>> 
>>> - Process: Email dev@ to solicit feedback. We have some this with some
>>> chan

Re: Improving governance / committers (split from Spark Improvement Proposals thread)

2016-10-08 Thread Matei Zaharia

This makes a lot of sense; just to comment on a few things:

> - More committers
> Just looking at the ratio of committers to open tickets, or committers
> to contributors, I don't think you have enough human power.
> I realize this is a touchy issue.  I don't have dog in this fight,
> because I'm not on either coast nor in a big company that views
> committership as a political thing.  I just think you need more people
> to do the work, and more diversity of viewpoint.
> It's unfortunate that the Apache governance process involves giving
> someone all the keys or none of the keys, but until someone really
> starts screwing up, I think it's better to err on the side of
> accepting hard-working people.

This is something the PMC is actively discussing. Historically, we've added 
committers when people contributed a new module or feature, basically to the 
point where other developers are asking them to review changes in that area 
(https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-BecomingaCommitter).
 For example, we added the original authors of GraphX when we merged in GraphX, 
the authors of new ML algorithms, etc. However, there's a good argument that 
some areas are simply not covered well now and we should add people there. 
Also, as the project has grown, there are also more people who focus on smaller 
fixes and are nonetheless contributing a lot.

> - Each major area of the code needs at least one person who cares
> about it that is empowered with a vote, otherwise decisions get made
> that don't make technical sense.
> I don't know if anyone with a vote is shepherding GraphX (or maybe
> it's just dead), the Mesos relationship has always been weird, no one
> with a vote really groks Kafka.
> marmbrus and zsxwing are getting there quickly on the Kafka side, and
> I appreciate it, but it's been bad for a while.
> Because I don't have any political power, my response to seeing things
> that I know are technically dangerous has been to yell really loud
> until someone listens, which sucks for everyone involved.
> I already apologized to Michael privately; Ryan, I'm sorry, it's not about 
> you.
> This seems pretty straightforward to fix, if politically awkward:
> those people exist, just give them a vote.
> Failing that, listen the first or second time they say something not
> the third or fourth, and if it doesn't make sense, ask.

Just as a note here -- it's true that some areas are not super well covered, 
but I also hope to avoid a situation where people have to yell to be listened 
to. I can't say anything about *all* technical discussions we've ever had, but 
historically, people have been able to comment on the design of many things 
without yelling. This is actually important because a culture of having to yell 
can drive away contributors. So it's awesome that you yelled about the Kafka 
source stuff, but at the same time, hopefully we make these types of things 
work without yelling. This would be a problem even if there were committers 
with more expertise in each area -- what if someone disagrees with the 
committers?

Matei


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-08 Thread Matei Zaharia

I like this idea of asking them. BTW, one other thing we can do *provided the 
JIRAs are eventually under control* is to create a filter for old JIRAs that 
have not received a response in X amount of time and have the system 
automatically email the dev list with this report every month. Then everyone 
can see the list of items and maybe be reminded to take care to clean it up. 
This only works if the list is manageable and you actually want to read all of 
it.

Matei

> On Oct 8, 2016, at 9:01 AM, Cody Koeninger  wrote:
> 
> Yeah, I've interacted with other projects that used that system and it was 
> pleasant.
> 
> 1. "this is getting closed cause its stale, let us know if thats a problem"
> 2. "actually that matters to us"
> 3. "ok well leave it open"
> 
> I'd be fine with totally automating step 1 as long as a human was involved at 
> step 2 and 3
> 
> 
> On Saturday, October 8, 2016, assaf.mendelson  > wrote:
> I don’t really have much experience with large open source projects but I 
> have some experience with having lots of issues with no one handling them. 
> Automation proved a good solution in my experience, but one thing that I 
> found which was really important is giving people a chance to say “don’t 
> close this please”.
> 
> Basically, because closing you can send an email to the reporter (and 
> probably people who are watching the issue) and tell them this is going to be 
> closed. Allow them an option to ping back saying “don’t close this please” 
> which would ping committers for input (as if there were 5+ votes as described 
> by Nick).
> 
> The main reason for this is that many times people fine solutions and the 
> issue does become stale but at other times, the issue is still important, it 
> is just that no one noticed it because of the noise of other issues.
> 
> Thanks,
> 
> Assaf.
> 
>  
> 
>  
> 
>  
> 
> From: Nicholas Chammas [via Apache Spark Developers List] [mailto:ml-node+ 
> [hidden email] 
> ] 
> Sent: Saturday, October 08, 2016 12:42 AM
> To: Mendelson, Assaf
> Subject: Re: Improving volunteer management / JIRAs (split from Spark 
> Improvement Proposals thread)
> 
>  
> 
> I agree with Cody and others that we need some automation — or at least an 
> adjusted process — to help us manage organic contributions better.
> 
> The objections about automated closing being potentially abrasive are 
> understood, but I wouldn’t accept that as a defeat for automation. Instead, 
> it seems like a constraint we should impose on any proposed solution: Make 
> sure it doesn’t turn contributors off. Rolling as we have been won’t cut it, 
> and I don’t think adding committers will ever be a sufficient solution to 
> this particular problem.
> 
> To me, it seems like we need a way to filter out viable contributions with 
> community support from other contributions when it comes to deciding that 
> automated action is appropriate. Our current tooling isn’t perfect, but 
> perhaps we can leverage it to create such a filter.
> 
> For example, consider the following strawman proposal for how to cut down on 
> the number of pending but unviable proposals, and simultaneously help 
> contributors organize to promote viable proposals and get the attention of 
> committers:
> 
> 1.  Have a bot scan for stale JIRA issues and PRs—i.e. they haven’t been 
> updated in 20+ days (or D+ days, if you prefer).
> 
> 2.  Depending on the level of community support, either close the item or 
> ping specific people for action. Specifically:
> a. If the JIRA/PR has no input from a committer and the JIRA/PR has 5+ votes 
> (or V+ votes), ping committers for input. (For PRs, you could count comments 
> from different people, or thumbs up on the initial PR post.)
> b. If the JIRA/PR has no input from a committer and the JIRA/PR has less than 
> V votes, close it with a gentle message asking the contributor to solicit 
> support from either the community or a committer, and try again later.
> c. If the JIRA/PR has input from a committer or committers, ping them for an 
> update.
> 
> This is just a rough idea. The point is that when contributors have stale 
> proposals that they don’t close, committers need to take action. A little 
> automation to selectively bring contributions to the attention of committers 
> can perhaps help them manage the backlog of stale contributions. The 
> “selective” part is implemented in this strawman proposal by using JIRA votes 
> as a crude proxy for when the community is interested in something, but it 
> could be anything.
> 
> Also, this doesn’t have to be used just to clear out stale proposals. Once 
> the initial backlog is trimmed down, you could set D to 5 days and use this 
> as a regular way to bring contributions to the attention of committers.
> 
> I dunno if people think

Re: Spark Improvement Proposals

2016-10-07 Thread Matei Zaharia

For the improvement proposals, I think one major point was to make them really 
visible to users who are not contributors, so we should do more than sending 
stuff to dev@. One very lightweight idea is to have a new type of JIRA called a 
SIP and have a link to a filter that shows all such JIRAs from 
http://spark.apache.org. I also like the idea of SIP and design doc templates 
(in fact many projects have them).

Matei

> On Oct 7, 2016, at 10:38 AM, Reynold Xin <r...@databricks.com> wrote:
> 
> I called Cody last night and talked about some of the topics in his email. It 
> became clear to me Cody genuinely cares about the project.
> 
> Some of the frustrations come from the success of the project itself becoming 
> very "hot", and it is difficult to get clarity from people who don't dedicate 
> all their time to Spark. In fact, it is in some ways similar to scaling an 
> engineering team in a successful startup: old processes that worked well 
> might not work so well when it gets to a certain size, cultures can get 
> diluted, building culture vs building process, etc.
> 
> I also really like to have a more visible process for larger changes, 
> especially major user facing API changes. Historically we upload design docs 
> for major changes, but it is not always consistent and difficult to quality 
> of the docs, due to the volunteering nature of the organization.
> 
> Some of the more concrete ideas we discussed focus on building a culture to 
> improve clarity:
> 
> - Process: Large changes should have design docs posted on JIRA. One thing 
> Cody and I didn't discuss but an idea that just came to me is we should 
> create a design doc template for the project and ask everybody to follow. The 
> design doc template should also explicitly list goals and non-goals, to make 
> design doc more consistent.
> 
> - Process: Email dev@ to solicit feedback. We have some this with some 
> changes, but again very inconsistent. Just posting something on JIRA isn't 
> sufficient, because there are simply too many JIRAs and the signal get lost 
> in the noise. While this is generally impossible to enforce because we can't 
> force all volunteers to conform to a process (or they might not even be aware 
> of this),  those who are more familiar with the project can help by emailing 
> the dev@ when they see something that hasn't been.
> 
> - Culture: The design doc author(s) should be open to feedback. A design doc 
> should serve as the base for discussion and is by no means the final design. 
> Of course, this does not mean the author has to accept every feedback. They 
> should also be comfortable accepting / rejecting ideas on technical grounds.
> 
> - Process / Culture: For major ongoing projects, it can be useful to have 
> some monthly Google hangouts that are open to the world. I am actually not 
> sure how well this will work, because of the volunteering nature and we need 
> to adjust for timezones for people across the globe, but it seems worth 
> trying.
> 
> - Culture: Contributors (including committers) should be more direct in 
> setting expectations, including whether they are working on a specific issue, 
> whether they will be working on a specific issue, and whether an issue or pr 
> or jira should be rejected. Most people I know in this community are nice and 
> don't enjoy telling other people no, but it is often more annoying to a 
> contributor to not know anything than getting a no.
> 
> 
> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia <matei.zaha...@gmail.com 
> <mailto:matei.zaha...@gmail.com>> wrote:
> 
> Love the idea of a more visible "Spark Improvement Proposal" process that 
> solicits user input on new APIs. For what it's worth, I don't think 
> committers are trying to minimize their own work -- every committer cares 
> about making the software useful for users. However, it is always hard to get 
> user input and so it helps to have this kind of process. I've certainly 
> looked at the *IPs a lot in other software I use just to see the biggest 
> things on the roadmap.
> 
> When you're talking about "changing interfaces", are you talking about public 
> or internal APIs? I do think many people hate changing public APIs and I 
> actually think that's for the best of the project. That's a technical debate, 
> but basically, the worst thing when you're using a piece of software is that 
> the developers constantly ask you to rewrite your app to update to a new 
> version (and thus benefit from bug fixes, etc). Cue anyone who's used 
> Protobuf, or Guava. The "let's get everyone to change their code this 
> release" model works well within a single large company, but doesn't work 
> well for a community, which is why nearly all *very* widely used programming 
> interfaces (I'm talking things like Java standard library, Windows API, etc) 
> almost *never* break backwards compatibility. All this is done within reason 
> though, e.g. we do change things in major releases (2.x, 3.x, etc).
> 
>

Re: Spark Improvement Proposals

2016-10-07 Thread Matei Zaharia

dn't be ignored.

I agree about empowering people interested here to contribute, but I'm 
wondering, do you think there are technical things that people don't want to 
work on, or is it a matter of what there's been time to do? Everyone I know 
does want great Kafka support, event time, etc, it's just a question of working 
out the details and of course of getting the coding done. This is also an area 
where I'd love to see more contributions -- in the past, people have dome 
similar-scale contributions in other areas (e.g. better integration with Hive, 
on-the-wire encryption, etc).

FWIW, I think there are three things going on with streaming.

1) Structured Streaming, which is meant to provide a much higher-level new API. 
This was meant from the beginning to include event time, various complex form 
of windows, and great data source and sink support in a unified framework. It's 
also, IMHO, much simpler than most existing APIs for this stuff (i.e. look at 
the number of concepts you have to learn for those versus for this). However, 
this project is still very early on -- only the bare minimum API came out in 
2.0. It's marked as alpha and it's precisely the type of system where I'd 
expect the API to improve in response to feedback. As with other APIs, such as 
Spark SQL's SchemaRDD and DataFrame, I think it's good to get it in front of 
*users* quickly and receive feedback -- even developers discussing among 
themselves can't anticipate all user needs.

2) Adding things in Spark Streaming. I haven't personally worked much on this 
lately, but it is a very reasonable thing that I'd love to see the project do 
to help current users. For example, consider adding an aggregate-by-event-time 
operator to Spark Streaming (it can be done using mapWithState), or a 
sessionization operator, etc.

3) Another thing that I think is possible is just lowering the latency of both 
Spark Streaming and Structured Streaming by 10x -- a few folks at Berkeley have 
been working on this 
(https://spark-summit.org/2016/events/low-latency-execution-for-apache-spark/). 
Happy to fork off a thread about how to do it. Their current system requires 
some new concepts in the Spark scheduler, but from measuring stuff it also 
seems that you can get somewhere with less intensive changes (most of the 
overhead is in RPCs, not in the scheduling logic or task execution).

> - Jira
> Concretely, automate closing stale jiras after X amount of time.  It's
> really surprising to me how much reluctance a community of programmers
> have shown towards automating their own processes around stuff like
> this (not to mention automatic code formatting of modified files).  I
> understand the arguments against. but the current alternative doesn't
> work.
> Concretely, clearly reject and close jiras.  I have a backlog of 50+
> kafka jiras, many of which are irrelevant at this point, but I do not
> feel that I have the political power to close them.
> Concretely, make it clear who is working on something.  This can be as
> simple as just "I'm working on this", assign it to me, if I don't
> follow up in X amount of time, close it or reassign.  That doesn't
> mean there can't be competing work, but it does mean those people
> should talk to each other.  Conversely, if committers currently don't
> have time to work on something that is important, make that clear in
> the ticket.

Definitely agree with marking who's working on something early on, and timing 
it out if inactive. For closing JIRAs, I think the best way I've seen is for 
people to go through them once in a while. Automated closing is too impersonal 
IMO -- if I opened a JIRA on a project and nobody looked at it and that 
happened to me, I'd actively feel ignored. If you do that, you'll see people on 
stage saying "I reported a bug for Spark and some bot just closed it after 3 
months", which is not ideal.

Matei

> 
> 
> On Fri, Oct 7, 2016 at 5:34 AM, Sean Owen <so...@cloudera.com 
> <mailto:so...@cloudera.com>> wrote:
> > Suggestion actions way at the bottom.
> >
> > On Fri, Oct 7, 2016 at 5:14 AM Matei Zaharia <matei.zaha...@gmail.com 
> > <mailto:matei.zaha...@gmail.com>>
> > wrote:
> >>
> >> since March. But it's true that other things such as the Kafka source for
> >> it didn't have as much design on JIRA. Nonetheless, this component is still
> >> early on and there's still a lot of time to change it, which is happening.
> >
> >
> > It's hard to drive design discussions in OSS. Even when diligently
> > publishing design docs, the doc happens after brainstorming, and that
> > happens inside someone's head or in chats.
> >
> > The lazy consensus model that works for small changes doesn't work well
> > here. If a committer wants a change, that change will basically be m

Re: Spark Improvement Proposals

2016-10-06 Thread Matei Zaharia

Hey Cody,

Thanks for bringing these things up. You're talking about quite a few different 
things here, but let me get to them each in turn.

1) About technical / design discussion -- I fully agree that everything big 
should go through a lot of review, and I like the idea of a more formal way to 
propose and comment on larger features. So far, all of this has been done 
through JIRA, but as a start, maybe marking JIRAs as large (we often use 
Umbrella for this) and also opening a thread on the list about each such JIRA 
would help. For Structured Streaming in particular, FWIW, there was a pretty 
complete doc on the proposed semantics at 
https://issues.apache.org/jira/browse/SPARK-8360 since March. But it's true 
that other things such as the Kafka source for it didn't have as much design on 
JIRA. Nonetheless, this component is still early on and there's still a lot of 
time to change it, which is happening.

2) About what people say at Reactive Summit -- there will always be trolls, but 
just ignore them and build a great project. Those of us involved in the project 
for a while have long seen similar stuff, e.g. a prominent company saying Spark 
doesn't scale past 100 nodes when there were many documented instances to the 
contrary, and the best answer is to just make the project better. This same 
company, if you read their website now, recommends Apache Spark for most 
anything. For streaming in particular, there is a lot of confusion because many 
of the concepts aren't well-defined (e.g. what is "at least once", etc), and 
it's also a crowded space. But Spark Streaming prioritizes a few things that it 
does very well: correctness (you can easily tell what the app will do, and it 
does the same thing despite failures), ease of programming (which also requires 
correctness), and scalability. We should of course both explain what it does in 
more places and work on improving it where needed (e.g. adding a higher level 
API with Structured Streaming and built-in primitives for external timestamps).

3) About number and diversity of committers -- the PMC is always working to 
expand these, and you should email people on the PMC (or even the whole list) 
if you have people you'd like to propose. In general I think nearly all 
committers added in the past year were from organizations that haven't long 
been involved in Spark, and the number of committers continues to grow pretty 
fast.

4) Finally, about better organizing JIRA, marking dead issues, etc, this would 
be great and I think we just need a concrete proposal for how to do it. It 
would be best to point to an existing process that someone else has used here 
BTW so that we can see it in action.

Matei

> On Oct 6, 2016, at 7:51 PM, Cody Koeninger  wrote:
> 
> I love Spark.  3 or 4 years ago it was the first distributed computing
> environment that felt usable, and the community was welcoming.
> 
> But I just got back from the Reactive Summit, and this is what I observed:
> 
> - Industry leaders on stage making fun of Spark's streaming model
> - Open source project leaders saying they looked at Spark's governance
> as a model to avoid
> - Users saying they chose Flink because it was technically superior
> and they couldn't get any answers on the Spark mailing lists
> 
> Whether you agree with the substance of any of this, when this stuff
> gets repeated enough people will believe it.
> 
> Right now Spark is suffering from its own success, and I think
> something needs to change.
> 
> - We need a clear process for planning significant changes to the codebase.
> I'm not saying you need to adopt Kafka Improvement Proposals exactly,
> but you need a documented process with a clear outcome (e.g. a vote).
> Passing around google docs after an implementation has largely been
> decided on doesn't cut it.
> 
> - All technical communication needs to be public.
> Things getting decided in private chat, or when 1/3 of the committers
> work for the same company and can just talk to each other...
> Yes, it's convenient, but it's ultimately detrimental to the health of
> the project.
> The way structured streaming has played out has shown that there are
> significant technical blind spots (myself included).
> One way to address that is to get the people who have domain knowledge
> involved, and listen to them.
> 
> - We need more committers, and more committer diversity.
> Per committer there are, what, more than 20 contributors and 10 new
> jira tickets a month?  It's too much.
> There are people (I am _not_ referring to myself) who have been around
> for years, contributed thousands of lines of code, helped educate the
> public around Spark... and yet are never going to be voted in.
> 
> - We need a clear process for managing volunteer work.
> Too many tickets sit around unowned, unclosed, uncertain.
> If someone proposed something and it isn't up to snuff, tell them and
> close it.  It may be blunt, but it's clearer than "silent no".
>

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Matei Zaharia

+1

Matei

> On Sep 29, 2016, at 10:59 AM, Herman van Hövell tot Westerflier 
>  wrote:
> 
> +1 (non binding)
> 
> On Thu, Sep 29, 2016 at 10:59 AM, Weiqing Yang  > wrote:
> +1 (non binding)
>  
> RC4 is compiled and tested on the system: CentOS Linux release 7.0.1406 / 
> openjdk 1.8.0_102 / R 3.3.1
>  All tests passed.
>  
> ./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver 
> -Dpyspark -Dsparkr -DskipTests clean package
> ./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver 
> -Dpyspark -Dsparkr test
>  
>  
> Best,
> Weiqing 
> 
> On Thu, Sep 29, 2016 at 4:18 AM, Jagadeesan As  > wrote:
> +1 (non binding)
> 
> Ubuntu 14.04.2/openjdk  "1.8.0_72"
> (-Pyarn -Phadoop-2.7 -Psparkr -Pkinesis-asl -Phive-thriftserver)
>  
> Cheers,
> Jagadeesan A S
> 
> 
> 
> From:Ricardo Almeida  >
> To:"dev@spark.apache.org " 
> >
> Date:29-09-16 04:36 PM
> Subject:Re: [VOTE] Release Apache Spark 2.0.1 (RC4)
> 
> 
> 
> +1 (non-binding)
> 
> Built (-Phadoop-2.7 -Dhadoop.version=2.7.3 -Phive -Phive-thriftserver -Pyarn) 
> and tested on:
> - Ubuntu 16.04 / OpenJDK 1.8.0_91
> - CentOS / Oracle Java 1.7.0_55
> 
> No regressions from 2.0.0 found while running our workloads (Python API)
> 
> 
> On 29 September 2016 at 08:10, Reynold Xin  > wrote:
> I will kick it off with my own +1.
> 
> 
> On Wed, Sep 28, 2016 at 7:14 PM, Reynold Xin  > wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and passes if a 
> majority of at least 3+1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Spark 2.0.1
> [ ] -1 Do not release this package because ...
> 
> 
> The tag to be voted on is v2.0.1-rc4 
> (933d2c1ea4e5f5c4ec8d375b5ccaa4577ba4be38)
> 
> This release candidate resolves 301 issues: 
> https://s.apache.org/spark-2.0.1-jira 
> 
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc4-bin/ 
> 
> 
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc 
> 
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1203/ 
> 
> 
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc4-docs/ 
> 
> 
> 
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an 
> existing Spark workload and running on this release candidate, then reporting 
> any regressions from 2.0.0.
> 
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series.  Bugs already present 
> in 2.0.0, missing features, or bugs related to new features will not 
> necessarily block this release.
> 
> Q: What fix version should I use for patches merging into branch-2.0 from now 
> on?
> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new RC (i.e. 
> RC5) is cut, I will change the fix version of those patches to 2.0.1.
> 
> 
> 
> 
> 
> 
>

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-25 Thread Matei Zaharia

+1

Matei

> On Sep 25, 2016, at 1:25 PM, Josh Rosen  wrote:
> 
> +1
> 
> On Sun, Sep 25, 2016 at 1:16 PM Yin Huai  > wrote:
> +1
> 
> On Sun, Sep 25, 2016 at 11:40 AM, Dongjoon Hyun  > wrote:
> +1 (non binding)
> 
> RC3 is compiled and tested on the following two systems, too. All tests 
> passed.
> 
> * CentOS 7.2 / Oracle JDK 1.8.0_77 / R 3.3.1
>with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver -Dsparkr
> * CentOS 7.2 / Open JDK 1.8.0_102
>with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
> 
> Cheers,
> Dongjoon
> 
> 
> 
> On Saturday, September 24, 2016, Reynold Xin  > wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 2.0.1. The vote is open until Tue, Sep 27, 2016 at 15:30 PDT and passes if a 
> majority of at least 3+1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Spark 2.0.1
> [ ] -1 Do not release this package because ...
> 
> 
> The tag to be voted on is v2.0.1-rc3 
> (9d28cc10357a8afcfb2fa2e6eecb5c2cc2730d17)
> 
> This release candidate resolves 290 issues: 
> https://s.apache.org/spark-2.0.1-jira 
> 
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc3-bin/ 
> 
> 
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc 
> 
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1201/ 
> 
> 
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc3-docs/ 
> 
> 
> 
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an 
> existing Spark workload and running on this release candidate, then reporting 
> any regressions from 2.0.0.
> 
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series.  Bugs already present 
> in 2.0.0, missing features, or bugs related to new features will not 
> necessarily block this release.
> 
> Q: What fix version should I use for patches merging into branch-2.0 from now 
> on?
> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new RC (i.e. 
> RC4) is cut, I will change the fix version of those patches to 2.0.1.
> 
> 
>

[jira] [Commented] (SPARK-17445) Reference an ASF page as the main place to find third-party packages

2016-09-12 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484732#comment-15484732
 ] 

Matei Zaharia commented on SPARK-17445:
---

Sounds good to me.

> Reference an ASF page as the main place to find third-party packages
> 
>
> Key: SPARK-17445
> URL: https://issues.apache.org/jira/browse/SPARK-17445
> Project: Spark
>  Issue Type: Improvement
>        Reporter: Matei Zaharia
>
> Some comments and docs like 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151
>  say to go to spark-packages.org, but since this is a package index 
> maintained by a third party, it would be better to reference an ASF page that 
> we can keep updated and own the URL for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17445) Reference an ASF page as the main place to find third-party packages

2016-09-10 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15480419#comment-15480419
 ] 

Matei Zaharia commented on SPARK-17445:
---

Sounds good, but IMO just keep the current supplemental projects there -- don't 
they fit better into "third-party packages" than "powered by"? I viewed powered 
by as a list of users, similar to https://wiki.apache.org/hadoop/PoweredBy, but 
I guess you're viewing it as a list of software that integrates with Spark.

> Reference an ASF page as the main place to find third-party packages
> 
>
> Key: SPARK-17445
> URL: https://issues.apache.org/jira/browse/SPARK-17445
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>
> Some comments and docs like 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151
>  say to go to spark-packages.org, but since this is a package index 
> maintained by a third party, it would be better to reference an ASF page that 
> we can keep updated and own the URL for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17445) Reference an ASF page as the main place to find third-party packages

2016-09-09 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15479121#comment-15479121
 ] 

Matei Zaharia commented on SPARK-17445:
---

The powered by wiki page is a bit of a mess IMO, so I'd separate out the 
third-party packages from that one. Basically, the powered by page was useful 
when the project was really new and nobody knew who's using it, but right now 
it's a snapshot of the users from back then because = few new organizations 
(especially the large ones) list themselves there. Anyway, just linking to this 
wiki page is nice, though I'd try to rename the page to "Third-Party Packages" 
instead of "Supplemental Spark Projects" if it's possible to make the old name 
redirect.

> Reference an ASF page as the main place to find third-party packages
> 
>
> Key: SPARK-17445
> URL: https://issues.apache.org/jira/browse/SPARK-17445
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>
> Some comments and docs like 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151
>  say to go to spark-packages.org, but since this is a package index 
> maintained by a third party, it would be better to reference an ASF page that 
> we can keep updated and own the URL for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: FileStreamSource source checks path eagerly?

2016-09-08 Thread Matei Zaharia

This source is meant to be used for a shared file system such as HDFS or NFS, 
where both the driver and the workers can see the same folders. There's no 
support in Spark for just working with local files on different workers.

Matei

> On Sep 8, 2016, at 2:23 AM, Jacek Laskowski  wrote:
> 
> Hi Steve,
> 
> Thank you for more source-oriented answer. Helped but didn't explain
> the reason for such eagerness. The file(s) might not be on the driver
> but on executors only where the Spark job(s) run. I don't see why
> Spark should check the file(s) regardless of glob pattern being used.
> 
> You see my way of thinking?
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
> 
> 
> On Thu, Sep 8, 2016 at 11:20 AM, Steve Loughran  
> wrote:
>> failfast generally means that you find problems sooner rather than later, 
>> and here, potentially, that your code runs but simply returns empty data 
>> without any obvious cue as to what is wrong.
>> 
>> As is always good in OSS, follow those stack trace links to see what they 
>> say:
>> 
>>// Check whether the path exists if it is not a glob pattern.
>>// For glob pattern, we do not check it because the glob pattern 
>> might only make sense
>>// once the streaming job starts and some upstream source starts 
>> dropping data.
>> 
>> If you specify a glob pattern, you'll get the late check at the expense of 
>> the risk of that empty data source if the pattern is wrong. Something like 
>> "/var/log\s" would suffice, as the presence of the backslash is enough for 
>> SparkHadoopUtil.isGlobPath() to conclude that its something for the globber.
>> 
>> 
>>> On 8 Sep 2016, at 07:33, Jacek Laskowski  wrote:
>>> 
>>> Hi,
>>> 
>>> I'm wondering what's the rationale for checking the path option
>>> eagerly in FileStreamSource? My thinking is that until start is called
>>> there's no processing going on that is supposed to happen on executors
>>> (not the driver) with the path available.
>>> 
>>> I could (and perhaps should) use dfs but IMHO that just hides the real
>>> question of the text source eagerness.
>>> 
>>> Please help me understand the rationale of the choice. Thanks!
>>> 
>>> scala> spark.version
>>> res0: String = 2.1.0-SNAPSHOT
>>> 
>>> scala> spark.readStream.format("text").load("/var/logs")
>>> org.apache.spark.sql.AnalysisException: Path does not exist: /var/logs;
>>> at 
>>> org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:229)
>>> at 
>>> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:81)
>>> at 
>>> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:81)
>>> at 
>>> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
>>> at 
>>> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142)
>>> at 
>>> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153)
>>> ... 48 elided
>>> 
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>>> Follow me at https://twitter.com/jaceklaskowski
>>> 
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> 
>>> 
>> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[jira] [Commented] (SPARK-17445) Reference an ASF page as the main place to find third-party packages

2016-09-08 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474543#comment-15474543
 ] 

Matei Zaharia commented on SPARK-17445:
---

I think one part you're missing, Josh, is that spark-packages.org *is* an index 
of packages from a wide variety of organizations, where anyone can submit a 
package. Have you looked through it? Maybe there is some concern about which 
third-party index we highlight on the site, but AFAIK there are no other 
third-party package indexes. Nonetheless it would make sense to have a stable 
URL on the Spark homepage that lists them.

BTW, in the past, we also used a wiki page to track them: 
https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects 
so we could just link to that. The spark-packages site provides some nicer 
functionality though such as letting anyone add a package with just a GitHub 
account, listing releases, etc.

> Reference an ASF page as the main place to find third-party packages
> 
>
> Key: SPARK-17445
> URL: https://issues.apache.org/jira/browse/SPARK-17445
> Project: Spark
>  Issue Type: Improvement
>        Reporter: Matei Zaharia
>
> Some comments and docs like 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151
>  say to go to spark-packages.org, but since this is a package index 
> maintained by a third party, it would be better to reference an ASF page that 
> we can keep updated and own the URL for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17445) Reference an ASF page as the main place to find third-party packages

2016-09-07 Thread Matei Zaharia (JIRA)

Matei Zaharia created SPARK-17445:
-

 Summary: Reference an ASF page as the main place to find 
third-party packages
 Key: SPARK-17445
 URL: https://issues.apache.org/jira/browse/SPARK-17445
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia


Some comments and docs like 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151
 say to go to spark-packages.org, but since this is a package index maintained 
by a third party, it would be better to reference an ASF page that we can keep 
updated and own the URL for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Removing published kinesis, ganglia artifacts due to license issues?

2016-09-07 Thread Matei Zaharia

The question is just whether the metadata and instructions involving these 
Maven packages counts as sufficient to tell the user that they have different 
licensing terms. For example, our Ganglia package was called spark-ganglia-lgpl 
(so you'd notice it's a different license even from its name), and our Kinesis 
one was called spark-streaming-kinesis-asl, and our docs both mentioned these 
were under different licensing terms. But is that enough? That's the question.

Matei

> On Sep 7, 2016, at 2:05 PM, Cody Koeninger  wrote:
> 
> To be clear, "safe" has very little to do with this.
> 
> It's pretty clear that there's very little risk of the spark module
> for kinesis being considered a derivative work, much less all of
> spark.
> 
> The use limitation in 3.3 that caused the amazon license to be put on
> the apache X list also doesn't have anything to do with a legal safety
> risk here.  Really, what are you going to use a kinesis connector for,
> except for connecting to kinesis?
> 
> 
> On Wed, Sep 7, 2016 at 2:41 PM, Luciano Resende  wrote:
>> 
>> 
>> On Wed, Sep 7, 2016 at 12:20 PM, Mridul Muralidharan 
>> wrote:
>>> 
>>> 
>>> It is good to get clarification, but the way I read it, the issue is
>>> whether we publish it as official Apache artifacts (in maven, etc).
>>> 
>>> Users can of course build it directly (and we can make it easy to do so) -
>>> as they are explicitly agreeing to additional licenses.
>>> 
>>> Regards
>>> Mridul
>>> 
>> 
>> +1, by providing instructions on how the user would build, and attaching the
>> license details on the instructions, we are then safe on the legal aspects
>> of it.
>> 
>> 
>> 
>> --
>> Luciano Resende
>> http://twitter.com/lresende1975
>> http://lresende.blogspot.com/


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Removing published kinesis, ganglia artifacts due to license issues?

2016-09-07 Thread Matei Zaharia

I think you should ask legal about how to have some Maven artifacts for these. 
Both Ganglia and Kinesis are very widely used, so it's weird to ask users to 
build them from source. Maybe the Maven artifacts can be marked as being under 
a different license?

In the initial discussion for LEGAL-198, we were told the following:

"If the component that uses this dependency is not required for the rest of 
Spark to function then you can have a subproject to build the component. See 
http://www.apache.org/legal/resolved.html#optional. This means you will have to 
provide instructions for users to enable the optional component (which IMO 
should provide pointers to the licensing)."

It's not clear whether "enable the optional component" means "every user must 
build it from source", or whether we could tell users "here's a Maven 
coordinate you can add to your project if you're okay with the licensing".

Matei

> On Sep 7, 2016, at 11:35 AM, Sean Owen  wrote:
> 
> (Credit to Luciano for pointing it out)
> 
> Yes it's clear why the assembly can't be published but I had the same
> question about the non-assembly Kinesis (and ganglia) artifact,
> because the published artifact has no code from Kinesis.
> 
> See the related discussion at
> https://issues.apache.org/jira/browse/LEGAL-198 ; the point I took
> from there is that the Spark Kinesis artifact is optional with respect
> to Spark, but still something published by Spark, and it requires the
> Amazon-licensed code non-optionally.
> 
> I'll just ask that question to confirm or deny.
> 
> (It also has some background on why the Amazon License is considered
> "Category X" in ASF policy due to field of use restrictions. I myself
> take that as read rather than know the details of that decision.)
> 
> On Wed, Sep 7, 2016 at 6:44 PM, Cody Koeninger  wrote:
>> I don't see a reason to remove the non-assembly artifact, why would
>> you?  You're not distributing copies of Amazon licensed code, and the
>> Amazon license goes out of its way not to over-reach regarding
>> derivative works.
>> 
>> This seems pretty clearly to fall in the spirit of
>> 
>> http://www.apache.org/legal/resolved.html#optional
>> 
>> I certainly think the majority of Spark users will still want to use
>> Spark without adding Kinesis
>> 
>> On Wed, Sep 7, 2016 at 3:29 AM, Sean Owen  wrote:
>>> It's worth calling attention to:
>>> 
>>> https://issues.apache.org/jira/browse/SPARK-17418
>>> https://issues.apache.org/jira/browse/SPARK-17422
>>> 
>>> It looks like we need to at least not publish the kinesis *assembly*
>>> Maven artifact because it contains Amazon Software Licensed-code
>>> directly.
>>> 
>>> However there's a reasonably strong reason to believe that we'd have
>>> to remove the non-assembly Kinesis artifact too, as well as the
>>> Ganglia one. This doesn't mean it goes away from the project, just
>>> means it would no longer be published as a Maven artifact. (These have
>>> never been bundled in the main Spark artifacts.)
>>> 
>>> I wanted to give a heads up to see if anyone a) believes this
>>> conclusion is wrong or b) wants to take it up with legal@? I'm
>>> inclined to believe we have to remove them given the interpretation
>>> Luciano has put forth.
>>> 
>>> Sean
>>> 
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Matei Zaharia

I think people explained this pretty well, but in practice, this distinction is 
also somewhat of a marketing term, because every system will perform some kind 
of batching. For example, every time you use TCP, the OS and network stack may 
buffer multiple messages together and send them at once; and likewise, 
virtually all streaming engines can batch data internally to achieve higher 
throughput. Furthermore, in all APIs, you can see individual records and 
respond to them one by one. The main question is just what overall performance 
you get (throughput and latency).

Matei

> On Aug 23, 2016, at 4:08 PM, Aseem Bansal  wrote:
> 
> Thanks everyone for clarifying.
> 
> On Tue, Aug 23, 2016 at 9:11 PM, Aseem Bansal  > wrote:
> I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/ 
>  and it mentioned that spark 
> streaming actually mini-batch not actual streaming. 
> 
> I have not used streaming and I am not sure what is the difference in the 2 
> terms. Hence could not make a judgement myself.
>

Re: unsubscribe

2016-08-10 Thread Matei Zaharia

To unsubscribe, please send an email to user-unsubscr...@spark.apache.org from 
the address you're subscribed from.

Matei

> On Aug 10, 2016, at 12:48 PM, Sohil Jain  wrote:
> 
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Welcoming Felix Cheung as a committer

2016-08-08 Thread Matei Zaharia

Hi all,

The PMC recently voted to add Felix Cheung as a committer. Felix has been a 
major contributor to SparkR and we're excited to have him join officially. 
Congrats and welcome, Felix!

Matei
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Dropping late date in Structured Streaming

2016-08-06 Thread Matei Zaharia

Yes, a built-in mechanism is planned in future releases. You can also drop it 
using a filter for now but the stateful operators will still keep state for old 
windows.

Matei

> On Aug 6, 2016, at 9:40 AM, Amit Sela  wrote:
> 
> I've noticed that when using Structured Streaming with event-time windows 
> (fixed/sliding), all windows are retained. This is clearly how "late" data is 
> handled, but I was wondering if there is some pruning mechanism that I might 
> have missed ? or is this planned in future releases ?
> 
> Thanks,
> Amit


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: renaming "minor release" to "feature release"

2016-07-28 Thread Matei Zaharia

I also agree with this given the way we develop stuff. We don't really want to 
move to possibly-API-breaking major releases super often, but we do have lots 
of large features that come out all the time, and our current name doesn't 
convey that.

Matei

> On Jul 28, 2016, at 4:15 PM, Reynold Xin  wrote:
> 
> Yea definitely. Those are consistent with what is defined here: 
> https://cwiki.apache.org/confluence/display/SPARK/Spark+Versioning+Policy 
> 
> 
> The only change I'm proposing is replacing "minor" with "feature".
> 
> 
> On Thu, Jul 28, 2016 at 4:10 PM, Sean Owen  > wrote:
> Although 'minor' is the standard term, the important thing is making
> the nature of the release understood. 'feature release' seems OK to me
> as an additional description.
> 
> Is it worth agreeing on or stating a little more about the theory?
> 
> patch release: backwards/forwards compatible within a minor release,
> generally fixes only
> minor/feature release: backwards compatible within a major release,
> not forward; generally also includes new features
> major release: not backwards compatible and may remove or change
> existing features
> 
> On Thu, Jul 28, 2016 at 3:46 PM, Reynold Xin  > wrote:
> > tl;dr
> >
> > I would like to propose renaming “minor release” to “feature release” in
> > Apache Spark.
> >
> >
> > details
> >
> > Apache Spark’s official versioning policy follows roughly semantic
> > versioning. Each Spark release is versioned as
> > [major].[minor].[maintenance]. That is to say, 1.0.0 and 2.0.0 are both
> > “major releases”, whereas “1.1.0” and “1.3.0” would be minor releases.
> >
> > I have gotten a lot of feedback from users that the word “minor” is
> > confusing and does not accurately describes those releases. When users hear
> > the word “minor”, they think it is a small update that introduces couple
> > minor features and some bug fixes. But if you look at the history of Spark
> > 1.x, here are just a subset of large features added:
> >
> > Spark 1.1: sort-based shuffle, JDBC/ODBC server, new stats library, 2-5X
> > perf improvement for machine learning.
> >
> > Spark 1.2: HA for streaming, new network module, Python API for streaming,
> > ML pipelines, data source API.
> >
> > Spark 1.3: DataFrame API, Spark SQL graduate out of alpha, tons of new
> > algorithms in machine learning.
> >
> > Spark 1.4: SparkR, Python 3 support, DAG viz, robust joins in SQL, math
> > functions, window functions, SQL analytic functions, Python API for
> > pipelines.
> >
> > Spark 1.5: code generation, Project Tungsten
> >
> > Spark 1.6: automatic memory management, Dataset API, ML pipeline persistence
> >
> >
> > So while “minor” is an accurate depiction of the releases from an API
> > compatibiility point of view, we are miscommunicating and doing Spark a
> > disservice by calling these releases “minor”. I would actually call these
> > releases “major”, but then it would be a larger deviation from semantic
> > versioning. I think calling these “feature releases” would be a smaller
> > change and a more accurate depiction of what they are.
> >
> > That said, I’m not attached to the name “feature” and am open to
> > suggestions, as long as they don’t convey the notion of “minor”.
> >
> >
>

Re: The Future Of DStream

2016-07-27 Thread Matei Zaharia

Yup, they will definitely coexist. Structured Streaming is currently alpha and 
will probably be complete in the next few releases, but Spark Streaming will 
continue to exist, because it gives the user more low-level control. It's 
similar to DataFrames vs RDDs (RDDs are the lower-level API for when you want 
control, while DataFrames do more optimizations automatically by restricting 
the computation model).

Matei

> On Jul 27, 2016, at 12:03 AM, Ofir Manor  wrote:
> 
> Structured Streaming in 2.0 is declared as alpha - plenty of bits still 
> missing:
>  
> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
>  
> 
> I assume that it will be declared stable / GA in a future 2.x release, and 
> then it will co-exist with DStream for quite a while before someone will 
> suggest to start a deprecation process that will eventually lead to its 
> removal...
> As a user, I guess we will need to apply judgement about when to switch to 
> Structured Streaming - each of us have a different risk/value tradeoff, based 
> on our specific situation...
> 
> Ofir Manor
> 
> Co-Founder & CTO | Equalum
> 
> 
> Mobile: +972-54-7801286  | Email: 
> ofir.ma...@equalum.io 
> On Wed, Jul 27, 2016 at 8:02 AM, Chang Chen  > wrote:
> Hi guys
> 
> Structure Stream is coming with spark 2.0,  but I noticed that DStream is 
> still here
> 
> What's the future of the DStream, will it be deprecated and removed 
> eventually? Or co-existed with  Structure Stream forever?
> 
> Thanks
> Chang
> 
>

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-22 Thread Matei Zaharia

+1

Tested on Mac.

Matei

> On Jul 22, 2016, at 11:18 AM, Joseph Bradley  wrote:
> 
> +1
> 
> Mainly tested ML/Graph/R.  Perf tests from Tim Hunter showed minor speedups 
> from 1.6 for common ML algorithms.
> 
> On Thu, Jul 21, 2016 at 9:41 AM, Ricardo Almeida 
> > wrote:
> +1 (non binding)
> 
> Tested PySpark Core, DataFrame/SQL, MLlib and Streaming on a standalone 
> cluster
> 
> On 21 July 2016 at 05:24, Reynold Xin  > wrote:
> +1
> 
> 
> On Wednesday, July 20, 2016, Krishna Sankar  > wrote:
> +1 (non-binding, of course)
> 
> 1. Compiled OS X 10.11.5 (El Capitan) OK Total time: 24:07 min
>  mvn clean package -Pyarn -Phadoop-2.7 -DskipTests
> 2. Tested pyspark, mllib (iPython 4.0)
> 2.0 Spark version is 2.0.0 
> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
> 2.2. Linear/Ridge/Lasso Regression OK 
> 2.3. Classification : Decision Tree, Naive Bayes OK
> 2.4. Clustering : KMeans OK
>Center And Scale OK
> 2.5. RDD operations OK
>   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>Model evaluation/optimization (rank, numIter, lambda) with itertools OK
> 3. Scala - MLlib
> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
> 3.2. LinearRegressionWithSGD OK
> 3.3. Decision Tree OK
> 3.4. KMeans OK
> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
> 3.6. saveAsParquetFile OK
> 3.7. Read and verify the 3.6 save(above) - sqlContext.parquetFile, 
> registerTempTable, sql OK
> 3.8. result = sqlContext.sql("SELECT 
> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER 
> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
> 4.0. Spark SQL from Python OK
> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
> 5.0. Packages
> 5.1. com.databricks.spark.csv - read/write OK (--packages 
> com.databricks:spark-csv_2.10:1.4.0)
> 6.0. DataFrames 
> 6.1. cast,dtypes OK
> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
> 6.3. All joins,sql,set operations,udf OK
> [Dataframe Operations very fast from 11 secs to 3 secs, to 1.8 secs, to 1.5 
> secs! Good work !!!]
> 7.0. GraphX/Scala
> 7.1. Create Graph (small and bigger dataset) OK
> 7.2. Structure APIs - OK
> 7.3. Social Network/Community APIs - OK
> 7.4. Algorithms : PageRank of 2 datasets, aggregateMessages() - OK
> 
> Cheers
> 
> 
> On Tue, Jul 19, 2016 at 7:35 PM, Reynold Xin > wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes 
> if a majority of at least 3 +1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Spark 2.0.0
> [ ] -1 Do not release this package because ...
> 
> 
> The tag to be voted on is v2.0.0-rc5 
> (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).
> 
> This release candidate resolves ~2500 issues: 
> https://s.apache.org/spark-2.0.0-jira 
> 
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-bin/ 
> 
> 
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc 
> 
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1195/ 
> 
> 
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/ 
> 
> 
> 
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking an 
> existing Spark workload and running on this release candidate, then reporting 
> any regressions from 1.x.
> 
> ==
> What justifies a -1 vote for this release?
> ==
> Critical bugs impacting major functionalities.
> 
> Bugs already present in 1.x, missing features, or bugs related to new 
> features will not necessarily block this release. Note that historically 
> Spark documentation has been published on the website separately from the 
> main release so we do not need to block the release due to documentation 
> errors either.
> 
> 
> 
>

Re: How to explain SchedulerBackend.reviveOffers()?

2016-06-20 Thread Matei Zaharia

Hi Jacek,

This applies to all schedulers actually -- it just tells Spark to re-check the 
available nodes and possibly launch tasks on them, because a new stage was 
submitted. Then when any node is available, the scheduler will call the 
TaskSetManager with an "offer" for the node.

Matei

> On Jun 19, 2016, at 11:54 PM, Jacek Laskowski  wrote:
> 
> Hi,
> 
> Whenever I see `backend.reviveOffers()` I'm struggling myself with
> properly explaining what it does. My understanding is that it requests
> a SchedulerBackend (that's responsible for talking to a cluster
> manager) to...that's the moment I'm not sure about.
> 
> How would you explain `backend.reviveOffers()`?
> 
> p.s. I understand that it's somehow related to how Mesos manages
> resources where it offers resources, but can't find anything related
> to `reviving offers` in Mesos docs :(
> 
> Please guide. Thanks!
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

[jira] [Commented] (SPARK-16031) Add debug-only socket source in Structured Streaming

2016-06-17 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337182#comment-15337182
 ] 

Matei Zaharia commented on SPARK-16031:
---

FYI I'll post a PR for this soon.

> Add debug-only socket source in Structured Streaming
> 
>
> Key: SPARK-16031
> URL: https://issues.apache.org/jira/browse/SPARK-16031
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Streaming
>    Reporter: Matei Zaharia
>    Assignee: Matei Zaharia
>
> This is a debug-only version of SPARK-15842: for tutorials and debugging of 
> streaming apps, it would be nice to have a text-based socket source similar 
> to the one in Spark Streaming. It will clearly be marked as debug-only so 
> that users don't try to run it in production applications, because this type 
> of source cannot provide HA without storing a lot of state in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16031) Add debug-only socket source in Structured Streaming

2016-06-17 Thread Matei Zaharia (JIRA)

Matei Zaharia created SPARK-16031:
-

 Summary: Add debug-only socket source in Structured Streaming
 Key: SPARK-16031
 URL: https://issues.apache.org/jira/browse/SPARK-16031
 Project: Spark
  Issue Type: New Feature
  Components: SQL, Streaming
Reporter: Matei Zaharia
Assignee: Matei Zaharia


This is a debug-only version of SPARK-15842: for tutorials and debugging of 
streaming apps, it would be nice to have a text-based socket source similar to 
the one in Spark Streaming. It will clearly be marked as debug-only so that 
users don't try to run it in production applications, because this type of 
source cannot provide HA without storing a lot of state in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Updated Spark logo

2016-06-10 Thread Matei Zaharia

Hi all, FYI, we've recently updated the Spark logo at https://spark.apache.org/ 
to say "Apache Spark" instead of just "Spark". Many ASF projects have been 
doing this recently to make it clearer that they are associated with the ASF, 
and indeed the ASF's branding guidelines generally require that projects be 
referred to as "Apache X" in various settings, especially in related commercial 
or open source products (https://www.apache.org/foundation/marks/). If you have 
any kind of site or product that uses Spark logo, it would be great to update 
to this full one.

There are EPS versions of the logo available at 
https://spark.apache.org/images/spark-logo.eps and 
https://spark.apache.org/images/spark-logo-reverse.eps; before using these also 
check https://www.apache.org/foundation/marks/.

Matei
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Updated Spark logo

2016-06-10 Thread Matei Zaharia

Hi all, FYI, we've recently updated the Spark logo at https://spark.apache.org/ 
to say "Apache Spark" instead of just "Spark". Many ASF projects have been 
doing this recently to make it clearer that they are associated with the ASF, 
and indeed the ASF's branding guidelines generally require that projects be 
referred to as "Apache X" in various settings, especially in related commercial 
or open source products (https://www.apache.org/foundation/marks/). If you have 
any kind of site or product that uses Spark logo, it would be great to update 
to this full one.

There are EPS versions of the logo available at 
https://spark.apache.org/images/spark-logo.eps and 
https://spark.apache.org/images/spark-logo-reverse.eps; before using these also 
check https://www.apache.org/foundation/marks/.

Matei
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

[jira] [Created] (SPARK-15879) Update logo in UI and docs to add "Apache"

2016-06-10 Thread Matei Zaharia (JIRA)

Matei Zaharia created SPARK-15879:
-

 Summary: Update logo in UI and docs to add "Apache"
 Key: SPARK-15879
 URL: https://issues.apache.org/jira/browse/SPARK-15879
 Project: Spark
  Issue Type: Task
  Components: Documentation, Web UI
Reporter: Matei Zaharia


We recently added "Apache" to the Spark logo on the website 
(http://spark.apache.org/images/spark-logo.eps) to have it be the full project 
name, and we should do the same in the web UI and docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-06 Thread Matei Zaharia

Is there any way to remove artifacts from Maven Central? Maybe that would
help clean these things up long-term, though it would create problems for
users who for some reason decide to rely on these previews.

In any case, if people are *really* concerned about this, we should just
put it there. My thought was that it's better for users to do something
special to link to this release (e.g. add a reference to the staging repo)
so that they are more likely to know that it's a special, unstable thing.
Same thing they do to use snapshots.

Matei

On Mon, Jun 6, 2016 at 10:49 AM, Luciano Resende 
wrote:

>
>
> On Mon, Jun 6, 2016 at 10:08 AM, Mark Hamstra 
> wrote:
>
>> I still don't know where this "severely compromised builds of limited
>>> usefulness" thing comes from? what's so bad? You didn't veto its
>>> release, after all.
>>
>>
>> I simply mean that it was released with the knowledge that there are
>> still significant bugs in the preview that definitely would warrant a veto
>> if this were intended to be on a par with other releases.  There have been
>> repeated announcements to that effect, but developers finding the preview
>> artifacts on Maven Central months from now may well not also see those
>> announcements and related discussion.  The artifacts will be very stale and
>> no longer useful for their limited testing purpose, but will persist in the
>> repository.
>>
>>
> A few months from now, why would a developer choose a preview, alpha, beta
> compared to the GA 2.0 release ?
>
> As for the being stale part, this is true for every release anyone put out
> there.
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-04 Thread Matei Zaharia

Personally I'd just put them on the staging repo and link to that on the 
downloads page. It will create less confusion for people browsing Maven Central 
later and wondering which releases are safe to use.

Matei

> On Jun 3, 2016, at 8:22 AM, Mark Hamstra  wrote:
> 
> It's not a question of whether the preview artifacts can be made available on 
> Maven central, but rather whether they must be or should be.  I've got no 
> problems leaving these unstable, transitory artifacts out of the more 
> permanent, canonical repository.
> 
> On Fri, Jun 3, 2016 at 1:53 AM, Steve Loughran  > wrote:
> 
> It's been voted on by the project, so can go up on central
> 
> There's already some JIRAs being filed against it, this is a metric of 
> success as pre-beta of the artifacts.
> 
> The risk of exercising the m2 central option is that people may get 
> expectations that they can point their code at the 2.0.0-preview and then, 
> when a release comes out, simply
> update their dependency; this may/may not be the case. But is it harmful if 
> people do start building and testing against the preview? If it finds 
> problems early, it can only be a good thing
> 
> 
> > On 1 Jun 2016, at 23:10, Sean Owen  > > wrote:
> >
> > I'll be more specific about the issue that I think trumps all this,
> > which I realize maybe not everyone was aware of.
> >
> > There was a long and contentious discussion on the PMC about, among
> > other things, advertising a "Spark 2.0 preview" from Databricks, such
> > as at 
> > https://databricks.com/blog/2016/05/11/apache-spark-2-0-technical-preview-easier-faster-and-smarter.html
> >  
> > 
> >
> > That post has already been updated/fixed from an earlier version, but
> > part of the resolution was to make a full "2.0.0 preview" release in
> > order to continue to be able to advertise it as such. Without it, I
> > believe the PMC's conclusion remains that this blog post / product
> > announcement is not allowed by ASF policy. Hence, either the product
> > announcements need to be taken down and a bunch of wording changed in
> > the Databricks product, or, this needs to be a normal release.
> >
> > Obviously, it seems far easier to just finish the release per usual. I
> > actually didn't realize this had not been offered for download at
> > http://spark.apache.org/downloads.html 
> >  either. It needs to be
> > accessible there too.
> >
> >
> > We can get back in the weeds about what a "preview" release means,
> > but, normal voted releases can and even should be alpha/beta
> > (http://www.apache.org/dev/release.html 
> > ) The culture is, in theory, to
> > release early and often. I don't buy an argument that it's too old, at
> > 2 weeks, when the alternative is having nothing at all to test
> > against.
> >
> > On Wed, Jun 1, 2016 at 5:02 PM, Michael Armbrust  > > wrote:
> >>> I'd think we want less effort, not more, to let people test it? for
> >>> example, right now I can't easily try my product build against
> >>> 2.0.0-preview.
> >>
> >>
> >> I don't feel super strongly one way or the other, so if we need to publish
> >> it permanently we can.
> >>
> >> However, either way you can still test against this release.  You just need
> >> to add a resolver as well (which is how I have always tested packages
> >> against RCs).  One concern with making it permeant is this preview release
> >> is already fairly far behind branch-2.0, so many of the issues that people
> >> might report have already been fixed and that might continue even after the
> >> release is made.  I'd rather be able to force upgrades eventually when we
> >> vote on the final 2.0 release.
> >>
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > 
> > For additional commands, e-mail: dev-h...@spark.apache.org 
> > 
> >
> >
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: dev-h...@spark.apache.org 
> 
> 
>

Welcoming Yanbo Liang as a committer

2016-06-03 Thread Matei Zaharia

Hi all,

The PMC recently voted to add Yanbo Liang as a committer. Yanbo has been a 
super active contributor in many areas of MLlib. Please join me in welcoming 
Yanbo!

Matei
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

[RESULT][VOTE] Removing module maintainer process

2016-05-26 Thread Matei Zaharia

Thanks everyone for voting. With only +1 votes, the vote passes, so I'll update 
the contributor wiki appropriately.

+1 votes:

Matei Zaharia (binding)
Mridul Muralidharan (binding)
Andrew Or (binding)
Sean Owen (binding)
Nick Pentreath (binding)
Tom Graves (binding)
Imran Rashid (binding)
Holden Karau
Owen O'Malley

No 0 or -1 votes.

Matei


> On May 24, 2016, at 12:27 PM, Owen O'Malley <omal...@apache.org> wrote:
> 
> +1 (non-binding)
> 
> I think this is an important step to improve Spark as an Apache project.
> 
> .. Owen
> 
> On Mon, May 23, 2016 at 11:18 AM, Holden Karau <hol...@pigscanfly.ca 
> <mailto:hol...@pigscanfly.ca>> wrote:
> +1 non-binding (as a contributor anything which speed things up is worth a 
> try, and git blame is a good enough substitute for the list when figuring out 
> who to ping on a PR).
> 
> 
> On Monday, May 23, 2016, Imran Rashid <iras...@cloudera.com 
> <mailto:iras...@cloudera.com>> wrote:
> +1 (binding)
> 
> On Mon, May 23, 2016 at 8:13 AM, Tom Graves <tgraves...@yahoo.com.invalid <>> 
> wrote:
> +1 (binding)
> 
> Tom
> 
> 
> On Sunday, May 22, 2016 7:34 PM, Matei Zaharia <matei.zaha...@gmail.com <>> 
> wrote:
> 
> 
> It looks like the discussion thread on this has only had positive replies, so 
> I'm going to call a VOTE. The proposal is to remove the maintainer process in 
> https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers
>   
> <https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers><https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers
>  
> <https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers>>
>  given that it doesn't seem to have had a huge impact on the project, and it 
> can unnecessarily create friction in contributing. We already have +1s from 
> Mridul, Tom, Andrew Or and Imran on that thread.
> 
> I'll leave the VOTE open for 48 hours, until 9 PM EST on May 24, 2016.
> 
> Matei
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org <>
> For additional commands, e-mail: dev-h...@spark.apache.org <>
> 
> 
> 
> 
> 
> -- 
> Cell : 425-233-8271 
> Twitter: https://twitter.com/holdenkarau <https://twitter.com/holdenkarau>
>

Re: [ANNOUNCE] Apache Spark 2.0.0-preview release

2016-05-25 Thread Matei Zaharia

Just wondering, what is the main use case for the Docker images -- to develop 
apps locally or to deploy a cluster? If the image is really just a script to 
download a certain package name from a mirror, it may be okay to create an 
official one, though it does seem tricky to make it properly use the right 
mirror.

Matei

> On May 25, 2016, at 6:05 PM, Luciano Resende  wrote:
> 
> 
> 
> On Wed, May 25, 2016 at 2:34 PM, Sean Owen  > wrote:
> I don't think the project would bless anything but the standard
> release artifacts since only those are voted on. People are free to
> maintain whatever they like and even share it, as long as it's clear
> it's not from the Apache project.
> 
> 
> +1
> 
> 
> -- 
> Luciano Resende
> http://twitter.com/lresende1975 
> http://lresende.blogspot.com/

Re: [VOTE] Removing module maintainer process

2016-05-22 Thread Matei Zaharia

Correction, let's run this for 72 hours, so until 9 PM EST May 25th.

> On May 22, 2016, at 8:34 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote:
> 
> It looks like the discussion thread on this has only had positive replies, so 
> I'm going to call a VOTE. The proposal is to remove the maintainer process in 
> https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers
>  
> <https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers>
>  given that it doesn't seem to have had a huge impact on the project, and it 
> can unnecessarily create friction in contributing. We already have +1s from 
> Mridul, Tom, Andrew Or and Imran on that thread.
> 
> I'll leave the VOTE open for 48 hours, until 9 PM EST on May 24, 2016.
> 
> Matei


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

[VOTE] Removing module maintainer process

2016-05-22 Thread Matei Zaharia

It looks like the discussion thread on this has only had positive replies, so 
I'm going to call a VOTE. The proposal is to remove the maintainer process in 
https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers
 

 given that it doesn't seem to have had a huge impact on the project, and it 
can unnecessarily create friction in contributing. We already have +1s from 
Mridul, Tom, Andrew Or and Imran on that thread.

I'll leave the VOTE open for 48 hours, until 9 PM EST on May 24, 2016.

Matei
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

[DISCUSS] Removing or changing maintainer process

2016-05-19 Thread Matei Zaharia

Hi folks,

Around 1.5 years ago, Spark added a maintainer process for reviewing API and 
architectural changes 
(https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers)
 to make sure these are seen by people who spent a lot of time on that 
component. At the time, the worry was that changes might go unnoticed as the 
project grows, but there were also concerns that this approach makes the 
project harder to contribute to and less welcoming. Since implementing the 
model, I think that a good number of developers concluded it doesn't make a 
huge difference, so because of these concerns, it may be useful to remove it. 
I've also heard that we should try to keep some other instructions for 
contributors to find the "right" reviewers, so it would be great to see 
suggestions on that. For my part, I'd personally prefer something "automatic", 
such as easily tracking who reviewed each patch and having people look at the 
commit history of the module they want to work on, instead of a list that needs 
to be maintained separately.

Matei
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Apache Spark Slack

2016-05-16 Thread Matei Zaharia

I don't think any of the developers use this as an official channel, but all 
the ASF IRC channels are indeed on FreeNode. If there's demand for it, we can 
document this on the website and say that it's mostly for users to find other 
users. Development discussions should happen on the dev mailing list and JIRA 
so that they can easily be archived and found afterward.

Matei

> On May 16, 2016, at 1:06 PM, Dood@ODDO  wrote:
> 
> On 5/16/2016 9:52 AM, Xinh Huynh wrote:
>> I just went to IRC. It looks like the correct channel is #apache-spark.
>> So, is this an "official" chat room for Spark?
>> 
> 
> Ah yes, my apologies, it is #apache-spark indeed. Not sure if there is an 
> official channel on IRC for spark :-)
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Matei Zaharia

This sounds good to me as well. The one thing we should pay attention to is how 
we update the docs so that people know to start with the spark.ml classes. 
Right now the docs list spark.mllib first and also seem more comprehensive in 
that area than in spark.ml, so maybe people naturally move towards that.

Matei

> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng  wrote:
> 
> Yes, DB (cc'ed) is working on porting the local linear algebra library over 
> (SPARK-13944). There are also frequent pattern mining algorithms we need to 
> port over in order to reach feature parity. -Xiangrui
> 
> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman 
> > wrote:
> Overall this sounds good to me. One question I have is that in
> addition to the ML algorithms we have a number of linear algebra
> (various distributed matrices) and statistical methods in the
> spark.mllib package. Is the plan to port or move these to the spark.ml 
> 
> namespace in the 2.x series ?
> 
> Thanks
> Shivaram
> 
> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen  > wrote:
> > FWIW, all of that sounds like a good plan to me. Developing one API is
> > certainly better than two.
> >
> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng  > > wrote:
> >> Hi all,
> >>
> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API built
> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API 
> >> has
> >> been developed under the spark.ml  package, while the 
> >> old RDD-based API has
> >> been developed in parallel under the spark.mllib package. While it was
> >> easier to implement and experiment with new APIs under a new package, it
> >> became harder and harder to maintain as both packages grew bigger and
> >> bigger. And new users are often confused by having two sets of APIs with
> >> overlapped functions.
> >>
> >> We started to recommend the DataFrame-based API over the RDD-based API in
> >> Spark 1.5 for its versatility and flexibility, and we saw the development
> >> and the usage gradually shifting to the DataFrame-based API. Just counting
> >> the lines of Scala code, from 1.5 to the current master we added ~1
> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
> >> gather more resources on the development of the DataFrame-based API and to
> >> help users migrate over sooner, I want to propose switching RDD-based MLlib
> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
> >>
> >> * We do not accept new features in the RDD-based spark.mllib package, 
> >> unless
> >> they block implementing new features in the DataFrame-based spark.ml 
> >> 
> >> package.
> >> * We still accept bug fixes in the RDD-based API.
> >> * We will add more features to the DataFrame-based API in the 2.x series to
> >> reach feature parity with the RDD-based API.
> >> * Once we reach feature parity (possibly in Spark 2.2), we will deprecate
> >> the RDD-based API.
> >> * We will remove the RDD-based API from the main Spark repo in Spark 3.0.
> >>
> >> Though the RDD-based API is already in de facto maintenance mode, this
> >> announcement will make it clear and hence important to both MLlib 
> >> developers
> >> and users. So we’d greatly appreciate your feedback!
> >>
> >> (As a side note, people sometimes use “Spark ML” to refer to the
> >> DataFrame-based API or even the entire MLlib component. This also causes
> >> confusion. To be clear, “Spark ML” is not an official name and there are no
> >> plans to rename MLlib to “Spark ML” at this time.)
> >>
> >> Best,
> >> Xiangrui
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> > 
> > For additional commands, e-mail: user-h...@spark.apache.org 
> > 
> >

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Matei Zaharia

This sounds good to me as well. The one thing we should pay attention to is how 
we update the docs so that people know to start with the spark.ml classes. 
Right now the docs list spark.mllib first and also seem more comprehensive in 
that area than in spark.ml, so maybe people naturally move towards that.

Matei

> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng  wrote:
> 
> Yes, DB (cc'ed) is working on porting the local linear algebra library over 
> (SPARK-13944). There are also frequent pattern mining algorithms we need to 
> port over in order to reach feature parity. -Xiangrui
> 
> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman 
> > wrote:
> Overall this sounds good to me. One question I have is that in
> addition to the ML algorithms we have a number of linear algebra
> (various distributed matrices) and statistical methods in the
> spark.mllib package. Is the plan to port or move these to the spark.ml 
> 
> namespace in the 2.x series ?
> 
> Thanks
> Shivaram
> 
> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen  > wrote:
> > FWIW, all of that sounds like a good plan to me. Developing one API is
> > certainly better than two.
> >
> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng  > > wrote:
> >> Hi all,
> >>
> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API built
> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API 
> >> has
> >> been developed under the spark.ml  package, while the 
> >> old RDD-based API has
> >> been developed in parallel under the spark.mllib package. While it was
> >> easier to implement and experiment with new APIs under a new package, it
> >> became harder and harder to maintain as both packages grew bigger and
> >> bigger. And new users are often confused by having two sets of APIs with
> >> overlapped functions.
> >>
> >> We started to recommend the DataFrame-based API over the RDD-based API in
> >> Spark 1.5 for its versatility and flexibility, and we saw the development
> >> and the usage gradually shifting to the DataFrame-based API. Just counting
> >> the lines of Scala code, from 1.5 to the current master we added ~1
> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
> >> gather more resources on the development of the DataFrame-based API and to
> >> help users migrate over sooner, I want to propose switching RDD-based MLlib
> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
> >>
> >> * We do not accept new features in the RDD-based spark.mllib package, 
> >> unless
> >> they block implementing new features in the DataFrame-based spark.ml 
> >> 
> >> package.
> >> * We still accept bug fixes in the RDD-based API.
> >> * We will add more features to the DataFrame-based API in the 2.x series to
> >> reach feature parity with the RDD-based API.
> >> * Once we reach feature parity (possibly in Spark 2.2), we will deprecate
> >> the RDD-based API.
> >> * We will remove the RDD-based API from the main Spark repo in Spark 3.0.
> >>
> >> Though the RDD-based API is already in de facto maintenance mode, this
> >> announcement will make it clear and hence important to both MLlib 
> >> developers
> >> and users. So we’d greatly appreciate your feedback!
> >>
> >> (As a side note, people sometimes use “Spark ML” to refer to the
> >> DataFrame-based API or even the entire MLlib component. This also causes
> >> confusion. To be clear, “Spark ML” is not an official name and there are no
> >> plans to rename MLlib to “Spark ML” at this time.)
> >>
> >> Best,
> >> Xiangrui
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> > 
> > For additional commands, e-mail: user-h...@spark.apache.org 
> > 
> >

[jira] [Assigned] (SPARK-14356) Update spark.sql.execution.debug to work on Datasets

2016-04-03 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reassigned SPARK-14356:
-

Assignee: Matei Zaharia

> Update spark.sql.execution.debug to work on Datasets
> 
>
> Key: SPARK-14356
> URL: https://issues.apache.org/jira/browse/SPARK-14356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>    Reporter: Matei Zaharia
>    Assignee: Matei Zaharia
>Priority: Minor
>
> Currently it only works on DataFrame, which seems unnecessarily restrictive 
> for 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14356) Update spark.sql.execution.debug to work on Datasets

2016-04-03 Thread Matei Zaharia (JIRA)

Matei Zaharia created SPARK-14356:
-

 Summary: Update spark.sql.execution.debug to work on Datasets
 Key: SPARK-14356
 URL: https://issues.apache.org/jira/browse/SPARK-14356
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Matei Zaharia
Priority: Minor


Currently it only works on DataFrame, which seems unnecessarily restrictive for 
2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Matei Zaharia

I agree that putting it in 2.0 doesn't mean keeping Scala 2.10 for the entire 
2.x line. My vote is to keep Scala 2.10 in Spark 2.0, because it's the default 
version we built with in 1.x. We want to make the transition from 1.x to 2.0 as 
easy as possible. In 2.0, we'll have the default downloads be for Scala 2.11, 
so people will more easily move, but we shouldn't create obstacles that lead to 
fragmenting the community and slowing down Spark 2.0's adoption. I've seen 
companies that stayed on an old Scala version for multiple years because 
switching it, or mixing versions, would affect the company's entire codebase.

Matei

> On Mar 30, 2016, at 12:08 PM, Koert Kuipers  wrote:
> 
> oh wow, had no idea it got ripped out
> 
> On Wed, Mar 30, 2016 at 11:50 AM, Mark Hamstra  > wrote:
> No, with 2.0 Spark really doesn't use Akka: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkConf.scala#L744
>  
> 
> 
> On Wed, Mar 30, 2016 at 9:10 AM, Koert Kuipers  > wrote:
> Spark still runs on akka. So if you want the benefits of the latest akka (not 
> saying we do, was just an example) then you need to drop scala 2.10
> 
> On Mar 30, 2016 10:44 AM, "Cody Koeninger"  > wrote:
> I agree with Mark in that I don't see how supporting scala 2.10 for
> spark 2.0 implies supporting it for all of spark 2.x
> 
> Regarding Koert's comment on akka, I thought all akka dependencies
> have been removed from spark after SPARK-7997 and the recent removal
> of external/akka
> 
> On Wed, Mar 30, 2016 at 9:36 AM, Mark Hamstra  > wrote:
> > Dropping Scala 2.10 support has to happen at some point, so I'm not
> > fundamentally opposed to the idea; but I've got questions about how we go
> > about making the change and what degree of negative consequences we are
> > willing to accept.  Until now, we have been saying that 2.10 support will be
> > continued in Spark 2.0.0.  Switching to 2.11 will be non-trivial for some
> > Spark users, so abruptly dropping 2.10 support is very likely to delay
> > migration to Spark 2.0 for those users.
> >
> > What about continuing 2.10 support in 2.0.x, but repeatedly making an
> > obvious announcement in multiple places that such support is deprecated,
> > that we are not committed to maintaining it throughout 2.x, and that it is,
> > in fact, scheduled to be removed in 2.1.0?
> >
> > On Wed, Mar 30, 2016 at 7:45 AM, Sean Owen  > > wrote:
> >>
> >> (This should fork as its own thread, though it began during discussion
> >> of whether to continue Java 7 support in Spark 2.x.)
> >>
> >> Simply: would like to more clearly take the temperature of all
> >> interested parties about whether to support Scala 2.10 in the Spark
> >> 2.x lifecycle. Some of the arguments appear to be:
> >>
> >> Pro
> >> - Some third party dependencies do not support Scala 2.11+ yet and so
> >> would not be usable in a Spark app
> >>
> >> Con
> >> - Lower maintenance overhead -- no separate 2.10 build,
> >> cross-building, tests to check, esp considering support of 2.12 will
> >> be needed
> >> - Can use 2.11+ features freely
> >> - 2.10 was EOL in late 2014 and Spark 2.x lifecycle is years to come
> >>
> >> I would like to not support 2.10 for Spark 2.x, myself.
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> >> 
> >> For additional commands, e-mail: dev-h...@spark.apache.org 
> >> 
> >>
> >
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: dev-h...@spark.apache.org 
> 
> 
> 
>

Welcoming two new committers

2016-02-08 Thread Matei Zaharia

Hi all,

The PMC has recently added two new Spark committers -- Herman van Hovell and 
Wenchen Fan. Both have been heavily involved in Spark SQL and Tungsten, adding 
new features, optimizations and APIs. Please join me in welcoming Herman and 
Wenchen.

Matei
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: simultaneous actions

2016-01-17 Thread Matei Zaharia

They'll be able to run concurrently and share workers / data. Take a look at 
http://spark.apache.org/docs/latest/job-scheduling.html 
<http://spark.apache.org/docs/latest/job-scheduling.html> for how scheduling 
happens across multiple running jobs in the same SparkContext.

Matei

> On Jan 17, 2016, at 8:06 AM, Koert Kuipers <ko...@tresata.com> wrote:
> 
> Same rdd means same sparkcontext means same workers
> 
> Cache/persist the rdd to avoid repeated jobs
> 
> On Jan 17, 2016 5:21 AM, "Mennour Rostom" <mennou...@gmail.com 
> <mailto:mennou...@gmail.com>> wrote:
> Hi,
> 
> Thank you all for your answers,
> 
> If I correctly understand, actions (in my case foreach) can be run 
> concurrently and simultaneously on the SAME rdd, (which is logical because 
> they are read only object). however, I want to know if the same workers are 
> used for the concurrent analysis ?
> 
> Thank you
> 
> 2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com 
> <mailto:joder...@gmail.com>>:
> I stand corrected. How considerable are the benefits though? Will the 
> scheduler be able to dispatch jobs from both actions simultaneously (or on a 
> when-workers-become-available basis)?
> 
> On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com 
> <mailto:ko...@tresata.com>> wrote:
> we run multiple actions on the same (cached) rdd all the time, i guess in 
> different threads indeed (its in akka)
> 
> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <matei.zaha...@gmail.com 
> <mailto:matei.zaha...@gmail.com>> wrote:
> RDDs actually are thread-safe, and quite a few applications use them this 
> way, e.g. the JDBC server.
> 
> Matei
> 
>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com 
>> <mailto:joder...@gmail.com>> wrote:
>> 
>> I don't think RDDs are threadsafe.
>> More fundamentally however, why would you want to run RDD actions in 
>> parallel? The idea behind RDDs is to provide you with an abstraction for 
>> computing parallel operations on distributed data. Even if you were to call 
>> actions from several threads at once, the individual executors of your spark 
>> environment would still have to perform operations sequentially.
>> 
>> As an alternative, I would suggest to restructure your RDD transformations 
>> to compute the required results in one single operation.
>> 
>> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com 
>> <mailto:jcove...@gmail.com>> wrote:
>> Threads
>> 
>> 
>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com 
>> <mailto:mennou...@gmail.com>> escribió:
>> Hi,
>> 
>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this
>> be done ?
>> 
>> Thank you,
>> Regards
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>  
>> <http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com 
>> <http://nabble.com/>.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <>
>> For additional commands, e-mail: user-h...@spark.apache.org <>
>> 
>> 
> 
> 
> 
>

Re: simultaneous actions

2016-01-15 Thread Matei Zaharia

RDDs actually are thread-safe, and quite a few applications use them this way, 
e.g. the JDBC server.

Matei

> On Jan 15, 2016, at 2:10 PM, Jakob Odersky  wrote:
> 
> I don't think RDDs are threadsafe.
> More fundamentally however, why would you want to run RDD actions in 
> parallel? The idea behind RDDs is to provide you with an abstraction for 
> computing parallel operations on distributed data. Even if you were to call 
> actions from several threads at once, the individual executors of your spark 
> environment would still have to perform operations sequentially.
> 
> As an alternative, I would suggest to restructure your RDD transformations to 
> compute the required results in one single operation.
> 
> On 15 January 2016 at 06:18, Jonathan Coveney  > wrote:
> Threads
> 
> 
> El viernes, 15 de enero de 2016, Kira  > escribió:
> Hi,
> 
> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this
> be done ?
> 
> Thank you,
> Regards
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>  
> 
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <>
> For additional commands, e-mail: user-h...@spark.apache.org <>
> 
>

Re: Compiling only MLlib?

2016-01-15 Thread Matei Zaharia

Have you tried just downloading a pre-built package, or linking to Spark 
through Maven? You don't need to build it unless you are changing code inside 
it. Check out 
http://spark.apache.org/docs/latest/quick-start.html#self-contained-applications
 for how to link to it.

Matei

> On Jan 15, 2016, at 6:13 PM, Colin Woodbury  wrote:
> 
> Hi, I'm very much interested in using Spark's MLlib in standalone programs. 
> I've never used Hadoop, and don't intend to deploy on massive clusters. 
> Building Spark has been an honest nightmare, and I've been on and off it for 
> weeks.
> 
> The build always runs out of RAM on my laptop (4g of RAM, Arch Linux) when I 
> try to build with Scala 2.11 support. No matter how I tweak JVM flags to 
> reduce maximum RAM use, the build always crashes.
> 
> When trying to build Spark 1.6.0 for Scala 2.10 just now, the build had 
> compilation errors. Here is one, as a sample. I've saved the rest:
> 
> [error] 
> /home/colin/building/apache-spark/spark-1.6.0/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:16:
>  object jline is not a member of package tools
> [error] import scala.tools.jline.console.completer._
> 
> It informs me:
> 
> [ERROR] After correcting the problems, you can resume the build with the 
> command
> [ERROR]   mvn  -rf :spark-repl_2.10
> 
> I don't feel safe doing that, given that I don't know what my "" are. 
> 
> I've noticed that the build is compiling a lot of things I have no interest 
> in. Is it possible to just compile the Spark core, its tools, and MLlib? I 
> just want to experiment, and this is causing me a  lot of stress.
> 
> Thank you kindly,
> Colin

Re: Read from AWS s3 with out having to hard-code sensitive keys

2016-01-11 Thread Matei Zaharia

In production, I'd recommend using IAM roles to avoid having keys altogether. 
Take a look at 
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html.

Matei

> On Jan 11, 2016, at 11:32 AM, Sabarish Sasidharan 
>  wrote:
> 
> If you are on EMR, these can go into your hdfs site config. And will work 
> with Spark on YARN by default.
> 
> Regards
> Sab
> 
> On 11-Jan-2016 5:16 pm, "Krishna Rao"  > wrote:
> Hi all,
> 
> Is there a method for reading from s3 without having to hard-code keys? The 
> only 2 ways I've found both require this:
> 
> 1. Set conf in code e.g.: 
> sc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "")
> sc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "") 
> 
> 2. Set keys in URL, e.g.:
> sc.textFile("s3n://@/bucket/test/testdata")
> 
> 
> Both if which I'm reluctant to do within production code!
> 
> 
> Cheers

[jira] [Commented] (SPARK-10854) MesosExecutorBackend: Received launchTask but executor was null

2015-12-03 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038058#comment-15038058
 ] 

Matei Zaharia commented on SPARK-10854:
---

Just a note, I saw a log where this happened, and the sequence of events is 
that the executor logs a launchTask callback before registered(). It could be a 
synchronization thing or a problem in the Mesos library.

> MesosExecutorBackend: Received launchTask but executor was null
> ---
>
> Key: SPARK-10854
> URL: https://issues.apache.org/jira/browse/SPARK-10854
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.4.0
> Environment: Spark 1.4.0
> Mesos 0.23.0
> Docker 1.8.1
>Reporter: Kevin Matzen
>Priority: Minor
>
> Sometimes my tasks get stuck in staging.  Here's stdout from one such worker. 
>  I'm running mesos-slave inside a docker container with the host's docker 
> exposed and I'm using Spark's docker support to launch the worker inside its 
> own container.  Both containers are running.  I'm using pyspark.  I can see 
> mesos-slave and java running, but I do not see python running.
> {noformat}
> WARNING: Your kernel does not support swap limit capabilities, memory limited 
> without swap.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 15/09/28 15:02:09 INFO MesosExecutorBackend: Registered signal handlers for 
> [TERM, HUP, INT]
> I0928 15:02:09.65854138 exec.cpp:132] Version: 0.23.0
> 15/09/28 15:02:09 ERROR MesosExecutorBackend: Received launchTask but 
> executor was null
> I0928 15:02:09.70295554 exec.cpp:206] Executor registered on slave 
> 20150928-044200-1140850698-5050-8-S190
> 15/09/28 15:02:09 INFO MesosExecutorBackend: Registered with Mesos as 
> executor ID 20150928-044200-1140850698-5050-8-S190 with 1 cpus
> 15/09/28 15:02:09 INFO SecurityManager: Changing view acls to: root
> 15/09/28 15:02:09 INFO SecurityManager: Changing modify acls to: root
> 15/09/28 15:02:09 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(root); users 
> with modify permissions: Set(root)
> 15/09/28 15:02:10 INFO Slf4jLogger: Slf4jLogger started
> 15/09/28 15:02:10 INFO Remoting: Starting remoting
> 15/09/28 15:02:10 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkExecutor@:56458]
> 15/09/28 15:02:10 INFO Utils: Successfully started service 'sparkExecutor' on 
> port 56458.
> 15/09/28 15:02:10 INFO DiskBlockManager: Created local directory at 
> /tmp/spark-28a21c2d-54cc-40b3-b0c2-cc3624f1a73c/blockmgr-f2336fec-e1ea-44f1-bd5c-9257049d5e7b
> 15/09/28 15:02:10 INFO MemoryStore: MemoryStore started with capacity 52.1 MB
> 15/09/28 15:02:11 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 15/09/28 15:02:11 INFO Executor: Starting executor ID 
> 20150928-044200-1140850698-5050-8-S190 on host 
> 15/09/28 15:02:11 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 57431.
> 15/09/28 15:02:11 INFO NettyBlockTransferService: Server created on 57431
> 15/09/28 15:02:11 INFO BlockManagerMaster: Trying to register BlockManager
> 15/09/28 15:02:11 INFO BlockManagerMaster: Registered BlockManager
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: A proposal for Spark 2.0

2015-11-24 Thread Matei Zaharia

 removing or replacing them immediately. That way 2.0 doesn’t 
> have to wait for everything that we want to deprecate to be replaced all at 
> once.
> 
> Nick
> 
> 
> 
>  
> 
> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <alexander.ula...@hpe.com 
> <mailto:alexander.ula...@hpe.com>> wrote:
> 
> Parameter Server is a new feature and thus does not match the goal of 2.0 is 
> “to fix things that are broken in the current API and remove certain 
> deprecated APIs”. At the same time I would be happy to have that feature.
> 
>  
> 
> With regards to Machine learning, it would be great to move useful features 
> from MLlib to ML and deprecate the former. Current structure of two separate 
> machine learning packages seems to be somewhat confusing.
> 
> With regards to GraphX, it would be great to deprecate the use of RDD in 
> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
> 
>  
> 
> Best regards, Alexander
> 
>  
> 
> From: Nan Zhu [mailto:zhunanmcg...@gmail.com <mailto:zhunanmcg...@gmail.com>] 
> Sent: Thursday, November 12, 2015 7:28 AM
> To: wi...@qq.com <mailto:wi...@qq.com>
> Cc: dev@spark.apache.org <mailto:dev@spark.apache.org>
> Subject: Re: A proposal for Spark 2.0
> 
>  
> 
> Being specific to Parameter Server, I think the current agreement is that PS 
> shall exist as a third-party library instead of a component of the core code 
> base, isn’t?
> 
>  
> 
> Best,
> 
>  
> 
> -- 
> 
> Nan Zhu
> 
> http://codingcat.me <http://codingcat.me/>
>  
> 
> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com <mailto:wi...@qq.com> 
> wrote:
> 
> Who has the idea of machine learning? Spark missing some features for machine 
> learning, For example, the parameter server.
> 
>  
> 
>  
> 
> 在 2015年11月12日，05:32，Matei Zaharia <matei.zaha...@gmail.com 
> <mailto:matei.zaha...@gmail.com>> 写道：
> 
>  
> 
> I like the idea of popping out Tachyon to an optional component too to reduce 
> the number of dependencies. In the future, it might even be useful to do this 
> for Hadoop, but it requires too many API changes to be worth doing now.
> 
>  
> 
> Regarding Scala 2.12, we should definitely support it eventually, but I don't 
> think we need to block 2.0 on that because it can be added later too. Has 
> anyone investigated what it would take to run on there? I imagine we don't 
> need many code changes, just maybe some REPL stuff.
> 
>  
> 
> Needless to say, but I'm all for the idea of making "major" releases as 
> undisruptive as possible in the model Reynold proposed. Keeping everyone 
> working with the same set of releases is super important.
> 
>  
> 
> Matei
> 
>  
> 
> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com 
> <mailto:so...@cloudera.com>> wrote:
> 
>  
> 
> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <r...@databricks.com 
> <mailto:r...@databricks.com>> wrote:
> 
> to the Spark community. A major release should not be very different from a
> 
> minor release and should not be gated based on new features. The main
> 
> purpose of a major release is an opportunity to fix things that are broken
> 
> in the current API and remove certain deprecated APIs (examples follow).
> 
>  
> 
> Agree with this stance. Generally, a major release might also be a
> 
> time to replace some big old API or implementation with a new one, but
> 
> I don't see obvious candidates.
> 
>  
> 
> I wouldn't mind turning attention to 2.x sooner than later, unless
> 
> there's a fairly good reason to continue adding features in 1.x to a
> 
> 1.7 release. The scope as of 1.6 is already pretty darned big.
> 
>  
> 
>  
> 
> 1. Scala 2.11 as the default build. We should still support Scala 2.10, but
> 
> it has been end-of-life.
> 
>  
> 
> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
> 
> be quite stable, and 2.10 will have been EOL for a while. I'd propose
> 
> dropping 2.10. Otherwise it's supported for 2 more years.
> 
>  
> 
>  
> 
> 2. Remove Hadoop 1 support.
> 
>  
> 
> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
> 
> sort of 'alpha' and 'beta' releases) and even <2.6.
> 
>  
> 
> I'm sure we'll think of a number of other small things -- shading a
> 
> bunch of stuff? reviewing and updating dependencies in light of
> 
> simpler, more recent dependencies to support from Hadoop etc?
> 
>  
> 
> Farming out Tachyon to a module? (I felt like someone proposed this?)
> 
> Pop out any Docker stuff to another repo?
> 
> Continue that same effort for EC2?
> 
> Farming out some of the "external" integrations to another repo (?
> 
> controversial)
> 
>  
> 
> See also anything marked version "2+" in JIRA.
> 
>  
> 
> -
> 
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> <mailto:dev-unsubscr...@spark.apache.org>
> For additional commands, e-mail: dev-h...@spark.apache.org 
> <mailto:dev-h...@spark.apache.org>
>  
> 
>  
> 
> -
> 
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> <mailto:dev-unsubscr...@spark.apache.org>
> For additional commands, e-mail: dev-h...@spark.apache.org 
> <mailto:dev-h...@spark.apache.org>
>  
> 
>  
> 
>  
> 
>  
> 
> -
> 
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> <mailto:dev-unsubscr...@spark.apache.org>
> For additional commands, e-mail: dev-h...@spark.apache.org 
> <mailto:dev-h...@spark.apache.org>
>  
> 
>  
> 
>  
> 
>  
> 
>  
> 
> 
> 
> 
> 
> 
>

[jira] [Created] (SPARK-11733) Allow shuffle readers to request data from just one mapper

2015-11-13 Thread Matei Zaharia (JIRA)

Matei Zaharia created SPARK-11733:
-

 Summary: Allow shuffle readers to request data from just one mapper
 Key: SPARK-11733
 URL: https://issues.apache.org/jira/browse/SPARK-11733
 Project: Spark
  Issue Type: Sub-task
Reporter: Matei Zaharia


This is needed to do broadcast joins. Right now the shuffle reader interface 
takes a range of reduce IDs but fetches from all maps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [DISCUSS] Spark-Kernel Incubator Proposal

2015-11-13 Thread Matei Zaharia

One question about this from the Spark side: have you considered giving the 
project a different name so that it doesn't sound like a Spark component? Right 
now "Spark Kernel" may be confused with "Spark Core" and things like that. I 
don't see a lot of Apache TLPs with related names, though maybe there's nothing 
wrong with that.

In terms of whether to put this in Apache Spark proper, we can have a 
discussion about it later, but my feeling is that it's not necessary. One 
reason is that this only uses public APIs, and another is that there are also 
other notebook interfaces over Spark (e.g. Zeppelin).

Matei

> On Nov 12, 2015, at 7:17 PM, da...@fallside.com wrote:
> 
> Hello, we would like to start a discussion on accepting the Spark-Kernel,
> a mechanism for applications to interactively and remotely access Apache
> Spark, into the Apache Incubator.
> 
> The proposal is available online at
> https://wiki.apache.org/incubator/SparkKernelProposal, and it is appended
> to this email.
> 
> We are looking for additional mentors to help with this project, and we
> would much appreciate your guidance and advice.
> 
> Thank-you in advance,
> David Fallside
> 
> 
> 
> = Spark-Kernel Proposal =
> 
> == Abstract ==
> Spark-Kernel provides applications with a mechanism to interactively and
> remotely access Apache Spark.
> 
> == Proposal ==
> The Spark-Kernel enables interactive applications to access Apache Spark
> clusters. More specifically:
> * Applications can send code-snippets and libraries for execution by Spark
> * Applications can be deployed separately from Spark clusters and
> communicate with the Spark-Kernel using the provided Spark-Kernel client
> * Execution results and streaming data can be sent back to calling
> applications
> * Applications no longer have to be network connected to the workers on a
> Spark cluster because the Spark-Kernel acts as each applications proxy
> * Work has started on enabling Spark-Kernel to support languages in
> addition to Scala, namely Python (with PySpark), R (with SparkR), and SQL
> (with SparkSQL)
> 
> == Background & Rationale ==
> Apache Spark provides applications with a fast and general purpose
> distributed computing engine that supports static and streaming data,
> tabular and graph representations of data, and an extensive library of
> machine learning libraries. Consequently, a wide variety of applications
> will be written for Spark and there will be interactive applications that
> require relatively frequent function evaluations, and batch-oriented
> applications that require one-shot or only occasional evaluation.
> 
> Apache Spark provides two mechanisms for applications to connect with
> Spark. The primary mechanism launches applications on Spark clusters using
> spark-submit
> (http://spark.apache.org/docs/latest/submitting-applications.html); this
> requires developers to bundle their application code plus any dependencies
> into JAR files, and then submit them to Spark. A second mechanism is an
> ODBC/JDBC API
> (http://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine)
> which enables applications to issue SQL queries against SparkSQL.
> 
> Our experience when developing interactive applications, such as analytic
> applications and Jupyter Notebooks, to run against Spark was that the
> spark-submit mechanism was overly cumbersome and slow (requiring JAR
> creation and forking processes to run spark-submit), and the SQL interface
> was too limiting and did not offer easy access to components other than
> SparkSQL, such as streaming. The most promising mechanism provided by
> Apache Spark was the command-line shell
> (http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell)
> which enabled us to execute code snippets and dynamically control the
> tasks submitted to  a Spark cluster. Spark does not provide the
> command-line shell as a consumable service but it provided us with the
> starting point from which we developed the Spark-Kernel.
> 
> == Current Status ==
> Spark-Kernel was first developed by a small team working on an
> internal-IBM Spark-related project in July 2014. In recognition of its
> likely general utility to Spark users and developers, in November 2014 the
> Spark-Kernel project was moved to GitHub and made available under the
> Apache License V2.
> 
> == Meritocracy ==
> The current developers are familiar with the meritocratic open source
> development process at Apache. As the project has gathered interest at
> GitHub the developers have actively started a process to invite additional
> developers into the project, and we have at least one new developer who is
> ready to contribute code to the project.
> 
> == Community ==
> We started building a community around the Spark-Kernel project when we
> moved it to GitHub about one year ago. Since then we have grown to about
> 70 people, and there are regular requests and suggestions from the
> community. We believe that

Re: A proposal for Spark 2.0

2015-11-11 Thread Matei Zaharia

I like the idea of popping out Tachyon to an optional component too to reduce 
the number of dependencies. In the future, it might even be useful to do this 
for Hadoop, but it requires too many API changes to be worth doing now.

Regarding Scala 2.12, we should definitely support it eventually, but I don't 
think we need to block 2.0 on that because it can be added later too. Has 
anyone investigated what it would take to run on there? I imagine we don't need 
many code changes, just maybe some REPL stuff.

Needless to say, but I'm all for the idea of making "major" releases as 
undisruptive as possible in the model Reynold proposed. Keeping everyone 
working with the same set of releases is super important.

Matei

> On Nov 11, 2015, at 4:58 AM, Sean Owen  wrote:
> 
> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin  wrote:
>> to the Spark community. A major release should not be very different from a
>> minor release and should not be gated based on new features. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs (examples follow).
> 
> Agree with this stance. Generally, a major release might also be a
> time to replace some big old API or implementation with a new one, but
> I don't see obvious candidates.
> 
> I wouldn't mind turning attention to 2.x sooner than later, unless
> there's a fairly good reason to continue adding features in 1.x to a
> 1.7 release. The scope as of 1.6 is already pretty darned big.
> 
> 
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10, but
>> it has been end-of-life.
> 
> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
> be quite stable, and 2.10 will have been EOL for a while. I'd propose
> dropping 2.10. Otherwise it's supported for 2 more years.
> 
> 
>> 2. Remove Hadoop 1 support.
> 
> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
> sort of 'alpha' and 'beta' releases) and even <2.6.
> 
> I'm sure we'll think of a number of other small things -- shading a
> bunch of stuff? reviewing and updating dependencies in light of
> simpler, more recent dependencies to support from Hadoop etc?
> 
> Farming out Tachyon to a module? (I felt like someone proposed this?)
> Pop out any Docker stuff to another repo?
> Continue that same effort for EC2?
> Farming out some of the "external" integrations to another repo (?
> controversial)
> 
> See also anything marked version "2+" in JIRA.
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-16 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961567#comment-14961567
 ] 

Matei Zaharia commented on SPARK-:
--

Beyond tuples, you'll also want encoders for other generic classes, such as 
Seq[T]. They're the cleanest mechanism to get the most type info. Also, from a 
software engineering point of view it's nice to avoid a central object where 
you register stuff to allow composition between libraries (basically, see the 
problems that the Kryo registry creates today).

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: How to compile Spark with customized Hadoop?

2015-10-09 Thread Matei Zaharia

You can publish your version of Hadoop to your Maven cache with mvn publish 
(just give it a different version number, e.g. 2.7.0a) and then pass that as 
the Hadoop version to Spark's build (see 
http://spark.apache.org/docs/latest/building-spark.html 
).

Matei

> On Oct 9, 2015, at 3:10 PM, Dogtail L  wrote:
> 
> Hi all,
> 
> I have modified Hadoop source code, and I want to compile Spark with my 
> modified Hadoop. Do you know how to do that? Great thanks!

[jira] [Commented] (SPARK-9850) Adaptive execution in Spark

2015-09-24 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907518#comment-14907518
 ] 

Matei Zaharia commented on SPARK-9850:
--

Hey Imran, this could make sense, but note that the problem will only happen if 
you have 2000 map *output* partitions, which would've been 2000 reduce tasks 
normally. Otherwise, you can have as many map *tasks* as needed with fewer 
partitions. In most jobs, I'd expect data to get significantly smaller after 
the maps, so we'd catch that. In particular, for choosing between broadcast and 
shuffle joins this should be fine. We can do something different if we suspect 
that there is going to be tons of map output *and* we think there's nontrivial 
planning to be done once we see it.

> Adaptive execution in Spark
> ---
>
> Key: SPARK-9850
> URL: https://issues.apache.org/jira/browse/SPARK-9850
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core, SQL
>    Reporter: Matei Zaharia
>Assignee: Yin Huai
> Attachments: AdaptiveExecutionInSpark.pdf
>
>
> Query planning is one of the main factors in high performance, but the 
> current Spark engine requires the execution DAG for a job to be set in 
> advance. Even with cost-based optimization, it is hard to know the behavior 
> of data and user-defined functions well enough to always get great execution 
> plans. This JIRA proposes to add adaptive query execution, so that the engine 
> can change the plan for each query as it sees what data earlier stages 
> produced.
> We propose adding this to Spark SQL / DataFrames first, using a new API in 
> the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, 
> the functionality could be extended to other libraries or the RDD API, but 
> that is more difficult than adding it in SQL.
> I've attached a design doc by Yin Huai and myself explaining how it would 
> work in more detail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9852) Let reduce tasks fetch multiple map output partitions

2015-09-24 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-9852.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

> Let reduce tasks fetch multiple map output partitions
> -
>
> Key: SPARK-9852
> URL: https://issues.apache.org/jira/browse/SPARK-9852
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>    Reporter: Matei Zaharia
>    Assignee: Matei Zaharia
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9852) Let reduce tasks fetch multiple map output partitions

2015-09-20 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-9852:
-
Summary: Let reduce tasks fetch multiple map output partitions  (was: Let 
HashShuffleFetcher fetch multiple map output partitions)

> Let reduce tasks fetch multiple map output partitions
> -
>
> Key: SPARK-9852
> URL: https://issues.apache.org/jira/browse/SPARK-9852
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>    Reporter: Matei Zaharia
>    Assignee: Matei Zaharia
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9851) Support submitting map stages individually in DAGScheduler

2015-09-14 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-9851.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

> Support submitting map stages individually in DAGScheduler
> --
>
> Key: SPARK-9851
> URL: https://issues.apache.org/jira/browse/SPARK-9851
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>    Reporter: Matei Zaharia
>    Assignee: Matei Zaharia
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Ranger-like Security on Spark

2015-09-03 Thread Matei Zaharia

If you run on YARN, you can use Kerberos, be authenticated as the right user, 
etc in the same way as MapReduce jobs.

Matei

> On Sep 3, 2015, at 1:37 PM, Daniel Schulz  
> wrote:
> 
> Hi,
> 
> I really enjoy using Spark. An obstacle to sell it to our clients currently 
> is the missing Kerberos-like security on a Hadoop with simple authentication. 
> Are there plans, a proposal, or a project to deliver a Ranger plugin or 
> something similar to Spark. The target is to differentiate users and their 
> privileges when reading and writing data to HDFS? Is Kerberos my only option 
> then?
> 
> Kind regards, Daniel.
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Ranger-like Security on Spark

2015-09-03 Thread Matei Zaharia

Even simple Spark-on-YARN should run as the user that submitted the job, yes, 
so HDFS ACLs should be enforced. Not sure how it plays with the rest of Ranger.

Matei

> On Sep 3, 2015, at 4:57 PM, Jörn Franke <jornfra...@gmail.com> wrote:
> 
> Well if it needs to read from hdfs then it will adhere to the permissions 
> defined there And/or in ranger. However, I am not aware that you can protect 
> dataframes, tables or streams in general in Spark.
> 
> Le jeu. 3 sept. 2015 à 21:47, Daniel Schulz <danielschulz2...@hotmail.com 
> <mailto:danielschulz2...@hotmail.com>> a écrit :
> Hi Matei,
> 
> Thanks for your answer.
> 
> My question is regarding simple authenticated Spark-on-YARN only, without 
> Kerberos. So when I run Spark on YARN and HDFS, Spark will pass through my 
> HDFS user and only be able to access files I am entitled to read/write? Will 
> it enforce HDFS ACLs and Ranger policies as well?
> 
> Best regards, Daniel.
> 
> > On 03 Sep 2015, at 21:16, Matei Zaharia <matei.zaha...@gmail.com 
> > <mailto:matei.zaha...@gmail.com>> wrote:
> >
> > If you run on YARN, you can use Kerberos, be authenticated as the right 
> > user, etc in the same way as MapReduce jobs.
> >
> > Matei
> >
> >> On Sep 3, 2015, at 1:37 PM, Daniel Schulz <danielschulz2...@hotmail.com 
> >> <mailto:danielschulz2...@hotmail.com>> wrote:
> >>
> >> Hi,
> >>
> >> I really enjoy using Spark. An obstacle to sell it to our clients 
> >> currently is the missing Kerberos-like security on a Hadoop with simple 
> >> authentication. Are there plans, a proposal, or a project to deliver a 
> >> Ranger plugin or something similar to Spark. The target is to 
> >> differentiate users and their privileges when reading and writing data to 
> >> HDFS? Is Kerberos my only option then?
> >>
> >> Kind regards, Daniel.
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> >> <mailto:user-unsubscr...@spark.apache.org>
> >> For additional commands, e-mail: user-h...@spark.apache.org 
> >> <mailto:user-h...@spark.apache.org>
> >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> > <mailto:user-unsubscr...@spark.apache.org>
> > For additional commands, e-mail: user-h...@spark.apache.org 
> > <mailto:user-h...@spark.apache.org>
> >
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
>

[jira] [Assigned] (SPARK-9853) Optimize shuffle fetch of contiguous partition IDs

2015-08-20 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reassigned SPARK-9853:


Assignee: Matei Zaharia

 Optimize shuffle fetch of contiguous partition IDs
 --

 Key: SPARK-9853
 URL: https://issues.apache.org/jira/browse/SPARK-9853
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Reporter: Matei Zaharia
Assignee: Matei Zaharia
Priority: Minor

 On the map side, we should be able to serve a block representing multiple 
 partition IDs in one block manager request



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10008) Shuffle locality can take precedence over narrow dependencies for RDDs with both

2015-08-16 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-10008.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

 Shuffle locality can take precedence over narrow dependencies for RDDs with 
 both
 

 Key: SPARK-10008
 URL: https://issues.apache.org/jira/browse/SPARK-10008
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Reporter: Matei Zaharia
Assignee: Matei Zaharia
 Fix For: 1.5.0


 The shuffle locality patch made the DAGScheduler aware of shuffle data, but 
 for RDDs that have both narrow and shuffle dependencies, it can cause them to 
 place tasks based on the shuffle dependency instead of the narrow one. This 
 case is common in iterative join-based algorithms like PageRank and ALS, 
 where one RDD is hash-partitioned and one isn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10008) Shuffle locality can take precedence over narrow dependencies for RDDs with both

2015-08-14 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reassigned SPARK-10008:
-

Assignee: Matei Zaharia

 Shuffle locality can take precedence over narrow dependencies for RDDs with 
 both
 

 Key: SPARK-10008
 URL: https://issues.apache.org/jira/browse/SPARK-10008
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Reporter: Matei Zaharia
Assignee: Matei Zaharia

 The shuffle locality patch made the DAGScheduler aware of shuffle data, but 
 for RDDs that have both narrow and shuffle dependencies, it can cause them to 
 place tasks based on the shuffle dependency instead of the narrow one. This 
 case is common in iterative join-based algorithms like PageRank and ALS, 
 where one RDD is hash-partitioned and one isn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10008) Shuffle locality can take precedence over narrow dependencies for RDDs with both

2015-08-14 Thread Matei Zaharia (JIRA)

Matei Zaharia created SPARK-10008:
-

 Summary: Shuffle locality can take precedence over narrow 
dependencies for RDDs with both
 Key: SPARK-10008
 URL: https://issues.apache.org/jira/browse/SPARK-10008
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Reporter: Matei Zaharia


The shuffle locality patch made the DAGScheduler aware of shuffle data, but for 
RDDs that have both narrow and shuffle dependencies, it can cause them to place 
tasks based on the shuffle dependency instead of the narrow one. This case is 
common in iterative join-based algorithms like PageRank and ALS, where one RDD 
is hash-partitioned and one isn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9851) Support submitting map stages individually in DAGScheduler

2015-08-13 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-9851:
-
Summary: Support submitting map stages individually in DAGScheduler  (was: 
Add support for submitting map stages individually in DAGScheduler)

 Support submitting map stages individually in DAGScheduler
 --

 Key: SPARK-9851
 URL: https://issues.apache.org/jira/browse/SPARK-9851
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Reporter: Matei Zaharia
Assignee: Matei Zaharia





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9923) ShuffleMapStage.numAvailableOutputs should be an Int instead of Long

2015-08-12 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-9923:
-
Labels: Starter  (was: )

 ShuffleMapStage.numAvailableOutputs should be an Int instead of Long
 

 Key: SPARK-9923
 URL: https://issues.apache.org/jira/browse/SPARK-9923
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
Priority: Trivial
  Labels: Starter

 Not sure why it was made a Long, but every usage assumes it's an Int.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9923) ShuffleMapStage.numAvailableOutputs should be an Int instead of Long

2015-08-12 Thread Matei Zaharia (JIRA)

Matei Zaharia created SPARK-9923:


 Summary: ShuffleMapStage.numAvailableOutputs should be an Int 
instead of Long
 Key: SPARK-9923
 URL: https://issues.apache.org/jira/browse/SPARK-9923
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
Priority: Trivial


Not sure why it was made a Long, but every usage assumes it's an Int.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9850) Adaptive execution in Spark

2015-08-12 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-9850:
-
Issue Type: Epic  (was: New Feature)

 Adaptive execution in Spark
 ---

 Key: SPARK-9850
 URL: https://issues.apache.org/jira/browse/SPARK-9850
 Project: Spark
  Issue Type: Epic
  Components: Spark Core, SQL
Reporter: Matei Zaharia
Assignee: Yin Huai
 Attachments: AdaptiveExecutionInSpark.pdf


 Query planning is one of the main factors in high performance, but the 
 current Spark engine requires the execution DAG for a job to be set in 
 advance. Even with cost-based optimization, it is hard to know the behavior 
 of data and user-defined functions well enough to always get great execution 
 plans. This JIRA proposes to add adaptive query execution, so that the engine 
 can change the plan for each query as it sees what data earlier stages 
 produced.
 We propose adding this to Spark SQL / DataFrames first, using a new API in 
 the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, 
 the functionality could be extended to other libraries or the RDD API, but 
 that is more difficult than adding it in SQL.
 I've attached a design doc by Yin Huai and myself explaining how it would 
 work in more detail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9850) Adaptive execution in Spark

2015-08-11 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-9850:
-
Assignee: Yin Huai

 Adaptive execution in Spark
 ---

 Key: SPARK-9850
 URL: https://issues.apache.org/jira/browse/SPARK-9850
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, SQL
Reporter: Matei Zaharia
Assignee: Yin Huai
 Attachments: AdaptiveExecutionInSpark.pdf


 Query planning is one of the main factors in high performance, but the 
 current Spark engine requires the execution DAG for a job to be set in 
 advance. Even with cost-based optimization, it is hard to know the behavior 
 of data and user-defined functions well enough to always get great execution 
 plans. This JIRA proposes to add adaptive query execution, so that the engine 
 can change the plan for each query as it sees what data earlier stages 
 produced.
 We propose adding this to Spark SQL / DataFrames first, using a new API in 
 the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, 
 the functionality could be extended to other libraries or the RDD API, but 
 that is more difficult than adding it in SQL.
 I've attached a design doc by Yin Huai and myself explaining how it would 
 work in more detail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9851) Add support for submitting map stages individually in DAGScheduler

2015-08-11 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reassigned SPARK-9851:


Assignee: Matei Zaharia

 Add support for submitting map stages individually in DAGScheduler
 --

 Key: SPARK-9851
 URL: https://issues.apache.org/jira/browse/SPARK-9851
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Reporter: Matei Zaharia
Assignee: Matei Zaharia





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9852) Let HashShuffleFetcher fetch multiple map output partitions

2015-08-11 Thread Matei Zaharia (JIRA)

Matei Zaharia created SPARK-9852:


 Summary: Let HashShuffleFetcher fetch multiple map output 
partitions
 Key: SPARK-9852
 URL: https://issues.apache.org/jira/browse/SPARK-9852
 Project: Spark
  Issue Type: Sub-task
Reporter: Matei Zaharia






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9852) Let HashShuffleFetcher fetch multiple map output partitions

2015-08-11 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reassigned SPARK-9852:


Assignee: Matei Zaharia

 Let HashShuffleFetcher fetch multiple map output partitions
 ---

 Key: SPARK-9852
 URL: https://issues.apache.org/jira/browse/SPARK-9852
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Reporter: Matei Zaharia
Assignee: Matei Zaharia





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9851) Add support for submitting map stages individually in DAGScheduler

2015-08-11 Thread Matei Zaharia (JIRA)

Matei Zaharia created SPARK-9851:


 Summary: Add support for submitting map stages individually in 
DAGScheduler
 Key: SPARK-9851
 URL: https://issues.apache.org/jira/browse/SPARK-9851
 Project: Spark
  Issue Type: Sub-task
Reporter: Matei Zaharia






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9850) Adaptive execution in Spark

2015-08-11 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-9850:
-
Attachment: AdaptiveExecutionInSpark.pdf

 Adaptive execution in Spark
 ---

 Key: SPARK-9850
 URL: https://issues.apache.org/jira/browse/SPARK-9850
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, SQL
Reporter: Matei Zaharia
 Attachments: AdaptiveExecutionInSpark.pdf


 Query planning is one of the main factors in high performance, but the 
 current Spark engine requires the execution DAG for a job to be set in 
 advance. Even with cost-based optimization, it is hard to know the behavior 
 of data and user-defined functions well enough to always get great execution 
 plans. This JIRA proposes to add adaptive query execution, so that the engine 
 can change the plan for each query as it sees what data earlier stages 
 produced.
 We propose adding this to Spark SQL / DataFrames first, using a new API in 
 the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, 
 the functionality could be extended to other libraries or the RDD API, but 
 that is more difficult than adding it in SQL.
 I've attached a design doc by Yin Huai and myself explaining how it would 
 work in more detail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9850) Adaptive execution in Spark

2015-08-11 Thread Matei Zaharia (JIRA)

Matei Zaharia created SPARK-9850:


 Summary: Adaptive execution in Spark
 Key: SPARK-9850
 URL: https://issues.apache.org/jira/browse/SPARK-9850
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, SQL
Reporter: Matei Zaharia


Query planning is one of the main factors in high performance, but the current 
Spark engine requires the execution DAG for a job to be set in advance. Even 
with cost-based optimization, it is hard to know the behavior of data and 
user-defined functions well enough to always get great execution plans. This 
JIRA proposes to add adaptive query execution, so that the engine can change 
the plan for each query as it sees what data earlier stages produced.

We propose adding this to Spark SQL / DataFrames first, using a new API in the 
Spark engine that lets libraries run DAGs adaptively. In future JIRAs, the 
functionality could be extended to other libraries or the RDD API, but that is 
more difficult than adding it in SQL.

I've attached a design doc by Yin Huai and myself explaining how it would work 
in more detail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9853) Optimize shuffle fetch of contiguous partition IDs

2015-08-11 Thread Matei Zaharia (JIRA)

Matei Zaharia created SPARK-9853:


 Summary: Optimize shuffle fetch of contiguous partition IDs
 Key: SPARK-9853
 URL: https://issues.apache.org/jira/browse/SPARK-9853
 Project: Spark
  Issue Type: Sub-task
Reporter: Matei Zaharia
Priority: Minor


On the map side, we should be able to serve a block representing multiple 
partition IDs in one block manager request



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9244) Increase some default memory limits

2015-07-22 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-9244.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

 Increase some default memory limits
 ---

 Key: SPARK-9244
 URL: https://issues.apache.org/jira/browse/SPARK-9244
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Matei Zaharia
Priority: Minor
 Fix For: 1.5.0


 There are a few memory limits that people hit often and that we could make 
 higher, especially now that memory sizes have grown.
 - spark.akka.frameSize: This defaults at 10 but is often hit for map output 
 statuses in large shuffles. AFAIK the memory is not fully allocated up-front, 
 so we can just make this larger and still not affect jobs that never sent a 
 status that large.
 - spark.executor.memory: Defaults at 512m, which is really small. We can at 
 least increase it to 1g, though this is something users do need to set on 
 their own.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9244) Increase some default memory limits

2015-07-21 Thread Matei Zaharia (JIRA)

Matei Zaharia created SPARK-9244:


 Summary: Increase some default memory limits
 Key: SPARK-9244
 URL: https://issues.apache.org/jira/browse/SPARK-9244
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
Assignee: Matei Zaharia


There are a few memory limits that people hit often and that we could make 
higher, especially now that memory sizes have grown.

- spark.akka.frameSize: This defaults at 10 but is often hit for map output 
statuses in large shuffles. AFAIK the memory is not fully allocated up-front, 
so we can just make this larger and still not affect jobs that never sent a 
status that large.

- spark.executor.memory: Defaults at 512m, which is really small. We can at 
least increase it to 1g, though this is something users do need to set on their 
own.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 2046 matches

Mail list logo