Re: Signal/Noise Ratio

2014-02-21 Thread Reynold Xin
FYI I submitted an ASF INFRA ticket on granting the AMPLab Jenkins
permission to use the github commit status API.

If that goes through, we can configure Jenkins to use the commit status API
without leaving comments on the pull requests.

https://issues.apache.org/jira/browse/INFRA-7367



On Fri, Feb 21, 2014 at 11:14 AM, Ethan Jewett  wrote:

> Thanks for the pointer Aaron. Very helpful.
>
> I won't harp on this any more after this email: my reading is that the main
> concern is archiving discussion, which could be achieved using a separate
> mailing list. Major decisions should clearly happen on the dev list so
> everyone is informed, but I don't see a situation where that hadn't been
> happening anyway (which is why I read the dev list regularly, sometimes
> look at the archives, and am struggling with the Github messages and
> pitying those not using Gmail filters).
>
>
> On Fri, Feb 21, 2014 at 12:51 PM, Aaron Davidson 
> wrote:
>
> > I don't have an official policy to point you to, but Chris Mattmann (our
> > Apache project mentor) summarized some of the points in this thread, and
> > here is the original concern that caused us to make this change:
> >
> >
> >
> http://mail-archives.apache.org/mod_mbox/incubator-general/201402.mbox/%3CCAAS6=7hkCiT093nXVMcUus8Z-5XCDn=cQ5trjN_Kz9ARe9H=r...@mail.gmail.com%3E
> >
> >
> > On Fri, Feb 21, 2014 at 8:08 AM, Ethan Jewett 
> wrote:
> >
> > > Or not off-list. Sorry folks :-) Anyone should feel free to educate me
> > > either on the policy or on mailing list use ;-)
> > >
> > > On Friday, February 21, 2014, Ethan Jewett  wrote:
> > >
> > > > Hi Aaron,
> > > >
> > > > Off-list message here. Can you point me to this policy? Due to some
> > > > previous experiences here, I'm under the impression that it doesn't
> > > exist.
> > > > I can't find it on the Apache website.
> > > >
> > > > Thanks,
> > > > Ethan
> > > >
> > > > On Tuesday, February 18, 2014, Aaron Davidson  > > >
> > > > wrote:
> > > >
> > > >> This is due, unfortunately, to Apache policies that all
> > > >> development-related
> > > >> discussion should take place on the dev list. As we are attempting
> to
> > > >> graduate from an incubating project to an Apache top level project,
> > > there
> > > >> were some concerns raised about GitHub, and the fastest solution to
> > > avoid
> > > >> conflict related to our graduation was to CC dev@ for all GitHub
> > > >> messages.
> > > >> Once our graduation is complete, we may be able to find a less noisy
> > way
> > > >> of
> > > >> dealing with these messages.
> > > >>
> > > >> In the meantime, one simple solution is to filter out all messages
> > that
> > > >> come from g...@git.apache.org and are destined to
> > > >> dev@spark.incubator.apache.org.
> > > >>
> > > >>
> > > >> On Tue, Feb 18, 2014 at 10:04 AM, Gerard Maas <
> gerard.m...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > +1 please.
> > > >> >
> > > >> >
> > > >> > On Tue, Feb 18, 2014 at 6:04 PM, Michael Ernest <
> > > mfern...@cloudera.com
> > > >> > >wrote:
> > > >> >
> > > >> > > +1
> > > >> > >
> > > >> > >
> > > >> > > On Tue, Feb 18, 2014 at 8:24 AM, Heiko Braun <
> > > >> ike.br...@googlemail.com
> > > >> > > >wrote:
> > > >> > >
> > > >> > > >
> > > >> > > >
> > > >> > > > Wouldn't it be better to move the github messages to a
> dedicated
> > > >> email
> > > >> > > > list?
> > > >> > > >
> > > >> > > > Regards, Heiko
> > > >> > > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > --
> > > >> > > Michael Ernest
> > > >> > > Sr. Solutions Consultant
> > > >> > > West Coast
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>


Re: coding style discussion: explicit return type in public APIs

2014-02-19 Thread Reynold Xin
Mridul,

Can you be more specific in the createFoo example?

def myFunc = createFoo

is disallowed in my guideline. It is invoking a function createFoo, not the
constructor of Foo.




On Wed, Feb 19, 2014 at 10:39 AM, Mridul Muralidharan wrote:

> Without bikeshedding this too much ... It is likely incorrect (not wrong) -
> and rules like this potentially cause things to slip through.
>
> Explicit return type strictly specifies what is being exposed (think in
> face of impl change - createFoo changes in future from Foo to Foo1 or Foo2)
> .. being conservative about how to specify exposed interfaces, imo,
> outweighs potential gains in breveity of code.
> Btw this is a degenerate contrieved example already stretching its use ...
>
> Regards
> Mridul
>
> Regards
> Mridul
> On Feb 19, 2014 1:49 PM, "Reynold Xin"  wrote:
>
> > Yes, the case you brought up is not a matter of readability or style. If
> it
> > returns a different type, it should be declared (otherwise it is just
> > wrong).
> >
> >
> > On Wed, Feb 19, 2014 at 12:17 AM, Mridul Muralidharan  > >wrote:
> >
> > > You are right.
> > > A degenerate case would be :
> > >
> > > def createFoo = new FooImpl()
> > >
> > > vs
> > >
> > > def createFoo: Foo = new FooImpl()
> > >
> > > Former will cause api instability. Reynold, maybe this is already
> > > avoided - and I understood it wrong ?
> > >
> > > Thanks,
> > > Mridul
> > >
> > >
> > >
> > > On Wed, Feb 19, 2014 at 12:44 PM, Christopher Nguyen 
> > > wrote:
> > > > Mridul, IIUUC, what you've mentioned did come to mind, but I deemed
> it
> > > > orthogonal to the stylistic issue Reynold is talking about.
> > > >
> > > > I believe you're referring to the case where there is a specific
> > desired
> > > > return type by API design, but the implementation does not, in which
> > > case,
> > > > of course, one must define the return type. That's an API requirement
> > and
> > > > not just a matter of readability.
> > > >
> > > > We could add this as an NB in the proposed guideline.
> > > >
> > > > --
> > > > Christopher T. Nguyen
> > > > Co-founder & CEO, Adatao <http://adatao.com>
> > > > linkedin.com/in/ctnguyen
> > > >
> > > >
> > > >
> > > > On Tue, Feb 18, 2014 at 10:40 PM, Reynold Xin 
> > > wrote:
> > > >
> > > >> +1 Christopher's suggestion.
> > > >>
> > > >> Mridul,
> > > >>
> > > >> How would that happen? Case 3 requires the method to be invoking the
> > > >> constructor directly. It was implicit in my email, but the return
> type
> > > >> should be the same as the class itself.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Tue, Feb 18, 2014 at 10:37 PM, Mridul Muralidharan <
> > mri...@gmail.com
> > > >> >wrote:
> > > >>
> > > >> > Case 3 can be a potential issue.
> > > >> > Current implementation might be returning a concrete class which
> we
> > > >> > might want to change later - making it a type change.
> > > >> > The intention might be to return an RDD (for example), but the
> > > >> > inferred type might be a subclass of RDD - and future changes will
> > > >> > cause signature change.
> > > >> >
> > > >> >
> > > >> > Regards,
> > > >> > Mridul
> > > >> >
> > > >> >
> > > >> > On Wed, Feb 19, 2014 at 11:52 AM, Reynold Xin <
> r...@databricks.com>
> > > >> wrote:
> > > >> > > Hi guys,
> > > >> > >
> > > >> > > Want to bring to the table this issue to see what other members
> of
> > > the
> > > >> > > community think and then we can codify it in the Spark coding
> > style
> > > >> > guide.
> > > >> > > The topic is about declaring return types explicitly in public
> > APIs.
> > > >> > >
> > > >> > > In general I think we should favor explicit type declaration in
> > > public
> > > >> > > APIs. However, I do think there are 3 cases we can avoid the
> > public
> > > API
> > > >> > > definition because in these 3 cases the types are self-evident &
> > > >> > repetitive.
> > > >> > >
> > > >> > > Case 1. toString
> > > >> > >
> > > >> > > Case 2. A method returning a string or a val defining a string
> > > >> > >
> > > >> > > def name = "abcd" // this is so obvious that it is a string
> > > >> > > val name = "edfg" // this too
> > > >> > >
> > > >> > > Case 3. The method or variable is invoking the constructor of a
> > > class
> > > >> and
> > > >> > > return that immediately. For example:
> > > >> > >
> > > >> > > val a = new SparkContext(...)
> > > >> > > implicit def rddToAsyncRDDActions[T: ClassTag](rdd: RDD[T]) =
> new
> > > >> > > AsyncRDDActions(rdd)
> > > >> > >
> > > >> > >
> > > >> > > Thoughts?
> > > >> >
> > > >>
> > >
> >
>


Re: coding style discussion: explicit return type in public APIs

2014-02-19 Thread Reynold Xin
Yes, the case you brought up is not a matter of readability or style. If it
returns a different type, it should be declared (otherwise it is just
wrong).


On Wed, Feb 19, 2014 at 12:17 AM, Mridul Muralidharan wrote:

> You are right.
> A degenerate case would be :
>
> def createFoo = new FooImpl()
>
> vs
>
> def createFoo: Foo = new FooImpl()
>
> Former will cause api instability. Reynold, maybe this is already
> avoided - and I understood it wrong ?
>
> Thanks,
> Mridul
>
>
>
> On Wed, Feb 19, 2014 at 12:44 PM, Christopher Nguyen 
> wrote:
> > Mridul, IIUUC, what you've mentioned did come to mind, but I deemed it
> > orthogonal to the stylistic issue Reynold is talking about.
> >
> > I believe you're referring to the case where there is a specific desired
> > return type by API design, but the implementation does not, in which
> case,
> > of course, one must define the return type. That's an API requirement and
> > not just a matter of readability.
> >
> > We could add this as an NB in the proposed guideline.
> >
> > --
> > Christopher T. Nguyen
> > Co-founder & CEO, Adatao <http://adatao.com>
> > linkedin.com/in/ctnguyen
> >
> >
> >
> > On Tue, Feb 18, 2014 at 10:40 PM, Reynold Xin 
> wrote:
> >
> >> +1 Christopher's suggestion.
> >>
> >> Mridul,
> >>
> >> How would that happen? Case 3 requires the method to be invoking the
> >> constructor directly. It was implicit in my email, but the return type
> >> should be the same as the class itself.
> >>
> >>
> >>
> >>
> >> On Tue, Feb 18, 2014 at 10:37 PM, Mridul Muralidharan  >> >wrote:
> >>
> >> > Case 3 can be a potential issue.
> >> > Current implementation might be returning a concrete class which we
> >> > might want to change later - making it a type change.
> >> > The intention might be to return an RDD (for example), but the
> >> > inferred type might be a subclass of RDD - and future changes will
> >> > cause signature change.
> >> >
> >> >
> >> > Regards,
> >> > Mridul
> >> >
> >> >
> >> > On Wed, Feb 19, 2014 at 11:52 AM, Reynold Xin 
> >> wrote:
> >> > > Hi guys,
> >> > >
> >> > > Want to bring to the table this issue to see what other members of
> the
> >> > > community think and then we can codify it in the Spark coding style
> >> > guide.
> >> > > The topic is about declaring return types explicitly in public APIs.
> >> > >
> >> > > In general I think we should favor explicit type declaration in
> public
> >> > > APIs. However, I do think there are 3 cases we can avoid the public
> API
> >> > > definition because in these 3 cases the types are self-evident &
> >> > repetitive.
> >> > >
> >> > > Case 1. toString
> >> > >
> >> > > Case 2. A method returning a string or a val defining a string
> >> > >
> >> > > def name = "abcd" // this is so obvious that it is a string
> >> > > val name = "edfg" // this too
> >> > >
> >> > > Case 3. The method or variable is invoking the constructor of a
> class
> >> and
> >> > > return that immediately. For example:
> >> > >
> >> > > val a = new SparkContext(...)
> >> > > implicit def rddToAsyncRDDActions[T: ClassTag](rdd: RDD[T]) = new
> >> > > AsyncRDDActions(rdd)
> >> > >
> >> > >
> >> > > Thoughts?
> >> >
> >>
>


Re: coding style discussion: explicit return type in public APIs

2014-02-18 Thread Reynold Xin
+1 Christopher's suggestion.

Mridul,

How would that happen? Case 3 requires the method to be invoking the
constructor directly. It was implicit in my email, but the return type
should be the same as the class itself.




On Tue, Feb 18, 2014 at 10:37 PM, Mridul Muralidharan wrote:

> Case 3 can be a potential issue.
> Current implementation might be returning a concrete class which we
> might want to change later - making it a type change.
> The intention might be to return an RDD (for example), but the
> inferred type might be a subclass of RDD - and future changes will
> cause signature change.
>
>
> Regards,
> Mridul
>
>
> On Wed, Feb 19, 2014 at 11:52 AM, Reynold Xin  wrote:
> > Hi guys,
> >
> > Want to bring to the table this issue to see what other members of the
> > community think and then we can codify it in the Spark coding style
> guide.
> > The topic is about declaring return types explicitly in public APIs.
> >
> > In general I think we should favor explicit type declaration in public
> > APIs. However, I do think there are 3 cases we can avoid the public API
> > definition because in these 3 cases the types are self-evident &
> repetitive.
> >
> > Case 1. toString
> >
> > Case 2. A method returning a string or a val defining a string
> >
> > def name = "abcd" // this is so obvious that it is a string
> > val name = "edfg" // this too
> >
> > Case 3. The method or variable is invoking the constructor of a class and
> > return that immediately. For example:
> >
> > val a = new SparkContext(...)
> > implicit def rddToAsyncRDDActions[T: ClassTag](rdd: RDD[T]) = new
> > AsyncRDDActions(rdd)
> >
> >
> > Thoughts?
>


Re: coding style discussion: explicit return type in public APIs

2014-02-18 Thread Reynold Xin
Case 2  should probably be expanded to cover most primitive types.


On Tue, Feb 18, 2014 at 10:22 PM, Reynold Xin  wrote:

> Hi guys,
>
> Want to bring to the table this issue to see what other members of the
> community think and then we can codify it in the Spark coding style guide.
> The topic is about declaring return types explicitly in public APIs.
>
> In general I think we should favor explicit type declaration in public
> APIs. However, I do think there are 3 cases we can avoid the public API
> definition because in these 3 cases the types are self-evident & repetitive.
>
> Case 1. toString
>
> Case 2. A method returning a string or a val defining a string
>
> def name = "abcd" // this is so obvious that it is a string
> val name = "edfg" // this too
>
> Case 3. The method or variable is invoking the constructor of a class and
> return that immediately. For example:
>
> val a = new SparkContext(...)
> implicit def rddToAsyncRDDActions[T: ClassTag](rdd: RDD[T]) = new
> AsyncRDDActions(rdd)
>
>
> Thoughts?
>
>


coding style discussion: explicit return type in public APIs

2014-02-18 Thread Reynold Xin
Hi guys,

Want to bring to the table this issue to see what other members of the
community think and then we can codify it in the Spark coding style guide.
The topic is about declaring return types explicitly in public APIs.

In general I think we should favor explicit type declaration in public
APIs. However, I do think there are 3 cases we can avoid the public API
definition because in these 3 cases the types are self-evident & repetitive.

Case 1. toString

Case 2. A method returning a string or a val defining a string

def name = "abcd" // this is so obvious that it is a string
val name = "edfg" // this too

Case 3. The method or variable is invoking the constructor of a class and
return that immediately. For example:

val a = new SparkContext(...)
implicit def rddToAsyncRDDActions[T: ClassTag](rdd: RDD[T]) = new
AsyncRDDActions(rdd)


Thoughts?


Re: Fast Serialization

2014-02-13 Thread Reynold Xin
The perf difference between that and Kryo is pretty small according to
their own benchmark. However, if they can provide better compatibility than
Kryo, we should definitely give it a shot!

Would you like to do some testing?


On Thu, Feb 13, 2014 at 12:27 AM, Evan Chan  wrote:

> Any interest in adding Fast Serialization (or possibly replacing the
> default of Java Serialization)?
> https://code.google.com/p/fast-serialization/
>
> --
> --
> Evan Chan
> Staff Engineer
> e...@ooyala.com  |
>


Re: Could someone with karma to add my userid hsaputra so I could assign issue in https://spark-project.atlassian.net?

2014-02-11 Thread Reynold Xin
I added you to the dev list on jira for spark.


On Tue, Feb 11, 2014 at 2:58 PM, Henry Saputra wrote:

> Hi Guys,
>
> With ASF JIRA still in transfer mode, could someone with permission to
> add my userid "hsaputra" in https://spark-project.atlassian.net so I
> could assign issues and resolve them myself?
>
> CC @pwendell or @rxin
>
>
> Thanks,
>
> - Henry
>


Re: [VOTE] Graduation of Apache Spark from the Incubator

2014-02-10 Thread Reynold Xin
Actually I made a mistake by saying binding.

Just +1 here.


On Mon, Feb 10, 2014 at 10:20 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hi Nathan, anybody is welcome to to VOTE. Thank you.
> Only VOTEs from the Incubator PMC are what is considered "binding", but
> I welcome and will tally all VOTEs provided.
>
> Cheers,
> Chris
>
>
>
>
> -Original Message-
> From: Nathan Kronenfeld 
> Reply-To: "dev@spark.incubator.apache.org"  >
> Date: Monday, February 10, 2014 9:44 PM
> To: "dev@spark.incubator.apache.org" 
> Subject: Re: [VOTE] Graduation of Apache Spark from the Incubator
>
> >Who is allowed to vote on stuff like this?
> >
> >
> >On Mon, Feb 10, 2014 at 11:27 PM, Chris Mattmann
> >wrote:
> >
> >> Hi Everyone,
> >>
> >> This is a new VOTE to decide if Apache Spark should graduate
> >> from the Incubator. Please VOTE on the resolution pasted below
> >> the ballot. I'll leave this VOTE open for at least 72 hours.
> >>
> >> Thanks!
> >>
> >> [ ] +1 Graduate Apache Spark from the Incubator.
> >> [ ] +0 Don't care.
> >> [ ] -1 Don't graduate Apache Spark from the Incubator because..
> >>
> >> Here is my +1 binding for graduation.
> >>
> >> Cheers,
> >> Chris
> >>
> >>  snip
> >>
> >> WHEREAS, the Board of Directors deems it to be in the best
> >> interests of the Foundation and consistent with the
> >> Foundation's purpose to establish a Project Management
> >> Committee charged with the creation and maintenance of
> >> open-source software, for distribution at no charge to the
> >> public, related to fast and flexible large-scale data analysis
> >> on clusters.
> >>
> >> NOW, THEREFORE, BE IT RESOLVED, that a Project Management
> >> Committee (PMC), to be known as the "Apache Spark Project", be
> >> and hereby is established pursuant to Bylaws of the Foundation;
> >> and be it further
> >>
> >> RESOLVED, that the Apache Spark Project be and hereby is
> >> responsible for the creation and maintenance of software
> >> related to fast and flexible large-scale data analysis
> >> on clusters; and be it further RESOLVED, that the office
> >> of "Vice President, Apache Spark" be and hereby is created,
> >> the person holding such office to serve at the direction of
> >> the Board of Directors as the chair of the Apache Spark
> >> Project, and to have primary responsibility for management
> >> of the projects within the scope of responsibility
> >> of the Apache Spark Project; and be it further
> >> RESOLVED, that the persons listed immediately below be and
> >> hereby are appointed to serve as the initial members of the
> >> Apache Spark Project:
> >>
> >> * Mosharaf Chowdhury 
> >> * Jason Dai 
> >> * Tathagata Das 
> >> * Ankur Dave 
> >> * Aaron Davidson 
> >> * Thomas Dudziak 
> >> * Robert Evans 
> >> * Thomas Graves 
> >> * Andy Konwinski 
> >> * Stephen Haberman 
> >> * Mark Hamstra 
> >> * Shane Huang 
> >> * Ryan LeCompte 
> >> * Haoyuan Li 
> >> * Sean McNamara 
> >> * Mridul Muralidharam 
> >> * Kay Ousterhout 
> >> * Nick Pentreath 
> >> * Imran Rashid 
> >> * Charles Reiss 
> >> * Josh Rosen 
> >> * Prashant Sharma 
> >> * Ram Sriharsha 
> >> * Shivaram Venkataraman 
> >> * Patrick Wendell 
> >> * Andrew Xia 
> >> * Reynold Xin 
> >> * Matei Zaharia 
> >>
> >> NOW, THEREFORE, BE IT FURTHER RESOLVED, that Matei Zaharia be
> >> appointed to the office of Vice President, Apache Spark, to
> >> serve in accordance with and subject to the direction of the
> >> Board of Directors and the Bylaws of the Foundation until
> >> death, resignation, retirement, removal or disqualification, or
> >> until a successor is appointed; and be it further
> >>
> >> RESOLVED, that the Apache Spark Project be and hereby is
> >> tasked with the migration and rationalization of the Apache
> >> Incubator Spark podling; and be it further
> >>
> >> RESOLVED, that all responsibilities pertaining to the Apache
> >> Incubator Spark podling encumbered upon the Apache Incubator
> >> Project are hereafter discharged.
> >>
> >> 
> >>
> >>
> >>
> >>
> >
> >
> >--
> >Nathan Kronenfeld
> >Senior Visualization Developer
> >Oculus Info Inc
> >2 Berkeley Street, Suite 600,
> >Toronto, Ontario M5A 4J5
> >Phone:  +1-416-203-3003 x 238
> >Email:  nkronenf...@oculusinfo.com
>
>


Re: [VOTE] Graduation of Apache Spark from the Incubator

2014-02-10 Thread Reynold Xin
+1 (binding)


On Mon, Feb 10, 2014 at 8:56 PM, Henry Saputra wrote:

> +1 (binding)
>
>
> - Henry
>
> On Mon, Feb 10, 2014 at 8:27 PM, Chris Mattmann 
> wrote:
> > Hi Everyone,
> >
> > This is a new VOTE to decide if Apache Spark should graduate
> > from the Incubator. Please VOTE on the resolution pasted below
> > the ballot. I'll leave this VOTE open for at least 72 hours.
> >
> > Thanks!
> >
> > [ ] +1 Graduate Apache Spark from the Incubator.
> > [ ] +0 Don't care.
> > [ ] -1 Don't graduate Apache Spark from the Incubator because..
> >
> > Here is my +1 binding for graduation.
> >
> > Cheers,
> > Chris
> >
> >  snip
> >
> > WHEREAS, the Board of Directors deems it to be in the best
> > interests of the Foundation and consistent with the
> > Foundation's purpose to establish a Project Management
> > Committee charged with the creation and maintenance of
> > open-source software, for distribution at no charge to the
> > public, related to fast and flexible large-scale data analysis
> > on clusters.
> >
> > NOW, THEREFORE, BE IT RESOLVED, that a Project Management
> > Committee (PMC), to be known as the "Apache Spark Project", be
> > and hereby is established pursuant to Bylaws of the Foundation;
> > and be it further
> >
> > RESOLVED, that the Apache Spark Project be and hereby is
> > responsible for the creation and maintenance of software
> > related to fast and flexible large-scale data analysis
> > on clusters; and be it further RESOLVED, that the office
> > of "Vice President, Apache Spark" be and hereby is created,
> > the person holding such office to serve at the direction of
> > the Board of Directors as the chair of the Apache Spark
> > Project, and to have primary responsibility for management
> > of the projects within the scope of responsibility
> > of the Apache Spark Project; and be it further
> > RESOLVED, that the persons listed immediately below be and
> > hereby are appointed to serve as the initial members of the
> > Apache Spark Project:
> >
> > * Mosharaf Chowdhury 
> > * Jason Dai 
> > * Tathagata Das 
> > * Ankur Dave 
> > * Aaron Davidson 
> > * Thomas Dudziak 
> > * Robert Evans 
> > * Thomas Graves 
> > * Andy Konwinski 
> > * Stephen Haberman 
> > * Mark Hamstra 
> > * Shane Huang 
> > * Ryan LeCompte 
> > * Haoyuan Li 
> > * Sean McNamara 
> > * Mridul Muralidharam 
> > * Kay Ousterhout 
> > * Nick Pentreath 
> > * Imran Rashid 
> > * Charles Reiss 
> > * Josh Rosen 
> > * Prashant Sharma 
> > * Ram Sriharsha 
> > * Shivaram Venkataraman 
> > * Patrick Wendell 
> > * Andrew Xia 
> > * Reynold Xin 
> > * Matei Zaharia 
> >
> > NOW, THEREFORE, BE IT FURTHER RESOLVED, that Matei Zaharia be
> > appointed to the office of Vice President, Apache Spark, to
> > serve in accordance with and subject to the direction of the
> > Board of Directors and the Bylaws of the Foundation until
> > death, resignation, retirement, removal or disqualification, or
> > until a successor is appointed; and be it further
> >
> > RESOLVED, that the Apache Spark Project be and hereby is
> > tasked with the migration and rationalization of the Apache
> > Incubator Spark podling; and be it further
> >
> > RESOLVED, that all responsibilities pertaining to the Apache
> > Incubator Spark podling encumbered upon the Apache Incubator
> > Project are hereafter discharged.
> >
> > 
> >
> >
> >
>


Re: Proposal: Clarifying minor points of Scala style

2014-02-10 Thread Reynold Xin
+1 on both


On Mon, Feb 10, 2014 at 1:34 AM, Aaron Davidson  wrote:

> There are a few bits of the Scala style that are underspecified by
> both the Scala
> style guide  and our own supplemental
> notes<
> https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide>.
> Often, this leads to inconsistent formatting within the codebase, so I'd
> like to propose some general guidelines which we can add to the wiki and
> use in the future:
>
> 1) Line-wrapped method return type is indented with two spaces:
> def longMethodName(... long param list ...)
>   : Long = {
>   2
> }
>
> *Justification: *I think this is the most commonly used style in Spark
> today. It's also similar to the "extends" style used in classes, with the
> same justification: it is visually distinguished from the 4-indented
> parameter list.
>
> 2) URLs and code examples in comments should not be line-wrapped.
> Here<
> https://github.com/apache/incubator-spark/pull/557/files#diff-c338f10f3567d4c1d7fec4bf9e2677e1L29
> >is
> an example of the latter.
>
> *Justification*: Line-wrapping can cause confusion when trying to
> copy-paste a URL or command. Can additionally cause IDE issues or,
> avoidably, Javadoc issues.
>
> Any thoughts on these, or additional style issues not explicitly covered in
> either the Scala style guide or Spark wiki?
>


Re: [GitHub] incubator-spark pull request: Improved NetworkReceiver in Spark St...

2014-02-07 Thread Reynold Xin
test


On Fri, Feb 7, 2014 at 3:23 PM, AmplabJenkins  wrote:

> Github user AmplabJenkins commented on the pull request:
>
>
> https://github.com/apache/incubator-spark/pull/559#issuecomment-34518332
>
> All automated tests passed.
> Refer to this link for build results:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12623/
>
>


Re: [GitHub] incubator-spark pull request:

2014-02-07 Thread Reynold Xin
I don't think it does.


On Fri, Feb 7, 2014 at 8:58 PM, Nan Zhu  wrote:

> If we reply these emails, will the reply be posted on pull request
> discussion board automatically?
>
> if yes, that would be very nice
>
> --
> Nan Zhu
>
>
>
> On Friday, February 7, 2014 at 9:23 PM, Henry Saputra wrote:
>
> > I am with Chris on this one.
> >
> > These github notifications are similar to JIRA updates that in most
> > ASF projects are sent to dev@ list, and these are valid messages that
> > contributors in the project should concern about.
> >
> > Especially the PPMCs (which willl be PMCs hopefully soon) need to know
> > about them and become audit trail/ archive of development discussions
> > for ASF.
> >
> > We already have user@ list which targeted for people interested to ask
> > for questions using Spark and should be the proper list for people
> > interested on using Spark.
> >
> > As Matei have said, you can filter these github notifications email
> easily.
> >
> > Thanks,
> >
> >
> > - Henry
> >
> >
> > On Fri, Feb 7, 2014 at 6:02 PM, Chris Mattmann  mattm...@apache.org)> wrote:
> > > Guys this Github discussion seems like dev discussion in which case it
> > > must be
> > > on dev list and not moved - the whole point of this is that
> development,
> > > including
> > > conversations related to it, which are the lifeblood of the project
> should
> > > occur
> > > on the ASF mailing lists.
> > >
> > > Refactoring the lists is one thing for the more automated messages,
> but the
> > > comments below look like Kay commenting on some relevant stuff in which
> > > case
> > > I would argue against (paraphrased) "moving it to some ASF list that
> those
> > > who
> > > care can subscribe to". "Those who care" in this case should be people
> who
> > > care about Kay's comments (which aren't automated commit messages from
> > > some bot;
> > > they are relevant dev comments) in which case "those who care" should
> be
> > > the
> > > PMC.
> > >
> > > My suggestion is if there is a notifications list set up, it can be
> like
> > > for
> > > automated stuff - but *NOT* for dev discussion -- that needs to happen
> on
> > > the
> > > dev lists. If it's on another list, then I would expect periodically
> > > (frequently;
> > > with enough diligence to VOTE on and discuss and contribute to) to see
> that
> > > flushed or summarized on the dev list.
> > >
> > > Cheers,
> > > Chris
> > >
> > >
> > >
> > >
> > > -Original Message-
> > > From: Andrew Ash mailto:and...@andrewash.com)>
> > > Reply-To: "dev@spark.incubator.apache.org (mailto:
> dev@spark.incubator.apache.org)"  dev@spark.incubator.apache.org)>
> > > Date: Friday, February 7, 2014 5:43 PM
> > > To: "dev@spark.incubator.apache.org (mailto:
> dev@spark.incubator.apache.org)"  dev@spark.incubator.apache.org)>
> > > Subject: Re: [GitHub] incubator-spark pull request:
> > >
> > > > +1 on moving this stuff to a separate mailing list. It's Apache
> policy
> > > > that discussion is archived, but it's not policy that it must be
> > > > interleaved with other dev discussion. Let's move it to a
> > > > spark-github-discuss list (or a different name) and people who care
> to see
> > > > it can subscribe.
> > > >
> > > >
> > > > On Fri, Feb 7, 2014 at 5:19 PM, Reynold Xin  r...@databricks.com)> wrote:
> > > >
> > > > > I concur wholeheartedly ...
> > > > >
> > > > >
> > > > > On Fri, Feb 7, 2014 at 4:55 PM, Dean Wampler <
> deanwamp...@gmail.com (mailto:deanwamp...@gmail.com)>
> > > > > wrote:
> > > > >
> > > > > > This SPAM is not doing anyone any good. How about another
> mailing list
> > > > > for
> > > > > > people who want to see this?
> > > > > >
> > > > > > Sent from my rotary phone.
> > > > > >
> > > > > >
> > > > > > > On Feb 7, 2014, at 10:33 AM, mridulm  g...@git.apache.org)> wrote:
> > > > > > >
> > > > > > > 

Re: [GitHub] incubator-spark pull request:

2014-02-07 Thread Reynold Xin
I concur wholeheartedly ...


On Fri, Feb 7, 2014 at 4:55 PM, Dean Wampler  wrote:

> This SPAM is not doing anyone any good. How about another mailing list for
> people who want to see this?
>
> Sent from my rotary phone.
>
>
> > On Feb 7, 2014, at 10:33 AM, mridulm  wrote:
> >
> > Github user mridulm commented on the pull request:
> >
> >
> https://github.com/apache/incubator-spark/pull/517#issuecomment-34484468
> >
> >I am hoping that the PR Prashant Sharma submitted would also include
> >ability to check these things once committed !
> >Thanks Kay
> >
> >
> >On Sat, Feb 8, 2014 at 12:01 AM, Kay Ousterhout <
> notificati...@github.com>wrote:
> >
> >> I don't know of any precommit scripts (I think there's been talk of
> adding
> >> a general style checker script but AFAIK it hasn't been done yet); I
> just
> >> add highlighting in my editor so it's obvious when I'm writing lines
> that
> >> are longer than 100 characters.
> >>
> >> --
> >> Reply to this email directly or view it on GitHub<
> https://github.com/apache/incubator-spark/pull/517#issuecomment-34484190>
> >> .
> >
>


Re: Discussion on strategy or roadmap should happen on dev@ list

2014-02-06 Thread Reynold Xin
We can try it on dev, but I personally find the JIRA notifications pretty
spammy ... It will clutter the dev list, and make it harder to search for
useful information here.


On Thu, Feb 6, 2014 at 6:27 PM, Matei Zaharia wrote:

> Henry (or anyone else), do you have any preference on sending these
> directly to "dev" versus creating another list for "issues"? I guess we can
> try "dev" for a while and let people decide if it gets too spammy. We'll
> just have to advertise it in advance.
>
> Matei
>
> On Feb 6, 2014, at 9:55 AM, Henry Saputra  wrote:
>
> > HI Matei, yeah please subscribe it for now. Once we have ASF JIRA
> > setup for Spark it will happen automatically.
> >
> > - Henry
> >
> > On Wed, Feb 5, 2014 at 2:56 PM, Matei Zaharia 
> wrote:
> >> Hey Henry, this makes sense. I'd like to add that one other vehicle for
> discussion has been JIRA at
> https://spark-project.atlassian.net/browse/SPARK. Right now the dev list
> is not subscribed to JIRA, but we'd be happy to subscribe it anytime if
> that helps. We were hoping to do this only when JIRA has been moved to the
> ASF, since infra can set up the forwarding automatically. But most major
> discussions (e.g. https://spark-project.atlassian.net/browse/SPARK-964,
> https://spark-project.atlassian.net/browse/SPARK-969) happen there. I
> think this is the model we want to have in the future -- most other projects
> I've participated in also used JIRA for their discussion, and mirrored to
> either the "dev" list or an "issues" list.
> >>
> >> Matei
> >>
> >> On Feb 5, 2014, at 2:49 PM, Henry Saputra 
> wrote:
> >>
> >>> Hi Guys,
> >>>
> >>> Just friendly reminder, some of you guys may work closely or
> >>> collaborate outside the dev@ list and sometimes it is easier.
> >>> But, as part of Apache Software Foundation project, any decision or
> >>> outcome that could or will be implemented in the Apache Spark need to
> >>> happen in the dev@ list as we are open and collaborative as community.
> >>>
> >>> If offline discussions happen please forward the history or potential
> >>> solution to the dev@ list before any action taken.
> >>>
> >>> Most of us work remote so email is the official channel of discussion
> >>> about stuff related to development in Spark.
> >>>
> >>> Github pull request is not the appropriate vehicle for technical
> >>> discussions. It is used primarily for review of proposed patch which
> >>> means initial problem most of the times had been identified and
> >>> discussed.
> >>>
> >>> Thanks for understanding.
> >>>
> >>> - Henry
> >>
>
>


Re: Is there any way to make a quick test on some pre-commit code?

2014-02-06 Thread Reynold Xin
You can do

sbt/sbt assemble-deps


and then just run

sbt/sbt package

each time.


You can even do

sbt/sbt ~package

for automatic incremental compilation.



On Thu, Feb 6, 2014 at 4:46 PM, Nan Zhu  wrote:

> Hi, all
>
> Is it always necessary to run sbt assembly when you want to test some code,
>
> Sometimes you just repeatedly change one or two lines for some failed test
> case, it is really time-consuming to sbt assembly every time
>
> any faster way?
>
> Best,
>
> --
> Nan Zhu
>
>


Re: Proposal for Spark Release Strategy

2014-02-06 Thread Reynold Xin
+1 for 1.0


The point of 1.0 is for us to self-enforce API compatibility in the context
of longer term support. If we continue down the 0.xx road, we will always
have excuse for breaking APIs. That said, a major focus of 0.9 and some of
the work that are happening for 1.0 (e.g. configuration, Java 8 closure
support, security) are for better API compatibility support in 1.x releases.

While not perfect, Spark as is is already more mature than many (ASF)
projects that are versioned 1.x, 2.x, or even 10.x. Software releases are
always a moving target. 1.0 doesn't mean it is "perfect" and "final". The
project will still evolve.




On Thu, Feb 6, 2014 at 11:54 AM, Evan Chan  wrote:

> +1 for 0.10.0.
>
> It would give more time to study things (such as the new SparkConf)
> and let the community decide if any breaking API changes are needed.
>
> Also, a +1 for minor revisions not breaking code compatibility,
> including Scala versions.   (I guess this would mean that 1.x would
> stay on Scala 2.10.x)
>
> On Thu, Feb 6, 2014 at 11:05 AM, Sandy Ryza 
> wrote:
> > Bleh, hit send to early again.  My second paragraph was to argue for
> 1.0.0
> > instead of 0.10.0, not to hammer on the binary compatibility point.
> >
> >
> > On Thu, Feb 6, 2014 at 11:04 AM, Sandy Ryza 
> wrote:
> >
> >> *Would it make sense to put in something that strongly discourages
> binary
> >> incompatible changes when possible?
> >>
> >>
> >> On Thu, Feb 6, 2014 at 11:03 AM, Sandy Ryza  >wrote:
> >>
> >>> Not codifying binary compatibility as a hard rule sounds fine to me.
> >>>  Would it make sense to put something in that . I.e. avoid making
> needless
> >>> changes to class hierarchies.
> >>>
> >>> Whether Spark considers itself stable or not, users are beginning to
> >>> treat it so.  A responsible project will acknowledge this and provide
> the
> >>> stability needed by its user base.  I think some projects have made the
> >>> mistake of waiting too long to release a 1.0.0.  It allows them to put
> off
> >>> making the hard decisions, but users and downstream projects suffer.
> >>>
> >>> If Spark needs to go through dramatic changes, there's always the
> option
> >>> of a 2.0.0 that allows for this.
> >>>
> >>> -Sandy
> >>>
> >>>
> >>>
> >>> On Thu, Feb 6, 2014 at 10:56 AM, Matei Zaharia <
> matei.zaha...@gmail.com>wrote:
> >>>
>  I think it's important to do 1.0 next. The project has been around
> for 4
>  years, and I'd be comfortable maintaining the current codebase for a
> long
>  time in an API and binary compatible way through 1.x releases. Over
> the
>  past 4 years we haven't actually had major changes to the user-facing
> API --
>  the only ones were changing the package to org.apache.spark, and
> upgrading
>  the Scala version. I'd be okay leaving 1.x to always use Scala 2.10
> for
>  example, or later cross-building it for Scala 2.11. Updating to 1.0
> says
>  two things: it tells users that they can be confident that version
> will be
>  maintained for a long time, which we absolutely want to do, and it
> lets
>  outsiders see that the project is now fairly mature (for many people,
>  pre-1.0 might still cause them not to try it). I think both are good
> for
>  the community.
> 
>  Regarding binary compatibility, I agree that it's what we should
> strive
>  for, but it just seems premature to codify now. Let's see how it works
>  between, say, 1.0 and 1.1, and then we can codify it.
> 
>  Matei
> 
>  On Feb 6, 2014, at 10:43 AM, Henry Saputra 
>  wrote:
> 
>  > Thanks Patick to initiate the discussion about next road map for
>  Apache Spark.
>  >
>  > I am +1 for 0.10.0 for next version.
>  >
>  > It will give us as community some time to digest the process and the
>  > vision and make adjustment accordingly.
>  >
>  > Release a 1.0.0 is a huge milestone and if we do need to break API
>  > somehow or modify internal behavior dramatically we could take
>  > advantage to release 1.0.0 as good step to go to.
>  >
>  >
>  > - Henry
>  >
>  >
>  >
>  > On Wed, Feb 5, 2014 at 9:52 PM, Andrew Ash 
>  wrote:
>  >> Agree on timeboxed releases as well.
>  >>
>  >> Is there a vision for where we want to be as a project before
>  declaring the
>  >> first 1.0 release?  While we're in the 0.x days per semver we can
>  break
>  >> backcompat at will (though we try to avoid it where possible), and
>  that
>  >> luxury goes away with 1.x  I just don't want to release a 1.0
> simply
>  >> because it seems to follow after 0.9 rather than making an
> intentional
>  >> decision that we're at the point where we can stand by the current
>  APIs and
>  >> binary compatibility for the next year or so of the major release.
>  >>
>  >> Until that decision is made as a group I'd rather we do an
> immediate
>  >> version bump to 0

Re: [0.9.0] Possible deadlock in shutdown hook?

2014-02-06 Thread Reynold Xin
Is it safe if we interrupt the running thread during shutdown?




On Thu, Feb 6, 2014 at 3:27 AM, Andrew Ash  wrote:

> Per the book Java Concurrency in Practice the already-running threads
> continue running while the shutdown hooks run.  So I think the race between
> the writing thread and the deleting thread could be a very real possibility
> :/
>
> http://stackoverflow.com/a/3332925/120915
>
>
> On Thu, Feb 6, 2014 at 2:49 AM, Andrew Ash  wrote:
>
> > Got a repro locally on my MBP (the other was on a CentOS machine).
> >
> > Build spark, run a master and a worker with the sbin/start-all.sh script,
> > then run this in a shell:
> >
> > import org.apache.spark.storage.StorageLevel._
> > val s = sc.parallelize(1 to 10).persist(MEMORY_AND_DISK_SER);
> > s.count
> >
> > After about a minute, this line appears in the shell logging output:
> >
> > 14/02/06 02:44:44 WARN BlockManagerMasterActor: Removing BlockManager
> > BlockManagerId(0, aash-mbp.dyn.yojoe.local, 57895, 0) with no recent
> heart
> > beats: 57510ms exceeds 45000ms
> >
> > Ctrl-C the shell.  In jps there is now a worker, a master, and a
> > CoarseGrainedExecutorBackend.
> >
> > Run jstack on the CGEBackend JVM, and I got the attached stacktraces.  I
> > waited around for 15min then kill -9'd the JVM and restarted the process.
> >
> > I wonder if what's happening here is that the threads that are spewing
> > data to disk (as that parallelize and persist would do) can write to disk
> > faster than the cleanup threads can delete from disk.
> >
> > What do you think of that theory?
> >
> >
> > Andrew
> >
> >
> >
> > On Thu, Feb 6, 2014 at 2:30 AM, Mridul Muralidharan  >wrote:
> >
> >> shutdown hooks should not take 15 mins are you mentioned !
> >> On the other hand, how busy was your disk when this was happening ?
> >> (either due to spark or something else ?)
> >>
> >> It might just be that there was a lot of stuff to remove ?
> >>
> >> Regards,
> >> Mridul
> >>
> >>
> >> On Thu, Feb 6, 2014 at 3:50 PM, Andrew Ash 
> wrote:
> >> > Hi Spark devs,
> >> >
> >> > Occasionally when hitting Ctrl-C in the scala spark shell on 0.9.0 one
> >> of
> >> > my workers goes dead in the spark master UI.  I'm using the standalone
> >> > cluster and didn't ever see this while using 0.8.0 so I think it may
> be
> >> a
> >> > regression.
> >> >
> >> > When I prod on the hung CoarseGrainedExecutorBackend JVM with jstack
> and
> >> > jmap -heap, it doesn't respond unless I add the -F force flag.  The
> heap
> >> > isn't full, but there are some interesting bits in the jstack.  Poking
> >> > around a little, I think there may be some kind of deadlock in the
> >> shutdown
> >> > hooks.
> >> >
> >> > Below are the threads I think are most interesting:
> >> >
> >> > Thread 14308: (state = BLOCKED)
> >> >  - java.lang.Shutdown.exit(int) @bci=96, line=212 (Interpreted frame)
> >> >  - java.lang.Runtime.exit(int) @bci=14, line=109 (Interpreted frame)
> >> >  - java.lang.System.exit(int) @bci=4, line=962 (Interpreted frame)
> >> >  -
> >> >
> >>
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(java.lang.Object,
> >> > scala.Function1) @bci=352, line=81 (Interpreted frame)
> >> >  - akka.actor.ActorCell.receiveMessage(java.lang.Object) @bci=25,
> >> line=498
> >> > (Interpreted frame)
> >> >  - akka.actor.ActorCell.invoke(akka.dispatch.Envelope) @bci=39,
> line=456
> >> > (Interpreted frame)
> >> >  - akka.dispatch.Mailbox.processMailbox(int, long) @bci=24, line=237
> >> > (Interpreted frame)
> >> >  - akka.dispatch.Mailbox.run() @bci=20, line=219 (Interpreted frame)
> >> >  - akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec()
> >> > @bci=4, line=386 (Interpreted frame)
> >> >  - scala.concurrent.forkjoin.ForkJoinTask.doExec() @bci=10, line=260
> >> > (Compiled frame)
> >> >  -
> >> >
> >>
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(scala.concurrent.forkjoin.ForkJoinTask)
> >> > @bci=10, line=1339 (Compiled frame)
> >> >  -
> >> >
> >>
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue)
> >> > @bci=11, line=1979 (Compiled frame)
> >> >  - scala.concurrent.forkjoin.ForkJoinWorkerThread.run() @bci=14,
> >> line=107
> >> > (Interpreted frame)
> >> >
> >> > Thread 3865: (state = BLOCKED)
> >> >  - java.lang.Object.wait(long) @bci=0 (Interpreted frame)
> >> >  - java.lang.Thread.join(long) @bci=38, line=1280 (Interpreted frame)
> >> >  - java.lang.Thread.join() @bci=2, line=1354 (Interpreted frame)
> >> >  - java.lang.ApplicationShutdownHooks.runHooks() @bci=87, line=106
> >> > (Interpreted frame)
> >> >  - java.lang.ApplicationShutdownHooks$1.run() @bci=0, line=46
> >> (Interpreted
> >> > frame)
> >> >  - java.lang.Shutdown.runHooks() @bci=39, line=123 (Interpreted frame)
> >> >  - java.lang.Shutdown.sequence() @bci=26, line=167 (Interpreted frame)
> >> >  - java.lang.Shutdown.exit(int) @bci=96, line=212 (Interpreted frame)
> >> >  - java.lang.Terminator$1

Re: Not closing the merged PRs anymore from Spark github mirror?

2014-02-03 Thread Reynold Xin
It was a transient thing. There's a script that we are using to
automatically fetch diffs from a PR and apply the diff against the git
repo. Patrick changed the way it works last week, and a regression there
was PRs are no longer closed automatically.

I believe he has fixed it. Patrick will also write an email about the
details of that script soon.





On Mon, Feb 3, 2014 at 11:24 AM, Henry Saputra wrote:

> Seems like some merged PRs by Reynold and Patrick did not close the PR
> automatically anymore?
>
> - Henry
>


Re: [VOTE] Release Apache Spark 0.9.0-incubating (rc5)

2014-01-30 Thread Reynold Xin
Thanks. That does go out of the scope of the Spark release. The EC2 script
starts instances and use some scripts to setup this version. For that to
work, we need to have a release first.


On Thu, Jan 30, 2014 at 11:47 AM, bkrouse  wrote:

> I just tried the EC2 scripts as a part of this rc5, and it *looks* like it
> did not setup this version properly.  Is that in scope for this rc?
>
> Brian
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-0-9-0-incubating-rc5-tp318p421.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>


Re: Problems while moving from 0.8.0 to 0.8.1

2014-01-27 Thread Reynold Xin
Do you mind pasting the whole stack trace for the NPE?



On Mon, Jan 27, 2014 at 6:44 AM, Archit Thakur wrote:

> Hi,
>
> Implementation of aggregation logic has been changed with 0.8.1
> (Aggregator.scala)
>
> It is now using AppendOnlyMap as compared to java.util.HashMap in 0.8.0
> release.
>
> Aggregator.scala
> def combineValuesByKey(iter: Iterator[_ <: Product2[K, V]]) : Iterator[(K,
> C)] = {
> val combiners = new AppendOnlyMap[K, C]
> var kv: Product2[K, V] = null
> val update = (hadValue: Boolean, oldValue: C) => {
>   if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
> }
> while (iter.hasNext) {
>   kv = iter.next()
>   combiners.changeValue(kv._1, update)
> }
> combiners.iterator
>   }
>
> I am facing problem that in changeValue function of AppendOnlyMap, it
> computes,
> val curKey = data(2 * pos)
> which is coming as null and eventually giving NPE.
>
> AppendOnlyMap.scala
> def changeValue(key: K, updateFunc: (Boolean, V) => V): V = {
> val k = key.asInstanceOf[AnyRef]
> if (k.eq(null)) {
>   if (!haveNullValue) {
> incrementSize()
>   }
>   nullValue = updateFunc(haveNullValue, nullValue)
>   haveNullValue = true
>   return nullValue
> }
> var pos = rehash(k.hashCode) & mask
> var i = 1
> while (true) {
>   val curKey = data(2 * pos)
>   if (k.eq(curKey) || k.equals(curKey)) {
> val newValue = updateFunc(true, data(2 * pos + 1).asInstanceOf[V])
> data(2 * pos + 1) = newValue.asInstanceOf[AnyRef]
> return newValue
>   } else if (curKey.eq(null)) {
> val newValue = updateFunc(false, null.asInstanceOf[V])
> data(2 * pos) = k
> data(2 * pos + 1) = newValue.asInstanceOf[AnyRef]
> incrementSize()
> return newValue
>   } else {
> val delta = i
> pos = (pos + delta) & mask
> i += 1
>   }
> }
> null.asInstanceOf[V] // Never reached but needed to keep compiler happy
>   }
>
>
> Other info:
> 1. My code works fine with 0.8.0.
> 2. I used groupByKey transformation.
> 3. I replaces the Aggregator.scala with the older version(0.8.0), compiled
> it, Restarted Master and Worker, It ran successfully.
>
> Thanks and Regards,
> Archit Thakur.
>


Re: [VOTE] Release Apache Spark 0.9.0-incubating (rc5)

2014-01-26 Thread Reynold Xin
It is possible that you have generated the assembly jar using one version
of Hadoop, and then another assembly jar with another version. Those tests
that failed are all using a local cluster that sets up multiple processes,
which would require launching Spark worker processes using the assembly
jar. If that's indeed the problem, removing the extra assembly jars should
fix them.


On Sun, Jan 26, 2014 at 10:49 PM, Taka Shinagawa wrote:

> If I build Spark for Hadoop 1.0.4 (either "SPARK_HADOOP_VERSION=1.0.4
> sbt/sbt assembly"  or "sbt/sbt assembly") or use the binary distribution,
> 'sbt/sbt test' runs successfully.
>
> However, if I build Spark targeting any other Hadoop versions (e.g.
> "SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly", "SPARK_HADOOP_VERSION=2.2.0
> sbt/sbt assembly"), I'm getting the following errors with 'sbt/sbt test':
>
> 1) type mismatch errors with JavaPairDStream.scala
> 2) following test failures
> [error] Failed tests:
> [error] org.apache.spark.ShuffleNettySuite
> [error] org.apache.spark.ShuffleSuite
> [error] org.apache.spark.FileServerSuite
> [error] org.apache.spark.DistributedSuite
>
> I don't have Hadoop 1.0.4 installed on my test systems (but the test
> succeeds, and failed with the installed Hadoop versions). I'm seeing these
> sbt test errors with the previous 0.9.0 RCs and 0.8.1, too.
>
> I'm wondering if anyone else has seen this problem or I'm missing something
> to run the test correctly.
>
> Thanks,
> Taka
>
>
>
>
> On Sat, Jan 25, 2014 at 5:00 PM, Sean McNamara
> wrote:
>
> > +1
> >
> > On 1/25/14, 4:04 PM, "Mark Hamstra"  wrote:
> >
> > >+1
> > >
> > >
> > >On Sat, Jan 25, 2014 at 2:37 PM, Andy Konwinski
> > >wrote:
> > >
> > >> +1
> > >>
> > >>
> > >> On Sat, Jan 25, 2014 at 2:27 PM, Reynold Xin 
> > >>wrote:
> > >>
> > >> > +1
> > >> >
> > >> > > On Jan 25, 2014, at 12:07 PM, Hossein  wrote:
> > >> > >
> > >> > > +1
> > >> > >
> > >> > > Compiled and tested on Mavericks.
> > >> > >
> > >> > > --Hossein
> > >> > >
> > >> > >
> > >> > > On Sat, Jan 25, 2014 at 11:38 AM, Patrick Wendell
> > >> > >> > >wrote:
> > >> > >
> > >> > >> I'll kick of the voting with a +1.
> > >> > >>
> > >> > >> On Thu, Jan 23, 2014 at 11:33 PM, Patrick Wendell
> > >> > >> >
> > >> > >> wrote:
> > >> > >>> Please vote on releasing the following candidate as Apache Spark
> > >> > >>> (incubating) version 0.9.0.
> > >> > >>>
> > >> > >>> A draft of the release notes along with the changes file is
> > >>attached
> > >> > >>> to this e-mail.
> > >> > >>>
> > >> > >>> The tag to be voted on is v0.9.0-incubating (commit 95d28ff3):
> > >> > >>
> > >> >
> > >>
> > >>
> >
> https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=commit;h=
> > >>95d28ff3d0d20d9c583e184f9e2c5ae842d8a4d9
> > >> > >>>
> > >> > >>> The release files, including signatures, digests, etc can be
> found
> > >> at:
> > >> > >>> http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc5
> > >> > >>>
> > >> > >>> Release artifacts are signed with the following key:
> > >> > >>> https://people.apache.org/keys/committer/pwendell.asc
> > >> > >>>
> > >> > >>> The staging repository for this release can be found at:
> > >> > >>>
> > >> >
> > >>
> https://repository.apache.org/content/repositories/orgapachespark-1006/
> > >> > >>>
> > >> > >>> The documentation corresponding to this release can be found at:
> > >> > >>>
> > >>http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc5-docs/
> > >> > >>>
> > >> > >>> Please vote on releasing this package as Apache Spark
> > >> 0.9.0-incubating!
> > >> > >>>
> > >> > >>> The vote is open until Monday, January 27, at 07:30 UTC and
> passes
> > >> ifa
> > >> > >>> majority of at least 3 +1 PPMC votes are cast.
> > >> > >>>
> > >> > >>> [ ] +1 Release this package as Apache Spark 0.9.0-incubating
> > >> > >>> [ ] -1 Do not release this package because ...
> > >> > >>>
> > >> > >>> To learn more about Apache Spark, please see
> > >> > >>> http://spark.incubator.apache.org/
> > >> > >>
> > >> >
> > >> > --
> > >> > You received this message because you are subscribed to the Google
> > >>Groups
> > >> > "Unofficial Apache Spark Dev Mailing List Mirror" group.
> > >> > To unsubscribe from this group and stop receiving emails from it,
> > >>send an
> > >> > email to apache-spark-dev-mirror+unsubscr...@googlegroups.com.
> > >> > For more options, visit https://groups.google.com/groups/opt_out.
> > >> >
> > >>
> >
> >
>


Re: GroupByKey implementation.

2014-01-26 Thread Reynold Xin
While I echo Mark's sentiment, versioning has nothing to do with this
problem. It has been the case even in Spark 0.8.0.

Note that mapSideCombine is turned off for groupByKey, so there is no need
to merge any combiners.


On Sun, Jan 26, 2014 at 12:22 PM, Archit Thakur
wrote:

> Hi,
>
> Below is the implementation for GroupByKey. (v, 0.8.0)
>
>
> def groupByKey(partitioner: Partitioner): RDD[(K, Seq[V])] = {
> def createCombiner(v: V) = ArrayBuffer(v)
> def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v
> val bufs = combineByKey[ArrayBuffer[V]](
>   createCombiner _, mergeValue _, null, partitioner,
> mapSideCombine=false)
> bufs.asInstanceOf[RDD[(K, Seq[V])]]
>   }
>
> and CombineValuesByKey (Aggregator.scala):
>
> def combineValuesByKey(iter: Iterator[_ <: Product2[K, V]]) : Iterator[(K,
> C)] = {
> val combiners = new JHashMap[K, C]
> for (kv <- iter) {
>   val oldC = combiners.get(kv._1)
>   if (oldC == null) {
> combiners.put(kv._1, createCombiner(kv._2))
>   } else {
> combiners.put(kv._1, mergeValue(oldC, kv._2))
>   }
> }
> combiners.iterator
>   }
>
> My doubt is why null is being passed for mergeCombiners closure.
>
> If two different partitions have same key, wouldn't there be the
> requirement to merge them afterwards?
>
> Thanks,
> Archit.
>


Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

2014-01-25 Thread Reynold Xin
I'm not entirely sure, but two candidates are

the visit function in stageDependsOn

submitStage






On Sat, Jan 25, 2014 at 10:01 PM, Aaron Davidson  wrote:

> I'm an idiot, but which part of the DAGScheduler is recursive here? Seems
> like processEvent shouldn't have inherently recursive properties.
>
>
> On Sat, Jan 25, 2014 at 9:57 PM, Reynold Xin  wrote:
>
> > It seems to me fixing DAGScheduler to make it not recursive is the better
> > solution here, given the cost of checkpointing.
> >
> > On Sat, Jan 25, 2014 at 9:49 PM, Xia, Junluan 
> > wrote:
> >
> > > Hi all
> > >
> > > The description about this Bug submitted by Matei is as following
> > >
> > >
> > > The tipping point seems to be around 50. We should fix this by
> > > checkpointing the RDDs every 10-20 iterations to break the lineage
> chain,
> > > but checkpointing currently requires HDFS installed, which not all
> users
> > > will have.
> > >
> > > We might also be able to fix DAGScheduler to not be recursive.
> > >
> > >
> > > regards,
> > > Andrew
> > >
> > >
> >
>


Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

2014-01-25 Thread Reynold Xin
It seems to me fixing DAGScheduler to make it not recursive is the better
solution here, given the cost of checkpointing.

On Sat, Jan 25, 2014 at 9:49 PM, Xia, Junluan  wrote:

> Hi all
>
> The description about this Bug submitted by Matei is as following
>
>
> The tipping point seems to be around 50. We should fix this by
> checkpointing the RDDs every 10-20 iterations to break the lineage chain,
> but checkpointing currently requires HDFS installed, which not all users
> will have.
>
> We might also be able to fix DAGScheduler to not be recursive.
>
>
> regards,
> Andrew
>
>


Re: [VOTE] Release Apache Spark 0.9.0-incubating (rc5)

2014-01-25 Thread Reynold Xin
+1

> On Jan 25, 2014, at 12:07 PM, Hossein  wrote:
>
> +1
>
> Compiled and tested on Mavericks.
>
> --Hossein
>
>
> On Sat, Jan 25, 2014 at 11:38 AM, Patrick Wendell wrote:
>
>> I'll kick of the voting with a +1.
>>
>> On Thu, Jan 23, 2014 at 11:33 PM, Patrick Wendell 
>> wrote:
>>> Please vote on releasing the following candidate as Apache Spark
>>> (incubating) version 0.9.0.
>>>
>>> A draft of the release notes along with the changes file is attached
>>> to this e-mail.
>>>
>>> The tag to be voted on is v0.9.0-incubating (commit 95d28ff3):
>> https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=commit;h=95d28ff3d0d20d9c583e184f9e2c5ae842d8a4d9
>>>
>>> The release files, including signatures, digests, etc can be found at:
>>> http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc5
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1006/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc5-docs/
>>>
>>> Please vote on releasing this package as Apache Spark 0.9.0-incubating!
>>>
>>> The vote is open until Monday, January 27, at 07:30 UTC and passes ifa
>>> majority of at least 3 +1 PPMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 0.9.0-incubating
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see
>>> http://spark.incubator.apache.org/
>>


Re: JavaRDD.collect()

2014-01-24 Thread Reynold Xin
The reason is likely because first() is entirely executed on the driver
node in the same process, while collect() needs to connect with worker
nodes.

Usually the first time you run an action, most of the JVM code are not
optimized, and the classloader also needs to load a lot of things on the
fly. Having to connect with other processes via RPC can slow the first
execution down in collect.

That said, if you run this a few times (in the same driver program) and it
is still much slower, you should look into other factors such as network
congestion, cpu/memory load on workers, etc.


BTW - this is a dev list for Spark development itself (not Spark
application development). Questions like this probably go better in the
user list in the future.


On Fri, Jan 24, 2014 at 6:34 PM, Chen Jin  wrote:

> Hi Tathagata,
>
> Thanks for the detailed explanation, I thought so too.However,
> currently I only have one text partition which contains two lines.
> Each line is like tens of characters in total. Why is there such a big
> difference between first() and collect()? Instead 2x, I have around
> 30x difference.
>
>
> On Fri, Jan 24, 2014 at 6:25 PM, Tathagata Das
>  wrote:
> > RDD.first() doesnt have to scan the whole partition. It gets only the
> first
> > item and returns it.
> > RDD.collect() has to scan the whole partition, collect all of it and send
> > all of it back (serialization + deserialization costs, etc.)
> >
> > TD
> >
> >
> > On Fri, Jan 24, 2014 at 5:55 PM, Chen Jin  wrote:
> >
> >> Hi All,
> >>
> >> I have some metadata saved as a single partition on HDFS (a few
> >> hundred bytes) and when I want to get the content of the data:
> >>
> >> JavaRDD blob = sc.textFile();
> >> List lines = blob.collect();
> >>
> >> However, collect takes probably more than 3 seconds at least but
> >> first() only take 0.1 second,
> >>
> >> Could you advise on what's the best practice to read small files using
> >> spark.
> >>
> >> -chen
> >>
> >>
> >> On Fri, Jan 24, 2014 at 3:23 PM, Kapil Malik  wrote:
> >> > Hi Andrew,
> >> >
> >> >
> >> >
> >> > Here's the exception I get while trying to build an OSGi bundle using
> >> maven SCR plugin -
> >> >
> >> >
> >> >
> >> > [ERROR] Failed to execute goal
> >> org.apache.felix:maven-scr-plugin:1.9.0:scr
> (generate-scr-scrdescriptor) on
> >> project repo-spark: Execution generate-scr-scrdescriptor of goal
> >> org.apache.felix:maven-scr-plugin:1.9.0:scr failed: Invalid signature
> file
> >> digest for Manifest main attributes -> [Help 1]
> >> >
> >> > org.apache.maven.lifecycle.LifecycleExecutionException: Failed to
> >> execute goal org.apache.felix:maven-scr-plugin:1.9.0:scr
> >> (generate-scr-scrdescriptor) on project repo-spark: Execution
> >> generate-scr-scrdescriptor of goal
> >> org.apache.felix:maven-scr-plugin:1.9.0:scr failed: Invalid signature
> file
> >> digest for Manifest main attributes
> >> >
> >> >   at
> >>
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:225)
> >> >
> >> > ...
> >> >
> >> > Caused by: org.apache.maven.plugin.PluginExecutionException: Execution
> >> generate-scr-scrdescriptor of goal
> >> org.apache.felix:maven-scr-plugin:1.9.0:scr failed: Invalid signature
> file
> >> digest for Manifest main attributes
> >> >
> >> >   at
> >>
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:110)
> >> >
> >> >   at
> >>
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209)
> >> >
> >> >   ... 19 more
> >> >
> >> > Caused by: java.lang.SecurityException: Invalid signature file digest
> >> for Manifest main attributes
> >> >
> >> >   at
> >>
> sun.security.util.SignatureFileVerifier.processImpl(SignatureFileVerifier.java:240)
> >> >
> >> > ...
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > Also, from eclipse, if I build a simple main program. Then, I can
> create
> >> an executable JAR in 3 ways -
> >> >
> >> > a.   Extract required libraries into generated JAR ( individual
> >> classes inside my JAR)
> >> >
> >> > On running main program on this JAR –
> >> >
> >> > Exception in thread "main"
> com.typesafe.config.ConfigException$Missing:
> >> No configuration setting found for key
> 'akka.remote.log-received-messages'
> >> >
> >> > at
> >> com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:126)
> >> >
> >> > at
> >> com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:146)
> >> >
> >> >
> >> >
> >> > b.  Package required libraries into generated JAR (all JARs inside
> >> my JAR)
> >> >
> >> > On running main program on this JAR –
> >> >
> >> > Exception in thread "main" java.lang.reflect.InvocationTargetException
> >> >
> >> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> >
> >> > at
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> >> >
> >> > at
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

Re: [DISCUSS] Graduating as a TLP

2014-01-23 Thread Reynold Xin
+1 supporting Matei as the VP.


On Thu, Jan 23, 2014 at 4:11 PM, Chris Mattmann  wrote:

> +1 from me.
>
> I'll throw Matei's name into the hat for VP. He's done a great job
> and has stood out to me with his report filing and tenacity and
> would make an excellent chair.
>
> Being a chair entails:
>
> 1. being the eyes and ears of the board on the project.
> 2. filing a monthly (first 3 months, then quarterly) board report
> similar to the incubator report.
>
> Not too bad.
>
> +1 for graduation from me binding when the VOTE comes. We need our
> mentors and IPMC members to chime in and we should be in time for
> February 2014 board meeting.
>
> Cheers,
> Chris
>
>
> -Original Message-
> From: Matei Zaharia 
> Reply-To: "dev@spark.incubator.apache.org"  >
> Date: Thursday, January 23, 2014 2:45 PM
> To: "dev@spark.incubator.apache.org" 
> Subject: [DISCUSS] Graduating as a TLP
>
> >Hi folks,
> >
> >We¹ve been working on the transition to Apache for a while, and our last
> >shepherd¹s report says the following:
> >
> >
> >Spark
> >
> >Alan Cabrera (acabrera):
> >
> >  Seems like a nice active project.  IMO, there's no need to wait import
> >  to JIRA to graduate. Seems like they can graduate now.
> >
> >
> >What do you think about graduating to a top-level project? As far as I
> >can tell, we¹ve completed all the requirements for graduating:
> >
> >- Made 2 releases (working on a third now)
> >- Added new committers and PPMC members (4 of them)
> >- Did IP clearance
> >- Moved infrastructure over to Apache, except for the JIRA above, which
> >INFRA is working on and which shouldn¹t block us.
> >
> >If everything is okay, I¹ll call a VOTE on graduating in 48 hours. The
> >one final thing missing is that we¹ll need to nominate an initial VP for
> >the project.
> >
> >Matei
>
>
>


Re: [DISCUSS] Graduating as a TLP

2014-01-23 Thread Reynold Xin
+1


On Thu, Jan 23, 2014 at 2:45 PM, Matei Zaharia wrote:

> Hi folks,
>
> We’ve been working on the transition to Apache for a while, and our last
> shepherd’s report says the following:
>
> 
> Spark
>
> Alan Cabrera (acabrera):
>
>   Seems like a nice active project.  IMO, there's no need to wait import
>   to JIRA to graduate. Seems like they can graduate now.
> 
>
> What do you think about graduating to a top-level project? As far as I can
> tell, we’ve completed all the requirements for graduating:
>
> - Made 2 releases (working on a third now)
> - Added new committers and PPMC members (4 of them)
> - Did IP clearance
> - Moved infrastructure over to Apache, except for the JIRA above, which
> INFRA is working on and which shouldn’t block us.
>
> If everything is okay, I’ll call a VOTE on graduating in 48 hours. The one
> final thing missing is that we’ll need to nominate an initial VP for the
> project.
>
> Matei


Re: [VOTE] Release Apache Spark 0.9.0-incubating (rc3)

2014-01-20 Thread Reynold Xin
That's a perm gen issue - you need to adjust the perm gem size. In sbt it
should've been set automatically, but I think for Maven, you need to set
the maven opts, which is documented in the build instructions.


On Sun, Jan 19, 2014 at 11:35 PM, Ewen Cheslack-Postava
wrote:

> I can't get the tests to run on a Mac, 10.7.5, java -version output:
>
> java version "1.6.0_65"
> Java(TM) SE Runtime Environment (build 1.6.0_65-b14-462-11M4609)
> Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-462, mixed mode)
>
> For reference, Spark 0.8.* build and test find on the same configuration.
> 0.9.0-rc3 fails *after* PrimitiveVectorSuite, I'm not sure what it's
> running at that time since all the tests in PrimitiveVectorSuite seem to
> have finished:
>
> [info] PrimitiveVectorSuite:
> [info] - primitive value (4 milliseconds)
> [info] - non-primitive value (5 milliseconds)
> [info] - ideal growth (4 milliseconds)
> [info] - ideal size (5 milliseconds)
> [info] - resizing (6 milliseconds)
> [ERROR] [01/19/2014 23:16:27.508] [spark-akka.actor.default-dispatcher-4]
> [ActorSystem(spark)] exception while executing timer task
> org.apache.spark.SparkException: Error sending message to
> BlockManagerMaster [message = HeartBeat(BlockManagerId(, localhost,
> 51634, 0))]
> at
> org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:176)
> at
> org.apache.spark.storage.BlockManagerMaster.sendHeartBeat(BlockManagerMaster.scala:52)
> at org.apache.spark.storage.BlockManager.org
> $apache$spark$storage$BlockManager$$heartBeat(BlockManager.scala:97)
> at
> org.apache.spark.storage.BlockManager$$anonfun$initialize$1.apply$mcV$sp(BlockManager.scala:135)
> at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241)
> at
> akka.actor.LightArrayRevolverScheduler$TaskHolder.run(Scheduler.scala:464)
> at
> akka.actor.LightArrayRevolverScheduler$$anonfun$close$1.apply(Scheduler.scala:281)
> at
> akka.actor.LightArrayRevolverScheduler$$anonfun$close$1.apply(Scheduler.scala:280)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at akka.actor.LightArrayRevolverScheduler.close(Scheduler.scala:279)
> at akka.actor.ActorSystemImpl.stopScheduler(ActorSystem.scala:630)
> at
> akka.actor.ActorSystemImpl$$anonfun$_start$1.apply$mcV$sp(ActorSystem.scala:582)
> at
> akka.actor.ActorSystemImpl$$anonfun$_start$1.apply(ActorSystem.scala:582)
> at
> akka.actor.ActorSystemImpl$$anonfun$_start$1.apply(ActorSystem.scala:582)
> at akka.actor.ActorSystemImpl$$anon$3.run(ActorSystem.scala:596)
> at
> akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$run$1.runNext$1(ActorSystem.scala:750)
> at
> akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$run$1.apply$mcV$sp(ActorSystem.scala:753)
> at
> akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$run$1.apply(ActorSystem.scala:746)
> at
> akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$run$1.apply(ActorSystem.scala:746)
> at akka.util.ReentrantGuard.withGuard(LockUtil.scala:15)
> at
> akka.actor.ActorSystemImpl$TerminationCallbacks.run(ActorSystem.scala:746)
> at
> akka.actor.ActorSystemImpl$$anonfun$terminationCallbacks$1.apply(ActorSystem.scala:593)
> at
> akka.actor.ActorSystemImpl$$anonfun$terminationCallbacks$1.apply(ActorSystem.scala:593)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
> at
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
> at
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
> at
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
> at
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
> at
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
> at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: akka.pattern.AskTimeoutException:
> Recipient[Actor[akka://spark/user/BlockManagerMaster#927284646]] had
> already been terminated.
> at akka.pattern.Askable

Re: [VOTE] Release Apache Spark 0.9.0-incubating (rc2)

2014-01-19 Thread Reynold Xin
+1


On Sat, Jan 18, 2014 at 11:11 PM, Patrick Wendell wrote:

> I'll kick of the voting with a +1.
>
> On Sat, Jan 18, 2014 at 11:05 PM, Patrick Wendell 
> wrote:
> > Please vote on releasing the following candidate as Apache Spark
> > (incubating) version 0.9.0.
> >
> > A draft of the release notes along with the changes file is attached
> > to this e-mail.
> >
> > The tag to be voted on is v0.9.0-incubating (commit 00c847a):
> >
> https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=commit;h=00c847af1d4be2fe5fad887a57857eead1e517dc
> >
> > The release files, including signatures, digests, etc can be found at:
> > http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc2/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1003/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc2-docs/
> >
> > Please vote on releasing this package as Apache Spark 0.9.0-incubating!
> >
> > The vote is open until Wednesday, January 22, at 07:05 UTC
> > and passes if a majority of at least 3 +1 PPMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 0.9.0-incubating
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see
> > http://spark.incubator.apache.org/
>


Re: Config properties broken in master

2014-01-18 Thread Reynold Xin
I also just went over the config options to see how pervasive this is. In
addition to speculation, there is one more "conflict" of this kind:

spark.locality.wait
spark.locality.wait.node
spark.locality.wait.process
spark.locality.wait.rack


spark.speculation
spark.speculation.interval
spark.speculation.multiplier
spark.speculation.quantile


On Sat, Jan 18, 2014 at 11:36 AM, Matei Zaharia wrote:

> This is definitely an important issue to fix. Instead of renaming
> properties, one solution would be to replace Typesafe Config with just
> reading Java system properties, and disable config files for this release.
> I kind of like that over renaming.
>
> Matei
>
> On Jan 18, 2014, at 11:30 AM, Mridul Muralidharan 
> wrote:
>
> > Hi,
> >
> >  Speculation was an example, there are others in spark which are
> > affected by this ...
> > Some of them have been around for a while, so will break existing
> code/scripts.
> >
> > Regards,
> > Mridul
> >
> > On Sun, Jan 19, 2014 at 12:51 AM, Nan Zhu 
> wrote:
> >> change spark.speculation to spark.speculation.switch?
> >>
> >> maybe we can restrict that all properties in Spark should be "three
> levels"
> >>
> >>
> >> On Sat, Jan 18, 2014 at 2:10 PM, Mridul Muralidharan  >wrote:
> >>
> >>> Hi,
> >>>
> >>>  Unless I am mistaken, the change to using typesafe ConfigFactory has
> >>> broken some of the system properties we use in spark.
> >>>
> >>> For example: if we have both
> >>> -Dspark.speculation=true -Dspark.speculation.multiplier=0.95
> >>> set, then the spark.speculation property is dropped.
> >>>
> >>> The rules of parseProperty actually document this clearly [1]
> >>>
> >>>
> >>> I am not sure what the right fix here would be (other than replacing
> >>> use of config that is).
> >>>
> >>> Any thoughts ?
> >>> I would vote -1 for 0.9 to be released before this is fixed.
> >>>
> >>>
> >>> Regards,
> >>> Mridul
> >>>
> >>>
> >>> [1]
> >>>
> http://typesafehub.github.io/config/latest/api/com/typesafe/config/ConfigFactory.html#parseProperties%28java.util.Properties,%20com.typesafe.config.ConfigParseOptions%29
> >>>
>
>


Re: Is there any plan to develop an application level fair scheduler?

2014-01-17 Thread Reynold Xin
It does.

There are two scheduling levels here.

The first level is what the cluster manager does. The standalone cluster
manager for Spark only supports FIFO at the moment at the level of
applications.

Regarding Spark itself. Within a single Spark application, both FIFO and
fair scheduling are supported, regardless of what your cluster manager is



On Fri, Jan 17, 2014 at 12:17 PM, Evan Chan  wrote:

> What is the reason that standalone mode doesn't support the fair scheduler?
> Does that mean that Mesos coarse mode also doesn't support the fair
> scheduler?
>
>
> On Tue, Jan 14, 2014 at 8:10 PM, Matei Zaharia  >wrote:
>
> > This is true for now, we didn’t want to replicate those systems. But it
> > may change if we see demand for fair scheduling in our standalone cluster
> > manager.
> >
> > Matei
> >
> > On Jan 14, 2014, at 6:32 PM, Xia, Junluan  wrote:
> >
> > > Yes, Spark depends on Yarn or Mesos for application level scheduling.
> > >
> > > -Original Message-
> > > From: Nan Zhu [mailto:zhunanmcg...@gmail.com]
> > > Sent: Tuesday, January 14, 2014 9:43 PM
> > > To: dev@spark.incubator.apache.org
> > > Subject: Re: Is there any plan to develop an application level fair
> > scheduler?
> > >
> > > Hi, Junluan,
> > >
> > > Thank you for the reply
> > >
> > > but for the long-term plan, Spark will depend on Yarn and Mesos for
> > application level scheduling in the coming versions?
> > >
> > > Best,
> > >
> > > --
> > > Nan Zhu
> > >
> > >
> > > On Tuesday, January 14, 2014 at 12:56 AM, Xia, Junluan wrote:
> > >
> > >> Are you sure that you must deploy spark in standalone mode?(it
> > currently only support FIFO)
> > >>
> > >> If you could setup Spark on Yarn or Mesos, then it has supported Fair
> > scheduler in application level.
> > >>
> > >> -Original Message-
> > >> From: Nan Zhu [mailto:zhunanmcg...@gmail.com]
> > >> Sent: Tuesday, January 14, 2014 10:13 AM
> > >> To: dev@spark.incubator.apache.org (mailto:
> > dev@spark.incubator.apache.org)
> > >> Subject: Is there any plan to develop an application level fair
> > scheduler?
> > >>
> > >> Hi, All
> > >>
> > >> Is there any plan to develop an application level fair scheduler?
> > >>
> > >> I think it will have more value than a fair scheduler within the
> > application (actually I didn’t understand why we want to fairly share the
> > resource among jobs within the application, in usual, users submit
> > different applications, not jobs)…
> > >>
> > >> Best,
> > >>
> > >> --
> > >> Nan Zhu
> > >>
> > >>
> > >
> > >
> >
> >
>
>
> --
> --
> Evan Chan
> Staff Engineer
> e...@ooyala.com  |
>
> 
> <
> http://www.twitter.com/ooyala>
>


Re: [VOTE] Release Apache Spark 0.9.0-incubating (rc1)

2014-01-16 Thread Reynold Xin
+1


On Thu, Jan 16, 2014 at 3:23 PM, Matei Zaharia wrote:

> +1 for me as well.
>
> I built and tested this on Mac OS X, and looked through the new docs.
>
> Matei
>
> On Jan 15, 2014, at 5:48 PM, Patrick Wendell  wrote:
>
> > Please vote on releasing the following candidate as Apache Spark
> > (incubating) version 0.9.0.
> >
> > A draft of the release notes along with the changes file is attached
> > to this e-mail.
> >
> > The tag to be voted on is v0.9.0-incubating (commit 7348893):
> >
> https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=commit;h=7348893f0edd96dacce2f00970db1976266f7008
> >
> > The release files, including signatures, digests, etc can be found at:
> > http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc1/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1001/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc1-docs/
> >
> > Please vote on releasing this package as Apache Spark 0.9.0-incubating!
> >
> > The vote is open until Sunday, January 19, at 02:00 UTC
> > and passes if a majority of at least 3 +1 PPMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 0.9.0-incubating
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see
> > http://spark.incubator.apache.org/
>
>


Re: Contribute SimRank algorightm to mllib

2014-01-10 Thread Reynold Xin
Hi Jerry,

Why don't you submit a pull request and then we can discuss there? If
SimRank is not common enough, we might take the matrix multiplication
method in and merge that. At the very least, even if SimRank doesn't get
merged into Spark, we can include a contrib package or a Wiki page that
links to examples of various algorithms community members have implemented.




On Thu, Jan 9, 2014 at 9:29 PM, Shao, Saisai  wrote:

> Hi All,
>
> We would like to contribute SimRank algorithm to mllib. SimRank algorithm
> used to calculate similarity rank between two objects based on graph
> structure, details can be seen in (
> http://ilpubs.stanford.edu:8090/508/1/2001-41.pdf), here we implemented a
> matrix multiplication method based on basic algorithm, the description of
> matrix multiplication method can be seen in (
> http://www.cse.unsw.edu.au/~zhangw/files/wwwj.pdf) chapter 4.1.
>
> The implementation is abstracted and generalized from our customer's real
> case, we made some tradeoffs to improve the speed and reduce the shuffle
> size. we just wondered if this algorithm be suitable to put into mllib?
> What else should we take care about?
>
> Any suggestion would be really appreciated.
>
> Thanks
> Jerry
>


Re: spark code formatter?

2014-01-08 Thread Reynold Xin
Thanks for doing that, DB. Not sure about others, but I'm actually strongly
against blanket automatic code formatters, given that they can be
disruptive. Often humans would intentionally choose to style things in a
certain way for more clear semantics and better readability. Code
formatters don't capture these nuances. It is pretty dangerous to just auto
format everything.

Maybe it'd be ok if we restrict the code formatters to a very limited set
of things, such as indenting function parameters, etc.


On Wed, Jan 8, 2014 at 10:28 PM, DB Tsai  wrote:

> A pull request for scalariform.
> https://github.com/apache/incubator-spark/pull/365
>
> Sincerely,
>
> DB Tsai
> Machine Learning Engineer
> Alpine Data Labs
> --
> Web: http://alpinenow.com/
>
>
> On Wed, Jan 8, 2014 at 10:09 PM, DB Tsai  wrote:
> > We use sbt-scalariform in our company, and it can automatically format
> > the coding style when runs `sbt compile`.
> >
> > https://github.com/sbt/sbt-scalariform
> >
> > We ask our developers to run `sbt compile` before commit, and it's
> > really nice to see everyone has the same spacing and indentation.
> >
> > Sincerely,
> >
> > DB Tsai
> > Machine Learning Engineer
> > Alpine Data Labs
> > --
> > Web: http://alpinenow.com/
> >
> >
> > On Wed, Jan 8, 2014 at 9:50 PM, Reynold Xin  wrote:
> >> We have a Scala style configuration file in Shark:
> >> https://github.com/amplab/shark/blob/master/scalastyle-config.xml
> >>
> >> However, the scalastyle project is still pretty primitive and doesn't
> cover
> >> most of the use cases. It is still great to include it to cover basic
> >> checks such as 100-char wide lines.
> >>
> >>
> >> On Wed, Jan 8, 2014 at 8:02 PM, Matei Zaharia  >wrote:
> >>
> >>> Not that I know of. This would be very useful to add, especially if we
> can
> >>> make SBT automatically check the code style (or we can somehow plug
> this
> >>> into Jenkins).
> >>>
> >>> Matei
> >>>
> >>> On Jan 8, 2014, at 11:00 AM, Michael Allman  wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> > I've read the spark code style guide for contributors here:
> >>> >
> >>> >
> https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
> >>> >
> >>> > For scala code, do you have a scalariform configuration that you use
> to
> >>> format your code to these specs?
> >>> >
> >>> > Cheers,
> >>> >
> >>> > Michael
> >>>
> >>>
>


Re: spark code formatter?

2014-01-08 Thread Reynold Xin
We have a Scala style configuration file in Shark:
https://github.com/amplab/shark/blob/master/scalastyle-config.xml

However, the scalastyle project is still pretty primitive and doesn't cover
most of the use cases. It is still great to include it to cover basic
checks such as 100-char wide lines.


On Wed, Jan 8, 2014 at 8:02 PM, Matei Zaharia wrote:

> Not that I know of. This would be very useful to add, especially if we can
> make SBT automatically check the code style (or we can somehow plug this
> into Jenkins).
>
> Matei
>
> On Jan 8, 2014, at 11:00 AM, Michael Allman  wrote:
>
> > Hi,
> >
> > I've read the spark code style guide for contributors here:
> >
> > https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
> >
> > For scala code, do you have a scalariform configuration that you use to
> format your code to these specs?
> >
> > Cheers,
> >
> > Michael
>
>


Re: multinomial logistic regression

2014-01-06 Thread Reynold Xin
Thanks. Why don't you submit a pr and then we can work on it?

> On Jan 6, 2014, at 6:15 PM, Michael Kun Yang  wrote:
>
> Hi Hossein,
>
> I can still use LabeledPoint with little modification. Currently I convert
> the category into {0, 1} sequence, but I can do the conversion in the body
> of methods or functions.
>
> In order to make the code run faster, I try not to use DoubleMatrix
> abstraction to avoid memory allocation; another reason is that jblas has no
> data structure to handle symmetric matrix addition efficiently.
>
> My code is not very pretty because I handle matrix operations manually (by
> indexing).
>
> If you think it is ok, I will make a pull request.
>
>
>> On Mon, Jan 6, 2014 at 5:34 PM, Hossein  wrote:
>>
>> Hi Michael,
>>
>> This sounds great. Would you please send these as a pull request.
>> Especially if you can make your Newtown method implementation generic such
>> that it can later be used by other algorithms, it would be very helpful.
>> For example, you could add it as another optimization method under
>> mllib/optimization.
>>
>> Was there a particular reason you chose not use LabeledPoint?
>>
>> We have some instructions for contributions here: <
>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark>
>>
>> Thanks,
>>
>> --Hossein
>>
>>
>> On Mon, Jan 6, 2014 at 11:33 AM, Michael Kun Yang >> wrote:
>>
>>> I actually have two versions:
>>> one is based on gradient descent like the logistic regression on mllib.
>>> the other is based on Newtown iteration, it is not as fast as SGD, but we
>>> can get all the statistics from it like deviance, p-values and fisher
>> info.
>>>
>>> we can get confusion matrix in both versions
>>>
>>> the gradient descent version is just a modification of logistic
>> regression
>>> with my own implementation. I did not use LabeledPoints class.
>>>
>>>
>>> On Mon, Jan 6, 2014 at 11:13 AM, Evan Sparks 
>>> wrote:
>>>
 Hi Michael,

 What strategy are you using to train the multinomial classifier?
 One-vs-all? I've got an optimized version of that method that I've been
 meaning to clean up and commit for a while. In particular, rather than
 shipping a (potentially very big) model with each map task, I ship it
>>> once
 before each iteration with a broadcast variable. Perhaps we can compare
 versions and incorporate some of my optimizations into your code?

 Thanks,
 Evan

>> On Jan 6, 2014, at 10:57 AM, Michael Kun Yang 
> wrote:
>
> Hi Spark-ers,
>
> I implemented a SGD version of multinomial logistic regression based
>> on
> mllib's optimization package. If this classifier is in the future
>> plan
>>> of
> mllib, I will be happy to contribute my code.
>
> Cheers
>>


Re: Build Changes for SBT Users

2014-01-05 Thread Reynold Xin
Why is it not possible? You always update the script; just can't update
scripts for released versions.




On Sat, Jan 4, 2014 at 9:07 PM, Patrick Wendell  wrote:

> I agree TD - I was just saying that Reynold's proposal that we could
> update the release post-hoc is unfortunately not possible.
>
> On Sat, Jan 4, 2014 at 7:13 PM, Tathagata Das
>  wrote:
> > Patrick, that is right. All we are trying to ensure is to make a
> > "best-effort" attempt to make it smooth for a new user. The script will
> try
> > its best to automatically install / download sbt for the user. The
> fallback
> > will be that the user will have to install sbt on their own. If the URL
> > happens to change and our script fails to automatically download, then we
> > are *no worse* than not providing the script at all.
> >
> > TD
> >
> >
> > On Sat, Jan 4, 2014 at 7:06 PM, Patrick Wendell 
> wrote:
> >
> >> Reynold the issue is releases are immutable and we expect them to be
> >> downloaded for several years after the release date.
> >>
> >> On Sat, Jan 4, 2014 at 5:57 PM, Xuefeng Wu  wrote:
> >> > Sound reasonable.  But I think few installed sbt even it is easy to
> >> install.  I think can provide this tricky script in online document,
> user
> >> could download this script to install sbt independence. Sound like a yet
> >> another brew install sbt?
> >> > :)
> >> >
> >> > Yours, Xuefeng Wu 吴雪峰 敬上
> >> >
> >> >> On 2014年1月5日, at 上午2:56, Patrick Wendell  wrote:
> >> >>
> >> >> We thought about this but elected not to do this for a few reasons.
> >> >>
> >> >> 1. Some people build from machines that do not have internet access
> >> >> for security reasons and retrieve dependency from internal nexus
> >> >> repositories. So having a build dependency that relies on internet
> >> >> downloads is not desirable.
> >> >>
> >> >> 2. It's a hard to ensure stability of a particular URL in perpetuity.
> >> >> This is why maven central and other mirror networks exist. Keep in
> >> >> mind that we can't change the release code ever once we release it,
> >> >> and if something changed about the particular URL it could break the
> >> >> build.
> >> >>
> >> >> - Patrick
> >> >>
> >> >>> On Sat, Jan 4, 2014 at 9:34 AM, Andrew Ash 
> >> wrote:
> >> >>> +1 on bundling a script similar to that one
> >> >>>
> >> >>>
> >>  On Sat, Jan 4, 2014 at 4:48 AM, Holden Karau  >
> >> wrote:
> >> 
> >>  Could we ship a shell script which downloads the sbt jar if not
> >> present
> >>  (like for example
> https://github.com/holdenk/slashem/blob/master/sbt)?
> >> 
> >> 
> >>  On Sat, Jan 4, 2014 at 12:02 AM, Patrick Wendell <
> pwend...@gmail.com>
> >>  wrote:
> >> 
> >> > Hey All,
> >> >
> >> > Due to an ASF requirement, we recently merged a patch which
> removes
> >> > the sbt jar from the build. This is necessary because we aren't
> >> > allowed to distributed binary artifacts with our source packages.
> >> >
> >> > This means that instead of building Spark with "sbt/sbt XXX",
> you'll
> >> > need to have sbt yourself and just run "sbt XXX" from within the
> >> Spark
> >> > directory. This is similar to the maven build, where we expect
> users
> >> > already have maven installed.
> >> >
> >> > You can download sbt at http://www.scala-sbt.org/. It's okay to
> just
> >> > download the most recent version of sbt, since sbt knows how to
> fetch
> >> > other versions of itself and will always use the one we specify in
> >> our
> >> > build file to compile spark.
> >> >
> >> > - Patrick
> >> 
> >> 
> >> 
> >>  --
> >>  Cell : 425-233-8271
> >> 
> >>
>


Re: Build Changes for SBT Users

2014-01-04 Thread Reynold Xin
Doesn't Apache do redirection from incubation. to the normal website also?
 By the time that happens, we can also update the URL in the script?


On Sat, Jan 4, 2014 at 4:13 PM, Patrick Wendell  wrote:

> Hey Holden,
>
> That sounds reasonable to me. Where would we get a url we can control
> though? Right now the project has web space is at incubator.apache...
> but later this will change to a full apache domain. Is there somewhere
> in maven central these jars are hosted... that would be the nicest
> because things like repo1.maven.org basically never changes.
>
> - Patrick
>
> On Sat, Jan 4, 2014 at 1:20 PM, Holden Karau  wrote:
> > That makes sense, I think we could structure a script in such a way that
> it
> > would overcome these problems though and probably provide a fair a mount
> of
> > benefit for people who just want to get started quickly.
> >
> > The easiest would be to have it use the system sbt if present and then
> fall
> > back to downloading the sbt jar. As far as stability of the URL goes we
> > could solve this by either having it point at a domain we control, or
> just
> > with an clear error message indicating it failed to download sbt and the
> > user needs to install sbt.
> >
> > If a restructured script in that manner would be useful I could whip up a
> > pull request :)
> >
> >
> > On Sat, Jan 4, 2014 at 10:56 AM, Patrick Wendell 
> wrote:
> >
> >> We thought about this but elected not to do this for a few reasons.
> >>
> >> 1. Some people build from machines that do not have internet access
> >> for security reasons and retrieve dependency from internal nexus
> >> repositories. So having a build dependency that relies on internet
> >> downloads is not desirable.
> >>
> >> 2. It's a hard to ensure stability of a particular URL in perpetuity.
> >> This is why maven central and other mirror networks exist. Keep in
> >> mind that we can't change the release code ever once we release it,
> >> and if something changed about the particular URL it could break the
> >> build.
> >>
> >> - Patrick
> >>
> >> On Sat, Jan 4, 2014 at 9:34 AM, Andrew Ash 
> wrote:
> >> > +1 on bundling a script similar to that one
> >> >
> >> >
> >> > On Sat, Jan 4, 2014 at 4:48 AM, Holden Karau 
> >> wrote:
> >> >
> >> >> Could we ship a shell script which downloads the sbt jar if not
> present
> >> >> (like for example https://github.com/holdenk/slashem/blob/master/sbt)?
> >> >>
> >> >>
> >> >> On Sat, Jan 4, 2014 at 12:02 AM, Patrick Wendell  >
> >> >> wrote:
> >> >>
> >> >> > Hey All,
> >> >> >
> >> >> > Due to an ASF requirement, we recently merged a patch which removes
> >> >> > the sbt jar from the build. This is necessary because we aren't
> >> >> > allowed to distributed binary artifacts with our source packages.
> >> >> >
> >> >> > This means that instead of building Spark with "sbt/sbt XXX",
> you'll
> >> >> > need to have sbt yourself and just run "sbt XXX" from within the
> Spark
> >> >> > directory. This is similar to the maven build, where we expect
> users
> >> >> > already have maven installed.
> >> >> >
> >> >> > You can download sbt at http://www.scala-sbt.org/. It's okay to
> just
> >> >> > download the most recent version of sbt, since sbt knows how to
> fetch
> >> >> > other versions of itself and will always use the one we specify in
> our
> >> >> > build file to compile spark.
> >> >> >
> >> >> > - Patrick
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Cell : 425-233-8271
> >> >>
> >>
> >
> >
> >
> > --
> > Cell : 425-233-8271
>


Re: Terminology: "worker" vs "slave"

2014-01-02 Thread Reynold Xin
It is historic.

I think we are converging towards

worker: the "slave" daemon in the standalone cluster manager

executor: the jvm process that is launched by the worker that executes tasks



On Thu, Jan 2, 2014 at 10:39 PM, Andrew Ash  wrote:

> The terms worker and slave seem to be used interchangeably.  Are they the
> same?
>
> Worker is used more frequently in the codebase:
>
> aash@aash-mbp ~/git/spark$ git grep -i worker | wc -l
>  981
> aash@aash-mbp ~/git/spark$ git grep -i slave | wc -l
>  348
> aash@aash-mbp ~/git/spark$
>
> Does it make sense to unify on one or the other?
>


Re: Disallowing null mergeCombiners

2013-12-31 Thread Reynold Xin
I added the option that doesn't require the caller to specify the
mergeCombiner closure a while ago when I wanted to disable mapSideCombine.
In virtually all use cases I know of, it is fine & easy to specify a
mergeCombiner, so I'm all for this given it simplifies the codebase.


On Tue, Dec 31, 2013 at 5:05 PM, Patrick Wendell  wrote:

> Hey All,
>
> There is a small API change that we are considering for the external
> sort patch. Previously we allowed mergeCombiner to be null when map
> side aggregation was not enabled. This is because it wasn't necessary
> in that case since mappers didn't ship pre-aggregated values to
> reducers.
>
> Because the external sort capability also relies on the mergeCombiner
> function to merge partially-aggregated on-disk segments, we now need
> it all the time, even if map side aggregation is enabled. This is a
> fairly esoteric thing that I'm not sure anyone other than Shark ever
> used, but I want to check in case anyone had feelings about this.
>
> The relevant code is here:
>
>
> https://github.com/apache/incubator-spark/pull/303/files#diff-f70e97c099b5eac05c75288cb215e080R72
>
> - Patrick
>


Re: Spark graduate project ideas

2013-12-31 Thread Reynold Xin
There is a recent discussion on academic projects on Spark.

Take a look at the replies to that email (unfortunately you have to dig
through the archive to find the replies):
http://mail-archives.apache.org/mod_mbox/spark-dev/201312.mbox/%3CCAHH8_ON-2y69fBfVtt6pngWtEPOZdsmvt4hZ=doe-dzsk6k...@mail.gmail.com%3E



On Wed, Dec 25, 2013 at 5:21 AM, Фёдор Короткий wrote:

> Hi,
>
> Currently I'm pursuing a masters degree in CS and I'm in search of my year
> project theme (in distributed systems field), and Spark seems very
> interesting to me.
>
> Can you suggest some problems or ideas to work on?
>
> By the way, what is the status of external sorting(
> https://spark-project.atlassian.net/browse/SPARK-983)?
>


Re: Systematically performance diagnose

2013-12-30 Thread Reynold Xin
The application web ui is pretty useful. We have been adding more and more
information to the web ui for easier performance analysis.

Look at Patrick Wendell's two talks at the Spark Summit for more
information: http://spark-summit.org/summit-2013/


On Sat, Dec 28, 2013 at 8:12 PM, Hao Lin  wrote:

> Hi folks,
>
> I am trying to test the performance on a couple of my Spark applications.
> For benchmarking purpose, I am wondering if there is a good performance
> analysis practice. The best way I can think of is to instrument log prints
> and analyze the timestamps in logs on each node.
>
> The major metrics I am interested in are computation ratios (computation
> time, data transferring time, basically a timeline of detailed events),
> memory usage, disk throughput. Could I have some suggestions on how Spark
> is benchmarked.
>
> Thanks,
>
> Max
>


Re: test suite results in OOME

2013-12-30 Thread Reynold Xin
Again, I usually use sbt ...

sbt/sbt "test-only *TaskResultGetterSuite*"


On Sat, Dec 28, 2013 at 2:04 PM, Ted Yu  wrote:

> Build Tools slide from Matei's slides, transition toward maven only is
> happening.
>
> That was why I used mvn.
>
> BTW I specified the following on the commandline:
> -Dtest=TaskResultGetterSuite
>
> Many other test suites were run.
>
> How can I run one suite ?
>
> Thanks
>
>
> On Sat, Dec 28, 2013 at 3:13 PM, Reynold Xin  wrote:
>
> > I usually use sbt. i.e. sbt/sbt test
> >
> >
> >
> >
> > On Sat, Dec 28, 2013 at 7:00 AM, Ted Yu  wrote:
> >
> > > Hi,
> > > I used the following setting to run test suite:
> > > export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=812M
> > > -XX:ReservedCodeCacheSize=512m"
> > >
> > > I got:
> > >
> > > [ERROR] [12/28/2013 08:34:03.747]
> > > [sparkWorker1-akka.actor.default-dispatcher-14]
> > [ActorSystem(sparkWorker1)]
> > > Uncaught fatal error from thread
> > > [sparkWorker1-akka.actor.default-dispatcher-14] shutting down
> ActorSystem
> > > [sparkWorker1]
> > > java.lang.OutOfMemoryError: PermGen space
> > >
> > > How do I run test suite on Mac ?
> > >
> > > Thanks
> > >
> >
>


Re: test suite results in OOME

2013-12-28 Thread Reynold Xin
I usually use sbt. i.e. sbt/sbt test




On Sat, Dec 28, 2013 at 7:00 AM, Ted Yu  wrote:

> Hi,
> I used the following setting to run test suite:
> export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=812M
> -XX:ReservedCodeCacheSize=512m"
>
> I got:
>
> [ERROR] [12/28/2013 08:34:03.747]
> [sparkWorker1-akka.actor.default-dispatcher-14] [ActorSystem(sparkWorker1)]
> Uncaught fatal error from thread
> [sparkWorker1-akka.actor.default-dispatcher-14] shutting down ActorSystem
> [sparkWorker1]
> java.lang.OutOfMemoryError: PermGen space
>
> How do I run test suite on Mac ?
>
> Thanks
>


Re: Option folding idiom

2013-12-26 Thread Reynold Xin
I'm not strongly against Option.fold, but I find the readability getting
worse for the use case you brought up.  For the use case of if/else, I find
Option.fold pretty confusing because it reverses the order of Some vs None.
Also, when code gets long, the lack of an obvious boundary (the only
boundary is "} {") with two closures is pretty confusing.


On Thu, Dec 26, 2013 at 4:23 PM, Mark Hamstra wrote:

> On the contrary, it is the completely natural place for the initial value
> of the accumulator, and provides the expected result of folding over an
> empty collection.
>
> scala> val l: List[Int] = List()
>
> l: List[Int] = List()
>
>
> scala> l.fold(42)(_ + _)
>
> res0: Int = 42
>
>
> scala> val o: Option[Int] = None
>
> o: Option[Int] = None
>
>
> scala> o.fold(42)(_ + 1)
>
> res1: Int = 42
>
>
> On Thu, Dec 26, 2013 at 5:51 PM, Evan Chan  wrote:
>
> > +1 for using more functional idioms in general.
> >
> > That's a pretty clever use of `fold`, but putting the default condition
> > first there makes it not as intuitive.   What about the following, which
> > are more readable?
> >
> > option.map { a => someFuncMakesB() }
> >   .getOrElse(b)
> >
> > option.map { a => someFuncMakesB() }
> >   .orElse { a => otherDefaultB() }.get
> >
> >
> > On Thu, Dec 26, 2013 at 12:33 PM, Mark Hamstra  > >wrote:
> >
> > > In code added to Spark over the past several months, I'm glad to see
> more
> > > use of `foreach`, `for`, `map` and `flatMap` over `Option` instead of
> > > pattern matching boilerplate.  There are opportunities to push `Option`
> > > idioms even further now that we are using Scala 2.10 in master, but I
> > want
> > > to discuss the issue here a little bit before committing code whose
> form
> > > may be a little unfamiliar to some Spark developers.
> > >
> > > In particular, I really like the use of `fold` with `Option` to cleanly
> > an
> > > concisely express the "do something if the Option is None; do something
> > > else with the thing contained in the Option if it is Some" code
> fragment.
> > >
> > > An example:
> > >
> > > Instead of...
> > >
> > > val driver = drivers.find(_.id == driverId)
> > > driver match {
> > >   case Some(d) =>
> > > if (waitingDrivers.contains(d)) { waitingDrivers -= d }
> > > else {
> > >   d.worker.foreach { w =>
> > > w.actor ! KillDriver(driverId)
> > >   }
> > > }
> > > val msg = s"Kill request for $driverId submitted"
> > > logInfo(msg)
> > > sender ! KillDriverResponse(true, msg)
> > >   case None =>
> > > val msg = s"Could not find running driver $driverId"
> > > logWarning(msg)
> > > sender ! KillDriverResponse(false, msg)
> > > }
> > >
> > > ...using fold we end up with...
> > >
> > > driver.fold
> > >   {
> > > val msg = s"Could not find running driver $driverId"
> > > logWarning(msg)
> > > sender ! KillDriverResponse(false, msg)
> > >   }
> > >   { d =>
> > > if (waitingDrivers.contains(d)) { waitingDrivers -= d }
> > > else {
> > >   d.worker.foreach { w =>
> > > w.actor ! KillDriver(driverId)
> > >   }
> > > }
> > > val msg = s"Kill request for $driverId submitted"
> > > logInfo(msg)
> > > sender ! KillDriverResponse(true, msg)
> > >   }
> > >
> > >
> > > So the basic pattern (and my proposed formatting standard) for folding
> > over
> > > an `Option[A]` from which you need to produce a B (which may be Unit if
> > > you're only interested in side effects) is:
> > >
> > > anOption.fold
> > >   {
> > > // something that evaluates to a B if anOption = None
> > >   }
> > >   { a =>
> > > // something that transforms `a` into a B if anOption = Some(a)
> > >   }
> > >
> > >
> > > Any thoughts?  Does anyone really, really hate this style of coding and
> > > oppose its use in Spark?
> > >
> >
> >
> >
> > --
> > --
> > Evan Chan
> > Staff Engineer
> > e...@ooyala.com  |
> >
> > 
> >  ><
> > http://www.twitter.com/ooyala>
> >
>


Re: Akka problem when using scala command to launch Spark applications in the current 0.9.0-SNAPSHOT

2013-12-24 Thread Reynold Xin
Yup - you are safe if you stick to the official documented method.

A lot of users also use scala for a variety of reasons (e.g. old script)
and that used to work also.


On Tue, Dec 24, 2013 at 10:50 AM, Evan Chan  wrote:

> Hi Reynold,
>
> The default, documented methods of starting Spark all use the assembly jar,
> and thus java, right?
>
> -Evan
>
>
>
> On Fri, Dec 20, 2013 at 11:36 PM, Reynold Xin  wrote:
>
> > It took me hours to debug a problem yesterday on the latest master branch
> > (0.9.0-SNAPSHOT), and I would like to share with the dev list in case
> > anybody runs into this Akka problem.
> >
> > A little background for those of you who haven't followed closely the
> > development of Spark and YARN 2.2: YARN 2.2 uses protobuf 2.5, and Akka
> > uses an older version of protobuf that is not binary compatible. In order
> > to have a single build that is compatible for both YARN 2.2 and pre-2.2
> > YARN/Hadoop, we published a special version of Akka that builds with
> > protobuf shaded (i.e. using a different package name for the protobuf
> > stuff).
> >
> > However, it turned out Scala 2.10 includes a version of Akka jar in its
> > default classpath (look at the lib folder in Scala 2.10 binary
> > distribution). If you use the scala command to launch any Spark
> application
> > on the current master branch, there is a pretty high chance that you
> > wouldn't be able to create the SparkContext (stack trace at the end of
> the
> > email). The problem is that the Akka packaged with Scala 2.10 takes
> > precedence in the classloader over the special Akka version Spark
> includes.
> >
> > Before we have a good solution for this, the workaround is to use java to
> > launch the application instead of scala. All you need to do is to include
> > the right Scala jars (scala-library and scala-compiler) in the classpath.
> > Note that the scala command is really just a simple script that calls
> java
> > with the right classpath.
> >
> >
> > Stack trace:
> >
> > java.lang.NoSuchMethodException:
> > akka.remote.RemoteActorRefProvider.(java.lang.String,
> > akka.actor.ActorSystem$Settings, akka.event.EventStream,
> > akka.actor.Scheduler, akka.actor.DynamicAccess)
> > at java.lang.Class.getConstructor0(Class.java:2763)
> > at java.lang.Class.getDeclaredConstructor(Class.java:2021)
> > at
> >
> >
> akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:77)
> > at scala.util.Try$.apply(Try.scala:161)
> > at
> >
> >
> akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:74)
> > at
> >
> >
> akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:85)
> > at
> >
> >
> akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:85)
> > at scala.util.Success.flatMap(Try.scala:200)
> > at
> >
> >
> akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:85)
> > at akka.actor.ActorSystemImpl.(ActorSystem.scala:546)
> > at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
> > at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
> > at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:79)
> > at
> > org.apache.spark.SparkEnv$.createFromSystemProperties(SparkEnv.scala:120)
> > at org.apache.spark.SparkContext.(SparkContext.scala:106)
> >
>
>
>
> --
> --
> Evan Chan
> Staff Engineer
> e...@ooyala.com  |
>
> <http://www.ooyala.com/>
> <http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><
> http://www.twitter.com/ooyala>
>


Akka problem when using scala command to launch Spark applications in the current 0.9.0-SNAPSHOT

2013-12-20 Thread Reynold Xin
It took me hours to debug a problem yesterday on the latest master branch
(0.9.0-SNAPSHOT), and I would like to share with the dev list in case
anybody runs into this Akka problem.

A little background for those of you who haven't followed closely the
development of Spark and YARN 2.2: YARN 2.2 uses protobuf 2.5, and Akka
uses an older version of protobuf that is not binary compatible. In order
to have a single build that is compatible for both YARN 2.2 and pre-2.2
YARN/Hadoop, we published a special version of Akka that builds with
protobuf shaded (i.e. using a different package name for the protobuf
stuff).

However, it turned out Scala 2.10 includes a version of Akka jar in its
default classpath (look at the lib folder in Scala 2.10 binary
distribution). If you use the scala command to launch any Spark application
on the current master branch, there is a pretty high chance that you
wouldn't be able to create the SparkContext (stack trace at the end of the
email). The problem is that the Akka packaged with Scala 2.10 takes
precedence in the classloader over the special Akka version Spark includes.

Before we have a good solution for this, the workaround is to use java to
launch the application instead of scala. All you need to do is to include
the right Scala jars (scala-library and scala-compiler) in the classpath.
Note that the scala command is really just a simple script that calls java
with the right classpath.


Stack trace:

java.lang.NoSuchMethodException:
akka.remote.RemoteActorRefProvider.(java.lang.String,
akka.actor.ActorSystem$Settings, akka.event.EventStream,
akka.actor.Scheduler, akka.actor.DynamicAccess)
at java.lang.Class.getConstructor0(Class.java:2763)
at java.lang.Class.getDeclaredConstructor(Class.java:2021)
at
akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:77)
at scala.util.Try$.apply(Try.scala:161)
at
akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:74)
at
akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:85)
at
akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:85)
at scala.util.Success.flatMap(Try.scala:200)
at
akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:85)
at akka.actor.ActorSystemImpl.(ActorSystem.scala:546)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:79)
at org.apache.spark.SparkEnv$.createFromSystemProperties(SparkEnv.scala:120)
at org.apache.spark.SparkContext.(SparkContext.scala:106)


Re: spark.task.maxFailures

2013-12-16 Thread Reynold Xin
I just merged your pull request
https://github.com/apache/incubator-spark/pull/245


On Mon, Dec 16, 2013 at 2:12 PM, Grega Kešpret  wrote:

> Any news regarding this setting? Is this expected behaviour? Is there some
> other way I can have Spark fail-fast?
>
> Thanks!
>
> On Mon, Dec 9, 2013 at 4:35 PM, Grega Kešpret  wrote:
>
> > Hi!
> >
> > I tried this (by setting spark.task.maxFailures to 1) and it still does
> > not fail-fast. I started a job and after some time, I killed all JVMs
> > running on one of the two workers. I was expecting Spark job to fail,
> > however it re-fetched tasks to one of the two workers that was still
> alive
> > and the job succeeded.
> >
> > Grega
> >
>


Re: Sorry about business lately and general unavailability

2013-12-04 Thread Reynold Xin
Thanks for the update Chris.

We do need to graduate soon. People have been asking me does "incubating"
means the project is very immature. :(

One thing we need to do is to import the JIRA tickets from AMPLab's JIRA.
That INFRA ticket hasn't moved much along. Can you help push that?



On Wed, Dec 4, 2013 at 11:32 AM, Chris Mattmann  wrote:

> Hey Guys,
>
> Just wanted to apologize for the general lack of my availability
> lately. I thought moving from Rancho Cucamonga, to Pasadena, CA
> (over 50+ miles) wouldn't affect my productivity, and with that
> and the holidays, and all the house work and moving stuff I've had
> to do, coupled with $dayjob it's been tough.
>
> I'm slowly catching up and coming out of the fog though so just
> wanted to let you all know I'm going to be around and get back
> to helping out as a mentor. Amazing thing though is that you guys
> have really been kicking ass largely without me like you always do
> and operating like a great ASF project.
>
> I'd say you are headed for an early graduation and I will closely
> monitor things like adding more PPMC members and committers (saw
> you guys have been doing this), and also things like releases
> (that too), and just keep doing what you're doing and you'll be
> an ASF TLP shortly!
>
> Cheers mates and rock on.
>
> -Chris "Champion in abenstia but back now" Mattmann
>
>
>
>
>
>


Re: PySpark / scikit-learn integration sprint at Cloudera - Strata Conference Friday 14th Feb 2014

2013-12-02 Thread Reynold Xin
Definitely some people will get confused. It's up to you. If we post it, we
can mark it in the title that this is a hackathon.


On Mon, Dec 2, 2013 at 1:43 PM, Olivier Grisel wrote:

> 2013/12/2 Reynold Xin :
> > Including the link to the meetup group:
> http://www.meetup.com/spark-users/
>
> I am not opposed to it but I am wondering if people will not confuse
> it with a traditional meetup if we do so.
>
> --
> Olivier
>


Re: PySpark / scikit-learn integration sprint at Cloudera - Strata Conference Friday 14th Feb 2014

2013-12-02 Thread Reynold Xin
Including the link to the meetup group: http://www.meetup.com/spark-users/


On Mon, Dec 2, 2013 at 1:22 PM, Reynold Xin  wrote:

> Olivier,
>
> Do you want us to create a Spark user meetup event for this hackathon?
>
> On Mon, Dec 2, 2013 at 1:12 PM, Olivier Grisel 
> wrote:
>
>> Hi all,
>>
>> Just a quick reply to say that I would be glad to meet some of you to
>> hack on some prototype scikit-learn / PySpark integration.
>>
>> Cloudera just confirmed that we have a room for us at their San
>> Fransisco offices on Friday Feb 14 (right after Strata).
>>
>> Hope to see you there or at Strata,
>>
>> --
>> Olivier
>>
>
>


Re: PySpark / scikit-learn integration sprint at Cloudera - Strata Conference Friday 14th Feb 2014

2013-12-02 Thread Reynold Xin
Olivier,

Do you want us to create a Spark user meetup event for this hackathon?

On Mon, Dec 2, 2013 at 1:12 PM, Olivier Grisel wrote:

> Hi all,
>
> Just a quick reply to say that I would be glad to meet some of you to
> hack on some prototype scikit-learn / PySpark integration.
>
> Cloudera just confirmed that we have a room for us at their San
> Fransisco offices on Friday Feb 14 (right after Strata).
>
> Hope to see you there or at Strata,
>
> --
> Olivier
>


Re: spark.task.maxFailures

2013-11-29 Thread Reynold Xin
Looks like a bug to me. Can you submit a pull request?



On Fri, Nov 29, 2013 at 2:02 AM, Grega Kešpret  wrote:

> Looking at
> http://spark.incubator.apache.org/docs/latest/configuration.html
> docs says:
> Number of individual task failures before giving up on the job. Should be
> greater than or equal to 1. Number of allowed retries = this value - 1.
>
> However, looking at the code
>
> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala#L532
>
> if I set spark.task.maxFailures to 1, this means that job will fail after
> task fails for the second time. Shouldn't this line be corrected to if (
> numFailures(index) >= MAX_TASK_FAILURES) {
> ?
>
> I can open a pull request if this is the case.
>
> Thanks,
> Grega
> --
> [image: Inline image 1]
> *Grega Kešpret*
> Analytics engineer
>
> Celtra — Rich Media Mobile Advertising
> celtra.com  | 
> @celtramobile
>


Re: Problem with tests

2013-11-24 Thread Reynold Xin
Take a look at this pull request and see if it fixes your problem:
https://github.com/apache/incubator-spark/pull/201

I changed the semantics of the index from the output partition index back
to the rdd partition index.



On Sat, Nov 23, 2013 at 10:01 PM, Nathan Kronenfeld <
nkronenf...@oculusinfo.com> wrote:

> Though I think it's a more general problem...
>
> Take the following:
>
> val data = sc.parallelize(Range(0, 8), 2)
> val data2 = data.mapPartitionsWithIndex((index, i) => i.map(x => (x,
> index)))
>
> data2.collect
>   res0: Array[(Int, Int)] = Array((0,0), (1,0), (2,0), (3,0), (4,1), (5,1),
> (6,1), (7,1))
>
> new org.apache.spark.rdd.PartitionPruningRDD(data2, n => 1 == n).collect
>   res1: Array[(Int, Int)] = Array((4,0), (5,0), (6,0), (7,0))
>
> So, in this case, pruning the RDD has changed the data within it.  This
> seems to be what is causing my errors.
>
>
>
> On Sat, Nov 23, 2013 at 8:00 AM, Nathan Kronenfeld <
> nkronenf...@oculusinfo.com> wrote:
>
> > https://github.com/apache/incubator-spark/pull/18
> >
> >
> > On Fri, Nov 22, 2013 at 6:35 PM, Reynold Xin  wrote:
> >
> >> Can you provide a link to your pull request?
> >>
> >>
> >> On Sat, Nov 23, 2013 at 5:02 AM, Nathan Kronenfeld <
> >> nkronenf...@oculusinfo.com> wrote:
> >>
> >> > Actually, looking into recent commits, it looks like my hunch may be
> >> > exactly correct:
> >> >
> >> >
> >>
> https://github.com/apache/incubator-spark/commit/f639b65eabcc8666b74af8f13a37c5fdf7e0185f
> >> > "PartitionPruningRDD is using index from parent"
> >> >
> >> > Is there anyone who can explain why this new behavior is preferable?
> >>  And,
> >> > if it's staying, can suggest a way to fix my tests for this case?
> >> >
> >> > Thanks again,
> >> >  Nathan
> >> >
> >> >
> >> > On Fri, Nov 22, 2013 at 3:56 PM, Nathan Kronenfeld <
> >> > nkronenf...@oculusinfo.com> wrote:
> >> >
> >> > > Hi there.
> >> > >
> >> > > I have a problem with the unit tests on a pull request I'm trying to
> >> tie
> >> > > up.  The changes deal with partition-related functions.
> >> > >
> >> > > In particular, the tests I have that test an append-to-partition
> >> function
> >> > > work fine on my own machine, but fail on the build machine (
> >> > >
> >> >
> >>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/2152/console
> >> > > ).
> >> > >
> >> > > The failure seems to stem from pulling a single partition out of the
> >> set.
> >> > > In either case, when I work on the full dataset:
> >> > >
> >> > > UnionRDD[11] at apply at FunSuite.scala:1265 (4 partitions)
> >> > >   UnionRDD[9] at apply at FunSuite.scala:1265 (3 partitions)
> >> > > ParallelCollectionRDD[8] at apply at FunSuite.scala:1265 (1
> >> > partitions)
> >> > > MapPartitionsWithContextRDD[7] at apply at FunSuite.scala:1265
> (2
> >> > partitions)
> >> > >   ParallelCollectionRDD[4] at apply at FunSuite.scala:1265 (2
> >> > partitions)
> >> > >   ParallelCollectionRDD[10] at apply at FunSuite.scala:1265 (1
> >> > partitions)
> >> > >
> >> > >
> >> > > It seems to work.  When I pull one partition out of this, by
> wrapping
> >> a
> >> > PartitionPruningRDD around it (pruning out everything but partition
> 2):
> >> > >
> >> > > PartitionPruningRDD[12] at apply at FunSuite.scala:1265 (1
> partitions)
> >> > >   UnionRDD[11] at apply at FunSuite.scala:1265 (4 partitions)
> >> > > UnionRDD[9] at apply at FunSuite.scala:1265 (3 partitions)
> >> > >   ParallelCollectionRDD[8] at apply at FunSuite.scala:1265 (1
> >> > partitions)
> >> > >   MapPartitionsWithContextRDD[7] at apply at FunSuite.scala:1265
> >> (2
> >> > partitions)
> >> > > ParallelCollectionRDD[4] at apply at FunSuite.scala:1265 (2
> >> > partitions)
> >> > > ParallelCollectionRDD[10] at apply at FunSuite.scala:1265 (1
> >> > partitions)
> >> > >
> >> > >
> >> > > In this case, my l

Re: Problem with tests

2013-11-22 Thread Reynold Xin
Can you provide a link to your pull request?


On Sat, Nov 23, 2013 at 5:02 AM, Nathan Kronenfeld <
nkronenf...@oculusinfo.com> wrote:

> Actually, looking into recent commits, it looks like my hunch may be
> exactly correct:
>
> https://github.com/apache/incubator-spark/commit/f639b65eabcc8666b74af8f13a37c5fdf7e0185f
> "PartitionPruningRDD is using index from parent"
>
> Is there anyone who can explain why this new behavior is preferable?  And,
> if it's staying, can suggest a way to fix my tests for this case?
>
> Thanks again,
>  Nathan
>
>
> On Fri, Nov 22, 2013 at 3:56 PM, Nathan Kronenfeld <
> nkronenf...@oculusinfo.com> wrote:
>
> > Hi there.
> >
> > I have a problem with the unit tests on a pull request I'm trying to tie
> > up.  The changes deal with partition-related functions.
> >
> > In particular, the tests I have that test an append-to-partition function
> > work fine on my own machine, but fail on the build machine (
> >
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/2152/console
> > ).
> >
> > The failure seems to stem from pulling a single partition out of the set.
> > In either case, when I work on the full dataset:
> >
> > UnionRDD[11] at apply at FunSuite.scala:1265 (4 partitions)
> >   UnionRDD[9] at apply at FunSuite.scala:1265 (3 partitions)
> > ParallelCollectionRDD[8] at apply at FunSuite.scala:1265 (1
> partitions)
> > MapPartitionsWithContextRDD[7] at apply at FunSuite.scala:1265 (2
> partitions)
> >   ParallelCollectionRDD[4] at apply at FunSuite.scala:1265 (2
> partitions)
> >   ParallelCollectionRDD[10] at apply at FunSuite.scala:1265 (1
> partitions)
> >
> >
> > It seems to work.  When I pull one partition out of this, by wrapping a
> PartitionPruningRDD around it (pruning out everything but partition 2):
> >
> > PartitionPruningRDD[12] at apply at FunSuite.scala:1265 (1 partitions)
> >   UnionRDD[11] at apply at FunSuite.scala:1265 (4 partitions)
> > UnionRDD[9] at apply at FunSuite.scala:1265 (3 partitions)
> >   ParallelCollectionRDD[8] at apply at FunSuite.scala:1265 (1
> partitions)
> >   MapPartitionsWithContextRDD[7] at apply at FunSuite.scala:1265 (2
> partitions)
> > ParallelCollectionRDD[4] at apply at FunSuite.scala:1265 (2
> partitions)
> > ParallelCollectionRDD[10] at apply at FunSuite.scala:1265 (1
> partitions)
> >
> >
> > In this case, my local machine and the build machine seem to act
> > differently.
> >
> > On my local machine, what is in the inner ParallelCollection partition #2
> > shows up in the MapPartitionsWithContextRDD as partition #2 still.  On
> the
> > build machine, this same partition shows up in the later RDD as partition
> > #0 - presumably because everything else is pruned out, but that pruning
> > should happen at an outer level, shouldn't it?
> >
> > Does anyone know why the build machine would act different from locally
> > here?
> >
> > Also, sadly, this worked fine two days ago.
> >
> > My only thought is that perhaps the PullRequestBuilder does a merge with
> > current code, and someone broke this in the last day or two?  Past that,
> > I'm at a bit of a loss.
> >
> > Thanks,
> > -Nathan
> >
> >
> > --
> >
> > Nathan Kronenfeld
> > Senior Visualization Developer
> > Oculus Info Inc
> > 2 Berkeley Street, Suite 600,
> > Toronto, Ontario M5A 4J5
> > Phone:  +1-416-203-3003 x 238
> > Email:  nkronenf...@oculusinfo.com
> >
>
>
>
> --
> Nathan Kronenfeld
> Senior Visualization Developer
> Oculus Info Inc
> 2 Berkeley Street, Suite 600,
> Toronto, Ontario M5A 4J5
> Phone:  +1-416-203-3003 x 238
> Email:  nkronenf...@oculusinfo.com
>


Re: Problem with tests

2013-11-22 Thread Reynold Xin
Can you provide a link to your pull request?


On Sat, Nov 23, 2013 at 5:02 AM, Nathan Kronenfeld <
nkronenf...@oculusinfo.com> wrote:

> Actually, looking into recent commits, it looks like my hunch may be
> exactly correct:
>
> https://github.com/apache/incubator-spark/commit/f639b65eabcc8666b74af8f13a37c5fdf7e0185f
> "PartitionPruningRDD is using index from parent"
>
> Is there anyone who can explain why this new behavior is preferable?  And,
> if it's staying, can suggest a way to fix my tests for this case?
>
> Thanks again,
>  Nathan
>
>
> On Fri, Nov 22, 2013 at 3:56 PM, Nathan Kronenfeld <
> nkronenf...@oculusinfo.com> wrote:
>
> > Hi there.
> >
> > I have a problem with the unit tests on a pull request I'm trying to tie
> > up.  The changes deal with partition-related functions.
> >
> > In particular, the tests I have that test an append-to-partition function
> > work fine on my own machine, but fail on the build machine (
> >
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/2152/console
> > ).
> >
> > The failure seems to stem from pulling a single partition out of the set.
> > In either case, when I work on the full dataset:
> >
> > UnionRDD[11] at apply at FunSuite.scala:1265 (4 partitions)
> >   UnionRDD[9] at apply at FunSuite.scala:1265 (3 partitions)
> > ParallelCollectionRDD[8] at apply at FunSuite.scala:1265 (1
> partitions)
> > MapPartitionsWithContextRDD[7] at apply at FunSuite.scala:1265 (2
> partitions)
> >   ParallelCollectionRDD[4] at apply at FunSuite.scala:1265 (2
> partitions)
> >   ParallelCollectionRDD[10] at apply at FunSuite.scala:1265 (1
> partitions)
> >
> >
> > It seems to work.  When I pull one partition out of this, by wrapping a
> PartitionPruningRDD around it (pruning out everything but partition 2):
> >
> > PartitionPruningRDD[12] at apply at FunSuite.scala:1265 (1 partitions)
> >   UnionRDD[11] at apply at FunSuite.scala:1265 (4 partitions)
> > UnionRDD[9] at apply at FunSuite.scala:1265 (3 partitions)
> >   ParallelCollectionRDD[8] at apply at FunSuite.scala:1265 (1
> partitions)
> >   MapPartitionsWithContextRDD[7] at apply at FunSuite.scala:1265 (2
> partitions)
> > ParallelCollectionRDD[4] at apply at FunSuite.scala:1265 (2
> partitions)
> > ParallelCollectionRDD[10] at apply at FunSuite.scala:1265 (1
> partitions)
> >
> >
> > In this case, my local machine and the build machine seem to act
> > differently.
> >
> > On my local machine, what is in the inner ParallelCollection partition #2
> > shows up in the MapPartitionsWithContextRDD as partition #2 still.  On
> the
> > build machine, this same partition shows up in the later RDD as partition
> > #0 - presumably because everything else is pruned out, but that pruning
> > should happen at an outer level, shouldn't it?
> >
> > Does anyone know why the build machine would act different from locally
> > here?
> >
> > Also, sadly, this worked fine two days ago.
> >
> > My only thought is that perhaps the PullRequestBuilder does a merge with
> > current code, and someone broke this in the last day or two?  Past that,
> > I'm at a bit of a loss.
> >
> > Thanks,
> > -Nathan
> >
> >
> > --
> >
> > Nathan Kronenfeld
> > Senior Visualization Developer
> > Oculus Info Inc
> > 2 Berkeley Street, Suite 600,
> > Toronto, Ontario M5A 4J5
> > Phone:  +1-416-203-3003 x 238
> > Email:  nkronenf...@oculusinfo.com
> >
>
>
>
> --
> Nathan Kronenfeld
> Senior Visualization Developer
> Oculus Info Inc
> 2 Berkeley Street, Suite 600,
> Toronto, Ontario M5A 4J5
> Phone:  +1-416-203-3003 x 238
> Email:  nkronenf...@oculusinfo.com
>


Re: issue regarding akka, protobuf and Hadoop version

2013-11-06 Thread Reynold Xin
That is correct. However, there is no guarantee right now that Akka 2.3
will work correctly for us. We haven't tested it enough yet (or rather, we
haven't tested it at all) E.g. see:
https://github.com/apache/incubator-spark/pull/131

We want to make Spark 0.9.0 based on Scala 2.10, but we have also been
discussing ideas to make a Scala 2.10 version of Spark 0.8.x so it enables
users to move to Scala 2.10 earlier if they want.


On Wed, Nov 6, 2013 at 12:29 AM, Sandy Ryza  wrote:

> For my own understanding, is this summary correct?
> Spark will move to scala 2.10, which means it can support akka 2.3-M1,
> which supports protobuf 2.5, which will allow Spark to run on Hadoop 2.2.
>
> What will be the first Spark version with these changes?  Are the Akka
> features that Spark relies on stable in 2.3-M1?
>
> thanks,
> Sandy
>
>
>
> On Tue, Nov 5, 2013 at 12:12 AM, Liu, Raymond 
> wrote:
>
> > Just pushed a pull request which based on scala 2.10 branch for hadoop
> > 2.2.0.
> > Yarn-standalone mode workable, but need a few more fine tune works.
> > Not really for pull, but as a placeholder, and for someone who want to
> > take a look.
> >
> > Best Regards,
> > Raymond Liu
> >
> >
> > -Original Message-
> > From: Reynold Xin [mailto:r...@apache.org]
> > Sent: Tuesday, November 05, 2013 10:07 AM
> > To: dev@spark.incubator.apache.org
> > Subject: Re: issue regarding akka, protobuf and Hadoop version
> >
> > I think we are near the end of Scala 2.9.3 development, and will merge
> the
> > Scala 2.10 branch into master and make it the future very soon (maybe
> next
> > week).  This problem will go away.
> >
> > Meantime, we are relying on periodically merging the master into the
> Scala
> > 2.10 branch.
> >
> >
> > On Mon, Nov 4, 2013 at 5:53 PM, Liu, Raymond 
> > wrote:
> >
> > > I plan to do the work on scala-2.10 branch, which already move to akka
> > > 2.2.3, hope that to move to akka 2.3-M1 (which support protobuf 2.5.x)
> > > will not cause many problem and make it a test to see is there further
> > > issues, then wait for the formal release of akka 2.3.x
> > >
> > > While the issue is that I can see many commits on master branch is not
> > > merged into scala-2.10 branch yet. The latest merge seems to happen on
> > > OCT.11, while as I mentioned in the dev branch merge/sync thread,
> > > seems that many earlier commit is not included and which will surely
> > > bring extra works on future code merging/rebase. So again, what's the
> > > code sync strategy and what's the plan of merge back into master?
> > >
> > > Best Regards,
> > > Raymond Liu
> > >
> > >
> > > -Original Message-
> > > From: Reynold Xin [mailto:r...@apache.org]
> > > Sent: Tuesday, November 05, 2013 8:34 AM
> > > To: dev@spark.incubator.apache.org
> > > Subject: Re: issue regarding akka, protobuf and Hadoop version
> > >
> > > I chatted with Matt Massie about this, and here are some options:
> > >
> > > 1. Use dependency injection in google-guice to make Akka use one
> > > version of protobuf, and YARN use the other version.
> > >
> > > 2. Look into OSGi to accomplish the same goal.
> > >
> > > 3. Rewrite the messaging part of Spark to use a simple, custom RPC
> > > library instead of Akka. We are really only using a very simple subset
> > > of Akka features, and we can probably implement a simple RPC library
> > > tailored for Spark quickly. We should only do this as the last resort.
> > >
> > > 4. Talk to Akka guys and hope they can make a maintenance release of
> > > Akka that supports protobuf 2.5.
> > >
> > >
> > > None of these are ideal, but we'd have to pick one. It would be great
> > > if you have other suggestions.
> > >
> > >
> > > On Sun, Nov 3, 2013 at 11:46 PM, Liu, Raymond 
> > > wrote:
> > >
> > > > Hi
> > > >
> > > > I am working on porting spark onto Hadoop 2.2.0, With some
> > > > renaming and call into new YARN API works done. I can run up the
> > > > spark master. While I encounter the issue that Executor Actor could
> > > > not connecting to Driver actor.
> > > >
> > > > After some investigation, I found the root cause is that the
> > > > akka-remote do not support protobuf 2.5.0 before 2.3. And hadoop
> > > > move to protobuf 2.5.0 from 2.1-beta.
> > > >
> > > > The issue is that if I exclude the akka dependency from
> > > > hadoop and force protobuf dependency to 2.4.1, the compile/packing
> > > > will fail since hadoop common jar require a new interface from
> > protobuf 2.5.0.
> > > >
> > > >  So any suggestion on this?
> > > >
> > > > Best Regards,
> > > > Raymond Liu
> > > >
> > >
> >
>


Re: appId is no longer in the command line args for StandaloneExecutor

2013-11-05 Thread Reynold Xin
+aaron on this one since he changed the executor runner. (I think it is
probably an oversight but Aaron should confirm.)




On Tue, Nov 5, 2013 at 10:44 AM, Imran Rashid  wrote:

> Hi,
>
> a while back, ExecutorRunner was changed so the command line args included
> the appId.
>
> https://github.com/mesos/spark/pull/467
>
> Those changes seem to be gone from the latest code.  Was that intentional,
> or just an oversight?  I'll add it back in if it was removed accidentally,
> but wanted to check in case there is some reason it shouldn't be there.
>
> thanks,
> Imran
>


Re: issue regarding akka, protobuf and Hadoop version

2013-11-04 Thread Reynold Xin
I think we are near the end of Scala 2.9.3 development, and will merge the
Scala 2.10 branch into master and make it the future very soon (maybe next
week).  This problem will go away.

Meantime, we are relying on periodically merging the master into the Scala
2.10 branch.


On Mon, Nov 4, 2013 at 5:53 PM, Liu, Raymond  wrote:

> I plan to do the work on scala-2.10 branch, which already move to akka
> 2.2.3, hope that to move to akka 2.3-M1 (which support protobuf 2.5.x) will
> not cause many problem and make it a test to see is there further issues,
> then wait for the formal release of akka 2.3.x
>
> While the issue is that I can see many commits on master branch is not
> merged into scala-2.10 branch yet. The latest merge seems to happen on
> OCT.11, while as I mentioned in the dev branch merge/sync thread, seems
> that many earlier commit is not included and which will surely bring extra
> works on future code merging/rebase. So again, what's the code sync
> strategy and what's the plan of merge back into master?
>
> Best Regards,
> Raymond Liu
>
>
> -Original Message-
> From: Reynold Xin [mailto:r...@apache.org]
> Sent: Tuesday, November 05, 2013 8:34 AM
> To: dev@spark.incubator.apache.org
> Subject: Re: issue regarding akka, protobuf and Hadoop version
>
> I chatted with Matt Massie about this, and here are some options:
>
> 1. Use dependency injection in google-guice to make Akka use one version
> of protobuf, and YARN use the other version.
>
> 2. Look into OSGi to accomplish the same goal.
>
> 3. Rewrite the messaging part of Spark to use a simple, custom RPC library
> instead of Akka. We are really only using a very simple subset of Akka
> features, and we can probably implement a simple RPC library tailored for
> Spark quickly. We should only do this as the last resort.
>
> 4. Talk to Akka guys and hope they can make a maintenance release of Akka
> that supports protobuf 2.5.
>
>
> None of these are ideal, but we'd have to pick one. It would be great if
> you have other suggestions.
>
>
> On Sun, Nov 3, 2013 at 11:46 PM, Liu, Raymond 
> wrote:
>
> > Hi
> >
> > I am working on porting spark onto Hadoop 2.2.0, With some
> > renaming and call into new YARN API works done. I can run up the spark
> > master. While I encounter the issue that Executor Actor could not
> > connecting to Driver actor.
> >
> > After some investigation, I found the root cause is that the
> > akka-remote do not support protobuf 2.5.0 before 2.3. And hadoop move
> > to protobuf 2.5.0 from 2.1-beta.
> >
> > The issue is that if I exclude the akka dependency from hadoop
> > and force protobuf dependency to 2.4.1, the compile/packing will fail
> > since hadoop common jar require a new interface from protobuf 2.5.0.
> >
> >  So any suggestion on this?
> >
> > Best Regards,
> > Raymond Liu
> >
>


Re: issue regarding akka, protobuf and Hadoop version

2013-11-04 Thread Reynold Xin
Adding in a few guys so they can chime in.


On Mon, Nov 4, 2013 at 4:33 PM, Reynold Xin  wrote:

> I chatted with Matt Massie about this, and here are some options:
>
> 1. Use dependency injection in google-guice to make Akka use one version
> of protobuf, and YARN use the other version.
>
> 2. Look into OSGi to accomplish the same goal.
>
> 3. Rewrite the messaging part of Spark to use a simple, custom RPC library
> instead of Akka. We are really only using a very simple subset of Akka
> features, and we can probably implement a simple RPC library tailored for
> Spark quickly. We should only do this as the last resort.
>
> 4. Talk to Akka guys and hope they can make a maintenance release of Akka
> that supports protobuf 2.5.
>
>
> None of these are ideal, but we'd have to pick one. It would be great if
> you have other suggestions.
>
>
> On Sun, Nov 3, 2013 at 11:46 PM, Liu, Raymond wrote:
>
>> Hi
>>
>> I am working on porting spark onto Hadoop 2.2.0, With some
>> renaming and call into new YARN API works done. I can run up the spark
>> master. While I encounter the issue that Executor Actor could not
>> connecting to Driver actor.
>>
>> After some investigation, I found the root cause is that the
>> akka-remote do not support protobuf 2.5.0 before 2.3. And hadoop move to
>> protobuf 2.5.0 from 2.1-beta.
>>
>> The issue is that if I exclude the akka dependency from hadoop
>> and force protobuf dependency to 2.4.1, the compile/packing will fail since
>> hadoop common jar require a new interface from protobuf 2.5.0.
>>
>>  So any suggestion on this?
>>
>> Best Regards,
>> Raymond Liu
>>
>
>


Re: issue regarding akka, protobuf and Hadoop version

2013-11-04 Thread Reynold Xin
I chatted with Matt Massie about this, and here are some options:

1. Use dependency injection in google-guice to make Akka use one version of
protobuf, and YARN use the other version.

2. Look into OSGi to accomplish the same goal.

3. Rewrite the messaging part of Spark to use a simple, custom RPC library
instead of Akka. We are really only using a very simple subset of Akka
features, and we can probably implement a simple RPC library tailored for
Spark quickly. We should only do this as the last resort.

4. Talk to Akka guys and hope they can make a maintenance release of Akka
that supports protobuf 2.5.


None of these are ideal, but we'd have to pick one. It would be great if
you have other suggestions.


On Sun, Nov 3, 2013 at 11:46 PM, Liu, Raymond  wrote:

> Hi
>
> I am working on porting spark onto Hadoop 2.2.0, With some
> renaming and call into new YARN API works done. I can run up the spark
> master. While I encounter the issue that Executor Actor could not
> connecting to Driver actor.
>
> After some investigation, I found the root cause is that the
> akka-remote do not support protobuf 2.5.0 before 2.3. And hadoop move to
> protobuf 2.5.0 from 2.1-beta.
>
> The issue is that if I exclude the akka dependency from hadoop and
> force protobuf dependency to 2.4.1, the compile/packing will fail since
> hadoop common jar require a new interface from protobuf 2.5.0.
>
>  So any suggestion on this?
>
> Best Regards,
> Raymond Liu
>


Re: SPARK-942

2013-11-03 Thread Reynold Xin
It's not a very elegant solution, but one possibility is for the
CacheManager to check whether it will have enough space. If it is running
out of space, skips buffering the output of the iterator & directly write
the output of the iterator to disk (if storage level allows that).

But it is still tricky to know whether we will run out of space before we
even start running the iterator. One possibility is to use sizing data from
previous partitions to estimate the size of the current partition (i.e.
estimated in memory size = avg of current in-memory size / current input
size).

Do you have any ideas on this one, Kyle?


On Sat, Oct 26, 2013 at 10:53 AM, Kyle Ellrott wrote:

> I was wondering if anybody had any thoughts on the best way to tackle
> SPARK-942 ( https://spark-project.atlassian.net/browse/SPARK-942 ).
> Basically, Spark takes an iterator from a flatmap call and because I tell
> it that it needs to persist Spark proceeds to push it all into an array
> before deciding that it doesn't have enough memory and trying to serialize
> it to disk, and somewhere along the line it runs out of memory. For my
> particular operation, the function return an iterator that reads data out
> of a file, and the size of the files passed to that function can vary
> greatly (from a few kilobytes to a few gigabytes). The funny thing is that
> if I do a strait 'map' operation after the flat map, everything works,
> because Spark just passes the iterator forward and never tries to expand
> the whole thing into memory. But I need do a reduceByKey across all the
> records, so I'd like to persist to disk first, and that is where I hit this
> snag.
> I've already setup a unit test to replicate the problem, and I know the
> area of the code that would need to be fixed.
> I'm just hoping for some tips on the best way to fix the problem.
>
> Kyle
>


Re: Are we moving too fast or too far on 0.8.1-SNAPSHOT?

2013-10-28 Thread Reynold Xin
Hi Mark,

I can't comment much on the Spark part right now (because I have to run in
3 mins), but we will make Shark 0.8.1 work with Spark 0.8.1 for sure. Some
of the changes will get cherry picked into branch-0.8 of Shark.


On Mon, Oct 28, 2013 at 6:22 PM, Mark Hamstra wrote:

> Or more to the point: What is our commitment to backward compatibility in
> point releases?
>
> Many Java developers will come to a library or platform versioned as x.y.z
> with the expectation that if their own code worked well using x.y.(z-1) as
> a dependency, then moving up to x.y.z will be painless and trivial.  That
> is not looking like it will be the case for Spark 0.8.0 and 0.8.1.
>
> We only need to look at Shark as an example of code built with a dependency
> on Spark to see the problem.  Shark 0.8.0 works with Spark 0.8.0.  Shark
> 0.8.0 does not build with Spark 0.8.1-SNAPSHOT.  Presumably that lack of
> backwards compatibility will continue into the eventual release of Spark
> 0.8.1, and that makes life hard on developers using Spark and Shark.  For
> example, a developer using the released version of Shark but wanting to
> pick up the bug fixes in Spark doesn't have a good option anymore since
> 0.8.1-SNAPSHOT (or the eventual 0.8.1 release) doesn't work, and moving to
> the wild and woolly development on the master branches of Spark and Shark
> is not a good idea for someone trying to develop production code.  In other
> words, all of the bug fixes in Spark 0.8.1 are not accessible to this
> developer until such time as there are available 0.8.1-compatible versions
> of Shark and anything else built on Spark that this developer is using.
>
> The only other option is trying to cherry-pick commits from, e.g., Shark
> 0.9.0-SNAPSHOT into Shark 0.8.0 until Shark 0.8.0 has been brought up to a
> point where it works with Spark 0.8.1.  But an application developer
> shouldn't need to do that just to get the bug fixes in Spark 0.8.1, and it
> is not immediately obvious just which Shark commits are necessary and
> sufficient to produce a correct, Spark-0.8.1-compatible version of Shark
> (indeed, there is no guarantee that such a thing is even possible.)  Right
> now, I believe that 67626ae3eb6a23efc504edf5aedc417197f072cf,
> 488930f5187264d094810f06f33b5b5a2fde230a and
> bae19222b3b221946ff870e0cee4dba0371dea04 are necessary to get Shark to work
> with Spark 0.8.1-SNAPSHOT, but that those commits are not sufficient (Shark
> builds against Spark 0.8.1-SNAPSHOT with those cherry-picks, but I'm still
> seeing runtime errors.)
>
> In short, this is not a good situation, and we probably need a real 0.8
> maintenance branch that maintains backward compatibility with 0.8.0,
> because (at least to me) the current branch-0.8 of Spark looks more like
> another active development branch (in addition to the master and scala-2.10
> branches) than it does a maintenance branch.
>


Re: help me with setting up IntelliJ Idea development IDE for Spark

2013-10-27 Thread Reynold Xin
Just generate the IntelliJ project file using

sbt/sbt gen-idea

And then open the folder in IntelliJ (no need to import anything).



On Sun, Oct 27, 2013 at 8:31 PM, dachuan  wrote:

> Hi, all,
>
> Could anybody help me set up the dev IDE for spark in IntelliJ idea IDE?
>
> I have already installed the scala plugin, and imported the
> incubator-spark/ project. The syntax highlight works for now.
>
> The problem is: It can not resolve symbol such as SparkEnv, which is
> internal object for spark.
>
> I count on this for jumping, otherwise I can simply use Vim.
>
> And I am pretty new to maven, embarrassing to say.
>
> thanks,
> dachuan.
>
> --
> Dachuan Huang
> Cellphone: 614-390-7234
> 2015 Neil Avenue
> Ohio State University
> Columbus, Ohio
> U.S.A.
> 43210
>


Re: Documentation of Java API and PySpark internals

2013-10-23 Thread Reynold Xin
Thanks, Josh. These are very useful for people to understand the APIs and
to write new language bindings.


On Wed, Oct 23, 2013 at 8:57 PM, Josh Rosen  wrote:

> I've created two new pages on the Spark wiki to document the internals of
> the Java and Python APIs:
>
> https://cwiki.apache.org/confluence/display/SPARK/Java+API+Internals
> https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
>
> These are only rough drafts; please let me know if there's anything that
> you'd like me to document (or feel free to add it yourself!).
>
> - Josh
>


Re: Is there any MLlib SVM Reference Paper

2013-10-21 Thread Reynold Xin
It is fairly simple and just runs mini-batch sgd. You can actually just
look at the code.

https://github.com/apache/incubator-spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/SVM.scala



On Mon, Oct 21, 2013 at 10:37 PM, Sarath P R wrote:

> Hi All,
>
> I Would like to know if there is any reference paper for SVM, which is
> implemented in MLlib.
>
> Please help. Thanks in Advance
>
> --
> Thank You
> Sarath P R
> Technical Lead
> Amrita Center for Cyber Security | Amrita Vishwa Vidyapeetham | Amritapuri
> Campus
> Contact +91 99 95 02 4287 | Twitter  |
> Blog
>


Re: SPARK-883

2013-10-17 Thread Reynold Xin
Thanks. I just closed the issue.


On Thu, Oct 17, 2013 at 12:43 AM, karthik tunga wrote:

> Hi,
>
> Is SPARK-883  still
> open ? I already see lift-json dependency in pom.xml and didn't find any
> reference to "scala.util.parsing.json".
>
> Cheers,
> Karthik
>


Re: Experimental Scala-2.10.3 branch based on master

2013-10-04 Thread Reynold Xin
Hi Martin,

Thanks for updating us. Prashant has also been updating the scala 2.10
branch at https://github.com/mesos/spark/tree/scala-2.10

Did you take a look at his work?


On Fri, Oct 4, 2013 at 8:01 AM, Martin Weindel wrote:

> Here you can find an experimental branch of Spark for Scala 2.10.
>
> https://github.com/**MartinWeindel/incubator-spark/**tree/0.9_Scala-2.10.3
>
> I have also updated Akka to version 2.1.4.
>
> The branch compiles with both sbt and mvn, but there are a few tests which
> are failing and even more worse producing deadlocks.
>
> Also there are a lot of warnings, most related to usage of ClassManifest,
> which should be replaced with ClassTag.
> But I don't think it is a good idea to fix these warnings at the moment,
> as this would make merging with the master branch harder.
>
> I would like to know about the official road map for supporting Scala 2.10.
> Does it make sense to investigate the test problems in more details on my
> experimental branch?
>
> Best regards,
> Martin
>
>
> P.S.: Below are the failing tests (probably not complete because of the
> deadlocks)
>
>
> DriverSuite:
> - driver should exit after finishing *** FAILED ***
>   TestFailedDueToTimeoutExceptio**n was thrown during property
> evaluation. (DriverSuite.scala:36)
>   Message: The code passed to failAfter did not complete within 30 seconds.
>   Location: (DriverSuite.scala:37)
>   Occurred at table row 0 (zero based, not counting headings), which had
> values (
> master = local
>   )
>
> UISuite:
> - jetty port increases under contention *** FAILED ***
>   java.net.BindException: Die Adresse wird bereits verwendet
>   at sun.nio.ch.Net.bind0(Native Method)
>   at sun.nio.ch.Net.bind(Net.java:**444)
>   at sun.nio.ch.Net.bind(Net.java:**436)
>   at sun.nio.ch.**ServerSocketChannelImpl.bind(**
> ServerSocketChannelImpl.java:**214)
>   at sun.nio.ch.**ServerSocketAdaptor.bind(**ServerSocketAdaptor.java:74)
>   at org.eclipse.jetty.server.nio.**SelectChannelConnector.open(**
> SelectChannelConnector.java:**187)
>   at org.eclipse.jetty.server.**AbstractConnector.doStart(**
> AbstractConnector.java:316)
>   at org.eclipse.jetty.server.nio.**SelectChannelConnector.**doStart(**
> SelectChannelConnector.java:**265)
>   at org.eclipse.jetty.util.**component.AbstractLifeCycle.**
> start(AbstractLifeCycle.java:**64)
>   at org.eclipse.jetty.server.**Server.doStart(Server.java:**286)
>   ...
>
> AccumulatorSuite:
> - add value to collection accumulators *** FAILED ***
>   org.apache.spark.**SparkException: Job failed: Task not serializable:
> java.io.**NotSerializableException: org.scalatest.Engine
>   at org.apache.spark.scheduler.**DAGScheduler$$anonfun$**
> abortStage$1.apply(**DAGScheduler.scala:762)
>   at org.apache.spark.scheduler.**DAGScheduler$$anonfun$**
> abortStage$1.apply(**DAGScheduler.scala:760)
>   at scala.collection.mutable.**ResizableArray$class.foreach(**
> ResizableArray.scala:59)
>   at scala.collection.mutable.**ArrayBuffer.foreach(**
> ArrayBuffer.scala:47)
>   at org.apache.spark.scheduler.**DAGScheduler.abortStage(**
> DAGScheduler.scala:760)
>   at 
> org.apache.spark.scheduler.**DAGScheduler.org
> $apache$spark$**scheduler$DAGScheduler$$**submitMissingTasks(**
> DAGScheduler.scala:555)
>   at 
> org.apache.spark.scheduler.**DAGScheduler.org
> $apache$spark$**scheduler$DAGScheduler$$**submitStage(DAGScheduler.**
> scala:502)
>   at org.apache.spark.scheduler.**DAGScheduler.processEvent(**
> DAGScheduler.scala:360)
>   at 
> org.apache.spark.scheduler.**DAGScheduler.org
> $apache$spark$**scheduler$DAGScheduler$$run(**DAGScheduler.scala:440)
>   at org.apache.spark.scheduler.**DAGScheduler$$anon$1.run(**
> DAGScheduler.scala:148)
>   ...
> - localValue readable in tasks *** FAILED ***
>   org.apache.spark.**SparkException: Job failed: Task not serializable:
> java.io.**NotSerializableException: org.scalatest.Engine
>   at org.apache.spark.scheduler.**DAGScheduler$$anonfun$**
> abortStage$1.apply(**DAGScheduler.scala:762)
>   at org.apache.spark.scheduler.**DAGScheduler$$anonfun$**
> abortStage$1.apply(**DAGScheduler.scala:760)
>   at scala.collection.mutable.**ResizableArray$class.foreach(**
> ResizableArray.scala:59)
>   at scala.collection.mutable.**ArrayBuffer.foreach(**
> ArrayBuffer.scala:47)
>   at org.apache.spark.scheduler.**DAGScheduler.abortStage(**
> DAGScheduler.scala:760)
>   at 
> org.apache.spark.scheduler.**DAGScheduler.org
> $apache$spark$**scheduler$DAGScheduler$$**submitMissingTasks(**
> DAGScheduler.scala:555)
>   at 
> org.apache.spark.scheduler.**DAGScheduler.org
> $apache$spark$**scheduler$DAGScheduler$$**submitStage(DAGScheduler.**
> scala

Re: Spark 0.8.0: bits need to come from ASF infrastructure

2013-09-26 Thread Reynold Xin
Is there a way we can track the number of downloads with the apache mirrors?

On Thursday, September 26, 2013, Chris Mattmann wrote:

> Hey Matei yep they have the signatures on them too.
>
> Cheers,
> Chris
>
>
> -Original Message-
> From: Matei Zaharia >
> Reply-To: "dev@spark.incubator.apache.org " <
> dev@spark.incubator.apache.org >
> Date: Thursday, September 26, 2013 8:11 PM
> To: "dev@spark.incubator.apache.org" 
> Subject: Re: Spark 0.8.0: bits need to come from ASF infrastructure
>
> >Maybe we can replace the link to "official Apache download site" in the
> >release notes to point to the mirrors? Do the mirrors all have signatures
> >on them too?
> >
> >Matei
> >
> >On Sep 26, 2013, at 10:59 PM, Andy Konwinski 
> >wrote:
> >
> >> Thanks Roman and Chris,
> >>
> >> I see here http://www.apache.org/dev/release.html#mirroring that
> >>"Project
> >> download pages must link to the mirrors" but I don't see anything about
> >> ordering.
> >>
> >> I'm definitely +1 for including a link to the apache mirrors as required
> >> and providing the Cloudfront link first since this seems to satisfy the
> >> apache requirements and provide a better experience for users.
> >>
> >> Patrick. Thanks again for all your hard work on this release and for
> >> pushing back on parts of the Apache process as you go. That's how
> >> do-ocracies stay healthy and evolve.
> >> On Sep 26, 2013 7:23 PM, "Mattmann, Chris A (398J)" <
> >> chris.a.mattm...@jpl.nasa.gov> wrote:
> >>
> >>> Hi Patrick will reply in more detail later but please know that
> >>>linking to
> >>> the apache download page is not a request it's a requirement. I will
> >>> explain more in a bit.
> >>>
> >>> Cheers,
> >>> Chris
> >>>
> >>> Sent from my iPhone
> >>>
> >>> On Sep 26, 2013, at 8:09 PM, "Patrick Wendell" 
> >>>wrote:
> >>>
> >>>> Chris et al,
> >>>>
> >>>> I'm -1 on this because it has many negative consequences for our
> >>> existing users:
> >>>>
> >>>> 1. Users who do automated downloads based on our posted URL's (of
> >>>> which we get many thousands each release) will no longer work. Now if
> >>>> they do "wget XXX" with our posted link, it will fail in a weird way
> >>>> to due to the redirect page. Is there a version of the closer.cgi
> >>>> script which just performs 302 redirects instead of asking me to click
> >>>> on a link?
> >>>>
> >>>> 2. All other users have to click through an additional page to
> >>>> download the software.
> >>>>
> >>>> 3. Amazon Cloudfront is, as a whole, much more reliable and higher
> >>>> bandwidth than the mirror network.
> >>>>
> >>>> These are my concerns, that basically we're causing our users to have
> >>>> a much worse experience. I've identified these concerns with moving to
> >>>> the apache mirror, but perhaps I've overlooked some benefits that
> >>>> would counteract these. Are there benefits?
> >>>>
> >>>> I completely agree that we need to send users to the signatures and
> >>>> hashes at the Apache release site (to verify the release). So I did
> >>>> add the link to this directly adjacent to the download.
> >>>>
> >>>> - Patrick
> >>>>
> >>>> On Thu, Sep 26, 2013 at 3:50 PM, Chris Mattmann 
> >>> wrote:
> >>>>> Hey Guys,
> >>>>>
> >>>>> Yep the link should by the dyn/closer.cgi link on the website and +1
> >>>>> to Roman's comment about auditing spark-project.org links to be
> >>> replaced
> >>>>> with ASF counterparts.
> >>>>>
> >>>>> Cheers,
> >>>>> Chris
> >>>>>
> >>>>>
> >>>>>
> >>>>> -Original Message



-- 

--
Reynold Xin, AMPLab, UC Berkeley
http://rxin.org


Re: Propose to Re-organize the scripts and configurations

2013-09-21 Thread Reynold Xin
Thanks, Shane. Can you also link to this mailing list discussion from the
JIRA ticket?


--
Reynold Xin, AMPLab, UC Berkeley
http://rxin.org



On Sat, Sep 21, 2013 at 9:01 PM, Shane Huang wrote:

> I summarized the opinions about Config in this post and added a comment on
> SPARK-544.
> Also post here below:
>
> 1) Define a Configuration class which contains all the options available
> for Spark application. A Configuration instance can be de-/serialized
> from/to a formatted file. Most of us tend to agree that Typesafe Config
> library is a good choice for the Configuration class.
> 2) Each application (SparkContext) has one Configuration instance and it is
> initialized by the application which creates it (either coded in app (apps
> could explicitly read from io stream or command line arguments), or system
> properties, or env vars).
> 3) For an application the overriding rule should be code > system
> properties > env vars. Over time we will deprecate the env vars and maybe
> even system properties.
> 4) When launching an Executor on a slave node, the Configuration is firstly
> initialized using the node-local configuration file as default (instead of
> the env vars at present), and then the Configuration passed from
> application driver context will override specific options specified in
> default. Certain options in app's Configuration will always override those
> in node-local, because these options need to be the consistent across all
> the slave nodes, e.g. spark.serializer. In this case if any such options is
> not set in app's Config, a value will be provided by the system. On the
> other hand, some options in app's Config will never override those in
> node-local. as they're not meat to be set in app, e.g. spark.local.dir
>
>
> On Wed, Sep 18, 2013 at 1:42 AM, Matei Zaharia  >wrote:
>
> > Hi Shane,
> >
> > I agree with all these points. Improving the configuration system is one
> > of the main things I'd like to have in the next release.
> >
> > > 1) Usually the application developers/users and platform administrators
> > > belongs to two teams. So it's better to separate the scripts used by
> > > administrators and application users, e.g. put them in sbin and bin
> > folders
> > > respectively
> >
> > Yup, right now we don't have any attempt to install on standard system
> > paths.
> >
> > > 3) If there are multiple ways to specify an option, an overriding rule
> > > should be present and should not be error-prone.
> >
> > Yes, I think this should always be Configuration class in code > system
> > properties > env vars. Over time we will deprecate the env vars and maybe
> > even system properties.
> >
> > > 4) Currently the options are set and get using System property. It's
> hard
> > > to manage and inconvenient for users. It's good to gather the options
> > into
> > > one file using format like xml or json.
> >
> > I think this is the main thing to do first -- pick one configuration
> class
> > and change the code to use this.
> >
> > > Our rough proposal:
> > >
> > >   - Scripts
> > >
> > >   1. make an "sbin" folder containing all the scripts for
> administrators,
> > >   specifically,
> > >  - all service administration scripts, i.e. start-*, stop-*,
> > >  slaves.sh, *-daemons, *-daemon scripts
> > >  - low-level or internally used utility scripts, i.e.
> > >  compute-classpath, spark-config, spark-class, spark-executor
> > >   2. make a "bin" folder containing all the scripts for application
> > >   developers/users, specifically,
> > >  - user level app  running scripts, i.e. pyspark, spark-shell, and
> we
> > >  propose to add a script "spark" for users to run applications
> (very
> > much
> > >  like spark-class but may add some more control or convenient
> > utilities)
> > >  - scripts for status checking, e.g. spark and hadoop version
> > >  checking, running applications checking, etc. We can make this a
> > separate
> > >  script or add functionality to "spark" script.
> > >   3. No wandering scripts outside the sbin and bin folders
> >
> > Makes sense.
> >
> > >   -  Configurations/Options and overriding rule
> > >
> > >   1. Define a Configuration class which contains all the options
> > available
> > >   for Spark application. A Configuration instance can be de-/serialized
> > >   from

Fwd: JVMs on single cores but parallel JVMs.

2013-09-21 Thread Reynold Xin
FYI

-- Forwarded message --
From: Kevin Burton 
Date: Sat, Sep 21, 2013 at 9:30 AM
Subject: Re: JVMs on single cores but parallel JVMs.
To: mechanical-sympa...@googlegroups.com


ok... so I'll rephrase this a bit.

You're essentially saying that GC and background threads will need to run
to prevent foreground threads from stalling. GC , network IO, background
filesystem log flushes, etc.

... and if you're only running on ONE core this will preempt your active
threads and will increase latency.

And I guess the theory is that if you have another core free, why not just
let that other core step in and help split the load so you can have a
smaller "stop the world" interval.

I guess that makes some sense and probably applies to a lot of workloads.

Some points:

 - in our load, we are usually about 100% CPU on the current thread, and
100% on the other CPU... so if we trigger GC in the core, the secondary
core isn't going to necessarily execute faster.  In fact it might execute
slower due to memory locality (depending on the configuration).  I think in
most situations, applications are over-provisioned to account for load
spikes so this setup might actually warrant deployment as it would work in
practice.

- This idea is partially a distributed computing fallacy.  This GC doesn't
scale to hundreds of cores...If you're on a 64 core machine splitting out
your VMs so they are smaller, with the entire working set local to that
CPU, and segmenting GC to that core, seems to make the most sense.  You
would have GC pauses but they would be 1/Nth (where N = num of cores) of
your entire application GC hit.

- You can still use a CMS approach here where you GC in the background,
it's just done on one core with another thread.

- GC isn't infinitely parallel... You aren't going to send part of your
heap over the network and do a map/reduce style GC across 1024 servers
within a cluster.  Data locality is important.  Keeping the JVMs small and
local to the core and having lots of them seems to make a lot of sense.

- the fewer JVMs you have the more JDK lock contention you can have.
 Things like SSL are still contended (yuk) ... though JDK 1.7 has
definitely improved the situation.

... one issue of course is that OpenJDK doesn't share the permanent
generation classes.  So you see like a 128MB hit per JVM.  This works out
to about $2 per month per JVM for us so not really the end of the world.

Kevin


On Saturday, September 21, 2013 8:39:33 AM UTC-7, Gil Tene wrote:
>
> Back to the original topic, running enough JVMs such that there is only 1
> core per JVM is not a good idea unless you can accept regular
> multi-tens-of-msec pauses even when no GC is occurring in your JVM. I'd
> recommend sizing for *at least* 2-3 cores per JVM unless you find those
> sort of glitches acceptable.
>
> The reasoning is:
>
> [assuming you are not isolating JVMs to dedicated cores that have nothing
> else running in them, whic has its own obvious problems]
>
> GC:
> even if you limit GC to using one thread, that one GC thread can be
> running concurrently with your actual application threads for long periods
> of time (e.g. during marking and sweeping in CMS, or during G1 marking). If
> there was only one core per JVM, then when any one JVM is active in GC at
> least one other JVM's application threads will have entire scheduling
> quantums stolen from it.
>
> Before people start thinking "this will be rare", let me point out that
> with many JVMs some GC is more likely to be active at any given time. E.g.
> If you ran 12 JVMs on a 12 vcore machine, and each JVM had a very
> reasonable 2% duty cycle (not necessarily pause time, but time in GC cycle
> compared to time when no GC is active) then there would  be some sort of
> quantum-stealing-from-**application-threads GC activity going on roughly
> 25% of the wall clock time even if GCs were perfectly interleaved (which
> they won't be), and if they weren't perfectly interleaved there would be
> multiple of those going on. Under full load, such a setup will translate
> into a 98%-99%'ile that is at least as large as an OS scheduling quantum,
> and under lower loads those quantum-level hiccups will only move slightly
> higher I percentiles (e.g. even at only  5%-10% load your 99.9% will still
> be around a 10msec).
>
> Other JVM stuff:
> The JVM has other, non-GC work that it uses JVM threads for. E.g. JIT
> compilation will cause storms of compiler activity that runs concurrently
> with the app. While GC does tend to dominate over time, limiting GC threads
> to 1 does not cap the number of concurrent, non-application thread work
> that the JVM does.
>
> Application threads:
> Unless your Java application is purely single threaded, there will be
> bursts of time where one JVM has multiple runnable application threads
> active. Whenever those occur when there is only one-core-per-JVZm sizing,
> application threads across JVM will be contending for cores and
> scheduling-quantum

Re: [VOTE] Release Apache Spark 0.8.0-incubating (RC6)

2013-09-18 Thread Reynold Xin
+1


--
Reynold Xin, AMPLab, UC Berkeley
http://rxin.org



On Wed, Sep 18, 2013 at 11:06 AM, Konstantin Boudnik  wrote:

> Maven package could be run with -DskipTests that will simply build... well,
> the package.
>
> +1 on the RC. The nits are indeed minor.
>
>   Cos
>
> On Tue, Sep 17, 2013 at 07:20PM, Matei Zaharia wrote:
> > In Maven, mvn package should also create the assembly, but the
> non-obvious
> > thing is that it needs to happen for all projects before mvn test for
> core
> > works. Unfortunately I don't know any easy way around that.
> >
> > Matei
> >
> > On Sep 17, 2013, at 1:46 PM, Patrick Wendell  wrote:
> >
> > > Hey Mark,
> > >
> > > Good catches here. Ya the driver suite thing is sorta annoying - we
> > > should try to fix that in master. The audit script I wrote first does
> > > an sbt/sbt assembly to avoid this. I agree though these shouldn't
> > > block the release (if a blocker does come up we can revisit these
> > > potentially when cutting a release).
> > >
> > > - Patrick
> > >
> > > On Tue, Sep 17, 2013 at 1:26 PM, Mark Hamstra 
> wrote:
> > >> There are a few nits left to pick: 'sbt/sbt publish-local' isn't
> generating
> > >> correct POM files because of the way the exclusions are defined in
> > >> SparkBuild.scala using wildcards; looks like there may be some broken
> doc
> > >> links generated in that task, as well; DriverSuite doesn't like to
> run from
> > >> the maven build, complaining that 'sbt/sbt assembly' needs to be run
> first.
> > >>
> > >> None of these is enough for me to give RC6 a -1.
> > >>
> > >>
> > >> On Tue, Sep 17, 2013 at 11:28 AM, Matei Zaharia <
> matei.zaha...@gmail.com>wrote:
> > >>
> > >>> +1
> > >>>
> > >>> Tried new staging repo to make sure the issue with RC5 is fixed.
> > >>>
> > >>> Matei
> > >>>
> > >>> On Sep 17, 2013, at 2:03 AM, Patrick Wendell 
> wrote:
> > >>>
> > >>>> Please vote on releasing the following candidate as Apache Spark
> > >>>> (incubating) version 0.8.0. This will be the first incubator
> release for
> > >>>> Spark in Apache.
> > >>>>
> > >>>> The tag to be voted on is v0.8.0-incubating (commit 3b85a85):
> > >>>>
> https://github.com/apache/incubator-spark/releases/tag/v0.8.0-incubating
> > >>>>
> > >>>> The release files, including signatures, digests, etc can be found
> at:
> > >>>>
> http://people.apache.org/~pwendell/spark-0.8.0-incubating-rc6/files/
> > >>>>
> > >>>> Release artifacts are signed with the following key:
> > >>>> https://people.apache.org/keys/committer/pwendell.asc
> > >>>>
> > >>>> The staging repository for this release can be found at:
> > >>>>
> https://repository.apache.org/content/repositories/orgapachespark-059/
> > >>>>
> > >>>> The documentation corresponding to this release can be found at:
> > >>>> http://people.apache.org/~pwendell/spark-0.8.0-incubating-rc6/docs/
> > >>>>
> > >>>> Please vote on releasing this package as Apache Spark
> 0.8.0-incubating!
> > >>>>
> > >>>> The vote is open until Friday, September 20th at 09:00 UTC and
> passes if
> > >>>> a majority of at least 3 +1 IPMC votes are cast.
> > >>>>
> > >>>> [ ] +1 Release this package as Apache Spark 0.8.0-incubating
> > >>>> [ ] -1 Do not release this package because ...
> > >>>>
> > >>>> To learn more about Apache Spark, please see
> > >>>> http://spark.incubator.apache.org/
> > >>>
> > >>>
> >
>


Re: [VOTE] Release Apache Spark 0.8.0-incubating (RC5)

2013-09-16 Thread Reynold Xin
+1


--
Reynold Xin, AMPLab, UC Berkeley
http://rxin.org



On Sun, Sep 15, 2013 at 11:09 PM, Patrick Wendell wrote:

> I also wrote an audit script [1] to verify various aspects of the
> release binaries and ran it on this RC. People are welcome to run this
> themselves, but I haven't tested it on other machines yet, and some of
> the Spark tests are very sensitive to the test environment :) Output
> is pasted below:
>
> [1] https://github.com/pwendell/spark-utils/blob/master/release_auditor.py
>
> -
>  Verifying download integrity for artifact:
> spark-0.8.0-incubating-bin-cdh4-rc5.tgz 
> [PASSED] Artifact signature verified.
> [PASSED] Artifact MD5 verified.
> [PASSED] Artifact SHA verified.
> [PASSED] Tarball contains CHANGES.txt file
> [PASSED] Tarball contains NOTICE file
> [PASSED] Tarball contains LICENSE file
> [PASSED] README file contains disclaimer
>  Verifying download integrity for artifact:
> spark-0.8.0-incubating-bin-hadoop1-rc5.tgz 
> [PASSED] Artifact signature verified.
> [PASSED] Artifact MD5 verified.
> [PASSED] Artifact SHA verified.
> [PASSED] Tarball contains CHANGES.txt file
> [PASSED] Tarball contains NOTICE file
> [PASSED] Tarball contains LICENSE file
> [PASSED] README file contains disclaimer
>  Verifying download integrity for artifact:
> spark-0.8.0-incubating-rc5.tgz 
> [PASSED] Artifact signature verified.
> [PASSED] Artifact MD5 verified.
> [PASSED] Artifact SHA verified.
> [PASSED] Tarball contains CHANGES.txt file
> [PASSED] Tarball contains NOTICE file
> [PASSED] Tarball contains LICENSE file
> [PASSED] README file contains disclaimer
>  Verifying build and tests for artifact:
> spark-0.8.0-incubating-bin-cdh4-rc5.tgz 
> ==> Running build
> [PASSED] sbt build successful
> [PASSED] Maven build successful
> ==> Performing unit tests
> [PASSED] Tests successful
>  Verifying build and tests for artifact:
> spark-0.8.0-incubating-bin-hadoop1-rc5.tgz 
> ==> Running build
> [PASSED] sbt build successful
> [PASSED] Maven build successful
> ==> Performing unit tests
> [PASSED] Tests successful
>  Verifying build and tests for artifact:
> spark-0.8.0-incubating-rc5.tgz 
> ==> Running build
> [PASSED] sbt build successful
> [PASSED] Maven build successful
> ==> Performing unit tests
> [PASSED] Tests successful
>
> - Patrick
>
> On Sun, Sep 15, 2013 at 9:48 PM, Patrick Wendell 
> wrote:
> > Please vote on releasing the following candidate as Apache Spark
> > (incubating) version 0.8.0. This will be the first incubator release for
> > Spark in Apache.
> >
> > The tag to be voted on is v0.8.0-incubating (commit d9e80d5):
> > https://github.com/apache/incubator-spark/releases/tag/v0.8.0-incubating
> >
> > The release files, including signatures, digests, etc can be found at:
> > http://people.apache.org/~pwendell/spark-0.8.0-incubating-rc5/files/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> >
> https://repository.apache.org/content/repositories/orgapachespark-051/org/apache/spark/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-0.8.0-incubating-rc5/docs/
> >
> > Please vote on releasing this package as Apache Spark 0.8.0-incubating!
> > The vote is open until Thursday, September 19th at 05:00 UTC and passes
> if
> > a majority of at least 3 +1 IPMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 0.8.0-incubating
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see
> > http://spark.incubator.apache.org/
>


Re: how to debug spark core code?

2013-09-10 Thread Reynold Xin
Among the folks in Berkeley, most of us use IntelliJ / Vim / Sublime Text.

You can generate the IntelliJ project for Spark using

sbt/sbt gen-idea


On Tue, Sep 10, 2013 at 2:23 PM, Mingxi Wu  wrote:

> thanks Cesar.
>
>
> On Mon, Sep 9, 2013 at 11:08 PM, Cesar Arevalo  >wrote:
>
> > Hi Mingxi,
> >
> > I think it usually comes down to what IDE you're most comfortable with.
> > Here is an article describing how to set up spark on eclipse:
> >
> > http://syndeticlogic.net/?p=311
> >
> > Best,
> > -Cesar
> >
> >
> >
> > On Mon, Sep 9, 2013 at 10:53 PM, Mingxi Wu 
> > wrote:
> >
> > > Hi,
> > >
> > > I wonder if there is a convenient way to debug source code of spark
> > > from a repl test case?
> > >
> > > Was the spark core code developed under an IDE or using println()?
> > >
> > > thanks,
> > >
> > > Mingxi
> > >
> >
>


Re: Needs a matrix library

2013-09-06 Thread Reynold Xin
They are asking about dedicated matrix libraries.

Neither GraphX nor Giraph are matrix libraries. These are systems that
handle large scale graph processing, which could possibly be modeled as
matrix computations.  Hama looks like a BSP framework, so I am not sure if
it has anything to do with matrix library either.

For very small matrices (3x3, 4x4), the cost of going through jni to do
native matrix operations will likely dominate the computation itself, so
you are probably better off with a simple unrolled for loop in Java.

I haven't looked into this myself, but I heard mahout-math is a decent
library.

--
Reynold Xin, AMPLab, UC Berkeley
http://rxin.org



On Sat, Sep 7, 2013 at 6:13 AM, Dmitriy Lyubimov  wrote:

> keep forgetting this: what is graphx release roadmap?
>
> On Fri, Sep 6, 2013 at 3:04 PM, Konstantin Boudnik  wrote:
> > Would it be more logical to use GraphX ?
> >   https://amplab.cs.berkeley.edu/publication/graphx-grades/
> >
> > Cos
> >
> > On Fri, Sep 06, 2013 at 09:13PM, Mattmann, Chris A (398J) wrote:
> >> Thanks Roman, I was thinking Giraph too (knew it supported graphs but
> >> wasn't sure it supported matrices). If Giraph supports matrices, big +1.
> >>
> >> Cheers,
> >> Chris
> >>
> >> ++
> >> Chris Mattmann, Ph.D.
> >> Senior Computer Scientist
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 171-266B, Mailstop: 171-246
> >> Email: chris.a.mattm...@nasa.gov
> >> WWW:  http://sunset.usc.edu/~mattmann/
> >> ++
> >> Adjunct Assistant Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++
> >>
> >>
> >>
> >>
> >>
> >>
> >> -Original Message-
> >> From: Roman Shaposhnik 
> >> Date: Friday, September 6, 2013 2:00 PM
> >> To: 
> >> Cc: "d...@sis.apache.org" 
> >> Subject: Re: Needs a matrix library
> >>
> >> >On Fri, Sep 6, 2013 at 1:33 PM, Mattmann, Chris A (398J)
> >> > wrote:
> >> >> Hey Martin,
> >> >>
> >> >> We may seriously consider using either Apache Hama here (which will
> >> >> bring in Hadoop):
> >> >
> >> >On that note I'd highly recommend taking a look at Apache Giraph
> >> >as well: http://giraph.apache.org/
> >> >
> >> >Thanks,
> >> >Roman.
> >> >
> >>
>


Re: Apache account

2013-09-06 Thread Reynold Xin
Copying Chris on this one.


--
Reynold Xin, AMPLab, UC Berkeley
http://rxin.org



On Fri, Sep 6, 2013 at 2:17 PM, Nick Pentreath wrote:

> Hi
>
> I submitted my license agreement and account name request a while back, but
> still haven't received any correspondence. Just wondering what I need to do
> in order to follow this up?
>
> Thanks
> Nick
>


Re: [Licensing check] Spark 0.8.0-incubating RC1

2013-09-03 Thread Reynold Xin
That seems substantially more overhead than generating github pull
requests. Is there any particular reason we want to do that?


On Wed, Sep 4, 2013 at 11:01 AM, Henry Saputra wrote:

> Thanks Guys.
>
> In other ASF projects I also allow people to attach the git diff to the
> JIRA itself (once we have one) and apply the patch and merge manually.
> I believe we could later configure ASF Jenkins to run when a patch is
> attached to JIRA (like in HBase and Hadoop).
>
> Do we want to also describe/ allow this alternative way to contribute
> patches?
>
>
> - Henry
>
>
> On Tue, Sep 3, 2013 at 7:08 PM, Matei Zaharia  >wrote:
>
> > As far as I understood, we will have to manually merge those PRs into the
> > Apache repo. However, GitHub will notice that they're "merged" as soon as
> > it sees those commits in the repo, and will automatically close them. At
> > least this is my experience merging other peoples' code (sometimes I just
> > check out their branch from their repo and merge it manually).
> >
> > Matei
> >
> > On Sep 3, 2013, at 6:52 PM, Michael Joyce  wrote:
> >
> > > Henry,
> > >
> > > I fairly certain that we'll have to manually resolve the pull requests.
> > As
> > > far as I know, the Github mirror is simply a read-only mirror of the
> > > project's repository (be it svn or git). Hopefully someone will chime
> in
> > > and correct me if I'm wrong.
> > >
> > >
> > >
> > >
> > > -- Joyce
> > >
> > >
> > > On Tue, Sep 3, 2013 at 6:18 PM, Henry Saputra  > >wrote:
> > >
> > >> So looks like we need to manually resolve the Github pull requests.
> > >>
> > >> Or, does github automatically know that a particular merge to ASF git
> > repo
> > >> is associated to a GitHub pull request?
> > >>
> > >> - Henry
> > >>
> > >>
> > >> On Tue, Sep 3, 2013 at 1:38 PM, Matei Zaharia <
> matei.zaha...@gmail.com
> > >>> wrote:
> > >>
> > >>> Yup, the plan is as follows:
> > >>>
> > >>> - Make pull request against the mirror
> > >>> - Code review on GitHub as usual
> > >>> - Whoever merges it will simply merge it into the main Apache repo;
> > when
> > >>> this propagates, the PR will be marked as merged
> > >>>
> > >>> I found at least one other Apache project that did this:
> > >>> http://wiki.apache.org/cordova/ContributorWorkflow.
> > >>>
> > >>> Matei
> > >>>
> > >>> On Sep 3, 2013, at 10:39 AM, Mark Hamstra 
> > >> wrote:
> > >>>
> >  What is going to be the process for making pull requests?  Can they
> be
> > >>> made
> >  against the github mirror (
> https://github.com/apache/incubator-spark
> > ),
> > >>> or
> >  must we use some other way?
> > 
> > 
> >  On Tue, Sep 3, 2013 at 10:28 AM, Matei Zaharia <
> > >> matei.zaha...@gmail.com
> >  wrote:
> > 
> > > Hi guys,
> > >
> > >> So are you planning to release 0.8 from the master branch (which
> is
> > >> at
> > >> a106ed8... now) or from branch-0.8?
> > >
> > > Right now the branches are the same in terms of content (though I
> > >> might
> > > not have merged the latest changes into 0.8). If we add stuff into
> > >>> master
> > > that we won't want in 0.8 we'll break that.
> > >
> > >> My recommendation is that we start to use the Incubator release
> > > doc/guide:
> > >>
> > >> http://incubator.apache.org/guides/releasemanagement.html
> > >
> > > Cool, thanks for the pointer. I'll try to follow the steps there
> > about
> > > signing.
> > >
> > >> Are we "locking" pull requests to github repo by tomorrow?
> > >> Meaning no more push to GitHub repo for Spark.
> > >>
> > >> From your email seems like there will be more potential pull
> > requests
> > >>> for
> > >> github repo to be merged back to ASF Git repo.
> > >
> > > We'll probably use the GitHub repo for the last few changes in this
> > > release and then switch. The reason is that there's a bit of work
> to
> > >> do
> > > pull requests against the Apache one.
> > >
> > > Matei
> > >>>
> > >>>
> > >>
> >
> >
>


Re: off-heap RDDs

2013-08-27 Thread Reynold Xin
On Tue, Aug 27, 2013 at 1:37 AM, Imran Rashid  wrote:

>
> Reynold Xin wrote:
> > This is especially attractive if the application can read directly from
> a byte
> > buffer without generic serialization (like Shark).
>
> interesting -- can you explain how this works in Shark?  do you have
> some general way of storing data in byte buffers that avoids
> serialization?  Or do you mean that if the user is effectively
> creating an RDD of ints, that you create a an RDD[ByteBuffer], and
> then you read / write ints into the byte buffer yourself?
> Sorry, I'm familiar with the basic idea of shark but not the code at
> all -- even a pointer to the code would be helpful.
>
>
Yes - the user application (in this case) can create a bunch of byte
buffers and doing primitive operations on that directly.


Re: off-heap RDDs

2013-08-25 Thread Reynold Xin
Mark - you don't necessarily need to construct a separate storage level.
One simple way to accomplish this is for the user application to pass Spark
a DirectByteBuffer.




On Sun, Aug 25, 2013 at 6:06 PM, Mark Hamstra wrote:

> I'd need to see a clear and significant advantage to using off-heap RDDs
> directly within Spark vs. leveraging Tachyon.  What worries me is the
> combinatoric explosion of different caching and persistence mechanisms.
>  With too many of these, not only will users potentially be baffled
> (@user-list: "What are the performance trade-offs in
> using MEMORY_ONLY_SER_2 vs. MEMORY_ONLY vs. off-heap RDDs?  Or should I
> store some of my RDDs in Tachyon?  Which ones?", etc. ad infinitum), but
> we've got to make sure that all of the combinations work correctly.  At
> some point we end up needing to do some sort of caching/persistence manager
> to automate some of the choices and wrangle the permutations.
>
> That's not to say that off-heap RDDs are a bad idea or are necessarily the
> combinatoric last straw, but I'm concerned about adding significant
> complexity for only marginal gains in limited cases over a more general
> solution via Tachyon.  I'm willing to be shown that those concerns are
> misplaced.
>
>
>
> On Sun, Aug 25, 2013 at 5:06 PM, Haoyuan Li  wrote:
>
> > Hi Imran,
> >
> > One possible solution is that you can use
> > Tachyon.
> > When data is in Tachyon, Spark jobs will read it from off-heap memory.
> > Internally, it uses direct byte buffers to store memory-serialized RDDs
> as
> > you mentioned. Also, different Spark jobs can share the same data in
> > Tachyon's memory. Here is a presentation
> > (slide<
> >
> https://docs.google.com/viewer?url=http%3A%2F%2Ffiles.meetup.com%2F3138542%2FTachyon_2013-05-09_Spark_Meetup.pdf
> > >)
> > we did in May.
> >
> > Haoyuan
> >
> >
> > On Sun, Aug 25, 2013 at 3:26 PM, Imran Rashid 
> > wrote:
> >
> > > Hi,
> > >
> > > I was wondering if anyone has thought about putting cached data in an
> > > RDD into off-heap memory, eg. w/ direct byte buffers.  For really
> > > long-lived RDDs that use a lot of memory, this seems like a huge
> > > improvement, since all the memory is now totally ignored during GC.
> > > (and reading data from direct byte buffers is potentially faster as
> > > well, buts thats just a nice bonus).
> > >
> > > The easiest thing to do is to store memory-serialized RDDs in direct
> > > byte buffers, but I guess we could also store the serialized RDD on
> > > disk and use a memory mapped file.  Serializing into off-heap buffers
> > > is a really simple patch, I just changed a few lines (I haven't done
> > > any real tests w/ it yet, though).  But I dont' really have a ton of
> > > experience w/ off-heap memory, so I thought I would ask what others
> > > think of the idea, if it makes sense or if there are any gotchas I
> > > should be aware of, etc.
> > >
> > > thanks,
> > > Imran
> > >
> >
>


Re: off-heap RDDs

2013-08-25 Thread Reynold Xin
This can be a good idea, especially for large heaps, and the changes for
Spark is potentially fairly small (need to make BlockManager aware of off
heap size and direct byte buffers in its size accounting). This is
especially attractive if the application can read directly from a byte
buffer without generic serialization (like Shark).

One caveat with off-heap storage in the JVM is the OS might not be very
good at dealing with tons of small allocations, but this is not really a
big problem since RDD partitions are supposed to be large in Spark.




On Sun, Aug 25, 2013 at 3:26 PM, Imran Rashid  wrote:

> Hi,
>
> I was wondering if anyone has thought about putting cached data in an
> RDD into off-heap memory, eg. w/ direct byte buffers.  For really
> long-lived RDDs that use a lot of memory, this seems like a huge
> improvement, since all the memory is now totally ignored during GC.
> (and reading data from direct byte buffers is potentially faster as
> well, buts thats just a nice bonus).
>
> The easiest thing to do is to store memory-serialized RDDs in direct
> byte buffers, but I guess we could also store the serialized RDD on
> disk and use a memory mapped file.  Serializing into off-heap buffers
> is a really simple patch, I just changed a few lines (I haven't done
> any real tests w/ it yet, though).  But I dont' really have a ton of
> experience w/ off-heap memory, so I thought I would ask what others
> think of the idea, if it makes sense or if there are any gotchas I
> should be aware of, etc.
>
> thanks,
> Imran
>


Re: RDDs with no partitions

2013-08-23 Thread Reynold Xin
But is there any reason to do the handling of those beyond runJob?


On Fri, Aug 23, 2013 at 11:04 AM, Charles Reiss
wrote:

> On 8/22/13 22:57 , Reynold Xin wrote:
> > I actually don't think there is any reason to have 0 partition stages,
> be it
> > either result stage or shufflemap.
> >
> > It looks like Charles added those. Charles, any comments?
>
> One can get 0-partition RDDs (and thus 0-partition stages of either type)
> pretty easily with PartitionPruningRDD. Given that, e.g., Shark uses this
> with
> partition statistics, I imagine that real programs can hit the 0-partition
> stage case this way.
>
> One can also get a 0-partition RDD from sc.textFile() on an empty
> directory,
> and presumably some uses of hadoopFile/etc., though I won't claim that
> these
> are important to support.
>
> - Charles
>
> >
> >
> >
> > On Thu, Aug 22, 2013 at 10:51 PM, Mark Hamstra  > <mailto:m...@clearstorydata.com>> wrote:
> >
> > We already do a quick, no-op return from DAGScheduler.runJob when
> there are
> > no partitions submitted with the job, so running a job with no
> partitions
> > in the usual way isn't a problem.  That still leaves at least the
> "zero
> > split job" in the DAGSchedulerSuite and the possibility of shuffleMap
> > stages with no partitions.  Is "zero split job" testing anything
> > meaningful, or is its only purpose to cause me headaches?  Can
> shuffleMap
> > stages actually have no partitions, or is this (also) a distraction
> posing
> > as a legitimate problem?
> >
> > In short, when are RDDs with no partitions real things that we
> actually
> > have to deal with?
> >
> >
> >
> > On Thu, Aug 22, 2013 at 9:20 PM, Reynold Xin  > <mailto:reyno...@gmail.com>> wrote:
> >
> > > Being the guy that added the empty partition rdd, I second your
> idea that
> > > we should just short-circuit those in DAGScheduler.runJob.
> > >
> > >
> > >
> > >
> > > On Thu, Aug 22, 2013 at 8:26 PM, Mark Hamstra <
> m...@clearstorydata.com
> > <mailto:m...@clearstorydata.com>
> > > >wrote:
> > >
> > > > So how do these get created, and are we really handling them
> correctly?
> > > >  What is prompting my questions is that I'm looking at making
> sure that
> > > the
> > > > various data structures in the DAGScheduler shrink when
> appropriate
> > > instead
> > > > of growing without bounds.  Jobs with no partitions and the
> "zero split
> > > > job" test in the DAGSchedulerSuite really throw a wrench into
> the works.
> > > >  That's because in the DAGScheduler we go part way along in
> handling this
> > > > weird case as though it were a normal job submission, we start
> > > initializing
> > > > or adding to various data structures, etc.; then we pretty much
> bail out
> > > in
> > > > submitMissingTasks when we find out that there actually are no
> tasks to
> > > be
> > > > done.  We remove the stage from the set of running stages, but
> we don't
> > > > ever clean up pendingTasks, activeJobs, stageIdToStage,
> stageToInfos, and
> > > > others because no tasks are ever submitted for the stage, so
> there are
> > > > never any completion events, nor is the stage aborted -- i.e.
> the normal
> > > > paths to cleanup are never taken.  The end result is that
> shuffleMap
> > > stages
> > > > with no partitions (can these even occur?) never complete, and
> job's with
> > > > no partitions would seem also to persist forever.
> > > >
> > > > In short, RDDs with no partitions do really weird things to the
> > > > DAGScheduler.
> > > >
> > > > So, if there is no way to effectively prevent the creation of
> RDDs with
> > > no
> > > > partitions, is there any reason why we can't short-circuit their
> handling
> > > > within the DAGScheduler so that data structures are never built
> or
> > > > populated for these weird things, or must we add a bunch of
> special-case
> > > > cleanup code to submitMissingStages?
> > > >
> > >
> >
> >
>
>


Re: RDDs with no partitions

2013-08-22 Thread Reynold Xin
I actually don't think there is any reason to have 0 partition stages, be
it either result stage or shufflemap.

It looks like Charles added those. Charles, any comments?



On Thu, Aug 22, 2013 at 10:51 PM, Mark Hamstra wrote:

> We already do a quick, no-op return from DAGScheduler.runJob when there are
> no partitions submitted with the job, so running a job with no partitions
> in the usual way isn't a problem.  That still leaves at least the "zero
> split job" in the DAGSchedulerSuite and the possibility of shuffleMap
> stages with no partitions.  Is "zero split job" testing anything
> meaningful, or is its only purpose to cause me headaches?  Can shuffleMap
> stages actually have no partitions, or is this (also) a distraction posing
> as a legitimate problem?
>
> In short, when are RDDs with no partitions real things that we actually
> have to deal with?
>
>
>
> On Thu, Aug 22, 2013 at 9:20 PM, Reynold Xin  wrote:
>
> > Being the guy that added the empty partition rdd, I second your idea that
> > we should just short-circuit those in DAGScheduler.runJob.
> >
> >
> >
> >
> > On Thu, Aug 22, 2013 at 8:26 PM, Mark Hamstra  > >wrote:
> >
> > > So how do these get created, and are we really handling them correctly?
> > >  What is prompting my questions is that I'm looking at making sure that
> > the
> > > various data structures in the DAGScheduler shrink when appropriate
> > instead
> > > of growing without bounds.  Jobs with no partitions and the "zero split
> > > job" test in the DAGSchedulerSuite really throw a wrench into the
> works.
> > >  That's because in the DAGScheduler we go part way along in handling
> this
> > > weird case as though it were a normal job submission, we start
> > initializing
> > > or adding to various data structures, etc.; then we pretty much bail
> out
> > in
> > > submitMissingTasks when we find out that there actually are no tasks to
> > be
> > > done.  We remove the stage from the set of running stages, but we don't
> > > ever clean up pendingTasks, activeJobs, stageIdToStage, stageToInfos,
> and
> > > others because no tasks are ever submitted for the stage, so there are
> > > never any completion events, nor is the stage aborted -- i.e. the
> normal
> > > paths to cleanup are never taken.  The end result is that shuffleMap
> > stages
> > > with no partitions (can these even occur?) never complete, and job's
> with
> > > no partitions would seem also to persist forever.
> > >
> > > In short, RDDs with no partitions do really weird things to the
> > > DAGScheduler.
> > >
> > > So, if there is no way to effectively prevent the creation of RDDs with
> > no
> > > partitions, is there any reason why we can't short-circuit their
> handling
> > > within the DAGScheduler so that data structures are never built or
> > > populated for these weird things, or must we add a bunch of
> special-case
> > > cleanup code to submitMissingStages?
> > >
> >
>


Re: RDDs with no partitions

2013-08-22 Thread Reynold Xin
Being the guy that added the empty partition rdd, I second your idea that
we should just short-circuit those in DAGScheduler.runJob.




On Thu, Aug 22, 2013 at 8:26 PM, Mark Hamstra wrote:

> So how do these get created, and are we really handling them correctly?
>  What is prompting my questions is that I'm looking at making sure that the
> various data structures in the DAGScheduler shrink when appropriate instead
> of growing without bounds.  Jobs with no partitions and the "zero split
> job" test in the DAGSchedulerSuite really throw a wrench into the works.
>  That's because in the DAGScheduler we go part way along in handling this
> weird case as though it were a normal job submission, we start initializing
> or adding to various data structures, etc.; then we pretty much bail out in
> submitMissingTasks when we find out that there actually are no tasks to be
> done.  We remove the stage from the set of running stages, but we don't
> ever clean up pendingTasks, activeJobs, stageIdToStage, stageToInfos, and
> others because no tasks are ever submitted for the stage, so there are
> never any completion events, nor is the stage aborted -- i.e. the normal
> paths to cleanup are never taken.  The end result is that shuffleMap stages
> with no partitions (can these even occur?) never complete, and job's with
> no partitions would seem also to persist forever.
>
> In short, RDDs with no partitions do really weird things to the
> DAGScheduler.
>
> So, if there is no way to effectively prevent the creation of RDDs with no
> partitions, is there any reason why we can't short-circuit their handling
> within the DAGScheduler so that data structures are never built or
> populated for these weird things, or must we add a bunch of special-case
> cleanup code to submitMissingStages?
>


Re: Bagel and partitioning

2013-08-16 Thread Reynold Xin
Hi Denis,

Thanks for the email. I didn't look at the paper yet so I don't fully
understand your use case. But here are some answers:

1. Do you plan to continue development of Bagel?

Bagel will be subsumed by GraphX when GraphX comes out. We will try to
provide a Bagel API on top of GraphX so existing interfaces don't have to
change.

2. Would you be interested in incorporating our graph processing interface
into Spark if we implement it?

Depending on what it is :) Can the interface be implemented on top of Bagel
or GraphX? Does it benefit a large base of use cases/algorithms or is it
very specific to some algorithms?

If the interface doesn't require changing Spark itself, you can always put
it out on GitHub and let others try the new interface, but creating a new
Spark module would be a pretty major effort. We need to understand what it
does and how it looks like before answering this question. It is a balance
between how big/disruptive the change is, vs how much benefits it brings.


3. Is there any point in contributing BLP partitioner to Spark project in
some way, e.g. as a Bagel partitioner?

See 2. If the BLP partitioner is small and can benefit some common graph
use cases, definitely!




On Fri, Aug 16, 2013 at 8:10 AM, Denis Turdakov  wrote:

> Hello everyone,
>
> In ISPRAS (http://ispras.ru) we are working on several problems of social
> network analysis. In our work we are using Bagel implementation in Spark.
>
> As one of ways to improve graph analysis performance we implement a graph
> partitioning algorithm based on balanced label propagation (BLP) algorithm
> developed in Facebook 
> (http://dl.acm.org/citation.**cfm?id=2433461).
> One problem is that it can't be directly integrated to Bagel since it does
> not know anything about graph edges. As a result interface for passing
> graph to the partitioner is different from interface for passing graph to
> Bagel. Because of this using partitioner requires more code modifications
> that it could.
>
> So we thought about implementing another interface for processing graphs
> that would be aware of edges. Another reason for a different interface is
> that it may be very similar to GraphX so that switching between
> edge-partitioning and vertex-partitioning approaches in an application
> would be easier.
>
>
> Could you please clarify following things for us:
> 1. Do you plan to continue development of Bagel?
> 2. Would you be interested in incorporating our graph processing interface
> into Spark if we implement it?
> 3. Is there any point in contributing BLP partitioner to Spark project in
> some way, e.g. as a Bagel partitioner?
>
> Best regards,
> Denis Turdakov
>
>


Re: Machine Learning on Spark [long rambling discussion email]

2013-07-24 Thread Reynold Xin
On Wed, Jul 24, 2013 at 1:46 AM, Nick Pentreath wrote:

>
> I also found Breeze to be very nice to work with and like the DSL - hence
> my question about why not use that? (Especially now that Breeze is actually
> just breeze-math and breeze-viz).
>


Matei addressed this from a higher level. I want to provide a little bit
more context. A common properties of a lot of high level Scala DSL
libraries is that simple operators tend to have high virtual function
overheads and also create a lot of temporary objects. And because the level
of abstraction is so high, it is fairly hard to debug / optimize
performance.




--
Reynold Xin, AMPLab, UC Berkeley
http://rxin.org


Re: Mailing list transition (was Re: Apache Spark podling: Created!)

2013-06-28 Thread Reynold Xin
/incubator
> >
> >
> >Did you request a list yet? Should i?
> >
> >(BTW sorry about the delay in responding was at a DARPA meeting all week
> >in
> >DC an am just back in California now catching up on everything).
> >
> >
> >>
> >>Also, we should discuss a strategy, and timeline for migrating the
> >>mailing
> >>lists over to the new ones.
> >>
> >>As far as a strategy, here are the steps I can think of that will help
> >>make
> >>for a smooth transition:
> >>
> >>   1. Request users list on apache infra (done)
> >>   2. Pick a day/time for the switch (how about July 1, assuming
> >>   users@spark.i.a.o is set up by then)
> >>   3. At Switch time:
> >>  1. Make announcements on the dev and users mailing lists with links
> >>  to the new lists, instructions on how to subscribe, and a note
> >>saying all
> >>  conversations are moving over to that list.
> >>  2. Update the website with links to the new lists
> >>  3. Enable an auto responders on those lists with pointers to the
> >>new
> >>  apache lists
> >
> >Perfect! That's correct Andy.
> >
> >Cheers,
> >Chris
> >
> >++
> >Chris Mattmann, Ph.D.
> >Senior Computer Scientist
> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >Office: 171-266B, Mailstop: 171-246
> >Email: chris.a.mattm...@nasa.gov
> >WWW:  http://sunset.usc.edu/~mattmann/
> >++
> >Adjunct Assistant Professor, Computer Science Department
> >University of Southern California, Los Angeles, CA 90089 USA
> >++
> >
> >
> >
> >>
> >>
> >>On Fri, Jun 21, 2013 at 5:03 PM, Mattmann, Chris A (398J) <
> >>chris.a.mattm...@jpl.nasa.gov> wrote:
> >>
> >>> CC'ing dev@spark.i.a.o: our first email to the dev list! :)
> >>>
> >>> ++
> >>> Chris Mattmann, Ph.D.
> >>> Senior Computer Scientist
> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>> Office: 171-266B, Mailstop: 171-246
> >>> Email: chris.a.mattm...@nasa.gov
> >>> WWW:  http://sunset.usc.edu/~mattmann/
> >>> ++
> >>> Adjunct Assistant Professor, Computer Science Department
> >>> University of Southern California, Los Angeles, CA 90089 USA
> >>> ++
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> -Original Message-
> >>> From: Henry Saputra 
> >>> Date: Friday, June 21, 2013 4:51 PM
> >>> To: jpluser 
> >>> Cc: Matt Massie , Reynold Xin
> >>> , Matei Zaharia , Ankur Dave
> >>> , Tathagata Das , Haoyuan
> >>>Li
> >>> , Josh Rosen ,
> >>> Shivaram Venkataraman , Mosharaf Chowdhury
> >>> , Charles Reiss ,
> >>> Andy Konwinski , Patrick Wendell
> >>> , Imran Rashid ,
> Ryan
> >>> LeCompte , Ravi Pandya
> >>>,
> >>> Ram Sriharsha , Robert Evans
> >>> , Mridul Muralidharan ,
> >>>Thomas
> >>> Dudziak , Mark Hamstra
> >>> , Stephen Haberman
> >>>,
> >>> Jason Dai , Shane Huang  >,
> >>> Andrew xia , Nick Pentreath
> >>> , Sean McNamara
> >>>,
> >>> "Ramirez, Paul M (398J)" , Roman
> >>>Shaposhnik
> >>> , Suresh Marru , "Hart, Andrew F
> >>> (398J)" 
> >>> Subject: Re: Apache Spark podling: Created!
> >>>
> >>> >Thanks for driving this forward Chris, awesome as usual! =)
> >>> >
> >>> >
> >>> >The mailing lists are ready:
> >>> >dev@spark.incubator.apache.org
> >>> >comm...@spark.incubator.apache.org
> >>> >
> >>> >priv...@spark.incubator.apache.org
> >>> >
> >>> >
> >>> >
> >>> >You can subscribe by sending email to:
> >>> >dev-subscr...@spark.incubator.apache.org
> >>> >commits-subscr...@spark.i

Re: Mailing list transition (was Re: Apache Spark podling: Created!)

2013-06-28 Thread Reynold Xin
I think we should avoid migrating the list too many times, especially the
user list.

Also - are there any rules regarding maintaining a separate, non-Apache
mailing list by 3rd party? Google Groups has been very convenient for
users, both in terms of the UX and the way to quickly and easily search for
archived messages.

--
Reynold Xin, AMPLab, UC Berkeley
http://rxin.org



On Fri, Jun 28, 2013 at 2:04 PM, Andy Konwinski wrote:

> + spark-develop...@googlegroups.com to loop in those who haven't
> subscribed to dev@spark.i.a.o yet, (also because my emails are getting
> bounced by Apache's spam filters).
>
> I wanted to respond here in the conversation about the mailing list
> migration that was happening on the email thread called "Re: A wiki for
> Spark (on Apache infra)"...
>
> Assuming that we Apache requires us to migrate from google groups to lists
> on Apache infra, we might consider waiting to migrating the users list to
> apache infra until after we graduate to a TLP, so that we only have to
> migrate it once.
>
> Here's why. I assume with each list migration that requires subscribers to
> do work, we will lose some subscribers. If we ask them to migrate to an
> incubator user list now and then again to yet a different list when we
> graduate to a TLP (which we hope to do fairly quickly), it seems like we
> will irritate and lose strictly more subscribers.
>
> This requirement to migrate infra twice as part of moving to Apache seems
> a bit hard on communities. It also seems like a requirement that will go
> away if the changes you are pushing for to the incubation process (i.e.
> podling TLPs) actually ever happen.
>
> Anyway, I see our options as:
>
> 1. Migrate only the dev list now (since this is a smaller core group that
> is more likely to migrate with us) and wait to create an apache users list
> until we graduate and migrate from the users google group then. Con: it's
> confusing to have user and dev lists on different infra.
> 2. Move the users list now, in which case we go with the migration plan I
> proposed earlier. Con: migrating users list twice = more irritating to
> users.
>
> Andy
>
>
>
> On Fri, Jun 28, 2013 at 12:20 PM, Mattmann, Chris A (398J) <
> chris.a.mattm...@jpl.nasa.gov> wrote:
>
>> Hi Andy,
>>
>> -Original Message-
>>
>> From: Andy Konwinski 
>> Reply-To: "dev@spark.incubator.apache.org" <
>> dev@spark.incubator.apache.org>
>> Date: Tuesday, June 25, 2013 10:18 AM
>> To: "dev@spark.incubator.apache.org" 
>> Subject: Re: Apache Spark podling: Created!
>>
>> >This is great.
>> >
>> >Quick question about mailing lists: Spark also has a
>> >spark-users<https://groups.google.com/forum/#!forum/spark-users>
>> >google
>> >group. Can we also get a users@spark.i.a.o mailing list to have
>> somewhere
>> >to migrate that group? Do I need to create an infra issue for this?
>>
>> OK, cool yeah I think I requested commits and dev as lists earlier, but
>> didn't request a user one. To request a new list, you go here:
>>
>> https://infra.apache.org/officers/mlreq/incubator
>>
>>
>> Did you request a list yet? Should i?
>>
>> (BTW sorry about the delay in responding was at a DARPA meeting all week
>> in
>> DC an am just back in California now catching up on everything).
>>
>>
>> >
>> >Also, we should discuss a strategy, and timeline for migrating the
>> mailing
>> >lists over to the new ones.
>> >
>> >As far as a strategy, here are the steps I can think of that will help
>> >make
>> >for a smooth transition:
>> >
>> >   1. Request users list on apache infra (done)
>> >   2. Pick a day/time for the switch (how about July 1, assuming
>> >   users@spark.i.a.o is set up by then)
>> >   3. At Switch time:
>> >  1. Make announcements on the dev and users mailing lists with links
>> >  to the new lists, instructions on how to subscribe, and a note
>> >saying all
>> >  conversations are moving over to that list.
>> >  2. Update the website with links to the new lists
>> >  3. Enable an auto responders on those lists with pointers to the
>> new
>> >  apache lists
>>
>> Perfect! That's correct Andy.
>>
>> Cheers,
>> Chris
>>
>> ++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>