Re: Anyone wants to look at SPARK-1123?

2014-02-23 Thread Nick Pentreath
Hi

What KeyClass and ValueClass are you trying to save as the keys/values of
your dataset?



On Sun, Feb 23, 2014 at 10:48 AM, Nan Zhu  wrote:

> Hi, all
>
> I found the weird thing on saveAsNewAPIHadoopFile  in
> PairRDDFunctions.scala when working on the other issue,
>
> saveAsNewAPIHadoopFile throws java.lang.InstantiationException all the time
>
> I checked the commit history of the file, it seems that the API exists for
> a long time, no one else found this? (that's the reason I'm confusing)
>
> Best,
>
> --
> Nan Zhu
>
>


Re: Spark 0.8.1 on Amazon Elastic MapReduce

2014-02-14 Thread Nick Pentreath
Thanks Parviz, this looks great and good to see it getting updated. Look
forward to 0.9.0!

A perhaps stupid question - where does the KinesisWordCount example live?
Is that an Amazon example, since I don't see it under the streaming
examples included in the Spark project. If it's a third party example is it
possible to get the code?

Thanks
Nick


On Fri, Feb 14, 2014 at 6:53 PM, Deyhim, Parviz  wrote:

>   Spark community,
>
>  Wanted to let you know that the version of Spark and Shark on Amazon
> Elastic MapReduce has been updated to 0.8.1. This new version provides a
> much better experience in terms of stability and performance but also
> supports the following features:
>
>- Integration with Amazon Cloudwatch
>- Integration of Spark Streaming with Amazon Kinesis.
>- Automatic log shipping to S3
>
> For a complete detail of the features Spark on EMR provides, please see
> the following article: http://aws.amazon.com/articles/4926593393724923
>
>  And yes I'm working hard to push another update to support 0.9.0 :)
>
>  What would be great is to hear from the community on what other features
> you like to see on Spark on EMR. For example, how useful is autoscaling for
> Spark? Any other features you like to see?
>
>  Thanks,
>
>*Parviz Deyhim*
>
> Solutions Architect
>
> *Amazon Web Services *
>
> E: parv...@amazon.com
>
> M:  408.315.2305
>
>
>
> [image: Description: Description: Description:
> C:\Users\aiden\AppData\Local\Microsoft\Windows\Temporary Internet
> Files\Content.Word\aws.gif] 
>


Re: [GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-13 Thread Nick Pentreath
@fommil @mengxr I think it's always worth a shot at a license change. Scikit 
learn devs have been successful before in getting such things over the line.


Assuming we can make that happen, what do folks think about MTJ vs Breeze vs 
JBLAS + commons-math since these seem like the viable alternatives?
—
Sent from Mailbox for iPhone

On Fri, Feb 14, 2014 at 1:21 AM, mengxr  wrote:

> Github user mengxr commented on the pull request:
> https://github.com/apache/incubator-spark/pull/575#issuecomment-35038739
>   
> @fommil I don't quite understand what "roll their own" means exactly 
> here. I didn't propose to re-implement one or half linear algebra library in 
> the PR. For the license issue, it would be great if the original author of 
> MTJ agrees to change the license to Apache. With the LGPL license, there is 
> not much we can do. 

Re: [VOTE] Graduation of Apache Spark from the Incubator

2014-02-11 Thread Nick Pentreath
+1


On Tue, Feb 11, 2014 at 9:17 AM, Matt Massie  wrote:

> +1
>
> --
> Matt Massie
> UC, Berkeley AMPLab
> Twitter: @matt_massie <https://twitter.com/matt_massie>,
> @amplab<https://twitter.com/amplab>
> https://amplab.cs.berkeley.edu/
>
>
> On Mon, Feb 10, 2014 at 11:12 PM, Zongheng Yang  >wrote:
>
> > +1
> >
> > On Mon, Feb 10, 2014 at 10:21 PM, Reynold Xin 
> wrote:
> > > Actually I made a mistake by saying binding.
> > >
> > > Just +1 here.
> > >
> > >
> > > On Mon, Feb 10, 2014 at 10:20 PM, Mattmann, Chris A (3980) <
> > > chris.a.mattm...@jpl.nasa.gov> wrote:
> > >
> > >> Hi Nathan, anybody is welcome to to VOTE. Thank you.
> > >> Only VOTEs from the Incubator PMC are what is considered "binding",
> but
> > >> I welcome and will tally all VOTEs provided.
> > >>
> > >> Cheers,
> > >> Chris
> > >>
> > >>
> > >>
> > >>
> > >> -Original Message-
> > >> From: Nathan Kronenfeld 
> > >> Reply-To: "dev@spark.incubator.apache.org" <
> > dev@spark.incubator.apache.org
> > >> >
> > >> Date: Monday, February 10, 2014 9:44 PM
> > >> To: "dev@spark.incubator.apache.org" 
> > >> Subject: Re: [VOTE] Graduation of Apache Spark from the Incubator
> > >>
> > >> >Who is allowed to vote on stuff like this?
> > >> >
> > >> >
> > >> >On Mon, Feb 10, 2014 at 11:27 PM, Chris Mattmann
> > >> >wrote:
> > >> >
> > >> >> Hi Everyone,
> > >> >>
> > >> >> This is a new VOTE to decide if Apache Spark should graduate
> > >> >> from the Incubator. Please VOTE on the resolution pasted below
> > >> >> the ballot. I'll leave this VOTE open for at least 72 hours.
> > >> >>
> > >> >> Thanks!
> > >> >>
> > >> >> [ ] +1 Graduate Apache Spark from the Incubator.
> > >> >> [ ] +0 Don't care.
> > >> >> [ ] -1 Don't graduate Apache Spark from the Incubator because..
> > >> >>
> > >> >> Here is my +1 binding for graduation.
> > >> >>
> > >> >> Cheers,
> > >> >> Chris
> > >> >>
> > >> >>  snip
> > >> >>
> > >> >> WHEREAS, the Board of Directors deems it to be in the best
> > >> >> interests of the Foundation and consistent with the
> > >> >> Foundation's purpose to establish a Project Management
> > >> >> Committee charged with the creation and maintenance of
> > >> >> open-source software, for distribution at no charge to the
> > >> >> public, related to fast and flexible large-scale data analysis
> > >> >> on clusters.
> > >> >>
> > >> >> NOW, THEREFORE, BE IT RESOLVED, that a Project Management
> > >> >> Committee (PMC), to be known as the "Apache Spark Project", be
> > >> >> and hereby is established pursuant to Bylaws of the Foundation;
> > >> >> and be it further
> > >> >>
> > >> >> RESOLVED, that the Apache Spark Project be and hereby is
> > >> >> responsible for the creation and maintenance of software
> > >> >> related to fast and flexible large-scale data analysis
> > >> >> on clusters; and be it further RESOLVED, that the office
> > >> >> of "Vice President, Apache Spark" be and hereby is created,
> > >> >> the person holding such office to serve at the direction of
> > >> >> the Board of Directors as the chair of the Apache Spark
> > >> >> Project, and to have primary responsibility for management
> > >> >> of the projects within the scope of responsibility
> > >> >> of the Apache Spark Project; and be it further
> > >> >> RESOLVED, that the persons listed immediately below be and
> > >> >> hereby are appointed to serve as the initial members of the
> > >> >> Apache Spark Project:
> > >> >>
> > >> >> * Mosharaf Chowdhury 
> > >> >> * Jason Dai 
> > >> >> * Tathagata Das 
> > >> >> * Ankur Dave 
> > &

Fwd: Represent your project at ApacheCon

2014-01-27 Thread Nick Pentreath
Is Spark active in submitting anything for this?


-- Forwarded message --
From: Rich Bowen 
Date: Mon, Jan 27, 2014 at 4:20 PM
Subject: Represent your project at ApacheCon
To: committ...@apache.org


Folks,

5 days from the end of the CFP, we have only 50 talks submitted. We need
three times that just to fill the space, and preferably a lot more so that
we have some variety to choose from to put together a schedule.

I know that we usually have over half the content submitted in the last 48
hours, so I'm not panicking yet, but it's worrying. More worrying, however
is that 2/3 of those submissions are from the Usual Suspects (ie, httpd and
Tomcat), and YOUR project isn't represented.

We would love to have a whole day of Lucene, and of OpenOffice, and of
Cordova, and of Felix and Celix and Helix and Nelix. Or a half day.

We need your talk submissions. We need you to come tell the world why your
project matters, why you spend your time working on it, and what exciting
new thing you hacked into it during the snow storms. (Or heat wave, as the
case may be.)

Please help us get the word out to your developer and user communities that
we're looking for quality talks about their favorite Apache project, about
related technologies, about ways that it's being used, and plans for its
future. Help us make this ApacheCon amazing.

--rcb

-- 
Rich Bowen - rbo...@redhat.com
OpenStack Community Liaison
http://openstack.redhat.com/


Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

2014-01-26 Thread Nick Pentreath
If you want to spend the time running 50 iterations, you're better off 
re-running 5x10 iterations with different random start to get a better local 
minimum...—
Sent from Mailbox for iPhone

On Sun, Jan 26, 2014 at 9:59 AM, Matei Zaharia 
wrote:

> I looked into this after I opened that JIRA and it’s actually a bit harder to 
> fix. While changing these visit() calls to use a stack manually instead of 
> being recursive helps avoid a StackOverflowError there, you still get a 
> StackOverflowError when you send the task to a worker node because Java 
> serialization uses recursion. The only real fix therefore with the current 
> codebase is to increase your JVM stack size. Longer-term, I’d like us to 
> automatically call checkpoint() to break lineage graphs when they exceed a 
> certain size, which would avoid the problems in both DAGScheduler and Java 
> serialization. We could also manually add this to ALS now without having a 
> solution for other programs. That would be a great change to make to fix this 
> JIRA.
> Matei
> On Jan 25, 2014, at 11:06 PM, Ewen Cheslack-Postava  wrote:
>> The three obvious ones in DAGScheduler.scala are in:
>> 
>> getParentStages
>> getMissingParentStages
>> stageDependsOn
>> 
>> They all follow the same pattern though (def visit(), followed by 
>> visit(root)), so they should be easy to replace with a Scala stack in place 
>> of the call stack.
>> 
>>> Shao, SaisaiJanuary 25, 2014 at 10:52 PM
>>> In my test I found this phenomenon might be caused by RDD's long dependency 
>>> chain, this dependency chain is serialized into task and sent to each 
>>> executor, while deserializing this task will cause stack overflow.
>>> 
>>> Especially in iterative job, like:
>>> var rdd = ..
>>> 
>>> for (i <- 0 to 100)
>>> rdd = rdd.map(x=>x)
>>> 
>>> rdd = rdd.cache
>>> 
>>> Here rdd's dependency will be chained, at some point stack overflow will 
>>> occur.
>>> 
>>> You can check 
>>> (https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/-Cyfe3G6VwY/PFFnslzWn6AJ)
>>>  and 
>>> (https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/NkxcmmS-DbM/c9qvuShbHEUJ)
>>>  for details. Current workaround method is to cut the dependency chain by 
>>> checkpointing RDD, maybe a better way is to clean the dependency chain 
>>> after materialize stage is executed.
>>> 
>>> Thanks
>>> Jerry
>>> 
>>> -Original Message-
>>> From: Reynold Xin [mailto:r...@databricks.com] 
>>> Sent: Sunday, January 26, 2014 2:04 PM
>>> To: dev@spark.incubator.apache.org
>>> Subject: Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow 
>>> with too many iterations"?
>>> 
>>> I'm not entirely sure, but two candidates are
>>> 
>>> the visit function in stageDependsOn
>>> 
>>> submitStage
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Reynold Xin January 25, 2014 at 10:03 PM
>>> I'm not entirely sure, but two candidates are
>>> 
>>> the visit function in stageDependsOn
>>> 
>>> submitStage
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Aaron Davidson  January 25, 2014 at 10:01 PM
>>> I'm an idiot, but which part of the DAGScheduler is recursive here? Seems
>>> like processEvent shouldn't have inherently recursive properties.
>>> 
>>> 
>>> 
>>> Reynold Xin January 25, 2014 at 9:57 PM
>>> It seems to me fixing DAGScheduler to make it not recursive is the better
>>> solution here, given the cost of checkpointing.
>>> 
>>> 
>>> Xia, JunluanJanuary 25, 2014 at 9:49 PM
>>> Hi all
>>> 
>>> The description about this Bug submitted by Matei is as following
>>> 
>>> 
>>> The tipping point seems to be around 50. We should fix this by 
>>> checkpointing the RDDs every 10-20 iterations to break the lineage chain, 
>>> but checkpointing currently requires HDFS installed, which not all users 
>>> will have.
>>> 
>>> We might also be able to fix DAGScheduler to not be recursive.
>>> 
>>> 
>>> regards,
>>> Andrew
>>> 
>>> 

Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

2014-01-26 Thread Nick Pentreath
Agree that it should be fixed if possible. But why run ALS for 50 iterations? 
It tends to pretty much converge (to within 0.001 or so RMSE) after 5-10 and 
even 20 is probably overkill.—
Sent from Mailbox for iPhone

On Sun, Jan 26, 2014 at 9:59 AM, Matei Zaharia 
wrote:

> I looked into this after I opened that JIRA and it’s actually a bit harder to 
> fix. While changing these visit() calls to use a stack manually instead of 
> being recursive helps avoid a StackOverflowError there, you still get a 
> StackOverflowError when you send the task to a worker node because Java 
> serialization uses recursion. The only real fix therefore with the current 
> codebase is to increase your JVM stack size. Longer-term, I’d like us to 
> automatically call checkpoint() to break lineage graphs when they exceed a 
> certain size, which would avoid the problems in both DAGScheduler and Java 
> serialization. We could also manually add this to ALS now without having a 
> solution for other programs. That would be a great change to make to fix this 
> JIRA.
> Matei
> On Jan 25, 2014, at 11:06 PM, Ewen Cheslack-Postava  wrote:
>> The three obvious ones in DAGScheduler.scala are in:
>> 
>> getParentStages
>> getMissingParentStages
>> stageDependsOn
>> 
>> They all follow the same pattern though (def visit(), followed by 
>> visit(root)), so they should be easy to replace with a Scala stack in place 
>> of the call stack.
>> 
>>> Shao, SaisaiJanuary 25, 2014 at 10:52 PM
>>> In my test I found this phenomenon might be caused by RDD's long dependency 
>>> chain, this dependency chain is serialized into task and sent to each 
>>> executor, while deserializing this task will cause stack overflow.
>>> 
>>> Especially in iterative job, like:
>>> var rdd = ..
>>> 
>>> for (i <- 0 to 100)
>>> rdd = rdd.map(x=>x)
>>> 
>>> rdd = rdd.cache
>>> 
>>> Here rdd's dependency will be chained, at some point stack overflow will 
>>> occur.
>>> 
>>> You can check 
>>> (https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/-Cyfe3G6VwY/PFFnslzWn6AJ)
>>>  and 
>>> (https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/NkxcmmS-DbM/c9qvuShbHEUJ)
>>>  for details. Current workaround method is to cut the dependency chain by 
>>> checkpointing RDD, maybe a better way is to clean the dependency chain 
>>> after materialize stage is executed.
>>> 
>>> Thanks
>>> Jerry
>>> 
>>> -Original Message-
>>> From: Reynold Xin [mailto:r...@databricks.com] 
>>> Sent: Sunday, January 26, 2014 2:04 PM
>>> To: dev@spark.incubator.apache.org
>>> Subject: Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow 
>>> with too many iterations"?
>>> 
>>> I'm not entirely sure, but two candidates are
>>> 
>>> the visit function in stageDependsOn
>>> 
>>> submitStage
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Reynold Xin January 25, 2014 at 10:03 PM
>>> I'm not entirely sure, but two candidates are
>>> 
>>> the visit function in stageDependsOn
>>> 
>>> submitStage
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Aaron Davidson  January 25, 2014 at 10:01 PM
>>> I'm an idiot, but which part of the DAGScheduler is recursive here? Seems
>>> like processEvent shouldn't have inherently recursive properties.
>>> 
>>> 
>>> 
>>> Reynold Xin January 25, 2014 at 9:57 PM
>>> It seems to me fixing DAGScheduler to make it not recursive is the better
>>> solution here, given the cost of checkpointing.
>>> 
>>> 
>>> Xia, JunluanJanuary 25, 2014 at 9:49 PM
>>> Hi all
>>> 
>>> The description about this Bug submitted by Matei is as following
>>> 
>>> 
>>> The tipping point seems to be around 50. We should fix this by 
>>> checkpointing the RDDs every 10-20 iterations to break the lineage chain, 
>>> but checkpointing currently requires HDFS installed, which not all users 
>>> will have.
>>> 
>>> We might also be able to fix DAGScheduler to not be recursive.
>>> 
>>> 
>>> regards,
>>> Andrew
>>> 
>>> 

Re: [DISCUSS] Graduating as a TLP

2014-01-23 Thread Nick Pentreath
+1 fantastic news —
Sent from Mailbox for iPhone

On Fri, Jan 24, 2014 at 6:43 AM, Mridul Muralidharan 
wrote:

> Great news !
> +1
> Regards,
> Mridul
> On Fri, Jan 24, 2014 at 4:15 AM, Matei Zaharia  
> wrote:
>> Hi folks,
>>
>> We’ve been working on the transition to Apache for a while, and our last 
>> shepherd’s report says the following:
>>
>> 
>> Spark
>>
>> Alan Cabrera (acabrera):
>>
>>   Seems like a nice active project.  IMO, there's no need to wait import
>>   to JIRA to graduate. Seems like they can graduate now.
>> 
>>
>> What do you think about graduating to a top-level project? As far as I can 
>> tell, we’ve completed all the requirements for graduating:
>>
>> - Made 2 releases (working on a third now)
>> - Added new committers and PPMC members (4 of them)
>> - Did IP clearance
>> - Moved infrastructure over to Apache, except for the JIRA above, which 
>> INFRA is working on and which shouldn’t block us.
>>
>> If everything is okay, I’ll call a VOTE on graduating in 48 hours. The one 
>> final thing missing is that we’ll need to nominate an initial VP for the 
>> project.
>>
>> Matei

Re: Option folding idiom

2013-12-26 Thread Nick Pentreath
+1 for getOrElse


When I was new to Scala I tended to use match almost like if/else statements 
with Option. These days I try to use map/flatMap instead and use getOrElse 
extensively and I for one find it very intuitive.




I also agree that the fold syntax seems way less intuitive and I certainly 
prefer readable Scala code to that which might be more "idiomatic" but which I 
honestly tend to find very inscrutable and hard to grok quickly.
—
Sent from Mailbox for iPhone

On Fri, Dec 27, 2013 at 9:06 AM, Matei Zaharia 
wrote:

> I agree about using getOrElse instead. In choosing which code style and 
> idioms to use, my goal has always been to maximize the ease of *other 
> developers* understanding the code, and most developers today still don’t 
> know Scala. It’s fine to use a maps or matches, because their meaning is 
> obvious, but fold on Option is not obvious (even foreach is kind of weird for 
> new people). In this case the benefit is so small that it doesn’t seem worth 
> it.
> Note that if you use getOrElse, you can even throw exceptions in the “else” 
> part if you’d like. (This is because Nothing is a subtype of every type in 
> Scala.) So for example you can do val stuff = option.getOrElse(throw new 
> Exception(“It wasn’t set”)). It looks a little weird, but note how the 
> meaning is obvious even if you don’t know anything about the type system.
> Matei
> On Dec 27, 2013, at 12:12 AM, Kay Ousterhout  wrote:
>> I agree with what Reynold said -- there's not a big benefit in terms of
>> lines of code (esp. compared to using getOrElse) and I think it hurts code
>> readability.  One of the great things about the current Spark codebase is
>> that it's very accessible for newcomers -- something that would be less
>> true with this use of "fold".
>> 
>> 
>> On Thu, Dec 26, 2013 at 8:11 PM, Holden Karau  wrote:
>> 
>>> I personally with Evan in that I prefer map with getOrElse over fold with
>>> options (but that just my personal preference) :)
>>> 
>>> 
>>> On Thu, Dec 26, 2013 at 7:58 PM, Reynold Xin  wrote:
>>> 
 I'm not strongly against Option.fold, but I find the readability getting
 worse for the use case you brought up.  For the use case of if/else, I
>>> find
 Option.fold pretty confusing because it reverses the order of Some vs
>>> None.
 Also, when code gets long, the lack of an obvious boundary (the only
 boundary is "} {") with two closures is pretty confusing.
 
 
 On Thu, Dec 26, 2013 at 4:23 PM, Mark Hamstra  wrote:
 
> On the contrary, it is the completely natural place for the initial
>>> value
> of the accumulator, and provides the expected result of folding over an
> empty collection.
> 
> scala> val l: List[Int] = List()
> 
> l: List[Int] = List()
> 
> 
> scala> l.fold(42)(_ + _)
> 
> res0: Int = 42
> 
> 
> scala> val o: Option[Int] = None
> 
> o: Option[Int] = None
> 
> 
> scala> o.fold(42)(_ + 1)
> 
> res1: Int = 42
> 
> 
> On Thu, Dec 26, 2013 at 5:51 PM, Evan Chan  wrote:
> 
>> +1 for using more functional idioms in general.
>> 
>> That's a pretty clever use of `fold`, but putting the default
>>> condition
>> first there makes it not as intuitive.   What about the following,
 which
>> are more readable?
>> 
>>option.map { a => someFuncMakesB() }
>>  .getOrElse(b)
>> 
>>option.map { a => someFuncMakesB() }
>>  .orElse { a => otherDefaultB() }.get
>> 
>> 
>> On Thu, Dec 26, 2013 at 12:33 PM, Mark Hamstra <
 m...@clearstorydata.com
>>> wrote:
>> 
>>> In code added to Spark over the past several months, I'm glad to
>>> see
> more
>>> use of `foreach`, `for`, `map` and `flatMap` over `Option` instead
>>> of
>>> pattern matching boilerplate.  There are opportunities to push
 `Option`
>>> idioms even further now that we are using Scala 2.10 in master,
>>> but I
>> want
>>> to discuss the issue here a little bit before committing code whose
> form
>>> may be a little unfamiliar to some Spark developers.
>>> 
>>> In particular, I really like the use of `fold` with `Option` to
 cleanly
>> an
>>> concisely express the "do something if the Option is None; do
 something
>>> else with the thing contained in the Option if it is Some" code
> fragment.
>>> 
>>> An example:
>>> 
>>> Instead of...
>>> 
>>> val driver = drivers.find(_.id == driverId)
>>> driver match {
>>>  case Some(d) =>
>>>if (waitingDrivers.contains(d)) { waitingDrivers -= d }
>>>else {
>>>  d.worker.foreach { w =>
>>>w.actor ! KillDriver(driverId)
>>>  }
>>>}
>>>val msg = s"Kill request for $driverId submitted"
>>>logInfo(msg)
>>>sender ! KillDriverResponse(true, msg)
>>>  case None =>
>>>

Re: Spark development for undergraduate project

2013-12-19 Thread Nick Pentreath
Another option would be:
1. Add another recommendation model based on mrec's sgd based model: 
https://github.com/mendeley/mrec
2. Look at the streaming K-means from Mahout and see if that might be 
integrated or adapted into MLlib
3. Work on adding to or refactoring the existing linear model framework, for 
example adaptive learning rate schedules, adaptive norm stuff from John 
Langford et al
4. Adding sparse vector/matrix support to MLlib?

Sent from my iPad

> On 20 Dec 2013, at 3:46 AM, Tathagata Das  wrote:
> 
> +1 to that (assuming by 'online' Andrew meant MLLib algorithm from Spark
> Streaming)
> 
> Something you can look into is implementing a streaming KMeans. Maybe you
> can re-use a lot of the offline KMeans code in MLLib.
> 
> TD
> 
> 
>> On Thu, Dec 19, 2013 at 5:33 PM, Andrew Ash  wrote:
>> 
>> Sounds like a great choice.  It would be particularly impressive if you
>> could add the first online learning algorithm (all the current ones are
>> offline I believe) to pave the way for future contributions.
>> 
>> 
>> On Thu, Dec 19, 2013 at 8:27 PM, Matthew Cheah 
>> wrote:
>> 
>>> Thanks a lot everyone! I'm looking into adding an algorithm to MLib for
>> the
>>> project. Nice and self-contained.
>>> 
>>> -Matt Cheah
>>> 
>>> 
>>> On Thu, Dec 19, 2013 at 12:52 PM, Christopher Nguyen 
>>> wrote:
>>> 
>>>> +1 to most of Andrew's suggestions here, and while we're in that
>>>> neighborhood, how about generalizing something like "wtf-spark" (from
>> the
>>>> Bizo team (http://youtu.be/6Sn1xs5DN1Y?t=38m36s)? It may not be of
>> high
>>>> academic interest, but it's something people would use many times a
>>>> debugging day.
>>>> 
>>>> Or am I behind and something like that is already there in 0.8?
>>>> 
>>>> --
>>>> Christopher T. Nguyen
>>>> Co-founder & CEO, Adatao <http://adatao.com>
>>>> linkedin.com/in/ctnguyen
>>>> 
>>>> 
>>>> 
>>>>> On Thu, Dec 19, 2013 at 10:56 AM, Andrew Ash 
>>>> wrote:
>>>> 
>>>>> I think there are also some improvements that could be made to
>>>>> deployability in an enterprise setting.  From my experience:
>>>>> 
>>>>> 1. Most places I deploy Spark in don't have internet access.  So I
>>> can't
>>>>> build from source, compile against a different version of Hadoop, etc
>>>>> without doing it locally and then getting that onto my servers
>>> manually.
>>>>> This is less a problem with Spark now that there are binary
>>>> distributions,
>>>>> but it's still a problem for using Mesos with Spark.
>>>>> 2. Configuration of Spark is confusing -- you can make configuration
>> in
>>>>> Java system properties, environment variables, command line
>> parameters,
>>>> and
>>>>> for the standalone cluster deployment mode you need to worry about
>>>> whether
>>>>> these need to be set on the master, the worker, the executor, or the
>>>>> application/driver program.  Also because spark-shell automatically
>>>>> instantiates a SparkContext you have to set up any system properties
>> in
>>>> the
>>>>> init scripts or on the command line with
>>>>> JAVA_OPTS="-Dspark.executor.memory=8g" etc.  I'm not sure what needs
>> to
>>>> be
>>>>> done, but it feels that there are gains to be made in configuration
>>>> options
>>>>> here.  Ideally, I would have one configuration file that can be used
>> in
>>>> all
>>>>> 4 places and that's the only place to make configuration changes.
>>>>> 3. Standalone cluster mode could use improved resiliency for
>> starting,
>>>>> stopping, and keeping alive a service -- there are custom init
>> scripts
>>>> that
>>>>> call each other in a mess of ways: spark-shell, spark-daemon.sh,
>>>>> spark-daemons.sh, spark-config.sh, spark-env.sh,
>> compute-classpath.sh,
>>>>> spark-executor, spark-class, run-example, and several others in the
>>> bin/
>>>>> directory.  I would love it if Spark used the Tanuki Service Wrapper,
>>>> which
>>>>> is widely-used for Java service daemons, supports retries,
>> 

Re: IMPORTANT: Spark mailing lists moving to Apache by September 1st

2013-12-19 Thread Nick Pentreath
One option that is 3rd party that works nicely for the Hadoop project and it's 
related projects is http://search-hadoop.com - managed by sematext. Perhaps we 
can plead with Otis to add Spark lists to search-spark.com, or the existing 
site?

Just throwing it out there as a potential solution to at least searching and 
navigating the Apache lists

Sent from my iPad

> On 20 Dec 2013, at 6:46 AM, Aaron Davidson  wrote:
> 
> I'd be fine with one-way mirrors here (Apache threads being reflected in
> Google groups) -- I have no idea how one is supposed to navigate the Apache
> list to look for historic threads.
> 
> 
>> On Thu, Dec 19, 2013 at 7:58 PM, Mike Potts  wrote:
>> 
>> Thanks very much for the prompt and comprehensive reply!  I appreciate the
>> overarching desire to integrate with apache: I'm very happy to hear that
>> there's a move to use the existing groups as mirrors: that will overcome
>> all of my objections: particularly if it's bidirectional! :)
>> 
>> 
>>> On Thursday, December 19, 2013 7:19:06 PM UTC-8, Andy Konwinski wrote:
>>> 
>>> Hey Mike,
>>> 
>>> As you probably noticed when you CC'd spark-de...@googlegroups.com, that
>>> list has already be reconfigured so that it no longer allows posting (and
>>> bounces emails sent to it).
>>> 
>>> We will be doing the same thing to the spark...@googlegroups.com list
>>> too (we'll announce a date for that soon).
>>> 
>>> That may sound very frustrating, and you are *not* alone feeling that
>>> way. We've had a long conversation with our mentors about this, and I've
>>> felt very similar to you, so I'd like to give you background.
>>> 
>>> As I'm coming to see it, part of becoming an Apache project is moving the
>>> community *fully* over to Apache infrastructure, and more generally the
>>> Apache way of organizing the community.
>>> 
>>> This applies in both the nuts-and-bolts sense of being on apache infra,
>>> but possibly more importantly, it is also a guiding principle and way of
>>> thinking.
>>> 
>>> In various ways, moving to apache Infra can be a painful process, and IMO
>>> the loss of all the great mailing list functionality that comes with using
>>> Google Groups is perhaps the most painful step. But basically, the de facto
>>> mailing lists need to be the Apache ones, and not Google Groups. The
>>> underlying reason is that Apache needs to take full accountability for
>>> recording and publishing the mailing lists, it has to be able to
>>> institutionally guarantee this. This is because discussion on mailing lists
>>> is one of the core things that defines an Apache community. So at a minimum
>>> this means Apache owning the master copy of the bits.
>>> 
>>> All that said, we are discussing the possibility of having a google group
>>> that subscribes to each list that would provide an easier to use and
>>> prettier archive for each list (so far we haven't gotten that to work).
>>> 
>>> I hope this was helpful. It has taken me a few years now, and a lot of
>>> conversations with experienced (and patient!) Apache mentors, to
>>> internalize some of the nuance about "the Apache way". That's why I wanted
>>> to share.
>>> 
>>> Andy
>>> 
 On Thu, Dec 19, 2013 at 6:28 PM, Mike Potts  wrote:
 
 I notice that there are still a lot of active topics in this group: and
 also activity on the apache mailing list (which is a really horrible
 experience!).  Is it a firm policy on apache's front to disallow external
 groups?  I'm going to be ramping up on spark, and I really hate the idea of
 having to rely on the apache archives and my mail client.  Also: having to
 search for topics/keywords both in old threads (here) as well as new
 threads in apache's (clunky) archive, is going to be a pain!  I almost feel
 like I must be missing something because the current solution seems
 unfeasibly awkward!
 
 --
 You received this message because you are subscribed to the Google
 Groups "Spark Users" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to spark-users...@googlegroups.com.
 
 For more options, visit https://groups.google.com/groups/opt_out.
>>> 
>>> 


Re: Spark development for undergraduate project

2013-12-19 Thread Nick Pentreath
Some good things to look at though hopefully #2 will be largely addressed by: 
https://github.com/apache/incubator-spark/pull/230—
Sent from Mailbox for iPhone

On Thu, Dec 19, 2013 at 8:57 PM, Andrew Ash  wrote:

> I think there are also some improvements that could be made to
> deployability in an enterprise setting.  From my experience:
> 1. Most places I deploy Spark in don't have internet access.  So I can't
> build from source, compile against a different version of Hadoop, etc
> without doing it locally and then getting that onto my servers manually.
>  This is less a problem with Spark now that there are binary distributions,
> but it's still a problem for using Mesos with Spark.
> 2. Configuration of Spark is confusing -- you can make configuration in
> Java system properties, environment variables, command line parameters, and
> for the standalone cluster deployment mode you need to worry about whether
> these need to be set on the master, the worker, the executor, or the
> application/driver program.  Also because spark-shell automatically
> instantiates a SparkContext you have to set up any system properties in the
> init scripts or on the command line with
> JAVA_OPTS="-Dspark.executor.memory=8g" etc.  I'm not sure what needs to be
> done, but it feels that there are gains to be made in configuration options
> here.  Ideally, I would have one configuration file that can be used in all
> 4 places and that's the only place to make configuration changes.
> 3. Standalone cluster mode could use improved resiliency for starting,
> stopping, and keeping alive a service -- there are custom init scripts that
> call each other in a mess of ways: spark-shell, spark-daemon.sh,
> spark-daemons.sh, spark-config.sh, spark-env.sh, compute-classpath.sh,
> spark-executor, spark-class, run-example, and several others in the bin/
> directory.  I would love it if Spark used the Tanuki Service Wrapper, which
> is widely-used for Java service daemons, supports retries, installation as
> init scripts that can be chkconfig'd, etc.  Let's not re-solve the "how do
> I keep a service running?" problem when it's been done so well by Tanuki --
> we use it at my day job for all our services, plus it's used by
> Elasticsearch.  This would help solve the problem where a quick bounce of
> the master causes all the workers to self-destruct.
> 4. Sensitivity to hostname vs FQDN vs IP address in spark URL -- this is
> entirely an Akka bug based on previous mailing list discussion with Matei,
> but it'd be awesome if you could use either the hostname or the FQDN or the
> IP address in the Spark URL and not have Akka barf at you.
> I've been telling myself I'd look into these at some point but just haven't
> gotten around to them myself yet.  Some day!  I would prioritize these
> requests from most- to least-important as 3, 2, 4, 1.
> Andrew
> On Thu, Dec 19, 2013 at 1:38 PM, Nick Pentreath 
> wrote:
>> Or if you're extremely ambitious work in implementing Spark Streaming in
>> Python—
>> Sent from Mailbox for iPhone
>>
>> On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia 
>> wrote:
>>
>> > Hi Matt,
>> > If you want to get started looking at Spark, I recommend the following
>> resources:
>> > - Our issue tracker at http://spark-project.atlassian.net contains some
>> issues marked “Starter” that are good places to jump into. You might be
>> able to take one of those and extend it into a bigger project.
>> > - The “contributing to Spark” wiki page covers how to send patches and
>> set up development:
>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
>> > - This talk has an intro to Spark internals (video and slides are in the
>> comments): http://www.meetup.com/spark-users/events/94101942/
>> > For a longer project, here are some possible ones:
>> > - Create a tool that automatically checks which Scala API methods are
>> missing in Python. We had a similar one for Java that was very useful. Even
>> better would be to automatically create wrappers for the Scala ones.
>> > - Extend the Spark monitoring UI with profiling information (to sample
>> the workers and say where they’re spending time, or what data structures
>> consume the most memory).
>> > - Pick and implement a new machine learning algorithm for MLlib.
>> > Matei
>> > On Dec 17, 2013, at 10:43 AM, Matthew Cheah 
>> wrote:
>> >> Hi everyone,
>> >>
>> >> During my most recent internship, I worked extensively with Apache
>> Spark,
>> >> integrating it into a company's data analytics plat

Re: [PySpark]: reading arbitrary Hadoop InputFormats

2013-12-19 Thread Nick Pentreath
Hi


I managed to find the time to put together a PR on this: 
https://github.com/apache/incubator-spark/pull/263




Josh has had a look over it - if anyone else with an interest could give some 
feedback that would be great.




As mentioned in the PR it's more of an RFC and certainly still needs a bit of 
clean up work, and I need to add the concept of "wrapper functions" to 
deserialize classes that MsgPack can't handle out the box.




N
—
Sent from Mailbox for iPhone

On Fri, Nov 8, 2013 at 12:20 PM, Nick Pentreath 
wrote:

> Wow Josh, that looks great. I've been a bit swamped this week but as soon
> as I get a chance I'll test out the PR in more detail and port over the
> InputFormat stuff to use the new framework (including the changes you
> suggested).
> I can then look deeper into the MsgPack functionality to see if it can be
> made to work in a generic enough manner without requiring huge amounts of
> custom Templates to be written by users.
> Will feed back asap.
> N
> On Thu, Nov 7, 2013 at 5:03 AM, Josh Rosen  wrote:
>> I opened a pull request to add custom serializer support to PySpark:
>> https://github.com/apache/incubator-spark/pull/146
>>
>> My pull request adds the plumbing for transferring data from Java to Python
>> using formats other than Pickle.  For example, look at how textFile() uses
>> MUTF8Deserializer to read strings from Java.  Hopefully this provides all
>> of the functionality needed to support MsgPack.
>>
>> - Josh
>>
>>
>> On Thu, Oct 31, 2013 at 11:11 AM, Josh Rosen  wrote:
>>
>> > Hi Nick,
>> >
>> > This is a nice start.  I'd prefer to keep the Java sequenceFileAsText()
>> > and newHadoopFileAsText() methods inside PythonRDD instead of adding them
>> > to JavaSparkContext, since I think these methods are unlikely to be used
>> > directly by Java users (you can add these methods to the PythonRDD
>> > companion object, which is how readRDDFromPickleFile is implemented:
>> >
>> https://github.com/apache/incubator-spark/blob/branch-0.8/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L255
>> > )
>> >
>> > For MsgPack, the UnpicklingError is because the Python worker expects to
>> > receive its input in a pickled format.  In my prototype of custom
>> > serializers, I modified the PySpark worker to receive its
>> > serialization/deserialization function as input (
>> >
>> https://github.com/JoshRosen/spark/blob/59b6b43916dc84fc8b83f22eb9ce13a27bc51ec0/python/pyspark/worker.py#L41
>> )
>> > and added logic to pass the appropriate serializers based on each stage's
>> > input and output formats (
>> >
>> https://github.com/JoshRosen/spark/blob/59b6b43916dc84fc8b83f22eb9ce13a27bc51ec0/python/pyspark/rdd.py#L42
>> > ).
>> >
>> > At some point, I'd like to port my custom serializers code to PySpark; if
>> > anyone's interested in helping, I'd be glad to write up some additional
>> > notes on how this should work.
>> >
>> > - Josh
>> >
>> > On Wed, Oct 30, 2013 at 2:25 PM, Nick Pentreath <
>> nick.pentre...@gmail.com>wrote:
>> >
>> >> Thanks Josh, Patrick for the feedback.
>> >>
>> >> Based on Josh's pointers I have something working for JavaPairRDD ->
>> >> PySpark RDD[(String, String)]. This just calls the toString method on
>> each
>> >> key and value as before, but without the need for a delimiter. For
>> >> SequenceFile, it uses SequenceFileAsTextInputFormat which itself calls
>> >> toString to convert to Text for keys and values. We then call toString
>> >> (again) ourselves to get Strings to feed to writeAsPickle.
>> >>
>> >> Details here: https://gist.github.com/MLnick/7230588
>> >>
>> >> This also illustrates where the "wrapper function" api would fit in. All
>> >> that is required is to define a T => String for key and value.
>> >>
>> >> I started playing around with MsgPack and can sort of get things to work
>> >> in
>> >> Scala, but am struggling with getting the raw bytes to be written
>> properly
>> >> in PythonRDD (I think it is treating them as pickled byte arrays when
>> they
>> >> are not, but when I removed the 'stripPickle' calls and amended the
>> length
>> >> (-6) I got "UnpicklingError: invalid load key, ' '. ").
>> >>
>> >> Another issue is that MsgPack does well at

Re: Spark development for undergraduate project

2013-12-19 Thread Nick Pentreath
Or if you're extremely ambitious work in implementing Spark Streaming in Python—
Sent from Mailbox for iPhone

On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia 
wrote:

> Hi Matt,
> If you want to get started looking at Spark, I recommend the following 
> resources:
> - Our issue tracker at http://spark-project.atlassian.net contains some 
> issues marked “Starter” that are good places to jump into. You might be able 
> to take one of those and extend it into a bigger project.
> - The “contributing to Spark” wiki page covers how to send patches and set up 
> development: 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> - This talk has an intro to Spark internals (video and slides are in the 
> comments): http://www.meetup.com/spark-users/events/94101942/
> For a longer project, here are some possible ones:
> - Create a tool that automatically checks which Scala API methods are missing 
> in Python. We had a similar one for Java that was very useful. Even better 
> would be to automatically create wrappers for the Scala ones.
> - Extend the Spark monitoring UI with profiling information (to sample the 
> workers and say where they’re spending time, or what data structures consume 
> the most memory).
> - Pick and implement a new machine learning algorithm for MLlib.
> Matei
> On Dec 17, 2013, at 10:43 AM, Matthew Cheah  wrote:
>> Hi everyone,
>> 
>> During my most recent internship, I worked extensively with Apache Spark,
>> integrating it into a company's data analytics platform. I've now become
>> interested in contributing to Apache Spark.
>> 
>> I'm returning to undergraduate studies in January and there is an academic
>> course which is simply a standalone software engineering project. I was
>> thinking that some contribution to Apache Spark would satisfy my curiosity,
>> help continue support the company I interned at, and give me academic
>> credits required to graduate, all at the same time. It seems like too good
>> an opportunity to pass up.
>> 
>> With that in mind, I have the following questions:
>> 
>>   1. At this point, is there any self-contained project that I could work
>>   on within Spark? Ideally, I would work on it independently, in about a
>>   three month time frame. This time also needs to accommodate ramping up on
>>   the Spark codebase and adjusting to the Scala programming language and
>>   paradigms. The company I worked at primarily used the Java APIs. The output
>>   needs to be a technical report describing the project requirements, and the
>>   design process I took to engineer the solution for the requirements. In
>>   particular, it cannot just be a series of haphazard patches.
>>   2. How can I get started with contributing to Spark?
>>   3. Is there a high-level UML or some other design specification for the
>>   Spark architecture?
>> 
>> Thanks! I hope to be of some help =)
>> 
>> -Matt Cheah

Re: Intellij IDEA build issues

2013-12-16 Thread Nick Pentreath
Thanks Evan, I tried it and the new SBT direct import seems to work well,
though I did run into issues with some yarn imports on Spark.

n


On Thu, Dec 12, 2013 at 7:03 PM, Evan Chan  wrote:

> Nick, have you tried using the latest Scala plug-in, which features native
> SBT project imports?   ie you no longer need to run gen-idea.
>
>
> On Sat, Dec 7, 2013 at 4:15 AM, Nick Pentreath  >wrote:
>
> > Hi Spark Devs,
> >
> > Hoping someone cane help me out. No matter what I do, I cannot get
> Intellij
> > to build Spark from source. I am using IDEA 13. I run sbt gen-idea and
> > everything seems to work fine.
> >
> > When I try to build using IDEA, everything compiles but I get the error
> > below.
> >
> > Have any of you come across the same?
> >
> > ==
> >
> > Internal error: (java.lang.AssertionError)
> > java/nio/channels/FileChannel$MapMode already declared as
> > ch.epfl.lamp.fjbg.JInnerClassesAttribute$Entry@1b5b798b
> > java.lang.AssertionError: java/nio/channels/FileChannel$MapMode already
> > declared as ch.epfl.lamp.fjbg.JInnerClassesAttribute$Entry@1b5b798b
> > at
> >
> >
> ch.epfl.lamp.fjbg.JInnerClassesAttribute.addEntry(JInnerClassesAttribute.java:74)
> > at
> >
> >
> scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator$$anonfun$addInnerClasses$3.apply(GenJVM.scala:738)
> > at
> >
> >
> scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator$$anonfun$addInnerClasses$3.apply(GenJVM.scala:733)
> > at
> >
> >
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
> > at scala.collection.immutable.List.foreach(List.scala:76)
> > at
> >
> >
> scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.addInnerClasses(GenJVM.scala:733)
> > at
> >
> >
> scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.emitClass(GenJVM.scala:200)
> > at
> >
> >
> scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.genClass(GenJVM.scala:355)
> > at
> >
> >
> scala.tools.nsc.backend.jvm.GenJVM$JvmPhase$$anonfun$run$4.apply(GenJVM.scala:86)
> > at
> >
> >
> scala.tools.nsc.backend.jvm.GenJVM$JvmPhase$$anonfun$run$4.apply(GenJVM.scala:86)
> > at
> >
> >
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:104)
> > at
> >
> >
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:104)
> > at scala.collection.Iterator$class.foreach(Iterator.scala:772)
> > at
> scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157)
> > at
> >
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190)
> > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45)
> > at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:104)
> > at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.scala:86)
> > at scala.tools.nsc.Global$Run.compileSources(Global.scala:953)
> > at scala.tools.nsc.Global$Run.compile(Global.scala:1041)
> > at xsbt.CachedCompiler0.run(CompilerInterface.scala:123)
> > at xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:99)
> > at xsbt.CachedCompiler0.run(CompilerInterface.scala:99)
> > at xsbt.CompilerInterface.run(CompilerInterface.scala:27)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:601)
> > at sbt.compiler.AnalyzingCompiler.call(AnalyzingCompiler.scala:102)
> > at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:48)
> > at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:41)
> > at
> >
> >
> sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3$$anonfun$apply$1.apply$mcV$sp(AggressiveCompile.scala:106)
> > at
> >
> >
> sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3$$anonfun$apply$1.apply(AggressiveCompile.scala:106)
> > at
> >
> >
> sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3$$anonfun$apply$1.apply(AggressiveCompile.scala:106)
> > at
> >
> >
> sbt.compiler.AggressiveCompile.sbt$compiler$AggressiveCompile$$timed(AggressiveCompile.scala:173)
> > at
> >
> >
> sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$

Re: Scala 2.10 Merge

2013-12-14 Thread Nick Pentreath
Whoohoo!

Great job everyone especially Prashant!

—
Sent from Mailbox for iPhone

On Sat, Dec 14, 2013 at 10:59 AM, Patrick Wendell 
wrote:

> Alright I just merged this in - so Spark is officially "Scala 2.10"
> from here forward.
> For reference I cut a new branch called scala-2.9 with the commit
> immediately prior to the merge:
> https://git-wip-us.apache.org/repos/asf/incubator-spark/repo?p=incubator-spark.git;a=shortlog;h=refs/heads/scala-2.9
> - Patrick
> On Thu, Dec 12, 2013 at 8:26 PM, Patrick Wendell  wrote:
>> Hey Reymond,
>>
>> Let's move this discussion out of this thread and into the associated JIRA.
>> I'll write up our current approach over there.
>>
>> https://spark-project.atlassian.net/browse/SPARK-995
>>
>> - Patrick
>>
>>
>> On Thu, Dec 12, 2013 at 5:56 PM, Liu, Raymond  wrote:
>>>
>>> Hi Patrick
>>>
>>> So what's the plan for support Yarn 2.2 in 0.9? As far as I can
>>> see, if you want to support both 2.2 and 2.0 , due to protobuf version
>>> incompatible issue. You need two version of akka anyway.
>>>
>>> Akka 2.3-M1 looks like have a little bit change in API, we
>>> probably could isolate the code like what we did on yarn part API. I
>>> remember that it is mentioned that to use reflection for different API is
>>> preferred. So the purpose to use reflection is to use one release bin jar to
>>> support both version of Hadoop/Yarn on runtime, instead of build different
>>> bin jar on compile time?
>>>
>>>  Then all code related to hadoop will also be built in separate
>>> modules for loading on demand? This sounds to me involve a lot of works. And
>>> you still need to have shim layer and separate code for different version
>>> API and depends on different version Akka etc. Sounds like and even strict
>>> demands versus our current approaching on master, and with dynamic class
>>> loader in addition, And the problem we are facing now are still there?
>>>
>>> Best Regards,
>>> Raymond Liu
>>>
>>> -Original Message-
>>> From: Patrick Wendell [mailto:pwend...@gmail.com]
>>> Sent: Thursday, December 12, 2013 5:13 PM
>>> To: dev@spark.incubator.apache.org
>>> Subject: Re: Scala 2.10 Merge
>>>
>>> Also - the code is still there because of a recent merge that took in some
>>> newer changes... we'll be removing it for the final merge.
>>>
>>>
>>> On Thu, Dec 12, 2013 at 1:12 AM, Patrick Wendell 
>>> wrote:
>>>
>>> > Hey Raymond,
>>> >
>>> > This won't work because AFAIK akka 2.3-M1 is not binary compatible
>>> > with akka 2.2.3 (right?). For all of the non-yarn 2.2 versions we need
>>> > to still use the older protobuf library, so we'd need to support both.
>>> >
>>> > I'd also be concerned about having a reference to a non-released
>>> > version of akka. Akka is the source of our hardest-to-find bugs and
>>> > simultaneously trying to support 2.2.3 and 2.3-M1 is a bit daunting.
>>> > Of course, if you are building off of master you can maintain a fork
>>> > that uses this.
>>> >
>>> > - Patrick
>>> >
>>> >
>>> > On Thu, Dec 12, 2013 at 12:42 AM, Liu, Raymond
>>> > wrote:
>>> >
>>> >> Hi Patrick
>>> >>
>>> >> What does that means for drop YARN 2.2? seems codes are still
>>> >> there. You mean if build upon 2.2 it will break, and won't and work
>>> >> right?
>>> >> Since the home made akka build on scala 2.10 are not there. While, if
>>> >> for this case, can we just use akka 2.3-M1 which run on protobuf 2.5
>>> >> for replacement?
>>> >>
>>> >> Best Regards,
>>> >> Raymond Liu
>>> >>
>>> >>
>>> >> -Original Message-
>>> >> From: Patrick Wendell [mailto:pwend...@gmail.com]
>>> >> Sent: Thursday, December 12, 2013 4:21 PM
>>> >> To: dev@spark.incubator.apache.org
>>> >> Subject: Scala 2.10 Merge
>>> >>
>>> >> Hi Developers,
>>> >>
>>> >> In the next few days we are planning to merge Scala 2.10 support into
>>> >> Spark. For those that haven't been following this, Prashant Sharma
>>> >> has been maintaining the scala-2.10 branch of Spark for several
>>> >> months. This branch is current with master and has been reviewed for
>>> >> merging:
>>> >>
>>> >> https://github.com/apache/incubator-spark/tree/scala-2.10
>>> >>
>>> >> Scala 2.10 support is one of the most requested features for Spark -
>>> >> it will be great to get this into Spark 0.9! Please note that *Scala
>>> >> 2.10 is not binary compatible with Scala 2.9*. With that in mind, I
>>> >> wanted to give a few heads-up/requests to developers:
>>> >>
>>> >> If you are developing applications on top of Spark's master branch,
>>> >> those will need to migrate to Scala 2.10. You may want to download
>>> >> and test the current scala-2.10 branch in order to make sure you will
>>> >> be okay as Spark developments move forward. Of course, you can always
>>> >> stick with the current master commit and be fine (I'll cut a tag when
>>> >> we do the merge in order to delineate where the version changes).
>>> >> Please open new threads on the dev list to report and discuss any
>>> >> issues.
>>> >>
>>

Re: [VOTE] Release Apache Spark 0.8.1-incubating (rc4)

2013-12-11 Thread Nick Pentreath
   - Successfully built via sbt/sbt assembly/assembly on Mac OS X, as well
   as on a dev Ubuntu EC2 box
   - Successfully tested via sbt/sbt test locally
   - Successfully built and tested using mvn package locally
   - I've tested my own Spark jobs (built against 0.8.0-incubating) on this
   RC and all works fine, as well as tested with my job server (also built
   against 0.8.0-incubating)
   - Ran a few spark examples and the shell and PySpark shell
   - For my part, tested the MLlib implicit code I added, and checked docs


I'm +1


On Wed, Dec 11, 2013 at 11:04 AM, Prashant Sharma wrote:

> I hope this PR https://github.com/apache/incubator-spark/pull/252 can
> help.
> Again this is not a blocker for the release from my side either.
>
>
> On Wed, Dec 11, 2013 at 2:14 PM, Mark Hamstra  >wrote:
>
> > Interesting, and confirmed: On my machine where `./sbt/sbt assembly`
> takes
> > a long, long, long time to complete (a MBP, in my case), building
> three
> > separate assemblies (`./sbt/sbt assembly/assembly`, `./sbt/sbt
> > examples/assembly`, `./sbt/sbt tools/assembly`) takes much, much less
> time.
> >
> >
> >
> > On Wed, Dec 11, 2013 at 12:02 AM, Prashant Sharma  > >wrote:
> >
> > > forgot to mention, after running sbt/sbt assembly/assembly running
> > sbt/sbt
> > > examples/assembly takes just 37s. Not to mention my hardware is not
> > really
> > > great.
> > >
> > >
> > > On Wed, Dec 11, 2013 at 1:28 PM, Prashant Sharma  > > >wrote:
> > >
> > > > Hi Patrick and Matei,
> > > >
> > > > Was trying out this and followed the quick start guide which says do
> > > > sbt/sbt assembly, like few others I was also stuck for few minutes on
> > > > linux. On the other hand if I use sbt/sbt assembly/assembly it is
> much
> > > > faster.
> > > >
> > > > Should we change the documentation to reflect this. It will not be
> > great
> > > > for first time users to get stuck there.
> > > >
> > > >
> > > > On Wed, Dec 11, 2013 at 9:54 AM, Matei Zaharia <
> > matei.zaha...@gmail.com
> > > >wrote:
> > > >
> > > >> +1
> > > >>
> > > >> Built and tested it on Mac OS X.
> > > >>
> > > >> Matei
> > > >>
> > > >>
> > > >> On Dec 10, 2013, at 4:49 PM, Patrick Wendell 
> > > wrote:
> > > >>
> > > >> > Please vote on releasing the following candidate as Apache Spark
> > > >> > (incubating) version 0.8.1.
> > > >> >
> > > >> > The tag to be voted on is v0.8.1-incubating (commit b87d31d):
> > > >> >
> > > >>
> > >
> >
> https://git-wip-us.apache.org/repos/asf/incubator-spark/repo?p=incubator-spark.git;a=commit;h=b87d31dd8eb4b4e47c0138e9242d0dd6922c8c4e
> > > >> >
> > > >> > The release files, including signatures, digests, etc can be found
> > at:
> > > >> > http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc4/
> > > >> >
> > > >> > Release artifacts are signed with the following key:
> > > >> > https://people.apache.org/keys/committer/pwendell.asc
> > > >> >
> > > >> > The staging repository for this release can be found at:
> > > >> >
> > > https://repository.apache.org/content/repositories/orgapachespark-040/
> > > >> >
> > > >> > The documentation corresponding to this release can be found at:
> > > >> >
> http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc4-docs/
> > > >> >
> > > >> > For information about the contents of this release see:
> > > >> >
> > > >>
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=blob;f=CHANGES.txt;h=ce0aeab524505b63c7999e0371157ac2def6fe1c;hb=branch-0.8
> > > >> >
> > > >> > Please vote on releasing this package as Apache Spark
> > > 0.8.1-incubating!
> > > >> >
> > > >> > The vote is open until Saturday, December 14th at 01:00 UTC and
> > > >> > passes if a majority of at least 3 +1 PPMC votes are cast.
> > > >> >
> > > >> > [ ] +1 Release this package as Apache Spark 0.8.1-incubating
> > > >> > [ ] -1 Do not release this package because ...
> > > >> >
> > > >> > To learn more about Apache Spark, please see
> > > >> > http://spark.incubator.apache.org/
> > > >>
> > > >>
> > > >
> > > >
> > > > --
> > > > s
> > > >
> > >
> > >
> > >
> > > --
> > > s
> > >
> >
>
>
>
> --
> s
>


Intellij IDEA build issues

2013-12-07 Thread Nick Pentreath
Hi Spark Devs,

Hoping someone cane help me out. No matter what I do, I cannot get Intellij
to build Spark from source. I am using IDEA 13. I run sbt gen-idea and
everything seems to work fine.

When I try to build using IDEA, everything compiles but I get the error
below.

Have any of you come across the same?

==

Internal error: (java.lang.AssertionError)
java/nio/channels/FileChannel$MapMode already declared as
ch.epfl.lamp.fjbg.JInnerClassesAttribute$Entry@1b5b798b
java.lang.AssertionError: java/nio/channels/FileChannel$MapMode already
declared as ch.epfl.lamp.fjbg.JInnerClassesAttribute$Entry@1b5b798b
at
ch.epfl.lamp.fjbg.JInnerClassesAttribute.addEntry(JInnerClassesAttribute.java:74)
at
scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator$$anonfun$addInnerClasses$3.apply(GenJVM.scala:738)
at
scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator$$anonfun$addInnerClasses$3.apply(GenJVM.scala:733)
at
scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
at scala.collection.immutable.List.foreach(List.scala:76)
at
scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.addInnerClasses(GenJVM.scala:733)
at
scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.emitClass(GenJVM.scala:200)
at
scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.genClass(GenJVM.scala:355)
at
scala.tools.nsc.backend.jvm.GenJVM$JvmPhase$$anonfun$run$4.apply(GenJVM.scala:86)
at
scala.tools.nsc.backend.jvm.GenJVM$JvmPhase$$anonfun$run$4.apply(GenJVM.scala:86)
at
scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:104)
at
scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:104)
at scala.collection.Iterator$class.foreach(Iterator.scala:772)
at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157)
at
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45)
at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:104)
at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.scala:86)
at scala.tools.nsc.Global$Run.compileSources(Global.scala:953)
at scala.tools.nsc.Global$Run.compile(Global.scala:1041)
at xsbt.CachedCompiler0.run(CompilerInterface.scala:123)
at xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:99)
at xsbt.CachedCompiler0.run(CompilerInterface.scala:99)
at xsbt.CompilerInterface.run(CompilerInterface.scala:27)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at sbt.compiler.AnalyzingCompiler.call(AnalyzingCompiler.scala:102)
at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:48)
at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:41)
at
sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3$$anonfun$apply$1.apply$mcV$sp(AggressiveCompile.scala:106)
at
sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3$$anonfun$apply$1.apply(AggressiveCompile.scala:106)
at
sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3$$anonfun$apply$1.apply(AggressiveCompile.scala:106)
at
sbt.compiler.AggressiveCompile.sbt$compiler$AggressiveCompile$$timed(AggressiveCompile.scala:173)
at
sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3.apply(AggressiveCompile.scala:105)
at
sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3.apply(AggressiveCompile.scala:102)
at scala.Option.foreach(Option.scala:236)
at
sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:102)
at
sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:102)
at scala.Option.foreach(Option.scala:236)
at
sbt.compiler.AggressiveCompile$$anonfun$6.compileScala$1(AggressiveCompile.scala:102)
at
sbt.compiler.AggressiveCompile$$anonfun$6.apply(AggressiveCompile.scala:151)
at
sbt.compiler.AggressiveCompile$$anonfun$6.apply(AggressiveCompile.scala:89)
at sbt.inc.IncrementalCompile$$anonfun$doCompile$1.apply(Compile.scala:39)
at sbt.inc.IncrementalCompile$$anonfun$doCompile$1.apply(Compile.scala:37)
at sbt.inc.Incremental$.cycle(Incremental.scala:75)
at sbt.inc.Incremental$$anonfun$1.apply(Incremental.scala:34)
at sbt.inc.Incremental$$anonfun$1.apply(Incremental.scala:33)
at sbt.inc.Incremental$.manageClassfiles(Incremental.scala:42)
at sbt.inc.Incremental$.compile(Incremental.scala:33)
at sbt.inc.IncrementalCompile$.apply(Compile.scala:27)
at sbt.compiler.AggressiveCompile.compile2(AggressiveCompile.scala:164)
at sbt.compiler.AggressiveCompile.compile1(AggressiveCompile.scala:73)
at
org.jetbrains.jps.incremental.scala.local.CompilerImpl.compile(CompilerImpl.scala:61)
at
org.jetbrains.j

PySpark - Dill serialization

2013-12-05 Thread Nick Pentreath
Hi devs

I came across Dill (
http://trac.mystic.cacr.caltech.edu/project/pathos/wiki/dill) for Python
serialization. Was wondering if it may be a replacement to the cloudpickle
stuff (and remove that piece of code that needs to be maintained within
PySpark)?

Josh have you looked into Dill? Any thoughts?

N


PySpark / scikit-learn integration sprint at Cloudera - Strata Conference Friday 14th Feb 2014

2013-12-02 Thread Nick Pentreath
Hi Spark Devs

An idea developed recently out of a scikit-learn mailing list discussion (
http://sourceforge.net/mailarchive/forum.php?thread_name=CAFvE7K5HGKYH9Myp7imrJ-nU%3DpJgeGqcCn3JC0m4MmGWZi35Hw%40mail.gmail.com&forum_name=scikit-learn-general)
to have a coding sprint around Strata in Feb, focused on integration
between scikit-learn and PySpark for large-scale machine learning tasks.

Cloudera has kindly agreed to host the sprint, most likely in San
Francisco. Ideally it would be focused and capped at around 10 people. The
idea is not meant to be a teaching workshop for
newcomers but more as a prototyping session, so ideally it would be great
to have developers and users with deep knowledge of PySpark (Josh
especially :) and/or scikit-learn, attend.

Hopefully we can get some people from the Spark community involved, and
Olivier will drum up support from the scikit-learn community.

All the best and hope to see you there (though likely I will only be able
to join remotely).
Nick


Re: [Scikit-learn-general] Spark-backed implementations of scikit-learn estimators

2013-11-26 Thread Nick Pentreath
CC'ing Spark Dev list

I have been thinking about this for quite a while and would really love to
see this happen.

Most of my pipeline ends up in Scala/Spark these days - which I love, but
it is partly because I am reliant on custom Hadoop input formats that are
just way easier to use from Scala/Java - but I still use Python a lot for
data analysis and interactive work. There is some good stuff happening with
Breeze in Scala and MLlib in Spark (and IScala) but the breadth just
doesn't compare as yet - not to mention IPython and plotting!

There is a PR that was just merged into PySpark to allow arbitrary
serialization protocols between the Java and Python layers. I hope to try
to use this to allow PySpark users to pull data from arbitrary Hadoop
InputFormats with minimum fuss. This I believe will open the way for many
(including me!) to use PySpark directly for virtually all distributed data
processing without "needing" to use Java (
https://github.com/apache/incubator-spark/pull/146) (
http://mail-archives.apache.org/mod_mbox/incubator-spark-dev/201311.mbox/browser
).

Linked to this is what I believe is huge potential to add distributed
PySpark versions of many algorithms in scikit-learn (and elsewhere). The
idea as intimated above, would be to have estimator classes with sklearn
compatible APIs. They may in turn use sklearn algorithms themselves (eg:
this shows how easy it would be for linear models:
https://gist.github.com/MLnick/4707012).

I'd be very happy to try to find some time to work on such a library (I had
started one in Scala that was going to contain a Python library also, but
I've just not had the time available and with Spark MLlib appearing and the
Hadoop stuff I had what I needed for my systems).

The main benefit I see is that sklearn already has:
- many algorithms to work with
- great, simple API
- very useful stuff like preprocessing, vectorizing and feature hashing
(very important for large scale linear models)
- obviously the nice Python ecosystem stuff like plotting, IPython
notebook, pandas, scikit-statsmodels and so on.

The easiest place to start in my opinion is to take a few of the basic
models in the PySpark examples and turn them into production-ready code
that utilises sklearn or other good libraries as much as possible.

(I think this library would live outside of both Spark and sklearn, at
least until it is clear where it should live).

I would be happy to help and provide Spark-related advice even if I cannot
find enough time to work on many algorithms. Though I do hope to find more
time toward the end of the year and early next year.

N


On Wed, Nov 27, 2013 at 12:42 AM, Uri Laserson wrote:

> Hi all,
>
> I was wondering whether there has been any organized effort to create
> scikit-learn estimators that are backed by Spark clusters.  Rather than
> using the PySpark API to call sklearn functions, you would instantiate
> sklearn estimators that end up calling PySpark functionality in their
> .fit() methods.
>
> Uri
>
> ..
> Uri Laserson
> +1 617 910 0447
> uri.laser...@gmail.com
>
>
> --
> Rapidly troubleshoot problems before they affect your business. Most IT
> organizations don't have a clear picture of how application performance
> affects their revenue. With AppDynamics, you get 100% visibility into your
> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
> Pro!
> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
> ___
> Scikit-learn-general mailing list
> scikit-learn-gene...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


Re: [PySpark]: reading arbitrary Hadoop InputFormats

2013-11-08 Thread Nick Pentreath
Wow Josh, that looks great. I've been a bit swamped this week but as soon
as I get a chance I'll test out the PR in more detail and port over the
InputFormat stuff to use the new framework (including the changes you
suggested).

I can then look deeper into the MsgPack functionality to see if it can be
made to work in a generic enough manner without requiring huge amounts of
custom Templates to be written by users.

Will feed back asap.
N


On Thu, Nov 7, 2013 at 5:03 AM, Josh Rosen  wrote:

> I opened a pull request to add custom serializer support to PySpark:
> https://github.com/apache/incubator-spark/pull/146
>
> My pull request adds the plumbing for transferring data from Java to Python
> using formats other than Pickle.  For example, look at how textFile() uses
> MUTF8Deserializer to read strings from Java.  Hopefully this provides all
> of the functionality needed to support MsgPack.
>
> - Josh
>
>
> On Thu, Oct 31, 2013 at 11:11 AM, Josh Rosen  wrote:
>
> > Hi Nick,
> >
> > This is a nice start.  I'd prefer to keep the Java sequenceFileAsText()
> > and newHadoopFileAsText() methods inside PythonRDD instead of adding them
> > to JavaSparkContext, since I think these methods are unlikely to be used
> > directly by Java users (you can add these methods to the PythonRDD
> > companion object, which is how readRDDFromPickleFile is implemented:
> >
> https://github.com/apache/incubator-spark/blob/branch-0.8/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L255
> > )
> >
> > For MsgPack, the UnpicklingError is because the Python worker expects to
> > receive its input in a pickled format.  In my prototype of custom
> > serializers, I modified the PySpark worker to receive its
> > serialization/deserialization function as input (
> >
> https://github.com/JoshRosen/spark/blob/59b6b43916dc84fc8b83f22eb9ce13a27bc51ec0/python/pyspark/worker.py#L41
> )
> > and added logic to pass the appropriate serializers based on each stage's
> > input and output formats (
> >
> https://github.com/JoshRosen/spark/blob/59b6b43916dc84fc8b83f22eb9ce13a27bc51ec0/python/pyspark/rdd.py#L42
> > ).
> >
> > At some point, I'd like to port my custom serializers code to PySpark; if
> > anyone's interested in helping, I'd be glad to write up some additional
> > notes on how this should work.
> >
> > - Josh
> >
> > On Wed, Oct 30, 2013 at 2:25 PM, Nick Pentreath <
> nick.pentre...@gmail.com>wrote:
> >
> >> Thanks Josh, Patrick for the feedback.
> >>
> >> Based on Josh's pointers I have something working for JavaPairRDD ->
> >> PySpark RDD[(String, String)]. This just calls the toString method on
> each
> >> key and value as before, but without the need for a delimiter. For
> >> SequenceFile, it uses SequenceFileAsTextInputFormat which itself calls
> >> toString to convert to Text for keys and values. We then call toString
> >> (again) ourselves to get Strings to feed to writeAsPickle.
> >>
> >> Details here: https://gist.github.com/MLnick/7230588
> >>
> >> This also illustrates where the "wrapper function" api would fit in. All
> >> that is required is to define a T => String for key and value.
> >>
> >> I started playing around with MsgPack and can sort of get things to work
> >> in
> >> Scala, but am struggling with getting the raw bytes to be written
> properly
> >> in PythonRDD (I think it is treating them as pickled byte arrays when
> they
> >> are not, but when I removed the 'stripPickle' calls and amended the
> length
> >> (-6) I got "UnpicklingError: invalid load key, ' '. ").
> >>
> >> Another issue is that MsgPack does well at writing "structures" - like
> >> Java
> >> classes with public fields that are fairly simple - but for example the
> >> Writables have private fields so you end up with nothing being written.
> >> This looks like it would require custom "Templates" (serialization
> >> functions effectively) for many classes, which means a lot of custom
> code
> >> for a user to write to use it. Fortunately for most of the common
> >> Writables
> >> a toString does the job. Will keep looking into it though.
> >>
> >> Anyway, Josh if you have ideas or examples on the "Wrapper API from
> >> Python"
> >> that you mentioned, I'd be interested to hear them.
> >>
> >> If you think this is worth working up as a Pull Request covering

Re: [PySpark]: reading arbitrary Hadoop InputFormats

2013-10-30 Thread Nick Pentreath
Thanks Josh, Patrick for the feedback.

Based on Josh's pointers I have something working for JavaPairRDD ->
PySpark RDD[(String, String)]. This just calls the toString method on each
key and value as before, but without the need for a delimiter. For
SequenceFile, it uses SequenceFileAsTextInputFormat which itself calls
toString to convert to Text for keys and values. We then call toString
(again) ourselves to get Strings to feed to writeAsPickle.

Details here: https://gist.github.com/MLnick/7230588

This also illustrates where the "wrapper function" api would fit in. All
that is required is to define a T => String for key and value.

I started playing around with MsgPack and can sort of get things to work in
Scala, but am struggling with getting the raw bytes to be written properly
in PythonRDD (I think it is treating them as pickled byte arrays when they
are not, but when I removed the 'stripPickle' calls and amended the length
(-6) I got "UnpicklingError: invalid load key, ' '. ").

Another issue is that MsgPack does well at writing "structures" - like Java
classes with public fields that are fairly simple - but for example the
Writables have private fields so you end up with nothing being written.
This looks like it would require custom "Templates" (serialization
functions effectively) for many classes, which means a lot of custom code
for a user to write to use it. Fortunately for most of the common Writables
a toString does the job. Will keep looking into it though.

Anyway, Josh if you have ideas or examples on the "Wrapper API from Python"
that you mentioned, I'd be interested to hear them.

If you think this is worth working up as a Pull Request covering
SequenceFiles and custom InputFormats with default toString conversions and
the ability to specify Wrapper functions, I can clean things up more, add
some functionality and tests, and also test to see if common things like
the "normal" Writables and reading from things like HBase and Cassandra can
be made to work nicely (any other common use cases that you think make
sense?).

Thoughts, comments etc welcome.

Nick



On Fri, Oct 25, 2013 at 11:03 PM, Patrick Wendell wrote:

> As a starting point, a version where people just write their own "wrapper"
> functions to convert various HadoopFiles into String  files could go
> a long way. We could even have a few built-in versions, such as dealing
> with Sequence files that are . Basically, the user needs to
> write a translator in Java/Scala that produces textual records from
> whatever format that want. Then, they make sure this is included in the
> classpath when running PySpark.
>
> As Josh is saying, I'm pretty sure this is already possible, but we may
> want to document it for users. In many organizations they might have 1-2
> people who can write the Java/Scala to do this but then many more people
> who are comfortable using python once it's setup.
>
> - Patrick
>
> On Fri, Oct 25, 2013 at 11:00 AM, Josh Rosen  wrote:
>
> > Hi Nick,
> >
> > I've seen several requests for SequenceFile support in PySpark, so
> there's
> > definitely demand for this feature.
> >
> > I like the idea of passing MsgPack'ed data (or some other structured
> > format) from Java to the Python workers.  My early prototype of custom
> > serializers (described at
> >
> >
> https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals#PySparkInternals-customserializers
> > )
> > might be useful for implementing this.  Proper custom serializer support
> > would handle the bookkeeping for tracking each stage's input and output
> > formats and supplying the appropriate deserialization functions to the
> > Python worker, so the Python worker would be able to directly read the
> > MsgPack'd data that's sent to it.
> >
> > Regarding a wrapper API, it's actually possible to initially transform
> data
> > using Scala/Java and perform the remainder of the processing in PySpark.
> >  This involves adding the appropriate compiled to the Java classpath and
> a
> > bit of work in Py4J to create the Java/Scala RDD and wrap it for use by
> > PySpark.  I can hack together a rough example of this if anyone's
> > interested, but it would need some work to be developed into a
> > user-friendly API.
> >
> > If you wanted to extend your proof-of-concept to handle the cases where
> > keys and values have parseable toString() values, I think you could
> remove
> > the need for a delimiter by creating a PythonRDD from the newHadoopFile
> > JavaPairRDD and adding a new method to writeAsPickle (
> >
> >
> https://github.com/apache/incubator-

Julia bindings

2013-10-24 Thread Nick Pentreath
Hi Spark Devs

If you could pick one language binding to add to Spark what would it be?
Probably Clojure or JRuby if JVM is of interest.

I'm quite excited about Julia as a language for scientific computing (
http://julialang.org). The Julia community have been very focused on things
like interop with R, Matlab, and probably mostly Python (see
https://github.com/stevengj/PyCall.jl and
https://github.com/stevengj/PyPlot.jl for example).

Anyway, this is a bit of a thought experiment but I'd imagine a Julia API
would be similar in principle to the Python API. On the Spark Java side, it
would likely be almost the same. On the Julia side I'd imagine the major
sticking point would be serialisation (eg PyCloud equivalent code).

I actually played around with PyCall and was able to call PySpark from the
Julia console. You're able to run arbitrary Python PySpark code (though the
syntax is a bit ugly) and it seemed to mostly work.

However, when I tried to pass in a Julia function or closure, it failed at
the serialization step.

So one option would be to figure out how to serialize the required things
on the Julia side and to use PyCall for interop. This could add a fair bit
of overhead Julia <-> Python <-> Java so perhaps not worth it, but still
the idea of being able to use Spark for the distributed computing part and
to be able to mix n match Python code/libraries and Julia code/libraries
for things like stats/machine learning is very appealing!

Thoughts?

Nick


[PySpark]: reading arbitrary Hadoop InputFormats

2013-10-24 Thread Nick Pentreath
Hi Spark Devs

I was wondering what appetite there may be to add the ability for PySpark
users to create RDDs from (somewhat) arbitrary Hadoop InputFormats.

In my data pipeline for example, I'm currently just using Scala (partly
because I love it but also because I am heavily reliant on quite custom
Hadoop InputFormats for reading data). However, many users may prefer to
use PySpark as much as possible (if not for everything). Reasons might
include the need to use some Python library. While I don't do it yet, I can
certainly see an attractive use case for using say scikit-learn / numpy to
do data analysis & machine learning in Python. Added to this my cofounder
knows Python well but not Scala so it can be very beneficial to do a lot of
stuff in Python.

For text-based data this is fine, but reading data in from more complex
Hadoop formats is an issue.

The current approach would of course be to write an ETL-style Java/Scala
job and then process in Python. Nothing wrong with this, but I was thinking
about ways to allow Python to access arbitrary Hadoop InputFormats.

Here is a quick proof of concept: https://gist.github.com/MLnick/7150058

This works for simple stuff like SequenceFile with simple Writable
key/values.

To work with more complex files, perhaps an approach is to manipulate
Hadoop JobConf via Python and pass that in. The one downside is of course
that the InputFormat (well actually the Key/Value classes) must have a
toString that makes sense so very custom stuff might not work.

I wonder if it would be possible to take the objects that are yielded via
the InputFormat and convert them into some representation like ProtoBuf,
MsgPack, Avro, JSON, that can be read relatively more easily from Python?

Another approach could be to allow a simple "wrapper API" such that one can
write a wrapper function T => String and pass that into an
InputFormatWrapper that takes an arbitrary InputFormat and yields Strings
for the keys and values. Then all that is required is to compile that
function and add it to the SPARK_CLASSPATH and away you go!

Thoughts?

Nick


Re: Propose to Re-organize the scripts and configurations

2013-09-16 Thread Nick Pentreath
There was another discussion on the old dev list about this:
https://groups.google.com/forum/#!msg/spark-developers/GL2_DwAeh5s/9rwQ3iDa2t4J

I tend to agree with having configuration sitting in JSON (or properties
files) and using the Typesafe Config library which can parse both.

Something I've used in my apps is along these lines:
https://gist.github.com/MLnick/6578146

It's then easy to have default config overridden with CLI for example:
val conf = cliConf.withFallback(defaultConf)

I'd be happy to be involved in working on this if there is a consensus
about best approach

N





On Mon, Sep 16, 2013 at 9:29 AM, Mike  wrote:

> Shane Huang wrote:
> > we found the current organization of the scripts and configuration a
> > bit confusing and inconvenient
>
> ditto
>
> > - Scripts
>
> I wonder why the work of these scripts wasn't mostly done in Scala.
> Seems roundabout to use Bash (or Python, in spark-perf) to calculate
> shell environment variables that are then read back into Scala code.
>
> > 1. Define a Configuration class which contains all the options
> > available for Spark application. A Configuration instance can be
> > de-/serialized from/to a json formatted file.
> > 2. Each application (SparkContext) has one Configuration instance and
> > it is initialized by the application which creates it (either read
> > from file or passed from command line options or env SPARK_JAVA_OPTS).
>
> Reminiscent of what Hibernate's been doing for the past decade.  Would
> be nice if the Configuration was also exposed through an MBean or such
> so that one can check it's values with certainty.
>


Re: MLI dependency exception

2013-09-11 Thread Nick Pentreath
Is mLI available? Where is the repo located?

—
Sent from Mailbox for iPhone

On Tue, Sep 10, 2013 at 10:45 PM, Gowtham N 
wrote:

> It worked.
> I was using old master for spark, which I forked many days a ago.
> On Tue, Sep 10, 2013 at 1:25 PM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>> For some more notes on how to debug this: After you do publish-local in
>> Spark, you should have a file in ~/.ivy2 that you can check for using
>> `ls
>> ~/.ivy2/local/org.apache.spark/spark-core_2.9.3/0.8.0-SNAPSHOT/jars/spark-core_2.9.3.jar`
>>
>> Or `sbt/sbt publish-local` also prints something like this on the console
>>
>>  [info]  published spark-core_2.9.3 to
>> /home/shivaram/.ivy2/local/org.apache.spark/spark-core_2.9.3/0.8.0-SNAPSHOT/jars/spark-core_2.9.3.jar
>>
>> After that MLI's build should be able to pick this jar up.
>>
>> Thanks
>> Shivaram
>>
>>
>>
>>
>> On Tue, Sep 10, 2013 at 1:14 PM, Gowtham N wrote:
>>
>>> I did it as publish-local.
>>> I forked mesos/spark to gowthamnatarajan/spark. And I am using that. I
>>> forked a few days ago, but did a upstream update today.
>>>
>>> For safety, I will directly clone from mesos now.
>>>
>>>
>>>
>>> On Tue, Sep 10, 2013 at 1:10 PM, Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>>
 Did you check out spark from the master branch of github.com/mesos/spark?
 The package names changed recently so you might need to pull. Also just
 checking that you did publish-local in Spark (not public-local as
 specified
 in the email) ?

 Thanks
 Shivaram


 On Tue, Sep 10, 2013 at 1:01 PM, Gowtham N 
 wrote:

 > still getting the same error.
 >
 > I have spark and MLI folder within a folder called git
 >
 > I did clean, package and public-local for spark.
 > Then for mli did clean, and then package.
 > I am still getting the error.
 >
 > [warn] ::
 > [warn] ::  UNRESOLVED DEPENDENCIES ::
 > [warn] ::
 > [warn] :: org.apache.spark#spark-core_2.9.3;0.8.0-SNAPSHOT: not found
 > [warn] :: org.apache.spark#spark-mllib_2.9.3;0.8.0-SNAPSHOT: not found
 > [warn] ::
 > [error] {file:/Users/gowthamn/git/MLI/}default-0b9403/*:update:
 > sbt.ResolveException: unresolved dependency:
 > org.apache.spark#spark-core_2.9.3;0.8.0-SNAPSHOT: not found
 > [error] unresolved dependency:
 > org.apache.spark#spark-mllib_2.9.3;0.8.0-SNAPSHOT: not found
 >
 > should I modify the contents of build.sbt?
 > Currently its
 >
 > libraryDependencies ++= Seq(
 >   "org.apache.spark" % "spark-core_2.9.3" % "0.8.0-SNAPSHOT",
 >   "org.apache.spark" % "spark-mllib_2.9.3" % "0.8.0-SNAPSHOT",
 >   "org.scalatest" %% "scalatest" % "1.9.1" % "test"
 > )
 >
 > resolvers ++= Seq(
 >   "Typesafe" at "http://repo.typesafe.com/typesafe/releases";,
 >   "Scala Tools Snapshots" at "http://scala-tools.org/repo-snapshots/";,
 >   "ScalaNLP Maven2" at "http://repo.scalanlp.org/repo";,
 >   "Spray" at "http://repo.spray.cc";
 > )
 >
 >
 >
 >
 >
 >
 > On Tue, Sep 10, 2013 at 11:58 AM, Evan R. Sparks <
 evan.spa...@gmail.com
 > >wrote:
 >
 > > Hi Gowtham,
 > >
 > > You'll need to do "sbt/sbt publish-local" in the spark directory
 > > before trying to build MLI.
 > >
 > > - Evan
 > >
 > > On Tue, Sep 10, 2013 at 11:37 AM, Gowtham N <
 gowtham.n.m...@gmail.com>
 > > wrote:
 > > > I cloned MLI, but am unable to compile it.
 > > >
 > > > I get the following dependency exception with other projects.
 > > >
 > > > org.apache.spark#spark-core_2.9.3;0.8.0-SNAPSHOT: not found
 > > > org.apache.spark#spark-mllib_2.9.3;0.8.0-SNAPSHOT: not found
 > > >
 > > > Why am I getting this error?
 > > >
 > > > I did not change anything from build.sbt
 > > >
 > > > libraryDependencies ++= Seq(
 > > >   "org.apache.spark" % "spark-core_2.9.3" % "0.8.0-SNAPSHOT",
 > > >   "org.apache.spark" % "spark-mllib_2.9.3" % "0.8.0-SNAPSHOT",
 > > >   "org.scalatest" %% "scalatest" % "1.9.1" % "test"
 > > > )
 > >
 >
 >
 >
 > --
 > Gowtham Natarajan
 >

>>>
>>>
>>>
>>> --
>>> Gowtham Natarajan
>>>
>>
>>
> -- 
> Gowtham Natarajan

Re: Adding support for implicit feedback to ALS

2013-09-09 Thread Nick Pentreath
In 3 are you saying that some cross validation support for picking the best 
lambda and alpha should be in there? Or that also the preference "weightings" 
of different event types should also be learnt? (Maybe both)


  


I agree that there should be support for this, by optimising for the best 
RMSE, MAP or whatever. I'm just not sure whether this functionality should live 
in Mllib or MLI. Until MLI is released it's sort of hard to know.


  


For 4, my frame of reference has been vs mahout and my own port to spark of 
mahouts ALS, and vs those this blocked approach is far superior. Though im sure 
there can be more efficiencies gained in this approach and other alternatives.


  


It would certainly be great to further improve the approach as you mention 
in 5. I'm not sure precisely what you mean by task reformulation - how would 
you propose to do so?


  


Nick

—
Sent from Mailbox for iPhone

On Mon, Sep 9, 2013 at 8:28 PM, Dmitriy Lyubimov 
wrote:

> Sorry, not directly aimed at the PR but at implementation in whole.
> See if the following is useful from my experience:
> 1. implicit feedback is just a corner case of more general problem:
> Given preference matrix P where P_i,j in R^{0,1} and weight
> (confidence)  matrix C, C_i,j \in R, and reg rate \lambda, compute
> L2-regularized ALS fit.
> 2. since default confidence is never zero (in paper it is assumed 1,
> and i will denote this real quantity as c_0), have C = C_0 + C' where
> C_0_i,j = c_0. Hence, rewrite input  in terms (P, C', c_0) since C'
> becomes severely sparse matrix in this case in real life.
> 3. It is nice when input C is known. But there are a lot of cases
> where individual confidence is derived from a final set of
> hyperparameters corresponding to a particular event type (search,
> click, transaction etc.). Hence, convex optimization for a small set
> of hyperparameters is desired (this might be outside of scope ALS
> itself, but weighing and lamda per se  aren't). Still though,
> crossvalidation largely relies on the fact that we want to take stuff
> that follows existing entries in C' so crossvalidation helpers would
> be naturally coupled with this method and should be provided.
> 4. i actually used pregel to avoid shuffle and sort programming model.
> Matrix operations do not require guarantees produced by reducers; only
> a full group guarantee. I did not benchmark this approach for really
> substantial datasets though; there are known Bagel limitations IMO
> which may create a problem for sufficiently large /skewed datasets. I
> guess I am interested in GraphX release to replace reliance on Bagel.
> 5. if the task reformulation is accepted, there are further
> optimizations that could be applied to blocking -- but this
> implementation gets the gist of it what i did in that regard.
> On Sun, Sep 8, 2013 at 10:58 AM, Nick Pentreath
>  wrote:
>> Hi
>>
>> I know everyone's pretty busy with getting 0.8.0 out, but as and when folks
>> have time it would be great to get your feedback on this PR adding support
>> for the 'implicit feedback' model variant to ALS:
>> https://github.com/apache/incubator-spark/pull/4
>>
>> In particular any potential efficiency improvements, issues, and testing it
>> out locally and on a cluster and on some datasets!
>>
>> Comments & feedback welcome.
>>
>> Many thanks
>> Nick

Adding support for implicit feedback to ALS

2013-09-08 Thread Nick Pentreath
Hi

I know everyone's pretty busy with getting 0.8.0 out, but as and when folks
have time it would be great to get your feedback on this PR adding support
for the 'implicit feedback' model variant to ALS:
https://github.com/apache/incubator-spark/pull/4

In particular any potential efficiency improvements, issues, and testing it
out locally and on a cluster and on some datasets!

Comments & feedback welcome.

Many thanks
Nick


Apache account

2013-09-05 Thread Nick Pentreath
Hi

I submitted my license agreement and account name request a while back, but
still haven't received any correspondence. Just wondering what I need to do
in order to follow this up?

Thanks
Nick


Scikit-learn API paper

2013-08-16 Thread Nick Pentreath
Quite interesting, and timely given current thinking around MLlib and MLI

http://orbi.ulg.ac.be/bitstream/2268/154357/1/paper.pdf

I do really like the way they have approached their API - and so far MLlib
seems to be following a (roughly) similar approach.

Interesting in particular they obviously go for mutable models instead of
the Estimator / Predictor interface MLlib currently has. Not sure which of
these is "best" really, they each have their pros & cons.

N


Re: Machine Learning on Spark [long rambling discussion email]

2013-07-25 Thread Nick Pentreath
Cool I totally understand the constraints you're under and it's not really a 
criticism at all - the amplab projects are all awesome!


If I can find ways to help then all the better
—
Sent from Mailbox for iPhone

On Thu, Jul 25, 2013 at 10:04 PM, Matei Zaharia 
wrote:

> I fully agree that we need to be clearer with the timelines in AMP Lab. One 
> thing is that many of these are still research projects, so it's hard to 
> predict when they will be ready for prime-time. Usually with all the things 
> we officially announce (e.g. MLlib, GraphX), and especially the things we put 
> in the Spark codebase, the team behind them really wants to make them widely 
> available and has committed to spend the engineering to make them usable in 
> real applications (as opposed to prototyping and moving on). But even then it 
> can take some time to get the first release out. Hopefully we'll improve our 
> communication about this through more careful tracking in JIRA.
> Matei
> On Jul 25, 2013, at 11:41 AM, Ameet Talwalkar  wrote:
>> Hi Nick,
>> 
>> I can understand your 'frustration' -- my hope is that having discussions
>> (like the one we're having now) via this mailing list will help mitigate
>> duplicate work moving forward.
>> 
>> Regarding your detailed comments, we are aiming to include various
>> components that you mentioned in our release (basic evaluation for
>> collaborative filtering, linear model additions, and basic support for
>> sparse vectors/features).  One particularly interesting avenue that is not
>> on our immediate roadmap is adding implicit feedback for matrix
>> factorization.  Algorithms like SVD++ are often used in practice, and it
>> would be great to add them to the MLI library (and perhaps also MLlib).
>> 
>> -Ameet
>> 
>> 
>> On Thu, Jul 25, 2013 at 6:44 AM, Nick Pentreath 
>> wrote:
>> 
>>> Hi
>>> 
>>> Ok, that all makes sense. I can see the benefit of good standard libraries
>>> definitely, and I guess the pieces that felt "missing" to me were what you
>>> are describing as MLI and MLOptimizer.
>>> 
>>> It seems like the aims of MLI are very much in line with what I have/had in
>>> mind for a ML library/framework. It seems the goals overlap quite a lot.
>>> 
>>> I guess one "frustration" I have had is that there are all these great BDAS
>>> projects, but we never really know when they will be released and what they
>>> will look like until they are. In this particular case I couldn't wait for
>>> MLlib so ended up doing some work myself to port Mahout's ALS and of course
>>> have ended up duplicating effort (which is not a problem as it was
>>> necessary at the time and has been a great learning experience).
>>> 
>>> Similarly for GraphX, I would like to develop a project for a Spark-based
>>> version of Faunus (https://github.com/thinkaurelius/faunus) for batch
>>> processing of data in our Titan graph DB. For now I am working with
>>> Bagel-based primitives and Spark RDDs directly, but would love to use
>>> GraphX, but have no idea when it will be released and have little
>>> involvement until it is.
>>> 
>>> (I use "frustration" in the nicest way here - I love the BDAS concepts and
>>> all the projects coming out, I just want them all to be released NOW!! :)
>>> 
>>> So yes I would love to be involved in MLlib and MLI work to the extent I
>>> can assist and the work is aligned with what I need currently in my
>>> projects (this is just from a time allocation viewpoint - I'm sure much of
>>> it will be complementary).
>>> 
>>> Anyway, it seems to me the best course of action is as follows:
>>> 
>>>   - I'll get involved in MLlib and see how I can contribute there. Some
>>>   things that jump out:
>>> 
>>> 
>>>   - implicit preference capability for ALS model since as far as I can see
>>>  currently it handles explicit prefs only? (Implicit prefs here:
>>>  http://68.180.206.246/files/HuKorenVolinsky-ICDM08.pdf which is
>>>  typically better if we don't have actual rating data but instead
>>> "view",
>>>  "click", "play" or whatever)
>> 
>>  - RMSE and other evaluation metrics for ALS as well as test/train
>>>  split / cross-val stuff?
>> 
>>  - linear model additions, like new loss functions for hinge loss,
>>>  least squares etc for SGD, as well as learnin

Re: Machine Learning on Spark [long rambling discussion email]

2013-07-25 Thread Nick Pentreath
against the MLI.  The MLI is currently written against
> Spark, but is designed to be platform independent, so that code written
> against MLI could be run on different engines (e.g., Hadoop, GraphX, etc.).
>
>
> 3) ML Optimizer: This piece automates the task of model selection.  The
> optimizer can be viewed as a search problem over feature extraction /
> algorithms included in the MLI library, and is in part based on efficient
> cross validation. This work is under active development but is in an
> earlier stage of development than MLlib and MLI.
>
> (note: MLlib will be included with the Spark codebase, while the MLI and ML
> Optimizer will live in separate repositories.)
>
> As far as I can tell (though please correct me if I've misunderstood) your
> main goals include:
>
> i) "consistency in the API"
> ii) "some level of abstraction but to keep things as simple as possible"
> iii) "execute models on Spark ... while providing workflows for pipelining
> transformations, feature extraction, testing and cross-validation, and data
> viz."
>
> The MLI (and to some extent the ML Optimizer) is very much in line with
> these goals, and it would be great if you were interested in contributing
> to it.  MLI is a private repository right now, but we'll make it public
> soon though, and Evan Sparks or I will let you know when we do so.
>
> Thanks again for getting in touch with us!
>
> -Ameet
>
>
> On Wed, Jul 24, 2013 at 11:47 AM, Reynold Xin 
> wrote:
>
> > On Wed, Jul 24, 2013 at 1:46 AM, Nick Pentreath <
> nick.pentre...@gmail.com
> > >wrote:
> >
> > >
> > > I also found Breeze to be very nice to work with and like the DSL -
> hence
> > > my question about why not use that? (Especially now that Breeze is
> > actually
> > > just breeze-math and breeze-viz).
> > >
> >
> >
> > Matei addressed this from a higher level. I want to provide a little bit
> > more context. A common properties of a lot of high level Scala DSL
> > libraries is that simple operators tend to have high virtual function
> > overheads and also create a lot of temporary objects. And because the
> level
> > of abstraction is so high, it is fairly hard to debug / optimize
> > performance.
> >
> >
> >
> >
> > --
> > Reynold Xin, AMPLab, UC Berkeley
> > http://rxin.org
> >
>


Machine Learning on Spark [long rambling discussion email]

2013-07-24 Thread Nick Pentreath
Hi dev team

(Apologies for a long email!)

Firstly great news about the inclusion of MLlib into the Spark project!

I've been working on a concept and some code for a machine learning library
on Spark, and so of course there is a lot of overlap between MLlib and what
I've been doing.

I wanted to throw this out there and (a) ask a couple of design and roadmap
questions about MLLib, and (b) talk about how to work together / integrate
my ideas (if at all :)

*Some questions*
*
*
1. What is the general design idea behind MLLib - is it aimed at being a
collection of algorithms, ie a library? Or is it aimed at being a "Mahout
for Spark", i.e. something that can be used as a library as well as a set
of tools for things like running jobs, feature extraction, text processing
etc?
2. How married are we to keeping it within the Spark project? While I
understand the reasoning behind it I am not convinced it's best. But I
guess we can wait and see how it develops
3. Some of the original test code I saw around the Block ALS did use Breeze
(https://github.com/dlwh/breeze) for some of the linear algebra. Now I see
everything is using JBLAS directly and Array[Double]. Is there a specific
reason for this? Is it aimed at creating a separation whereby the linear
algebra backend could be switched out? Scala 2.10 issues?
4. Since Spark is meant to be nicely compatible with Hadoop, do we care
about compatibility/integration with Mahout? This may also encourage Mahout
developers to switch over and contribute their expertise (see for example
Dmitry's work at:
https://github.com/dlyubimov/mahout-commits/commits/dev-0.8.x-scala/math-scala,
where he is doing a Scala/Spark DSL around mahout-math matrices and
distributed operations). Potentially even using mahout-math for linear
algebra routines?
5. Is there a roadmap? (I've checked the JIRA which does have a few
intended models etc). Who are the devs most involved in this project?
6. What are thoughts around API design for models?

*Some thoughts*
*
*
So, over the past couple of months I have been working on a machine
learning library. Initially it was for my own use but I've added a few
things and was starting to think about releasing it (though it's not nearly
ready). The model that I really needed first was ALS for doing
recommendations. So I have ported the ALS code from Mahout to Spark. Well,
"ported" in some sense - mostly I copied the algorithm and data
distribution design, using Spark's primitives and Breeze for all the linear
algebra.

I found it pretty straightforward to port over. So far I have done local
testing only on the Movielens datasets. I have found my RMSE results to
match that of Mahout's. Overall interestingly the wall clock performance is
not as dissimilar as I would have expected. But I would like to now do some
larger-scale tests on a cluster to really do a good comparison.

Obviously with Spark's Block ALS model, my version is now somewhat
superfluous since I expect (and have so far seen in my simple local
experiments) that the block model will significantly outperform. I will
probably be porting my use case over to this in due time once I've done
further testing.

I also found Breeze to be very nice to work with and like the DSL - hence
my question about why not use that? (Especially now that Breeze is actually
just breeze-math and breeze-viz).

Anyway, I then added KMeans (basically just the Spark example with some
Breeze tweaks), and started working on a Linear Model framework. I've also
added a simple framework for arg parsing and config (using Twitter
Algebird's Args and Typesafe Config), and have started on feature
extraction stuff - of particular interest will be text feature extraction
and feature hashing.

This is roughly the idea for a machine learning library on Spark that I
have - call it a design or manifesto or whatever:

- Library available and consistent across Scala, Java and Python (as much
as possible in any event)
- A core library and also a set of stuff for easily running models based on
standard input formats etc
- Standardised model API (even across languages) to the extent possible.
I've based mine so far on Python's scikit-learn (.fit(), .predict() etc).
Why? I believe it's a major strength of scikit-learn, that its API is so
clean, simple and consistent. Plus, for the Python version of the lib,
scikit-learn will no doubt be used wherever possible to avoid re-creating
code
- Models to be included initially:
  - ALS
  - Possibly co-occurrence recommendation stuff similar to Mahout's Taste
  - Clustering (K-Means and others potentially)
  - Linear Models - the idea here is to have something very close to Vowpal
Wabbit, ie a generic SGD engine with various Loss Functions, learning rate
paradigms etc. Furthermore this would allow other models similar to VW such
as online versions of matrix factorisation, neural nets and learning
reductions
  - Possibly Decision Trees / Random Forests
- Some utilities for feature extraction (hashing in p