Re: Anyone wants to look at SPARK-1123?

2014-02-23 Thread Nick Pentreath
Hi

What KeyClass and ValueClass are you trying to save as the keys/values of
your dataset?



On Sun, Feb 23, 2014 at 10:48 AM, Nan Zhu zhunanmcg...@gmail.com wrote:

 Hi, all

 I found the weird thing on saveAsNewAPIHadoopFile  in
 PairRDDFunctions.scala when working on the other issue,

 saveAsNewAPIHadoopFile throws java.lang.InstantiationException all the time

 I checked the commit history of the file, it seems that the API exists for
 a long time, no one else found this? (that's the reason I'm confusing)

 Best,

 --
 Nan Zhu




Re: Spark 0.8.1 on Amazon Elastic MapReduce

2014-02-14 Thread Nick Pentreath
Thanks Parviz, this looks great and good to see it getting updated. Look
forward to 0.9.0!

A perhaps stupid question - where does the KinesisWordCount example live?
Is that an Amazon example, since I don't see it under the streaming
examples included in the Spark project. If it's a third party example is it
possible to get the code?

Thanks
Nick


On Fri, Feb 14, 2014 at 6:53 PM, Deyhim, Parviz parv...@amazon.com wrote:

   Spark community,

  Wanted to let you know that the version of Spark and Shark on Amazon
 Elastic MapReduce has been updated to 0.8.1. This new version provides a
 much better experience in terms of stability and performance but also
 supports the following features:

- Integration with Amazon Cloudwatch
- Integration of Spark Streaming with Amazon Kinesis.
- Automatic log shipping to S3

 For a complete detail of the features Spark on EMR provides, please see
 the following article: http://aws.amazon.com/articles/4926593393724923

  And yes I'm working hard to push another update to support 0.9.0 :)

  What would be great is to hear from the community on what other features
 you like to see on Spark on EMR. For example, how useful is autoscaling for
 Spark? Any other features you like to see?

  Thanks,

*Parviz Deyhim*

 Solutions Architect

 *Amazon Web Services http://aws.amazon.com/*

 E: parv...@amazon.com

 M:  408.315.2305



 [image: Description: Description: Description:
 C:\Users\aiden\AppData\Local\Microsoft\Windows\Temporary Internet
 Files\Content.Word\aws.gif] http://aws.amazon.com/



Re: [GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-13 Thread Nick Pentreath
@fommil @mengxr I think it's always worth a shot at a license change. Scikit 
learn devs have been successful before in getting such things over the line.


Assuming we can make that happen, what do folks think about MTJ vs Breeze vs 
JBLAS + commons-math since these seem like the viable alternatives?
—
Sent from Mailbox for iPhone

On Fri, Feb 14, 2014 at 1:21 AM, mengxr g...@git.apache.org wrote:

 Github user mengxr commented on the pull request:
 https://github.com/apache/incubator-spark/pull/575#issuecomment-35038739
   
 @fommil I don't quite understand what roll their own means exactly 
 here. I didn't propose to re-implement one or half linear algebra library in 
 the PR. For the license issue, it would be great if the original author of 
 MTJ agrees to change the license to Apache. With the LGPL license, there is 
 not much we can do. 

Re: [VOTE] Graduation of Apache Spark from the Incubator

2014-02-11 Thread Nick Pentreath
+1


On Tue, Feb 11, 2014 at 9:17 AM, Matt Massie mas...@berkeley.edu wrote:

 +1

 --
 Matt Massie
 UC, Berkeley AMPLab
 Twitter: @matt_massie https://twitter.com/matt_massie,
 @amplabhttps://twitter.com/amplab
 https://amplab.cs.berkeley.edu/


 On Mon, Feb 10, 2014 at 11:12 PM, Zongheng Yang zonghen...@gmail.com
 wrote:

  +1
 
  On Mon, Feb 10, 2014 at 10:21 PM, Reynold Xin r...@databricks.com
 wrote:
   Actually I made a mistake by saying binding.
  
   Just +1 here.
  
  
   On Mon, Feb 10, 2014 at 10:20 PM, Mattmann, Chris A (3980) 
   chris.a.mattm...@jpl.nasa.gov wrote:
  
   Hi Nathan, anybody is welcome to to VOTE. Thank you.
   Only VOTEs from the Incubator PMC are what is considered binding,
 but
   I welcome and will tally all VOTEs provided.
  
   Cheers,
   Chris
  
  
  
  
   -Original Message-
   From: Nathan Kronenfeld nkronenf...@oculusinfo.com
   Reply-To: dev@spark.incubator.apache.org 
  dev@spark.incubator.apache.org
   
   Date: Monday, February 10, 2014 9:44 PM
   To: dev@spark.incubator.apache.org dev@spark.incubator.apache.org
   Subject: Re: [VOTE] Graduation of Apache Spark from the Incubator
  
   Who is allowed to vote on stuff like this?
   
   
   On Mon, Feb 10, 2014 at 11:27 PM, Chris Mattmann
   mattm...@apache.orgwrote:
   
Hi Everyone,
   
This is a new VOTE to decide if Apache Spark should graduate
from the Incubator. Please VOTE on the resolution pasted below
the ballot. I'll leave this VOTE open for at least 72 hours.
   
Thanks!
   
[ ] +1 Graduate Apache Spark from the Incubator.
[ ] +0 Don't care.
[ ] -1 Don't graduate Apache Spark from the Incubator because..
   
Here is my +1 binding for graduation.
   
Cheers,
Chris
   
 snip
   
WHEREAS, the Board of Directors deems it to be in the best
interests of the Foundation and consistent with the
Foundation's purpose to establish a Project Management
Committee charged with the creation and maintenance of
open-source software, for distribution at no charge to the
public, related to fast and flexible large-scale data analysis
on clusters.
   
NOW, THEREFORE, BE IT RESOLVED, that a Project Management
Committee (PMC), to be known as the Apache Spark Project, be
and hereby is established pursuant to Bylaws of the Foundation;
and be it further
   
RESOLVED, that the Apache Spark Project be and hereby is
responsible for the creation and maintenance of software
related to fast and flexible large-scale data analysis
on clusters; and be it further RESOLVED, that the office
of Vice President, Apache Spark be and hereby is created,
the person holding such office to serve at the direction of
the Board of Directors as the chair of the Apache Spark
Project, and to have primary responsibility for management
of the projects within the scope of responsibility
of the Apache Spark Project; and be it further
RESOLVED, that the persons listed immediately below be and
hereby are appointed to serve as the initial members of the
Apache Spark Project:
   
* Mosharaf Chowdhury mosha...@apache.org
* Jason Dai jason...@apache.org
* Tathagata Das t...@apache.org
* Ankur Dave ankurd...@apache.org
* Aaron Davidson a...@apache.org
* Thomas Dudziak to...@apache.org
* Robert Evans bo...@apache.org
* Thomas Graves tgra...@apache.org
* Andy Konwinski and...@apache.org
* Stephen Haberman steph...@apache.org
* Mark Hamstra markhams...@apache.org
* Shane Huang shane_hu...@apache.org
* Ryan LeCompte ryanlecom...@apache.org
* Haoyuan Li haoy...@apache.org
* Sean McNamara mcnam...@apache.org
* Mridul Muralidharam mridul...@apache.org
* Kay Ousterhout kayousterh...@apache.org
* Nick Pentreath mln...@apache.org
* Imran Rashid iras...@apache.org
* Charles Reiss wog...@apache.org
* Josh Rosen joshro...@apache.org
* Prashant Sharma prash...@apache.org
* Ram Sriharsha har...@apache.org
* Shivaram Venkataraman shiva...@apache.org
* Patrick Wendell pwend...@apache.org
* Andrew Xia xiajunl...@apache.org
* Reynold Xin r...@apache.org
* Matei Zaharia ma...@apache.org
   
NOW, THEREFORE, BE IT FURTHER RESOLVED, that Matei Zaharia be
appointed to the office of Vice President, Apache Spark, to
serve in accordance with and subject to the direction of the
Board of Directors and the Bylaws of the Foundation until
death, resignation, retirement, removal or disqualification, or
until a successor is appointed; and be it further
   
RESOLVED, that the Apache Spark Project be and hereby is
tasked with the migration and rationalization of the Apache
Incubator Spark podling; and be it further
   
RESOLVED, that all responsibilities pertaining to the Apache
Incubator Spark podling encumbered upon the Apache Incubator
Project are hereafter discharged

Fwd: Represent your project at ApacheCon

2014-01-27 Thread Nick Pentreath
Is Spark active in submitting anything for this?


-- Forwarded message --
From: Rich Bowen rbo...@redhat.com
Date: Mon, Jan 27, 2014 at 4:20 PM
Subject: Represent your project at ApacheCon
To: committ...@apache.org


Folks,

5 days from the end of the CFP, we have only 50 talks submitted. We need
three times that just to fill the space, and preferably a lot more so that
we have some variety to choose from to put together a schedule.

I know that we usually have over half the content submitted in the last 48
hours, so I'm not panicking yet, but it's worrying. More worrying, however
is that 2/3 of those submissions are from the Usual Suspects (ie, httpd and
Tomcat), and YOUR project isn't represented.

We would love to have a whole day of Lucene, and of OpenOffice, and of
Cordova, and of Felix and Celix and Helix and Nelix. Or a half day.

We need your talk submissions. We need you to come tell the world why your
project matters, why you spend your time working on it, and what exciting
new thing you hacked into it during the snow storms. (Or heat wave, as the
case may be.)

Please help us get the word out to your developer and user communities that
we're looking for quality talks about their favorite Apache project, about
related technologies, about ways that it's being used, and plans for its
future. Help us make this ApacheCon amazing.

--rcb

-- 
Rich Bowen - rbo...@redhat.com
OpenStack Community Liaison
http://openstack.redhat.com/


Re: Any suggestion about JIRA 1006 MLlib ALS gets stack overflow with too many iterations?

2014-01-26 Thread Nick Pentreath
If you want to spend the time running 50 iterations, you're better off 
re-running 5x10 iterations with different random start to get a better local 
minimum...—
Sent from Mailbox for iPhone

On Sun, Jan 26, 2014 at 9:59 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 I looked into this after I opened that JIRA and it’s actually a bit harder to 
 fix. While changing these visit() calls to use a stack manually instead of 
 being recursive helps avoid a StackOverflowError there, you still get a 
 StackOverflowError when you send the task to a worker node because Java 
 serialization uses recursion. The only real fix therefore with the current 
 codebase is to increase your JVM stack size. Longer-term, I’d like us to 
 automatically call checkpoint() to break lineage graphs when they exceed a 
 certain size, which would avoid the problems in both DAGScheduler and Java 
 serialization. We could also manually add this to ALS now without having a 
 solution for other programs. That would be a great change to make to fix this 
 JIRA.
 Matei
 On Jan 25, 2014, at 11:06 PM, Ewen Cheslack-Postava m...@ewencp.org wrote:
 The three obvious ones in DAGScheduler.scala are in:
 
 getParentStages
 getMissingParentStages
 stageDependsOn
 
 They all follow the same pattern though (def visit(), followed by 
 visit(root)), so they should be easy to replace with a Scala stack in place 
 of the call stack.
 
 Shao, SaisaiJanuary 25, 2014 at 10:52 PM
 In my test I found this phenomenon might be caused by RDD's long dependency 
 chain, this dependency chain is serialized into task and sent to each 
 executor, while deserializing this task will cause stack overflow.
 
 Especially in iterative job, like:
 var rdd = ..
 
 for (i - 0 to 100)
 rdd = rdd.map(x=x)
 
 rdd = rdd.cache
 
 Here rdd's dependency will be chained, at some point stack overflow will 
 occur.
 
 You can check 
 (https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/-Cyfe3G6VwY/PFFnslzWn6AJ)
  and 
 (https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/NkxcmmS-DbM/c9qvuShbHEUJ)
  for details. Current workaround method is to cut the dependency chain by 
 checkpointing RDD, maybe a better way is to clean the dependency chain 
 after materialize stage is executed.
 
 Thanks
 Jerry
 
 -Original Message-
 From: Reynold Xin [mailto:r...@databricks.com] 
 Sent: Sunday, January 26, 2014 2:04 PM
 To: dev@spark.incubator.apache.org
 Subject: Re: Any suggestion about JIRA 1006 MLlib ALS gets stack overflow 
 with too many iterations?
 
 I'm not entirely sure, but two candidates are
 
 the visit function in stageDependsOn
 
 submitStage
 
 
 
 
 
 
 Reynold Xin January 25, 2014 at 10:03 PM
 I'm not entirely sure, but two candidates are
 
 the visit function in stageDependsOn
 
 submitStage
 
 
 
 
 
 
 
 Aaron Davidson  January 25, 2014 at 10:01 PM
 I'm an idiot, but which part of the DAGScheduler is recursive here? Seems
 like processEvent shouldn't have inherently recursive properties.
 
 
 
 Reynold Xin January 25, 2014 at 9:57 PM
 It seems to me fixing DAGScheduler to make it not recursive is the better
 solution here, given the cost of checkpointing.
 
 
 Xia, JunluanJanuary 25, 2014 at 9:49 PM
 Hi all
 
 The description about this Bug submitted by Matei is as following
 
 
 The tipping point seems to be around 50. We should fix this by 
 checkpointing the RDDs every 10-20 iterations to break the lineage chain, 
 but checkpointing currently requires HDFS installed, which not all users 
 will have.
 
 We might also be able to fix DAGScheduler to not be recursive.
 
 
 regards,
 Andrew
 
 

Re: Option folding idiom

2013-12-26 Thread Nick Pentreath
+1 for getOrElse


When I was new to Scala I tended to use match almost like if/else statements 
with Option. These days I try to use map/flatMap instead and use getOrElse 
extensively and I for one find it very intuitive.




I also agree that the fold syntax seems way less intuitive and I certainly 
prefer readable Scala code to that which might be more idiomatic but which I 
honestly tend to find very inscrutable and hard to grok quickly.
—
Sent from Mailbox for iPhone

On Fri, Dec 27, 2013 at 9:06 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 I agree about using getOrElse instead. In choosing which code style and 
 idioms to use, my goal has always been to maximize the ease of *other 
 developers* understanding the code, and most developers today still don’t 
 know Scala. It’s fine to use a maps or matches, because their meaning is 
 obvious, but fold on Option is not obvious (even foreach is kind of weird for 
 new people). In this case the benefit is so small that it doesn’t seem worth 
 it.
 Note that if you use getOrElse, you can even throw exceptions in the “else” 
 part if you’d like. (This is because Nothing is a subtype of every type in 
 Scala.) So for example you can do val stuff = option.getOrElse(throw new 
 Exception(“It wasn’t set”)). It looks a little weird, but note how the 
 meaning is obvious even if you don’t know anything about the type system.
 Matei
 On Dec 27, 2013, at 12:12 AM, Kay Ousterhout k...@eecs.berkeley.edu wrote:
 I agree with what Reynold said -- there's not a big benefit in terms of
 lines of code (esp. compared to using getOrElse) and I think it hurts code
 readability.  One of the great things about the current Spark codebase is
 that it's very accessible for newcomers -- something that would be less
 true with this use of fold.
 
 
 On Thu, Dec 26, 2013 at 8:11 PM, Holden Karau hol...@pigscanfly.ca wrote:
 
 I personally with Evan in that I prefer map with getOrElse over fold with
 options (but that just my personal preference) :)
 
 
 On Thu, Dec 26, 2013 at 7:58 PM, Reynold Xin r...@databricks.com wrote:
 
 I'm not strongly against Option.fold, but I find the readability getting
 worse for the use case you brought up.  For the use case of if/else, I
 find
 Option.fold pretty confusing because it reverses the order of Some vs
 None.
 Also, when code gets long, the lack of an obvious boundary (the only
 boundary is } {) with two closures is pretty confusing.
 
 
 On Thu, Dec 26, 2013 at 4:23 PM, Mark Hamstra m...@clearstorydata.com
 wrote:
 
 On the contrary, it is the completely natural place for the initial
 value
 of the accumulator, and provides the expected result of folding over an
 empty collection.
 
 scala val l: List[Int] = List()
 
 l: List[Int] = List()
 
 
 scala l.fold(42)(_ + _)
 
 res0: Int = 42
 
 
 scala val o: Option[Int] = None
 
 o: Option[Int] = None
 
 
 scala o.fold(42)(_ + 1)
 
 res1: Int = 42
 
 
 On Thu, Dec 26, 2013 at 5:51 PM, Evan Chan e...@ooyala.com wrote:
 
 +1 for using more functional idioms in general.
 
 That's a pretty clever use of `fold`, but putting the default
 condition
 first there makes it not as intuitive.   What about the following,
 which
 are more readable?
 
option.map { a = someFuncMakesB() }
  .getOrElse(b)
 
option.map { a = someFuncMakesB() }
  .orElse { a = otherDefaultB() }.get
 
 
 On Thu, Dec 26, 2013 at 12:33 PM, Mark Hamstra 
 m...@clearstorydata.com
 wrote:
 
 In code added to Spark over the past several months, I'm glad to
 see
 more
 use of `foreach`, `for`, `map` and `flatMap` over `Option` instead
 of
 pattern matching boilerplate.  There are opportunities to push
 `Option`
 idioms even further now that we are using Scala 2.10 in master,
 but I
 want
 to discuss the issue here a little bit before committing code whose
 form
 may be a little unfamiliar to some Spark developers.
 
 In particular, I really like the use of `fold` with `Option` to
 cleanly
 an
 concisely express the do something if the Option is None; do
 something
 else with the thing contained in the Option if it is Some code
 fragment.
 
 An example:
 
 Instead of...
 
 val driver = drivers.find(_.id == driverId)
 driver match {
  case Some(d) =
if (waitingDrivers.contains(d)) { waitingDrivers -= d }
else {
  d.worker.foreach { w =
w.actor ! KillDriver(driverId)
  }
}
val msg = sKill request for $driverId submitted
logInfo(msg)
sender ! KillDriverResponse(true, msg)
  case None =
val msg = sCould not find running driver $driverId
logWarning(msg)
sender ! KillDriverResponse(false, msg)
 }
 
 ...using fold we end up with...
 
 driver.fold
  {
val msg = sCould not find running driver $driverId
logWarning(msg)
sender ! KillDriverResponse(false, msg)
  }
  { d =
if (waitingDrivers.contains(d)) { waitingDrivers -= d }
else {
  d.worker.foreach { w =
w.actor ! KillDriver(driverId)
  }
}
val msg = sKill request 

Re: Spark development for undergraduate project

2013-12-19 Thread Nick Pentreath
Or if you're extremely ambitious work in implementing Spark Streaming in Python—
Sent from Mailbox for iPhone

On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Hi Matt,
 If you want to get started looking at Spark, I recommend the following 
 resources:
 - Our issue tracker at http://spark-project.atlassian.net contains some 
 issues marked “Starter” that are good places to jump into. You might be able 
 to take one of those and extend it into a bigger project.
 - The “contributing to Spark” wiki page covers how to send patches and set up 
 development: 
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
 - This talk has an intro to Spark internals (video and slides are in the 
 comments): http://www.meetup.com/spark-users/events/94101942/
 For a longer project, here are some possible ones:
 - Create a tool that automatically checks which Scala API methods are missing 
 in Python. We had a similar one for Java that was very useful. Even better 
 would be to automatically create wrappers for the Scala ones.
 - Extend the Spark monitoring UI with profiling information (to sample the 
 workers and say where they’re spending time, or what data structures consume 
 the most memory).
 - Pick and implement a new machine learning algorithm for MLlib.
 Matei
 On Dec 17, 2013, at 10:43 AM, Matthew Cheah mcch...@uwaterloo.ca wrote:
 Hi everyone,
 
 During my most recent internship, I worked extensively with Apache Spark,
 integrating it into a company's data analytics platform. I've now become
 interested in contributing to Apache Spark.
 
 I'm returning to undergraduate studies in January and there is an academic
 course which is simply a standalone software engineering project. I was
 thinking that some contribution to Apache Spark would satisfy my curiosity,
 help continue support the company I interned at, and give me academic
 credits required to graduate, all at the same time. It seems like too good
 an opportunity to pass up.
 
 With that in mind, I have the following questions:
 
   1. At this point, is there any self-contained project that I could work
   on within Spark? Ideally, I would work on it independently, in about a
   three month time frame. This time also needs to accommodate ramping up on
   the Spark codebase and adjusting to the Scala programming language and
   paradigms. The company I worked at primarily used the Java APIs. The output
   needs to be a technical report describing the project requirements, and the
   design process I took to engineer the solution for the requirements. In
   particular, it cannot just be a series of haphazard patches.
   2. How can I get started with contributing to Spark?
   3. Is there a high-level UML or some other design specification for the
   Spark architecture?
 
 Thanks! I hope to be of some help =)
 
 -Matt Cheah

Re: Spark development for undergraduate project

2013-12-19 Thread Nick Pentreath
Some good things to look at though hopefully #2 will be largely addressed by: 
https://github.com/apache/incubator-spark/pull/230—
Sent from Mailbox for iPhone

On Thu, Dec 19, 2013 at 8:57 PM, Andrew Ash and...@andrewash.com wrote:

 I think there are also some improvements that could be made to
 deployability in an enterprise setting.  From my experience:
 1. Most places I deploy Spark in don't have internet access.  So I can't
 build from source, compile against a different version of Hadoop, etc
 without doing it locally and then getting that onto my servers manually.
  This is less a problem with Spark now that there are binary distributions,
 but it's still a problem for using Mesos with Spark.
 2. Configuration of Spark is confusing -- you can make configuration in
 Java system properties, environment variables, command line parameters, and
 for the standalone cluster deployment mode you need to worry about whether
 these need to be set on the master, the worker, the executor, or the
 application/driver program.  Also because spark-shell automatically
 instantiates a SparkContext you have to set up any system properties in the
 init scripts or on the command line with
 JAVA_OPTS=-Dspark.executor.memory=8g etc.  I'm not sure what needs to be
 done, but it feels that there are gains to be made in configuration options
 here.  Ideally, I would have one configuration file that can be used in all
 4 places and that's the only place to make configuration changes.
 3. Standalone cluster mode could use improved resiliency for starting,
 stopping, and keeping alive a service -- there are custom init scripts that
 call each other in a mess of ways: spark-shell, spark-daemon.sh,
 spark-daemons.sh, spark-config.sh, spark-env.sh, compute-classpath.sh,
 spark-executor, spark-class, run-example, and several others in the bin/
 directory.  I would love it if Spark used the Tanuki Service Wrapper, which
 is widely-used for Java service daemons, supports retries, installation as
 init scripts that can be chkconfig'd, etc.  Let's not re-solve the how do
 I keep a service running? problem when it's been done so well by Tanuki --
 we use it at my day job for all our services, plus it's used by
 Elasticsearch.  This would help solve the problem where a quick bounce of
 the master causes all the workers to self-destruct.
 4. Sensitivity to hostname vs FQDN vs IP address in spark URL -- this is
 entirely an Akka bug based on previous mailing list discussion with Matei,
 but it'd be awesome if you could use either the hostname or the FQDN or the
 IP address in the Spark URL and not have Akka barf at you.
 I've been telling myself I'd look into these at some point but just haven't
 gotten around to them myself yet.  Some day!  I would prioritize these
 requests from most- to least-important as 3, 2, 4, 1.
 Andrew
 On Thu, Dec 19, 2013 at 1:38 PM, Nick Pentreath 
 nick.pentre...@gmail.comwrote:
 Or if you're extremely ambitious work in implementing Spark Streaming in
 Python—
 Sent from Mailbox for iPhone

 On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

  Hi Matt,
  If you want to get started looking at Spark, I recommend the following
 resources:
  - Our issue tracker at http://spark-project.atlassian.net contains some
 issues marked “Starter” that are good places to jump into. You might be
 able to take one of those and extend it into a bigger project.
  - The “contributing to Spark” wiki page covers how to send patches and
 set up development:
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
  - This talk has an intro to Spark internals (video and slides are in the
 comments): http://www.meetup.com/spark-users/events/94101942/
  For a longer project, here are some possible ones:
  - Create a tool that automatically checks which Scala API methods are
 missing in Python. We had a similar one for Java that was very useful. Even
 better would be to automatically create wrappers for the Scala ones.
  - Extend the Spark monitoring UI with profiling information (to sample
 the workers and say where they’re spending time, or what data structures
 consume the most memory).
  - Pick and implement a new machine learning algorithm for MLlib.
  Matei
  On Dec 17, 2013, at 10:43 AM, Matthew Cheah mcch...@uwaterloo.ca
 wrote:
  Hi everyone,
 
  During my most recent internship, I worked extensively with Apache
 Spark,
  integrating it into a company's data analytics platform. I've now become
  interested in contributing to Apache Spark.
 
  I'm returning to undergraduate studies in January and there is an
 academic
  course which is simply a standalone software engineering project. I was
  thinking that some contribution to Apache Spark would satisfy my
 curiosity,
  help continue support the company I interned at, and give me academic
  credits required to graduate, all at the same time. It seems like too
 good
  an opportunity to pass up.
 
  With that in mind, I have

Re: IMPORTANT: Spark mailing lists moving to Apache by September 1st

2013-12-19 Thread Nick Pentreath
One option that is 3rd party that works nicely for the Hadoop project and it's 
related projects is http://search-hadoop.com - managed by sematext. Perhaps we 
can plead with Otis to add Spark lists to search-spark.com, or the existing 
site?

Just throwing it out there as a potential solution to at least searching and 
navigating the Apache lists

Sent from my iPad

 On 20 Dec 2013, at 6:46 AM, Aaron Davidson ilike...@gmail.com wrote:
 
 I'd be fine with one-way mirrors here (Apache threads being reflected in
 Google groups) -- I have no idea how one is supposed to navigate the Apache
 list to look for historic threads.
 
 
 On Thu, Dec 19, 2013 at 7:58 PM, Mike Potts maspo...@gmail.com wrote:
 
 Thanks very much for the prompt and comprehensive reply!  I appreciate the
 overarching desire to integrate with apache: I'm very happy to hear that
 there's a move to use the existing groups as mirrors: that will overcome
 all of my objections: particularly if it's bidirectional! :)
 
 
 On Thursday, December 19, 2013 7:19:06 PM UTC-8, Andy Konwinski wrote:
 
 Hey Mike,
 
 As you probably noticed when you CC'd spark-de...@googlegroups.com, that
 list has already be reconfigured so that it no longer allows posting (and
 bounces emails sent to it).
 
 We will be doing the same thing to the spark...@googlegroups.com list
 too (we'll announce a date for that soon).
 
 That may sound very frustrating, and you are *not* alone feeling that
 way. We've had a long conversation with our mentors about this, and I've
 felt very similar to you, so I'd like to give you background.
 
 As I'm coming to see it, part of becoming an Apache project is moving the
 community *fully* over to Apache infrastructure, and more generally the
 Apache way of organizing the community.
 
 This applies in both the nuts-and-bolts sense of being on apache infra,
 but possibly more importantly, it is also a guiding principle and way of
 thinking.
 
 In various ways, moving to apache Infra can be a painful process, and IMO
 the loss of all the great mailing list functionality that comes with using
 Google Groups is perhaps the most painful step. But basically, the de facto
 mailing lists need to be the Apache ones, and not Google Groups. The
 underlying reason is that Apache needs to take full accountability for
 recording and publishing the mailing lists, it has to be able to
 institutionally guarantee this. This is because discussion on mailing lists
 is one of the core things that defines an Apache community. So at a minimum
 this means Apache owning the master copy of the bits.
 
 All that said, we are discussing the possibility of having a google group
 that subscribes to each list that would provide an easier to use and
 prettier archive for each list (so far we haven't gotten that to work).
 
 I hope this was helpful. It has taken me a few years now, and a lot of
 conversations with experienced (and patient!) Apache mentors, to
 internalize some of the nuance about the Apache way. That's why I wanted
 to share.
 
 Andy
 
 On Thu, Dec 19, 2013 at 6:28 PM, Mike Potts masp...@gmail.com wrote:
 
 I notice that there are still a lot of active topics in this group: and
 also activity on the apache mailing list (which is a really horrible
 experience!).  Is it a firm policy on apache's front to disallow external
 groups?  I'm going to be ramping up on spark, and I really hate the idea of
 having to rely on the apache archives and my mail client.  Also: having to
 search for topics/keywords both in old threads (here) as well as new
 threads in apache's (clunky) archive, is going to be a pain!  I almost feel
 like I must be missing something because the current solution seems
 unfeasibly awkward!
 
 --
 You received this message because you are subscribed to the Google
 Groups Spark Users group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to spark-users...@googlegroups.com.
 
 For more options, visit https://groups.google.com/groups/opt_out.
 
 


Re: Spark development for undergraduate project

2013-12-19 Thread Nick Pentreath
Another option would be:
1. Add another recommendation model based on mrec's sgd based model: 
https://github.com/mendeley/mrec
2. Look at the streaming K-means from Mahout and see if that might be 
integrated or adapted into MLlib
3. Work on adding to or refactoring the existing linear model framework, for 
example adaptive learning rate schedules, adaptive norm stuff from John 
Langford et al
4. Adding sparse vector/matrix support to MLlib?

Sent from my iPad

 On 20 Dec 2013, at 3:46 AM, Tathagata Das tathagata.das1...@gmail.com wrote:
 
 +1 to that (assuming by 'online' Andrew meant MLLib algorithm from Spark
 Streaming)
 
 Something you can look into is implementing a streaming KMeans. Maybe you
 can re-use a lot of the offline KMeans code in MLLib.
 
 TD
 
 
 On Thu, Dec 19, 2013 at 5:33 PM, Andrew Ash and...@andrewash.com wrote:
 
 Sounds like a great choice.  It would be particularly impressive if you
 could add the first online learning algorithm (all the current ones are
 offline I believe) to pave the way for future contributions.
 
 
 On Thu, Dec 19, 2013 at 8:27 PM, Matthew Cheah mcch...@uwaterloo.ca
 wrote:
 
 Thanks a lot everyone! I'm looking into adding an algorithm to MLib for
 the
 project. Nice and self-contained.
 
 -Matt Cheah
 
 
 On Thu, Dec 19, 2013 at 12:52 PM, Christopher Nguyen c...@adatao.com
 wrote:
 
 +1 to most of Andrew's suggestions here, and while we're in that
 neighborhood, how about generalizing something like wtf-spark (from
 the
 Bizo team (http://youtu.be/6Sn1xs5DN1Y?t=38m36s)? It may not be of
 high
 academic interest, but it's something people would use many times a
 debugging day.
 
 Or am I behind and something like that is already there in 0.8?
 
 --
 Christopher T. Nguyen
 Co-founder  CEO, Adatao http://adatao.com
 linkedin.com/in/ctnguyen
 
 
 
 On Thu, Dec 19, 2013 at 10:56 AM, Andrew Ash and...@andrewash.com
 wrote:
 
 I think there are also some improvements that could be made to
 deployability in an enterprise setting.  From my experience:
 
 1. Most places I deploy Spark in don't have internet access.  So I
 can't
 build from source, compile against a different version of Hadoop, etc
 without doing it locally and then getting that onto my servers
 manually.
 This is less a problem with Spark now that there are binary
 distributions,
 but it's still a problem for using Mesos with Spark.
 2. Configuration of Spark is confusing -- you can make configuration
 in
 Java system properties, environment variables, command line
 parameters,
 and
 for the standalone cluster deployment mode you need to worry about
 whether
 these need to be set on the master, the worker, the executor, or the
 application/driver program.  Also because spark-shell automatically
 instantiates a SparkContext you have to set up any system properties
 in
 the
 init scripts or on the command line with
 JAVA_OPTS=-Dspark.executor.memory=8g etc.  I'm not sure what needs
 to
 be
 done, but it feels that there are gains to be made in configuration
 options
 here.  Ideally, I would have one configuration file that can be used
 in
 all
 4 places and that's the only place to make configuration changes.
 3. Standalone cluster mode could use improved resiliency for
 starting,
 stopping, and keeping alive a service -- there are custom init
 scripts
 that
 call each other in a mess of ways: spark-shell, spark-daemon.sh,
 spark-daemons.sh, spark-config.sh, spark-env.sh,
 compute-classpath.sh,
 spark-executor, spark-class, run-example, and several others in the
 bin/
 directory.  I would love it if Spark used the Tanuki Service Wrapper,
 which
 is widely-used for Java service daemons, supports retries,
 installation
 as
 init scripts that can be chkconfig'd, etc.  Let's not re-solve the
 how
 do
 I keep a service running? problem when it's been done so well by
 Tanuki
 --
 we use it at my day job for all our services, plus it's used by
 Elasticsearch.  This would help solve the problem where a quick
 bounce
 of
 the master causes all the workers to self-destruct.
 4. Sensitivity to hostname vs FQDN vs IP address in spark URL -- this
 is
 entirely an Akka bug based on previous mailing list discussion with
 Matei,
 but it'd be awesome if you could use either the hostname or the FQDN
 or
 the
 IP address in the Spark URL and not have Akka barf at you.
 
 I've been telling myself I'd look into these at some point but just
 haven't
 gotten around to them myself yet.  Some day!  I would prioritize
 these
 requests from most- to least-important as 3, 2, 4, 1.
 
 Andrew
 
 
 On Thu, Dec 19, 2013 at 1:38 PM, Nick Pentreath 
 nick.pentre...@gmail.com
 wrote:
 
 Or if you're extremely ambitious work in implementing Spark
 Streaming
 in
 Python—
 Sent from Mailbox for iPhone
 
 On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia 
 matei.zaha...@gmail.com
 wrote:
 
 Hi Matt,
 If you want to get started looking at Spark, I recommend the
 following
 resources:
 - Our issue tracker at http://spark

Re: Intellij IDEA build issues

2013-12-16 Thread Nick Pentreath
Thanks Evan, I tried it and the new SBT direct import seems to work well,
though I did run into issues with some yarn imports on Spark.

n


On Thu, Dec 12, 2013 at 7:03 PM, Evan Chan e...@ooyala.com wrote:

 Nick, have you tried using the latest Scala plug-in, which features native
 SBT project imports?   ie you no longer need to run gen-idea.


 On Sat, Dec 7, 2013 at 4:15 AM, Nick Pentreath nick.pentre...@gmail.com
 wrote:

  Hi Spark Devs,
 
  Hoping someone cane help me out. No matter what I do, I cannot get
 Intellij
  to build Spark from source. I am using IDEA 13. I run sbt gen-idea and
  everything seems to work fine.
 
  When I try to build using IDEA, everything compiles but I get the error
  below.
 
  Have any of you come across the same?
 
  ==
 
  Internal error: (java.lang.AssertionError)
  java/nio/channels/FileChannel$MapMode already declared as
  ch.epfl.lamp.fjbg.JInnerClassesAttribute$Entry@1b5b798b
  java.lang.AssertionError: java/nio/channels/FileChannel$MapMode already
  declared as ch.epfl.lamp.fjbg.JInnerClassesAttribute$Entry@1b5b798b
  at
 
 
 ch.epfl.lamp.fjbg.JInnerClassesAttribute.addEntry(JInnerClassesAttribute.java:74)
  at
 
 
 scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator$$anonfun$addInnerClasses$3.apply(GenJVM.scala:738)
  at
 
 
 scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator$$anonfun$addInnerClasses$3.apply(GenJVM.scala:733)
  at
 
 
 scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
  at scala.collection.immutable.List.foreach(List.scala:76)
  at
 
 
 scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.addInnerClasses(GenJVM.scala:733)
  at
 
 
 scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.emitClass(GenJVM.scala:200)
  at
 
 
 scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.genClass(GenJVM.scala:355)
  at
 
 
 scala.tools.nsc.backend.jvm.GenJVM$JvmPhase$$anonfun$run$4.apply(GenJVM.scala:86)
  at
 
 
 scala.tools.nsc.backend.jvm.GenJVM$JvmPhase$$anonfun$run$4.apply(GenJVM.scala:86)
  at
 
 
 scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:104)
  at
 
 
 scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:104)
  at scala.collection.Iterator$class.foreach(Iterator.scala:772)
  at
 scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157)
  at
 
 scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190)
  at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45)
  at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:104)
  at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.scala:86)
  at scala.tools.nsc.Global$Run.compileSources(Global.scala:953)
  at scala.tools.nsc.Global$Run.compile(Global.scala:1041)
  at xsbt.CachedCompiler0.run(CompilerInterface.scala:123)
  at xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:99)
  at xsbt.CachedCompiler0.run(CompilerInterface.scala:99)
  at xsbt.CompilerInterface.run(CompilerInterface.scala:27)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:601)
  at sbt.compiler.AnalyzingCompiler.call(AnalyzingCompiler.scala:102)
  at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:48)
  at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:41)
  at
 
 
 sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3$$anonfun$apply$1.apply$mcV$sp(AggressiveCompile.scala:106)
  at
 
 
 sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3$$anonfun$apply$1.apply(AggressiveCompile.scala:106)
  at
 
 
 sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3$$anonfun$apply$1.apply(AggressiveCompile.scala:106)
  at
 
 
 sbt.compiler.AggressiveCompile.sbt$compiler$AggressiveCompile$$timed(AggressiveCompile.scala:173)
  at
 
 
 sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3.apply(AggressiveCompile.scala:105)
  at
 
 
 sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3.apply(AggressiveCompile.scala:102)
  at scala.Option.foreach(Option.scala:236)
  at
 
 
 sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:102)
  at
 
 
 sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:102)
  at scala.Option.foreach(Option.scala:236)
  at
 
 
 sbt.compiler.AggressiveCompile$$anonfun$6.compileScala$1(AggressiveCompile.scala:102)
  at
 
 
 sbt.compiler.AggressiveCompile$$anonfun$6.apply(AggressiveCompile.scala:151)
  at
 
 sbt.compiler.AggressiveCompile$$anonfun$6.apply(AggressiveCompile.scala:89)
  at
 sbt.inc.IncrementalCompile$$anonfun$doCompile$1.apply(Compile.scala

Re: Scala 2.10 Merge

2013-12-14 Thread Nick Pentreath
Whoohoo!

Great job everyone especially Prashant!

—
Sent from Mailbox for iPhone

On Sat, Dec 14, 2013 at 10:59 AM, Patrick Wendell pwend...@gmail.com
wrote:

 Alright I just merged this in - so Spark is officially Scala 2.10
 from here forward.
 For reference I cut a new branch called scala-2.9 with the commit
 immediately prior to the merge:
 https://git-wip-us.apache.org/repos/asf/incubator-spark/repo?p=incubator-spark.git;a=shortlog;h=refs/heads/scala-2.9
 - Patrick
 On Thu, Dec 12, 2013 at 8:26 PM, Patrick Wendell pwend...@gmail.com wrote:
 Hey Reymond,

 Let's move this discussion out of this thread and into the associated JIRA.
 I'll write up our current approach over there.

 https://spark-project.atlassian.net/browse/SPARK-995

 - Patrick


 On Thu, Dec 12, 2013 at 5:56 PM, Liu, Raymond raymond@intel.com wrote:

 Hi Patrick

 So what's the plan for support Yarn 2.2 in 0.9? As far as I can
 see, if you want to support both 2.2 and 2.0 , due to protobuf version
 incompatible issue. You need two version of akka anyway.

 Akka 2.3-M1 looks like have a little bit change in API, we
 probably could isolate the code like what we did on yarn part API. I
 remember that it is mentioned that to use reflection for different API is
 preferred. So the purpose to use reflection is to use one release bin jar to
 support both version of Hadoop/Yarn on runtime, instead of build different
 bin jar on compile time?

  Then all code related to hadoop will also be built in separate
 modules for loading on demand? This sounds to me involve a lot of works. And
 you still need to have shim layer and separate code for different version
 API and depends on different version Akka etc. Sounds like and even strict
 demands versus our current approaching on master, and with dynamic class
 loader in addition, And the problem we are facing now are still there?

 Best Regards,
 Raymond Liu

 -Original Message-
 From: Patrick Wendell [mailto:pwend...@gmail.com]
 Sent: Thursday, December 12, 2013 5:13 PM
 To: dev@spark.incubator.apache.org
 Subject: Re: Scala 2.10 Merge

 Also - the code is still there because of a recent merge that took in some
 newer changes... we'll be removing it for the final merge.


 On Thu, Dec 12, 2013 at 1:12 AM, Patrick Wendell pwend...@gmail.com
 wrote:

  Hey Raymond,
 
  This won't work because AFAIK akka 2.3-M1 is not binary compatible
  with akka 2.2.3 (right?). For all of the non-yarn 2.2 versions we need
  to still use the older protobuf library, so we'd need to support both.
 
  I'd also be concerned about having a reference to a non-released
  version of akka. Akka is the source of our hardest-to-find bugs and
  simultaneously trying to support 2.2.3 and 2.3-M1 is a bit daunting.
  Of course, if you are building off of master you can maintain a fork
  that uses this.
 
  - Patrick
 
 
  On Thu, Dec 12, 2013 at 12:42 AM, Liu, Raymond
  raymond@intel.comwrote:
 
  Hi Patrick
 
  What does that means for drop YARN 2.2? seems codes are still
  there. You mean if build upon 2.2 it will break, and won't and work
  right?
  Since the home made akka build on scala 2.10 are not there. While, if
  for this case, can we just use akka 2.3-M1 which run on protobuf 2.5
  for replacement?
 
  Best Regards,
  Raymond Liu
 
 
  -Original Message-
  From: Patrick Wendell [mailto:pwend...@gmail.com]
  Sent: Thursday, December 12, 2013 4:21 PM
  To: dev@spark.incubator.apache.org
  Subject: Scala 2.10 Merge
 
  Hi Developers,
 
  In the next few days we are planning to merge Scala 2.10 support into
  Spark. For those that haven't been following this, Prashant Sharma
  has been maintaining the scala-2.10 branch of Spark for several
  months. This branch is current with master and has been reviewed for
  merging:
 
  https://github.com/apache/incubator-spark/tree/scala-2.10
 
  Scala 2.10 support is one of the most requested features for Spark -
  it will be great to get this into Spark 0.9! Please note that *Scala
  2.10 is not binary compatible with Scala 2.9*. With that in mind, I
  wanted to give a few heads-up/requests to developers:
 
  If you are developing applications on top of Spark's master branch,
  those will need to migrate to Scala 2.10. You may want to download
  and test the current scala-2.10 branch in order to make sure you will
  be okay as Spark developments move forward. Of course, you can always
  stick with the current master commit and be fine (I'll cut a tag when
  we do the merge in order to delineate where the version changes).
  Please open new threads on the dev list to report and discuss any
  issues.
 
  This merge will temporarily drop support for YARN 2.2 on the master
  branch.
  This is because the workaround we used was only compiled for Scala 2.9.
  We are going to come up with a more robust solution to YARN 2.2
  support before releasing 0.9.
 
  Going forward, we will continue to make maintenance releases on
  branch-0.8 

Re: [VOTE] Release Apache Spark 0.8.1-incubating (rc4)

2013-12-11 Thread Nick Pentreath
   - Successfully built via sbt/sbt assembly/assembly on Mac OS X, as well
   as on a dev Ubuntu EC2 box
   - Successfully tested via sbt/sbt test locally
   - Successfully built and tested using mvn package locally
   - I've tested my own Spark jobs (built against 0.8.0-incubating) on this
   RC and all works fine, as well as tested with my job server (also built
   against 0.8.0-incubating)
   - Ran a few spark examples and the shell and PySpark shell
   - For my part, tested the MLlib implicit code I added, and checked docs


I'm +1


On Wed, Dec 11, 2013 at 11:04 AM, Prashant Sharma scrapco...@gmail.comwrote:

 I hope this PR https://github.com/apache/incubator-spark/pull/252 can
 help.
 Again this is not a blocker for the release from my side either.


 On Wed, Dec 11, 2013 at 2:14 PM, Mark Hamstra m...@clearstorydata.com
 wrote:

  Interesting, and confirmed: On my machine where `./sbt/sbt assembly`
 takes
  a long, long, long time to complete (a MBP, in my case), building
 three
  separate assemblies (`./sbt/sbt assembly/assembly`, `./sbt/sbt
  examples/assembly`, `./sbt/sbt tools/assembly`) takes much, much less
 time.
 
 
 
  On Wed, Dec 11, 2013 at 12:02 AM, Prashant Sharma scrapco...@gmail.com
  wrote:
 
   forgot to mention, after running sbt/sbt assembly/assembly running
  sbt/sbt
   examples/assembly takes just 37s. Not to mention my hardware is not
  really
   great.
  
  
   On Wed, Dec 11, 2013 at 1:28 PM, Prashant Sharma scrapco...@gmail.com
   wrote:
  
Hi Patrick and Matei,
   
Was trying out this and followed the quick start guide which says do
sbt/sbt assembly, like few others I was also stuck for few minutes on
linux. On the other hand if I use sbt/sbt assembly/assembly it is
 much
faster.
   
Should we change the documentation to reflect this. It will not be
  great
for first time users to get stuck there.
   
   
On Wed, Dec 11, 2013 at 9:54 AM, Matei Zaharia 
  matei.zaha...@gmail.com
   wrote:
   
+1
   
Built and tested it on Mac OS X.
   
Matei
   
   
On Dec 10, 2013, at 4:49 PM, Patrick Wendell pwend...@gmail.com
   wrote:
   
 Please vote on releasing the following candidate as Apache Spark
 (incubating) version 0.8.1.

 The tag to be voted on is v0.8.1-incubating (commit b87d31d):

   
  
 
 https://git-wip-us.apache.org/repos/asf/incubator-spark/repo?p=incubator-spark.git;a=commit;h=b87d31dd8eb4b4e47c0138e9242d0dd6922c8c4e

 The release files, including signatures, digests, etc can be found
  at:
 http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc4/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:

   https://repository.apache.org/content/repositories/orgapachespark-040/

 The documentation corresponding to this release can be found at:

 http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc4-docs/

 For information about the contents of this release see:

   
  
 
 https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=blob;f=CHANGES.txt;h=ce0aeab524505b63c7999e0371157ac2def6fe1c;hb=branch-0.8

 Please vote on releasing this package as Apache Spark
   0.8.1-incubating!

 The vote is open until Saturday, December 14th at 01:00 UTC and
 passes if a majority of at least 3 +1 PPMC votes are cast.

 [ ] +1 Release this package as Apache Spark 0.8.1-incubating
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.incubator.apache.org/
   
   
   
   
--
s
   
  
  
  
   --
   s
  
 



 --
 s



Intellij IDEA build issues

2013-12-07 Thread Nick Pentreath
Hi Spark Devs,

Hoping someone cane help me out. No matter what I do, I cannot get Intellij
to build Spark from source. I am using IDEA 13. I run sbt gen-idea and
everything seems to work fine.

When I try to build using IDEA, everything compiles but I get the error
below.

Have any of you come across the same?

==

Internal error: (java.lang.AssertionError)
java/nio/channels/FileChannel$MapMode already declared as
ch.epfl.lamp.fjbg.JInnerClassesAttribute$Entry@1b5b798b
java.lang.AssertionError: java/nio/channels/FileChannel$MapMode already
declared as ch.epfl.lamp.fjbg.JInnerClassesAttribute$Entry@1b5b798b
at
ch.epfl.lamp.fjbg.JInnerClassesAttribute.addEntry(JInnerClassesAttribute.java:74)
at
scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator$$anonfun$addInnerClasses$3.apply(GenJVM.scala:738)
at
scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator$$anonfun$addInnerClasses$3.apply(GenJVM.scala:733)
at
scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
at scala.collection.immutable.List.foreach(List.scala:76)
at
scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.addInnerClasses(GenJVM.scala:733)
at
scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.emitClass(GenJVM.scala:200)
at
scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.genClass(GenJVM.scala:355)
at
scala.tools.nsc.backend.jvm.GenJVM$JvmPhase$$anonfun$run$4.apply(GenJVM.scala:86)
at
scala.tools.nsc.backend.jvm.GenJVM$JvmPhase$$anonfun$run$4.apply(GenJVM.scala:86)
at
scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:104)
at
scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:104)
at scala.collection.Iterator$class.foreach(Iterator.scala:772)
at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157)
at
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45)
at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:104)
at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.scala:86)
at scala.tools.nsc.Global$Run.compileSources(Global.scala:953)
at scala.tools.nsc.Global$Run.compile(Global.scala:1041)
at xsbt.CachedCompiler0.run(CompilerInterface.scala:123)
at xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:99)
at xsbt.CachedCompiler0.run(CompilerInterface.scala:99)
at xsbt.CompilerInterface.run(CompilerInterface.scala:27)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at sbt.compiler.AnalyzingCompiler.call(AnalyzingCompiler.scala:102)
at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:48)
at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:41)
at
sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3$$anonfun$apply$1.apply$mcV$sp(AggressiveCompile.scala:106)
at
sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3$$anonfun$apply$1.apply(AggressiveCompile.scala:106)
at
sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3$$anonfun$apply$1.apply(AggressiveCompile.scala:106)
at
sbt.compiler.AggressiveCompile.sbt$compiler$AggressiveCompile$$timed(AggressiveCompile.scala:173)
at
sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3.apply(AggressiveCompile.scala:105)
at
sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3.apply(AggressiveCompile.scala:102)
at scala.Option.foreach(Option.scala:236)
at
sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:102)
at
sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:102)
at scala.Option.foreach(Option.scala:236)
at
sbt.compiler.AggressiveCompile$$anonfun$6.compileScala$1(AggressiveCompile.scala:102)
at
sbt.compiler.AggressiveCompile$$anonfun$6.apply(AggressiveCompile.scala:151)
at
sbt.compiler.AggressiveCompile$$anonfun$6.apply(AggressiveCompile.scala:89)
at sbt.inc.IncrementalCompile$$anonfun$doCompile$1.apply(Compile.scala:39)
at sbt.inc.IncrementalCompile$$anonfun$doCompile$1.apply(Compile.scala:37)
at sbt.inc.Incremental$.cycle(Incremental.scala:75)
at sbt.inc.Incremental$$anonfun$1.apply(Incremental.scala:34)
at sbt.inc.Incremental$$anonfun$1.apply(Incremental.scala:33)
at sbt.inc.Incremental$.manageClassfiles(Incremental.scala:42)
at sbt.inc.Incremental$.compile(Incremental.scala:33)
at sbt.inc.IncrementalCompile$.apply(Compile.scala:27)
at sbt.compiler.AggressiveCompile.compile2(AggressiveCompile.scala:164)
at sbt.compiler.AggressiveCompile.compile1(AggressiveCompile.scala:73)
at
org.jetbrains.jps.incremental.scala.local.CompilerImpl.compile(CompilerImpl.scala:61)
at

PySpark - Dill serialization

2013-12-05 Thread Nick Pentreath
Hi devs

I came across Dill (
http://trac.mystic.cacr.caltech.edu/project/pathos/wiki/dill) for Python
serialization. Was wondering if it may be a replacement to the cloudpickle
stuff (and remove that piece of code that needs to be maintained within
PySpark)?

Josh have you looked into Dill? Any thoughts?

N


Re: [PySpark]: reading arbitrary Hadoop InputFormats

2013-10-30 Thread Nick Pentreath
Thanks Josh, Patrick for the feedback.

Based on Josh's pointers I have something working for JavaPairRDD -
PySpark RDD[(String, String)]. This just calls the toString method on each
key and value as before, but without the need for a delimiter. For
SequenceFile, it uses SequenceFileAsTextInputFormat which itself calls
toString to convert to Text for keys and values. We then call toString
(again) ourselves to get Strings to feed to writeAsPickle.

Details here: https://gist.github.com/MLnick/7230588

This also illustrates where the wrapper function api would fit in. All
that is required is to define a T = String for key and value.

I started playing around with MsgPack and can sort of get things to work in
Scala, but am struggling with getting the raw bytes to be written properly
in PythonRDD (I think it is treating them as pickled byte arrays when they
are not, but when I removed the 'stripPickle' calls and amended the length
(-6) I got UnpicklingError: invalid load key, ' '. ).

Another issue is that MsgPack does well at writing structures - like Java
classes with public fields that are fairly simple - but for example the
Writables have private fields so you end up with nothing being written.
This looks like it would require custom Templates (serialization
functions effectively) for many classes, which means a lot of custom code
for a user to write to use it. Fortunately for most of the common Writables
a toString does the job. Will keep looking into it though.

Anyway, Josh if you have ideas or examples on the Wrapper API from Python
that you mentioned, I'd be interested to hear them.

If you think this is worth working up as a Pull Request covering
SequenceFiles and custom InputFormats with default toString conversions and
the ability to specify Wrapper functions, I can clean things up more, add
some functionality and tests, and also test to see if common things like
the normal Writables and reading from things like HBase and Cassandra can
be made to work nicely (any other common use cases that you think make
sense?).

Thoughts, comments etc welcome.

Nick



On Fri, Oct 25, 2013 at 11:03 PM, Patrick Wendell pwend...@gmail.comwrote:

 As a starting point, a version where people just write their own wrapper
 functions to convert various HadoopFiles into String K, V files could go
 a long way. We could even have a few built-in versions, such as dealing
 with Sequence files that are String, String. Basically, the user needs to
 write a translator in Java/Scala that produces textual records from
 whatever format that want. Then, they make sure this is included in the
 classpath when running PySpark.

 As Josh is saying, I'm pretty sure this is already possible, but we may
 want to document it for users. In many organizations they might have 1-2
 people who can write the Java/Scala to do this but then many more people
 who are comfortable using python once it's setup.

 - Patrick

 On Fri, Oct 25, 2013 at 11:00 AM, Josh Rosen rosenvi...@gmail.com wrote:

  Hi Nick,
 
  I've seen several requests for SequenceFile support in PySpark, so
 there's
  definitely demand for this feature.
 
  I like the idea of passing MsgPack'ed data (or some other structured
  format) from Java to the Python workers.  My early prototype of custom
  serializers (described at
 
 
 https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals#PySparkInternals-customserializers
  )
  might be useful for implementing this.  Proper custom serializer support
  would handle the bookkeeping for tracking each stage's input and output
  formats and supplying the appropriate deserialization functions to the
  Python worker, so the Python worker would be able to directly read the
  MsgPack'd data that's sent to it.
 
  Regarding a wrapper API, it's actually possible to initially transform
 data
  using Scala/Java and perform the remainder of the processing in PySpark.
   This involves adding the appropriate compiled to the Java classpath and
 a
  bit of work in Py4J to create the Java/Scala RDD and wrap it for use by
  PySpark.  I can hack together a rough example of this if anyone's
  interested, but it would need some work to be developed into a
  user-friendly API.
 
  If you wanted to extend your proof-of-concept to handle the cases where
  keys and values have parseable toString() values, I think you could
 remove
  the need for a delimiter by creating a PythonRDD from the newHadoopFile
  JavaPairRDD and adding a new method to writeAsPickle (
 
 
 https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L224
  )
  to dump its contents as a pickled pair of strings.  (Aside: most of
  writeAsPickle() would probably need be eliminated or refactored when
 adding
  general custom serializer support).
 
  - Josh
 
  On Thu, Oct 24, 2013 at 11:18 PM, Nick Pentreath
  nick.pentre...@gmail.comwrote:
 
   Hi Spark Devs
  
   I was wondering what appetite there may be to add

Re: MLI dependency exception

2013-09-11 Thread Nick Pentreath
Is mLI available? Where is the repo located?

—
Sent from Mailbox for iPhone

On Tue, Sep 10, 2013 at 10:45 PM, Gowtham N gowtham.n.m...@gmail.com
wrote:

 It worked.
 I was using old master for spark, which I forked many days a ago.
 On Tue, Sep 10, 2013 at 1:25 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:
 For some more notes on how to debug this: After you do publish-local in
 Spark, you should have a file in ~/.ivy2 that you can check for using
 `ls
 ~/.ivy2/local/org.apache.spark/spark-core_2.9.3/0.8.0-SNAPSHOT/jars/spark-core_2.9.3.jar`

 Or `sbt/sbt publish-local` also prints something like this on the console

  [info]  published spark-core_2.9.3 to
 /home/shivaram/.ivy2/local/org.apache.spark/spark-core_2.9.3/0.8.0-SNAPSHOT/jars/spark-core_2.9.3.jar

 After that MLI's build should be able to pick this jar up.

 Thanks
 Shivaram




 On Tue, Sep 10, 2013 at 1:14 PM, Gowtham N gowtham.n.m...@gmail.comwrote:

 I did it as publish-local.
 I forked mesos/spark to gowthamnatarajan/spark. And I am using that. I
 forked a few days ago, but did a upstream update today.

 For safety, I will directly clone from mesos now.



 On Tue, Sep 10, 2013 at 1:10 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 Did you check out spark from the master branch of github.com/mesos/spark?
 The package names changed recently so you might need to pull. Also just
 checking that you did publish-local in Spark (not public-local as
 specified
 in the email) ?

 Thanks
 Shivaram


 On Tue, Sep 10, 2013 at 1:01 PM, Gowtham N gowtham.n.m...@gmail.com
 wrote:

  still getting the same error.
 
  I have spark and MLI folder within a folder called git
 
  I did clean, package and public-local for spark.
  Then for mli did clean, and then package.
  I am still getting the error.
 
  [warn] ::
  [warn] ::  UNRESOLVED DEPENDENCIES ::
  [warn] ::
  [warn] :: org.apache.spark#spark-core_2.9.3;0.8.0-SNAPSHOT: not found
  [warn] :: org.apache.spark#spark-mllib_2.9.3;0.8.0-SNAPSHOT: not found
  [warn] ::
  [error] {file:/Users/gowthamn/git/MLI/}default-0b9403/*:update:
  sbt.ResolveException: unresolved dependency:
  org.apache.spark#spark-core_2.9.3;0.8.0-SNAPSHOT: not found
  [error] unresolved dependency:
  org.apache.spark#spark-mllib_2.9.3;0.8.0-SNAPSHOT: not found
 
  should I modify the contents of build.sbt?
  Currently its
 
  libraryDependencies ++= Seq(
org.apache.spark % spark-core_2.9.3 % 0.8.0-SNAPSHOT,
org.apache.spark % spark-mllib_2.9.3 % 0.8.0-SNAPSHOT,
org.scalatest %% scalatest % 1.9.1 % test
  )
 
  resolvers ++= Seq(
Typesafe at http://repo.typesafe.com/typesafe/releases;,
Scala Tools Snapshots at http://scala-tools.org/repo-snapshots/;,
ScalaNLP Maven2 at http://repo.scalanlp.org/repo;,
Spray at http://repo.spray.cc;
  )
 
 
 
 
 
 
  On Tue, Sep 10, 2013 at 11:58 AM, Evan R. Sparks 
 evan.spa...@gmail.com
  wrote:
 
   Hi Gowtham,
  
   You'll need to do sbt/sbt publish-local in the spark directory
   before trying to build MLI.
  
   - Evan
  
   On Tue, Sep 10, 2013 at 11:37 AM, Gowtham N 
 gowtham.n.m...@gmail.com
   wrote:
I cloned MLI, but am unable to compile it.
   
I get the following dependency exception with other projects.
   
org.apache.spark#spark-core_2.9.3;0.8.0-SNAPSHOT: not found
org.apache.spark#spark-mllib_2.9.3;0.8.0-SNAPSHOT: not found
   
Why am I getting this error?
   
I did not change anything from build.sbt
   
libraryDependencies ++= Seq(
  org.apache.spark % spark-core_2.9.3 % 0.8.0-SNAPSHOT,
  org.apache.spark % spark-mllib_2.9.3 % 0.8.0-SNAPSHOT,
  org.scalatest %% scalatest % 1.9.1 % test
)
  
 
 
 
  --
  Gowtham Natarajan
 




 --
 Gowtham Natarajan



 -- 
 Gowtham Natarajan

Re: Adding support for implicit feedback to ALS

2013-09-09 Thread Nick Pentreath
In 3 are you saying that some cross validation support for picking the best 
lambda and alpha should be in there? Or that also the preference weightings 
of different event types should also be learnt? (Maybe both)


  


I agree that there should be support for this, by optimising for the best 
RMSE, MAP or whatever. I'm just not sure whether this functionality should live 
in Mllib or MLI. Until MLI is released it's sort of hard to know.


  


For 4, my frame of reference has been vs mahout and my own port to spark of 
mahouts ALS, and vs those this blocked approach is far superior. Though im sure 
there can be more efficiencies gained in this approach and other alternatives.


  


It would certainly be great to further improve the approach as you mention 
in 5. I'm not sure precisely what you mean by task reformulation - how would 
you propose to do so?


  


Nick

—
Sent from Mailbox for iPhone

On Mon, Sep 9, 2013 at 8:28 PM, Dmitriy Lyubimov dlie...@gmail.com
wrote:

 Sorry, not directly aimed at the PR but at implementation in whole.
 See if the following is useful from my experience:
 1. implicit feedback is just a corner case of more general problem:
 Given preference matrix P where P_i,j in R^{0,1} and weight
 (confidence)  matrix C, C_i,j \in R, and reg rate \lambda, compute
 L2-regularized ALS fit.
 2. since default confidence is never zero (in paper it is assumed 1,
 and i will denote this real quantity as c_0), have C = C_0 + C' where
 C_0_i,j = c_0. Hence, rewrite input  in terms (P, C', c_0) since C'
 becomes severely sparse matrix in this case in real life.
 3. It is nice when input C is known. But there are a lot of cases
 where individual confidence is derived from a final set of
 hyperparameters corresponding to a particular event type (search,
 click, transaction etc.). Hence, convex optimization for a small set
 of hyperparameters is desired (this might be outside of scope ALS
 itself, but weighing and lamda per se  aren't). Still though,
 crossvalidation largely relies on the fact that we want to take stuff
 that follows existing entries in C' so crossvalidation helpers would
 be naturally coupled with this method and should be provided.
 4. i actually used pregel to avoid shuffle and sort programming model.
 Matrix operations do not require guarantees produced by reducers; only
 a full group guarantee. I did not benchmark this approach for really
 substantial datasets though; there are known Bagel limitations IMO
 which may create a problem for sufficiently large /skewed datasets. I
 guess I am interested in GraphX release to replace reliance on Bagel.
 5. if the task reformulation is accepted, there are further
 optimizations that could be applied to blocking -- but this
 implementation gets the gist of it what i did in that regard.
 On Sun, Sep 8, 2013 at 10:58 AM, Nick Pentreath
 nick.pentre...@gmail.com wrote:
 Hi

 I know everyone's pretty busy with getting 0.8.0 out, but as and when folks
 have time it would be great to get your feedback on this PR adding support
 for the 'implicit feedback' model variant to ALS:
 https://github.com/apache/incubator-spark/pull/4

 In particular any potential efficiency improvements, issues, and testing it
 out locally and on a cluster and on some datasets!

 Comments  feedback welcome.

 Many thanks
 Nick

Adding support for implicit feedback to ALS

2013-09-08 Thread Nick Pentreath
Hi

I know everyone's pretty busy with getting 0.8.0 out, but as and when folks
have time it would be great to get your feedback on this PR adding support
for the 'implicit feedback' model variant to ALS:
https://github.com/apache/incubator-spark/pull/4

In particular any potential efficiency improvements, issues, and testing it
out locally and on a cluster and on some datasets!

Comments  feedback welcome.

Many thanks
Nick