Re: proposal: replace lift-json with spray-json

2014-02-14 Thread Evan Chan
Pascal,

Ah, I stand corrected, thanks.

On Mon, Feb 10, 2014 at 11:49 PM, Pascal Voitot Dev
pascal.voitot@gmail.com wrote:
 Evan,

 Excuse me but that's WRONG that play-json pulls all play deps!

 PLAY/JSON has NO HEAVY DEP ON PLAY!

 I personally worked to make it an independent module in play!
 So play/json has just one big dep which is Jackson!

 I agree that jackson is the right way to go as a beginning.
 But for scala developers, a higher thin layer like play/json is useful to
 bring typesafety...

 Pascal

 On Tue, Feb 11, 2014 at 1:31 AM, Evan Chan e...@ooyala.com wrote:

 By the way, I did a benchmark on JSON parsing performance recently.
  Based on that, spray-json was about 10x slower than the Jackson-based
 parsers.  I recommend json4s-jackson, because jackson is almost
 certainly already a dependency of Sparks (many other Java libraries
 use it), so the dependencies are very lightweight.   I didn't
 benchmark Argonaut or play-json, partly because play-json pulled in
 all the Play dependencies, tho as someone else commented in this
 thread, they plan to split it out.

 On Mon, Feb 10, 2014 at 12:22 AM, Pascal Voitot Dev
 pascal.voitot@gmail.com wrote:
  Hi again,
 
  If spark just need Json with serialization/deserialization of basic
  structures and some potential simple validations for webUI, let's remind
  that play/json (without any other dependency than Jackson) is about
 200-300
  lines of code... the only dependency is jackson which is the best json
  parser that I know. The rest of code is about typesafety  composition...
  If Spark need some json veryveryvery performant JSON (de)serialization,
 we
  will have to look on things like pickling and potentially some streaming
  parsers (I think this is a domain under work right now...)
 
  Pascal
 
 
  On Mon, Feb 10, 2014 at 1:50 AM, Will Benton wi...@redhat.com wrote:
 
  Matei, sorry if I was unclear:  I'm referring to downstream operating
  system distributions (like Fedora or Debian) that  have policies
 requiring
  that all packages are built from source (using only tools already
 packaged
  in the distribution).  So end-users (and distributions with different
  policies) don't have to build Lift to get the lift-json artifact, but
 it is
  a concern for many open-source communities.
 
  best,
  wb
 
 
  - Original Message -
   From: Matei Zaharia matei.zaha...@gmail.com
   To: dev@spark.incubator.apache.org
   Sent: Sunday, February 9, 2014 4:29:20 PM
   Subject: Re: proposal:  replace lift-json with spray-json
  
   Will, why are you saying that downstream distributes need to build
 all of
   Lift to package lift-json? Spark just downloads it from Maven Central,
  where
   it's a JAR with no external dependencies. We don't have any
 dependency on
   the rest of lift.
  
   Matei
  
   On Feb 9, 2014, at 11:28 AM, Will Benton wi...@redhat.com wrote:
  
lift-json is a nice library, but Lift is a pretty heavyweight
  dependency to
track just for its JSON support.  (lift-json is relatively
  self-contained
as a dependency from an end-user's perspective, but downstream
distributors need to build all of Lift in order to package the JSON
support.)  I understand that this has come up before (cf. SPARK-883)
  and
that the uncertain future of JSON support in the Scala standard
  library is
the motivator for relying on an external library.
   
I'm proposing replacing lift-json in Spark with something more
  lightweight.
I've evaluated apparent project liveness and dependency scope for
 most
  of
the current Scala JSON libraries and believe the best candidate is
spray-json (https://github.com/spray/spray-json), the JSON library
  used by
the Spray HTTP toolkit. spray-json is Apache-licensed, actively
  developed,
and builds and works independently of Spray with only one external
dependency.
   
It looks to me like a pretty straightforward change (although
JsonProtocol.scala would be a little more verbose since it couldn't
 use
the Lift JSON DSL), and I'd like to do it.  I'm writing now to ask
 for
some community feedback before making the change (and submitting a
 JIRA
and PR).  If no one has any serious objections (to the effort in
  general
or to to the choice of spark-json in particular), I'll go ahead and
 do
  it,
but if anyone has concerns, I'd be happy to discuss and address them
before getting started.
   
   
thanks,
wb
  
  
 



 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com  |




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Fast Serialization

2014-02-13 Thread Evan Chan
Any interest in adding Fast Serialization (or possibly replacing the
default of Java Serialization)?
https://code.google.com/p/fast-serialization/

-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Re: Proposal: Clarifying minor points of Scala style

2014-02-10 Thread Evan Chan
+1 to the proposal.

On Mon, Feb 10, 2014 at 2:56 PM, Michael Armbrust
mich...@databricks.com wrote:
 +1 to Shivaram's proposal.  I think we should try to avoid functions with
 many args as much as possible so having a high vertical cost here isn't the
 worst thing.  I also like the visual consistency.

 FWIW, (based on a cursory inspection) in the scala compiler they don't seem
 to ever orphan the return type from the closing parenthese.  It seems there
 are two main accepted styles:

 def mkSlowPathDef(clazz: Symbol, lzyVal: Symbol, cond: Tree, syncBody:
 List[Tree],
   stats: List[Tree], retVal: Tree): Tree = {

 and

 def tryToSetIfExists(
   cmd: String,
   args: List[String],
   setter: (Setting) = (List[String] = Option[List[String]])
 ): Option[List[String]] =


 On Mon, Feb 10, 2014 at 2:36 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 Yeah that was my proposal - Essentially we can just have two styles:
 The entire function + parameterList + return type fits in one line or
 when it doesn't we wrap parameters into lines.
 I agree that it makes the code a more verbose, but it'll make code
 style more consistent.

 Shivaram

 On Mon, Feb 10, 2014 at 2:13 PM, Aaron Davidson ilike...@gmail.com
 wrote:
  Shivaram, is your recommendation to wrap the parameter list even if it
 fits,
  but just the return type doesn't? Personally, I think the cost of moving
  from a single-line parameter list to an n-ine list is pretty high, as it
  takes up a lot more space. I am even in favor of allowing a parameter
 list
  to overflow into a second line (but not a third) instead of spreading
 them
  out, if it's a private helper method (where the parameters are probably
 not
  as important as the implementation, unlike a public API).
 
 
  On Mon, Feb 10, 2014 at 1:42 PM, Shivaram Venkataraman
  shiva...@eecs.berkeley.edu wrote:
 
  For the 1st case wouldn't it be better to just wrap the parameters to
  the next line as we do in other cases ? For example
 
  def longMethodName(
  param1,
  param2, ...) : Long = {
  }
 
  Are there a lot functions which use the old format ? Can we just stick
  to the above for new functions ?
 
  Thanks
  Shivaram
 
  On Mon, Feb 10, 2014 at 11:33 AM, Reynold Xin r...@databricks.com
 wrote:
   +1 on both
  
  
   On Mon, Feb 10, 2014 at 1:34 AM, Aaron Davidson ilike...@gmail.com
   wrote:
  
   There are a few bits of the Scala style that are underspecified by
   both the Scala
   style guide http://docs.scala-lang.org/style/ and our own
   supplemental
   notes
  
  
 https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide.
   Often, this leads to inconsistent formatting within the codebase, so
   I'd
   like to propose some general guidelines which we can add to the wiki
   and
   use in the future:
  
   1) Line-wrapped method return type is indented with two spaces:
   def longMethodName(... long param list ...)
 : Long = {
 2
   }
  
   *Justification: *I think this is the most commonly used style in
 Spark
   today. It's also similar to the extends style used in classes, with
   the
   same justification: it is visually distinguished from the 4-indented
   parameter list.
  
   2) URLs and code examples in comments should not be line-wrapped.
   Here
  
  
 https://github.com/apache/incubator-spark/pull/557/files#diff-c338f10f3567d4c1d7fec4bf9e2677e1L29
   is
   an example of the latter.
  
   *Justification*: Line-wrapping can cause confusion when trying to
   copy-paste a URL or command. Can additionally cause IDE issues or,
   avoidably, Javadoc issues.
  
   Any thoughts on these, or additional style issues not explicitly
   covered in
   either the Scala style guide or Spark wiki?
  
 
 




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Re: Proposal for Spark Release Strategy

2014-02-06 Thread Evan Chan
 as a multi-module
  project.
 
  Each Spark release will be versioned:
  [MAJOR].[MINOR].[MAINTENANCE]
 
  All releases with the same major version number will have API
  compatibility, defined as [1]. Major version numbers will remain
  stable over long periods of time. For instance, 1.X.Y may last 1
 year
  or more.
 
  Minor releases will typically contain new features and improvements.
  The target frequency for minor releases is every 3-4 months. One
  change we'd like to make is to announce fixed release dates and
 merge
  windows for each release, to facilitate coordination. Each minor
  release will have a merge window where new patches can be merged, a
 QA
  window when only fixes can be merged, then a final period where
 voting
  occurs on release candidates. These windows will be announced
  immediately after the previous minor release to give people plenty
 of
  time, and over time, we might make the whole release process more
  regular (similar to Ubuntu). At the bottom of this document is an
  example window for the 1.0.0 release.
 
  Maintenance releases will occur more frequently and depend on
 specific
  patches introduced (e.g. bug fixes) and their urgency. In general
  these releases are designed to patch bugs. However, higher level
  libraries may introduce small features, such as a new algorithm,
  provided they are entirely additive and isolated from existing code
  paths. Spark core may not introduce any features.
 
  When new components are added to Spark, they may initially be marked
  as alpha. Alpha components do not have to abide by the above
  guidelines, however, to the maximum extent possible, they should try
  to. Once they are marked stable they have to follow these
  guidelines. At present, GraphX is the only alpha component of Spark.
 
  [1] API compatibility:
 
  An API is any public class or interface exposed in Spark that is not
  marked as semi-private or experimental. Release A is API compatible
  with release B if code compiled against release A *compiles cleanly*
  against B. This does not guarantee that a compiled application that
 is
  linked against version A will link cleanly against version B without
  re-compiling. Link-level compatibility is something we'll try to
  guarantee that as well, and we might make it a requirement in the
  future, but challenges with things like Scala versions have made
 this
  difficult to guarantee in the past.
 
  == Merging Pull Requests ==
  To merge pull requests, committers are encouraged to use this tool
 [2]
  to collapse the request into one commit rather than manually
  performing git merges. It will also format the commit message nicely
  in a way that can be easily parsed later when writing credits.
  Currently it is maintained in a public utility repository, but we'll
  merge it into mainline Spark soon.
 
  [2]
 
 https://github.com/pwendell/spark-utils/blob/master/apache_pr_merge.py
 
  == Tentative Release Window for 1.0.0 ==
  Feb 1st - April 1st: General development
  April 1st: Code freeze for new features
  April 15th: RC1
 
  == Deviations ==
  For now, the proposal is to consider these tentative guidelines. We
  can vote to formalize these as project rules at a later time after
  some experience working with them. Once formalized, any deviation to
  these guidelines will be subject to a lazy majority vote.
 
  - Patrick
 







-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Re: Proposal for Spark Release Strategy

2014-02-06 Thread Evan Chan
The other reason for waiting are things like stability.

It would be great to have as a goal for 1.0.0 that under most heavy
use scenarios, workers and executors don't just die, which is not true
today.
Also, there should be minimal silent failures which are difficult to debug.

On Thu, Feb 6, 2014 at 11:54 AM, Evan Chan e...@ooyala.com wrote:
 +1 for 0.10.0.

 It would give more time to study things (such as the new SparkConf)
 and let the community decide if any breaking API changes are needed.

 Also, a +1 for minor revisions not breaking code compatibility,
 including Scala versions.   (I guess this would mean that 1.x would
 stay on Scala 2.10.x)

 On Thu, Feb 6, 2014 at 11:05 AM, Sandy Ryza sandy.r...@cloudera.com wrote:
 Bleh, hit send to early again.  My second paragraph was to argue for 1.0.0
 instead of 0.10.0, not to hammer on the binary compatibility point.


 On Thu, Feb 6, 2014 at 11:04 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 *Would it make sense to put in something that strongly discourages binary
 incompatible changes when possible?


 On Thu, Feb 6, 2014 at 11:03 AM, Sandy Ryza sandy.r...@cloudera.comwrote:

 Not codifying binary compatibility as a hard rule sounds fine to me.
  Would it make sense to put something in that . I.e. avoid making needless
 changes to class hierarchies.

 Whether Spark considers itself stable or not, users are beginning to
 treat it so.  A responsible project will acknowledge this and provide the
 stability needed by its user base.  I think some projects have made the
 mistake of waiting too long to release a 1.0.0.  It allows them to put off
 making the hard decisions, but users and downstream projects suffer.

 If Spark needs to go through dramatic changes, there's always the option
 of a 2.0.0 that allows for this.

 -Sandy



 On Thu, Feb 6, 2014 at 10:56 AM, Matei Zaharia 
 matei.zaha...@gmail.comwrote:

 I think it's important to do 1.0 next. The project has been around for 4
 years, and I'd be comfortable maintaining the current codebase for a long
 time in an API and binary compatible way through 1.x releases. Over the
 past 4 years we haven't actually had major changes to the user-facing API 
 --
 the only ones were changing the package to org.apache.spark, and upgrading
 the Scala version. I'd be okay leaving 1.x to always use Scala 2.10 for
 example, or later cross-building it for Scala 2.11. Updating to 1.0 says
 two things: it tells users that they can be confident that version will be
 maintained for a long time, which we absolutely want to do, and it lets
 outsiders see that the project is now fairly mature (for many people,
 pre-1.0 might still cause them not to try it). I think both are good for
 the community.

 Regarding binary compatibility, I agree that it's what we should strive
 for, but it just seems premature to codify now. Let's see how it works
 between, say, 1.0 and 1.1, and then we can codify it.

 Matei

 On Feb 6, 2014, at 10:43 AM, Henry Saputra henry.sapu...@gmail.com
 wrote:

  Thanks Patick to initiate the discussion about next road map for
 Apache Spark.
 
  I am +1 for 0.10.0 for next version.
 
  It will give us as community some time to digest the process and the
  vision and make adjustment accordingly.
 
  Release a 1.0.0 is a huge milestone and if we do need to break API
  somehow or modify internal behavior dramatically we could take
  advantage to release 1.0.0 as good step to go to.
 
 
  - Henry
 
 
 
  On Wed, Feb 5, 2014 at 9:52 PM, Andrew Ash and...@andrewash.com
 wrote:
  Agree on timeboxed releases as well.
 
  Is there a vision for where we want to be as a project before
 declaring the
  first 1.0 release?  While we're in the 0.x days per semver we can
 break
  backcompat at will (though we try to avoid it where possible), and
 that
  luxury goes away with 1.x  I just don't want to release a 1.0 simply
  because it seems to follow after 0.9 rather than making an intentional
  decision that we're at the point where we can stand by the current
 APIs and
  binary compatibility for the next year or so of the major release.
 
  Until that decision is made as a group I'd rather we do an immediate
  version bump to 0.10.0-SNAPSHOT and then if discussion warrants it
 later,
  replace that with 1.0.0-SNAPSHOT.  It's very easy to go from 0.10 to
 1.0
  but not the other way around.
 
  https://github.com/apache/incubator-spark/pull/542
 
  Cheers!
  Andrew
 
 
  On Wed, Feb 5, 2014 at 9:49 PM, Heiko Braun ike.br...@googlemail.com
 wrote:
 
  +1 on time boxed releases and compatibility guidelines
 
 
  Am 06.02.2014 um 01:20 schrieb Patrick Wendell pwend...@gmail.com
 :
 
  Hi Everyone,
 
  In an effort to coordinate development amongst the growing list of
  Spark contributors, I've taken some time to write up a proposal to
  formalize various pieces of the development process. The next
 release
  of Spark will likely be Spark 1.0.0, so this message is intended in
  part to coordinate the release plan for 1.0.0

ApacheCon

2014-01-30 Thread Evan Chan
I might have missed it earlier, but is anybody planning to present at
ApacheCon?  I think it's in Denver this year, April 7-9.

Thinking of submitting a talk about how we use Spark and Cassandra.

-Evan


-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Re: Any suggestion about JIRA 1006 MLlib ALS gets stack overflow with too many iterations?

2014-01-28 Thread Evan Chan
By the way, is there any plan to make a pluggable backend for
checkpointing?   We might be interested in writing a, for example,
Cassandra backend.

On Sat, Jan 25, 2014 at 9:49 PM, Xia, Junluan junluan@intel.com wrote:
 Hi all

 The description about this Bug submitted by Matei is as following


 The tipping point seems to be around 50. We should fix this by checkpointing 
 the RDDs every 10-20 iterations to break the lineage chain, but checkpointing 
 currently requires HDFS installed, which not all users will have.

 We might also be able to fix DAGScheduler to not be recursive.


 regards,
 Andrew




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Re: [DISCUSS] Graduating as a TLP

2014-01-24 Thread Evan Chan
+1 to both!

On Thu, Jan 23, 2014 at 11:32 PM, Austen McRae austen.mc...@gmail.com wrote:
 +1 to both (awesome job all!)


 On Thu, Jan 23, 2014 at 10:25 PM, Haoyuan Li haoyuan...@gmail.com wrote:

 +1 to both.


 On Thu, Jan 23, 2014 at 8:33 PM, Mosharaf Chowdhury 
 mosharafka...@gmail.com
  wrote:

  +1  +1
 
  On Thu, Jan 23, 2014 at 8:26 PM, Josh Rosen rosenvi...@gmail.com
 wrote:
 
   +1 to both!
  
  
   On Thu, Jan 23, 2014 at 8:22 PM, prabeesh k prabsma...@gmail.com
  wrote:
  
+1
   
   
On Fri, Jan 24, 2014 at 6:55 AM, Hossein fal...@gmail.com wrote:
   
 +1 to both.

 --Hossein


 On Thu, Jan 23, 2014 at 4:40 PM, Mark Hamstra 
  m...@clearstorydata.com
 wrote:

  Is there any other choice that makes any kind of sense?  Matei
 for
   VP:
 +1.
 
 
  On Thu, Jan 23, 2014 at 4:32 PM, Andy Konwinski 
andykonwin...@gmail.com
  wrote:
 
   +2 (1 for graduating + 1 for matei as VP)!
  
  
   On Thu, Jan 23, 2014 at 4:11 PM, Chris Mattmann 
   mattm...@apache.org

   wrote:
  
+1 from me.
   
I'll throw Matei's name into the hat for VP. He's done a
 great
   job
and has stood out to me with his report filing and tenacity
 and
would make an excellent chair.
   
Being a chair entails:
   
1. being the eyes and ears of the board on the project.
2. filing a monthly (first 3 months, then quarterly) board
  report
similar to the incubator report.
   
Not too bad.
   
+1 for graduation from me binding when the VOTE comes. We
 need
   our
mentors and IPMC members to chime in and we should be in time
  for
February 2014 board meeting.
   
Cheers,
Chris
   
   
-Original Message-
From: Matei Zaharia matei.zaha...@gmail.com
Reply-To: dev@spark.incubator.apache.org 
   dev@spark.incubator.apache.org

Date: Thursday, January 23, 2014 2:45 PM
To: dev@spark.incubator.apache.org 
dev@spark.incubator.apache.org
 
Subject: [DISCUSS] Graduating as a TLP
   
Hi folks,

We¹ve been working on the transition to Apache for a while,
  and
our
  last
shepherd¹s report says the following:


Spark

Alan Cabrera (acabrera):

  Seems like a nice active project.  IMO, there's no need to
   wait
  import
  to JIRA to graduate. Seems like they can graduate now.


What do you think about graduating to a top-level project?
 As
   far
 as I
can tell, we¹ve completed all the requirements for
 graduating:

- Made 2 releases (working on a third now)
- Added new committers and PPMC members (4 of them)
- Did IP clearance
- Moved infrastructure over to Apache, except for the JIRA
   above,
  which
INFRA is working on and which shouldn¹t block us.

If everything is okay, I¹ll call a VOTE on graduating in 48
   hours.
 The
one final thing missing is that we¹ll need to nominate an
   initial
VP
  for
the project.

Matei
   
   
--
You received this message because you are subscribed to the
   Google
  Groups
Unofficial Apache Spark Dev Mailing List Mirror group.
To unsubscribe from this group and stop receiving emails from
  it,
 send
  an
email to
 apache-spark-dev-mirror+unsubscr...@googlegroups.com.
For more options, visit
  https://groups.google.com/groups/opt_out
   .
   
  
 

   
  
 




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Re: Is there any plan to develop an application level fair scheduler?

2014-01-17 Thread Evan Chan
What is the reason that standalone mode doesn't support the fair scheduler?
Does that mean that Mesos coarse mode also doesn't support the fair
scheduler?


On Tue, Jan 14, 2014 at 8:10 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

 This is true for now, we didn’t want to replicate those systems. But it
 may change if we see demand for fair scheduling in our standalone cluster
 manager.

 Matei

 On Jan 14, 2014, at 6:32 PM, Xia, Junluan junluan@intel.com wrote:

  Yes, Spark depends on Yarn or Mesos for application level scheduling.
 
  -Original Message-
  From: Nan Zhu [mailto:zhunanmcg...@gmail.com]
  Sent: Tuesday, January 14, 2014 9:43 PM
  To: dev@spark.incubator.apache.org
  Subject: Re: Is there any plan to develop an application level fair
 scheduler?
 
  Hi, Junluan,
 
  Thank you for the reply
 
  but for the long-term plan, Spark will depend on Yarn and Mesos for
 application level scheduling in the coming versions?
 
  Best,
 
  --
  Nan Zhu
 
 
  On Tuesday, January 14, 2014 at 12:56 AM, Xia, Junluan wrote:
 
  Are you sure that you must deploy spark in standalone mode?(it
 currently only support FIFO)
 
  If you could setup Spark on Yarn or Mesos, then it has supported Fair
 scheduler in application level.
 
  -Original Message-
  From: Nan Zhu [mailto:zhunanmcg...@gmail.com]
  Sent: Tuesday, January 14, 2014 10:13 AM
  To: dev@spark.incubator.apache.org (mailto:
 dev@spark.incubator.apache.org)
  Subject: Is there any plan to develop an application level fair
 scheduler?
 
  Hi, All
 
  Is there any plan to develop an application level fair scheduler?
 
  I think it will have more value than a fair scheduler within the
 application (actually I didn’t understand why we want to fairly share the
 resource among jobs within the application, in usual, users submit
 different applications, not jobs)…
 
  Best,
 
  --
  Nan Zhu
 
 
 
 




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

http://www.ooyala.com/
http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyalahttp://www.twitter.com/ooyala


Re: spark code formatter?

2014-01-17 Thread Evan Chan
BTW, we also run Scalariform, but we don't turn it on automatically.  We
find that for the most part it is good, but there are a few places where it
reformats things and doesn't look good, and requires cleanup.  I think
Scalariform requires some more rules to make it more generally useful.

-Evan



On Thu, Jan 9, 2014 at 12:23 PM, DB Tsai dbt...@alpinenow.com wrote:

 Initially, we also had the same concern, so we started from limited
 set of rules. Gradually, we found that it increases the productivity
 and readability of our codebase.

 PS, Scalariform is compatible with the Scala Style Guide in the sense
 that, given the right preference settings, source code that is
 initially compiliant with the Style Guide will not become uncompliant
 after formatting. In a number of cases, running the formatter will
 make uncompliant source more compliant.

 I added the configuration option in the latest PR to limit the set of
 rules. The options are https://github.com/mdr/scalariform

 When developers wants to choose their own style for whatever reasons,
 they can source directives to turn it off by `// format: OFF`.

 Just quickly run the formatter, and I found that Spark is in general
 in good shape; most of the changes are extra space after semicolon.

 -  def run[K: Manifest, V : Vertex : Manifest, M : Message[K] :
 Manifest, C: Manifest](

 +  def run[K: Manifest, V : Vertex: Manifest, M : Message[K]:
 Manifest, C: Manifest](

 -  def addFile(file: File) : String = {
 + def addFile(file: File): String = {

 Sincerely,

 DB Tsai
 Machine Learning Engineer
 Alpine Data Labs
 --
 Web: http://alpinenow.com/


 On Thu, Jan 9, 2014 at 12:32 AM, Patrick Wendell pwend...@gmail.com
 wrote:
  I'm also very wary of using a code formatter for the reasons already
  mentioned by Reynold.
 
  Does scaliform have a mode where it just provides style checks rather
  than reformat the code? This is something we really need for, e.g.,
  reviewing the many submissions to the project.
 
  - Patrick
 
  On Wed, Jan 8, 2014 at 11:51 PM, Reynold Xin r...@databricks.com
 wrote:
  Thanks for doing that, DB. Not sure about others, but I'm actually
 strongly
  against blanket automatic code formatters, given that they can be
  disruptive. Often humans would intentionally choose to style things in a
  certain way for more clear semantics and better readability. Code
  formatters don't capture these nuances. It is pretty dangerous to just
 auto
  format everything.
 
  Maybe it'd be ok if we restrict the code formatters to a very limited
 set
  of things, such as indenting function parameters, etc.
 
 
  On Wed, Jan 8, 2014 at 10:28 PM, DB Tsai dbt...@alpinenow.com wrote:
 
  A pull request for scalariform.
  https://github.com/apache/incubator-spark/pull/365
 
  Sincerely,
 
  DB Tsai
  Machine Learning Engineer
  Alpine Data Labs
  --
  Web: http://alpinenow.com/
 
 
  On Wed, Jan 8, 2014 at 10:09 PM, DB Tsai dbt...@alpinenow.com wrote:
   We use sbt-scalariform in our company, and it can automatically
 format
   the coding style when runs `sbt compile`.
  
   https://github.com/sbt/sbt-scalariform
  
   We ask our developers to run `sbt compile` before commit, and it's
   really nice to see everyone has the same spacing and indentation.
  
   Sincerely,
  
   DB Tsai
   Machine Learning Engineer
   Alpine Data Labs
   --
   Web: http://alpinenow.com/
  
  
   On Wed, Jan 8, 2014 at 9:50 PM, Reynold Xin r...@databricks.com
 wrote:
   We have a Scala style configuration file in Shark:
   https://github.com/amplab/shark/blob/master/scalastyle-config.xml
  
   However, the scalastyle project is still pretty primitive and
 doesn't
  cover
   most of the use cases. It is still great to include it to cover
 basic
   checks such as 100-char wide lines.
  
  
   On Wed, Jan 8, 2014 at 8:02 PM, Matei Zaharia 
 matei.zaha...@gmail.com
  wrote:
  
   Not that I know of. This would be very useful to add, especially
 if we
  can
   make SBT automatically check the code style (or we can somehow plug
  this
   into Jenkins).
  
   Matei
  
   On Jan 8, 2014, at 11:00 AM, Michael Allman m...@allman.ms wrote:
  
Hi,
   
I've read the spark code style guide for contributors here:
   
   
 
 https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
   
For scala code, do you have a scalariform configuration that you
 use
  to
   format your code to these specs?
   
Cheers,
   
Michael
  
  
 




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

http://www.ooyala.com/
http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyalahttp://www.twitter.com/ooyala


scala-graph

2013-12-27 Thread Evan Chan
http://www.scala-graph.org/

Have you guys seen the above site?  I wonder if this will ever be
merged into the Scala standard library, but might be interesting to
see if this fits into GraphX at all, or to add a Spark backend to it.

-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Re: Option folding idiom

2013-12-26 Thread Evan Chan
+1 for using more functional idioms in general.

That's a pretty clever use of `fold`, but putting the default condition
first there makes it not as intuitive.   What about the following, which
are more readable?

option.map { a = someFuncMakesB() }
  .getOrElse(b)

option.map { a = someFuncMakesB() }
  .orElse { a = otherDefaultB() }.get


On Thu, Dec 26, 2013 at 12:33 PM, Mark Hamstra m...@clearstorydata.comwrote:

 In code added to Spark over the past several months, I'm glad to see more
 use of `foreach`, `for`, `map` and `flatMap` over `Option` instead of
 pattern matching boilerplate.  There are opportunities to push `Option`
 idioms even further now that we are using Scala 2.10 in master, but I want
 to discuss the issue here a little bit before committing code whose form
 may be a little unfamiliar to some Spark developers.

 In particular, I really like the use of `fold` with `Option` to cleanly an
 concisely express the do something if the Option is None; do something
 else with the thing contained in the Option if it is Some code fragment.

 An example:

 Instead of...

 val driver = drivers.find(_.id == driverId)
 driver match {
   case Some(d) =
 if (waitingDrivers.contains(d)) { waitingDrivers -= d }
 else {
   d.worker.foreach { w =
 w.actor ! KillDriver(driverId)
   }
 }
 val msg = sKill request for $driverId submitted
 logInfo(msg)
 sender ! KillDriverResponse(true, msg)
   case None =
 val msg = sCould not find running driver $driverId
 logWarning(msg)
 sender ! KillDriverResponse(false, msg)
 }

 ...using fold we end up with...

 driver.fold
   {
 val msg = sCould not find running driver $driverId
 logWarning(msg)
 sender ! KillDriverResponse(false, msg)
   }
   { d =
 if (waitingDrivers.contains(d)) { waitingDrivers -= d }
 else {
   d.worker.foreach { w =
 w.actor ! KillDriver(driverId)
   }
 }
 val msg = sKill request for $driverId submitted
 logInfo(msg)
 sender ! KillDriverResponse(true, msg)
   }


 So the basic pattern (and my proposed formatting standard) for folding over
 an `Option[A]` from which you need to produce a B (which may be Unit if
 you're only interested in side effects) is:

 anOption.fold
   {
 // something that evaluates to a B if anOption = None
   }
   { a =
 // something that transforms `a` into a B if anOption = Some(a)
   }


 Any thoughts?  Does anyone really, really hate this style of coding and
 oppose its use in Spark?




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

http://www.ooyala.com/
http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyalahttp://www.twitter.com/ooyala


Re: IMPORTANT: Spark mailing lists moving to Apache by September 1st

2013-12-24 Thread Evan Chan
 to
  institutionally guarantee this. This is because discussion on mailing
 lists
  is one of the core things that defines an Apache community. So at a
 minimum
  this means Apache owning the master copy of the bits.
 
  All that said, we are discussing the possibility of having a google
  group that subscribes to each list that would provide an easier to
 use and
  prettier archive for each list (so far we haven't gotten that to
 work).
 
  I hope this was helpful. It has taken me a few years now, and a lot of
  conversations with experienced (and patient!) Apache mentors, to
  internalize some of the nuance about the Apache way. That's why I
 wanted
  to share.
 
  Andy
 
  On Thu, Dec 19, 2013 at 6:28 PM, Mike Potts masp...@gmail.com
 wrote:
 
  I notice that there are still a lot of active topics in this group:
  and also activity on the apache mailing list (which is a really
 horrible
  experience!).  Is it a firm policy on apache's front to disallow
 external
  groups?  I'm going to be ramping up on spark, and I really hate the
 idea of
  having to rely on the apache archives and my mail client.  Also:
 having to
  search for topics/keywords both in old threads (here) as well as new
  threads in apache's (clunky) archive, is going to be a pain!  I
 almost feel
  like I must be missing something because the current solution seems
  unfeasibly awkward!
 
 
  --
  You received this message because you are subscribed to the Google
  Groups Spark Users group.
  To unsubscribe from this group and stop receiving emails from it,
 send
  an email to spark-users...@googlegroups.com.
 
  For more options, visit https://groups.google.com/groups/opt_out.
 
 
 
 
 
  --
  You received this message because you are subscribed to the Google Groups
  Spark Users group.
  To unsubscribe from this group and stop receiving emails from it, send an
  email to spark-users+unsubscr...@googlegroups.com.
 
  For more options, visit https://groups.google.com/groups/opt_out.
 
 
   --
  You received this message because you are subscribed to the Google Groups
  Spark Users group.
  To unsubscribe from this group and stop receiving emails from it, send an
  email to spark-users+unsubscr...@googlegroups.com.
 
  For more options, visit https://groups.google.com/groups/opt_out.
 




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

http://www.ooyala.com/
http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyalahttp://www.twitter.com/ooyala


Re: Akka problem when using scala command to launch Spark applications in the current 0.9.0-SNAPSHOT

2013-12-24 Thread Evan Chan
Hi Reynold,

The default, documented methods of starting Spark all use the assembly jar,
and thus java, right?

-Evan



On Fri, Dec 20, 2013 at 11:36 PM, Reynold Xin r...@databricks.com wrote:

 It took me hours to debug a problem yesterday on the latest master branch
 (0.9.0-SNAPSHOT), and I would like to share with the dev list in case
 anybody runs into this Akka problem.

 A little background for those of you who haven't followed closely the
 development of Spark and YARN 2.2: YARN 2.2 uses protobuf 2.5, and Akka
 uses an older version of protobuf that is not binary compatible. In order
 to have a single build that is compatible for both YARN 2.2 and pre-2.2
 YARN/Hadoop, we published a special version of Akka that builds with
 protobuf shaded (i.e. using a different package name for the protobuf
 stuff).

 However, it turned out Scala 2.10 includes a version of Akka jar in its
 default classpath (look at the lib folder in Scala 2.10 binary
 distribution). If you use the scala command to launch any Spark application
 on the current master branch, there is a pretty high chance that you
 wouldn't be able to create the SparkContext (stack trace at the end of the
 email). The problem is that the Akka packaged with Scala 2.10 takes
 precedence in the classloader over the special Akka version Spark includes.

 Before we have a good solution for this, the workaround is to use java to
 launch the application instead of scala. All you need to do is to include
 the right Scala jars (scala-library and scala-compiler) in the classpath.
 Note that the scala command is really just a simple script that calls java
 with the right classpath.


 Stack trace:

 java.lang.NoSuchMethodException:
 akka.remote.RemoteActorRefProvider.init(java.lang.String,
 akka.actor.ActorSystem$Settings, akka.event.EventStream,
 akka.actor.Scheduler, akka.actor.DynamicAccess)
 at java.lang.Class.getConstructor0(Class.java:2763)
 at java.lang.Class.getDeclaredConstructor(Class.java:2021)
 at

 akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:77)
 at scala.util.Try$.apply(Try.scala:161)
 at

 akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:74)
 at

 akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:85)
 at

 akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:85)
 at scala.util.Success.flatMap(Try.scala:200)
 at

 akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:85)
 at akka.actor.ActorSystemImpl.init(ActorSystem.scala:546)
 at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
 at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
 at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:79)
 at
 org.apache.spark.SparkEnv$.createFromSystemProperties(SparkEnv.scala:120)
 at org.apache.spark.SparkContext.init(SparkContext.scala:106)




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

http://www.ooyala.com/
http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyalahttp://www.twitter.com/ooyala


Re: Re : Scala 2.10 Merge

2013-12-16 Thread Evan Chan
 changes).
  Please open new threads on the dev list to report and discuss any
  issues.
 
  This merge will temporarily drop support for YARN 2.2 on the master
  branch.
  This is because the workaround we used was only compiled for Scala
 2.9.
  We are going to come up with a more robust solution to YARN 2.2
  support before releasing 0.9.
 
  Going forward, we will continue to make maintenance releases on
  branch-0.8 which will remain compatible with Scala 2.9.
 
  For those interested, the primary code changes in this merge are
  upgrading the akka version, changing the use of Scala 2.9's
  ClassManifest construct to Scala 2.10's ClassTag, and updating the
  spark shell to work with Scala 2.10's repl.
 
  - Patrick
 
 




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

http://www.ooyala.com/
http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyalahttp://www.twitter.com/ooyala


Re: Spark API - support for asynchronous calls - Reactive style [I]

2013-12-12 Thread Evan Chan
Mark,

Thanks.  The FutureAction API looks awesome.


On Mon, Dec 9, 2013 at 9:31 AM, Mark Hamstra m...@clearstorydata.comwrote:

 Spark has already supported async jobs for awhile now --
 https://github.com/apache/incubator-spark/pull/29, and they even work
 correctly after https://github.com/apache/incubator-spark/pull/232

 There are now implicit conversions from RDD to
 AsyncRDDActions
 https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala
 ,
 where async actions like countAsync are defined.


 On Mon, Dec 9, 2013 at 5:46 AM, Deenar Toraskar deenar.toras...@db.com
 wrote:

  Classification: For internal use only
  Hi developers
 
  Are there any plans to have Spark (and Shark) APIs that are asynchronous
  and non blocking? APIs that return Futures and Iteratee/Enumerators would
  be very useful to users building scalable apps using Spark, specially
 when
  combined with a fully asynchronous/non-blocking framework like Play!.
 
  Something along the lines of ReactiveMongo
 
 
 http://stephane.godbillon.com/2012/08/30/reactivemongo-for-scala-unleashing-mongodb-streaming-capabilities-for-realtime-web
 
 
  Deenar
 
  ---
  This e-mail may contain confidential and/or privileged information. If
 you
  are not the intended recipient (or have received this e-mail in error)
  please notify the sender immediately and delete this e-mail. Any
  unauthorized copying, disclosure or distribution of the material in this
  e-mail is strictly forbidden.
 
  Please refer to http://www.db.com/en/content/eu_disclosures.htm for
  additional EU corporate and regulatory disclosures.
 




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

http://www.ooyala.com/
http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyalahttp://www.twitter.com/ooyala


Re: [VOTE] Release Apache Spark 0.8.1-incubating (rc4)

2013-12-12 Thread Evan Chan
I'd be personally fine with a standard workflow of assemble-deps +
packaging just the Spark files as separate packages, if it speeds up
everyone's development time.


On Wed, Dec 11, 2013 at 1:10 PM, Mark Hamstra m...@clearstorydata.comwrote:

 I don't know how to make sense of the numbers, but here's what I've got
 from a very small sample size.

 For both v0.8.0-incubating and v0.8.1-incubating, building separate
 assemblies is faster than `./sbt/sbt assembly` and the times for building
 separate assemblies for 0.8.0 and 0.8.1 are about the same.

 For v0.8.0-incubating, `./sbt/sbt assembly` takes about 2.5x as long as the
 sum of the separate assemblies.
 For v0.8.1-incubating, `./sbt/sbt assembly` takes almost 8x as long as the
 sum of the separate assemblies.

 Weird.


 On Wed, Dec 11, 2013 at 11:49 AM, Patrick Wendell pwend...@gmail.com
 wrote:

  I'll +1 myself also.
 
  For anyone who has the slow build problem: does this issue happen when
  building v0.8.0-incubating also? Trying to figure out whether it's
  related to something we added in 0.8.1 or if it's a long standing
  issue.
 
  - Patrick
 
  On Wed, Dec 11, 2013 at 10:39 AM, Matei Zaharia matei.zaha...@gmail.com
 
  wrote:
   Woah, weird, but definitely good to know.
  
   If you’re doing Spark development, there’s also a more convenient
 option
  added by Shivaram in the master branch. You can do sbt assemble-deps to
  package *just* the dependencies of each project in a special assembly
 JAR,
  and then use sbt compile to update the code. This will use the classes
  directly out of the target/scala-2.9.3/classes directories. You have to
  redo assemble-deps only if your external dependencies change.
  
   Matei
  
   On Dec 11, 2013, at 1:04 AM, Prashant Sharma scrapco...@gmail.com
  wrote:
  
   I hope this PR https://github.com/apache/incubator-spark/pull/252 can
  help.
   Again this is not a blocker for the release from my side either.
  
  
   On Wed, Dec 11, 2013 at 2:14 PM, Mark Hamstra 
 m...@clearstorydata.com
  wrote:
  
   Interesting, and confirmed: On my machine where `./sbt/sbt assembly`
  takes
   a long, long, long time to complete (a MBP, in my case), building
  three
   separate assemblies (`./sbt/sbt assembly/assembly`, `./sbt/sbt
   examples/assembly`, `./sbt/sbt tools/assembly`) takes much, much less
  time.
  
  
  
   On Wed, Dec 11, 2013 at 12:02 AM, Prashant Sharma 
  scrapco...@gmail.com
   wrote:
  
   forgot to mention, after running sbt/sbt assembly/assembly running
   sbt/sbt
   examples/assembly takes just 37s. Not to mention my hardware is not
   really
   great.
  
  
   On Wed, Dec 11, 2013 at 1:28 PM, Prashant Sharma 
  scrapco...@gmail.com
   wrote:
  
   Hi Patrick and Matei,
  
   Was trying out this and followed the quick start guide which says
 do
   sbt/sbt assembly, like few others I was also stuck for few minutes
 on
   linux. On the other hand if I use sbt/sbt assembly/assembly it is
  much
   faster.
  
   Should we change the documentation to reflect this. It will not be
   great
   for first time users to get stuck there.
  
  
   On Wed, Dec 11, 2013 at 9:54 AM, Matei Zaharia 
   matei.zaha...@gmail.com
   wrote:
  
   +1
  
   Built and tested it on Mac OS X.
  
   Matei
  
  
   On Dec 10, 2013, at 4:49 PM, Patrick Wendell pwend...@gmail.com
   wrote:
  
   Please vote on releasing the following candidate as Apache Spark
   (incubating) version 0.8.1.
  
   The tag to be voted on is v0.8.1-incubating (commit b87d31d):
  
  
  
  
 
 https://git-wip-us.apache.org/repos/asf/incubator-spark/repo?p=incubator-spark.git;a=commit;h=b87d31dd8eb4b4e47c0138e9242d0dd6922c8c4e
  
   The release files, including signatures, digests, etc can be
 found
   at:
   http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc4/
  
   Release artifacts are signed with the following key:
   https://people.apache.org/keys/committer/pwendell.asc
  
   The staging repository for this release can be found at:
  
  
  https://repository.apache.org/content/repositories/orgapachespark-040/
  
   The documentation corresponding to this release can be found at:
  
  http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc4-docs/
  
   For information about the contents of this release see:
  
  
  
  
 
 https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=blob;f=CHANGES.txt;h=ce0aeab524505b63c7999e0371157ac2def6fe1c;hb=branch-0.8
  
   Please vote on releasing this package as Apache Spark
   0.8.1-incubating!
  
   The vote is open until Saturday, December 14th at 01:00 UTC and
   passes if a majority of at least 3 +1 PPMC votes are cast.
  
   [ ] +1 Release this package as Apache Spark 0.8.1-incubating
   [ ] -1 Do not release this package because ...
  
   To learn more about Apache Spark, please see
   http://spark.incubator.apache.org/
  
  
  
  
   --
   s
  
  
  
  
   --
   s
  
  
  
  
  
   --
   s
  
 




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

http://www.ooyala.com

Re: [ANNOUNCE] Welcoming two new Spark committers: Tom Graves and Prashant Sharma

2013-11-14 Thread Evan Chan
Congrats!

On Wed, Nov 13, 2013 at 9:21 PM, karthik tunga karthik.tu...@gmail.com wrote:
 Congrats Tom and Prashant :)

 Cheers,
 Karthik
 On Nov 13, 2013 7:37 PM, Matei Zaharia matei.zaha...@gmail.com wrote:

 Hi folks,

 The Apache Spark PPMC is happy to welcome two new PPMC members and
 committers: Tom Graves and Prashant Sharma.

 Tom has been maintaining and expanding the YARN support in Spark over the
 past few months, including adding big features such as support for YARN
 security, and recently contributed a major patch that adds security to all
 of Spark’s internal communication services as well (
 https://github.com/apache/incubator-spark/pull/120). It will be great to
 have him continue expanding these and other features as a committer.

 Prashant created and has maintained the Scala 2.10 branch of Spark for 6+
 months now, including tackling the hairy task of porting the Spark
 interpreter to 2.10, and debugging all the issues raised by that with
 third-party libraries. He's also contributed bug fixes and new input
 sources to Spark Streaming. The Scala 2.10 branch will be merged into
 master soon.

 We’re very excited to have both Tom and Prashant join the project as
 committers.

 The Apache Spark PPMC





-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Re: SPARK-942

2013-11-14 Thread Evan Chan
+1 for IteratorWithSizeEstimate.

I believe today only HadoopRDDs are able to give fine grained
progress;  with an enhanced iterator interface (which can still expose
the base Iterator trait) we can extend the possibility of fine grained
progress to all RDDs that implement the enhanced iterator.

On Tue, Nov 12, 2013 at 11:07 AM, Stephen Haberman
stephen.haber...@gmail.com wrote:

 The problem is that the iterator interface only defines 'hasNext' and
 'next' methods.

 Just a comment from the peanut gallery, but FWIW it seems like being
 able to ask how much data is here would be a useful thing for Spark
 to know, even if that means moving away from Iterator itself, or
 something like IteratorWithSizeEstimate/something/something.

 Not only for this, but so that, ideally, Spark could basically do
 dynamic partitioning.

 E.g. when we load a month's worth of data, it's X GB, but after a few
 maps and filters, it's X/100 GB, so could use X/100 partitions instead.

 But right now all partitioning decisions are made up-front,
 via .coalesce/etc. type hints from the programmer, and it seems if
 Spark could delay making partitioning decisions each until RDD could
 like lazily-eval/sample a few lines (hand waving), that would be super
 sexy from our respective, in terms of doing automatic perf/partition
 optimization.

 Huge disclaimer that this is probably a big pita to implement, and
 could likely not be as worthwhile as I naively think it would be.

 - Stephen



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Re: Getting failures in FileServerSuite

2013-10-30 Thread Evan Chan
Must be a local environment thing, because AmpLab Jenkins can't
reproduce it. :-p

On Wed, Oct 30, 2013 at 11:10 AM, Josh Rosen rosenvi...@gmail.com wrote:
 Someone on the users list also encountered this exception:

 https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3C64474308D680D540A4D8151B0F7C03F7025EF289%40SHSMSX104.ccr.corp.intel.com%3E


 On Wed, Oct 30, 2013 at 9:40 AM, Evan Chan e...@ooyala.com wrote:

 I'm at the latest

 commit f0e23a023ce1356bc0f04248605c48d4d08c2d05
 Merge: aec9bf9 a197137
 Author: Reynold Xin r...@apache.org
 Date:   Tue Oct 29 01:41:44 2013 -0400


 and seeing this when I do a test-only FileServerSuite:

 13/10/30 09:35:04.300 INFO DAGScheduler: Completed ResultTask(0, 0)
 13/10/30 09:35:04.307 INFO LocalTaskSetManager: Loss was due to
 java.io.StreamCorruptedException
 java.io.StreamCorruptedException: invalid type code: AC
 at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
 at
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:39)
 at
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:101)
 at
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:440)
 at
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:26)
 at
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:27)
 at
 org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:53)
 at
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$2.apply(PairRDDFunctions.scala:95)
 at
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$2.apply(PairRDDFunctions.scala:94)
 at
 org.apache.spark.rdd.MapPartitionsWithContextRDD.compute(MapPartitionsWithContextRDD.scala:40)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:107)
 at org.apache.spark.scheduler.Task.run(Task.scala:53)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:212)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
 at java.lang.Thread.run(Thread.java:680)


 Anybody else seen this yet?

 I have a really simple PR and this fails without my change, so I may
 go ahead and submit it anyways.

 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com  |




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Re: Suggestion/Recommendation for language bindings

2013-10-16 Thread Evan Chan
+1 for Ruby, as Ruby already has a functional-ish collection API, so
mapping the RDD functional transforms over would be pretty idiomatic.
Would love to review this.

On Tue, Oct 15, 2013 at 10:13 AM, Patrick Wendell pwend...@gmail.com wrote:
 I think Ruby integration via JRuby would be a great idea.

 On Tue, Oct 15, 2013 at 9:45 AM, Ryan Weald r...@weald.com wrote:
 Writing a JRuby wrapper around the existing Java bindings would be pretty
 cool. Could help to get some of the Ruby community to start using the Spark
 platform.

 -Ryan


 On Mon, Oct 14, 2013 at 12:07 PM, Aaron Babcock 
 aaron.babc...@gmail.comwrote:

 Hey Laksh,

 Not sure if you are interested in groovy at all, but I've got the
 beginning of a project here:
 https://github.com/bunions1/groovy-spark-example

 The idea is to map groovy idioms: myRdd.collect{ row - newRow } to
 spark api calls myRdd.map( row = newRow)  and support a good repl.

 Its not officially related to spark at all and is very early stage but
 maybe it will be a point of reference for you.






 On Mon, Oct 14, 2013 at 12:42 PM, Laksh Gupta glaks...@gmail.com wrote:
  Hi
 
  I am interested in contributing to the project and want to start with
  supporting a new programming language on Spark. I can see that Spark
  already support Java and Python. Would someone provide me some
  suggestion/references to start with? I think this would be a great
 learning
  experince for me. Thank you in advance.
 
  --
  - Laksh Gupta




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Re: Development environments

2013-10-09 Thread Evan Chan
Matei,

I wonder if we can further optimize / reduce the size of the assembly.
One idea is to produce just a core assembly, and have the other projects
produce their own assemblies which exclude the core dependencies.

Also, DistributedSuite is pretty slow.  would it make sense to tag certain
tests as the core tests and give it a separate build target? The
overall tests that include DistributedSuite can trigger assembly, but then
it would be much faster to run the core tests.

-Evan



On Wed, Oct 9, 2013 at 12:54 AM, Matei Zaharia matei.zaha...@gmail.comwrote:

 For most development, you might not need to do assembly. You can run most
 of the unit tests if you just do sbt compile -- only the ones that spawn
 processes, like DistributedSuite, won't work. That said, we are looking to
 optimize assembly by maybe having it only package the dependencies rather
 than Spark itself -- there were some messages on this earlier. For now I'd
 just recommend doing it in a RAMFS if possible (symlink the assembly/target
 directory to be a RAMFS).

 Matei

 On Oct 9, 2013, at 12:45 AM, Evan Chan e...@ooyala.com wrote:

  Once you have compiled everything the first time using SBT (assembly will
  do that for you), successive runs of assembly are much faster.  I just
 did
  it on my MacBook Pro in about 36 seconds.
 
  Running builds using IntelliJ or an IDE is wasted time, because the
  compiled classes go to a different place than SBT.   Maybe there's some
 way
  to symlink them.
 
  -Evan
 
 
 
  On Tue, Oct 8, 2013 at 6:29 AM, Markus Losoi markus.lo...@gmail.com
 wrote:
 
  Hi Markus,
 
  have a look at the bottom of this wiki page:
 
 
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
 
  IntelliJ IDEA seems to be quite popular (that I am using myself)
  although Eclipse should work fine, too. There is another sbt plugin for
  generating Eclipse project files.
 
  The IDE seems to work nicely, but what is the fastest way to build
 Spark?
  If
  I make a change to the core module and choose Make Module 'core'
 from
  the Build menu in IntelliJ Idea, then the IDE compiles the source
 code.
  To
  create the spark-assembly-0.8.0-incubating-hadoop1.0.4.jar JAR file, I
  have run sbt assembly on the command line. However, this takes an
  impractically long time (843 s when I last ran it on my workstation
 with an
  Intel Core 2 Quad Q9400 and 8 GB of RAM). Is there any faster way?
 
  Best regards,
  Markus Losoi (markus.lo...@gmail.com)
 
 
 
 
  --
  --
  Evan Chan
  Staff Engineer
  e...@ooyala.com  |
 
  http://www.ooyala.com/
  http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyala
 http://www.twitter.com/ooyala




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

http://www.ooyala.com/
http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyalahttp://www.twitter.com/ooyala


Re: Propose to Re-organize the scripts and configurations

2013-09-26 Thread Evan Chan
 driver context will
  override any
   options specified in default.
   
This sounds great to me! The one thing I'll add is that we might
 want
  to
prevent applications from overriding certain settings on each node,
  such as
work directories. The best way is to probably just ignore the app's
  version
of those settings in the Executor.
   
If you guys would like, feel free to write up this design on
  SPARK-544 and
start working on it. I think it looks good.
   
Matei
  
  
  
  
   --
   *Shane Huang *
   *Intel Asia-Pacific RD Ltd.*
   *Email: shengsheng.hu...@intel.com*
 
 
 
 
  --
  *Shane Huang *
  *Intel Asia-Pacific RD Ltd.*
  *Email: shengsheng.hu...@intel.com*
 
 


 --
 *Shane Huang *
 *Intel Asia-Pacific RD Ltd.*
 *Email: shengsheng.hu...@intel.com*




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

http://www.ooyala.com/
http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyalahttp://www.twitter.com/ooyala


Re: off-heap RDDs

2013-09-05 Thread Evan Chan
Haoyuan,

Thanks, that sounds great, exactly what we are looking for.

We might be interested in integrating Tachyon with CFS (Cassandra File
System, the Cassandra-based implementation of HDFS).

-Evan



On Sat, Aug 31, 2013 at 3:33 PM, Haoyuan Li haoyuan...@gmail.com wrote:

 Evan,

 If I understand you correctly, you want to avoid network I/O as much as
 possible by caching the data on the node having the data on disk. Actually,
 what I meant client caching would automatically do this. For example,
 suppose you have a cluster of machines, nothing cached in memory yet. Then
 a spark application runs on it. Spark asks Tachyon where data X is. Since
 nothing is in memory yet, Tachyon would return disk locations for the first
 time. Then Spark program will try to take advantage of disk data locality,
 and load the data X in HDFS node N into the off-heap memory of node N. In
 the future, when Spark asks Tachyon the location of X, Tachyon will return
 node N. There is no network I/O involved in the whole process. Let me know
 if I misunderstood something.

 Haoyuan


 On Fri, Aug 30, 2013 at 10:00 AM, Evan Chan e...@ooyala.com wrote:

  Hey guys,
 
  I would also prefer to strengthen and get behind Tachyon, rather than
  implement a separate solution (though I guess if it's not offiically
  supported, then nobody will ask questions).  But it's more that off-heap
  memory is difficult, so it's better to focus efforts on one project, is
 my
  feeling.
 
  Haoyuan,
 
  Tachyon brings cached HDFS data to the local client.  Have we thought
 about
  the opposite approach, which might be more efficient?
   - Load the data in HDFS node N into the off-heap memory of node N
   - in Spark, inform the framework (maybe via RDD partition/location info)
  of where the data is, that it is located in node N
   - bring the computation to node N
 
  This avoids network IO and may be much more efficient for many types of
  applications.   I know this would be a big win for us.
 
  -Evan
 
 
  On Wed, Aug 28, 2013 at 1:37 AM, Haoyuan Li haoyuan...@gmail.com
 wrote:
 
   No problem. Like reading/writing data from/to off-heap bytebuffer,
 when a
   program reads/writes data from/to Tachyon, Spark/Shark needs to do
  ser/de.
   Efficient ser/de will help on performance a lot as people pointed out.
  One
   solution is that the application can do primitive operations directly
 on
   ByteBuffer, like how Shark is handling it now. Most related code is
  located
   at 
  
 
 https://github.com/amplab/shark/tree/master/src/main/scala/shark/memstore2
   
   and 
  
  
 
 https://github.com/amplab/shark/tree/master/src/tachyon_enabled/scala/shark/tachyon
   .
  
   Haoyuan
  
  
   On Wed, Aug 28, 2013 at 1:21 AM, Imran Rashid im...@therashids.com
   wrote:
  
Thanks Haoyuan.  It seems like we should try out Tachyon, sounds like
it is what we are looking for.
   
On Wed, Aug 28, 2013 at 8:18 AM, Haoyuan Li haoyuan...@gmail.com
   wrote:
 Response inline.


 On Tue, Aug 27, 2013 at 1:37 AM, Imran Rashid 
 im...@therashids.com
wrote:

 Thanks for all the great comments  discussion.  Let me expand a
 bit
 on our use case, and then I'm gonna combine responses to various
 questions.

 In general, when we use spark, we have some really big RDDs that
 use
 up a lot of memory (10s of GB per node) that are really our core
 data sets.  We tend to start up a spark application, immediately
  load
 all those data sets, and just leave them loaded for the lifetime
 of
 that process.  We definitely create a lot of other RDDs along the
  way,
 and lots of intermediate objects that we'd like to go through
 normal
 garbage collection.  But those all require much less memory, maybe
 1/10th of the big RDDs that we just keep around.  I know this is a
  bit
 of a special case, but it seems like it probably isn't that
  different
 from a lot of use cases.

 Reynold Xin wrote:
  This is especially attractive if the application can read
 directly
from
 a byte
  buffer without generic serialization (like Shark).

 interesting -- can you explain how this works in Shark?  do you
 have
 some general way of storing data in byte buffers that avoids
 serialization?  Or do you mean that if the user is effectively
 creating an RDD of ints, that you create a an RDD[ByteBuffer], and
 then you read / write ints into the byte buffer yourself?
 Sorry, I'm familiar with the basic idea of shark but not the code
 at
 all -- even a pointer to the code would be helpful.

 Haoyun Li wrote:
  One possible solution is that you can use
  Tachyonhttps://github.com/amplab/tachyon.

 This is a good idea, that I had probably overlooked.  There are
 two
 potential issues that I can think of with this approach, though:
 1) I was under the impression that Tachyon is still not really
  tested
 in production

Re: [Licensing check] Spark 0.8.0-incubating RC1

2013-09-04 Thread Evan Chan
Matei,

On Tue, Sep 3, 2013 at 7:09 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 BTW, what docs are you planning to write? Something on make-distribution.sh 
 would be nice.

Yes, I was planning to write something on make-distribution.sh, and
reference it in the standalone and Mesos deploy guides.

I was also going to document public methods in SparkContext that have
not been documented before, such as getPersistentRdds,
getExecutorStatus etc.   Some folks on my team don't realize that such
methods existed as they were not in the doc.

-Evan


 Matei

 On Sep 3, 2013, at 4:57 PM, Evan Chan e...@ooyala.com wrote:

 Sorry one more clarification.

 For doc pull requests for 0.8 release, should these be done against
 the existing mesos/spark repo, or against the mirror at
 apache/incubator-spark ?
 I'm hoping to clear up a couple things in the docs before the release this 
 week.

 thanks,
 -Evan


 On Tue, Sep 3, 2013 at 2:25 PM, Mark Hamstra m...@clearstorydata.com wrote:
 Okay, so is there any way to get github's compare view to be happy with
 differently-named repositories?  What I want is to be able to compare and
 generate a pull request between
 github.com/apache/incubator-spark:masterand, e.g.,
 github.com/markhamstra/spark:myBranch and not need to create a new
 github.com/markhamstra/incubator-spark.  It's bad enough that the
 differently-named repos don't show up in the compare-view drop-down
 choices, but I also haven't been able to find a hand-crafted URL that will
 make this work.



 On Tue, Sep 3, 2013 at 1:38 PM, Matei Zaharia 
 matei.zaha...@gmail.comwrote:

 Yup, the plan is as follows:

 - Make pull request against the mirror
 - Code review on GitHub as usual
 - Whoever merges it will simply merge it into the main Apache repo; when
 this propagates, the PR will be marked as merged

 I found at least one other Apache project that did this:
 http://wiki.apache.org/cordova/ContributorWorkflow.

 Matei

 On Sep 3, 2013, at 10:39 AM, Mark Hamstra m...@clearstorydata.com wrote:

 What is going to be the process for making pull requests?  Can they be
 made
 against the github mirror (https://github.com/apache/incubator-spark),
 or
 must we use some other way?


 On Tue, Sep 3, 2013 at 10:28 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 Hi guys,

 So are you planning to release 0.8 from the master branch (which is at
 a106ed8... now) or from branch-0.8?

 Right now the branches are the same in terms of content (though I might
 not have merged the latest changes into 0.8). If we add stuff into
 master
 that we won't want in 0.8 we'll break that.

 My recommendation is that we start to use the Incubator release
 doc/guide:

 http://incubator.apache.org/guides/releasemanagement.html

 Cool, thanks for the pointer. I'll try to follow the steps there about
 signing.

 Are we locking pull requests to github repo by tomorrow?
 Meaning no more push to GitHub repo for Spark.

 From your email seems like there will be more potential pull requests
 for
 github repo to be merged back to ASF Git repo.

 We'll probably use the GitHub repo for the last few changes in this
 release and then switch. The reason is that there's a bit of work to do
 pull requests against the Apache one.

 Matei





 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com  |




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Re: [Licensing check] Spark 0.8.0-incubating RC1

2013-09-04 Thread Evan Chan
Sorry just to be super clear but the GitHub repo (ie PRs for 0.8
docs) refers to:

a)  github.com/mesos/spark
b)  github.com/apache/incubator-spark

I'm assuming a) but don't want to be wrong thanks!

On Tue, Sep 3, 2013 at 7:08 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 Yes, please do the PRs against the GitHub repo for now.

 Matei

 On Sep 3, 2013, at 4:57 PM, Evan Chan e...@ooyala.com wrote:

 Sorry one more clarification.

 For doc pull requests for 0.8 release, should these be done against
 the existing mesos/spark repo, or against the mirror at
 apache/incubator-spark ?
 I'm hoping to clear up a couple things in the docs before the release this 
 week.

 thanks,
 -Evan


 On Tue, Sep 3, 2013 at 2:25 PM, Mark Hamstra m...@clearstorydata.com wrote:
 Okay, so is there any way to get github's compare view to be happy with
 differently-named repositories?  What I want is to be able to compare and
 generate a pull request between
 github.com/apache/incubator-spark:masterand, e.g.,
 github.com/markhamstra/spark:myBranch and not need to create a new
 github.com/markhamstra/incubator-spark.  It's bad enough that the
 differently-named repos don't show up in the compare-view drop-down
 choices, but I also haven't been able to find a hand-crafted URL that will
 make this work.



 On Tue, Sep 3, 2013 at 1:38 PM, Matei Zaharia 
 matei.zaha...@gmail.comwrote:

 Yup, the plan is as follows:

 - Make pull request against the mirror
 - Code review on GitHub as usual
 - Whoever merges it will simply merge it into the main Apache repo; when
 this propagates, the PR will be marked as merged

 I found at least one other Apache project that did this:
 http://wiki.apache.org/cordova/ContributorWorkflow.

 Matei

 On Sep 3, 2013, at 10:39 AM, Mark Hamstra m...@clearstorydata.com wrote:

 What is going to be the process for making pull requests?  Can they be
 made
 against the github mirror (https://github.com/apache/incubator-spark),
 or
 must we use some other way?


 On Tue, Sep 3, 2013 at 10:28 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 Hi guys,

 So are you planning to release 0.8 from the master branch (which is at
 a106ed8... now) or from branch-0.8?

 Right now the branches are the same in terms of content (though I might
 not have merged the latest changes into 0.8). If we add stuff into
 master
 that we won't want in 0.8 we'll break that.

 My recommendation is that we start to use the Incubator release
 doc/guide:

 http://incubator.apache.org/guides/releasemanagement.html

 Cool, thanks for the pointer. I'll try to follow the steps there about
 signing.

 Are we locking pull requests to github repo by tomorrow?
 Meaning no more push to GitHub repo for Spark.

 From your email seems like there will be more potential pull requests
 for
 github repo to be merged back to ASF Git repo.

 We'll probably use the GitHub repo for the last few changes in this
 release and then switch. The reason is that there's a bit of work to do
 pull requests against the Apache one.

 Matei





 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com  |




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |