Re: proposal: replace lift-json with spray-json
Pascal, Ah, I stand corrected, thanks. On Mon, Feb 10, 2014 at 11:49 PM, Pascal Voitot Dev pascal.voitot@gmail.com wrote: Evan, Excuse me but that's WRONG that play-json pulls all play deps! PLAY/JSON has NO HEAVY DEP ON PLAY! I personally worked to make it an independent module in play! So play/json has just one big dep which is Jackson! I agree that jackson is the right way to go as a beginning. But for scala developers, a higher thin layer like play/json is useful to bring typesafety... Pascal On Tue, Feb 11, 2014 at 1:31 AM, Evan Chan e...@ooyala.com wrote: By the way, I did a benchmark on JSON parsing performance recently. Based on that, spray-json was about 10x slower than the Jackson-based parsers. I recommend json4s-jackson, because jackson is almost certainly already a dependency of Sparks (many other Java libraries use it), so the dependencies are very lightweight. I didn't benchmark Argonaut or play-json, partly because play-json pulled in all the Play dependencies, tho as someone else commented in this thread, they plan to split it out. On Mon, Feb 10, 2014 at 12:22 AM, Pascal Voitot Dev pascal.voitot@gmail.com wrote: Hi again, If spark just need Json with serialization/deserialization of basic structures and some potential simple validations for webUI, let's remind that play/json (without any other dependency than Jackson) is about 200-300 lines of code... the only dependency is jackson which is the best json parser that I know. The rest of code is about typesafety composition... If Spark need some json veryveryvery performant JSON (de)serialization, we will have to look on things like pickling and potentially some streaming parsers (I think this is a domain under work right now...) Pascal On Mon, Feb 10, 2014 at 1:50 AM, Will Benton wi...@redhat.com wrote: Matei, sorry if I was unclear: I'm referring to downstream operating system distributions (like Fedora or Debian) that have policies requiring that all packages are built from source (using only tools already packaged in the distribution). So end-users (and distributions with different policies) don't have to build Lift to get the lift-json artifact, but it is a concern for many open-source communities. best, wb - Original Message - From: Matei Zaharia matei.zaha...@gmail.com To: dev@spark.incubator.apache.org Sent: Sunday, February 9, 2014 4:29:20 PM Subject: Re: proposal: replace lift-json with spray-json Will, why are you saying that downstream distributes need to build all of Lift to package lift-json? Spark just downloads it from Maven Central, where it's a JAR with no external dependencies. We don't have any dependency on the rest of lift. Matei On Feb 9, 2014, at 11:28 AM, Will Benton wi...@redhat.com wrote: lift-json is a nice library, but Lift is a pretty heavyweight dependency to track just for its JSON support. (lift-json is relatively self-contained as a dependency from an end-user's perspective, but downstream distributors need to build all of Lift in order to package the JSON support.) I understand that this has come up before (cf. SPARK-883) and that the uncertain future of JSON support in the Scala standard library is the motivator for relying on an external library. I'm proposing replacing lift-json in Spark with something more lightweight. I've evaluated apparent project liveness and dependency scope for most of the current Scala JSON libraries and believe the best candidate is spray-json (https://github.com/spray/spray-json), the JSON library used by the Spray HTTP toolkit. spray-json is Apache-licensed, actively developed, and builds and works independently of Spray with only one external dependency. It looks to me like a pretty straightforward change (although JsonProtocol.scala would be a little more verbose since it couldn't use the Lift JSON DSL), and I'd like to do it. I'm writing now to ask for some community feedback before making the change (and submitting a JIRA and PR). If no one has any serious objections (to the effort in general or to to the choice of spark-json in particular), I'll go ahead and do it, but if anyone has concerns, I'd be happy to discuss and address them before getting started. thanks, wb -- -- Evan Chan Staff Engineer e...@ooyala.com | -- -- Evan Chan Staff Engineer e...@ooyala.com |
Fast Serialization
Any interest in adding Fast Serialization (or possibly replacing the default of Java Serialization)? https://code.google.com/p/fast-serialization/ -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: Proposal: Clarifying minor points of Scala style
+1 to the proposal. On Mon, Feb 10, 2014 at 2:56 PM, Michael Armbrust mich...@databricks.com wrote: +1 to Shivaram's proposal. I think we should try to avoid functions with many args as much as possible so having a high vertical cost here isn't the worst thing. I also like the visual consistency. FWIW, (based on a cursory inspection) in the scala compiler they don't seem to ever orphan the return type from the closing parenthese. It seems there are two main accepted styles: def mkSlowPathDef(clazz: Symbol, lzyVal: Symbol, cond: Tree, syncBody: List[Tree], stats: List[Tree], retVal: Tree): Tree = { and def tryToSetIfExists( cmd: String, args: List[String], setter: (Setting) = (List[String] = Option[List[String]]) ): Option[List[String]] = On Mon, Feb 10, 2014 at 2:36 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Yeah that was my proposal - Essentially we can just have two styles: The entire function + parameterList + return type fits in one line or when it doesn't we wrap parameters into lines. I agree that it makes the code a more verbose, but it'll make code style more consistent. Shivaram On Mon, Feb 10, 2014 at 2:13 PM, Aaron Davidson ilike...@gmail.com wrote: Shivaram, is your recommendation to wrap the parameter list even if it fits, but just the return type doesn't? Personally, I think the cost of moving from a single-line parameter list to an n-ine list is pretty high, as it takes up a lot more space. I am even in favor of allowing a parameter list to overflow into a second line (but not a third) instead of spreading them out, if it's a private helper method (where the parameters are probably not as important as the implementation, unlike a public API). On Mon, Feb 10, 2014 at 1:42 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: For the 1st case wouldn't it be better to just wrap the parameters to the next line as we do in other cases ? For example def longMethodName( param1, param2, ...) : Long = { } Are there a lot functions which use the old format ? Can we just stick to the above for new functions ? Thanks Shivaram On Mon, Feb 10, 2014 at 11:33 AM, Reynold Xin r...@databricks.com wrote: +1 on both On Mon, Feb 10, 2014 at 1:34 AM, Aaron Davidson ilike...@gmail.com wrote: There are a few bits of the Scala style that are underspecified by both the Scala style guide http://docs.scala-lang.org/style/ and our own supplemental notes https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide. Often, this leads to inconsistent formatting within the codebase, so I'd like to propose some general guidelines which we can add to the wiki and use in the future: 1) Line-wrapped method return type is indented with two spaces: def longMethodName(... long param list ...) : Long = { 2 } *Justification: *I think this is the most commonly used style in Spark today. It's also similar to the extends style used in classes, with the same justification: it is visually distinguished from the 4-indented parameter list. 2) URLs and code examples in comments should not be line-wrapped. Here https://github.com/apache/incubator-spark/pull/557/files#diff-c338f10f3567d4c1d7fec4bf9e2677e1L29 is an example of the latter. *Justification*: Line-wrapping can cause confusion when trying to copy-paste a URL or command. Can additionally cause IDE issues or, avoidably, Javadoc issues. Any thoughts on these, or additional style issues not explicitly covered in either the Scala style guide or Spark wiki? -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: Proposal for Spark Release Strategy
as a multi-module project. Each Spark release will be versioned: [MAJOR].[MINOR].[MAINTENANCE] All releases with the same major version number will have API compatibility, defined as [1]. Major version numbers will remain stable over long periods of time. For instance, 1.X.Y may last 1 year or more. Minor releases will typically contain new features and improvements. The target frequency for minor releases is every 3-4 months. One change we'd like to make is to announce fixed release dates and merge windows for each release, to facilitate coordination. Each minor release will have a merge window where new patches can be merged, a QA window when only fixes can be merged, then a final period where voting occurs on release candidates. These windows will be announced immediately after the previous minor release to give people plenty of time, and over time, we might make the whole release process more regular (similar to Ubuntu). At the bottom of this document is an example window for the 1.0.0 release. Maintenance releases will occur more frequently and depend on specific patches introduced (e.g. bug fixes) and their urgency. In general these releases are designed to patch bugs. However, higher level libraries may introduce small features, such as a new algorithm, provided they are entirely additive and isolated from existing code paths. Spark core may not introduce any features. When new components are added to Spark, they may initially be marked as alpha. Alpha components do not have to abide by the above guidelines, however, to the maximum extent possible, they should try to. Once they are marked stable they have to follow these guidelines. At present, GraphX is the only alpha component of Spark. [1] API compatibility: An API is any public class or interface exposed in Spark that is not marked as semi-private or experimental. Release A is API compatible with release B if code compiled against release A *compiles cleanly* against B. This does not guarantee that a compiled application that is linked against version A will link cleanly against version B without re-compiling. Link-level compatibility is something we'll try to guarantee that as well, and we might make it a requirement in the future, but challenges with things like Scala versions have made this difficult to guarantee in the past. == Merging Pull Requests == To merge pull requests, committers are encouraged to use this tool [2] to collapse the request into one commit rather than manually performing git merges. It will also format the commit message nicely in a way that can be easily parsed later when writing credits. Currently it is maintained in a public utility repository, but we'll merge it into mainline Spark soon. [2] https://github.com/pwendell/spark-utils/blob/master/apache_pr_merge.py == Tentative Release Window for 1.0.0 == Feb 1st - April 1st: General development April 1st: Code freeze for new features April 15th: RC1 == Deviations == For now, the proposal is to consider these tentative guidelines. We can vote to formalize these as project rules at a later time after some experience working with them. Once formalized, any deviation to these guidelines will be subject to a lazy majority vote. - Patrick -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: Proposal for Spark Release Strategy
The other reason for waiting are things like stability. It would be great to have as a goal for 1.0.0 that under most heavy use scenarios, workers and executors don't just die, which is not true today. Also, there should be minimal silent failures which are difficult to debug. On Thu, Feb 6, 2014 at 11:54 AM, Evan Chan e...@ooyala.com wrote: +1 for 0.10.0. It would give more time to study things (such as the new SparkConf) and let the community decide if any breaking API changes are needed. Also, a +1 for minor revisions not breaking code compatibility, including Scala versions. (I guess this would mean that 1.x would stay on Scala 2.10.x) On Thu, Feb 6, 2014 at 11:05 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Bleh, hit send to early again. My second paragraph was to argue for 1.0.0 instead of 0.10.0, not to hammer on the binary compatibility point. On Thu, Feb 6, 2014 at 11:04 AM, Sandy Ryza sandy.r...@cloudera.com wrote: *Would it make sense to put in something that strongly discourages binary incompatible changes when possible? On Thu, Feb 6, 2014 at 11:03 AM, Sandy Ryza sandy.r...@cloudera.comwrote: Not codifying binary compatibility as a hard rule sounds fine to me. Would it make sense to put something in that . I.e. avoid making needless changes to class hierarchies. Whether Spark considers itself stable or not, users are beginning to treat it so. A responsible project will acknowledge this and provide the stability needed by its user base. I think some projects have made the mistake of waiting too long to release a 1.0.0. It allows them to put off making the hard decisions, but users and downstream projects suffer. If Spark needs to go through dramatic changes, there's always the option of a 2.0.0 that allows for this. -Sandy On Thu, Feb 6, 2014 at 10:56 AM, Matei Zaharia matei.zaha...@gmail.comwrote: I think it's important to do 1.0 next. The project has been around for 4 years, and I'd be comfortable maintaining the current codebase for a long time in an API and binary compatible way through 1.x releases. Over the past 4 years we haven't actually had major changes to the user-facing API -- the only ones were changing the package to org.apache.spark, and upgrading the Scala version. I'd be okay leaving 1.x to always use Scala 2.10 for example, or later cross-building it for Scala 2.11. Updating to 1.0 says two things: it tells users that they can be confident that version will be maintained for a long time, which we absolutely want to do, and it lets outsiders see that the project is now fairly mature (for many people, pre-1.0 might still cause them not to try it). I think both are good for the community. Regarding binary compatibility, I agree that it's what we should strive for, but it just seems premature to codify now. Let's see how it works between, say, 1.0 and 1.1, and then we can codify it. Matei On Feb 6, 2014, at 10:43 AM, Henry Saputra henry.sapu...@gmail.com wrote: Thanks Patick to initiate the discussion about next road map for Apache Spark. I am +1 for 0.10.0 for next version. It will give us as community some time to digest the process and the vision and make adjustment accordingly. Release a 1.0.0 is a huge milestone and if we do need to break API somehow or modify internal behavior dramatically we could take advantage to release 1.0.0 as good step to go to. - Henry On Wed, Feb 5, 2014 at 9:52 PM, Andrew Ash and...@andrewash.com wrote: Agree on timeboxed releases as well. Is there a vision for where we want to be as a project before declaring the first 1.0 release? While we're in the 0.x days per semver we can break backcompat at will (though we try to avoid it where possible), and that luxury goes away with 1.x I just don't want to release a 1.0 simply because it seems to follow after 0.9 rather than making an intentional decision that we're at the point where we can stand by the current APIs and binary compatibility for the next year or so of the major release. Until that decision is made as a group I'd rather we do an immediate version bump to 0.10.0-SNAPSHOT and then if discussion warrants it later, replace that with 1.0.0-SNAPSHOT. It's very easy to go from 0.10 to 1.0 but not the other way around. https://github.com/apache/incubator-spark/pull/542 Cheers! Andrew On Wed, Feb 5, 2014 at 9:49 PM, Heiko Braun ike.br...@googlemail.com wrote: +1 on time boxed releases and compatibility guidelines Am 06.02.2014 um 01:20 schrieb Patrick Wendell pwend...@gmail.com : Hi Everyone, In an effort to coordinate development amongst the growing list of Spark contributors, I've taken some time to write up a proposal to formalize various pieces of the development process. The next release of Spark will likely be Spark 1.0.0, so this message is intended in part to coordinate the release plan for 1.0.0
ApacheCon
I might have missed it earlier, but is anybody planning to present at ApacheCon? I think it's in Denver this year, April 7-9. Thinking of submitting a talk about how we use Spark and Cassandra. -Evan -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: Any suggestion about JIRA 1006 MLlib ALS gets stack overflow with too many iterations?
By the way, is there any plan to make a pluggable backend for checkpointing? We might be interested in writing a, for example, Cassandra backend. On Sat, Jan 25, 2014 at 9:49 PM, Xia, Junluan junluan@intel.com wrote: Hi all The description about this Bug submitted by Matei is as following The tipping point seems to be around 50. We should fix this by checkpointing the RDDs every 10-20 iterations to break the lineage chain, but checkpointing currently requires HDFS installed, which not all users will have. We might also be able to fix DAGScheduler to not be recursive. regards, Andrew -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: [DISCUSS] Graduating as a TLP
+1 to both! On Thu, Jan 23, 2014 at 11:32 PM, Austen McRae austen.mc...@gmail.com wrote: +1 to both (awesome job all!) On Thu, Jan 23, 2014 at 10:25 PM, Haoyuan Li haoyuan...@gmail.com wrote: +1 to both. On Thu, Jan 23, 2014 at 8:33 PM, Mosharaf Chowdhury mosharafka...@gmail.com wrote: +1 +1 On Thu, Jan 23, 2014 at 8:26 PM, Josh Rosen rosenvi...@gmail.com wrote: +1 to both! On Thu, Jan 23, 2014 at 8:22 PM, prabeesh k prabsma...@gmail.com wrote: +1 On Fri, Jan 24, 2014 at 6:55 AM, Hossein fal...@gmail.com wrote: +1 to both. --Hossein On Thu, Jan 23, 2014 at 4:40 PM, Mark Hamstra m...@clearstorydata.com wrote: Is there any other choice that makes any kind of sense? Matei for VP: +1. On Thu, Jan 23, 2014 at 4:32 PM, Andy Konwinski andykonwin...@gmail.com wrote: +2 (1 for graduating + 1 for matei as VP)! On Thu, Jan 23, 2014 at 4:11 PM, Chris Mattmann mattm...@apache.org wrote: +1 from me. I'll throw Matei's name into the hat for VP. He's done a great job and has stood out to me with his report filing and tenacity and would make an excellent chair. Being a chair entails: 1. being the eyes and ears of the board on the project. 2. filing a monthly (first 3 months, then quarterly) board report similar to the incubator report. Not too bad. +1 for graduation from me binding when the VOTE comes. We need our mentors and IPMC members to chime in and we should be in time for February 2014 board meeting. Cheers, Chris -Original Message- From: Matei Zaharia matei.zaha...@gmail.com Reply-To: dev@spark.incubator.apache.org dev@spark.incubator.apache.org Date: Thursday, January 23, 2014 2:45 PM To: dev@spark.incubator.apache.org dev@spark.incubator.apache.org Subject: [DISCUSS] Graduating as a TLP Hi folks, We¹ve been working on the transition to Apache for a while, and our last shepherd¹s report says the following: Spark Alan Cabrera (acabrera): Seems like a nice active project. IMO, there's no need to wait import to JIRA to graduate. Seems like they can graduate now. What do you think about graduating to a top-level project? As far as I can tell, we¹ve completed all the requirements for graduating: - Made 2 releases (working on a third now) - Added new committers and PPMC members (4 of them) - Did IP clearance - Moved infrastructure over to Apache, except for the JIRA above, which INFRA is working on and which shouldn¹t block us. If everything is okay, I¹ll call a VOTE on graduating in 48 hours. The one final thing missing is that we¹ll need to nominate an initial VP for the project. Matei -- You received this message because you are subscribed to the Google Groups Unofficial Apache Spark Dev Mailing List Mirror group. To unsubscribe from this group and stop receiving emails from it, send an email to apache-spark-dev-mirror+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out . -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: Is there any plan to develop an application level fair scheduler?
What is the reason that standalone mode doesn't support the fair scheduler? Does that mean that Mesos coarse mode also doesn't support the fair scheduler? On Tue, Jan 14, 2014 at 8:10 PM, Matei Zaharia matei.zaha...@gmail.comwrote: This is true for now, we didn’t want to replicate those systems. But it may change if we see demand for fair scheduling in our standalone cluster manager. Matei On Jan 14, 2014, at 6:32 PM, Xia, Junluan junluan@intel.com wrote: Yes, Spark depends on Yarn or Mesos for application level scheduling. -Original Message- From: Nan Zhu [mailto:zhunanmcg...@gmail.com] Sent: Tuesday, January 14, 2014 9:43 PM To: dev@spark.incubator.apache.org Subject: Re: Is there any plan to develop an application level fair scheduler? Hi, Junluan, Thank you for the reply but for the long-term plan, Spark will depend on Yarn and Mesos for application level scheduling in the coming versions? Best, -- Nan Zhu On Tuesday, January 14, 2014 at 12:56 AM, Xia, Junluan wrote: Are you sure that you must deploy spark in standalone mode?(it currently only support FIFO) If you could setup Spark on Yarn or Mesos, then it has supported Fair scheduler in application level. -Original Message- From: Nan Zhu [mailto:zhunanmcg...@gmail.com] Sent: Tuesday, January 14, 2014 10:13 AM To: dev@spark.incubator.apache.org (mailto: dev@spark.incubator.apache.org) Subject: Is there any plan to develop an application level fair scheduler? Hi, All Is there any plan to develop an application level fair scheduler? I think it will have more value than a fair scheduler within the application (actually I didn’t understand why we want to fairly share the resource among jobs within the application, in usual, users submit different applications, not jobs)… Best, -- Nan Zhu -- -- Evan Chan Staff Engineer e...@ooyala.com | http://www.ooyala.com/ http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyalahttp://www.twitter.com/ooyala
Re: spark code formatter?
BTW, we also run Scalariform, but we don't turn it on automatically. We find that for the most part it is good, but there are a few places where it reformats things and doesn't look good, and requires cleanup. I think Scalariform requires some more rules to make it more generally useful. -Evan On Thu, Jan 9, 2014 at 12:23 PM, DB Tsai dbt...@alpinenow.com wrote: Initially, we also had the same concern, so we started from limited set of rules. Gradually, we found that it increases the productivity and readability of our codebase. PS, Scalariform is compatible with the Scala Style Guide in the sense that, given the right preference settings, source code that is initially compiliant with the Style Guide will not become uncompliant after formatting. In a number of cases, running the formatter will make uncompliant source more compliant. I added the configuration option in the latest PR to limit the set of rules. The options are https://github.com/mdr/scalariform When developers wants to choose their own style for whatever reasons, they can source directives to turn it off by `// format: OFF`. Just quickly run the formatter, and I found that Spark is in general in good shape; most of the changes are extra space after semicolon. - def run[K: Manifest, V : Vertex : Manifest, M : Message[K] : Manifest, C: Manifest]( + def run[K: Manifest, V : Vertex: Manifest, M : Message[K]: Manifest, C: Manifest]( - def addFile(file: File) : String = { + def addFile(file: File): String = { Sincerely, DB Tsai Machine Learning Engineer Alpine Data Labs -- Web: http://alpinenow.com/ On Thu, Jan 9, 2014 at 12:32 AM, Patrick Wendell pwend...@gmail.com wrote: I'm also very wary of using a code formatter for the reasons already mentioned by Reynold. Does scaliform have a mode where it just provides style checks rather than reformat the code? This is something we really need for, e.g., reviewing the many submissions to the project. - Patrick On Wed, Jan 8, 2014 at 11:51 PM, Reynold Xin r...@databricks.com wrote: Thanks for doing that, DB. Not sure about others, but I'm actually strongly against blanket automatic code formatters, given that they can be disruptive. Often humans would intentionally choose to style things in a certain way for more clear semantics and better readability. Code formatters don't capture these nuances. It is pretty dangerous to just auto format everything. Maybe it'd be ok if we restrict the code formatters to a very limited set of things, such as indenting function parameters, etc. On Wed, Jan 8, 2014 at 10:28 PM, DB Tsai dbt...@alpinenow.com wrote: A pull request for scalariform. https://github.com/apache/incubator-spark/pull/365 Sincerely, DB Tsai Machine Learning Engineer Alpine Data Labs -- Web: http://alpinenow.com/ On Wed, Jan 8, 2014 at 10:09 PM, DB Tsai dbt...@alpinenow.com wrote: We use sbt-scalariform in our company, and it can automatically format the coding style when runs `sbt compile`. https://github.com/sbt/sbt-scalariform We ask our developers to run `sbt compile` before commit, and it's really nice to see everyone has the same spacing and indentation. Sincerely, DB Tsai Machine Learning Engineer Alpine Data Labs -- Web: http://alpinenow.com/ On Wed, Jan 8, 2014 at 9:50 PM, Reynold Xin r...@databricks.com wrote: We have a Scala style configuration file in Shark: https://github.com/amplab/shark/blob/master/scalastyle-config.xml However, the scalastyle project is still pretty primitive and doesn't cover most of the use cases. It is still great to include it to cover basic checks such as 100-char wide lines. On Wed, Jan 8, 2014 at 8:02 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Not that I know of. This would be very useful to add, especially if we can make SBT automatically check the code style (or we can somehow plug this into Jenkins). Matei On Jan 8, 2014, at 11:00 AM, Michael Allman m...@allman.ms wrote: Hi, I've read the spark code style guide for contributors here: https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide For scala code, do you have a scalariform configuration that you use to format your code to these specs? Cheers, Michael -- -- Evan Chan Staff Engineer e...@ooyala.com | http://www.ooyala.com/ http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyalahttp://www.twitter.com/ooyala
scala-graph
http://www.scala-graph.org/ Have you guys seen the above site? I wonder if this will ever be merged into the Scala standard library, but might be interesting to see if this fits into GraphX at all, or to add a Spark backend to it. -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: Option folding idiom
+1 for using more functional idioms in general. That's a pretty clever use of `fold`, but putting the default condition first there makes it not as intuitive. What about the following, which are more readable? option.map { a = someFuncMakesB() } .getOrElse(b) option.map { a = someFuncMakesB() } .orElse { a = otherDefaultB() }.get On Thu, Dec 26, 2013 at 12:33 PM, Mark Hamstra m...@clearstorydata.comwrote: In code added to Spark over the past several months, I'm glad to see more use of `foreach`, `for`, `map` and `flatMap` over `Option` instead of pattern matching boilerplate. There are opportunities to push `Option` idioms even further now that we are using Scala 2.10 in master, but I want to discuss the issue here a little bit before committing code whose form may be a little unfamiliar to some Spark developers. In particular, I really like the use of `fold` with `Option` to cleanly an concisely express the do something if the Option is None; do something else with the thing contained in the Option if it is Some code fragment. An example: Instead of... val driver = drivers.find(_.id == driverId) driver match { case Some(d) = if (waitingDrivers.contains(d)) { waitingDrivers -= d } else { d.worker.foreach { w = w.actor ! KillDriver(driverId) } } val msg = sKill request for $driverId submitted logInfo(msg) sender ! KillDriverResponse(true, msg) case None = val msg = sCould not find running driver $driverId logWarning(msg) sender ! KillDriverResponse(false, msg) } ...using fold we end up with... driver.fold { val msg = sCould not find running driver $driverId logWarning(msg) sender ! KillDriverResponse(false, msg) } { d = if (waitingDrivers.contains(d)) { waitingDrivers -= d } else { d.worker.foreach { w = w.actor ! KillDriver(driverId) } } val msg = sKill request for $driverId submitted logInfo(msg) sender ! KillDriverResponse(true, msg) } So the basic pattern (and my proposed formatting standard) for folding over an `Option[A]` from which you need to produce a B (which may be Unit if you're only interested in side effects) is: anOption.fold { // something that evaluates to a B if anOption = None } { a = // something that transforms `a` into a B if anOption = Some(a) } Any thoughts? Does anyone really, really hate this style of coding and oppose its use in Spark? -- -- Evan Chan Staff Engineer e...@ooyala.com | http://www.ooyala.com/ http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyalahttp://www.twitter.com/ooyala
Re: IMPORTANT: Spark mailing lists moving to Apache by September 1st
to institutionally guarantee this. This is because discussion on mailing lists is one of the core things that defines an Apache community. So at a minimum this means Apache owning the master copy of the bits. All that said, we are discussing the possibility of having a google group that subscribes to each list that would provide an easier to use and prettier archive for each list (so far we haven't gotten that to work). I hope this was helpful. It has taken me a few years now, and a lot of conversations with experienced (and patient!) Apache mentors, to internalize some of the nuance about the Apache way. That's why I wanted to share. Andy On Thu, Dec 19, 2013 at 6:28 PM, Mike Potts masp...@gmail.com wrote: I notice that there are still a lot of active topics in this group: and also activity on the apache mailing list (which is a really horrible experience!). Is it a firm policy on apache's front to disallow external groups? I'm going to be ramping up on spark, and I really hate the idea of having to rely on the apache archives and my mail client. Also: having to search for topics/keywords both in old threads (here) as well as new threads in apache's (clunky) archive, is going to be a pain! I almost feel like I must be missing something because the current solution seems unfeasibly awkward! -- You received this message because you are subscribed to the Google Groups Spark Users group. To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. -- You received this message because you are subscribed to the Google Groups Spark Users group. To unsubscribe from this group and stop receiving emails from it, send an email to spark-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. -- You received this message because you are subscribed to the Google Groups Spark Users group. To unsubscribe from this group and stop receiving emails from it, send an email to spark-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. -- -- Evan Chan Staff Engineer e...@ooyala.com | http://www.ooyala.com/ http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyalahttp://www.twitter.com/ooyala
Re: Akka problem when using scala command to launch Spark applications in the current 0.9.0-SNAPSHOT
Hi Reynold, The default, documented methods of starting Spark all use the assembly jar, and thus java, right? -Evan On Fri, Dec 20, 2013 at 11:36 PM, Reynold Xin r...@databricks.com wrote: It took me hours to debug a problem yesterday on the latest master branch (0.9.0-SNAPSHOT), and I would like to share with the dev list in case anybody runs into this Akka problem. A little background for those of you who haven't followed closely the development of Spark and YARN 2.2: YARN 2.2 uses protobuf 2.5, and Akka uses an older version of protobuf that is not binary compatible. In order to have a single build that is compatible for both YARN 2.2 and pre-2.2 YARN/Hadoop, we published a special version of Akka that builds with protobuf shaded (i.e. using a different package name for the protobuf stuff). However, it turned out Scala 2.10 includes a version of Akka jar in its default classpath (look at the lib folder in Scala 2.10 binary distribution). If you use the scala command to launch any Spark application on the current master branch, there is a pretty high chance that you wouldn't be able to create the SparkContext (stack trace at the end of the email). The problem is that the Akka packaged with Scala 2.10 takes precedence in the classloader over the special Akka version Spark includes. Before we have a good solution for this, the workaround is to use java to launch the application instead of scala. All you need to do is to include the right Scala jars (scala-library and scala-compiler) in the classpath. Note that the scala command is really just a simple script that calls java with the right classpath. Stack trace: java.lang.NoSuchMethodException: akka.remote.RemoteActorRefProvider.init(java.lang.String, akka.actor.ActorSystem$Settings, akka.event.EventStream, akka.actor.Scheduler, akka.actor.DynamicAccess) at java.lang.Class.getConstructor0(Class.java:2763) at java.lang.Class.getDeclaredConstructor(Class.java:2021) at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:77) at scala.util.Try$.apply(Try.scala:161) at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:74) at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:85) at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:85) at scala.util.Success.flatMap(Try.scala:200) at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:85) at akka.actor.ActorSystemImpl.init(ActorSystem.scala:546) at akka.actor.ActorSystem$.apply(ActorSystem.scala:111) at akka.actor.ActorSystem$.apply(ActorSystem.scala:104) at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:79) at org.apache.spark.SparkEnv$.createFromSystemProperties(SparkEnv.scala:120) at org.apache.spark.SparkContext.init(SparkContext.scala:106) -- -- Evan Chan Staff Engineer e...@ooyala.com | http://www.ooyala.com/ http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyalahttp://www.twitter.com/ooyala
Re: Re : Scala 2.10 Merge
changes). Please open new threads on the dev list to report and discuss any issues. This merge will temporarily drop support for YARN 2.2 on the master branch. This is because the workaround we used was only compiled for Scala 2.9. We are going to come up with a more robust solution to YARN 2.2 support before releasing 0.9. Going forward, we will continue to make maintenance releases on branch-0.8 which will remain compatible with Scala 2.9. For those interested, the primary code changes in this merge are upgrading the akka version, changing the use of Scala 2.9's ClassManifest construct to Scala 2.10's ClassTag, and updating the spark shell to work with Scala 2.10's repl. - Patrick -- -- Evan Chan Staff Engineer e...@ooyala.com | http://www.ooyala.com/ http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyalahttp://www.twitter.com/ooyala
Re: Spark API - support for asynchronous calls - Reactive style [I]
Mark, Thanks. The FutureAction API looks awesome. On Mon, Dec 9, 2013 at 9:31 AM, Mark Hamstra m...@clearstorydata.comwrote: Spark has already supported async jobs for awhile now -- https://github.com/apache/incubator-spark/pull/29, and they even work correctly after https://github.com/apache/incubator-spark/pull/232 There are now implicit conversions from RDD to AsyncRDDActions https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala , where async actions like countAsync are defined. On Mon, Dec 9, 2013 at 5:46 AM, Deenar Toraskar deenar.toras...@db.com wrote: Classification: For internal use only Hi developers Are there any plans to have Spark (and Shark) APIs that are asynchronous and non blocking? APIs that return Futures and Iteratee/Enumerators would be very useful to users building scalable apps using Spark, specially when combined with a fully asynchronous/non-blocking framework like Play!. Something along the lines of ReactiveMongo http://stephane.godbillon.com/2012/08/30/reactivemongo-for-scala-unleashing-mongodb-streaming-capabilities-for-realtime-web Deenar --- This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and delete this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. Please refer to http://www.db.com/en/content/eu_disclosures.htm for additional EU corporate and regulatory disclosures. -- -- Evan Chan Staff Engineer e...@ooyala.com | http://www.ooyala.com/ http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyalahttp://www.twitter.com/ooyala
Re: [VOTE] Release Apache Spark 0.8.1-incubating (rc4)
I'd be personally fine with a standard workflow of assemble-deps + packaging just the Spark files as separate packages, if it speeds up everyone's development time. On Wed, Dec 11, 2013 at 1:10 PM, Mark Hamstra m...@clearstorydata.comwrote: I don't know how to make sense of the numbers, but here's what I've got from a very small sample size. For both v0.8.0-incubating and v0.8.1-incubating, building separate assemblies is faster than `./sbt/sbt assembly` and the times for building separate assemblies for 0.8.0 and 0.8.1 are about the same. For v0.8.0-incubating, `./sbt/sbt assembly` takes about 2.5x as long as the sum of the separate assemblies. For v0.8.1-incubating, `./sbt/sbt assembly` takes almost 8x as long as the sum of the separate assemblies. Weird. On Wed, Dec 11, 2013 at 11:49 AM, Patrick Wendell pwend...@gmail.com wrote: I'll +1 myself also. For anyone who has the slow build problem: does this issue happen when building v0.8.0-incubating also? Trying to figure out whether it's related to something we added in 0.8.1 or if it's a long standing issue. - Patrick On Wed, Dec 11, 2013 at 10:39 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Woah, weird, but definitely good to know. If you’re doing Spark development, there’s also a more convenient option added by Shivaram in the master branch. You can do sbt assemble-deps to package *just* the dependencies of each project in a special assembly JAR, and then use sbt compile to update the code. This will use the classes directly out of the target/scala-2.9.3/classes directories. You have to redo assemble-deps only if your external dependencies change. Matei On Dec 11, 2013, at 1:04 AM, Prashant Sharma scrapco...@gmail.com wrote: I hope this PR https://github.com/apache/incubator-spark/pull/252 can help. Again this is not a blocker for the release from my side either. On Wed, Dec 11, 2013 at 2:14 PM, Mark Hamstra m...@clearstorydata.com wrote: Interesting, and confirmed: On my machine where `./sbt/sbt assembly` takes a long, long, long time to complete (a MBP, in my case), building three separate assemblies (`./sbt/sbt assembly/assembly`, `./sbt/sbt examples/assembly`, `./sbt/sbt tools/assembly`) takes much, much less time. On Wed, Dec 11, 2013 at 12:02 AM, Prashant Sharma scrapco...@gmail.com wrote: forgot to mention, after running sbt/sbt assembly/assembly running sbt/sbt examples/assembly takes just 37s. Not to mention my hardware is not really great. On Wed, Dec 11, 2013 at 1:28 PM, Prashant Sharma scrapco...@gmail.com wrote: Hi Patrick and Matei, Was trying out this and followed the quick start guide which says do sbt/sbt assembly, like few others I was also stuck for few minutes on linux. On the other hand if I use sbt/sbt assembly/assembly it is much faster. Should we change the documentation to reflect this. It will not be great for first time users to get stuck there. On Wed, Dec 11, 2013 at 9:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote: +1 Built and tested it on Mac OS X. Matei On Dec 10, 2013, at 4:49 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark (incubating) version 0.8.1. The tag to be voted on is v0.8.1-incubating (commit b87d31d): https://git-wip-us.apache.org/repos/asf/incubator-spark/repo?p=incubator-spark.git;a=commit;h=b87d31dd8eb4b4e47c0138e9242d0dd6922c8c4e The release files, including signatures, digests, etc can be found at: http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc4/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-040/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc4-docs/ For information about the contents of this release see: https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=blob;f=CHANGES.txt;h=ce0aeab524505b63c7999e0371157ac2def6fe1c;hb=branch-0.8 Please vote on releasing this package as Apache Spark 0.8.1-incubating! The vote is open until Saturday, December 14th at 01:00 UTC and passes if a majority of at least 3 +1 PPMC votes are cast. [ ] +1 Release this package as Apache Spark 0.8.1-incubating [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.incubator.apache.org/ -- s -- s -- s -- -- Evan Chan Staff Engineer e...@ooyala.com | http://www.ooyala.com
Re: [ANNOUNCE] Welcoming two new Spark committers: Tom Graves and Prashant Sharma
Congrats! On Wed, Nov 13, 2013 at 9:21 PM, karthik tunga karthik.tu...@gmail.com wrote: Congrats Tom and Prashant :) Cheers, Karthik On Nov 13, 2013 7:37 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi folks, The Apache Spark PPMC is happy to welcome two new PPMC members and committers: Tom Graves and Prashant Sharma. Tom has been maintaining and expanding the YARN support in Spark over the past few months, including adding big features such as support for YARN security, and recently contributed a major patch that adds security to all of Spark’s internal communication services as well ( https://github.com/apache/incubator-spark/pull/120). It will be great to have him continue expanding these and other features as a committer. Prashant created and has maintained the Scala 2.10 branch of Spark for 6+ months now, including tackling the hairy task of porting the Spark interpreter to 2.10, and debugging all the issues raised by that with third-party libraries. He's also contributed bug fixes and new input sources to Spark Streaming. The Scala 2.10 branch will be merged into master soon. We’re very excited to have both Tom and Prashant join the project as committers. The Apache Spark PPMC -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: SPARK-942
+1 for IteratorWithSizeEstimate. I believe today only HadoopRDDs are able to give fine grained progress; with an enhanced iterator interface (which can still expose the base Iterator trait) we can extend the possibility of fine grained progress to all RDDs that implement the enhanced iterator. On Tue, Nov 12, 2013 at 11:07 AM, Stephen Haberman stephen.haber...@gmail.com wrote: The problem is that the iterator interface only defines 'hasNext' and 'next' methods. Just a comment from the peanut gallery, but FWIW it seems like being able to ask how much data is here would be a useful thing for Spark to know, even if that means moving away from Iterator itself, or something like IteratorWithSizeEstimate/something/something. Not only for this, but so that, ideally, Spark could basically do dynamic partitioning. E.g. when we load a month's worth of data, it's X GB, but after a few maps and filters, it's X/100 GB, so could use X/100 partitions instead. But right now all partitioning decisions are made up-front, via .coalesce/etc. type hints from the programmer, and it seems if Spark could delay making partitioning decisions each until RDD could like lazily-eval/sample a few lines (hand waving), that would be super sexy from our respective, in terms of doing automatic perf/partition optimization. Huge disclaimer that this is probably a big pita to implement, and could likely not be as worthwhile as I naively think it would be. - Stephen -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: Getting failures in FileServerSuite
Must be a local environment thing, because AmpLab Jenkins can't reproduce it. :-p On Wed, Oct 30, 2013 at 11:10 AM, Josh Rosen rosenvi...@gmail.com wrote: Someone on the users list also encountered this exception: https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3C64474308D680D540A4D8151B0F7C03F7025EF289%40SHSMSX104.ccr.corp.intel.com%3E On Wed, Oct 30, 2013 at 9:40 AM, Evan Chan e...@ooyala.com wrote: I'm at the latest commit f0e23a023ce1356bc0f04248605c48d4d08c2d05 Merge: aec9bf9 a197137 Author: Reynold Xin r...@apache.org Date: Tue Oct 29 01:41:44 2013 -0400 and seeing this when I do a test-only FileServerSuite: 13/10/30 09:35:04.300 INFO DAGScheduler: Completed ResultTask(0, 0) 13/10/30 09:35:04.307 INFO LocalTaskSetManager: Loss was due to java.io.StreamCorruptedException java.io.StreamCorruptedException: invalid type code: AC at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:39) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:101) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:440) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:26) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:27) at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:53) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$2.apply(PairRDDFunctions.scala:95) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$2.apply(PairRDDFunctions.scala:94) at org.apache.spark.rdd.MapPartitionsWithContextRDD.compute(MapPartitionsWithContextRDD.scala:40) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237) at org.apache.spark.rdd.RDD.iterator(RDD.scala:226) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:107) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:212) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:680) Anybody else seen this yet? I have a really simple PR and this fails without my change, so I may go ahead and submit it anyways. -- -- Evan Chan Staff Engineer e...@ooyala.com | -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: Suggestion/Recommendation for language bindings
+1 for Ruby, as Ruby already has a functional-ish collection API, so mapping the RDD functional transforms over would be pretty idiomatic. Would love to review this. On Tue, Oct 15, 2013 at 10:13 AM, Patrick Wendell pwend...@gmail.com wrote: I think Ruby integration via JRuby would be a great idea. On Tue, Oct 15, 2013 at 9:45 AM, Ryan Weald r...@weald.com wrote: Writing a JRuby wrapper around the existing Java bindings would be pretty cool. Could help to get some of the Ruby community to start using the Spark platform. -Ryan On Mon, Oct 14, 2013 at 12:07 PM, Aaron Babcock aaron.babc...@gmail.comwrote: Hey Laksh, Not sure if you are interested in groovy at all, but I've got the beginning of a project here: https://github.com/bunions1/groovy-spark-example The idea is to map groovy idioms: myRdd.collect{ row - newRow } to spark api calls myRdd.map( row = newRow) and support a good repl. Its not officially related to spark at all and is very early stage but maybe it will be a point of reference for you. On Mon, Oct 14, 2013 at 12:42 PM, Laksh Gupta glaks...@gmail.com wrote: Hi I am interested in contributing to the project and want to start with supporting a new programming language on Spark. I can see that Spark already support Java and Python. Would someone provide me some suggestion/references to start with? I think this would be a great learning experince for me. Thank you in advance. -- - Laksh Gupta -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: Development environments
Matei, I wonder if we can further optimize / reduce the size of the assembly. One idea is to produce just a core assembly, and have the other projects produce their own assemblies which exclude the core dependencies. Also, DistributedSuite is pretty slow. would it make sense to tag certain tests as the core tests and give it a separate build target? The overall tests that include DistributedSuite can trigger assembly, but then it would be much faster to run the core tests. -Evan On Wed, Oct 9, 2013 at 12:54 AM, Matei Zaharia matei.zaha...@gmail.comwrote: For most development, you might not need to do assembly. You can run most of the unit tests if you just do sbt compile -- only the ones that spawn processes, like DistributedSuite, won't work. That said, we are looking to optimize assembly by maybe having it only package the dependencies rather than Spark itself -- there were some messages on this earlier. For now I'd just recommend doing it in a RAMFS if possible (symlink the assembly/target directory to be a RAMFS). Matei On Oct 9, 2013, at 12:45 AM, Evan Chan e...@ooyala.com wrote: Once you have compiled everything the first time using SBT (assembly will do that for you), successive runs of assembly are much faster. I just did it on my MacBook Pro in about 36 seconds. Running builds using IntelliJ or an IDE is wasted time, because the compiled classes go to a different place than SBT. Maybe there's some way to symlink them. -Evan On Tue, Oct 8, 2013 at 6:29 AM, Markus Losoi markus.lo...@gmail.com wrote: Hi Markus, have a look at the bottom of this wiki page: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark IntelliJ IDEA seems to be quite popular (that I am using myself) although Eclipse should work fine, too. There is another sbt plugin for generating Eclipse project files. The IDE seems to work nicely, but what is the fastest way to build Spark? If I make a change to the core module and choose Make Module 'core' from the Build menu in IntelliJ Idea, then the IDE compiles the source code. To create the spark-assembly-0.8.0-incubating-hadoop1.0.4.jar JAR file, I have run sbt assembly on the command line. However, this takes an impractically long time (843 s when I last ran it on my workstation with an Intel Core 2 Quad Q9400 and 8 GB of RAM). Is there any faster way? Best regards, Markus Losoi (markus.lo...@gmail.com) -- -- Evan Chan Staff Engineer e...@ooyala.com | http://www.ooyala.com/ http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyala http://www.twitter.com/ooyala -- -- Evan Chan Staff Engineer e...@ooyala.com | http://www.ooyala.com/ http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyalahttp://www.twitter.com/ooyala
Re: Propose to Re-organize the scripts and configurations
driver context will override any options specified in default. This sounds great to me! The one thing I'll add is that we might want to prevent applications from overriding certain settings on each node, such as work directories. The best way is to probably just ignore the app's version of those settings in the Executor. If you guys would like, feel free to write up this design on SPARK-544 and start working on it. I think it looks good. Matei -- *Shane Huang * *Intel Asia-Pacific RD Ltd.* *Email: shengsheng.hu...@intel.com* -- *Shane Huang * *Intel Asia-Pacific RD Ltd.* *Email: shengsheng.hu...@intel.com* -- *Shane Huang * *Intel Asia-Pacific RD Ltd.* *Email: shengsheng.hu...@intel.com* -- -- Evan Chan Staff Engineer e...@ooyala.com | http://www.ooyala.com/ http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyalahttp://www.twitter.com/ooyala
Re: off-heap RDDs
Haoyuan, Thanks, that sounds great, exactly what we are looking for. We might be interested in integrating Tachyon with CFS (Cassandra File System, the Cassandra-based implementation of HDFS). -Evan On Sat, Aug 31, 2013 at 3:33 PM, Haoyuan Li haoyuan...@gmail.com wrote: Evan, If I understand you correctly, you want to avoid network I/O as much as possible by caching the data on the node having the data on disk. Actually, what I meant client caching would automatically do this. For example, suppose you have a cluster of machines, nothing cached in memory yet. Then a spark application runs on it. Spark asks Tachyon where data X is. Since nothing is in memory yet, Tachyon would return disk locations for the first time. Then Spark program will try to take advantage of disk data locality, and load the data X in HDFS node N into the off-heap memory of node N. In the future, when Spark asks Tachyon the location of X, Tachyon will return node N. There is no network I/O involved in the whole process. Let me know if I misunderstood something. Haoyuan On Fri, Aug 30, 2013 at 10:00 AM, Evan Chan e...@ooyala.com wrote: Hey guys, I would also prefer to strengthen and get behind Tachyon, rather than implement a separate solution (though I guess if it's not offiically supported, then nobody will ask questions). But it's more that off-heap memory is difficult, so it's better to focus efforts on one project, is my feeling. Haoyuan, Tachyon brings cached HDFS data to the local client. Have we thought about the opposite approach, which might be more efficient? - Load the data in HDFS node N into the off-heap memory of node N - in Spark, inform the framework (maybe via RDD partition/location info) of where the data is, that it is located in node N - bring the computation to node N This avoids network IO and may be much more efficient for many types of applications. I know this would be a big win for us. -Evan On Wed, Aug 28, 2013 at 1:37 AM, Haoyuan Li haoyuan...@gmail.com wrote: No problem. Like reading/writing data from/to off-heap bytebuffer, when a program reads/writes data from/to Tachyon, Spark/Shark needs to do ser/de. Efficient ser/de will help on performance a lot as people pointed out. One solution is that the application can do primitive operations directly on ByteBuffer, like how Shark is handling it now. Most related code is located at https://github.com/amplab/shark/tree/master/src/main/scala/shark/memstore2 and https://github.com/amplab/shark/tree/master/src/tachyon_enabled/scala/shark/tachyon . Haoyuan On Wed, Aug 28, 2013 at 1:21 AM, Imran Rashid im...@therashids.com wrote: Thanks Haoyuan. It seems like we should try out Tachyon, sounds like it is what we are looking for. On Wed, Aug 28, 2013 at 8:18 AM, Haoyuan Li haoyuan...@gmail.com wrote: Response inline. On Tue, Aug 27, 2013 at 1:37 AM, Imran Rashid im...@therashids.com wrote: Thanks for all the great comments discussion. Let me expand a bit on our use case, and then I'm gonna combine responses to various questions. In general, when we use spark, we have some really big RDDs that use up a lot of memory (10s of GB per node) that are really our core data sets. We tend to start up a spark application, immediately load all those data sets, and just leave them loaded for the lifetime of that process. We definitely create a lot of other RDDs along the way, and lots of intermediate objects that we'd like to go through normal garbage collection. But those all require much less memory, maybe 1/10th of the big RDDs that we just keep around. I know this is a bit of a special case, but it seems like it probably isn't that different from a lot of use cases. Reynold Xin wrote: This is especially attractive if the application can read directly from a byte buffer without generic serialization (like Shark). interesting -- can you explain how this works in Shark? do you have some general way of storing data in byte buffers that avoids serialization? Or do you mean that if the user is effectively creating an RDD of ints, that you create a an RDD[ByteBuffer], and then you read / write ints into the byte buffer yourself? Sorry, I'm familiar with the basic idea of shark but not the code at all -- even a pointer to the code would be helpful. Haoyun Li wrote: One possible solution is that you can use Tachyonhttps://github.com/amplab/tachyon. This is a good idea, that I had probably overlooked. There are two potential issues that I can think of with this approach, though: 1) I was under the impression that Tachyon is still not really tested in production
Re: [Licensing check] Spark 0.8.0-incubating RC1
Matei, On Tue, Sep 3, 2013 at 7:09 PM, Matei Zaharia matei.zaha...@gmail.com wrote: BTW, what docs are you planning to write? Something on make-distribution.sh would be nice. Yes, I was planning to write something on make-distribution.sh, and reference it in the standalone and Mesos deploy guides. I was also going to document public methods in SparkContext that have not been documented before, such as getPersistentRdds, getExecutorStatus etc. Some folks on my team don't realize that such methods existed as they were not in the doc. -Evan Matei On Sep 3, 2013, at 4:57 PM, Evan Chan e...@ooyala.com wrote: Sorry one more clarification. For doc pull requests for 0.8 release, should these be done against the existing mesos/spark repo, or against the mirror at apache/incubator-spark ? I'm hoping to clear up a couple things in the docs before the release this week. thanks, -Evan On Tue, Sep 3, 2013 at 2:25 PM, Mark Hamstra m...@clearstorydata.com wrote: Okay, so is there any way to get github's compare view to be happy with differently-named repositories? What I want is to be able to compare and generate a pull request between github.com/apache/incubator-spark:masterand, e.g., github.com/markhamstra/spark:myBranch and not need to create a new github.com/markhamstra/incubator-spark. It's bad enough that the differently-named repos don't show up in the compare-view drop-down choices, but I also haven't been able to find a hand-crafted URL that will make this work. On Tue, Sep 3, 2013 at 1:38 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Yup, the plan is as follows: - Make pull request against the mirror - Code review on GitHub as usual - Whoever merges it will simply merge it into the main Apache repo; when this propagates, the PR will be marked as merged I found at least one other Apache project that did this: http://wiki.apache.org/cordova/ContributorWorkflow. Matei On Sep 3, 2013, at 10:39 AM, Mark Hamstra m...@clearstorydata.com wrote: What is going to be the process for making pull requests? Can they be made against the github mirror (https://github.com/apache/incubator-spark), or must we use some other way? On Tue, Sep 3, 2013 at 10:28 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi guys, So are you planning to release 0.8 from the master branch (which is at a106ed8... now) or from branch-0.8? Right now the branches are the same in terms of content (though I might not have merged the latest changes into 0.8). If we add stuff into master that we won't want in 0.8 we'll break that. My recommendation is that we start to use the Incubator release doc/guide: http://incubator.apache.org/guides/releasemanagement.html Cool, thanks for the pointer. I'll try to follow the steps there about signing. Are we locking pull requests to github repo by tomorrow? Meaning no more push to GitHub repo for Spark. From your email seems like there will be more potential pull requests for github repo to be merged back to ASF Git repo. We'll probably use the GitHub repo for the last few changes in this release and then switch. The reason is that there's a bit of work to do pull requests against the Apache one. Matei -- -- Evan Chan Staff Engineer e...@ooyala.com | -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: [Licensing check] Spark 0.8.0-incubating RC1
Sorry just to be super clear but the GitHub repo (ie PRs for 0.8 docs) refers to: a) github.com/mesos/spark b) github.com/apache/incubator-spark I'm assuming a) but don't want to be wrong thanks! On Tue, Sep 3, 2013 at 7:08 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yes, please do the PRs against the GitHub repo for now. Matei On Sep 3, 2013, at 4:57 PM, Evan Chan e...@ooyala.com wrote: Sorry one more clarification. For doc pull requests for 0.8 release, should these be done against the existing mesos/spark repo, or against the mirror at apache/incubator-spark ? I'm hoping to clear up a couple things in the docs before the release this week. thanks, -Evan On Tue, Sep 3, 2013 at 2:25 PM, Mark Hamstra m...@clearstorydata.com wrote: Okay, so is there any way to get github's compare view to be happy with differently-named repositories? What I want is to be able to compare and generate a pull request between github.com/apache/incubator-spark:masterand, e.g., github.com/markhamstra/spark:myBranch and not need to create a new github.com/markhamstra/incubator-spark. It's bad enough that the differently-named repos don't show up in the compare-view drop-down choices, but I also haven't been able to find a hand-crafted URL that will make this work. On Tue, Sep 3, 2013 at 1:38 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Yup, the plan is as follows: - Make pull request against the mirror - Code review on GitHub as usual - Whoever merges it will simply merge it into the main Apache repo; when this propagates, the PR will be marked as merged I found at least one other Apache project that did this: http://wiki.apache.org/cordova/ContributorWorkflow. Matei On Sep 3, 2013, at 10:39 AM, Mark Hamstra m...@clearstorydata.com wrote: What is going to be the process for making pull requests? Can they be made against the github mirror (https://github.com/apache/incubator-spark), or must we use some other way? On Tue, Sep 3, 2013 at 10:28 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi guys, So are you planning to release 0.8 from the master branch (which is at a106ed8... now) or from branch-0.8? Right now the branches are the same in terms of content (though I might not have merged the latest changes into 0.8). If we add stuff into master that we won't want in 0.8 we'll break that. My recommendation is that we start to use the Incubator release doc/guide: http://incubator.apache.org/guides/releasemanagement.html Cool, thanks for the pointer. I'll try to follow the steps there about signing. Are we locking pull requests to github repo by tomorrow? Meaning no more push to GitHub repo for Spark. From your email seems like there will be more potential pull requests for github repo to be merged back to ASF Git repo. We'll probably use the GitHub repo for the last few changes in this release and then switch. The reason is that there's a bit of work to do pull requests against the Apache one. Matei -- -- Evan Chan Staff Engineer e...@ooyala.com | -- -- Evan Chan Staff Engineer e...@ooyala.com |