Re: Anyone wants to look at SPARK-1123?
Hi What KeyClass and ValueClass are you trying to save as the keys/values of your dataset? On Sun, Feb 23, 2014 at 10:48 AM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, all I found the weird thing on saveAsNewAPIHadoopFile in PairRDDFunctions.scala when working on the other issue, saveAsNewAPIHadoopFile throws java.lang.InstantiationException all the time I checked the commit history of the file, it seems that the API exists for a long time, no one else found this? (that's the reason I'm confusing) Best, -- Nan Zhu
Re: Spark 0.8.1 on Amazon Elastic MapReduce
Thanks Parviz, this looks great and good to see it getting updated. Look forward to 0.9.0! A perhaps stupid question - where does the KinesisWordCount example live? Is that an Amazon example, since I don't see it under the streaming examples included in the Spark project. If it's a third party example is it possible to get the code? Thanks Nick On Fri, Feb 14, 2014 at 6:53 PM, Deyhim, Parviz parv...@amazon.com wrote: Spark community, Wanted to let you know that the version of Spark and Shark on Amazon Elastic MapReduce has been updated to 0.8.1. This new version provides a much better experience in terms of stability and performance but also supports the following features: - Integration with Amazon Cloudwatch - Integration of Spark Streaming with Amazon Kinesis. - Automatic log shipping to S3 For a complete detail of the features Spark on EMR provides, please see the following article: http://aws.amazon.com/articles/4926593393724923 And yes I'm working hard to push another update to support 0.9.0 :) What would be great is to hear from the community on what other features you like to see on Spark on EMR. For example, how useful is autoscaling for Spark? Any other features you like to see? Thanks, *Parviz Deyhim* Solutions Architect *Amazon Web Services http://aws.amazon.com/* E: parv...@amazon.com M: 408.315.2305 [image: Description: Description: Description: C:\Users\aiden\AppData\Local\Microsoft\Windows\Temporary Internet Files\Content.Word\aws.gif] http://aws.amazon.com/
Re: [GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...
@fommil @mengxr I think it's always worth a shot at a license change. Scikit learn devs have been successful before in getting such things over the line. Assuming we can make that happen, what do folks think about MTJ vs Breeze vs JBLAS + commons-math since these seem like the viable alternatives? — Sent from Mailbox for iPhone On Fri, Feb 14, 2014 at 1:21 AM, mengxr g...@git.apache.org wrote: Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35038739 @fommil I don't quite understand what roll their own means exactly here. I didn't propose to re-implement one or half linear algebra library in the PR. For the license issue, it would be great if the original author of MTJ agrees to change the license to Apache. With the LGPL license, there is not much we can do.
Re: [VOTE] Graduation of Apache Spark from the Incubator
+1 On Tue, Feb 11, 2014 at 9:17 AM, Matt Massie mas...@berkeley.edu wrote: +1 -- Matt Massie UC, Berkeley AMPLab Twitter: @matt_massie https://twitter.com/matt_massie, @amplabhttps://twitter.com/amplab https://amplab.cs.berkeley.edu/ On Mon, Feb 10, 2014 at 11:12 PM, Zongheng Yang zonghen...@gmail.com wrote: +1 On Mon, Feb 10, 2014 at 10:21 PM, Reynold Xin r...@databricks.com wrote: Actually I made a mistake by saying binding. Just +1 here. On Mon, Feb 10, 2014 at 10:20 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Nathan, anybody is welcome to to VOTE. Thank you. Only VOTEs from the Incubator PMC are what is considered binding, but I welcome and will tally all VOTEs provided. Cheers, Chris -Original Message- From: Nathan Kronenfeld nkronenf...@oculusinfo.com Reply-To: dev@spark.incubator.apache.org dev@spark.incubator.apache.org Date: Monday, February 10, 2014 9:44 PM To: dev@spark.incubator.apache.org dev@spark.incubator.apache.org Subject: Re: [VOTE] Graduation of Apache Spark from the Incubator Who is allowed to vote on stuff like this? On Mon, Feb 10, 2014 at 11:27 PM, Chris Mattmann mattm...@apache.orgwrote: Hi Everyone, This is a new VOTE to decide if Apache Spark should graduate from the Incubator. Please VOTE on the resolution pasted below the ballot. I'll leave this VOTE open for at least 72 hours. Thanks! [ ] +1 Graduate Apache Spark from the Incubator. [ ] +0 Don't care. [ ] -1 Don't graduate Apache Spark from the Incubator because.. Here is my +1 binding for graduation. Cheers, Chris snip WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software, for distribution at no charge to the public, related to fast and flexible large-scale data analysis on clusters. NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee (PMC), to be known as the Apache Spark Project, be and hereby is established pursuant to Bylaws of the Foundation; and be it further RESOLVED, that the Apache Spark Project be and hereby is responsible for the creation and maintenance of software related to fast and flexible large-scale data analysis on clusters; and be it further RESOLVED, that the office of Vice President, Apache Spark be and hereby is created, the person holding such office to serve at the direction of the Board of Directors as the chair of the Apache Spark Project, and to have primary responsibility for management of the projects within the scope of responsibility of the Apache Spark Project; and be it further RESOLVED, that the persons listed immediately below be and hereby are appointed to serve as the initial members of the Apache Spark Project: * Mosharaf Chowdhury mosha...@apache.org * Jason Dai jason...@apache.org * Tathagata Das t...@apache.org * Ankur Dave ankurd...@apache.org * Aaron Davidson a...@apache.org * Thomas Dudziak to...@apache.org * Robert Evans bo...@apache.org * Thomas Graves tgra...@apache.org * Andy Konwinski and...@apache.org * Stephen Haberman steph...@apache.org * Mark Hamstra markhams...@apache.org * Shane Huang shane_hu...@apache.org * Ryan LeCompte ryanlecom...@apache.org * Haoyuan Li haoy...@apache.org * Sean McNamara mcnam...@apache.org * Mridul Muralidharam mridul...@apache.org * Kay Ousterhout kayousterh...@apache.org * Nick Pentreath mln...@apache.org * Imran Rashid iras...@apache.org * Charles Reiss wog...@apache.org * Josh Rosen joshro...@apache.org * Prashant Sharma prash...@apache.org * Ram Sriharsha har...@apache.org * Shivaram Venkataraman shiva...@apache.org * Patrick Wendell pwend...@apache.org * Andrew Xia xiajunl...@apache.org * Reynold Xin r...@apache.org * Matei Zaharia ma...@apache.org NOW, THEREFORE, BE IT FURTHER RESOLVED, that Matei Zaharia be appointed to the office of Vice President, Apache Spark, to serve in accordance with and subject to the direction of the Board of Directors and the Bylaws of the Foundation until death, resignation, retirement, removal or disqualification, or until a successor is appointed; and be it further RESOLVED, that the Apache Spark Project be and hereby is tasked with the migration and rationalization of the Apache Incubator Spark podling; and be it further RESOLVED, that all responsibilities pertaining to the Apache Incubator Spark podling encumbered upon the Apache Incubator Project are hereafter discharged
Fwd: Represent your project at ApacheCon
Is Spark active in submitting anything for this? -- Forwarded message -- From: Rich Bowen rbo...@redhat.com Date: Mon, Jan 27, 2014 at 4:20 PM Subject: Represent your project at ApacheCon To: committ...@apache.org Folks, 5 days from the end of the CFP, we have only 50 talks submitted. We need three times that just to fill the space, and preferably a lot more so that we have some variety to choose from to put together a schedule. I know that we usually have over half the content submitted in the last 48 hours, so I'm not panicking yet, but it's worrying. More worrying, however is that 2/3 of those submissions are from the Usual Suspects (ie, httpd and Tomcat), and YOUR project isn't represented. We would love to have a whole day of Lucene, and of OpenOffice, and of Cordova, and of Felix and Celix and Helix and Nelix. Or a half day. We need your talk submissions. We need you to come tell the world why your project matters, why you spend your time working on it, and what exciting new thing you hacked into it during the snow storms. (Or heat wave, as the case may be.) Please help us get the word out to your developer and user communities that we're looking for quality talks about their favorite Apache project, about related technologies, about ways that it's being used, and plans for its future. Help us make this ApacheCon amazing. --rcb -- Rich Bowen - rbo...@redhat.com OpenStack Community Liaison http://openstack.redhat.com/
Re: Any suggestion about JIRA 1006 MLlib ALS gets stack overflow with too many iterations?
If you want to spend the time running 50 iterations, you're better off re-running 5x10 iterations with different random start to get a better local minimum...— Sent from Mailbox for iPhone On Sun, Jan 26, 2014 at 9:59 AM, Matei Zaharia matei.zaha...@gmail.com wrote: I looked into this after I opened that JIRA and it’s actually a bit harder to fix. While changing these visit() calls to use a stack manually instead of being recursive helps avoid a StackOverflowError there, you still get a StackOverflowError when you send the task to a worker node because Java serialization uses recursion. The only real fix therefore with the current codebase is to increase your JVM stack size. Longer-term, I’d like us to automatically call checkpoint() to break lineage graphs when they exceed a certain size, which would avoid the problems in both DAGScheduler and Java serialization. We could also manually add this to ALS now without having a solution for other programs. That would be a great change to make to fix this JIRA. Matei On Jan 25, 2014, at 11:06 PM, Ewen Cheslack-Postava m...@ewencp.org wrote: The three obvious ones in DAGScheduler.scala are in: getParentStages getMissingParentStages stageDependsOn They all follow the same pattern though (def visit(), followed by visit(root)), so they should be easy to replace with a Scala stack in place of the call stack. Shao, SaisaiJanuary 25, 2014 at 10:52 PM In my test I found this phenomenon might be caused by RDD's long dependency chain, this dependency chain is serialized into task and sent to each executor, while deserializing this task will cause stack overflow. Especially in iterative job, like: var rdd = .. for (i - 0 to 100) rdd = rdd.map(x=x) rdd = rdd.cache Here rdd's dependency will be chained, at some point stack overflow will occur. You can check (https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/-Cyfe3G6VwY/PFFnslzWn6AJ) and (https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/NkxcmmS-DbM/c9qvuShbHEUJ) for details. Current workaround method is to cut the dependency chain by checkpointing RDD, maybe a better way is to clean the dependency chain after materialize stage is executed. Thanks Jerry -Original Message- From: Reynold Xin [mailto:r...@databricks.com] Sent: Sunday, January 26, 2014 2:04 PM To: dev@spark.incubator.apache.org Subject: Re: Any suggestion about JIRA 1006 MLlib ALS gets stack overflow with too many iterations? I'm not entirely sure, but two candidates are the visit function in stageDependsOn submitStage Reynold Xin January 25, 2014 at 10:03 PM I'm not entirely sure, but two candidates are the visit function in stageDependsOn submitStage Aaron Davidson January 25, 2014 at 10:01 PM I'm an idiot, but which part of the DAGScheduler is recursive here? Seems like processEvent shouldn't have inherently recursive properties. Reynold Xin January 25, 2014 at 9:57 PM It seems to me fixing DAGScheduler to make it not recursive is the better solution here, given the cost of checkpointing. Xia, JunluanJanuary 25, 2014 at 9:49 PM Hi all The description about this Bug submitted by Matei is as following The tipping point seems to be around 50. We should fix this by checkpointing the RDDs every 10-20 iterations to break the lineage chain, but checkpointing currently requires HDFS installed, which not all users will have. We might also be able to fix DAGScheduler to not be recursive. regards, Andrew
Re: Option folding idiom
+1 for getOrElse When I was new to Scala I tended to use match almost like if/else statements with Option. These days I try to use map/flatMap instead and use getOrElse extensively and I for one find it very intuitive. I also agree that the fold syntax seems way less intuitive and I certainly prefer readable Scala code to that which might be more idiomatic but which I honestly tend to find very inscrutable and hard to grok quickly. — Sent from Mailbox for iPhone On Fri, Dec 27, 2013 at 9:06 AM, Matei Zaharia matei.zaha...@gmail.com wrote: I agree about using getOrElse instead. In choosing which code style and idioms to use, my goal has always been to maximize the ease of *other developers* understanding the code, and most developers today still don’t know Scala. It’s fine to use a maps or matches, because their meaning is obvious, but fold on Option is not obvious (even foreach is kind of weird for new people). In this case the benefit is so small that it doesn’t seem worth it. Note that if you use getOrElse, you can even throw exceptions in the “else” part if you’d like. (This is because Nothing is a subtype of every type in Scala.) So for example you can do val stuff = option.getOrElse(throw new Exception(“It wasn’t set”)). It looks a little weird, but note how the meaning is obvious even if you don’t know anything about the type system. Matei On Dec 27, 2013, at 12:12 AM, Kay Ousterhout k...@eecs.berkeley.edu wrote: I agree with what Reynold said -- there's not a big benefit in terms of lines of code (esp. compared to using getOrElse) and I think it hurts code readability. One of the great things about the current Spark codebase is that it's very accessible for newcomers -- something that would be less true with this use of fold. On Thu, Dec 26, 2013 at 8:11 PM, Holden Karau hol...@pigscanfly.ca wrote: I personally with Evan in that I prefer map with getOrElse over fold with options (but that just my personal preference) :) On Thu, Dec 26, 2013 at 7:58 PM, Reynold Xin r...@databricks.com wrote: I'm not strongly against Option.fold, but I find the readability getting worse for the use case you brought up. For the use case of if/else, I find Option.fold pretty confusing because it reverses the order of Some vs None. Also, when code gets long, the lack of an obvious boundary (the only boundary is } {) with two closures is pretty confusing. On Thu, Dec 26, 2013 at 4:23 PM, Mark Hamstra m...@clearstorydata.com wrote: On the contrary, it is the completely natural place for the initial value of the accumulator, and provides the expected result of folding over an empty collection. scala val l: List[Int] = List() l: List[Int] = List() scala l.fold(42)(_ + _) res0: Int = 42 scala val o: Option[Int] = None o: Option[Int] = None scala o.fold(42)(_ + 1) res1: Int = 42 On Thu, Dec 26, 2013 at 5:51 PM, Evan Chan e...@ooyala.com wrote: +1 for using more functional idioms in general. That's a pretty clever use of `fold`, but putting the default condition first there makes it not as intuitive. What about the following, which are more readable? option.map { a = someFuncMakesB() } .getOrElse(b) option.map { a = someFuncMakesB() } .orElse { a = otherDefaultB() }.get On Thu, Dec 26, 2013 at 12:33 PM, Mark Hamstra m...@clearstorydata.com wrote: In code added to Spark over the past several months, I'm glad to see more use of `foreach`, `for`, `map` and `flatMap` over `Option` instead of pattern matching boilerplate. There are opportunities to push `Option` idioms even further now that we are using Scala 2.10 in master, but I want to discuss the issue here a little bit before committing code whose form may be a little unfamiliar to some Spark developers. In particular, I really like the use of `fold` with `Option` to cleanly an concisely express the do something if the Option is None; do something else with the thing contained in the Option if it is Some code fragment. An example: Instead of... val driver = drivers.find(_.id == driverId) driver match { case Some(d) = if (waitingDrivers.contains(d)) { waitingDrivers -= d } else { d.worker.foreach { w = w.actor ! KillDriver(driverId) } } val msg = sKill request for $driverId submitted logInfo(msg) sender ! KillDriverResponse(true, msg) case None = val msg = sCould not find running driver $driverId logWarning(msg) sender ! KillDriverResponse(false, msg) } ...using fold we end up with... driver.fold { val msg = sCould not find running driver $driverId logWarning(msg) sender ! KillDriverResponse(false, msg) } { d = if (waitingDrivers.contains(d)) { waitingDrivers -= d } else { d.worker.foreach { w = w.actor ! KillDriver(driverId) } } val msg = sKill request
Re: Spark development for undergraduate project
Or if you're extremely ambitious work in implementing Spark Streaming in Python— Sent from Mailbox for iPhone On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi Matt, If you want to get started looking at Spark, I recommend the following resources: - Our issue tracker at http://spark-project.atlassian.net contains some issues marked “Starter” that are good places to jump into. You might be able to take one of those and extend it into a bigger project. - The “contributing to Spark” wiki page covers how to send patches and set up development: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark - This talk has an intro to Spark internals (video and slides are in the comments): http://www.meetup.com/spark-users/events/94101942/ For a longer project, here are some possible ones: - Create a tool that automatically checks which Scala API methods are missing in Python. We had a similar one for Java that was very useful. Even better would be to automatically create wrappers for the Scala ones. - Extend the Spark monitoring UI with profiling information (to sample the workers and say where they’re spending time, or what data structures consume the most memory). - Pick and implement a new machine learning algorithm for MLlib. Matei On Dec 17, 2013, at 10:43 AM, Matthew Cheah mcch...@uwaterloo.ca wrote: Hi everyone, During my most recent internship, I worked extensively with Apache Spark, integrating it into a company's data analytics platform. I've now become interested in contributing to Apache Spark. I'm returning to undergraduate studies in January and there is an academic course which is simply a standalone software engineering project. I was thinking that some contribution to Apache Spark would satisfy my curiosity, help continue support the company I interned at, and give me academic credits required to graduate, all at the same time. It seems like too good an opportunity to pass up. With that in mind, I have the following questions: 1. At this point, is there any self-contained project that I could work on within Spark? Ideally, I would work on it independently, in about a three month time frame. This time also needs to accommodate ramping up on the Spark codebase and adjusting to the Scala programming language and paradigms. The company I worked at primarily used the Java APIs. The output needs to be a technical report describing the project requirements, and the design process I took to engineer the solution for the requirements. In particular, it cannot just be a series of haphazard patches. 2. How can I get started with contributing to Spark? 3. Is there a high-level UML or some other design specification for the Spark architecture? Thanks! I hope to be of some help =) -Matt Cheah
Re: Spark development for undergraduate project
Some good things to look at though hopefully #2 will be largely addressed by: https://github.com/apache/incubator-spark/pull/230— Sent from Mailbox for iPhone On Thu, Dec 19, 2013 at 8:57 PM, Andrew Ash and...@andrewash.com wrote: I think there are also some improvements that could be made to deployability in an enterprise setting. From my experience: 1. Most places I deploy Spark in don't have internet access. So I can't build from source, compile against a different version of Hadoop, etc without doing it locally and then getting that onto my servers manually. This is less a problem with Spark now that there are binary distributions, but it's still a problem for using Mesos with Spark. 2. Configuration of Spark is confusing -- you can make configuration in Java system properties, environment variables, command line parameters, and for the standalone cluster deployment mode you need to worry about whether these need to be set on the master, the worker, the executor, or the application/driver program. Also because spark-shell automatically instantiates a SparkContext you have to set up any system properties in the init scripts or on the command line with JAVA_OPTS=-Dspark.executor.memory=8g etc. I'm not sure what needs to be done, but it feels that there are gains to be made in configuration options here. Ideally, I would have one configuration file that can be used in all 4 places and that's the only place to make configuration changes. 3. Standalone cluster mode could use improved resiliency for starting, stopping, and keeping alive a service -- there are custom init scripts that call each other in a mess of ways: spark-shell, spark-daemon.sh, spark-daemons.sh, spark-config.sh, spark-env.sh, compute-classpath.sh, spark-executor, spark-class, run-example, and several others in the bin/ directory. I would love it if Spark used the Tanuki Service Wrapper, which is widely-used for Java service daemons, supports retries, installation as init scripts that can be chkconfig'd, etc. Let's not re-solve the how do I keep a service running? problem when it's been done so well by Tanuki -- we use it at my day job for all our services, plus it's used by Elasticsearch. This would help solve the problem where a quick bounce of the master causes all the workers to self-destruct. 4. Sensitivity to hostname vs FQDN vs IP address in spark URL -- this is entirely an Akka bug based on previous mailing list discussion with Matei, but it'd be awesome if you could use either the hostname or the FQDN or the IP address in the Spark URL and not have Akka barf at you. I've been telling myself I'd look into these at some point but just haven't gotten around to them myself yet. Some day! I would prioritize these requests from most- to least-important as 3, 2, 4, 1. Andrew On Thu, Dec 19, 2013 at 1:38 PM, Nick Pentreath nick.pentre...@gmail.comwrote: Or if you're extremely ambitious work in implementing Spark Streaming in Python— Sent from Mailbox for iPhone On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi Matt, If you want to get started looking at Spark, I recommend the following resources: - Our issue tracker at http://spark-project.atlassian.net contains some issues marked “Starter” that are good places to jump into. You might be able to take one of those and extend it into a bigger project. - The “contributing to Spark” wiki page covers how to send patches and set up development: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark - This talk has an intro to Spark internals (video and slides are in the comments): http://www.meetup.com/spark-users/events/94101942/ For a longer project, here are some possible ones: - Create a tool that automatically checks which Scala API methods are missing in Python. We had a similar one for Java that was very useful. Even better would be to automatically create wrappers for the Scala ones. - Extend the Spark monitoring UI with profiling information (to sample the workers and say where they’re spending time, or what data structures consume the most memory). - Pick and implement a new machine learning algorithm for MLlib. Matei On Dec 17, 2013, at 10:43 AM, Matthew Cheah mcch...@uwaterloo.ca wrote: Hi everyone, During my most recent internship, I worked extensively with Apache Spark, integrating it into a company's data analytics platform. I've now become interested in contributing to Apache Spark. I'm returning to undergraduate studies in January and there is an academic course which is simply a standalone software engineering project. I was thinking that some contribution to Apache Spark would satisfy my curiosity, help continue support the company I interned at, and give me academic credits required to graduate, all at the same time. It seems like too good an opportunity to pass up. With that in mind, I have
Re: IMPORTANT: Spark mailing lists moving to Apache by September 1st
One option that is 3rd party that works nicely for the Hadoop project and it's related projects is http://search-hadoop.com - managed by sematext. Perhaps we can plead with Otis to add Spark lists to search-spark.com, or the existing site? Just throwing it out there as a potential solution to at least searching and navigating the Apache lists Sent from my iPad On 20 Dec 2013, at 6:46 AM, Aaron Davidson ilike...@gmail.com wrote: I'd be fine with one-way mirrors here (Apache threads being reflected in Google groups) -- I have no idea how one is supposed to navigate the Apache list to look for historic threads. On Thu, Dec 19, 2013 at 7:58 PM, Mike Potts maspo...@gmail.com wrote: Thanks very much for the prompt and comprehensive reply! I appreciate the overarching desire to integrate with apache: I'm very happy to hear that there's a move to use the existing groups as mirrors: that will overcome all of my objections: particularly if it's bidirectional! :) On Thursday, December 19, 2013 7:19:06 PM UTC-8, Andy Konwinski wrote: Hey Mike, As you probably noticed when you CC'd spark-de...@googlegroups.com, that list has already be reconfigured so that it no longer allows posting (and bounces emails sent to it). We will be doing the same thing to the spark...@googlegroups.com list too (we'll announce a date for that soon). That may sound very frustrating, and you are *not* alone feeling that way. We've had a long conversation with our mentors about this, and I've felt very similar to you, so I'd like to give you background. As I'm coming to see it, part of becoming an Apache project is moving the community *fully* over to Apache infrastructure, and more generally the Apache way of organizing the community. This applies in both the nuts-and-bolts sense of being on apache infra, but possibly more importantly, it is also a guiding principle and way of thinking. In various ways, moving to apache Infra can be a painful process, and IMO the loss of all the great mailing list functionality that comes with using Google Groups is perhaps the most painful step. But basically, the de facto mailing lists need to be the Apache ones, and not Google Groups. The underlying reason is that Apache needs to take full accountability for recording and publishing the mailing lists, it has to be able to institutionally guarantee this. This is because discussion on mailing lists is one of the core things that defines an Apache community. So at a minimum this means Apache owning the master copy of the bits. All that said, we are discussing the possibility of having a google group that subscribes to each list that would provide an easier to use and prettier archive for each list (so far we haven't gotten that to work). I hope this was helpful. It has taken me a few years now, and a lot of conversations with experienced (and patient!) Apache mentors, to internalize some of the nuance about the Apache way. That's why I wanted to share. Andy On Thu, Dec 19, 2013 at 6:28 PM, Mike Potts masp...@gmail.com wrote: I notice that there are still a lot of active topics in this group: and also activity on the apache mailing list (which is a really horrible experience!). Is it a firm policy on apache's front to disallow external groups? I'm going to be ramping up on spark, and I really hate the idea of having to rely on the apache archives and my mail client. Also: having to search for topics/keywords both in old threads (here) as well as new threads in apache's (clunky) archive, is going to be a pain! I almost feel like I must be missing something because the current solution seems unfeasibly awkward! -- You received this message because you are subscribed to the Google Groups Spark Users group. To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Spark development for undergraduate project
Another option would be: 1. Add another recommendation model based on mrec's sgd based model: https://github.com/mendeley/mrec 2. Look at the streaming K-means from Mahout and see if that might be integrated or adapted into MLlib 3. Work on adding to or refactoring the existing linear model framework, for example adaptive learning rate schedules, adaptive norm stuff from John Langford et al 4. Adding sparse vector/matrix support to MLlib? Sent from my iPad On 20 Dec 2013, at 3:46 AM, Tathagata Das tathagata.das1...@gmail.com wrote: +1 to that (assuming by 'online' Andrew meant MLLib algorithm from Spark Streaming) Something you can look into is implementing a streaming KMeans. Maybe you can re-use a lot of the offline KMeans code in MLLib. TD On Thu, Dec 19, 2013 at 5:33 PM, Andrew Ash and...@andrewash.com wrote: Sounds like a great choice. It would be particularly impressive if you could add the first online learning algorithm (all the current ones are offline I believe) to pave the way for future contributions. On Thu, Dec 19, 2013 at 8:27 PM, Matthew Cheah mcch...@uwaterloo.ca wrote: Thanks a lot everyone! I'm looking into adding an algorithm to MLib for the project. Nice and self-contained. -Matt Cheah On Thu, Dec 19, 2013 at 12:52 PM, Christopher Nguyen c...@adatao.com wrote: +1 to most of Andrew's suggestions here, and while we're in that neighborhood, how about generalizing something like wtf-spark (from the Bizo team (http://youtu.be/6Sn1xs5DN1Y?t=38m36s)? It may not be of high academic interest, but it's something people would use many times a debugging day. Or am I behind and something like that is already there in 0.8? -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Thu, Dec 19, 2013 at 10:56 AM, Andrew Ash and...@andrewash.com wrote: I think there are also some improvements that could be made to deployability in an enterprise setting. From my experience: 1. Most places I deploy Spark in don't have internet access. So I can't build from source, compile against a different version of Hadoop, etc without doing it locally and then getting that onto my servers manually. This is less a problem with Spark now that there are binary distributions, but it's still a problem for using Mesos with Spark. 2. Configuration of Spark is confusing -- you can make configuration in Java system properties, environment variables, command line parameters, and for the standalone cluster deployment mode you need to worry about whether these need to be set on the master, the worker, the executor, or the application/driver program. Also because spark-shell automatically instantiates a SparkContext you have to set up any system properties in the init scripts or on the command line with JAVA_OPTS=-Dspark.executor.memory=8g etc. I'm not sure what needs to be done, but it feels that there are gains to be made in configuration options here. Ideally, I would have one configuration file that can be used in all 4 places and that's the only place to make configuration changes. 3. Standalone cluster mode could use improved resiliency for starting, stopping, and keeping alive a service -- there are custom init scripts that call each other in a mess of ways: spark-shell, spark-daemon.sh, spark-daemons.sh, spark-config.sh, spark-env.sh, compute-classpath.sh, spark-executor, spark-class, run-example, and several others in the bin/ directory. I would love it if Spark used the Tanuki Service Wrapper, which is widely-used for Java service daemons, supports retries, installation as init scripts that can be chkconfig'd, etc. Let's not re-solve the how do I keep a service running? problem when it's been done so well by Tanuki -- we use it at my day job for all our services, plus it's used by Elasticsearch. This would help solve the problem where a quick bounce of the master causes all the workers to self-destruct. 4. Sensitivity to hostname vs FQDN vs IP address in spark URL -- this is entirely an Akka bug based on previous mailing list discussion with Matei, but it'd be awesome if you could use either the hostname or the FQDN or the IP address in the Spark URL and not have Akka barf at you. I've been telling myself I'd look into these at some point but just haven't gotten around to them myself yet. Some day! I would prioritize these requests from most- to least-important as 3, 2, 4, 1. Andrew On Thu, Dec 19, 2013 at 1:38 PM, Nick Pentreath nick.pentre...@gmail.com wrote: Or if you're extremely ambitious work in implementing Spark Streaming in Python— Sent from Mailbox for iPhone On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi Matt, If you want to get started looking at Spark, I recommend the following resources: - Our issue tracker at http://spark
Re: Intellij IDEA build issues
Thanks Evan, I tried it and the new SBT direct import seems to work well, though I did run into issues with some yarn imports on Spark. n On Thu, Dec 12, 2013 at 7:03 PM, Evan Chan e...@ooyala.com wrote: Nick, have you tried using the latest Scala plug-in, which features native SBT project imports? ie you no longer need to run gen-idea. On Sat, Dec 7, 2013 at 4:15 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Hi Spark Devs, Hoping someone cane help me out. No matter what I do, I cannot get Intellij to build Spark from source. I am using IDEA 13. I run sbt gen-idea and everything seems to work fine. When I try to build using IDEA, everything compiles but I get the error below. Have any of you come across the same? == Internal error: (java.lang.AssertionError) java/nio/channels/FileChannel$MapMode already declared as ch.epfl.lamp.fjbg.JInnerClassesAttribute$Entry@1b5b798b java.lang.AssertionError: java/nio/channels/FileChannel$MapMode already declared as ch.epfl.lamp.fjbg.JInnerClassesAttribute$Entry@1b5b798b at ch.epfl.lamp.fjbg.JInnerClassesAttribute.addEntry(JInnerClassesAttribute.java:74) at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator$$anonfun$addInnerClasses$3.apply(GenJVM.scala:738) at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator$$anonfun$addInnerClasses$3.apply(GenJVM.scala:733) at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) at scala.collection.immutable.List.foreach(List.scala:76) at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.addInnerClasses(GenJVM.scala:733) at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.emitClass(GenJVM.scala:200) at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.genClass(GenJVM.scala:355) at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase$$anonfun$run$4.apply(GenJVM.scala:86) at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase$$anonfun$run$4.apply(GenJVM.scala:86) at scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:104) at scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:104) at scala.collection.Iterator$class.foreach(Iterator.scala:772) at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45) at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:104) at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.scala:86) at scala.tools.nsc.Global$Run.compileSources(Global.scala:953) at scala.tools.nsc.Global$Run.compile(Global.scala:1041) at xsbt.CachedCompiler0.run(CompilerInterface.scala:123) at xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:99) at xsbt.CachedCompiler0.run(CompilerInterface.scala:99) at xsbt.CompilerInterface.run(CompilerInterface.scala:27) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at sbt.compiler.AnalyzingCompiler.call(AnalyzingCompiler.scala:102) at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:48) at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:41) at sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3$$anonfun$apply$1.apply$mcV$sp(AggressiveCompile.scala:106) at sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3$$anonfun$apply$1.apply(AggressiveCompile.scala:106) at sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3$$anonfun$apply$1.apply(AggressiveCompile.scala:106) at sbt.compiler.AggressiveCompile.sbt$compiler$AggressiveCompile$$timed(AggressiveCompile.scala:173) at sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3.apply(AggressiveCompile.scala:105) at sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3.apply(AggressiveCompile.scala:102) at scala.Option.foreach(Option.scala:236) at sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:102) at sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:102) at scala.Option.foreach(Option.scala:236) at sbt.compiler.AggressiveCompile$$anonfun$6.compileScala$1(AggressiveCompile.scala:102) at sbt.compiler.AggressiveCompile$$anonfun$6.apply(AggressiveCompile.scala:151) at sbt.compiler.AggressiveCompile$$anonfun$6.apply(AggressiveCompile.scala:89) at sbt.inc.IncrementalCompile$$anonfun$doCompile$1.apply(Compile.scala
Re: Scala 2.10 Merge
Whoohoo! Great job everyone especially Prashant! — Sent from Mailbox for iPhone On Sat, Dec 14, 2013 at 10:59 AM, Patrick Wendell pwend...@gmail.com wrote: Alright I just merged this in - so Spark is officially Scala 2.10 from here forward. For reference I cut a new branch called scala-2.9 with the commit immediately prior to the merge: https://git-wip-us.apache.org/repos/asf/incubator-spark/repo?p=incubator-spark.git;a=shortlog;h=refs/heads/scala-2.9 - Patrick On Thu, Dec 12, 2013 at 8:26 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Reymond, Let's move this discussion out of this thread and into the associated JIRA. I'll write up our current approach over there. https://spark-project.atlassian.net/browse/SPARK-995 - Patrick On Thu, Dec 12, 2013 at 5:56 PM, Liu, Raymond raymond@intel.com wrote: Hi Patrick So what's the plan for support Yarn 2.2 in 0.9? As far as I can see, if you want to support both 2.2 and 2.0 , due to protobuf version incompatible issue. You need two version of akka anyway. Akka 2.3-M1 looks like have a little bit change in API, we probably could isolate the code like what we did on yarn part API. I remember that it is mentioned that to use reflection for different API is preferred. So the purpose to use reflection is to use one release bin jar to support both version of Hadoop/Yarn on runtime, instead of build different bin jar on compile time? Then all code related to hadoop will also be built in separate modules for loading on demand? This sounds to me involve a lot of works. And you still need to have shim layer and separate code for different version API and depends on different version Akka etc. Sounds like and even strict demands versus our current approaching on master, and with dynamic class loader in addition, And the problem we are facing now are still there? Best Regards, Raymond Liu -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Thursday, December 12, 2013 5:13 PM To: dev@spark.incubator.apache.org Subject: Re: Scala 2.10 Merge Also - the code is still there because of a recent merge that took in some newer changes... we'll be removing it for the final merge. On Thu, Dec 12, 2013 at 1:12 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Raymond, This won't work because AFAIK akka 2.3-M1 is not binary compatible with akka 2.2.3 (right?). For all of the non-yarn 2.2 versions we need to still use the older protobuf library, so we'd need to support both. I'd also be concerned about having a reference to a non-released version of akka. Akka is the source of our hardest-to-find bugs and simultaneously trying to support 2.2.3 and 2.3-M1 is a bit daunting. Of course, if you are building off of master you can maintain a fork that uses this. - Patrick On Thu, Dec 12, 2013 at 12:42 AM, Liu, Raymond raymond@intel.comwrote: Hi Patrick What does that means for drop YARN 2.2? seems codes are still there. You mean if build upon 2.2 it will break, and won't and work right? Since the home made akka build on scala 2.10 are not there. While, if for this case, can we just use akka 2.3-M1 which run on protobuf 2.5 for replacement? Best Regards, Raymond Liu -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Thursday, December 12, 2013 4:21 PM To: dev@spark.incubator.apache.org Subject: Scala 2.10 Merge Hi Developers, In the next few days we are planning to merge Scala 2.10 support into Spark. For those that haven't been following this, Prashant Sharma has been maintaining the scala-2.10 branch of Spark for several months. This branch is current with master and has been reviewed for merging: https://github.com/apache/incubator-spark/tree/scala-2.10 Scala 2.10 support is one of the most requested features for Spark - it will be great to get this into Spark 0.9! Please note that *Scala 2.10 is not binary compatible with Scala 2.9*. With that in mind, I wanted to give a few heads-up/requests to developers: If you are developing applications on top of Spark's master branch, those will need to migrate to Scala 2.10. You may want to download and test the current scala-2.10 branch in order to make sure you will be okay as Spark developments move forward. Of course, you can always stick with the current master commit and be fine (I'll cut a tag when we do the merge in order to delineate where the version changes). Please open new threads on the dev list to report and discuss any issues. This merge will temporarily drop support for YARN 2.2 on the master branch. This is because the workaround we used was only compiled for Scala 2.9. We are going to come up with a more robust solution to YARN 2.2 support before releasing 0.9. Going forward, we will continue to make maintenance releases on branch-0.8
Re: [VOTE] Release Apache Spark 0.8.1-incubating (rc4)
- Successfully built via sbt/sbt assembly/assembly on Mac OS X, as well as on a dev Ubuntu EC2 box - Successfully tested via sbt/sbt test locally - Successfully built and tested using mvn package locally - I've tested my own Spark jobs (built against 0.8.0-incubating) on this RC and all works fine, as well as tested with my job server (also built against 0.8.0-incubating) - Ran a few spark examples and the shell and PySpark shell - For my part, tested the MLlib implicit code I added, and checked docs I'm +1 On Wed, Dec 11, 2013 at 11:04 AM, Prashant Sharma scrapco...@gmail.comwrote: I hope this PR https://github.com/apache/incubator-spark/pull/252 can help. Again this is not a blocker for the release from my side either. On Wed, Dec 11, 2013 at 2:14 PM, Mark Hamstra m...@clearstorydata.com wrote: Interesting, and confirmed: On my machine where `./sbt/sbt assembly` takes a long, long, long time to complete (a MBP, in my case), building three separate assemblies (`./sbt/sbt assembly/assembly`, `./sbt/sbt examples/assembly`, `./sbt/sbt tools/assembly`) takes much, much less time. On Wed, Dec 11, 2013 at 12:02 AM, Prashant Sharma scrapco...@gmail.com wrote: forgot to mention, after running sbt/sbt assembly/assembly running sbt/sbt examples/assembly takes just 37s. Not to mention my hardware is not really great. On Wed, Dec 11, 2013 at 1:28 PM, Prashant Sharma scrapco...@gmail.com wrote: Hi Patrick and Matei, Was trying out this and followed the quick start guide which says do sbt/sbt assembly, like few others I was also stuck for few minutes on linux. On the other hand if I use sbt/sbt assembly/assembly it is much faster. Should we change the documentation to reflect this. It will not be great for first time users to get stuck there. On Wed, Dec 11, 2013 at 9:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote: +1 Built and tested it on Mac OS X. Matei On Dec 10, 2013, at 4:49 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark (incubating) version 0.8.1. The tag to be voted on is v0.8.1-incubating (commit b87d31d): https://git-wip-us.apache.org/repos/asf/incubator-spark/repo?p=incubator-spark.git;a=commit;h=b87d31dd8eb4b4e47c0138e9242d0dd6922c8c4e The release files, including signatures, digests, etc can be found at: http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc4/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-040/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc4-docs/ For information about the contents of this release see: https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=blob;f=CHANGES.txt;h=ce0aeab524505b63c7999e0371157ac2def6fe1c;hb=branch-0.8 Please vote on releasing this package as Apache Spark 0.8.1-incubating! The vote is open until Saturday, December 14th at 01:00 UTC and passes if a majority of at least 3 +1 PPMC votes are cast. [ ] +1 Release this package as Apache Spark 0.8.1-incubating [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.incubator.apache.org/ -- s -- s -- s
Intellij IDEA build issues
Hi Spark Devs, Hoping someone cane help me out. No matter what I do, I cannot get Intellij to build Spark from source. I am using IDEA 13. I run sbt gen-idea and everything seems to work fine. When I try to build using IDEA, everything compiles but I get the error below. Have any of you come across the same? == Internal error: (java.lang.AssertionError) java/nio/channels/FileChannel$MapMode already declared as ch.epfl.lamp.fjbg.JInnerClassesAttribute$Entry@1b5b798b java.lang.AssertionError: java/nio/channels/FileChannel$MapMode already declared as ch.epfl.lamp.fjbg.JInnerClassesAttribute$Entry@1b5b798b at ch.epfl.lamp.fjbg.JInnerClassesAttribute.addEntry(JInnerClassesAttribute.java:74) at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator$$anonfun$addInnerClasses$3.apply(GenJVM.scala:738) at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator$$anonfun$addInnerClasses$3.apply(GenJVM.scala:733) at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) at scala.collection.immutable.List.foreach(List.scala:76) at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.addInnerClasses(GenJVM.scala:733) at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.emitClass(GenJVM.scala:200) at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.genClass(GenJVM.scala:355) at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase$$anonfun$run$4.apply(GenJVM.scala:86) at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase$$anonfun$run$4.apply(GenJVM.scala:86) at scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:104) at scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:104) at scala.collection.Iterator$class.foreach(Iterator.scala:772) at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45) at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:104) at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.scala:86) at scala.tools.nsc.Global$Run.compileSources(Global.scala:953) at scala.tools.nsc.Global$Run.compile(Global.scala:1041) at xsbt.CachedCompiler0.run(CompilerInterface.scala:123) at xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:99) at xsbt.CachedCompiler0.run(CompilerInterface.scala:99) at xsbt.CompilerInterface.run(CompilerInterface.scala:27) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at sbt.compiler.AnalyzingCompiler.call(AnalyzingCompiler.scala:102) at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:48) at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:41) at sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3$$anonfun$apply$1.apply$mcV$sp(AggressiveCompile.scala:106) at sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3$$anonfun$apply$1.apply(AggressiveCompile.scala:106) at sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3$$anonfun$apply$1.apply(AggressiveCompile.scala:106) at sbt.compiler.AggressiveCompile.sbt$compiler$AggressiveCompile$$timed(AggressiveCompile.scala:173) at sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3.apply(AggressiveCompile.scala:105) at sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1$$anonfun$apply$3.apply(AggressiveCompile.scala:102) at scala.Option.foreach(Option.scala:236) at sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:102) at sbt.compiler.AggressiveCompile$$anonfun$6$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:102) at scala.Option.foreach(Option.scala:236) at sbt.compiler.AggressiveCompile$$anonfun$6.compileScala$1(AggressiveCompile.scala:102) at sbt.compiler.AggressiveCompile$$anonfun$6.apply(AggressiveCompile.scala:151) at sbt.compiler.AggressiveCompile$$anonfun$6.apply(AggressiveCompile.scala:89) at sbt.inc.IncrementalCompile$$anonfun$doCompile$1.apply(Compile.scala:39) at sbt.inc.IncrementalCompile$$anonfun$doCompile$1.apply(Compile.scala:37) at sbt.inc.Incremental$.cycle(Incremental.scala:75) at sbt.inc.Incremental$$anonfun$1.apply(Incremental.scala:34) at sbt.inc.Incremental$$anonfun$1.apply(Incremental.scala:33) at sbt.inc.Incremental$.manageClassfiles(Incremental.scala:42) at sbt.inc.Incremental$.compile(Incremental.scala:33) at sbt.inc.IncrementalCompile$.apply(Compile.scala:27) at sbt.compiler.AggressiveCompile.compile2(AggressiveCompile.scala:164) at sbt.compiler.AggressiveCompile.compile1(AggressiveCompile.scala:73) at org.jetbrains.jps.incremental.scala.local.CompilerImpl.compile(CompilerImpl.scala:61) at
PySpark - Dill serialization
Hi devs I came across Dill ( http://trac.mystic.cacr.caltech.edu/project/pathos/wiki/dill) for Python serialization. Was wondering if it may be a replacement to the cloudpickle stuff (and remove that piece of code that needs to be maintained within PySpark)? Josh have you looked into Dill? Any thoughts? N
Re: [PySpark]: reading arbitrary Hadoop InputFormats
Thanks Josh, Patrick for the feedback. Based on Josh's pointers I have something working for JavaPairRDD - PySpark RDD[(String, String)]. This just calls the toString method on each key and value as before, but without the need for a delimiter. For SequenceFile, it uses SequenceFileAsTextInputFormat which itself calls toString to convert to Text for keys and values. We then call toString (again) ourselves to get Strings to feed to writeAsPickle. Details here: https://gist.github.com/MLnick/7230588 This also illustrates where the wrapper function api would fit in. All that is required is to define a T = String for key and value. I started playing around with MsgPack and can sort of get things to work in Scala, but am struggling with getting the raw bytes to be written properly in PythonRDD (I think it is treating them as pickled byte arrays when they are not, but when I removed the 'stripPickle' calls and amended the length (-6) I got UnpicklingError: invalid load key, ' '. ). Another issue is that MsgPack does well at writing structures - like Java classes with public fields that are fairly simple - but for example the Writables have private fields so you end up with nothing being written. This looks like it would require custom Templates (serialization functions effectively) for many classes, which means a lot of custom code for a user to write to use it. Fortunately for most of the common Writables a toString does the job. Will keep looking into it though. Anyway, Josh if you have ideas or examples on the Wrapper API from Python that you mentioned, I'd be interested to hear them. If you think this is worth working up as a Pull Request covering SequenceFiles and custom InputFormats with default toString conversions and the ability to specify Wrapper functions, I can clean things up more, add some functionality and tests, and also test to see if common things like the normal Writables and reading from things like HBase and Cassandra can be made to work nicely (any other common use cases that you think make sense?). Thoughts, comments etc welcome. Nick On Fri, Oct 25, 2013 at 11:03 PM, Patrick Wendell pwend...@gmail.comwrote: As a starting point, a version where people just write their own wrapper functions to convert various HadoopFiles into String K, V files could go a long way. We could even have a few built-in versions, such as dealing with Sequence files that are String, String. Basically, the user needs to write a translator in Java/Scala that produces textual records from whatever format that want. Then, they make sure this is included in the classpath when running PySpark. As Josh is saying, I'm pretty sure this is already possible, but we may want to document it for users. In many organizations they might have 1-2 people who can write the Java/Scala to do this but then many more people who are comfortable using python once it's setup. - Patrick On Fri, Oct 25, 2013 at 11:00 AM, Josh Rosen rosenvi...@gmail.com wrote: Hi Nick, I've seen several requests for SequenceFile support in PySpark, so there's definitely demand for this feature. I like the idea of passing MsgPack'ed data (or some other structured format) from Java to the Python workers. My early prototype of custom serializers (described at https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals#PySparkInternals-customserializers ) might be useful for implementing this. Proper custom serializer support would handle the bookkeeping for tracking each stage's input and output formats and supplying the appropriate deserialization functions to the Python worker, so the Python worker would be able to directly read the MsgPack'd data that's sent to it. Regarding a wrapper API, it's actually possible to initially transform data using Scala/Java and perform the remainder of the processing in PySpark. This involves adding the appropriate compiled to the Java classpath and a bit of work in Py4J to create the Java/Scala RDD and wrap it for use by PySpark. I can hack together a rough example of this if anyone's interested, but it would need some work to be developed into a user-friendly API. If you wanted to extend your proof-of-concept to handle the cases where keys and values have parseable toString() values, I think you could remove the need for a delimiter by creating a PythonRDD from the newHadoopFile JavaPairRDD and adding a new method to writeAsPickle ( https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L224 ) to dump its contents as a pickled pair of strings. (Aside: most of writeAsPickle() would probably need be eliminated or refactored when adding general custom serializer support). - Josh On Thu, Oct 24, 2013 at 11:18 PM, Nick Pentreath nick.pentre...@gmail.comwrote: Hi Spark Devs I was wondering what appetite there may be to add
Re: MLI dependency exception
Is mLI available? Where is the repo located? — Sent from Mailbox for iPhone On Tue, Sep 10, 2013 at 10:45 PM, Gowtham N gowtham.n.m...@gmail.com wrote: It worked. I was using old master for spark, which I forked many days a ago. On Tue, Sep 10, 2013 at 1:25 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: For some more notes on how to debug this: After you do publish-local in Spark, you should have a file in ~/.ivy2 that you can check for using `ls ~/.ivy2/local/org.apache.spark/spark-core_2.9.3/0.8.0-SNAPSHOT/jars/spark-core_2.9.3.jar` Or `sbt/sbt publish-local` also prints something like this on the console [info] published spark-core_2.9.3 to /home/shivaram/.ivy2/local/org.apache.spark/spark-core_2.9.3/0.8.0-SNAPSHOT/jars/spark-core_2.9.3.jar After that MLI's build should be able to pick this jar up. Thanks Shivaram On Tue, Sep 10, 2013 at 1:14 PM, Gowtham N gowtham.n.m...@gmail.comwrote: I did it as publish-local. I forked mesos/spark to gowthamnatarajan/spark. And I am using that. I forked a few days ago, but did a upstream update today. For safety, I will directly clone from mesos now. On Tue, Sep 10, 2013 at 1:10 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Did you check out spark from the master branch of github.com/mesos/spark? The package names changed recently so you might need to pull. Also just checking that you did publish-local in Spark (not public-local as specified in the email) ? Thanks Shivaram On Tue, Sep 10, 2013 at 1:01 PM, Gowtham N gowtham.n.m...@gmail.com wrote: still getting the same error. I have spark and MLI folder within a folder called git I did clean, package and public-local for spark. Then for mli did clean, and then package. I am still getting the error. [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :: [warn] :: org.apache.spark#spark-core_2.9.3;0.8.0-SNAPSHOT: not found [warn] :: org.apache.spark#spark-mllib_2.9.3;0.8.0-SNAPSHOT: not found [warn] :: [error] {file:/Users/gowthamn/git/MLI/}default-0b9403/*:update: sbt.ResolveException: unresolved dependency: org.apache.spark#spark-core_2.9.3;0.8.0-SNAPSHOT: not found [error] unresolved dependency: org.apache.spark#spark-mllib_2.9.3;0.8.0-SNAPSHOT: not found should I modify the contents of build.sbt? Currently its libraryDependencies ++= Seq( org.apache.spark % spark-core_2.9.3 % 0.8.0-SNAPSHOT, org.apache.spark % spark-mllib_2.9.3 % 0.8.0-SNAPSHOT, org.scalatest %% scalatest % 1.9.1 % test ) resolvers ++= Seq( Typesafe at http://repo.typesafe.com/typesafe/releases;, Scala Tools Snapshots at http://scala-tools.org/repo-snapshots/;, ScalaNLP Maven2 at http://repo.scalanlp.org/repo;, Spray at http://repo.spray.cc; ) On Tue, Sep 10, 2013 at 11:58 AM, Evan R. Sparks evan.spa...@gmail.com wrote: Hi Gowtham, You'll need to do sbt/sbt publish-local in the spark directory before trying to build MLI. - Evan On Tue, Sep 10, 2013 at 11:37 AM, Gowtham N gowtham.n.m...@gmail.com wrote: I cloned MLI, but am unable to compile it. I get the following dependency exception with other projects. org.apache.spark#spark-core_2.9.3;0.8.0-SNAPSHOT: not found org.apache.spark#spark-mllib_2.9.3;0.8.0-SNAPSHOT: not found Why am I getting this error? I did not change anything from build.sbt libraryDependencies ++= Seq( org.apache.spark % spark-core_2.9.3 % 0.8.0-SNAPSHOT, org.apache.spark % spark-mllib_2.9.3 % 0.8.0-SNAPSHOT, org.scalatest %% scalatest % 1.9.1 % test ) -- Gowtham Natarajan -- Gowtham Natarajan -- Gowtham Natarajan
Re: Adding support for implicit feedback to ALS
In 3 are you saying that some cross validation support for picking the best lambda and alpha should be in there? Or that also the preference weightings of different event types should also be learnt? (Maybe both) I agree that there should be support for this, by optimising for the best RMSE, MAP or whatever. I'm just not sure whether this functionality should live in Mllib or MLI. Until MLI is released it's sort of hard to know. For 4, my frame of reference has been vs mahout and my own port to spark of mahouts ALS, and vs those this blocked approach is far superior. Though im sure there can be more efficiencies gained in this approach and other alternatives. It would certainly be great to further improve the approach as you mention in 5. I'm not sure precisely what you mean by task reformulation - how would you propose to do so? Nick — Sent from Mailbox for iPhone On Mon, Sep 9, 2013 at 8:28 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Sorry, not directly aimed at the PR but at implementation in whole. See if the following is useful from my experience: 1. implicit feedback is just a corner case of more general problem: Given preference matrix P where P_i,j in R^{0,1} and weight (confidence) matrix C, C_i,j \in R, and reg rate \lambda, compute L2-regularized ALS fit. 2. since default confidence is never zero (in paper it is assumed 1, and i will denote this real quantity as c_0), have C = C_0 + C' where C_0_i,j = c_0. Hence, rewrite input in terms (P, C', c_0) since C' becomes severely sparse matrix in this case in real life. 3. It is nice when input C is known. But there are a lot of cases where individual confidence is derived from a final set of hyperparameters corresponding to a particular event type (search, click, transaction etc.). Hence, convex optimization for a small set of hyperparameters is desired (this might be outside of scope ALS itself, but weighing and lamda per se aren't). Still though, crossvalidation largely relies on the fact that we want to take stuff that follows existing entries in C' so crossvalidation helpers would be naturally coupled with this method and should be provided. 4. i actually used pregel to avoid shuffle and sort programming model. Matrix operations do not require guarantees produced by reducers; only a full group guarantee. I did not benchmark this approach for really substantial datasets though; there are known Bagel limitations IMO which may create a problem for sufficiently large /skewed datasets. I guess I am interested in GraphX release to replace reliance on Bagel. 5. if the task reformulation is accepted, there are further optimizations that could be applied to blocking -- but this implementation gets the gist of it what i did in that regard. On Sun, Sep 8, 2013 at 10:58 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Hi I know everyone's pretty busy with getting 0.8.0 out, but as and when folks have time it would be great to get your feedback on this PR adding support for the 'implicit feedback' model variant to ALS: https://github.com/apache/incubator-spark/pull/4 In particular any potential efficiency improvements, issues, and testing it out locally and on a cluster and on some datasets! Comments feedback welcome. Many thanks Nick
Adding support for implicit feedback to ALS
Hi I know everyone's pretty busy with getting 0.8.0 out, but as and when folks have time it would be great to get your feedback on this PR adding support for the 'implicit feedback' model variant to ALS: https://github.com/apache/incubator-spark/pull/4 In particular any potential efficiency improvements, issues, and testing it out locally and on a cluster and on some datasets! Comments feedback welcome. Many thanks Nick