Re: Is uberjar a recommended way of running Spark/Scala applications?
You might want to look at another great plugin : “sbt-pack” https://github.com/xerial/sbt-pack. It collects all the dependencies JARs and creates launch scripts for *nix (including Mac OS) and windows. HTH Pierre On 02 Jun 2014, at 17:29, Andrei faithlessfri...@gmail.com wrote: Thanks! This is even closer to what I am looking for. I'm in a trip now, so I'm going to give it a try when I come back. On Mon, Jun 2, 2014 at 5:12 AM, Ngoc Dao ngocdaoth...@gmail.com wrote: Alternative solution: https://github.com/xitrum-framework/xitrum-package It collects all dependency .jar files in your Scala program into a directory. It doesn't merge the .jar files together, the .jar files are left as is. On Sat, May 31, 2014 at 3:42 AM, Andrei faithlessfri...@gmail.com wrote: Thanks, Stephen. I have eventually decided to go with assembly, but put away Spark and Hadoop jars, and instead use `spark-submit` to automatically provide these dependencies. This way no resource conflicts arise and mergeStrategy needs no modification. To memorize this stable setup and also share it with the community I've crafted a project [1] with minimal working config. It is SBT project with assembly plugin, Spark 1.0 and Cloudera's Hadoop client. Hope, it will help somebody to take Spark setup quicker. Though I'm fine with this setup for final builds, I'm still looking for a more interactive dev setup - something that doesn't require full rebuild. [1]: https://github.com/faithlessfriend/sample-spark-project Thanks and have a good weekend, Andrei On Thu, May 29, 2014 at 8:27 PM, Stephen Boesch java...@gmail.com wrote: The MergeStrategy combined with sbt assembly did work for me. This is not painless: some trial and error and the assembly may take multiple minutes. You will likely want to filter out some additional classes from the generated jar file. Here is an SOF answer to explain that and with IMHO the best answer snippet included here (in this case the OP understandably did not want to not include javax.servlet.Servlet) http://stackoverflow.com/questions/7819066/sbt-exclude-class-from-jar mappings in (Compile,packageBin) ~= { (ms: Seq[(File, String)]) = ms filter { case (file, toPath) = toPath != javax/servlet/Servlet.class } } There is a setting to not include the project files in the assembly but I do not recall it at this moment. 2014-05-29 10:13 GMT-07:00 Andrei faithlessfri...@gmail.com: Thanks, Jordi, your gist looks pretty much like what I have in my project currently (with few exceptions that I'm going to borrow). I like the idea of using sbt package, since it doesn't require third party plugins and, most important, doesn't create a mess of classes and resources. But in this case I'll have to handle jar list manually via Spark context. Is there a way to automate this process? E.g. when I was a Clojure guy, I could run lein deps (lein is a build tool similar to sbt) to download all dependencies and then just enumerate them from my app. Maybe you have heard of something like that for Spark/SBT? Thanks, Andrei On Thu, May 29, 2014 at 3:48 PM, jaranda jordi.ara...@bsc.es wrote: Hi Andrei, I think the preferred way to deploy Spark jobs is by using the sbt package task instead of using the sbt assembly plugin. In any case, as you comment, the mergeStrategy in combination with some dependency exlusions should fix your problems. Have a look at this gist https://gist.github.com/JordiAranda/bdbad58d128c14277a05 for further details (I just followed some recommendations commented in the sbt assembly plugin documentation). Up to now I haven't found a proper way to combine my development/deployment phases, although I must say my experience in Spark is pretty poor (it really depends in your deployment requirements as well). In this case, I think someone else could give you some further insights. Best, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-uberjar-a-recommended-way-of-running-Spark-Scala-applications-tp6518p6520.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file
+1 Same question here... Message sent from a mobile device - excuse typos and abbreviations Le 2 juin 2014 à 10:08, Kexin Xie kexin@bigcommerce.com a écrit : Hi, Spark 1.0 changes the default behaviour of RDD.saveAsTextFile to throw org.apache.hadoop.mapred.FileAlreadyExistsException when file already exists. Is there a way I can allow Spark to overwrite the existing file? Cheers, Kexin
Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file
Indeed, the behavior has changed for good or for bad. I mean, I agree with the danger you mention but I'm not sure it's happening like that. Isn't there a mechanism for overwrite in Hadoop that automatically removes part files, then writes a _temporary folder and then only the part files along with the _success folder. In any case this change of behavior should be documented IMO. Cheers Pierre Message sent from a mobile device - excuse typos and abbreviations Le 2 juin 2014 à 17:42, Nicholas Chammas nicholas.cham...@gmail.com a écrit : What I’ve found using saveAsTextFile() against S3 (prior to Spark 1.0.0.) is that files get overwritten automatically. This is one danger to this though. If I save to a directory that already has 20 part- files, but this time around I’m only saving 15 part- files, then there will be 5 leftover part- files from the previous set mixed in with the 15 newer files. This is potentially dangerous. I haven’t checked to see if this behavior has changed in 1.0.0. Are you saying it has, Pierre? On Mon, Jun 2, 2014 at 9:41 AM, Pierre B [pierre.borckm...@realimpactanalytics.com](mailto:pierre.borckm...@realimpactanalytics.com) wrote: Hi Michaël, Thanks for this. We could indeed do that. But I guess the question is more about the change of behaviour from 0.9.1 to 1.0.0. We never had to care about that in previous versions. Does that mean we have to manually remove existing files or is there a way to aumotically overwrite when using saveAsTextFile? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-tp6696p6700.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file
I'm a bit confused because the PR mentioned by Patrick seems to adress all these issues: https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1 Was it not accepted? Or is the description of this PR not completely implemented? Message sent from a mobile device - excuse typos and abbreviations Le 2 juin 2014 à 23:08, Nicholas Chammas nicholas.cham...@gmail.com a écrit : OK, thanks for confirming. Is there something we can do about that leftover part- files problem in Spark, or is that for the Hadoop team? 2014년 6월 2일 월요일, Aaron Davidsonilike...@gmail.com님이 작성한 메시지: Yes. On Mon, Jun 2, 2014 at 1:23 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: So in summary: As of Spark 1.0.0, saveAsTextFile() will no longer clobber by default. There is an open JIRA issue to add an option to allow clobbering. Even when clobbering, part- files may be left over from previous saves, which is dangerous. Is this correct? On Mon, Jun 2, 2014 at 4:17 PM, Aaron Davidson ilike...@gmail.com wrote: +1 please re-add this feature On Mon, Jun 2, 2014 at 12:44 PM, Patrick Wendell pwend...@gmail.com wrote: Thanks for pointing that out. I've assigned you to SPARK-1677 (I think I accidentally assigned myself way back when I created it). This should be an easy fix. On Mon, Jun 2, 2014 at 12:19 PM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, Patrick, I think https://issues.apache.org/jira/browse/SPARK-1677 is talking about the same thing? How about assigning it to me? I think I missed the configuration part in my previous commit, though I declared that in the PR description Best, -- Nan Zhu On Monday, June 2, 2014 at 3:03 PM, Patrick Wendell wrote: Hey There, The issue was that the old behavior could cause users to silently overwrite data, which is pretty bad, so to be conservative we decided to enforce the same checks that Hadoop does. This was documented by this JIRA: https://issues.apache.org/jira/browse/SPARK-1100 https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1 However, it would be very easy to add an option that allows preserving the old behavior. Is anyone here interested in contributing that? I created a JIRA for it: https://issues.apache.org/jira/browse/SPARK-1993 - Patrick On Mon, Jun 2, 2014 at 9:22 AM, Pierre Borckmans pierre.borckm...@realimpactanalytics.com wrote: Indeed, the behavior has changed for good or for bad. I mean, I agree with the danger you mention but I'm not sure it's happening like that. Isn't there a mechanism for overwrite in Hadoop that automatically removes part files, then writes a _temporary folder and then only the part files along with the _success folder. In any case this change of behavior should be documented IMO. Cheers Pierre Message sent from a mobile device - excuse typos and abbreviations Le 2 juin 2014 à 17:42, Nicholas Chammas nicholas.cham...@gmail.com a écrit : What I've found using saveAsTextFile() against S3 (prior to Spark 1.0.0.) is that files get overwritten automatically. This is one danger to this though. If I save to a directory that already has 20 part- files, but this time around I'm only saving 15 part- files, then there will be 5 leftover part- files from the previous set mixed in with the 15 newer files. This is potentially dangerous. I haven't checked to see if this behavior has changed in 1.0.0. Are you
Re: Using sbt-pack with Spark 1.0.0
You're right Patrick! Just had a chat with sbt-pack creator and indeed dependencies with classifiers are ignored to avoid problems with dirty cache... Should be fixed in next version of the plugin. Cheers Pierre Message sent from a mobile device - excuse typos and abbreviations Le 1 juin 2014 à 20:04, Patrick Wendell pwend...@gmail.com a écrit : https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L350 On Sun, Jun 1, 2014 at 11:03 AM, Patrick Wendell pwend...@gmail.com wrote: One potential issue here is that mesos is using classifiers now to publish there jars. It might be that sbt-pack has trouble with dependencies that are published using classifiers. I'm pretty sure mesos is the only dependency in Spark that is using classifiers, so that's why I mention it. On Sun, Jun 1, 2014 at 2:34 AM, Pierre B pierre.borckm...@realimpactanalytics.com wrote: Hi all! We'be been using the sbt-pack sbt plugin (https://github.com/xerial/sbt-pack) for building our standalone Spark application for a while now. Until version 1.0.0, that worked nicely. For those who don't know the sbt-pack plugin, it basically copies all the dependencies JARs from your local ivy/maven cache to a your target folder (in target/pack/lib), and creates launch scripts (in target/pack/bin) for your application (notably setting all these jars on the classpath). Now, since Spark 1.0.0 was released, we are encountering a weird error where running our project with sbt run is fine but running our app with the launch scripts generated by sbt-pack fails. After a (quite painful) investigation, it turns out some JARs are NOT copied from the local ivy2 cache to the lib folder. I noticed that all the missing jars contain shaded in their file name (but all not all jars with such name are missing). One of the missing JARs is explicitly from the Spark definition (SparkBuild.scala, line 350): ``mesos-0.18.1-shaded-protobuf.jar``. This file is clearly present in my local ivy cache, but is not copied by sbt-pack. Is there an evident reason for that? I don't know much about the shading mechanism, maybe I'm missing something here? Any help would be appreciated! Cheers Pierre -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-sbt-pack-with-Spark-1-0-0-tp6649.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Use SparkListener to get overall progress of an action
That would be great, Mayur, thanks! Anyhow, to be more specific, my question really was the following: Is there any way to link events in the SparkListener to an action triggered in your code? Cheers Pierre Borckmans Software team RealImpact Analytics | Brussels Office www.realimpactanalytics.com | pierre.borckm...@realimpactanalytics.com FR +32 485 91 87 31 | Skype pierre.borckmans On 23 May 2014, at 10:17, Mayur Rustagi mayur.rust...@gmail.com wrote: We have an internal patched version of Spark webUI which exports application related data as Json. We use monitoring systems as well as alternate UI for that json data for our specific application. Found it much cleaner. Can provide 0.9.1 version. Would submit as a pull request soon. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Fri, May 23, 2014 at 10:57 AM, Chester chesterxgc...@yahoo.com wrote: This is something we are interested as well. We are planning to investigate more on this. If someone has suggestions, we would love to hear. Chester Sent from my iPad On May 22, 2014, at 8:02 AM, Pierre B pierre.borckm...@realimpactanalytics.com wrote: Hi Andy! Yes Spark UI provides a lot of interesting informations for debugging purposes. Here I’m trying to integrate a simple progress monitoring in my app ui. I’m typically running a few “jobs” (or rather actions), and I’d like to be able to display the progress of each of those in my ui. I don’t really see how i could do that using SparkListener for the moment … Thanks for your help! Cheers! Pierre Borckmans Software team RealImpact Analytics | Brussels Office www.realimpactanalytics.com | [hidden email] FR +32 485 91 87 31 | Skype pierre.borckmans On 22 May 2014, at 16:58, andy petrella [via Apache Spark User List] [hidden email] wrote: SparkListener offers good stuffs. But I also completed it with another metrics stuffs on my own that use Akka to aggregate metrics from anywhere I'd like to collect them (without any deps on ganglia yet on Codahale). However, this was useful to gather some custom metrics (from within the tasks then) not really to collect overall monitoring information about the spark thingies themselves. For that Spark UI offers already a pretty good insight no? Cheers, aℕdy ℙetrella about.me/noootsab On Thu, May 22, 2014 at 4:51 PM, Pierre B a href=x-msg://7/user/SendEmail.jtp?type=nodeamp;node=6258amp;i=0 target=_top rel=nofollow link=external[hidden email] wrote: Is there a simple way to monitor the overall progress of an action using SparkListener or anything else? I see that one can name an RDD... Could that be used to determine which action triggered a stage, ... ? Thanks Pierre -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256.html Sent from the Apache Spark User List mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256p6258.html To unsubscribe from Use SparkListener to get overall progress of an action, click here. NAML View this message in context: Re: Use SparkListener to get overall progress of an action Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark-ec2 asks for password
We’ve been experiencing this as well, and our simple solution is to actually keep trying the ssh connection instead of just waiting: Something like this: def wait_for_ssh_connection(opts, host): u.message(Waiting for ssh connection to host {}.format(host)) connected = False while (connected==False): try: if (subprocess.check_call(s.ssh_command(opts) + ['-t', '-t', '%s@%s' % (opts.user, host), ls])==0): connected = True except subprocess.CalledProcessError as e: print Ssh connection to host {} failed, retrying in 10 secondsformat(host) time.sleep(10) print Ssh connection to host {} successfully established!.format(host) HTH Pierre Borckmans RealImpact Analytics | Brussels Office www.realimpactanalytics.com | pierre.borckm...@realimpactanalytics.com FR +32 485 91 87 31 | Skype pierre.borckmans On 19 Apr 2014, at 06:51, Patrick Wendell pwend...@gmail.com wrote: Unfortunately - I think a lot of this is due to generally increased latency on ec2 itself. I've noticed that it's way more common than it used to be for instances to come online past the wait timeout in the ec2 script. On Fri, Apr 18, 2014 at 9:11 PM, FRANK AUSTIN NOTHAFT fnoth...@berkeley.edu wrote: Aureliano, I've been noticing this error recently as well: ssh: connect to host ec-xx-xx-xx-xx.compute-1.amazonaws.com port 22: Connection refused Error 255 while executing remote command, retrying after 30 seconds However, this isn't an issue with the spark-ec2 scripts. After the scripts fail, if you wait a bit longer (e.g., another 2 minutes), the EC2 hosts will finish launching and port 22 will open up. Until the EC2 host has launched and opened port 22 for SSH, SSH cannot succeed, and the Spark-ec2 scripts will fail. I've noticed that EC2 machine launch latency seems to be highest in Oregon; I haven't run into this problem on either the California or Virgina EC2 farms. To work around this issue, I've manually modified my copy of the EC2 scripts to wait for 6 failures (i.e., 3 minutes), which seems to work OK. Might be worth a try on your end. I can't comment about the password request; I haven't seen that on my end. Regards, Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Fri, Apr 18, 2014 at 8:57 PM, Aureliano Buendia buendia...@gmail.com wrote: Hi, Since 0.9.0 spark-ec2 has gone unstable. During launch it throws many errors like: ssh: connect to host ec-xx-xx-xx-xx.compute-1.amazonaws.com port 22: Connection refused Error 255 while executing remote command, retrying after 30 seconds .. and recently, it prompts for passwords!: Warning: Permanently added '' (RSA) to the list of known hosts. Password: Note that the hostname in Permanently added '' is missing in the log, which is probably why it asks for a password. Is this a known bug?
Re: programmatic way to tell Spark version
I see that this was fixed using a fixed string in SparkContext.scala. Wouldn’t it be better to use something like: getClass.getPackage.getImplementationVersion to get the version from the jar manifest (and thus from the sbt definition)? The same holds for SparkILoopInit.scala in the welcome message (printWelcome). This would avoid having to modify these strings at each release. cheers Pierre Borckmans RealImpact Analytics | Brussels Office www.realimpactanalytics.com | pierre.borckm...@realimpactanalytics.com FR +32 485 91 87 31 | Skype pierre.borckmans On 10 Apr 2014, at 23:05, Patrick Wendell pwend...@gmail.com wrote: I think this was solved in a recent merge: https://github.com/apache/spark/pull/204/files#diff-364713d7776956cb8b0a771e9b62f82dR779 Is that what you are looking for? If so, mind marking the JIRA as resolved? On Wed, Apr 9, 2014 at 3:30 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Hey Patrick, I've created SPARK-1458 to track this request, in case the team/community wants to implement it in the future. Nick On Sat, Feb 22, 2014 at 7:25 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: No use case at the moment. What prompted the question: I was going to ask a different question on this list and wanted to note my version of Spark. I assumed there would be a getVersion method on SparkContext or something like that, but I couldn't find one in the docs. I also couldn't find an environment variable with the version. After futzing around a bit I realized it was printed out (quite conspicuously) in the shell startup banner. On Sat, Feb 22, 2014 at 7:15 PM, Patrick Wendell pwend...@gmail.com wrote: AFIAK - We don't have any way to do this right now. Maybe we could add a getVersion method to SparkContext that would tell you. Just wondering - what is the use case here? - Patrick On Sat, Feb 22, 2014 at 4:04 PM, nicholas.chammas nicholas.cham...@gmail.com wrote: Is there a programmatic way to tell what version of Spark I'm running? I know I can look at the banner when the Spark shell starts up, but I'm curious to know if there's another way. Nick View this message in context: programmatic way to tell Spark version Sent from the Apache Spark User List mailing list archive at Nabble.com.
Changing number of workers for benchmarking purposes
Hi there! I was performing some tests for benchmarking purposes, among other things to observe the evolution of the performances versus the number of workers. In that context, I was wondering if there is any easy way to choose the number of workers to be used in standalone mode, without having to change the “slaves” file, dispatch it, and restart the cluster ? Cheers, Pierre