Re: Is uberjar a recommended way of running Spark/Scala applications?

2014-06-03 Thread Pierre Borckmans
You might want to look at another great plugin : “sbt-pack” 
https://github.com/xerial/sbt-pack.

It collects all the dependencies JARs and creates launch scripts for *nix 
(including Mac OS) and windows.

HTH

Pierre


On 02 Jun 2014, at 17:29, Andrei faithlessfri...@gmail.com wrote:

 Thanks! This is even closer to what I am looking for. I'm in a trip now, so 
 I'm going to give it a try when I come back. 
 
 
 On Mon, Jun 2, 2014 at 5:12 AM, Ngoc Dao ngocdaoth...@gmail.com wrote:
 Alternative solution:
 https://github.com/xitrum-framework/xitrum-package
 
 It collects all dependency .jar files in your Scala program into a
 directory. It doesn't merge the .jar files together, the .jar files
 are left as is.
 
 
 On Sat, May 31, 2014 at 3:42 AM, Andrei faithlessfri...@gmail.com wrote:
  Thanks, Stephen. I have eventually decided to go with assembly, but put away
  Spark and Hadoop jars, and instead use `spark-submit` to automatically
  provide these dependencies. This way no resource conflicts arise and
  mergeStrategy needs no modification. To memorize this stable setup and also
  share it with the community I've crafted a project [1] with minimal working
  config. It is SBT project with assembly plugin, Spark 1.0 and Cloudera's
  Hadoop client. Hope, it will help somebody to take Spark setup quicker.
 
  Though I'm fine with this setup for final builds, I'm still looking for a
  more interactive dev setup - something that doesn't require full rebuild.
 
  [1]: https://github.com/faithlessfriend/sample-spark-project
 
  Thanks and have a good weekend,
  Andrei
 
  On Thu, May 29, 2014 at 8:27 PM, Stephen Boesch java...@gmail.com wrote:
 
 
  The MergeStrategy combined with sbt assembly did work for me.  This is not
  painless: some trial and error and the assembly may take multiple minutes.
 
  You will likely want to filter out some additional classes from the
  generated jar file.  Here is an SOF answer to explain that and with IMHO 
  the
  best answer snippet included here (in this case the OP understandably did
  not want to not include javax.servlet.Servlet)
 
  http://stackoverflow.com/questions/7819066/sbt-exclude-class-from-jar
 
 
  mappings in (Compile,packageBin) ~= { (ms: Seq[(File, String)]) = ms
  filter { case (file, toPath) = toPath != javax/servlet/Servlet.class } }
 
  There is a setting to not include the project files in the assembly but I
  do not recall it at this moment.
 
 
 
  2014-05-29 10:13 GMT-07:00 Andrei faithlessfri...@gmail.com:
 
  Thanks, Jordi, your gist looks pretty much like what I have in my project
  currently (with few exceptions that I'm going to borrow).
 
  I like the idea of using sbt package, since it doesn't require third
  party plugins and, most important, doesn't create a mess of classes and
  resources. But in this case I'll have to handle jar list manually via 
  Spark
  context. Is there a way to automate this process? E.g. when I was a 
  Clojure
  guy, I could run lein deps (lein is a build tool similar to sbt) to
  download all dependencies and then just enumerate them from my app. Maybe
  you have heard of something like that for Spark/SBT?
 
  Thanks,
  Andrei
 
 
  On Thu, May 29, 2014 at 3:48 PM, jaranda jordi.ara...@bsc.es wrote:
 
  Hi Andrei,
 
  I think the preferred way to deploy Spark jobs is by using the sbt
  package
  task instead of using the sbt assembly plugin. In any case, as you
  comment,
  the mergeStrategy in combination with some dependency exlusions should
  fix
  your problems. Have a look at  this gist
  https://gist.github.com/JordiAranda/bdbad58d128c14277a05   for further
  details (I just followed some recommendations commented in the sbt
  assembly
  plugin documentation).
 
  Up to now I haven't found a proper way to combine my
  development/deployment
  phases, although I must say my experience in Spark is pretty poor (it
  really
  depends in your deployment requirements as well). In this case, I think
  someone else could give you some further insights.
 
  Best,
 
 
 
  --
  View this message in context:
  http://apache-spark-user-list.1001560.n3.nabble.com/Is-uberjar-a-recommended-way-of-running-Spark-Scala-applications-tp6518p6520.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 
 
 
 



Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Pierre Borckmans
+1 Same question here...

Message sent from a mobile device - excuse typos and abbreviations

 Le 2 juin 2014 à 10:08, Kexin Xie kexin@bigcommerce.com a écrit :
 
 Hi,
 
 Spark 1.0 changes the default behaviour of RDD.saveAsTextFile to throw 
 org.apache.hadoop.mapred.FileAlreadyExistsException when file already exists. 
 
 Is there a way I can allow Spark to overwrite the existing file?
 
 Cheers,
 Kexin
 


Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Pierre Borckmans
Indeed, the behavior has changed for good or for bad. I mean, I agree with the 
danger you mention but I'm not sure it's happening like that. Isn't there a 
mechanism for overwrite in Hadoop that automatically removes part files, then 
writes a _temporary folder and then only the part files along with the _success 
folder. 

In any case this change of behavior should be documented IMO.

Cheers 
Pierre

Message sent from a mobile device - excuse typos and abbreviations

 Le 2 juin 2014 à 17:42, Nicholas Chammas nicholas.cham...@gmail.com a écrit 
 :
 
 What I’ve found using saveAsTextFile() against S3 (prior to Spark 1.0.0.) is 
 that files get overwritten automatically. This is one danger to this though. 
 If I save to a directory that already has 20 part- files, but this time 
 around I’m only saving 15 part- files, then there will be 5 leftover part- 
 files from the previous set mixed in with the 15 newer files. This is 
 potentially dangerous.
 
 I haven’t checked to see if this behavior has changed in 1.0.0. Are you 
 saying it has, Pierre?
 
 On Mon, Jun 2, 2014 at 9:41 AM, Pierre B 
 [pierre.borckm...@realimpactanalytics.com](mailto:pierre.borckm...@realimpactanalytics.com)
  wrote:
 
 Hi Michaël,
 
 Thanks for this. We could indeed do that.
 
 But I guess the question is more about the change of behaviour from 0.9.1 to
 1.0.0.
 We never had to care about that in previous versions.
 
 Does that mean we have to manually remove existing files or is there a way
 to aumotically overwrite when using saveAsTextFile?
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-tp6696p6700.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 ​


Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Pierre Borckmans
I'm a bit confused because the PR mentioned by Patrick seems to adress all 
these issues:
https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1

Was it not accepted? Or is the description of this PR not completely 
implemented?

Message sent from a mobile device - excuse typos and abbreviations

 Le 2 juin 2014 à 23:08, Nicholas Chammas nicholas.cham...@gmail.com a écrit 
 :
 
 OK, thanks for confirming. Is there something we can do about that leftover 
 part- files problem in Spark, or is that for the Hadoop team?
 
 
 2014년 6월 2일 월요일, Aaron Davidsonilike...@gmail.com님이 작성한 메시지:
 Yes.
 
 
 On Mon, Jun 2, 2014 at 1:23 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:
 So in summary:
 As of Spark 1.0.0, saveAsTextFile() will no longer clobber by default.
 There is an open JIRA issue to add an option to allow clobbering.
 Even when clobbering, part- files may be left over from previous saves, 
 which is dangerous.
 Is this correct?
 
 
 On Mon, Jun 2, 2014 at 4:17 PM, Aaron Davidson ilike...@gmail.com wrote:
 +1 please re-add this feature
 
 
 On Mon, Jun 2, 2014 at 12:44 PM, Patrick Wendell pwend...@gmail.com wrote:
 Thanks for pointing that out. I've assigned you to SPARK-1677 (I think
 I accidentally assigned myself way back when I created it). This
 should be an easy fix.
 
 On Mon, Jun 2, 2014 at 12:19 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
  Hi, Patrick,
 
  I think https://issues.apache.org/jira/browse/SPARK-1677 is talking about
  the same thing?
 
  How about assigning it to me?
 
  I think I missed the configuration part in my previous commit, though I
  declared that in the PR description
 
  Best,
 
  --
  Nan Zhu
 
  On Monday, June 2, 2014 at 3:03 PM, Patrick Wendell wrote:
 
  Hey There,
 
  The issue was that the old behavior could cause users to silently
  overwrite data, which is pretty bad, so to be conservative we decided
  to enforce the same checks that Hadoop does.
 
  This was documented by this JIRA:
  https://issues.apache.org/jira/browse/SPARK-1100
  https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1
 
  However, it would be very easy to add an option that allows preserving
  the old behavior. Is anyone here interested in contributing that? I
  created a JIRA for it:
 
  https://issues.apache.org/jira/browse/SPARK-1993
 
  - Patrick
 
  On Mon, Jun 2, 2014 at 9:22 AM, Pierre Borckmans
  pierre.borckm...@realimpactanalytics.com wrote:
 
  Indeed, the behavior has changed for good or for bad. I mean, I agree with
  the danger you mention but I'm not sure it's happening like that. Isn't
  there a mechanism for overwrite in Hadoop that automatically removes part
  files, then writes a _temporary folder and then only the part files along
  with the _success folder.
 
  In any case this change of behavior should be documented IMO.
 
  Cheers
  Pierre
 
  Message sent from a mobile device - excuse typos and abbreviations
 
  Le 2 juin 2014 à 17:42, Nicholas Chammas nicholas.cham...@gmail.com a
  écrit :
 
  What I've found using saveAsTextFile() against S3 (prior to Spark 1.0.0.) 
  is
  that files get overwritten automatically. This is one danger to this 
  though.
  If I save to a directory that already has 20 part- files, but this time
  around I'm only saving 15 part- files, then there will be 5 leftover part-
  files from the previous set mixed in with the 15 newer files. This is
  potentially dangerous.
 
  I haven't checked to see if this behavior has changed in 1.0.0. Are you


Re: Using sbt-pack with Spark 1.0.0

2014-06-01 Thread Pierre Borckmans
You're right Patrick! 

Just had a chat with sbt-pack creator and indeed dependencies with classifiers 
are ignored to avoid problems with dirty cache...

Should be fixed in next version of the plugin.

Cheers

Pierre 

Message sent from a mobile device - excuse typos and abbreviations 

 Le 1 juin 2014 à 20:04, Patrick Wendell pwend...@gmail.com a écrit :
 
 https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L350
 
 On Sun, Jun 1, 2014 at 11:03 AM, Patrick Wendell pwend...@gmail.com wrote:
 One potential issue here is that mesos is using classifiers now to
 publish there jars. It might be that sbt-pack has trouble with
 dependencies that are published using classifiers. I'm pretty sure
 mesos is the only dependency in Spark that is using classifiers, so
 that's why I mention it.
 
 On Sun, Jun 1, 2014 at 2:34 AM, Pierre B
 pierre.borckm...@realimpactanalytics.com wrote:
 Hi all!
 
 We'be been using the sbt-pack sbt plugin
 (https://github.com/xerial/sbt-pack) for building our standalone Spark
 application for a while now. Until version 1.0.0, that worked nicely.
 
 For those who don't know the sbt-pack plugin, it basically copies all the
 dependencies JARs from your local ivy/maven cache to a your target folder
 (in target/pack/lib), and creates launch scripts (in target/pack/bin) for
 your application (notably setting all these jars on the classpath).
 
 Now, since Spark 1.0.0 was released, we are encountering a weird error where
 running our project with sbt run is fine but running our app with the
 launch scripts generated by sbt-pack fails.
 
 After a (quite painful) investigation, it turns out some JARs are NOT copied
 from the local ivy2 cache to the lib folder. I noticed that all the missing
 jars contain shaded in their file name (but all not all jars with such
 name are missing).
 One of the missing JARs is explicitly from the Spark definition
 (SparkBuild.scala, line 350): ``mesos-0.18.1-shaded-protobuf.jar``.
 
 This file is clearly present in my local ivy cache, but is not copied by
 sbt-pack.
 
 Is there an evident reason for that?
 
 I don't know much about the shading mechanism, maybe I'm missing something
 here?
 
 
 Any help would be appreciated!
 
 Cheers
 
 Pierre
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-sbt-pack-with-Spark-1-0-0-tp6649.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Use SparkListener to get overall progress of an action

2014-05-23 Thread Pierre Borckmans
That would be great, Mayur, thanks!

Anyhow, to be more specific, my question really was the following:

Is there any way to link events in the SparkListener to an action triggered in 
your code?

Cheers




Pierre Borckmans
Software team

RealImpact Analytics | Brussels Office
www.realimpactanalytics.com | pierre.borckm...@realimpactanalytics.com

FR +32 485 91 87 31 | Skype pierre.borckmans






On 23 May 2014, at 10:17, Mayur Rustagi mayur.rust...@gmail.com wrote:

 We have an internal patched version of Spark webUI which exports application 
 related data as Json. We use monitoring systems as well as alternate UI for 
 that json data for our specific application. Found it much cleaner. Can 
 provide 0.9.1 version.
 Would submit as a pull request soon. 
 
 
 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi
 
 
 
 On Fri, May 23, 2014 at 10:57 AM, Chester chesterxgc...@yahoo.com wrote:
 This is something we are interested as well. We are planning to investigate 
 more on this. If someone has suggestions, we would love to hear.
 
 Chester
 
 Sent from my iPad
 
 On May 22, 2014, at 8:02 AM, Pierre B 
 pierre.borckm...@realimpactanalytics.com wrote:
 
 Hi Andy!
 
 Yes Spark UI provides a lot of interesting informations for debugging 
 purposes.
 
 Here I’m trying to integrate a simple progress monitoring in my app ui.
 
 I’m typically running a few “jobs” (or rather actions), and I’d like to be 
 able to display the progress of each of those in my ui.
 
 I don’t really see how i could do that using SparkListener for the moment …
 
 Thanks for your help!
 
 Cheers!
 
 
 
 
 Pierre Borckmans
 Software team
 
 RealImpact Analytics | Brussels Office
 www.realimpactanalytics.com | [hidden email]
 
 FR +32 485 91 87 31 | Skype pierre.borckmans
 
 
 
 
 
 
 On 22 May 2014, at 16:58, andy petrella [via Apache Spark User List] 
 [hidden email] wrote:
 
 SparkListener offers good stuffs.
 But I also completed it with another metrics stuffs on my own that use Akka 
 to aggregate metrics from anywhere I'd like to collect them (without any 
 deps on ganglia yet on Codahale).
 However, this was useful to gather some custom metrics (from within the 
 tasks then) not really to collect overall monitoring information about the 
 spark thingies themselves.
 For that Spark UI offers already a pretty good insight no?
 
 Cheers,
 
 aℕdy ℙetrella
 about.me/noootsab
 
 
 
 
 On Thu, May 22, 2014 at 4:51 PM, Pierre B a 
 href=x-msg://7/user/SendEmail.jtp?type=nodeamp;node=6258amp;i=0 
 target=_top rel=nofollow link=external[hidden email] wrote:
 Is there a simple way to monitor the overall progress of an action using
 SparkListener or anything else?
 
 I see that one can name an RDD... Could that be used to determine which
 action triggered a stage, ... ?
 
 
 Thanks
 
 Pierre
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 
 
 If you reply to this email, your message will be added to the discussion 
 below:
 http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256p6258.html
 To unsubscribe from Use SparkListener to get overall progress of an action, 
 click here.
 NAML
 
 
 View this message in context: Re: Use SparkListener to get overall progress 
 of an action
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 



Re: Spark-ec2 asks for password

2014-04-22 Thread Pierre Borckmans
We’ve been experiencing this as well, and our simple solution is to actually 
keep trying the ssh connection instead of just waiting:

Something like this:


def wait_for_ssh_connection(opts, host):
  u.message(Waiting for ssh connection to host {}.format(host))
  connected = False
  while (connected==False):
try:
  if (subprocess.check_call(s.ssh_command(opts) + ['-t', '-t', '%s@%s' % 
(opts.user, host), ls])==0):
connected = True
except subprocess.CalledProcessError as e:
  print Ssh connection to host {} failed, retrying in 10 
secondsformat(host)
  time.sleep(10)
  print Ssh connection to host {} successfully established!.format(host)


HTH

Pierre Borckmans

RealImpact Analytics | Brussels Office
www.realimpactanalytics.com | pierre.borckm...@realimpactanalytics.com

FR +32 485 91 87 31 | Skype pierre.borckmans





On 19 Apr 2014, at 06:51, Patrick Wendell pwend...@gmail.com wrote:

 Unfortunately - I think a lot of this is due to generally increased latency 
 on ec2 itself. I've noticed that it's way more common than it used to be for 
 instances to come online past the wait timeout in the ec2 script.
 
 
 On Fri, Apr 18, 2014 at 9:11 PM, FRANK AUSTIN NOTHAFT fnoth...@berkeley.edu 
 wrote:
 Aureliano,
 
 I've been noticing this error recently as well:
 
 ssh: connect to host ec-xx-xx-xx-xx.compute-1.amazonaws.com port 22: 
 Connection refused
 Error 255 while executing remote command, retrying after 30 seconds
 
 However, this isn't an issue with the spark-ec2 scripts. After the scripts 
 fail, if you wait a bit longer (e.g., another 2 minutes), the EC2 hosts will 
 finish launching and port 22 will open up. Until the EC2 host has launched 
 and opened port 22 for SSH, SSH cannot succeed, and the Spark-ec2 scripts 
 will fail. I've noticed that EC2 machine launch latency seems to be highest 
 in Oregon; I haven't run into this problem on either the California or 
 Virgina EC2 farms. To work around this issue, I've manually modified my copy 
 of the EC2 scripts to wait for 6 failures (i.e., 3 minutes), which seems to 
 work OK. Might be worth a try on your end. I can't comment about the password 
 request; I haven't seen that on my end.
 
 Regards,
 
 Frank Austin Nothaft
 fnoth...@berkeley.edu
 fnoth...@eecs.berkeley.edu
 202-340-0466
 
 
 On Fri, Apr 18, 2014 at 8:57 PM, Aureliano Buendia buendia...@gmail.com 
 wrote:
 Hi,
 
 Since 0.9.0 spark-ec2 has gone unstable. During launch it throws many errors 
 like:
 
 ssh: connect to host ec-xx-xx-xx-xx.compute-1.amazonaws.com port 22: 
 Connection refused
 Error 255 while executing remote command, retrying after 30 seconds
 
 .. and recently, it prompts for passwords!:
 
 Warning: Permanently added '' (RSA) to the list of known hosts.
 Password:
 
 Note that the hostname in Permanently added '' is missing in the log, which 
 is probably why it asks for a password.
 
 Is this a known bug?
 
 



Re: programmatic way to tell Spark version

2014-04-10 Thread Pierre Borckmans
I see that this was fixed using a fixed string in SparkContext.scala.
Wouldn’t it be better to use something like:

getClass.getPackage.getImplementationVersion

to get the version from the jar manifest (and thus from the sbt definition)?

The same holds for SparkILoopInit.scala in the welcome message (printWelcome).

This would avoid having to modify these strings at each release.

cheers



Pierre Borckmans

RealImpact Analytics | Brussels Office
www.realimpactanalytics.com | pierre.borckm...@realimpactanalytics.com

FR +32 485 91 87 31 | Skype pierre.borckmans





On 10 Apr 2014, at 23:05, Patrick Wendell pwend...@gmail.com wrote:

 I think this was solved in a recent merge:
 
 https://github.com/apache/spark/pull/204/files#diff-364713d7776956cb8b0a771e9b62f82dR779
 
 Is that what you are looking for? If so, mind marking the JIRA as resolved?
 
 
 On Wed, Apr 9, 2014 at 3:30 PM, Nicholas Chammas nicholas.cham...@gmail.com 
 wrote:
 Hey Patrick, 
 
 I've created SPARK-1458 to track this request, in case the team/community 
 wants to implement it in the future.
 
 Nick
 
 
 On Sat, Feb 22, 2014 at 7:25 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:
 No use case at the moment.
 
 What prompted the question: I was going to ask a different question on this 
 list and wanted to note my version of Spark. I assumed there would be a 
 getVersion method on SparkContext or something like that, but I couldn't find 
 one in the docs. I also couldn't find an environment variable with the 
 version. After futzing around a bit I realized it was printed out (quite 
 conspicuously) in the shell startup banner.
 
 
 On Sat, Feb 22, 2014 at 7:15 PM, Patrick Wendell pwend...@gmail.com wrote:
 AFIAK - We don't have any way to do this right now. Maybe we could add
 a getVersion method to SparkContext that would tell you. Just
 wondering - what is the use case here?
 
 - Patrick
 
 On Sat, Feb 22, 2014 at 4:04 PM, nicholas.chammas
 nicholas.cham...@gmail.com wrote:
  Is there a programmatic way to tell what version of Spark I'm running?
 
  I know I can look at the banner when the Spark shell starts up, but I'm
  curious to know if there's another way.
 
  Nick
 
 
  
  View this message in context: programmatic way to tell Spark version
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 
 



Changing number of workers for benchmarking purposes

2014-03-12 Thread Pierre Borckmans
Hi there!

I was performing some tests for benchmarking purposes, among other things to 
observe the evolution of the performances versus the number of workers. 

In that context, I was wondering if there is any easy way to choose the number 
of workers to be used in standalone mode, without having to change the “slaves” 
file, dispatch it, and restart the cluster ?


Cheers,

Pierre