from:"Pierre Borckmans"

Re: Is uberjar a recommended way of running Spark/Scala applications?

2014-06-03 Thread Pierre Borckmans

You might want to look at another great plugin : “sbt-pack”
https://github.com/xerial/sbt-pack.

It collects all the dependencies JARs and creates launch scripts for *nix
(including Mac OS) and windows.

HTH

Pierre

On 02 Jun 2014, at 17:29, Andrei faithlessfri...@gmail.com wrote:

Thanks! This is even closer to what I am looking for. I'm in a trip now, so
I'm going to give it a try when I come back.

On Mon, Jun 2, 2014 at 5:12 AM, Ngoc Dao ngocdaoth...@gmail.com wrote:
Alternative solution:
https://github.com/xitrum-framework/xitrum-package

It collects all dependency .jar files in your Scala program into a
directory. It doesn't merge the .jar files together, the .jar files
are left as is.

On Sat, May 31, 2014 at 3:42 AM, Andrei faithlessfri...@gmail.com wrote:
Thanks, Stephen. I have eventually decided to go with assembly, but put away
Spark and Hadoop jars, and instead use `spark-submit` to automatically
provide these dependencies. This way no resource conflicts arise and
mergeStrategy needs no modification. To memorize this stable setup and also
share it with the community I've crafted a project [1] with minimal working
config. It is SBT project with assembly plugin, Spark 1.0 and Cloudera's
Hadoop client. Hope, it will help somebody to take Spark setup quicker.

Though I'm fine with this setup for final builds, I'm still looking for a
more interactive dev setup - something that doesn't require full rebuild.

[1]: https://github.com/faithlessfriend/sample-spark-project

Thanks and have a good weekend,
Andrei

On Thu, May 29, 2014 at 8:27 PM, Stephen Boesch java...@gmail.com wrote:

The MergeStrategy combined with sbt assembly did work for me. This is not
painless: some trial and error and the assembly may take multiple minutes.

You will likely want to filter out some additional classes from the
generated jar file. Here is an SOF answer to explain that and with IMHO
the
best answer snippet included here (in this case the OP understandably did
not want to not include javax.servlet.Servlet)

http://stackoverflow.com/questions/7819066/sbt-exclude-class-from-jar

mappings in (Compile,packageBin) ~= { (ms: Seq[(File, String)]) = ms
filter { case (file, toPath) = toPath != javax/servlet/Servlet.class } }

There is a setting to not include the project files in the assembly but I
do not recall it at this moment.

2014-05-29 10:13 GMT-07:00 Andrei faithlessfri...@gmail.com:

Thanks, Jordi, your gist looks pretty much like what I have in my project
currently (with few exceptions that I'm going to borrow).

I like the idea of using sbt package, since it doesn't require third
party plugins and, most important, doesn't create a mess of classes and
resources. But in this case I'll have to handle jar list manually via
Spark
context. Is there a way to automate this process? E.g. when I was a
Clojure
guy, I could run lein deps (lein is a build tool similar to sbt) to
download all dependencies and then just enumerate them from my app. Maybe
you have heard of something like that for Spark/SBT?

Thanks,
Andrei

On Thu, May 29, 2014 at 3:48 PM, jaranda jordi.ara...@bsc.es wrote:

Hi Andrei,

I think the preferred way to deploy Spark jobs is by using the sbt
package
task instead of using the sbt assembly plugin. In any case, as you
comment,
the mergeStrategy in combination with some dependency exlusions should
fix
your problems. Have a look at this gist
https://gist.github.com/JordiAranda/bdbad58d128c14277a05 for further
details (I just followed some recommendations commented in the sbt
assembly
plugin documentation).

Up to now I haven't found a proper way to combine my
development/deployment
phases, although I must say my experience in Spark is pretty poor (it
really
depends in your deployment requirements as well). In this case, I think
someone else could give you some further insights.

Best,

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Is-uberjar-a-recommended-way-of-running-Spark-Scala-applications-tp6518p6520.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Pierre Borckmans

+1 Same question here...

Message sent from a mobile device - excuse typos and abbreviations

 Le 2 juin 2014 à 10:08, Kexin Xie kexin@bigcommerce.com a écrit :
 
 Hi,
 
 Spark 1.0 changes the default behaviour of RDD.saveAsTextFile to throw 
 org.apache.hadoop.mapred.FileAlreadyExistsException when file already exists. 
 
 Is there a way I can allow Spark to overwrite the existing file?
 
 Cheers,
 Kexin

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Pierre Borckmans

Indeed, the behavior has changed for good or for bad. I mean, I agree with the
danger you mention but I'm not sure it's happening like that. Isn't there a
mechanism for overwrite in Hadoop that automatically removes part files, then
writes a _temporary folder and then only the part files along with the _success
folder.

In any case this change of behavior should be documented IMO.

Cheers
Pierre

Message sent from a mobile device - excuse typos and abbreviations

Le 2 juin 2014 à 17:42, Nicholas Chammas nicholas.cham...@gmail.com a écrit
:

What I’ve found using saveAsTextFile() against S3 (prior to Spark 1.0.0.) is
that files get overwritten automatically. This is one danger to this though.
If I save to a directory that already has 20 part- files, but this time
around I’m only saving 15 part- files, then there will be 5 leftover part-
files from the previous set mixed in with the 15 newer files. This is
potentially dangerous.

I haven’t checked to see if this behavior has changed in 1.0.0. Are you
saying it has, Pierre?

On Mon, Jun 2, 2014 at 9:41 AM, Pierre B
[pierre.borckm...@realimpactanalytics.com](mailto:pierre.borckm...@realimpactanalytics.com)
wrote:

Hi Michaël,

Thanks for this. We could indeed do that.

But I guess the question is more about the change of behaviour from 0.9.1 to
1.0.0.
We never had to care about that in previous versions.

Does that mean we have to manually remove existing files or is there a way
to aumotically overwrite when using saveAsTextFile?

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-tp6696p6700.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Pierre Borckmans

I'm a bit confused because the PR mentioned by Patrick seems to adress all 
these issues:
https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1

Was it not accepted? Or is the description of this PR not completely 
implemented?

Message sent from a mobile device - excuse typos and abbreviations

 Le 2 juin 2014 à 23:08, Nicholas Chammas nicholas.cham...@gmail.com a écrit 
 :
 
 OK, thanks for confirming. Is there something we can do about that leftover 
 part- files problem in Spark, or is that for the Hadoop team?
 
 
 2014년 6월 2일 월요일, Aaron Davidsonilike...@gmail.com님이 작성한 메시지:
 Yes.
 
 
 On Mon, Jun 2, 2014 at 1:23 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:
 So in summary:
 As of Spark 1.0.0, saveAsTextFile() will no longer clobber by default.
 There is an open JIRA issue to add an option to allow clobbering.
 Even when clobbering, part- files may be left over from previous saves, 
 which is dangerous.
 Is this correct?
 
 
 On Mon, Jun 2, 2014 at 4:17 PM, Aaron Davidson ilike...@gmail.com wrote:
 +1 please re-add this feature
 
 
 On Mon, Jun 2, 2014 at 12:44 PM, Patrick Wendell pwend...@gmail.com wrote:
 Thanks for pointing that out. I've assigned you to SPARK-1677 (I think
 I accidentally assigned myself way back when I created it). This
 should be an easy fix.
 
 On Mon, Jun 2, 2014 at 12:19 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
  Hi, Patrick,
 
  I think https://issues.apache.org/jira/browse/SPARK-1677 is talking about
  the same thing?
 
  How about assigning it to me?
 
  I think I missed the configuration part in my previous commit, though I
  declared that in the PR description
 
  Best,
 
  --
  Nan Zhu
 
  On Monday, June 2, 2014 at 3:03 PM, Patrick Wendell wrote:
 
  Hey There,
 
  The issue was that the old behavior could cause users to silently
  overwrite data, which is pretty bad, so to be conservative we decided
  to enforce the same checks that Hadoop does.
 
  This was documented by this JIRA:
  https://issues.apache.org/jira/browse/SPARK-1100
  https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1
 
  However, it would be very easy to add an option that allows preserving
  the old behavior. Is anyone here interested in contributing that? I
  created a JIRA for it:
 
  https://issues.apache.org/jira/browse/SPARK-1993
 
  - Patrick
 
  On Mon, Jun 2, 2014 at 9:22 AM, Pierre Borckmans
  pierre.borckm...@realimpactanalytics.com wrote:
 
  Indeed, the behavior has changed for good or for bad. I mean, I agree with
  the danger you mention but I'm not sure it's happening like that. Isn't
  there a mechanism for overwrite in Hadoop that automatically removes part
  files, then writes a _temporary folder and then only the part files along
  with the _success folder.
 
  In any case this change of behavior should be documented IMO.
 
  Cheers
  Pierre
 
  Message sent from a mobile device - excuse typos and abbreviations
 
  Le 2 juin 2014 à 17:42, Nicholas Chammas nicholas.cham...@gmail.com a
  écrit :
 
  What I've found using saveAsTextFile() against S3 (prior to Spark 1.0.0.) 
  is
  that files get overwritten automatically. This is one danger to this 
  though.
  If I save to a directory that already has 20 part- files, but this time
  around I'm only saving 15 part- files, then there will be 5 leftover part-
  files from the previous set mixed in with the 15 newer files. This is
  potentially dangerous.
 
  I haven't checked to see if this behavior has changed in 1.0.0. Are you

Re: Using sbt-pack with Spark 1.0.0

2014-06-01 Thread Pierre Borckmans

You're right Patrick!

Just had a chat with sbt-pack creator and indeed dependencies with classifiers
are ignored to avoid problems with dirty cache...

Should be fixed in next version of the plugin.

Cheers

Pierre

Message sent from a mobile device - excuse typos and abbreviations

Le 1 juin 2014 à 20:04, Patrick Wendell pwend...@gmail.com a écrit :

https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L350

On Sun, Jun 1, 2014 at 11:03 AM, Patrick Wendell pwend...@gmail.com wrote:
One potential issue here is that mesos is using classifiers now to
publish there jars. It might be that sbt-pack has trouble with
dependencies that are published using classifiers. I'm pretty sure
mesos is the only dependency in Spark that is using classifiers, so
that's why I mention it.

On Sun, Jun 1, 2014 at 2:34 AM, Pierre B
pierre.borckm...@realimpactanalytics.com wrote:
Hi all!

We'be been using the sbt-pack sbt plugin
(https://github.com/xerial/sbt-pack) for building our standalone Spark
application for a while now. Until version 1.0.0, that worked nicely.

For those who don't know the sbt-pack plugin, it basically copies all the
dependencies JARs from your local ivy/maven cache to a your target folder
(in target/pack/lib), and creates launch scripts (in target/pack/bin) for
your application (notably setting all these jars on the classpath).

Now, since Spark 1.0.0 was released, we are encountering a weird error where
running our project with sbt run is fine but running our app with the
launch scripts generated by sbt-pack fails.

After a (quite painful) investigation, it turns out some JARs are NOT copied
from the local ivy2 cache to the lib folder. I noticed that all the missing
jars contain shaded in their file name (but all not all jars with such
name are missing).
One of the missing JARs is explicitly from the Spark definition
(SparkBuild.scala, line 350): ``mesos-0.18.1-shaded-protobuf.jar``.

This file is clearly present in my local ivy cache, but is not copied by
sbt-pack.

Is there an evident reason for that?

I don't know much about the shading mechanism, maybe I'm missing something
here?

Any help would be appreciated!

Cheers

Pierre

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Using-sbt-pack-with-Spark-1-0-0-tp6649.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Use SparkListener to get overall progress of an action

2014-05-23 Thread Pierre Borckmans

That would be great, Mayur, thanks!

Anyhow, to be more specific, my question really was the following:

Is there any way to link events in the SparkListener to an action triggered in
your code?

Cheers

Pierre Borckmans
Software team

RealImpact Analytics | Brussels Office
www.realimpactanalytics.com | pierre.borckm...@realimpactanalytics.com

FR +32 485 91 87 31 | Skype pierre.borckmans

On 23 May 2014, at 10:17, Mayur Rustagi mayur.rust...@gmail.com wrote:

We have an internal patched version of Spark webUI which exports application
related data as Json. We use monitoring systems as well as alternate UI for
that json data for our specific application. Found it much cleaner. Can
provide 0.9.1 version.
Would submit as a pull request soon.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi

On Fri, May 23, 2014 at 10:57 AM, Chester chesterxgc...@yahoo.com wrote:
This is something we are interested as well. We are planning to investigate
more on this. If someone has suggestions, we would love to hear.

Chester

Sent from my iPad

On May 22, 2014, at 8:02 AM, Pierre B
pierre.borckm...@realimpactanalytics.com wrote:

Hi Andy!

Yes Spark UI provides a lot of interesting informations for debugging
purposes.

Here I’m trying to integrate a simple progress monitoring in my app ui.

I’m typically running a few “jobs” (or rather actions), and I’d like to be
able to display the progress of each of those in my ui.

I don’t really see how i could do that using SparkListener for the moment …

Thanks for your help!

Cheers!

Pierre Borckmans
Software team

RealImpact Analytics | Brussels Office
www.realimpactanalytics.com | [hidden email]

FR +32 485 91 87 31 | Skype pierre.borckmans

On 22 May 2014, at 16:58, andy petrella [via Apache Spark User List]
[hidden email] wrote:

SparkListener offers good stuffs.
But I also completed it with another metrics stuffs on my own that use Akka
to aggregate metrics from anywhere I'd like to collect them (without any
deps on ganglia yet on Codahale).
However, this was useful to gather some custom metrics (from within the
tasks then) not really to collect overall monitoring information about the
spark thingies themselves.
For that Spark UI offers already a pretty good insight no?

Cheers,

aℕdy ℙetrella
about.me/noootsab

On Thu, May 22, 2014 at 4:51 PM, Pierre B a
href=x-msg://7/user/SendEmail.jtp?type=nodeamp;node=6258amp;i=0
target=_top rel=nofollow link=external[hidden email] wrote:
Is there a simple way to monitor the overall progress of an action using
SparkListener or anything else?

I see that one can name an RDD... Could that be used to determine which
action triggered a stage, ... ?

Thanks

Pierre

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

If you reply to this email, your message will be added to the discussion
below:
http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256p6258.html
To unsubscribe from Use SparkListener to get overall progress of an action,
click here.
NAML

View this message in context: Re: Use SparkListener to get overall progress
of an action
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark-ec2 asks for password

2014-04-22 Thread Pierre Borckmans

We’ve been experiencing this as well, and our simple solution is to actually 
keep trying the ssh connection instead of just waiting:

Something like this:


def wait_for_ssh_connection(opts, host):
  u.message(Waiting for ssh connection to host {}.format(host))
  connected = False
  while (connected==False):
try:
  if (subprocess.check_call(s.ssh_command(opts) + ['-t', '-t', '%s@%s' % 
(opts.user, host), ls])==0):
connected = True
except subprocess.CalledProcessError as e:
  print Ssh connection to host {} failed, retrying in 10 
secondsformat(host)
  time.sleep(10)
  print Ssh connection to host {} successfully established!.format(host)


HTH

Pierre Borckmans

RealImpact Analytics | Brussels Office
www.realimpactanalytics.com | pierre.borckm...@realimpactanalytics.com

FR +32 485 91 87 31 | Skype pierre.borckmans





On 19 Apr 2014, at 06:51, Patrick Wendell pwend...@gmail.com wrote:

 Unfortunately - I think a lot of this is due to generally increased latency 
 on ec2 itself. I've noticed that it's way more common than it used to be for 
 instances to come online past the wait timeout in the ec2 script.
 
 
 On Fri, Apr 18, 2014 at 9:11 PM, FRANK AUSTIN NOTHAFT fnoth...@berkeley.edu 
 wrote:
 Aureliano,
 
 I've been noticing this error recently as well:
 
 ssh: connect to host ec-xx-xx-xx-xx.compute-1.amazonaws.com port 22: 
 Connection refused
 Error 255 while executing remote command, retrying after 30 seconds
 
 However, this isn't an issue with the spark-ec2 scripts. After the scripts 
 fail, if you wait a bit longer (e.g., another 2 minutes), the EC2 hosts will 
 finish launching and port 22 will open up. Until the EC2 host has launched 
 and opened port 22 for SSH, SSH cannot succeed, and the Spark-ec2 scripts 
 will fail. I've noticed that EC2 machine launch latency seems to be highest 
 in Oregon; I haven't run into this problem on either the California or 
 Virgina EC2 farms. To work around this issue, I've manually modified my copy 
 of the EC2 scripts to wait for 6 failures (i.e., 3 minutes), which seems to 
 work OK. Might be worth a try on your end. I can't comment about the password 
 request; I haven't seen that on my end.
 
 Regards,
 
 Frank Austin Nothaft
 fnoth...@berkeley.edu
 fnoth...@eecs.berkeley.edu
 202-340-0466
 
 
 On Fri, Apr 18, 2014 at 8:57 PM, Aureliano Buendia buendia...@gmail.com 
 wrote:
 Hi,
 
 Since 0.9.0 spark-ec2 has gone unstable. During launch it throws many errors 
 like:
 
 ssh: connect to host ec-xx-xx-xx-xx.compute-1.amazonaws.com port 22: 
 Connection refused
 Error 255 while executing remote command, retrying after 30 seconds
 
 .. and recently, it prompts for passwords!:
 
 Warning: Permanently added '' (RSA) to the list of known hosts.
 Password:
 
 Note that the hostname in Permanently added '' is missing in the log, which 
 is probably why it asks for a password.
 
 Is this a known bug?

Re: programmatic way to tell Spark version

2014-04-10 Thread Pierre Borckmans

I see that this was fixed using a fixed string in SparkContext.scala.
Wouldn’t it be better to use something like:

getClass.getPackage.getImplementationVersion

to get the version from the jar manifest (and thus from the sbt definition)?

The same holds for SparkILoopInit.scala in the welcome message (printWelcome).

This would avoid having to modify these strings at each release.

cheers



Pierre Borckmans

RealImpact Analytics | Brussels Office
www.realimpactanalytics.com | pierre.borckm...@realimpactanalytics.com

FR +32 485 91 87 31 | Skype pierre.borckmans





On 10 Apr 2014, at 23:05, Patrick Wendell pwend...@gmail.com wrote:

 I think this was solved in a recent merge:
 
 https://github.com/apache/spark/pull/204/files#diff-364713d7776956cb8b0a771e9b62f82dR779
 
 Is that what you are looking for? If so, mind marking the JIRA as resolved?
 
 
 On Wed, Apr 9, 2014 at 3:30 PM, Nicholas Chammas nicholas.cham...@gmail.com 
 wrote:
 Hey Patrick, 
 
 I've created SPARK-1458 to track this request, in case the team/community 
 wants to implement it in the future.
 
 Nick
 
 
 On Sat, Feb 22, 2014 at 7:25 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:
 No use case at the moment.
 
 What prompted the question: I was going to ask a different question on this 
 list and wanted to note my version of Spark. I assumed there would be a 
 getVersion method on SparkContext or something like that, but I couldn't find 
 one in the docs. I also couldn't find an environment variable with the 
 version. After futzing around a bit I realized it was printed out (quite 
 conspicuously) in the shell startup banner.
 
 
 On Sat, Feb 22, 2014 at 7:15 PM, Patrick Wendell pwend...@gmail.com wrote:
 AFIAK - We don't have any way to do this right now. Maybe we could add
 a getVersion method to SparkContext that would tell you. Just
 wondering - what is the use case here?
 
 - Patrick
 
 On Sat, Feb 22, 2014 at 4:04 PM, nicholas.chammas
 nicholas.cham...@gmail.com wrote:
  Is there a programmatic way to tell what version of Spark I'm running?
 
  I know I can look at the banner when the Spark shell starts up, but I'm
  curious to know if there's another way.
 
  Nick
 
 
  
  View this message in context: programmatic way to tell Spark version
  Sent from the Apache Spark User List mailing list archive at Nabble.com.

Changing number of workers for benchmarking purposes

2014-03-12 Thread Pierre Borckmans

Hi there!

I was performing some tests for benchmarking purposes, among other things to 
observe the evolution of the performances versus the number of workers. 

In that context, I was wondering if there is any easy way to choose the number 
of workers to be used in standalone mode, without having to change the “slaves” 
file, dispatch it, and restart the cluster ?


Cheers,

Pierre

Re: Is uberjar a recommended way of running Spark/Scala applications?

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

Re: Using sbt-pack with Spark 1.0.0

Re: Use SparkListener to get overall progress of an action

Re: Spark-ec2 asks for password

Re: programmatic way to tell Spark version

Changing number of workers for benchmarking purposes

9 matches

Site Navigation

Mail list logo

Footer information