Re: Help reading Spark UI tea leaves..

2015-05-23 Thread Shay Seng
and otherData use the same partitioner, spark knows it doesn't need to resort d3 each time. It can use the existing shuffle output it already has sitting on disk. So you'll see the stage is skipped in the UI (except for the first job): On Fri, May 22, 2015 at 11:59 AM, Shay Seng s

Performance degradation between spark 0.9.3 and 1.3.1

2015-05-22 Thread Shay Seng
Hi. I have a job that takes ~50min with Spark 0.9.3 and ~1.8hrs on Spark 1.3.1 on the same cluster. The only code difference between the two code bases is to fix the Seq - Iter changes that happened in the Spark 1.x series. Are there any other changes in the defaults from spark 0.9.3 - 1.3.1

Help reading Spark UI tea leaves..

2015-05-22 Thread Shay Seng
Hi. I have an RDD that I use repeatedly through many iterations of an algorithm. To prevent recomputation, I persist the RDD (and incidentally I also persist and checkpoint it's parents) val consCostConstraintMap = consCost.join(constraintMap).map { case (cid, (costs,(mid1,_,mid2,_,_))) = {

Confused about class paths in spark 1.1.0

2014-10-30 Thread Shay Seng
Hi, I've been trying to move up from spark 0.9.2 to 1.1.0. I'm getting a little confused with the setup for a few different use cases, grateful for any pointers... (1) spark-shell + with jars that are only required by the driver (1a) I added spark.driver.extraClassPath /mypath/to.jar to my

Re: Confused about class paths in spark 1.1.0

2014-10-30 Thread Shay Seng
. Unfortunately, you do have to specify each JAR separately; you can maybe use a shell script to list a directory and get a big list, or set up a project that builds all of the dependencies into one assembly JAR. Matei On Oct 30, 2014, at 5:24 PM, Shay Seng s...@urbanengines.com wrote: Hi

RDD save as Seq File

2014-09-24 Thread Shay Seng
Hi, Why does RDD.saveAsObjectFile() to S3 leave a bunch of *_$folder$ empty files around? Is it possible for the saveas to clean up? tks

persist before or after checkpoint?

2014-09-24 Thread Shay Seng
Hey, I actually have 2 question (1) I want to generate unique IDs for each RDD element and I want to assign them in parallel so I do rdd.mapPartitionsWithIndex((index, s) = { var count = 0L s.zipWithIndex.map { case (t, i) = { count += 1 (index *

Odd saveAsSequenceFile bug

2014-08-28 Thread Shay Seng
Hey Sparkies... I have an odd bug. I am running Spark 0.9.2 on Amazon EC2 machines as a job (i.e. not in REPL) After a bunch of processing, I tell spark to save my rdd to S3 using: rdd.saveAsSequenceFile(uri,codec) That line of code hangs. By hang I mean (a) Spark stages UI shows no update on

Debugging cluster stability, configuration issues

2014-08-21 Thread Shay Seng
Hi, I am running Spark 0.9.2 on an EC2 cluster with about 16 r3.4xlarge machines The cluster is running Spark standalone and is launched with the ec2 scripts. In my Spark job, I am using ephemeral HDFS to checkpoint some of my RDDs. I'm also reading and writing to S3. My jobs also involve a large

Re: Debugging cluster stability, configuration issues

2014-08-21 Thread Shay Seng
it should be a pretty similar setup. On Thu, Aug 21, 2014 at 1:23 PM, Shay Seng s...@urbanengines.com wrote: Hi, I am running Spark 0.9.2 on an EC2 cluster with about 16 r3.4xlarge machines The cluster is running Spark standalone and is launched with the ec2 scripts. In my Spark job, I am