Perserving conf files when restarting ec2 cluster

2014-09-12 Thread jerryye
Hi, I'm using --use-existing-master to launch a previous stopped ec2 cluster with spark-ec2. However, my configuration files are overwritten once is the cluster is setup. What's the best way of preserving existing configuration files in spark/conf. Alternatively, what I'm trying to do is set

Re: Serialize input path

2014-09-05 Thread jerryye
Thanks for the response Sean. As a correction. The code I provided actually ended up working. I tried to reduce my code down but I was being overzealous and running count actually works. The minimal code that triggers the problem is this: val userProfiles = lines.map(line =

Serialize input path

2014-09-04 Thread jerryye
Hi, I have a quick serialization issue. I'm trying to read a date range of input files and I'm getting a serialization issue when using an input path that has a object generate a date range. Specifically, my code uses DateTimeFormat in the Joda time package, which is not serializable. How do I get

Re: saveAsTextFile makes no progress without caching RDD

2014-09-02 Thread jerryye
As an update. I'm still getting the same issue. I ended up doing a coalesce instead of a cache to get around the memory issue but saveAsTextFile still won't proceed without the coalesce or cache first. -- View this message in context:

minPartitions ignored for bz2?

2014-08-27 Thread jerryye
Hi, I'm running on the master branch and I noticed that textFile ignores minPartition for bz2 files. Is anyone else seeing the same thing? I tried varying minPartitions for a bz2 file and rdd.partitions.size was always 1 whereas doing it for a non-bz2 file worked. Not sure if this matters or not

saveAsTextFile makes no progress without caching RDD

2014-08-21 Thread jerryye
Hi, I'm running on branch-1.1 and trying to do a simple transformation to a relatively small dataset of 64GB and saveAsTextFile essentially hangs and tasks are stuck in running mode with the following code: // Stalls with tasks running for over an hour with no tasks finishing. Smallest partition

Re: How to debug: Runs locally but not on cluster

2014-08-14 Thread jerryye
I've isolated this to a memory issue but I don't know what parameter I need to tweak. If I sample my samples RDD with 35% of the data, everything runs to completion, with 35%, it fails. In standalone mode, I can run on the full RDD without any problems. // works val samples =

Re: Job aborted due to stage failure: TID x failed for unknown reasons

2014-08-14 Thread jerryye
bump. same problem here. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-aborted-due-to-stage-failure-TID-x-failed-for-unknown-reasons-tp10187p12095.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to debug: Runs locally but not on cluster

2014-08-13 Thread jerryye
Hi all, I have an issue where I'm able to run my code in standalone mode but not on my cluster. I've isolated it to a few things but am at a lost at how to debug this. Below is the code. Any suggestions would be much appreciated Thanks! 1) RDD size is causing the problem. The code below as is