from:"jerryye"

Perserving conf files when restarting ec2 cluster

2014-09-12 Thread jerryye

Hi, I'm using --use-existing-master to launch a previous stopped ec2 cluster with spark-ec2. However, my configuration files are overwritten once is the cluster is setup. What's the best way of preserving existing configuration files in spark/conf. Alternatively, what I'm trying to do is set

Re: Serialize input path

2014-09-05 Thread jerryye

Thanks for the response Sean. As a correction. The code I provided actually ended up working. I tried to reduce my code down but I was being overzealous and running count actually works. The minimal code that triggers the problem is this: val userProfiles = lines.map(line =

Serialize input path

2014-09-04 Thread jerryye

Hi, I have a quick serialization issue. I'm trying to read a date range of input files and I'm getting a serialization issue when using an input path that has a object generate a date range. Specifically, my code uses DateTimeFormat in the Joda time package, which is not serializable. How do I get

Re: saveAsTextFile makes no progress without caching RDD

2014-09-02 Thread jerryye

As an update. I'm still getting the same issue. I ended up doing a coalesce instead of a cache to get around the memory issue but saveAsTextFile still won't proceed without the coalesce or cache first. -- View this message in context:

minPartitions ignored for bz2?

2014-08-27 Thread jerryye

Hi, I'm running on the master branch and I noticed that textFile ignores minPartition for bz2 files. Is anyone else seeing the same thing? I tried varying minPartitions for a bz2 file and rdd.partitions.size was always 1 whereas doing it for a non-bz2 file worked. Not sure if this matters or not

saveAsTextFile makes no progress without caching RDD

2014-08-21 Thread jerryye

Hi, I'm running on branch-1.1 and trying to do a simple transformation to a relatively small dataset of 64GB and saveAsTextFile essentially hangs and tasks are stuck in running mode with the following code: // Stalls with tasks running for over an hour with no tasks finishing. Smallest partition

Re: How to debug: Runs locally but not on cluster

2014-08-14 Thread jerryye

I've isolated this to a memory issue but I don't know what parameter I need to tweak. If I sample my samples RDD with 35% of the data, everything runs to completion, with 35%, it fails. In standalone mode, I can run on the full RDD without any problems. // works val samples =

Re: Job aborted due to stage failure: TID x failed for unknown reasons

2014-08-14 Thread jerryye

bump. same problem here. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-aborted-due-to-stage-failure-TID-x-failed-for-unknown-reasons-tp10187p12095.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to debug: Runs locally but not on cluster

2014-08-13 Thread jerryye

Hi all, I have an issue where I'm able to run my code in standalone mode but not on my cluster. I've isolated it to a few things but am at a lost at how to debug this. Below is the code. Any suggestions would be much appreciated Thanks! 1) RDD size is causing the problem. The code below as is

Perserving conf files when restarting ec2 cluster

Re: Serialize input path

Serialize input path

Re: saveAsTextFile makes no progress without caching RDD

minPartitions ignored for bz2?

saveAsTextFile makes no progress without caching RDD

Re: How to debug: Runs locally but not on cluster

Re: Job aborted due to stage failure: TID x failed for unknown reasons

How to debug: Runs locally but not on cluster

9 matches

Site Navigation

Mail list logo

Footer information