Hi,
I'm using --use-existing-master to launch a previous stopped ec2 cluster
with spark-ec2. However, my configuration files are overwritten once is the
cluster is setup. What's the best way of preserving existing configuration
files in spark/conf.
Alternatively, what I'm trying to do is set
Thanks for the response Sean.
As a correction. The code I provided actually ended up working. I tried to
reduce my code down but I was being overzealous and running count actually
works.
The minimal code that triggers the problem is this:
val userProfiles = lines.map(line =
Hi,
I have a quick serialization issue. I'm trying to read a date range of input
files and I'm getting a serialization issue when using an input path that
has a object generate a date range. Specifically, my code uses
DateTimeFormat in the Joda time package, which is not serializable. How do I
get
As an update. I'm still getting the same issue. I ended up doing a coalesce
instead of a cache to get around the memory issue but saveAsTextFile still
won't proceed without the coalesce or cache first.
--
View this message in context:
Hi,
I'm running on the master branch and I noticed that textFile ignores
minPartition for bz2 files. Is anyone else seeing the same thing? I tried
varying minPartitions for a bz2 file and rdd.partitions.size was always 1
whereas doing it for a non-bz2 file worked.
Not sure if this matters or not
Hi,
I'm running on branch-1.1 and trying to do a simple transformation to a
relatively small dataset of 64GB and saveAsTextFile essentially hangs and
tasks are stuck in running mode with the following code:
// Stalls with tasks running for over an hour with no tasks finishing.
Smallest partition
I've isolated this to a memory issue but I don't know what parameter I need
to tweak. If I sample my samples RDD with 35% of the data, everything runs
to completion, with 35%, it fails. In standalone mode, I can run on the full
RDD without any problems.
// works
val samples =
bump. same problem here.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Job-aborted-due-to-stage-failure-TID-x-failed-for-unknown-reasons-tp10187p12095.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi all,
I have an issue where I'm able to run my code in standalone mode but not on
my cluster. I've isolated it to a few things but am at a lost at how to
debug this. Below is the code. Any suggestions would be much appreciated
Thanks!
1) RDD size is causing the problem. The code below as is