This may sound like an obvious question, but are you sure that the program
is doing any work when you don't have a saveAsTextFile? If there are
transformations but no actions to actually collect the data, there's no
need for Spark to execute the transformations.
As to the question of 'is this
AM, Johan Beisser j...@caustic.org wrote:
Yes.
We're looking at bootstrapping in EMR...
On Sat, May 23, 2015 at 07:21 Joe Wass jw...@crossref.org wrote:
I used Spark on EC2 a while ago
I used Spark on EC2 a while ago
I'm running a cluster of 3 Amazon EC2 machines (small number because it's
expensive when experiments keep crashing after a day!).
Today's crash looks like this (stacktrace at end of message).
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 0
On my
this happen? what operation?
On Thu, Feb 19, 2015 at 9:35 AM, Joe Wass jw...@crossref.org wrote:
On the advice of some recent discussions on this list, I thought I would
try
and consume gz files directly. I'm reading them, doing a preliminary map,
then repartitioning, then doing normal
On the advice of some recent discussions on this list, I thought I would
try and consume gz files directly. I'm reading them, doing a preliminary
map, then repartitioning, then doing normal spark things.
As I understand it, zip files aren't readable in partitions because of the
format, so I
.
Imran
On Thu, Feb 19, 2015 at 4:43 AM, Joe Wass jw...@crossref.org wrote:
Thanks for your reply Sean.
Looks like it's happening in a map:
15/02/18 20:35:52 INFO scheduler.DAGScheduler: Submitting 100 missing
tasks from Stage 1 (MappedRDD[17] at mapToPair at
NativeMethodAccessorImpl.java:-2
I've updated to Spark 1.2.0 and the EC2 and the persistent-hdfs behaviour
appears to have changed.
My launch script is
spark-1.2.0-bin-hadoop2.4/ec2/spark-ec2 --instance-type=m3.xlarge -s 5
--ebs-vol-size=1000 launch myproject
When I ssh into master I get:
$ df -h
FilesystemSize
Looks like this is caused by issue SPARK-5008:
https://issues.apache.org/jira/browse/SPARK-5008
On 13 February 2015 at 19:04, Joe Wass jw...@crossref.org wrote:
I've updated to Spark 1.2.0 and the EC2 and the persistent-hdfs behaviour
appears to have changed.
My launch script is
spark
I'm running on EC2 and I want to set the directory to use on the slaves
(mounted EBS volumes).
I have set:
spark.local.dir /vol3/my-spark-dir
in
/root/spark/conf/spark-defaults.conf
and replicated to all nodes. I have verified that in the console the value
in the config corresponds. I
the
webui you will be able to see how many operations have happened so far.
Thanks
Best Regards
On Wed, Feb 4, 2015 at 4:33 PM, Joe Wass jw...@crossref.org wrote:
I'm sitting here looking at my application crunching gigabytes of data
on a cluster and I have no idea if it's an hour away from
I'm sitting here looking at my application crunching gigabytes of data on a
cluster and I have no idea if it's an hour away from completion or a
minute. The web UI shows progress through each stage, but not how many
stages remaining. How can I work out how many stages my program will take
I want to process about 800 GB of data on an Amazon EC2 cluster. So, I need
to store the input in HDFS somehow.
I currently have a cluster of 5 x m3.xlarge, each of which has 80GB disk.
Each HDFS node reports 73 GB, and the total capacity is ~370 GB.
If I want to process 800 GB of data (assuming
of the cluster needed to process the data from the size of the data.
DR
On 02/03/2015 11:43 AM, Joe Wass wrote:
I want to process about 800 GB of data on an Amazon EC2 cluster. So, I
need
to store the input in HDFS somehow.
I currently have a cluster of 5 x m3.xlarge, each of which has 80GB
I have about 500 MB of data and I'm trying to process it on a single
`local` instance. I'm getting an Out of Memory exception. Stack trace at
the end.
Spark 1.1.1
My JVM has --Xmx2g
spark.driver.memory = 1000M
spark.executor.memory = 1000M
spark.kryoserializer.buffer.mb = 256
... YMMV.
:-)
HTH,
DR
On 02/03/2015 12:32 PM, Joe Wass wrote:
The data is coming from S3 in the first place, and the results will be
uploaded back there. But even in the same availability zone, fetching 170
GB (that's gzipped) is slow. From what I understand of the pipelines,
multiple
generation, so this particular
type of problem and tuning is not needed. You might consider running
on Java 8.
On Fri, Jan 9, 2015 at 10:38 AM, Joe Wass jw...@crossref.org wrote:
I'm running on an AWS cluster of 10 x m1.large (64 bit, 7.5 GiB RAM).
FWIW
I'm using the Flambo Clojure wrapper which
So I had a Spark job with various failures, and I decided to kill it and
start again. I clicked the 'kill' link in the web console, restarted the
job on the command line and headed back to the web console and refreshed to
see how my job was doing... the URL at the time was:
I'm running on an AWS cluster of 10 x m1.large (64 bit, 7.5 GiB RAM). FWIW
I'm using the Flambo Clojure wrapper which uses the Java API but I don't
think that should make any difference. I'm running with the following
command:
spark/bin/spark-submit --class mything.core --name My Thing --conf
I have a Spark job running on about 300 GB of log files, on Amazon EC2,
with 10 x Large instances (each with 600 GB disk). The job hasn't yet
completed.
So far, 18 stages have completed (2 of which have retries) and 3 stages
have failed. In each failed stage there are ~160 successful tasks, but
20 matches
Mail list logo