20, 2014 at 5:22 PM
To: Nicholas Chammas nicholas.cham...@gmail.com
Cc: user user@spark.apache.org
Subject: Re: Getting spark to use more than 4 cores on Amazon EC2
I am using globs though
raw = sc.textFile(/path/to/dir/*/*)
and I have tons of files so 1 file per partition should
Andy
From: Daniel Mahler dmah...@gmail.com
Date: Monday, October 20, 2014 at 5:22 PM
To: Nicholas Chammas nicholas.cham...@gmail.com
Cc: user user@spark.apache.org
Subject: Re: Getting spark to use more than 4 cores on Amazon EC2
I am using globs though
raw = sc.textFile(/path/to/dir
I am launching EC2 clusters using the spark-ec2 scripts.
My understanding is that this configures spark to use the available
resources.
I can see that spark will use the available memory on larger istance types.
However I have never seen spark running at more than 400% (using 100% on 4
cores)
on
How are you launching the cluster, and how are you submitting the job to
it? Can you list any Spark configuration parameters you provide?
On Mon, Oct 20, 2014 at 12:53 PM, Daniel Mahler dmah...@gmail.com wrote:
I am launching EC2 clusters using the spark-ec2 scripts.
My understanding is that
Perhaps your RDD is not partitioned enough to utilize all the cores in your
system.
Could you post a simple code snippet and explain what kind of parallelism
you are seeing for it? And can you report on how many partitions your RDDs
have?
On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler
I usually run interactively from the spark-shell.
My data definitely has more than enough partitions to keep all the workers
busy.
When I first launch the cluster I first do:
+
cat EOF ~/spark/conf/spark-defaults.conf
spark.serializer
I launch the cluster using vanilla spark-ec2 scripts.
I just specify the number of slaves and instance type
On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler dmah...@gmail.com wrote:
I usually run interactively from the spark-shell.
My data definitely has more than enough partitions to keep all
Are you dealing with gzipped files by any chance? Does explicitly
repartitioning your RDD to match the number of cores in your cluster help
at all? How about if you don't specify the configs you listed and just go
with defaults all around?
On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler
The biggest danger with gzipped files is this:
raw = sc.textFile(/path/to/file.gz, 8) raw.getNumPartitions()1
You think you’re telling Spark to parallelize the reads on the input, but
Spark cannot parallelize reads against gzipped files. So 1 gzipped file
gets assigned to 1 partition.
It might
I am using globs though
raw = sc.textFile(/path/to/dir/*/*)
and I have tons of files so 1 file per partition should not be a problem.
On Mon, Oct 20, 2014 at 7:14 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
The biggest danger with gzipped files is this:
raw =
10 matches
Mail list logo