Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-22 Thread Andy Davidson
On a related note, how are you submitting your job?

I have a simple streaming proof of concept and noticed that everything runs
on my master. I wonder if I do not have enough load for spark to push tasks
to the slaves. 

Thanks

Andy

From:  Daniel Mahler dmah...@gmail.com
Date:  Monday, October 20, 2014 at 5:22 PM
To:  Nicholas Chammas nicholas.cham...@gmail.com
Cc:  user user@spark.apache.org
Subject:  Re: Getting spark to use more than 4 cores on Amazon EC2

 I am using globs though
 
 raw = sc.textFile(/path/to/dir/*/*)
 
 and I have tons of files so 1 file per partition should not be a problem.
 
 On Mon, Oct 20, 2014 at 7:14 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:
 The biggest danger with gzipped files is this:
  raw = sc.textFile(/path/to/file.gz, 8)
  raw.getNumPartitions()
 1
 You think you’re telling Spark to parallelize the reads on the input, but
 Spark cannot parallelize reads against gzipped files. So 1 gzipped file gets
 assigned to 1 partition.
 
 It might be a nice user hint if Spark warned when parallelism is disabled by
 the input format.
 
 Nick
 
 ​
 
 On Mon, Oct 20, 2014 at 6:53 PM, Daniel Mahler dmah...@gmail.com wrote:
 Hi Nicholas,
 
 Gzipping is a an impressive guess! Yes, they are.
 My data sets are too large to make repartitioning viable, but I could try it
 on a subset.
 I generally have many more partitions than cores.
 This was happenning before I started setting those configs.
 
 thanks
 Daniel
 
 
 On Mon, Oct 20, 2014 at 5:37 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
 Are you dealing with gzipped files by any chance? Does explicitly
 repartitioning your RDD to match the number of cores in your cluster help
 at all? How about if you don't specify the configs you listed and just go
 with defaults all around?
 
 On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler dmah...@gmail.com wrote:
 I launch the cluster using vanilla spark-ec2 scripts.
 I just specify the number of slaves and instance type
 
 On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler dmah...@gmail.com wrote:
 I usually run interactively from the spark-shell.
 My data definitely has more than enough partitions to keep all the
 workers busy.
 When I first launch the cluster I first do:
 
 +
 cat EOF ~/spark/conf/spark-defaults.conf
 spark.serializerorg.apache.spark.serializer.KryoSerializer
 spark.rdd.compress  true
 spark.shuffle.consolidateFiles  true
 spark.akka.frameSize  20
 EOF
 
 copy-dir /root/spark/conf
 spark/sbin/stop-all.sh
 sleep 5
 spark/sbin/start-all.sh
 +
 
 before starting the spark-shell or running any jobs.
 
 
 
 
 On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
 Perhaps your RDD is not partitioned enough to utilize all the cores in
 your system.
 
 Could you post a simple code snippet and explain what kind of
 parallelism you are seeing for it? And can you report on how many
 partitions your RDDs have?
 
 On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler dmah...@gmail.com
 wrote:
 
 I am launching EC2 clusters using the spark-ec2 scripts.
 My understanding is that this configures spark to use the available
 resources.
 I can see that spark will use the available memory on larger istance
 types.
 However I have never seen spark running at more than 400% (using 100% on
 4 cores)
 on machines with many more cores.
 Am I misunderstanding the docs? Is it just that high end ec2 instances
 get I/O starved when running spark? It would be strange if that
 consistently produced a 400% hard limit though.
 
 thanks
 Daniel
 
 
 
 
 
 
 




Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-22 Thread Aaron Davidson
Another wild guess, if your data is stored in S3, you might be running into
an issue where the default jets3t properties limits the number of parallel
S3 connections to 4. Consider increasing the max-thread-counts from here:
http://www.jets3t.org/toolkit/configuration.html.

On Tue, Oct 21, 2014 at 10:39 AM, Andy Davidson 
a...@santacruzintegration.com wrote:

 On a related note, how are you submitting your job?

 I have a simple streaming proof of concept and noticed that everything
 runs on my master. I wonder if I do not have enough load for spark to push
 tasks to the slaves.

 Thanks

 Andy

 From: Daniel Mahler dmah...@gmail.com
 Date: Monday, October 20, 2014 at 5:22 PM
 To: Nicholas Chammas nicholas.cham...@gmail.com
 Cc: user user@spark.apache.org
 Subject: Re: Getting spark to use more than 4 cores on Amazon EC2

 I am using globs though

 raw = sc.textFile(/path/to/dir/*/*)

 and I have tons of files so 1 file per partition should not be a problem.

 On Mon, Oct 20, 2014 at 7:14 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 The biggest danger with gzipped files is this:

  raw = sc.textFile(/path/to/file.gz, 8) raw.getNumPartitions()1

 You think you’re telling Spark to parallelize the reads on the input, but
 Spark cannot parallelize reads against gzipped files. So 1 gzipped file
 gets assigned to 1 partition.

 It might be a nice user hint if Spark warned when parallelism is disabled
 by the input format.

 Nick
 ​

 On Mon, Oct 20, 2014 at 6:53 PM, Daniel Mahler dmah...@gmail.com wrote:

 Hi Nicholas,

 Gzipping is a an impressive guess! Yes, they are.
 My data sets are too large to make repartitioning viable, but I could
 try it on a subset.
 I generally have many more partitions than cores.
 This was happenning before I started setting those configs.

 thanks
 Daniel


 On Mon, Oct 20, 2014 at 5:37 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Are you dealing with gzipped files by any chance? Does explicitly
 repartitioning your RDD to match the number of cores in your cluster help
 at all? How about if you don't specify the configs you listed and just go
 with defaults all around?

 On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler dmah...@gmail.com
 wrote:

 I launch the cluster using vanilla spark-ec2 scripts.
 I just specify the number of slaves and instance type

 On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler dmah...@gmail.com
 wrote:

 I usually run interactively from the spark-shell.
 My data definitely has more than enough partitions to keep all the
 workers busy.
 When I first launch the cluster I first do:

 +
 cat EOF ~/spark/conf/spark-defaults.conf
 spark.serializerorg.apache.spark.serializer.KryoSerializer
 spark.rdd.compress  true
 spark.shuffle.consolidateFiles  true
 spark.akka.frameSize  20
 EOF

 copy-dir /root/spark/conf
 spark/sbin/stop-all.sh
 sleep 5
 spark/sbin/start-all.sh
 +

 before starting the spark-shell or running any jobs.




 On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Perhaps your RDD is not partitioned enough to utilize all the cores
 in your system.

 Could you post a simple code snippet and explain what kind of
 parallelism you are seeing for it? And can you report on how many
 partitions your RDDs have?

 On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler dmah...@gmail.com
 wrote:


 I am launching EC2 clusters using the spark-ec2 scripts.
 My understanding is that this configures spark to use the available
 resources.
 I can see that spark will use the available memory on larger
 istance types.
 However I have never seen spark running at more than 400% (using
 100% on 4 cores)
 on machines with many more cores.
 Am I misunderstanding the docs? Is it just that high end ec2
 instances get I/O starved when running spark? It would be strange if 
 that
 consistently produced a 400% hard limit though.

 thanks
 Daniel











Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniel Mahler
I am launching EC2 clusters using the spark-ec2 scripts.
My understanding is that this configures spark to use the available
resources.
I can see that spark will use the available memory on larger istance types.
However I have never seen spark running at more than 400% (using 100% on 4
cores)
on machines with many more cores.
Am I misunderstanding the docs? Is it just that high end ec2 instances get
I/O starved when running spark? It would be strange if that consistently
produced a 400% hard limit though.

thanks
Daniel


Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniil Osipov
How are you launching the cluster, and how are you submitting the job to
it? Can you list any Spark configuration parameters you provide?

On Mon, Oct 20, 2014 at 12:53 PM, Daniel Mahler dmah...@gmail.com wrote:


 I am launching EC2 clusters using the spark-ec2 scripts.
 My understanding is that this configures spark to use the available
 resources.
 I can see that spark will use the available memory on larger istance types.
 However I have never seen spark running at more than 400% (using 100% on 4
 cores)
 on machines with many more cores.
 Am I misunderstanding the docs? Is it just that high end ec2 instances get
 I/O starved when running spark? It would be strange if that consistently
 produced a 400% hard limit though.

 thanks
 Daniel



Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Nicholas Chammas
Perhaps your RDD is not partitioned enough to utilize all the cores in your
system.

Could you post a simple code snippet and explain what kind of parallelism
you are seeing for it? And can you report on how many partitions your RDDs
have?

On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler dmah...@gmail.com wrote:


 I am launching EC2 clusters using the spark-ec2 scripts.
 My understanding is that this configures spark to use the available
 resources.
 I can see that spark will use the available memory on larger istance types.
 However I have never seen spark running at more than 400% (using 100% on 4
 cores)
 on machines with many more cores.
 Am I misunderstanding the docs? Is it just that high end ec2 instances get
 I/O starved when running spark? It would be strange if that consistently
 produced a 400% hard limit though.

 thanks
 Daniel



Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniel Mahler
I usually run interactively from the spark-shell.
My data definitely has more than enough partitions to keep all the workers
busy.
When I first launch the cluster I first do:

+
cat EOF ~/spark/conf/spark-defaults.conf
spark.serializerorg.apache.spark.serializer.KryoSerializer
spark.rdd.compress  true
spark.shuffle.consolidateFiles  true
spark.akka.frameSize  20
EOF

copy-dir /root/spark/conf
spark/sbin/stop-all.sh
sleep 5
spark/sbin/start-all.sh
+

before starting the spark-shell or running any jobs.




On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Perhaps your RDD is not partitioned enough to utilize all the cores in
 your system.

 Could you post a simple code snippet and explain what kind of parallelism
 you are seeing for it? And can you report on how many partitions your RDDs
 have?

 On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler dmah...@gmail.com wrote:


 I am launching EC2 clusters using the spark-ec2 scripts.
 My understanding is that this configures spark to use the available
 resources.
 I can see that spark will use the available memory on larger istance
 types.
 However I have never seen spark running at more than 400% (using 100% on
 4 cores)
 on machines with many more cores.
 Am I misunderstanding the docs? Is it just that high end ec2 instances
 get I/O starved when running spark? It would be strange if that
 consistently produced a 400% hard limit though.

 thanks
 Daniel





Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniel Mahler
I launch the cluster using vanilla spark-ec2 scripts.
I just specify the number of slaves and instance type

On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler dmah...@gmail.com wrote:

 I usually run interactively from the spark-shell.
 My data definitely has more than enough partitions to keep all the workers
 busy.
 When I first launch the cluster I first do:

 +
 cat EOF ~/spark/conf/spark-defaults.conf
 spark.serializerorg.apache.spark.serializer.KryoSerializer
 spark.rdd.compress  true
 spark.shuffle.consolidateFiles  true
 spark.akka.frameSize  20
 EOF

 copy-dir /root/spark/conf
 spark/sbin/stop-all.sh
 sleep 5
 spark/sbin/start-all.sh
 +

 before starting the spark-shell or running any jobs.




 On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Perhaps your RDD is not partitioned enough to utilize all the cores in
 your system.

 Could you post a simple code snippet and explain what kind of parallelism
 you are seeing for it? And can you report on how many partitions your RDDs
 have?

 On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler dmah...@gmail.com wrote:


 I am launching EC2 clusters using the spark-ec2 scripts.
 My understanding is that this configures spark to use the available
 resources.
 I can see that spark will use the available memory on larger istance
 types.
 However I have never seen spark running at more than 400% (using 100% on
 4 cores)
 on machines with many more cores.
 Am I misunderstanding the docs? Is it just that high end ec2 instances
 get I/O starved when running spark? It would be strange if that
 consistently produced a 400% hard limit though.

 thanks
 Daniel






Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Nicholas Chammas
Are you dealing with gzipped files by any chance? Does explicitly
repartitioning your RDD to match the number of cores in your cluster help
at all? How about if you don't specify the configs you listed and just go
with defaults all around?

On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler dmah...@gmail.com wrote:

 I launch the cluster using vanilla spark-ec2 scripts.
 I just specify the number of slaves and instance type

 On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler dmah...@gmail.com wrote:

 I usually run interactively from the spark-shell.
 My data definitely has more than enough partitions to keep all the
 workers busy.
 When I first launch the cluster I first do:

 +
 cat EOF ~/spark/conf/spark-defaults.conf
 spark.serializerorg.apache.spark.serializer.KryoSerializer
 spark.rdd.compress  true
 spark.shuffle.consolidateFiles  true
 spark.akka.frameSize  20
 EOF

 copy-dir /root/spark/conf
 spark/sbin/stop-all.sh
 sleep 5
 spark/sbin/start-all.sh
 +

 before starting the spark-shell or running any jobs.




 On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Perhaps your RDD is not partitioned enough to utilize all the cores in
 your system.

 Could you post a simple code snippet and explain what kind of
 parallelism you are seeing for it? And can you report on how many
 partitions your RDDs have?

 On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler dmah...@gmail.com
 wrote:


 I am launching EC2 clusters using the spark-ec2 scripts.
 My understanding is that this configures spark to use the available
 resources.
 I can see that spark will use the available memory on larger istance
 types.
 However I have never seen spark running at more than 400% (using 100%
 on 4 cores)
 on machines with many more cores.
 Am I misunderstanding the docs? Is it just that high end ec2 instances
 get I/O starved when running spark? It would be strange if that
 consistently produced a 400% hard limit though.

 thanks
 Daniel







Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Nicholas Chammas
The biggest danger with gzipped files is this:

 raw = sc.textFile(/path/to/file.gz, 8) raw.getNumPartitions()1

You think you’re telling Spark to parallelize the reads on the input, but
Spark cannot parallelize reads against gzipped files. So 1 gzipped file
gets assigned to 1 partition.

It might be a nice user hint if Spark warned when parallelism is disabled
by the input format.

Nick
​

On Mon, Oct 20, 2014 at 6:53 PM, Daniel Mahler dmah...@gmail.com wrote:

 Hi Nicholas,

 Gzipping is a an impressive guess! Yes, they are.
 My data sets are too large to make repartitioning viable, but I could try
 it on a subset.
 I generally have many more partitions than cores.
 This was happenning before I started setting those configs.

 thanks
 Daniel


 On Mon, Oct 20, 2014 at 5:37 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Are you dealing with gzipped files by any chance? Does explicitly
 repartitioning your RDD to match the number of cores in your cluster help
 at all? How about if you don't specify the configs you listed and just go
 with defaults all around?

 On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler dmah...@gmail.com wrote:

 I launch the cluster using vanilla spark-ec2 scripts.
 I just specify the number of slaves and instance type

 On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler dmah...@gmail.com
 wrote:

 I usually run interactively from the spark-shell.
 My data definitely has more than enough partitions to keep all the
 workers busy.
 When I first launch the cluster I first do:

 +
 cat EOF ~/spark/conf/spark-defaults.conf
 spark.serializerorg.apache.spark.serializer.KryoSerializer
 spark.rdd.compress  true
 spark.shuffle.consolidateFiles  true
 spark.akka.frameSize  20
 EOF

 copy-dir /root/spark/conf
 spark/sbin/stop-all.sh
 sleep 5
 spark/sbin/start-all.sh
 +

 before starting the spark-shell or running any jobs.




 On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Perhaps your RDD is not partitioned enough to utilize all the cores in
 your system.

 Could you post a simple code snippet and explain what kind of
 parallelism you are seeing for it? And can you report on how many
 partitions your RDDs have?

 On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler dmah...@gmail.com
 wrote:


 I am launching EC2 clusters using the spark-ec2 scripts.
 My understanding is that this configures spark to use the available
 resources.
 I can see that spark will use the available memory on larger istance
 types.
 However I have never seen spark running at more than 400% (using 100%
 on 4 cores)
 on machines with many more cores.
 Am I misunderstanding the docs? Is it just that high end ec2
 instances get I/O starved when running spark? It would be strange if that
 consistently produced a 400% hard limit though.

 thanks
 Daniel









Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniel Mahler
I am using globs though

raw = sc.textFile(/path/to/dir/*/*)

and I have tons of files so 1 file per partition should not be a problem.

On Mon, Oct 20, 2014 at 7:14 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 The biggest danger with gzipped files is this:

  raw = sc.textFile(/path/to/file.gz, 8) raw.getNumPartitions()1

 You think you’re telling Spark to parallelize the reads on the input, but
 Spark cannot parallelize reads against gzipped files. So 1 gzipped file
 gets assigned to 1 partition.

 It might be a nice user hint if Spark warned when parallelism is disabled
 by the input format.

 Nick
 ​

 On Mon, Oct 20, 2014 at 6:53 PM, Daniel Mahler dmah...@gmail.com wrote:

 Hi Nicholas,

 Gzipping is a an impressive guess! Yes, they are.
 My data sets are too large to make repartitioning viable, but I could try
 it on a subset.
 I generally have many more partitions than cores.
 This was happenning before I started setting those configs.

 thanks
 Daniel


 On Mon, Oct 20, 2014 at 5:37 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Are you dealing with gzipped files by any chance? Does explicitly
 repartitioning your RDD to match the number of cores in your cluster help
 at all? How about if you don't specify the configs you listed and just go
 with defaults all around?

 On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler dmah...@gmail.com
 wrote:

 I launch the cluster using vanilla spark-ec2 scripts.
 I just specify the number of slaves and instance type

 On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler dmah...@gmail.com
 wrote:

 I usually run interactively from the spark-shell.
 My data definitely has more than enough partitions to keep all the
 workers busy.
 When I first launch the cluster I first do:

 +
 cat EOF ~/spark/conf/spark-defaults.conf
 spark.serializerorg.apache.spark.serializer.KryoSerializer
 spark.rdd.compress  true
 spark.shuffle.consolidateFiles  true
 spark.akka.frameSize  20
 EOF

 copy-dir /root/spark/conf
 spark/sbin/stop-all.sh
 sleep 5
 spark/sbin/start-all.sh
 +

 before starting the spark-shell or running any jobs.




 On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Perhaps your RDD is not partitioned enough to utilize all the cores
 in your system.

 Could you post a simple code snippet and explain what kind of
 parallelism you are seeing for it? And can you report on how many
 partitions your RDDs have?

 On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler dmah...@gmail.com
 wrote:


 I am launching EC2 clusters using the spark-ec2 scripts.
 My understanding is that this configures spark to use the available
 resources.
 I can see that spark will use the available memory on larger istance
 types.
 However I have never seen spark running at more than 400% (using
 100% on 4 cores)
 on machines with many more cores.
 Am I misunderstanding the docs? Is it just that high end ec2
 instances get I/O starved when running spark? It would be strange if 
 that
 consistently produced a 400% hard limit though.

 thanks
 Daniel