Re: Getting spark to use more than 4 cores on Amazon EC2
On a related note, how are you submitting your job? I have a simple streaming proof of concept and noticed that everything runs on my master. I wonder if I do not have enough load for spark to push tasks to the slaves. Thanks Andy From: Daniel Mahler dmah...@gmail.com Date: Monday, October 20, 2014 at 5:22 PM To: Nicholas Chammas nicholas.cham...@gmail.com Cc: user user@spark.apache.org Subject: Re: Getting spark to use more than 4 cores on Amazon EC2 I am using globs though raw = sc.textFile(/path/to/dir/*/*) and I have tons of files so 1 file per partition should not be a problem. On Mon, Oct 20, 2014 at 7:14 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: The biggest danger with gzipped files is this: raw = sc.textFile(/path/to/file.gz, 8) raw.getNumPartitions() 1 You think you’re telling Spark to parallelize the reads on the input, but Spark cannot parallelize reads against gzipped files. So 1 gzipped file gets assigned to 1 partition. It might be a nice user hint if Spark warned when parallelism is disabled by the input format. Nick On Mon, Oct 20, 2014 at 6:53 PM, Daniel Mahler dmah...@gmail.com wrote: Hi Nicholas, Gzipping is a an impressive guess! Yes, they are. My data sets are too large to make repartitioning viable, but I could try it on a subset. I generally have many more partitions than cores. This was happenning before I started setting those configs. thanks Daniel On Mon, Oct 20, 2014 at 5:37 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Are you dealing with gzipped files by any chance? Does explicitly repartitioning your RDD to match the number of cores in your cluster help at all? How about if you don't specify the configs you listed and just go with defaults all around? On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler dmah...@gmail.com wrote: I launch the cluster using vanilla spark-ec2 scripts. I just specify the number of slaves and instance type On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler dmah...@gmail.com wrote: I usually run interactively from the spark-shell. My data definitely has more than enough partitions to keep all the workers busy. When I first launch the cluster I first do: + cat EOF ~/spark/conf/spark-defaults.conf spark.serializerorg.apache.spark.serializer.KryoSerializer spark.rdd.compress true spark.shuffle.consolidateFiles true spark.akka.frameSize 20 EOF copy-dir /root/spark/conf spark/sbin/stop-all.sh sleep 5 spark/sbin/start-all.sh + before starting the spark-shell or running any jobs. On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Perhaps your RDD is not partitioned enough to utilize all the cores in your system. Could you post a simple code snippet and explain what kind of parallelism you are seeing for it? And can you report on how many partitions your RDDs have? On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler dmah...@gmail.com wrote: I am launching EC2 clusters using the spark-ec2 scripts. My understanding is that this configures spark to use the available resources. I can see that spark will use the available memory on larger istance types. However I have never seen spark running at more than 400% (using 100% on 4 cores) on machines with many more cores. Am I misunderstanding the docs? Is it just that high end ec2 instances get I/O starved when running spark? It would be strange if that consistently produced a 400% hard limit though. thanks Daniel
Re: Getting spark to use more than 4 cores on Amazon EC2
Another wild guess, if your data is stored in S3, you might be running into an issue where the default jets3t properties limits the number of parallel S3 connections to 4. Consider increasing the max-thread-counts from here: http://www.jets3t.org/toolkit/configuration.html. On Tue, Oct 21, 2014 at 10:39 AM, Andy Davidson a...@santacruzintegration.com wrote: On a related note, how are you submitting your job? I have a simple streaming proof of concept and noticed that everything runs on my master. I wonder if I do not have enough load for spark to push tasks to the slaves. Thanks Andy From: Daniel Mahler dmah...@gmail.com Date: Monday, October 20, 2014 at 5:22 PM To: Nicholas Chammas nicholas.cham...@gmail.com Cc: user user@spark.apache.org Subject: Re: Getting spark to use more than 4 cores on Amazon EC2 I am using globs though raw = sc.textFile(/path/to/dir/*/*) and I have tons of files so 1 file per partition should not be a problem. On Mon, Oct 20, 2014 at 7:14 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: The biggest danger with gzipped files is this: raw = sc.textFile(/path/to/file.gz, 8) raw.getNumPartitions()1 You think you’re telling Spark to parallelize the reads on the input, but Spark cannot parallelize reads against gzipped files. So 1 gzipped file gets assigned to 1 partition. It might be a nice user hint if Spark warned when parallelism is disabled by the input format. Nick On Mon, Oct 20, 2014 at 6:53 PM, Daniel Mahler dmah...@gmail.com wrote: Hi Nicholas, Gzipping is a an impressive guess! Yes, they are. My data sets are too large to make repartitioning viable, but I could try it on a subset. I generally have many more partitions than cores. This was happenning before I started setting those configs. thanks Daniel On Mon, Oct 20, 2014 at 5:37 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Are you dealing with gzipped files by any chance? Does explicitly repartitioning your RDD to match the number of cores in your cluster help at all? How about if you don't specify the configs you listed and just go with defaults all around? On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler dmah...@gmail.com wrote: I launch the cluster using vanilla spark-ec2 scripts. I just specify the number of slaves and instance type On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler dmah...@gmail.com wrote: I usually run interactively from the spark-shell. My data definitely has more than enough partitions to keep all the workers busy. When I first launch the cluster I first do: + cat EOF ~/spark/conf/spark-defaults.conf spark.serializerorg.apache.spark.serializer.KryoSerializer spark.rdd.compress true spark.shuffle.consolidateFiles true spark.akka.frameSize 20 EOF copy-dir /root/spark/conf spark/sbin/stop-all.sh sleep 5 spark/sbin/start-all.sh + before starting the spark-shell or running any jobs. On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Perhaps your RDD is not partitioned enough to utilize all the cores in your system. Could you post a simple code snippet and explain what kind of parallelism you are seeing for it? And can you report on how many partitions your RDDs have? On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler dmah...@gmail.com wrote: I am launching EC2 clusters using the spark-ec2 scripts. My understanding is that this configures spark to use the available resources. I can see that spark will use the available memory on larger istance types. However I have never seen spark running at more than 400% (using 100% on 4 cores) on machines with many more cores. Am I misunderstanding the docs? Is it just that high end ec2 instances get I/O starved when running spark? It would be strange if that consistently produced a 400% hard limit though. thanks Daniel
Getting spark to use more than 4 cores on Amazon EC2
I am launching EC2 clusters using the spark-ec2 scripts. My understanding is that this configures spark to use the available resources. I can see that spark will use the available memory on larger istance types. However I have never seen spark running at more than 400% (using 100% on 4 cores) on machines with many more cores. Am I misunderstanding the docs? Is it just that high end ec2 instances get I/O starved when running spark? It would be strange if that consistently produced a 400% hard limit though. thanks Daniel
Re: Getting spark to use more than 4 cores on Amazon EC2
How are you launching the cluster, and how are you submitting the job to it? Can you list any Spark configuration parameters you provide? On Mon, Oct 20, 2014 at 12:53 PM, Daniel Mahler dmah...@gmail.com wrote: I am launching EC2 clusters using the spark-ec2 scripts. My understanding is that this configures spark to use the available resources. I can see that spark will use the available memory on larger istance types. However I have never seen spark running at more than 400% (using 100% on 4 cores) on machines with many more cores. Am I misunderstanding the docs? Is it just that high end ec2 instances get I/O starved when running spark? It would be strange if that consistently produced a 400% hard limit though. thanks Daniel
Re: Getting spark to use more than 4 cores on Amazon EC2
Perhaps your RDD is not partitioned enough to utilize all the cores in your system. Could you post a simple code snippet and explain what kind of parallelism you are seeing for it? And can you report on how many partitions your RDDs have? On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler dmah...@gmail.com wrote: I am launching EC2 clusters using the spark-ec2 scripts. My understanding is that this configures spark to use the available resources. I can see that spark will use the available memory on larger istance types. However I have never seen spark running at more than 400% (using 100% on 4 cores) on machines with many more cores. Am I misunderstanding the docs? Is it just that high end ec2 instances get I/O starved when running spark? It would be strange if that consistently produced a 400% hard limit though. thanks Daniel
Re: Getting spark to use more than 4 cores on Amazon EC2
I usually run interactively from the spark-shell. My data definitely has more than enough partitions to keep all the workers busy. When I first launch the cluster I first do: + cat EOF ~/spark/conf/spark-defaults.conf spark.serializerorg.apache.spark.serializer.KryoSerializer spark.rdd.compress true spark.shuffle.consolidateFiles true spark.akka.frameSize 20 EOF copy-dir /root/spark/conf spark/sbin/stop-all.sh sleep 5 spark/sbin/start-all.sh + before starting the spark-shell or running any jobs. On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Perhaps your RDD is not partitioned enough to utilize all the cores in your system. Could you post a simple code snippet and explain what kind of parallelism you are seeing for it? And can you report on how many partitions your RDDs have? On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler dmah...@gmail.com wrote: I am launching EC2 clusters using the spark-ec2 scripts. My understanding is that this configures spark to use the available resources. I can see that spark will use the available memory on larger istance types. However I have never seen spark running at more than 400% (using 100% on 4 cores) on machines with many more cores. Am I misunderstanding the docs? Is it just that high end ec2 instances get I/O starved when running spark? It would be strange if that consistently produced a 400% hard limit though. thanks Daniel
Re: Getting spark to use more than 4 cores on Amazon EC2
I launch the cluster using vanilla spark-ec2 scripts. I just specify the number of slaves and instance type On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler dmah...@gmail.com wrote: I usually run interactively from the spark-shell. My data definitely has more than enough partitions to keep all the workers busy. When I first launch the cluster I first do: + cat EOF ~/spark/conf/spark-defaults.conf spark.serializerorg.apache.spark.serializer.KryoSerializer spark.rdd.compress true spark.shuffle.consolidateFiles true spark.akka.frameSize 20 EOF copy-dir /root/spark/conf spark/sbin/stop-all.sh sleep 5 spark/sbin/start-all.sh + before starting the spark-shell or running any jobs. On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Perhaps your RDD is not partitioned enough to utilize all the cores in your system. Could you post a simple code snippet and explain what kind of parallelism you are seeing for it? And can you report on how many partitions your RDDs have? On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler dmah...@gmail.com wrote: I am launching EC2 clusters using the spark-ec2 scripts. My understanding is that this configures spark to use the available resources. I can see that spark will use the available memory on larger istance types. However I have never seen spark running at more than 400% (using 100% on 4 cores) on machines with many more cores. Am I misunderstanding the docs? Is it just that high end ec2 instances get I/O starved when running spark? It would be strange if that consistently produced a 400% hard limit though. thanks Daniel
Re: Getting spark to use more than 4 cores on Amazon EC2
Are you dealing with gzipped files by any chance? Does explicitly repartitioning your RDD to match the number of cores in your cluster help at all? How about if you don't specify the configs you listed and just go with defaults all around? On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler dmah...@gmail.com wrote: I launch the cluster using vanilla spark-ec2 scripts. I just specify the number of slaves and instance type On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler dmah...@gmail.com wrote: I usually run interactively from the spark-shell. My data definitely has more than enough partitions to keep all the workers busy. When I first launch the cluster I first do: + cat EOF ~/spark/conf/spark-defaults.conf spark.serializerorg.apache.spark.serializer.KryoSerializer spark.rdd.compress true spark.shuffle.consolidateFiles true spark.akka.frameSize 20 EOF copy-dir /root/spark/conf spark/sbin/stop-all.sh sleep 5 spark/sbin/start-all.sh + before starting the spark-shell or running any jobs. On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Perhaps your RDD is not partitioned enough to utilize all the cores in your system. Could you post a simple code snippet and explain what kind of parallelism you are seeing for it? And can you report on how many partitions your RDDs have? On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler dmah...@gmail.com wrote: I am launching EC2 clusters using the spark-ec2 scripts. My understanding is that this configures spark to use the available resources. I can see that spark will use the available memory on larger istance types. However I have never seen spark running at more than 400% (using 100% on 4 cores) on machines with many more cores. Am I misunderstanding the docs? Is it just that high end ec2 instances get I/O starved when running spark? It would be strange if that consistently produced a 400% hard limit though. thanks Daniel
Re: Getting spark to use more than 4 cores on Amazon EC2
The biggest danger with gzipped files is this: raw = sc.textFile(/path/to/file.gz, 8) raw.getNumPartitions()1 You think you’re telling Spark to parallelize the reads on the input, but Spark cannot parallelize reads against gzipped files. So 1 gzipped file gets assigned to 1 partition. It might be a nice user hint if Spark warned when parallelism is disabled by the input format. Nick On Mon, Oct 20, 2014 at 6:53 PM, Daniel Mahler dmah...@gmail.com wrote: Hi Nicholas, Gzipping is a an impressive guess! Yes, they are. My data sets are too large to make repartitioning viable, but I could try it on a subset. I generally have many more partitions than cores. This was happenning before I started setting those configs. thanks Daniel On Mon, Oct 20, 2014 at 5:37 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Are you dealing with gzipped files by any chance? Does explicitly repartitioning your RDD to match the number of cores in your cluster help at all? How about if you don't specify the configs you listed and just go with defaults all around? On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler dmah...@gmail.com wrote: I launch the cluster using vanilla spark-ec2 scripts. I just specify the number of slaves and instance type On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler dmah...@gmail.com wrote: I usually run interactively from the spark-shell. My data definitely has more than enough partitions to keep all the workers busy. When I first launch the cluster I first do: + cat EOF ~/spark/conf/spark-defaults.conf spark.serializerorg.apache.spark.serializer.KryoSerializer spark.rdd.compress true spark.shuffle.consolidateFiles true spark.akka.frameSize 20 EOF copy-dir /root/spark/conf spark/sbin/stop-all.sh sleep 5 spark/sbin/start-all.sh + before starting the spark-shell or running any jobs. On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Perhaps your RDD is not partitioned enough to utilize all the cores in your system. Could you post a simple code snippet and explain what kind of parallelism you are seeing for it? And can you report on how many partitions your RDDs have? On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler dmah...@gmail.com wrote: I am launching EC2 clusters using the spark-ec2 scripts. My understanding is that this configures spark to use the available resources. I can see that spark will use the available memory on larger istance types. However I have never seen spark running at more than 400% (using 100% on 4 cores) on machines with many more cores. Am I misunderstanding the docs? Is it just that high end ec2 instances get I/O starved when running spark? It would be strange if that consistently produced a 400% hard limit though. thanks Daniel
Re: Getting spark to use more than 4 cores on Amazon EC2
I am using globs though raw = sc.textFile(/path/to/dir/*/*) and I have tons of files so 1 file per partition should not be a problem. On Mon, Oct 20, 2014 at 7:14 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: The biggest danger with gzipped files is this: raw = sc.textFile(/path/to/file.gz, 8) raw.getNumPartitions()1 You think you’re telling Spark to parallelize the reads on the input, but Spark cannot parallelize reads against gzipped files. So 1 gzipped file gets assigned to 1 partition. It might be a nice user hint if Spark warned when parallelism is disabled by the input format. Nick On Mon, Oct 20, 2014 at 6:53 PM, Daniel Mahler dmah...@gmail.com wrote: Hi Nicholas, Gzipping is a an impressive guess! Yes, they are. My data sets are too large to make repartitioning viable, but I could try it on a subset. I generally have many more partitions than cores. This was happenning before I started setting those configs. thanks Daniel On Mon, Oct 20, 2014 at 5:37 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Are you dealing with gzipped files by any chance? Does explicitly repartitioning your RDD to match the number of cores in your cluster help at all? How about if you don't specify the configs you listed and just go with defaults all around? On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler dmah...@gmail.com wrote: I launch the cluster using vanilla spark-ec2 scripts. I just specify the number of slaves and instance type On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler dmah...@gmail.com wrote: I usually run interactively from the spark-shell. My data definitely has more than enough partitions to keep all the workers busy. When I first launch the cluster I first do: + cat EOF ~/spark/conf/spark-defaults.conf spark.serializerorg.apache.spark.serializer.KryoSerializer spark.rdd.compress true spark.shuffle.consolidateFiles true spark.akka.frameSize 20 EOF copy-dir /root/spark/conf spark/sbin/stop-all.sh sleep 5 spark/sbin/start-all.sh + before starting the spark-shell or running any jobs. On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Perhaps your RDD is not partitioned enough to utilize all the cores in your system. Could you post a simple code snippet and explain what kind of parallelism you are seeing for it? And can you report on how many partitions your RDDs have? On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler dmah...@gmail.com wrote: I am launching EC2 clusters using the spark-ec2 scripts. My understanding is that this configures spark to use the available resources. I can see that spark will use the available memory on larger istance types. However I have never seen spark running at more than 400% (using 100% on 4 cores) on machines with many more cores. Am I misunderstanding the docs? Is it just that high end ec2 instances get I/O starved when running spark? It would be strange if that consistently produced a 400% hard limit though. thanks Daniel