On a related note, how are you submitting your job? I have a simple streaming proof of concept and noticed that everything runs on my master. I wonder if I do not have enough load for spark to push tasks to the slaves.
Thanks Andy From: Daniel Mahler <dmah...@gmail.com> Date: Monday, October 20, 2014 at 5:22 PM To: Nicholas Chammas <nicholas.cham...@gmail.com> Cc: user <user@spark.apache.org> Subject: Re: Getting spark to use more than 4 cores on Amazon EC2 > I am using globs though > > raw = sc.textFile("/path/to/dir/*/*") > > and I have tons of files so 1 file per partition should not be a problem. > > On Mon, Oct 20, 2014 at 7:14 PM, Nicholas Chammas <nicholas.cham...@gmail.com> > wrote: >> The biggest danger with gzipped files is this: >>>>> >>> raw = sc.textFile("/path/to/file.gz", 8) >>>>> >>> raw.getNumPartitions() >> 1 >> You think you’re telling Spark to parallelize the reads on the input, but >> Spark cannot parallelize reads against gzipped files. So 1 gzipped file gets >> assigned to 1 partition. >> >> It might be a nice user hint if Spark warned when parallelism is disabled by >> the input format. >> >> Nick >> >> >> >> On Mon, Oct 20, 2014 at 6:53 PM, Daniel Mahler <dmah...@gmail.com> wrote: >>> Hi Nicholas, >>> >>> Gzipping is a an impressive guess! Yes, they are. >>> My data sets are too large to make repartitioning viable, but I could try it >>> on a subset. >>> I generally have many more partitions than cores. >>> This was happenning before I started setting those configs. >>> >>> thanks >>> Daniel >>> >>> >>> On Mon, Oct 20, 2014 at 5:37 PM, Nicholas Chammas >>> <nicholas.cham...@gmail.com> wrote: >>>> Are you dealing with gzipped files by any chance? Does explicitly >>>> repartitioning your RDD to match the number of cores in your cluster help >>>> at all? How about if you don't specify the configs you listed and just go >>>> with defaults all around? >>>> >>>> On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler <dmah...@gmail.com> wrote: >>>>> I launch the cluster using vanilla spark-ec2 scripts. >>>>> I just specify the number of slaves and instance type >>>>> >>>>> On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler <dmah...@gmail.com> wrote: >>>>>> I usually run interactively from the spark-shell. >>>>>> My data definitely has more than enough partitions to keep all the >>>>>> workers busy. >>>>>> When I first launch the cluster I first do: >>>>>> >>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> cat <<EOF >>~/spark/conf/spark-defaults.conf >>>>>> spark.serializer org.apache.spark.serializer.KryoSerializer >>>>>> spark.rdd.compress true >>>>>> spark.shuffle.consolidateFiles true >>>>>> spark.akka.frameSize 20 >>>>>> EOF >>>>>> >>>>>> copy-dir /root/spark/conf >>>>>> spark/sbin/stop-all.sh >>>>>> sleep 5 >>>>>> spark/sbin/start-all.sh >>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> >>>>>> before starting the spark-shell or running any jobs. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas >>>>>> <nicholas.cham...@gmail.com> wrote: >>>>>>> Perhaps your RDD is not partitioned enough to utilize all the cores in >>>>>>> your system. >>>>>>> >>>>>>> Could you post a simple code snippet and explain what kind of >>>>>>> parallelism you are seeing for it? And can you report on how many >>>>>>> partitions your RDDs have? >>>>>>> >>>>>>> On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler <dmah...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> I am launching EC2 clusters using the spark-ec2 scripts. >>>>>>> My understanding is that this configures spark to use the available >>>>>>> resources. >>>>>>> I can see that spark will use the available memory on larger istance >>>>>>> types. >>>>>>> However I have never seen spark running at more than 400% (using 100% on >>>>>>> 4 cores) >>>>>>> on machines with many more cores. >>>>>>> Am I misunderstanding the docs? Is it just that high end ec2 instances >>>>>>> get I/O starved when running spark? It would be strange if that >>>>>>> consistently produced a 400% hard limit though. >>>>>>> >>>>>>> thanks >>>>>>> Daniel >>>>>>> >>>>>> >>>>> >>>> >>> >> >