Re: Getting spark to use more than 4 cores on Amazon EC2

Andy Davidson Wed, 22 Oct 2014 17:00:25 -0700

On a related note, how are you submitting your job?

I have a simple streaming proof of concept and noticed that everything runs
on my master. I wonder if I do not have enough load for spark to push tasks
to the slaves.


Thanks

Andy

From:  Daniel Mahler <dmah...@gmail.com>
Date:  Monday, October 20, 2014 at 5:22 PM
To:  Nicholas Chammas <nicholas.cham...@gmail.com>
Cc:  user <user@spark.apache.org>
Subject:  Re: Getting spark to use more than 4 cores on Amazon EC2

> I am using globs though
> 
> raw = sc.textFile("/path/to/dir/*/*")
> 
> and I have tons of files so 1 file per partition should not be a problem.
> 
> On Mon, Oct 20, 2014 at 7:14 PM, Nicholas Chammas <nicholas.cham...@gmail.com>
> wrote:
>> The biggest danger with gzipped files is this:
>>>>> >>> raw = sc.textFile("/path/to/file.gz", 8)
>>>>> >>> raw.getNumPartitions()
>> 1
>> You think you’re telling Spark to parallelize the reads on the input, but
>> Spark cannot parallelize reads against gzipped files. So 1 gzipped file gets
>> assigned to 1 partition.
>> 
>> It might be a nice user hint if Spark warned when parallelism is disabled by
>> the input format.
>> 
>> Nick
>> 
>> 
>> 
>> On Mon, Oct 20, 2014 at 6:53 PM, Daniel Mahler <dmah...@gmail.com> wrote:
>>> Hi Nicholas,
>>> 
>>> Gzipping is a an impressive guess! Yes, they are.
>>> My data sets are too large to make repartitioning viable, but I could try it
>>> on a subset.
>>> I generally have many more partitions than cores.
>>> This was happenning before I started setting those configs.
>>> 
>>> thanks
>>> Daniel
>>> 
>>> 
>>> On Mon, Oct 20, 2014 at 5:37 PM, Nicholas Chammas
>>> <nicholas.cham...@gmail.com> wrote:
>>>> Are you dealing with gzipped files by any chance? Does explicitly
>>>> repartitioning your RDD to match the number of cores in your cluster help
>>>> at all? How about if you don't specify the configs you listed and just go
>>>> with defaults all around?
>>>> 
>>>> On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler <dmah...@gmail.com> wrote:
>>>>> I launch the cluster using vanilla spark-ec2 scripts.
>>>>> I just specify the number of slaves and instance type
>>>>> 
>>>>> On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler <dmah...@gmail.com> wrote:
>>>>>> I usually run interactively from the spark-shell.
>>>>>> My data definitely has more than enough partitions to keep all the
>>>>>> workers busy.
>>>>>> When I first launch the cluster I first do:
>>>>>> 
>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> cat <<EOF >>~/spark/conf/spark-defaults.conf
>>>>>> spark.serializer        org.apache.spark.serializer.KryoSerializer
>>>>>> spark.rdd.compress      true
>>>>>> spark.shuffle.consolidateFiles  true
>>>>>> spark.akka.frameSize  20
>>>>>> EOF
>>>>>> 
>>>>>> copy-dir /root/spark/conf
>>>>>> spark/sbin/stop-all.sh
>>>>>> sleep 5
>>>>>> spark/sbin/start-all.sh
>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> 
>>>>>> before starting the spark-shell or running any jobs.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas
>>>>>> <nicholas.cham...@gmail.com> wrote:
>>>>>>> Perhaps your RDD is not partitioned enough to utilize all the cores in
>>>>>>> your system.
>>>>>>> 
>>>>>>> Could you post a simple code snippet and explain what kind of
>>>>>>> parallelism you are seeing for it? And can you report on how many
>>>>>>> partitions your RDDs have?
>>>>>>> 
>>>>>>> On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler <dmah...@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> I am launching EC2 clusters using the spark-ec2 scripts.
>>>>>>> My understanding is that this configures spark to use the available
>>>>>>> resources.
>>>>>>> I can see that spark will use the available memory on larger istance
>>>>>>> types.
>>>>>>> However I have never seen spark running at more than 400% (using 100% on
>>>>>>> 4 cores)
>>>>>>> on machines with many more cores.
>>>>>>> Am I misunderstanding the docs? Is it just that high end ec2 instances
>>>>>>> get I/O starved when running spark? It would be strange if that
>>>>>>> consistently produced a 400% hard limit though.
>>>>>>> 
>>>>>>> thanks
>>>>>>> Daniel
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: Getting spark to use more than 4 cores on Amazon EC2

Reply via email to