Are you dealing with gzipped files by any chance? Does explicitly
repartitioning your RDD to match the number of cores in your cluster help
at all? How about if you don't specify the configs you listed and just go
with defaults all around?

On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler <dmah...@gmail.com> wrote:

> I launch the cluster using vanilla spark-ec2 scripts.
> I just specify the number of slaves and instance type
>
> On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler <dmah...@gmail.com> wrote:
>
>> I usually run interactively from the spark-shell.
>> My data definitely has more than enough partitions to keep all the
>> workers busy.
>> When I first launch the cluster I first do:
>>
>> +++++++++++++++++++++++++++++++++++++++++++++++++
>> cat <<EOF >>~/spark/conf/spark-defaults.conf
>> spark.serializer        org.apache.spark.serializer.KryoSerializer
>> spark.rdd.compress      true
>> spark.shuffle.consolidateFiles  true
>> spark.akka.frameSize  20
>> EOF
>>
>> copy-dir /root/spark/conf
>> spark/sbin/stop-all.sh
>> sleep 5
>> spark/sbin/start-all.sh
>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>
>> before starting the spark-shell or running any jobs.
>>
>>
>>
>>
>> On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Perhaps your RDD is not partitioned enough to utilize all the cores in
>>> your system.
>>>
>>> Could you post a simple code snippet and explain what kind of
>>> parallelism you are seeing for it? And can you report on how many
>>> partitions your RDDs have?
>>>
>>> On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler <dmah...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> I am launching EC2 clusters using the spark-ec2 scripts.
>>>> My understanding is that this configures spark to use the available
>>>> resources.
>>>> I can see that spark will use the available memory on larger istance
>>>> types.
>>>> However I have never seen spark running at more than 400% (using 100%
>>>> on 4 cores)
>>>> on machines with many more cores.
>>>> Am I misunderstanding the docs? Is it just that high end ec2 instances
>>>> get I/O starved when running spark? It would be strange if that
>>>> consistently produced a 400% hard limit though.
>>>>
>>>> thanks
>>>> Daniel
>>>>
>>>
>>>
>>
>

Reply via email to