Hi Pedro,

It seems that you disabled maximize resource allocation in 5.16, but
enabled in 5.20.
This config can be different based on how you start EMR cluster (via quick
wizard, advanced wizard in console, or CLI/API).
You can see that in EMR console Configuration tab.

Please compare spark properties (especially spark.executor.cores,
spark.executor.memory, spark.dynamicAllocation.enabled, etc.)  between your
two Spark cluster with different version of EMR.
You can see them from Spark web UI’s environment tab or log files.
Then please try with the same properties against the same dataset with the
same deployment mode (cluster or client).

Even in EMR, you can configure num of cores and memory of driver/executors
in config files, arguments in spark-submit, and inside Spark app if you
need.


Warm regards,
Nori

2019年2月8日(金) 8:16 Hiroyuki Nagata <idiotpan...@gmail.com>:

> Hi,
> thank you Pedro
>
> I tested maximizeResourceAllocation option. When it's enabled, it seems
> Spark utilized their cores fully. However the performance is not so
> different from default setting.
>
> I consider to use s3-distcp for uploading files. And, I think
> table(dataframe) caching is also effectiveness.
>
> Regards,
> Hiroyuki
>
> 2019年2月2日(土) 1:12 Pedro Tuero <tuerope...@gmail.com>:
>
>> Hi Hiroyuki, thanks for the answer.
>>
>> I found a solution for the cores per executor configuration:
>> I set this configuration to true:
>>
>> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#emr-spark-maximizeresourceallocation
>> Probably it was true by default at version 5.16, but I didn't find when
>> it has changed.
>> In the same link, it says that dynamic allocation is true by default. I
>> thought it would do the trick but reading again I think it is related to
>> the number of executors rather than the number of cores.
>>
>> But the jobs are still taking more than before.
>> Watching application history,  I see these differences:
>> For the same job, the same kind of instances types, default (aws managed)
>> configuration for executors, cores, and memory:
>> Instances:
>> 6 r5.xlarge :  4 vCpu , 32gb of mem. (So there is 24 cores: 6 instances *
>> 4 cores).
>>
>> With 5.16:
>> - 24 executors  (4 in each instance, including the one who also had the
>> driver).
>> - 4 cores each.
>> - 2.7  * 2 (Storage + on-heap storage) memory each.
>> - 1 executor per core, but at the same time  4 cores per executor (?).
>> - Total Mem in executors per Instance : 21.6 (2.7 * 2 * 4)
>> - Total Elapsed Time: 6 minutes
>> With 5.20:
>> - 5 executors (1 in each instance, 0 in the instance with the driver).
>> - 4 cores each.
>> - 11.9  * 2 (Storage + on-heap storage) memory each.
>> - Total Mem  in executors per Instance : 23.8 (11.9 * 2 * 1)
>> - Total Elapsed Time: 8 minutes
>>
>>
>> I don't understand the configuration of 5.16, but it works better.
>> It seems that in 5.20, a full instance is wasted with the driver only,
>> while it could also contain an executor.
>>
>>
>> Regards,
>> Pedro.
>>
>>
>>
>> l jue., 31 de ene. de 2019 20:16, Hiroyuki Nagata <idiotpan...@gmail.com>
>> escribió:
>>
>>> Hi, Pedro
>>>
>>>
>>> I also start using AWS EMR, with Spark 2.4.0. I'm seeking methods for
>>> performance tuning.
>>>
>>> Do you configure dynamic allocation ?
>>>
>>> FYI:
>>>
>>> https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
>>>
>>> I've not tested it yet. I guess spark-submit needs to specify number of
>>> executors.
>>>
>>> Regards,
>>> Hiroyuki
>>>
>>> 2019年2月1日(金) 5:23、Pedro Tuero さん(tuerope...@gmail.com)のメッセージ:
>>>
>>>> Hi guys,
>>>> I use to run spark jobs in Aws emr.
>>>> Recently I switch from aws emr label  5.16 to 5.20 (which use Spark
>>>> 2.4.0).
>>>> I've noticed that a lot of steps are taking longer than before.
>>>> I think it is related to the automatic configuration of cores by
>>>> executor.
>>>> In version 5.16, some executors toke more cores if the instance allows
>>>> it.
>>>> Let say, if an instance had 8 cores and 40gb of ram, and ram configured
>>>> by executor was 10gb, then aws emr automatically assigned 2 cores by
>>>> executor.
>>>> Now in label 5.20, unless I configure the number of cores manually,
>>>> only one core is assigned per executor.
>>>>
>>>> I don't know if it is related to Spark 2.4.0 or if it is something
>>>> managed by aws...
>>>> Does anyone know if there is a way to automatically use more cores when
>>>> it is physically possible?
>>>>
>>>> Thanks,
>>>> Peter.
>>>>
>>>

Reply via email to