Here is the list that I will probably try to fill:

   1. Check GC on the offending executor when the task is running. May be
   you need even more memory.
   2. Go back to some previous successful run of the job and check the
   spark ui for the offending stage and check max task time/max input/max
   shuffle in/out for the largest task. Will help you understand the degree of
   skew in this stage.
   3. Take a thread dump of the executor from the Spark UI and verify if
   the task is really doing any work or it stuck in some deadlock. Some of the
   hive serde are not really usable from multi-threaded/multi-use spark
   executors.
   4. Take a thread dump of the executor from the Spark UI and verify if
   the task is spilling to disk. Playing with storage and memory fraction or
   generally increasing the memory will help.
   5. Check the disk utilisation on the machine running the executor.
   6. Look for event loss messages in the logs due to event queue full.
   Loss of events can send some of the spark components into really bad
   states.


thanks,
rohitk



On Sun, Dec 31, 2017 at 12:50 AM, Gourav Sengupta <gourav.sengu...@gmail.com
> wrote:

> Hi,
>
> Please try to use the SPARK UI from the way that AWS EMR recommends, it
> should be available from the resource manager. I never ever had any problem
> working with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF
> DEBUGGING.
>
> Sadly, I cannot be of much help unless we go for a screen share session
> over google chat or skype.
>
> Also, I ALWAYS prefer the maximize Resource Allocation setting in EMR to
> be set to true.
>
> Besides that, there is a metrics in the EMR console which shows the number
> of containers getting generated by your job on graphs.
>
>
>
> Regards,
> Gourav Sengupta
>
> On Fri, Dec 29, 2017 at 6:23 PM, Jeroen Miller <bluedasya...@gmail.com>
> wrote:
>
>> Hello,
>>
>> Just a quick update as I did not made much progress yet.
>>
>> On 28 Dec 2017, at 21:09, Gourav Sengupta <gourav.sengu...@gmail.com>
>> wrote:
>> > can you try to then use the EMR version 5.10 instead or EMR version
>> 5.11 instead?
>>
>> Same issue with EMR 5.11.0. Task 0 in one stage never finishes.
>>
>> > can you please try selecting a subnet which is in a different
>> availability zone?
>>
>> I did not try this yet. But why should that make a difference?
>>
>> > if possible just try to increase the number of task instances and see
>> the difference?
>>
>> I tried with 512 partitions -- no difference.
>>
>> > also in case you are using caching,
>>
>> No caching used.
>>
>> > Also can you please report the number of containers that your job is
>> creating by looking at the metrics in the EMR console?
>>
>> 8 containers if I trust the directories in j-xxx/containers/application_x
>> xx/.
>>
>> > Also if you see the spark UI then you can easily see which particular
>> step is taking the longest period of time - you just have to drill in a bit
>> in order to see that. Generally in case shuffling is an issue then it
>> definitely appears in the SPARK UI as I drill into the steps and see which
>> particular one is taking the longest.
>>
>> I always have issues with the Spark UI on EC2 -- it never seems to be up
>> to date.
>>
>> JM
>>
>>
>

Reply via email to