Thanks all for the inputs.

@Yogi: I do not have any operators that are dynamically partitioned. I have
not implemented any definePartition() in any of my operators.

@Bhupesh: I am not using the JSON parser operator from Malhar. I do use
jackson parser as an instance inside my operator that does some application
level logic. The stack trace seems to be coming from the Apex pubsub codec
handler.

@Ashwin : The window ID seems to be moving forward.

I would like to understand more as to what we mean by container failure ? I
am assuming that Apex automatically relaunches a container if it fails for
whatever reason. In fact I do see operators getting killed ( and on
clicking the details button , I see the message posted at the beginning of
this thread)

One thing I want to note is that the operators are recreated automatically
when they fail and after a couple of days, even this recovery process seems
to be broken. i.e. new instances of the operators are not created
automatically after they are dead and the app runs in a lower operators
count mode ( and hence some data is not getting processed)

I observed this behavior on non-HA enabled cluster.  ( CDH 5.7 ) and hence
I do not suspect Yarn HA is causing this. I am currently ruling out network
issues as this would mean all operators need to exhibit some sort of blips.
( Please correct me if I am wrong in this assumption)

Regards,
Ananth



On Wed, May 18, 2016 at 4:53 PM, Yogi Devendra <yogideven...@apache.org>
wrote:

> There are some instances of "Heartbeat for unknown operator" in the log.
> So, looks like operators are sending the heartbeats. But, STRAM is not
> able to identify the operator.
>
> In the past, I observed similar behavior when I was trying to define the
> dynamic partitioning for some operator.
>
>
> ~ Yogi
>
> On 18 May 2016 at 12:12, Ashwin Chandra Putta <ashwinchand...@gmail.com>
> wrote:
>
>> Ananth,
>>
>> The heartbeat timeout means that the operator is not sending back the
>> window heartbeat information to the app master. It usually happens because
>> of one of two reasons.
>>
>> 1. System failure - container died, network failure etc.
>> 2. Windows not moving forward in the operator. Some business logic in the
>> operator is blocking the windows. You can observe the window IDs on the UI
>> for the given operator when it is running to quickly find out if this is
>> the issue.
>>
>> Regards,
>> Ashwin.
>> On May 17, 2016 11:05 PM, "Ananth Gundabattula" <agundabatt...@gmail.com>
>> wrote:
>>
>> Hello Sandeep,
>>
>> Thanks for the response. Please find attached the app master log.
>>
>> It looks like it got killed due to a heartbeat timeout. I will have to
>> see why I am getting a heartbeat timeout. I also see a JSON parser
>> exception in the logs in the log attached. Is it a harmless exception  ?
>>
>>
>> Regards,
>> Ananth
>>
>> On Wed, May 18, 2016 at 2:45 PM, Sandeep Deshmukh <
>> sand...@datatorrent.com> wrote:
>>
>>> Dear Ananth,
>>>
>>> Could you please check the STRAM logs for any details of these
>>> containers. The first guess would be container going out of memory .
>>>
>>> Regards,
>>> Sandeep
>>>
>>> On Wed, May 18, 2016 at 10:05 AM, Ananth Gundabattula <
>>> agundabatt...@gmail.com> wrote:
>>>
>>>> Hello All,
>>>>
>>>> I was wondering what would be the case for a container to be killed by
>>>> the application master ?
>>>>
>>>> I see the following in the UI when I click on details :
>>>>
>>>> "
>>>>
>>>> Container killed by the ApplicationMaster.
>>>> Container killed on request. Exit code is 143
>>>> Container exited with a non-zero exit code 143
>>>>
>>>> "
>>>>
>>>> I see zome exceptions in the dtgateway.log and am not sure if they are 
>>>> related.
>>>>
>>>> I am running Apex 3.3.0 on CDH 5.7 and HA enabled (HA for YARN as well as 
>>>> HDFS is enabled).
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to