Hi Ananth,

Do the containers that are getting killed belong to any specific operator?
Or are these getting killed randomly.
I'll suggest to have a look at the operator / container logs.
You can also check this using: yarn logs --applicationId <App Id>

~Bhupesh

On Wed, May 18, 2016 at 12:22 AM, Ananth Gundabattula <
agundabatt...@gmail.com> wrote:

> Thanks all for the inputs.
>
> @Yogi: I do not have any operators that are dynamically partitioned. I
> have not implemented any definePartition() in any of my operators.
>
> @Bhupesh: I am not using the JSON parser operator from Malhar. I do use
> jackson parser as an instance inside my operator that does some application
> level logic. The stack trace seems to be coming from the Apex pubsub codec
> handler.
>
> @Ashwin : The window ID seems to be moving forward.
>
> I would like to understand more as to what we mean by container failure ?
> I am assuming that Apex automatically relaunches a container if it fails
> for whatever reason. In fact I do see operators getting killed ( and on
> clicking the details button , I see the message posted at the beginning of
> this thread)
>
> One thing I want to note is that the operators are recreated automatically
> when they fail and after a couple of days, even this recovery process seems
> to be broken. i.e. new instances of the operators are not created
> automatically after they are dead and the app runs in a lower operators
> count mode ( and hence some data is not getting processed)
>
> I observed this behavior on non-HA enabled cluster.  ( CDH 5.7 ) and hence
> I do not suspect Yarn HA is causing this. I am currently ruling out network
> issues as this would mean all operators need to exhibit some sort of blips.
> ( Please correct me if I am wrong in this assumption)
>
> Regards,
> Ananth
>
>
>
> On Wed, May 18, 2016 at 4:53 PM, Yogi Devendra <yogideven...@apache.org>
> wrote:
>
>> There are some instances of "Heartbeat for unknown operator" in the log.
>> So, looks like operators are sending the heartbeats. But, STRAM is not
>> able to identify the operator.
>>
>> In the past, I observed similar behavior when I was trying to define the
>> dynamic partitioning for some operator.
>>
>>
>> ~ Yogi
>>
>> On 18 May 2016 at 12:12, Ashwin Chandra Putta <ashwinchand...@gmail.com>
>> wrote:
>>
>>> Ananth,
>>>
>>> The heartbeat timeout means that the operator is not sending back the
>>> window heartbeat information to the app master. It usually happens because
>>> of one of two reasons.
>>>
>>> 1. System failure - container died, network failure etc.
>>> 2. Windows not moving forward in the operator. Some business logic in
>>> the operator is blocking the windows. You can observe the window IDs on the
>>> UI for the given operator when it is running to quickly find out if this is
>>> the issue.
>>>
>>> Regards,
>>> Ashwin.
>>> On May 17, 2016 11:05 PM, "Ananth Gundabattula" <agundabatt...@gmail.com>
>>> wrote:
>>>
>>> Hello Sandeep,
>>>
>>> Thanks for the response. Please find attached the app master log.
>>>
>>> It looks like it got killed due to a heartbeat timeout. I will have to
>>> see why I am getting a heartbeat timeout. I also see a JSON parser
>>> exception in the logs in the log attached. Is it a harmless exception  ?
>>>
>>>
>>> Regards,
>>> Ananth
>>>
>>> On Wed, May 18, 2016 at 2:45 PM, Sandeep Deshmukh <
>>> sand...@datatorrent.com> wrote:
>>>
>>>> Dear Ananth,
>>>>
>>>> Could you please check the STRAM logs for any details of these
>>>> containers. The first guess would be container going out of memory .
>>>>
>>>> Regards,
>>>> Sandeep
>>>>
>>>> On Wed, May 18, 2016 at 10:05 AM, Ananth Gundabattula <
>>>> agundabatt...@gmail.com> wrote:
>>>>
>>>>> Hello All,
>>>>>
>>>>> I was wondering what would be the case for a container to be killed by
>>>>> the application master ?
>>>>>
>>>>> I see the following in the UI when I click on details :
>>>>>
>>>>> "
>>>>>
>>>>> Container killed by the ApplicationMaster.
>>>>> Container killed on request. Exit code is 143
>>>>> Container exited with a non-zero exit code 143
>>>>>
>>>>> "
>>>>>
>>>>> I see zome exceptions in the dtgateway.log and am not sure if they are 
>>>>> related.
>>>>>
>>>>> I am running Apex 3.3.0 on CDH 5.7 and HA enabled (HA for YARN as well as 
>>>>> HDFS is enabled).
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to