Hi Ananth, Do the containers that are getting killed belong to any specific operator? Or are these getting killed randomly. I'll suggest to have a look at the operator / container logs. You can also check this using: yarn logs --applicationId <App Id>
~Bhupesh On Wed, May 18, 2016 at 12:22 AM, Ananth Gundabattula < agundabatt...@gmail.com> wrote: > Thanks all for the inputs. > > @Yogi: I do not have any operators that are dynamically partitioned. I > have not implemented any definePartition() in any of my operators. > > @Bhupesh: I am not using the JSON parser operator from Malhar. I do use > jackson parser as an instance inside my operator that does some application > level logic. The stack trace seems to be coming from the Apex pubsub codec > handler. > > @Ashwin : The window ID seems to be moving forward. > > I would like to understand more as to what we mean by container failure ? > I am assuming that Apex automatically relaunches a container if it fails > for whatever reason. In fact I do see operators getting killed ( and on > clicking the details button , I see the message posted at the beginning of > this thread) > > One thing I want to note is that the operators are recreated automatically > when they fail and after a couple of days, even this recovery process seems > to be broken. i.e. new instances of the operators are not created > automatically after they are dead and the app runs in a lower operators > count mode ( and hence some data is not getting processed) > > I observed this behavior on non-HA enabled cluster. ( CDH 5.7 ) and hence > I do not suspect Yarn HA is causing this. I am currently ruling out network > issues as this would mean all operators need to exhibit some sort of blips. > ( Please correct me if I am wrong in this assumption) > > Regards, > Ananth > > > > On Wed, May 18, 2016 at 4:53 PM, Yogi Devendra <yogideven...@apache.org> > wrote: > >> There are some instances of "Heartbeat for unknown operator" in the log. >> So, looks like operators are sending the heartbeats. But, STRAM is not >> able to identify the operator. >> >> In the past, I observed similar behavior when I was trying to define the >> dynamic partitioning for some operator. >> >> >> ~ Yogi >> >> On 18 May 2016 at 12:12, Ashwin Chandra Putta <ashwinchand...@gmail.com> >> wrote: >> >>> Ananth, >>> >>> The heartbeat timeout means that the operator is not sending back the >>> window heartbeat information to the app master. It usually happens because >>> of one of two reasons. >>> >>> 1. System failure - container died, network failure etc. >>> 2. Windows not moving forward in the operator. Some business logic in >>> the operator is blocking the windows. You can observe the window IDs on the >>> UI for the given operator when it is running to quickly find out if this is >>> the issue. >>> >>> Regards, >>> Ashwin. >>> On May 17, 2016 11:05 PM, "Ananth Gundabattula" <agundabatt...@gmail.com> >>> wrote: >>> >>> Hello Sandeep, >>> >>> Thanks for the response. Please find attached the app master log. >>> >>> It looks like it got killed due to a heartbeat timeout. I will have to >>> see why I am getting a heartbeat timeout. I also see a JSON parser >>> exception in the logs in the log attached. Is it a harmless exception ? >>> >>> >>> Regards, >>> Ananth >>> >>> On Wed, May 18, 2016 at 2:45 PM, Sandeep Deshmukh < >>> sand...@datatorrent.com> wrote: >>> >>>> Dear Ananth, >>>> >>>> Could you please check the STRAM logs for any details of these >>>> containers. The first guess would be container going out of memory . >>>> >>>> Regards, >>>> Sandeep >>>> >>>> On Wed, May 18, 2016 at 10:05 AM, Ananth Gundabattula < >>>> agundabatt...@gmail.com> wrote: >>>> >>>>> Hello All, >>>>> >>>>> I was wondering what would be the case for a container to be killed by >>>>> the application master ? >>>>> >>>>> I see the following in the UI when I click on details : >>>>> >>>>> " >>>>> >>>>> Container killed by the ApplicationMaster. >>>>> Container killed on request. Exit code is 143 >>>>> Container exited with a non-zero exit code 143 >>>>> >>>>> " >>>>> >>>>> I see zome exceptions in the dtgateway.log and am not sure if they are >>>>> related. >>>>> >>>>> I am running Apex 3.3.0 on CDH 5.7 and HA enabled (HA for YARN as well as >>>>> HDFS is enabled). >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >> >