Thanks all for the inputs. @Yogi: I do not have any operators that are dynamically partitioned. I have not implemented any definePartition() in any of my operators.
@Bhupesh: I am not using the JSON parser operator from Malhar. I do use jackson parser as an instance inside my operator that does some application level logic. The stack trace seems to be coming from the Apex pubsub codec handler. @Ashwin : The window ID seems to be moving forward. I would like to understand more as to what we mean by container failure ? I am assuming that Apex automatically relaunches a container if it fails for whatever reason. In fact I do see operators getting killed ( and on clicking the details button , I see the message posted at the beginning of this thread) One thing I want to note is that the operators are recreated automatically when they fail and after a couple of days, even this recovery process seems to be broken. i.e. new instances of the operators are not created automatically after they are dead and the app runs in a lower operators count mode ( and hence some data is not getting processed) I observed this behavior on non-HA enabled cluster. ( CDH 5.7 ) and hence I do not suspect Yarn HA is causing this. I am currently ruling out network issues as this would mean all operators need to exhibit some sort of blips. ( Please correct me if I am wrong in this assumption) Regards, Ananth On Wed, May 18, 2016 at 4:53 PM, Yogi Devendra <yogideven...@apache.org> wrote: > There are some instances of "Heartbeat for unknown operator" in the log. > So, looks like operators are sending the heartbeats. But, STRAM is not > able to identify the operator. > > In the past, I observed similar behavior when I was trying to define the > dynamic partitioning for some operator. > > > ~ Yogi > > On 18 May 2016 at 12:12, Ashwin Chandra Putta <ashwinchand...@gmail.com> > wrote: > >> Ananth, >> >> The heartbeat timeout means that the operator is not sending back the >> window heartbeat information to the app master. It usually happens because >> of one of two reasons. >> >> 1. System failure - container died, network failure etc. >> 2. Windows not moving forward in the operator. Some business logic in the >> operator is blocking the windows. You can observe the window IDs on the UI >> for the given operator when it is running to quickly find out if this is >> the issue. >> >> Regards, >> Ashwin. >> On May 17, 2016 11:05 PM, "Ananth Gundabattula" <agundabatt...@gmail.com> >> wrote: >> >> Hello Sandeep, >> >> Thanks for the response. Please find attached the app master log. >> >> It looks like it got killed due to a heartbeat timeout. I will have to >> see why I am getting a heartbeat timeout. I also see a JSON parser >> exception in the logs in the log attached. Is it a harmless exception ? >> >> >> Regards, >> Ananth >> >> On Wed, May 18, 2016 at 2:45 PM, Sandeep Deshmukh < >> sand...@datatorrent.com> wrote: >> >>> Dear Ananth, >>> >>> Could you please check the STRAM logs for any details of these >>> containers. The first guess would be container going out of memory . >>> >>> Regards, >>> Sandeep >>> >>> On Wed, May 18, 2016 at 10:05 AM, Ananth Gundabattula < >>> agundabatt...@gmail.com> wrote: >>> >>>> Hello All, >>>> >>>> I was wondering what would be the case for a container to be killed by >>>> the application master ? >>>> >>>> I see the following in the UI when I click on details : >>>> >>>> " >>>> >>>> Container killed by the ApplicationMaster. >>>> Container killed on request. Exit code is 143 >>>> Container exited with a non-zero exit code 143 >>>> >>>> " >>>> >>>> I see zome exceptions in the dtgateway.log and am not sure if they are >>>> related. >>>> >>>> I am running Apex 3.3.0 on CDH 5.7 and HA enabled (HA for YARN as well as >>>> HDFS is enabled). >>>> >>>> >>>> >>>> >>>> >>> >> >