Hello Bhupesh, The Kafka operator seems to be the one crashing. I am using the Kafka 0.9 operator from Malhar on a kafka broker cluster running CDH kafka 2.x.
Attaching the logs of this particular operator for reference. Please note that there is an exception from the netty driver and I believe this is not the root cause as I have observed this exception being thrown from the cassandra across other stacks as well. However the following lines in the log for the killed operator looks suspicious: 2016-05-18 07:17:17,556 INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator: Marking the coordinator 2147483576 dead. 2016-05-18 07:17:17,556 WARN org.apache.apex.malhar.kafka.AbstractKafkaInputOperator: Exceptions in committing offsets eventdetails_ingestion-8=OffsetAndMetadata{offset=1375350, metadata=''} : org.apache.kafka.common.errors.NotCoordinatorForGroupException: This is not the correct coordinator for this group. 2016-05-18 07:21:23,611 INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator: Offset commit for group ced_Consumer failed due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and retry 2016-05-18 07:21:23,611 INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator: Marking the coordinator 2147483577 dead. 2016-05-18 07:21:23,611 WARN org.apache.apex.malhar.kafka.AbstractKafkaInputOperator: Exceptions in committing offsets eventdetails_ingestion-8=OffsetAndMetadata{offset=1377033, metadata=''} : org.apache.kafka.common.errors.NotCoordinatorForGroupException: This is not the correct coordinator for this group. 2016-05-18 07:21:23,612 INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator: Offset commit for group ced_Consumer failed due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and retry 2016-05-18 07:21:23,612 WARN org.apache.apex.malhar.kafka.AbstractKafkaInputOperator: Exceptions in committing offsets eventdetails_ingestion-8=OffsetAndMetadata{offset=1377073, metadata=''} : org.apache.kafka.common.errors.NotCoordinatorForGroupException: This is not the correct coordinator for this group. 2016-05-18 07:22:17,950 INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator: Offset commit for group ced_Consumer failed due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and retry 2016-05-18 07:22:17,950 INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator: Marking the coordinator 2147483576 dead. Regards, Ananth On Wed, May 18, 2016 at 5:29 PM, Bhupesh Chawda <bhup...@datatorrent.com> wrote: > Hi Ananth, > > Do the containers that are getting killed belong to any specific operator? > Or are these getting killed randomly. > I'll suggest to have a look at the operator / container logs. > You can also check this using: yarn logs --applicationId <App Id> > > ~Bhupesh > > On Wed, May 18, 2016 at 12:22 AM, Ananth Gundabattula < > agundabatt...@gmail.com> wrote: > >> Thanks all for the inputs. >> >> @Yogi: I do not have any operators that are dynamically partitioned. I >> have not implemented any definePartition() in any of my operators. >> >> @Bhupesh: I am not using the JSON parser operator from Malhar. I do use >> jackson parser as an instance inside my operator that does some application >> level logic. The stack trace seems to be coming from the Apex pubsub codec >> handler. >> >> @Ashwin : The window ID seems to be moving forward. >> >> I would like to understand more as to what we mean by container failure ? >> I am assuming that Apex automatically relaunches a container if it fails >> for whatever reason. In fact I do see operators getting killed ( and on >> clicking the details button , I see the message posted at the beginning of >> this thread) >> >> One thing I want to note is that the operators are recreated >> automatically when they fail and after a couple of days, even this recovery >> process seems to be broken. i.e. new instances of the operators are not >> created automatically after they are dead and the app runs in a lower >> operators count mode ( and hence some data is not getting processed) >> >> I observed this behavior on non-HA enabled cluster. ( CDH 5.7 ) and >> hence I do not suspect Yarn HA is causing this. I am currently ruling out >> network issues as this would mean all operators need to exhibit some sort >> of blips. ( Please correct me if I am wrong in this assumption) >> >> Regards, >> Ananth >> >> >> >> On Wed, May 18, 2016 at 4:53 PM, Yogi Devendra <yogideven...@apache.org> >> wrote: >> >>> There are some instances of "Heartbeat for unknown operator" in the log. >>> So, looks like operators are sending the heartbeats. But, STRAM is not >>> able to identify the operator. >>> >>> In the past, I observed similar behavior when I was trying to define the >>> dynamic partitioning for some operator. >>> >>> >>> ~ Yogi >>> >>> On 18 May 2016 at 12:12, Ashwin Chandra Putta <ashwinchand...@gmail.com> >>> wrote: >>> >>>> Ananth, >>>> >>>> The heartbeat timeout means that the operator is not sending back the >>>> window heartbeat information to the app master. It usually happens because >>>> of one of two reasons. >>>> >>>> 1. System failure - container died, network failure etc. >>>> 2. Windows not moving forward in the operator. Some business logic in >>>> the operator is blocking the windows. You can observe the window IDs on the >>>> UI for the given operator when it is running to quickly find out if this is >>>> the issue. >>>> >>>> Regards, >>>> Ashwin. >>>> On May 17, 2016 11:05 PM, "Ananth Gundabattula" < >>>> agundabatt...@gmail.com> wrote: >>>> >>>> Hello Sandeep, >>>> >>>> Thanks for the response. Please find attached the app master log. >>>> >>>> It looks like it got killed due to a heartbeat timeout. I will have to >>>> see why I am getting a heartbeat timeout. I also see a JSON parser >>>> exception in the logs in the log attached. Is it a harmless exception ? >>>> >>>> >>>> Regards, >>>> Ananth >>>> >>>> On Wed, May 18, 2016 at 2:45 PM, Sandeep Deshmukh < >>>> sand...@datatorrent.com> wrote: >>>> >>>>> Dear Ananth, >>>>> >>>>> Could you please check the STRAM logs for any details of these >>>>> containers. The first guess would be container going out of memory . >>>>> >>>>> Regards, >>>>> Sandeep >>>>> >>>>> On Wed, May 18, 2016 at 10:05 AM, Ananth Gundabattula < >>>>> agundabatt...@gmail.com> wrote: >>>>> >>>>>> Hello All, >>>>>> >>>>>> I was wondering what would be the case for a container to be killed >>>>>> by the application master ? >>>>>> >>>>>> I see the following in the UI when I click on details : >>>>>> >>>>>> " >>>>>> >>>>>> Container killed by the ApplicationMaster. >>>>>> Container killed on request. Exit code is 143 >>>>>> Container exited with a non-zero exit code 143 >>>>>> >>>>>> " >>>>>> >>>>>> I see zome exceptions in the dtgateway.log and am not sure if they are >>>>>> related. >>>>>> >>>>>> I am running Apex 3.3.0 on CDH 5.7 and HA enabled (HA for YARN as well >>>>>> as HDFS is enabled). >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >
dt-log-for-killed-operator.log
Description: Binary data