Thanks Isha for analyzing the issue.
I am adding your analysis to the JIRA.
I observed one more issue in THREAD_LOCAL.
Let's the DAG be as follows:
A -> B -> C
Where A, B, C are operators, B and C are the operators which are them
THREAD_LOCAL.
If the downstream operator (i.e Operator C) throws exception from the main
thread, then application master caught exception and killed the container.
New container allocated for B and C operators. B is re-deployed into the
newly allocated container and the status is ACTIVE, but, C is not
re-deploying.
After re-deployment of Operator B, DAG be as follows:
A -> B.
I looked into Stram Logs, observed the following message:
"INFO com.datatorrent.stram.StreamingContainerManager: Affected operators
[PTOperator[id=2,name=B]]".
I think this is the issue. Here, Operator C is not there in affected
operators.
I created an application for this issue. Sample Application is here
<https://github.com/chaithu14/AppThreadLocal/tree/theadBranch>.
@Isha: Have you observed the same behavior?
I am creating a JIRA for this issue.
Regards,
Chaitanya
On Wed, Mar 2, 2016 at 9:34 AM, Sandeep Deshmukh <[email protected]>
wrote:
> Great finding Isha.
>
> In general, it is always advisable to do things in main thread. We had some
> timing issues in dtIngest as we were emitting tuples in the Reconciler
> thread. Once we moved all emit statements to the main thread, there were no
> issues observed.
>
> Issue: When tuples are emitted in Reconciler thread, some of them were
> emitted post endWindow but before the checkpointing is done. These tuples
> for the downstream operator are not guaranteed to reach the same window.
> Thus checkpointing of the two operators is not in sync and that could
> result in few tuples replayed wrongly from the Reconciler based operator.
>
> Regards,
> Sandeep
>
> On Wed, Mar 2, 2016 at 8:57 AM, Isha Arkatkar <[email protected]>
> wrote:
>
> > Hi,
> >
> > I checked the application https://github.com/chaithu14/AppThreadLocal
> >
> > In this example, exception from downstream operator is thrown in a
> > different thread in AbstractReconciler operator. And the rethrow to main
> > operator thread is done in handleIdleTime. This function is not
> guaranteed
> > to be invoked in every window. In Thread_local locality I checked that
> > handleIdleTime did not get invoked. So, the exception did not get
> rethrown.
> >
> > The exception thrown from a different thread other than the main
> operator
> > thread are not caught by Application Master. Something we can probably
> add
> > to troubleshooting guide to add a rethrow in the main thread.
> >
> > I verified that if downstream operator throws exception in the main
> > thread, it is caught appropriately by application master even in thread
> > local case.
> >
> > Thanks,
> > Isha
> >
> > On Thu, Feb 25, 2016 at 9:57 PM, Chaitanya Chebolu <
> > [email protected]> wrote:
> >
> > > Hi All,
> > >
> > > Created Sample application for THREAD_LOCAL issue. Application is
> here
> > > <https://github.com/chaithu14/AppThreadLocal>.
> > > Application has the following DAG:
> > >
> > > RandomEventGenerator -> OuputOperator.
> > >
> > > Both the operators are THREAD_LOCAL.
> > >
> > > In OutputOperator, throwing exceptions at every committed window. So,
> > > AppMaster supposed to kill container at every committed window. This is
> > > expected behavior.
> > > But, this is not happening with the current Apex.
> > >
> > > One more observation is, If the upstream operator throws exception at
> > > every committed window, then AppMaster is killing the container
> > > continuously. But, this is not happening with the downstream operator.
> > >
> > > Created JIRA for this issue: APEXCORE-357
> > >
> > > Regards,
> > > Chaitanya
> > >
> > > On Thu, Feb 25, 2016 at 12:36 PM, Chaitanya Chebolu <
> > > [email protected]> wrote:
> > >
> > > > Hi ,
> > > >
> > > > I am facing issues in Thread_Local. Two operators which are thread
> > > local
> > > > and out of which, the downstream operator throws exceptions. But,
> > > AppMaster
> > > > is not catching those exceptions. I was unable to figure out why
> > > > application is not working.
> > > > If both the operators are deployed on different containers, then
> the
> > > > container is killed continuously by AppMaster. This is expected
> > behavior.
> > > >
> > > > For Example, Let's say the dag be op1 -> op2 where op1, op2 are
> two
> > > > operators which are of them thread local. Throws an exception from
> the
> > > > downstream operator op2, AppMaster is not catching exceptions. I will
> > > > create a JIRA for this issue. Please some one help on this.
> > > >
> > > > Regards,
> > > > Chaitanya
> > > >
> > >
> >
>