Opened DRILL-4595 [1] to track this issue. Thanks
[1] https://issues.apache.org/jira/browse/DRILL-4595 On Fri, Apr 8, 2016 at 6:42 AM, Abdel Hakim Deneche <[email protected]> wrote: > Hey John, thanks for sharing your experience. If you see this again try > collecting the jstack output for the foreman node of the query, and also > check in the query profile which fragments are still marked as RUNNING. > > Thanks > > On Thu, Apr 7, 2016 at 2:29 PM, John Omernik <[email protected]> wrote: > >> Abdel - >> >> I think I've seen this on a MapR cluster I run, especially on CTAS. For >> me, I have not brought it up because the cluster I am running on has some >> serious personal issues (like being hardware that's near 7 years old, its >> a >> test cluster) and given the "hard to reproduce" nature of the problem, >> I've >> been reluctant to create noise. Given what you've described, it seems very >> similar to CTAS hangs I've seen, but couldn't accurately reproduce. >> >> This didn't add much to your post, but I wanted to give you a +1 for >> outlining this potential problem. Once I move to more robust hardware, >> and >> I am in similar situations, I will post more verbose details from my side. >> >> John >> >> >> >> On Thu, Apr 7, 2016 at 2:29 AM, Abdel Hakim Deneche < >> [email protected]> >> wrote: >> >> > So, we've been seeing some queries hang, I've come up with a possible >> > explanation, but so far it's really difficult to reproduce. Let me know >> if >> > you think this explanation doesn't hold up or if you have any ideas how >> we >> > can reproduce it. Thanks >> > >> > - generally it's a CTAS running on a large cluster (lot's of writers >> > running in parallel) >> > - logs show that the user channel was closed and UserServer caused the >> root >> > fragment to move to a FAILED state [1] >> > - jstack shows that the root fragment is blocked in it's receiver >> waiting >> > for data [2] >> > - jstack also shows that ALL other fragments are no longer running, and >> the >> > logs show that all of them succeeded [3] >> > - the foreman waits *forever* for the root fragment to finish >> > >> > [1] the only case I can think off is when the user channel closed while >> the >> > fragment was waiting for an ack from the user client >> > [2] if a writer finishes earlier than the others, it will send a data >> batch >> > to the root fragment that will be sent to the user. The root will then >> > immediately block on it's receiver waiting for the remaining writers to >> > finish >> > [3] once the root fragment moves to a failed state, the receiver will >> > immediately release any received batch and return an OK to the sender >> > without putting the batch in it's blocking queue. >> > >> > Abdelhakim Deneche >> > >> > Software Engineer >> > >> > <http://www.mapr.com/> >> > >> > >> > Now Available - Free Hadoop On-Demand Training >> > < >> > >> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available >> > > >> > >> > > > > -- > > Abdelhakim Deneche > > Software Engineer > > <http://www.mapr.com/> > > > Now Available - Free Hadoop On-Demand Training > <http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available> > -- Abdelhakim Deneche Software Engineer <http://www.mapr.com/> Now Available - Free Hadoop On-Demand Training <http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>
