There are many ways a query could hang. JStack of the foreman node will definitely help confirm it's the same issue.
Thanks On Fri, Apr 8, 2016 at 6:06 PM, François Méthot <fmetho...@gmail.com> wrote: > It might just adds up to the mystery of this issue but when we start > getting those hanging CTAS query, > if we restart our drill cluster and the problem goes away. > > Next time we start getting this problem I will try to collect the JStack > output of the foreman too. > > Thanks for looking into this. > > Francois > > > > On Fri, Apr 8, 2016 at 2:20 AM, Abdel Hakim Deneche <adene...@maprtech.com > > > wrote: > > > Opened DRILL-4595 [1] to track this issue. > > > > Thanks > > > > [1] https://issues.apache.org/jira/browse/DRILL-4595 > > > > On Fri, Apr 8, 2016 at 6:42 AM, Abdel Hakim Deneche < > adene...@maprtech.com > > > > > wrote: > > > > > Hey John, thanks for sharing your experience. If you see this again try > > > collecting the jstack output for the foreman node of the query, and > also > > > check in the query profile which fragments are still marked as RUNNING. > > > > > > Thanks > > > > > > On Thu, Apr 7, 2016 at 2:29 PM, John Omernik <j...@omernik.com> wrote: > > > > > >> Abdel - > > >> > > >> I think I've seen this on a MapR cluster I run, especially on CTAS. > For > > >> me, I have not brought it up because the cluster I am running on has > > some > > >> serious personal issues (like being hardware that's near 7 years old, > > its > > >> a > > >> test cluster) and given the "hard to reproduce" nature of the problem, > > >> I've > > >> been reluctant to create noise. Given what you've described, it seems > > very > > >> similar to CTAS hangs I've seen, but couldn't accurately reproduce. > > >> > > >> This didn't add much to your post, but I wanted to give you a +1 for > > >> outlining this potential problem. Once I move to more robust > hardware, > > >> and > > >> I am in similar situations, I will post more verbose details from my > > side. > > >> > > >> John > > >> > > >> > > >> > > >> On Thu, Apr 7, 2016 at 2:29 AM, Abdel Hakim Deneche < > > >> adene...@maprtech.com> > > >> wrote: > > >> > > >> > So, we've been seeing some queries hang, I've come up with a > possible > > >> > explanation, but so far it's really difficult to reproduce. Let me > > know > > >> if > > >> > you think this explanation doesn't hold up or if you have any ideas > > how > > >> we > > >> > can reproduce it. Thanks > > >> > > > >> > - generally it's a CTAS running on a large cluster (lot's of writers > > >> > running in parallel) > > >> > - logs show that the user channel was closed and UserServer caused > the > > >> root > > >> > fragment to move to a FAILED state [1] > > >> > - jstack shows that the root fragment is blocked in it's receiver > > >> waiting > > >> > for data [2] > > >> > - jstack also shows that ALL other fragments are no longer running, > > and > > >> the > > >> > logs show that all of them succeeded [3] > > >> > - the foreman waits *forever* for the root fragment to finish > > >> > > > >> > [1] the only case I can think off is when the user channel closed > > while > > >> the > > >> > fragment was waiting for an ack from the user client > > >> > [2] if a writer finishes earlier than the others, it will send a > data > > >> batch > > >> > to the root fragment that will be sent to the user. The root will > then > > >> > immediately block on it's receiver waiting for the remaining writers > > to > > >> > finish > > >> > [3] once the root fragment moves to a failed state, the receiver > will > > >> > immediately release any received batch and return an OK to the > sender > > >> > without putting the batch in it's blocking queue. > > >> > > > >> > Abdelhakim Deneche > > >> > > > >> > Software Engineer > > >> > > > >> > <http://www.mapr.com/> > > >> > > > >> > > > >> > Now Available - Free Hadoop On-Demand Training > > >> > < > > >> > > > >> > > > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > >> > > > > >> > > > >> > > > > > > > > > > > > -- > > > > > > Abdelhakim Deneche > > > > > > Software Engineer > > > > > > <http://www.mapr.com/> > > > > > > > > > Now Available - Free Hadoop On-Demand Training > > > < > > > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > > > > > > > > > > > > > -- > > > > Abdelhakim Deneche > > > > Software Engineer > > > > <http://www.mapr.com/> > > > > > > Now Available - Free Hadoop On-Demand Training > > < > > > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > > > > > -- Abdelhakim Deneche Software Engineer <http://www.mapr.com/> Now Available - Free Hadoop On-Demand Training <http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>