So, we've been seeing some queries hang, I've come up with a possible
explanation, but so far it's really difficult to reproduce. Let me know if
you think this explanation doesn't hold up or if you have any ideas how we
can reproduce it. Thanks

- generally it's a CTAS running on a large cluster (lot's of writers
running in parallel)
- logs show that the user channel was closed and UserServer caused the root
fragment to move to a FAILED state [1]
- jstack shows that the root fragment is blocked in it's receiver waiting
for data [2]
- jstack also shows that ALL other fragments are no longer running, and the
logs show that all of them succeeded [3]
- the foreman waits *forever* for the root fragment to finish

[1] the only case I can think off is when the user channel closed while the
fragment was waiting for an ack from the user client
[2] if a writer finishes earlier than the others, it will send a data batch
to the root fragment that will be sent to the user. The root will then
immediately block on it's receiver waiting for the remaining writers to
finish
[3] once the root fragment moves to a failed state, the receiver will
immediately release any received batch and return an OK to the sender
without putting the batch in it's blocking queue.

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Reply via email to