Hi Todd, I believe you are right. There are a couple of other race conditions in the HdfsScanNode which have resulted in broken-builds in the last couple of weeks. (IMPALA-7418 and IMPALA-7335). While reviewing the change for IMPALA-7335, Tim and Dan suggested that we need to cleanup the state invariants of the HdfsScanNode to find and fix other such races. I am going to create a JIRA to track it, I could add this issue to the list of races to be fixed. If this issue occurs more frequently, we could add a temporary fix for now. Let me know what you'd suggest.
Thanks, Pooja On Thu, Aug 30, 2018 at 1:27 PM Todd Lipcon <[email protected]> wrote: > I spent another 45 minutes looking at the core and think I figured this > out. I found that for the stuck scanner, both of its scanner threads and > run and exited, but for some reason it was never marked as "done". > > I think this might be a regression due to IMPALA-7096 > (7ccf7369085aa49a8fc0daf6f91d97b8a3135682). The scanner thread has the > following code: > > // Stop extra threads if we're over a soft limit in order to free up > memory. > if (!first_thread && mem_tracker_->AnyLimitExceeded(MemLimit::SOFT)) { > break; > } > > // Done with range and it completed successfully > if (progress_.done()) { > // All ranges are finished. Indicate we are done. > SetDone(); > break; > } > > if (scan_range == nullptr && num_unqueued_files == 0) { > unique_lock<mutex> l(lock_); > // All ranges have been queued and DiskIoMgr has no more new ranges > for this scan > // node to process. This means that every range is either done or > being processed by > // another thread. > all_ranges_started_ = true; > break; > } > } > > What if we have the following scenario: > > T1) grab scan range 1 and start processing > > T2) grab scan range 2 and start processing > > T1) finish scan range 1 and see that 'progress_' is not done() > T1) loop around, get no scan range (there are no more), so set > all_ranges_satrted_ and break > T1) thread exits > > T2) finish scan range 2 > T2) happen to hit a soft memory limit error due to pressure from other exec > nodes, etc. Since we aren't the first thread, we break. (even though the > first thread is no longer running) > T2) thread exits > > Note that no one got to the point of calling SetDone() because we break due > to the memory limit error _before_ checking progress_.Done(). > > Thus, the query will hang forever. > > It seems to me that the order of these conditions needs to be rejiggered: > the breaking due to memory limit should be the last thing in the loop so > that, if we were the thread to finish the last scan range, we properly > SetDone before exiting. > > Does that makes sense? If so, I'll file a JIRA. Suggestions also welcome on > how to write a test for this (do we have an easy facility to inject > probabilistic mem-limit failures in that code path?) > > -Todd > > > On Thu, Aug 30, 2018 at 12:35 PM, Michael Ho <[email protected]> wrote: > > > FWIW, please see the discussion at > > https://issues.apache.org/jira/browse/IMPALA-6194 about how "stuck IO" > may > > render cancellation in HDFS scan node ineffective. We have seen users > > reported similar backtraces in the past due to hung RPCs to HDFS name > node. > > Not saying this is the necessarily the case here. > > > > A couple of things which may be interesting to look at: > > - is there any scan node in the profile which doesn't finish any assigned > > scan ranges ? > > - if you happen to have a core, it may help to inspect the stack traces > of > > the scanner threads and the disk io mgr threads to understand their > states. > > > > On Thu, Aug 30, 2018 at 12:25 PM Todd Lipcon <[email protected]> wrote: > > > > > Hey folks, > > > > > > I ran into some issues with a local core test run last night. This is > on > > my > > > own branch with some uncommitted work, but I haven't touched the > backend > > > and these issues seem to be backend-focused. > > > > > > Initially, I noticed the problem because a test run that I started last > > > night was still "stuck" this morning. I had three queries which had > been > > > running for 10 hours and not making progress. The 'fragments' page for > > the > > > query showed that one of the backends had not reported to the > coordinator > > > for many hours. In attempting to debug this, I managed to get the > > > StateStore to declare that node dead, and those queries eventually were > > > cancelled due to that. > > > > > > I resumed the node that I had been debugging with gdb, and it was > > declared > > > live again. I didn't restart the process, though, which might have led > to > > > further issues: > > > > > > Next, I also noticed that /queries on my coordinator node showed > "Number > > of > > > running queries with fragments on this host" at 100+ on all three > nodes, > > > even though no queries were running anymore. These numbers are stable. > I > > > can continue to issue new queries and they complete normally. While a > > > queries are running, the count of fragments goes up appropriate, and > when > > > the query finishes, it goes back down. But, the "base" numbers are > stuck > > at > > > {108, 391, 402} with nothing running. > > > > > > I also found that the node that had had the original problems has three > > > stuck fragments, all waiting on HdfsScanNode::GetNext: > > > https://gist.github.com/57a99bf4c00f1575810af924013c259d > > > Looking in the logs, I see continuous spew that it's trying to cancel > the > > > fragment instances for this query, but apparently the cancellation is > not > > > working. I talked to Michael Ho about this and it sounds like this is a > > > known issue with cancellation. > > > > > > So, there are still two mysteries: > > > - why did it get stuck in the first place? > > > - why are my "number of running queries" counters stuck at non-zero > > values? > > > > > > Does anything above ring a bell for anyone? > > > > > > -Todd > > > -- > > > Todd Lipcon > > > Software Engineer, Cloudera > > > > > > > > > -- > > Thanks, > > Michael > > > > > > -- > Todd Lipcon > Software Engineer, Cloudera >
