On Thu, Aug 30, 2018 at 2:48 PM, Todd Lipcon <t...@cloudera.com> wrote:
> On Thu, Aug 30, 2018 at 2:44 PM, Pooja Nilangekar < > pooja.nilange...@cloudera.com> wrote: > >> Hi Todd, >> >> I believe you are right. There are a couple of other race conditions in >> the >> HdfsScanNode which have resulted in broken-builds in the last couple of >> weeks. (IMPALA-7418 and IMPALA-7335). While reviewing the change for >> IMPALA-7335, Tim and Dan suggested that we need to cleanup the state >> invariants of the HdfsScanNode to find and fix other such races. I am >> going >> to create a JIRA to track it, I could add this issue to the list of races >> to be fixed. If this issue occurs more frequently, we could add a >> temporary fix for now. Let me know what you'd suggest. >> > > Makes sense. I'm currently trying to put together a simple test and fix > that targets this race in particular, but definitely agree that the > interactions between the scanner threads and scan node are hard to reason > about and could be improved more generally. > > For now I'm trying to use a DebugAction to inject probabilistic failures > into the soft memory limit checks to see if I can reproduce it more easily. > Put this up here with a fix: https://gerrit.cloudera.org/c/11369/ -Todd > > -Todd > > >> On Thu, Aug 30, 2018 at 1:27 PM Todd Lipcon <t...@cloudera.com> wrote: >> >> > I spent another 45 minutes looking at the core and think I figured this >> > out. I found that for the stuck scanner, both of its scanner threads and >> > run and exited, but for some reason it was never marked as "done". >> > >> > I think this might be a regression due to IMPALA-7096 >> > (7ccf7369085aa49a8fc0daf6f91d97b8a3135682). The scanner thread has the >> > following code: >> > >> > // Stop extra threads if we're over a soft limit in order to free up >> > memory. >> > if (!first_thread && mem_tracker_->AnyLimitExceeded(MemLimit::SOFT)) >> { >> > break; >> > } >> > >> > // Done with range and it completed successfully >> > if (progress_.done()) { >> > // All ranges are finished. Indicate we are done. >> > SetDone(); >> > break; >> > } >> > >> > if (scan_range == nullptr && num_unqueued_files == 0) { >> > unique_lock<mutex> l(lock_); >> > // All ranges have been queued and DiskIoMgr has no more new >> ranges >> > for this scan >> > // node to process. This means that every range is either done or >> > being processed by >> > // another thread. >> > all_ranges_started_ = true; >> > break; >> > } >> > } >> > >> > What if we have the following scenario: >> > >> > T1) grab scan range 1 and start processing >> > >> > T2) grab scan range 2 and start processing >> > >> > T1) finish scan range 1 and see that 'progress_' is not done() >> > T1) loop around, get no scan range (there are no more), so set >> > all_ranges_satrted_ and break >> > T1) thread exits >> > >> > T2) finish scan range 2 >> > T2) happen to hit a soft memory limit error due to pressure from other >> exec >> > nodes, etc. Since we aren't the first thread, we break. (even though the >> > first thread is no longer running) >> > T2) thread exits >> > >> > Note that no one got to the point of calling SetDone() because we break >> due >> > to the memory limit error _before_ checking progress_.Done(). >> > >> > Thus, the query will hang forever. >> > >> > It seems to me that the order of these conditions needs to be >> rejiggered: >> > the breaking due to memory limit should be the last thing in the loop so >> > that, if we were the thread to finish the last scan range, we properly >> > SetDone before exiting. >> > >> > Does that makes sense? If so, I'll file a JIRA. Suggestions also >> welcome on >> > how to write a test for this (do we have an easy facility to inject >> > probabilistic mem-limit failures in that code path?) >> > >> > -Todd >> > >> > >> > On Thu, Aug 30, 2018 at 12:35 PM, Michael Ho <k...@cloudera.com> wrote: >> > >> > > FWIW, please see the discussion at >> > > https://issues.apache.org/jira/browse/IMPALA-6194 about how "stuck >> IO" >> > may >> > > render cancellation in HDFS scan node ineffective. We have seen users >> > > reported similar backtraces in the past due to hung RPCs to HDFS name >> > node. >> > > Not saying this is the necessarily the case here. >> > > >> > > A couple of things which may be interesting to look at: >> > > - is there any scan node in the profile which doesn't finish any >> assigned >> > > scan ranges ? >> > > - if you happen to have a core, it may help to inspect the stack >> traces >> > of >> > > the scanner threads and the disk io mgr threads to understand their >> > states. >> > > >> > > On Thu, Aug 30, 2018 at 12:25 PM Todd Lipcon <t...@cloudera.com> >> wrote: >> > > >> > > > Hey folks, >> > > > >> > > > I ran into some issues with a local core test run last night. This >> is >> > on >> > > my >> > > > own branch with some uncommitted work, but I haven't touched the >> > backend >> > > > and these issues seem to be backend-focused. >> > > > >> > > > Initially, I noticed the problem because a test run that I started >> last >> > > > night was still "stuck" this morning. I had three queries which had >> > been >> > > > running for 10 hours and not making progress. The 'fragments' page >> for >> > > the >> > > > query showed that one of the backends had not reported to the >> > coordinator >> > > > for many hours. In attempting to debug this, I managed to get the >> > > > StateStore to declare that node dead, and those queries eventually >> were >> > > > cancelled due to that. >> > > > >> > > > I resumed the node that I had been debugging with gdb, and it was >> > > declared >> > > > live again. I didn't restart the process, though, which might have >> led >> > to >> > > > further issues: >> > > > >> > > > Next, I also noticed that /queries on my coordinator node showed >> > "Number >> > > of >> > > > running queries with fragments on this host" at 100+ on all three >> > nodes, >> > > > even though no queries were running anymore. These numbers are >> stable. >> > I >> > > > can continue to issue new queries and they complete normally. While >> a >> > > > queries are running, the count of fragments goes up appropriate, and >> > when >> > > > the query finishes, it goes back down. But, the "base" numbers are >> > stuck >> > > at >> > > > {108, 391, 402} with nothing running. >> > > > >> > > > I also found that the node that had had the original problems has >> three >> > > > stuck fragments, all waiting on HdfsScanNode::GetNext: >> > > > https://gist.github.com/57a99bf4c00f1575810af924013c259d >> > > > Looking in the logs, I see continuous spew that it's trying to >> cancel >> > the >> > > > fragment instances for this query, but apparently the cancellation >> is >> > not >> > > > working. I talked to Michael Ho about this and it sounds like this >> is a >> > > > known issue with cancellation. >> > > > >> > > > So, there are still two mysteries: >> > > > - why did it get stuck in the first place? >> > > > - why are my "number of running queries" counters stuck at non-zero >> > > values? >> > > > >> > > > Does anything above ring a bell for anyone? >> > > > >> > > > -Todd >> > > > -- >> > > > Todd Lipcon >> > > > Software Engineer, Cloudera >> > > > >> > > >> > > >> > > -- >> > > Thanks, >> > > Michael >> > > >> > >> > >> > >> > -- >> > Todd Lipcon >> > Software Engineer, Cloudera >> > >> > > > > -- > Todd Lipcon > Software Engineer, Cloudera > -- Todd Lipcon Software Engineer, Cloudera