Hi Todd,

I believe you are right. There are a couple of other race conditions in the
HdfsScanNode which have resulted in broken-builds in the last couple of
weeks. (IMPALA-7418 and IMPALA-7335). While reviewing the change for
IMPALA-7335, Tim and Dan suggested that we need to cleanup the state
invariants of the HdfsScanNode to find and fix other such races. I am going
to create a JIRA to track it, I could add this issue to the list of races
to be fixed.  If this issue occurs more frequently, we could add a
temporary fix for now. Let me know what you'd suggest.

Thanks,
Pooja


On Thu, Aug 30, 2018 at 1:27 PM Todd Lipcon <[email protected]> wrote:

> I spent another 45 minutes looking at the core and think I figured this
> out. I found that for the stuck scanner, both of its scanner threads and
> run and exited, but for some reason it was never marked as "done".
>
> I think this might be a regression due to IMPALA-7096
> (7ccf7369085aa49a8fc0daf6f91d97b8a3135682). The scanner thread has the
> following code:
>
>     // Stop extra threads if we're over a soft limit in order to free up
> memory.
>     if (!first_thread && mem_tracker_->AnyLimitExceeded(MemLimit::SOFT)) {
>       break;
>     }
>
>     // Done with range and it completed successfully
>     if (progress_.done()) {
>       // All ranges are finished.  Indicate we are done.
>       SetDone();
>       break;
>     }
>
>     if (scan_range == nullptr && num_unqueued_files == 0) {
>       unique_lock<mutex> l(lock_);
>       // All ranges have been queued and DiskIoMgr has no more new ranges
> for this scan
>       // node to process. This means that every range is either done or
> being processed by
>       // another thread.
>       all_ranges_started_ = true;
>       break;
>     }
>   }
>
> What if we have the following scenario:
>
> T1) grab scan range 1 and start processing
>
> T2) grab scan range 2 and start processing
>
> T1) finish scan range 1 and see that 'progress_' is not done()
> T1) loop around, get no scan range (there are no more), so set
> all_ranges_satrted_ and break
> T1) thread exits
>
> T2) finish scan range 2
> T2) happen to hit a soft memory limit error due to pressure from other exec
> nodes, etc. Since we aren't the first thread, we break. (even though the
> first thread is no longer running)
> T2) thread exits
>
> Note that no one got to the point of calling SetDone() because we break due
> to the memory limit error _before_ checking progress_.Done().
>
> Thus, the query will hang forever.
>
> It seems to me that the order of these conditions needs to be rejiggered:
> the breaking due to memory limit should be the last thing in the loop so
> that, if we were the thread to finish the last scan range, we properly
> SetDone before exiting.
>
> Does that makes sense? If so, I'll file a JIRA. Suggestions also welcome on
> how to write a test for this (do we have an easy facility to inject
> probabilistic mem-limit failures in that code path?)
>
> -Todd
>
>
> On Thu, Aug 30, 2018 at 12:35 PM, Michael Ho <[email protected]> wrote:
>
> > FWIW, please see the discussion at
> > https://issues.apache.org/jira/browse/IMPALA-6194 about how "stuck IO"
> may
> > render cancellation in HDFS scan node ineffective. We have seen users
> > reported similar backtraces in the past due to hung RPCs to HDFS name
> node.
> > Not saying this is the necessarily the case here.
> >
> > A couple of things which may be interesting to look at:
> > - is there any scan node in the profile which doesn't finish any assigned
> > scan ranges ?
> > - if you happen to have a core, it may help to inspect the stack traces
> of
> > the scanner threads and the disk io mgr threads to understand their
> states.
> >
> > On Thu, Aug 30, 2018 at 12:25 PM Todd Lipcon <[email protected]> wrote:
> >
> > > Hey folks,
> > >
> > > I ran into some issues with a local core test run last night. This is
> on
> > my
> > > own branch with some uncommitted work, but I haven't touched the
> backend
> > > and these issues seem to be backend-focused.
> > >
> > > Initially, I noticed the problem because a test run that I started last
> > > night was still "stuck" this morning. I had three queries which had
> been
> > > running for 10 hours and not making progress. The 'fragments' page for
> > the
> > > query showed that one of the backends had not reported to the
> coordinator
> > > for many hours. In attempting to debug this, I managed to get the
> > > StateStore to declare that node dead, and those queries eventually were
> > > cancelled due to that.
> > >
> > > I resumed the node that I had been debugging with gdb, and it was
> > declared
> > > live again. I didn't restart the process, though, which might have led
> to
> > > further issues:
> > >
> > > Next, I also noticed that /queries on my coordinator node showed
> "Number
> > of
> > > running queries with fragments on this host" at 100+ on all three
> nodes,
> > > even though no queries were running anymore. These numbers are stable.
> I
> > > can continue to issue new queries and they complete normally. While a
> > > queries are running, the count of fragments goes up appropriate, and
> when
> > > the query finishes, it goes back down. But, the "base" numbers are
> stuck
> > at
> > > {108, 391, 402} with nothing running.
> > >
> > > I also found that the node that had had the original problems has three
> > > stuck fragments, all waiting on HdfsScanNode::GetNext:
> > > https://gist.github.com/57a99bf4c00f1575810af924013c259d
> > > Looking in the logs, I see continuous spew that it's trying to cancel
> the
> > > fragment instances for this query, but apparently the cancellation is
> not
> > > working. I talked to Michael Ho about this and it sounds like this is a
> > > known issue with cancellation.
> > >
> > > So, there are still two mysteries:
> > > - why did it get stuck in the first place?
> > > - why are my "number of running queries" counters stuck at non-zero
> > values?
> > >
> > > Does anything above ring a bell for anyone?
> > >
> > > -Todd
> > > --
> > > Todd Lipcon
> > > Software Engineer, Cloudera
> > >
> >
> >
> > --
> > Thanks,
> > Michael
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Reply via email to