I spent another 45 minutes looking at the core and think I figured this
out. I found that for the stuck scanner, both of its scanner threads and
run and exited, but for some reason it was never marked as "done".

I think this might be a regression due to IMPALA-7096
(7ccf7369085aa49a8fc0daf6f91d97b8a3135682). The scanner thread has the
following code:

    // Stop extra threads if we're over a soft limit in order to free up
memory.
    if (!first_thread && mem_tracker_->AnyLimitExceeded(MemLimit::SOFT)) {
      break;
    }

    // Done with range and it completed successfully
    if (progress_.done()) {
      // All ranges are finished.  Indicate we are done.
      SetDone();
      break;
    }

    if (scan_range == nullptr && num_unqueued_files == 0) {
      unique_lock<mutex> l(lock_);
      // All ranges have been queued and DiskIoMgr has no more new ranges
for this scan
      // node to process. This means that every range is either done or
being processed by
      // another thread.
      all_ranges_started_ = true;
      break;
    }
  }

What if we have the following scenario:

T1) grab scan range 1 and start processing

T2) grab scan range 2 and start processing

T1) finish scan range 1 and see that 'progress_' is not done()
T1) loop around, get no scan range (there are no more), so set
all_ranges_satrted_ and break
T1) thread exits

T2) finish scan range 2
T2) happen to hit a soft memory limit error due to pressure from other exec
nodes, etc. Since we aren't the first thread, we break. (even though the
first thread is no longer running)
T2) thread exits

Note that no one got to the point of calling SetDone() because we break due
to the memory limit error _before_ checking progress_.Done().

Thus, the query will hang forever.

It seems to me that the order of these conditions needs to be rejiggered:
the breaking due to memory limit should be the last thing in the loop so
that, if we were the thread to finish the last scan range, we properly
SetDone before exiting.

Does that makes sense? If so, I'll file a JIRA. Suggestions also welcome on
how to write a test for this (do we have an easy facility to inject
probabilistic mem-limit failures in that code path?)

-Todd


On Thu, Aug 30, 2018 at 12:35 PM, Michael Ho <k...@cloudera.com> wrote:

> FWIW, please see the discussion at
> https://issues.apache.org/jira/browse/IMPALA-6194 about how "stuck IO" may
> render cancellation in HDFS scan node ineffective. We have seen users
> reported similar backtraces in the past due to hung RPCs to HDFS name node.
> Not saying this is the necessarily the case here.
>
> A couple of things which may be interesting to look at:
> - is there any scan node in the profile which doesn't finish any assigned
> scan ranges ?
> - if you happen to have a core, it may help to inspect the stack traces of
> the scanner threads and the disk io mgr threads to understand their states.
>
> On Thu, Aug 30, 2018 at 12:25 PM Todd Lipcon <t...@cloudera.com> wrote:
>
> > Hey folks,
> >
> > I ran into some issues with a local core test run last night. This is on
> my
> > own branch with some uncommitted work, but I haven't touched the backend
> > and these issues seem to be backend-focused.
> >
> > Initially, I noticed the problem because a test run that I started last
> > night was still "stuck" this morning. I had three queries which had been
> > running for 10 hours and not making progress. The 'fragments' page for
> the
> > query showed that one of the backends had not reported to the coordinator
> > for many hours. In attempting to debug this, I managed to get the
> > StateStore to declare that node dead, and those queries eventually were
> > cancelled due to that.
> >
> > I resumed the node that I had been debugging with gdb, and it was
> declared
> > live again. I didn't restart the process, though, which might have led to
> > further issues:
> >
> > Next, I also noticed that /queries on my coordinator node showed "Number
> of
> > running queries with fragments on this host" at 100+ on all three nodes,
> > even though no queries were running anymore. These numbers are stable. I
> > can continue to issue new queries and they complete normally. While a
> > queries are running, the count of fragments goes up appropriate, and when
> > the query finishes, it goes back down. But, the "base" numbers are stuck
> at
> > {108, 391, 402} with nothing running.
> >
> > I also found that the node that had had the original problems has three
> > stuck fragments, all waiting on HdfsScanNode::GetNext:
> > https://gist.github.com/57a99bf4c00f1575810af924013c259d
> > Looking in the logs, I see continuous spew that it's trying to cancel the
> > fragment instances for this query, but apparently the cancellation is not
> > working. I talked to Michael Ho about this and it sounds like this is a
> > known issue with cancellation.
> >
> > So, there are still two mysteries:
> > - why did it get stuck in the first place?
> > - why are my "number of running queries" counters stuck at non-zero
> values?
> >
> > Does anything above ring a bell for anyone?
> >
> > -Todd
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>
>
> --
> Thanks,
> Michael
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to