On Thu, Aug 30, 2018 at 2:48 PM, Todd Lipcon <t...@cloudera.com> wrote:

> On Thu, Aug 30, 2018 at 2:44 PM, Pooja Nilangekar <
> pooja.nilange...@cloudera.com> wrote:
>
>> Hi Todd,
>>
>> I believe you are right. There are a couple of other race conditions in
>> the
>> HdfsScanNode which have resulted in broken-builds in the last couple of
>> weeks. (IMPALA-7418 and IMPALA-7335). While reviewing the change for
>> IMPALA-7335, Tim and Dan suggested that we need to cleanup the state
>> invariants of the HdfsScanNode to find and fix other such races. I am
>> going
>> to create a JIRA to track it, I could add this issue to the list of races
>> to be fixed.  If this issue occurs more frequently, we could add a
>> temporary fix for now. Let me know what you'd suggest.
>>
>
> Makes sense. I'm currently trying to put together a simple test and fix
> that targets this race in particular, but definitely agree that the
> interactions between the scanner threads and scan node are hard to reason
> about and could be improved more generally.
>
> For now I'm trying to use a DebugAction to inject probabilistic failures
> into the soft memory limit checks to see if I can reproduce it more easily.
>

Put this up here with a fix: https://gerrit.cloudera.org/c/11369/

-Todd


>
> -Todd
>
>
>> On Thu, Aug 30, 2018 at 1:27 PM Todd Lipcon <t...@cloudera.com> wrote:
>>
>> > I spent another 45 minutes looking at the core and think I figured this
>> > out. I found that for the stuck scanner, both of its scanner threads and
>> > run and exited, but for some reason it was never marked as "done".
>> >
>> > I think this might be a regression due to IMPALA-7096
>> > (7ccf7369085aa49a8fc0daf6f91d97b8a3135682). The scanner thread has the
>> > following code:
>> >
>> >     // Stop extra threads if we're over a soft limit in order to free up
>> > memory.
>> >     if (!first_thread && mem_tracker_->AnyLimitExceeded(MemLimit::SOFT))
>> {
>> >       break;
>> >     }
>> >
>> >     // Done with range and it completed successfully
>> >     if (progress_.done()) {
>> >       // All ranges are finished.  Indicate we are done.
>> >       SetDone();
>> >       break;
>> >     }
>> >
>> >     if (scan_range == nullptr && num_unqueued_files == 0) {
>> >       unique_lock<mutex> l(lock_);
>> >       // All ranges have been queued and DiskIoMgr has no more new
>> ranges
>> > for this scan
>> >       // node to process. This means that every range is either done or
>> > being processed by
>> >       // another thread.
>> >       all_ranges_started_ = true;
>> >       break;
>> >     }
>> >   }
>> >
>> > What if we have the following scenario:
>> >
>> > T1) grab scan range 1 and start processing
>> >
>> > T2) grab scan range 2 and start processing
>> >
>> > T1) finish scan range 1 and see that 'progress_' is not done()
>> > T1) loop around, get no scan range (there are no more), so set
>> > all_ranges_satrted_ and break
>> > T1) thread exits
>> >
>> > T2) finish scan range 2
>> > T2) happen to hit a soft memory limit error due to pressure from other
>> exec
>> > nodes, etc. Since we aren't the first thread, we break. (even though the
>> > first thread is no longer running)
>> > T2) thread exits
>> >
>> > Note that no one got to the point of calling SetDone() because we break
>> due
>> > to the memory limit error _before_ checking progress_.Done().
>> >
>> > Thus, the query will hang forever.
>> >
>> > It seems to me that the order of these conditions needs to be
>> rejiggered:
>> > the breaking due to memory limit should be the last thing in the loop so
>> > that, if we were the thread to finish the last scan range, we properly
>> > SetDone before exiting.
>> >
>> > Does that makes sense? If so, I'll file a JIRA. Suggestions also
>> welcome on
>> > how to write a test for this (do we have an easy facility to inject
>> > probabilistic mem-limit failures in that code path?)
>> >
>> > -Todd
>> >
>> >
>> > On Thu, Aug 30, 2018 at 12:35 PM, Michael Ho <k...@cloudera.com> wrote:
>> >
>> > > FWIW, please see the discussion at
>> > > https://issues.apache.org/jira/browse/IMPALA-6194 about how "stuck
>> IO"
>> > may
>> > > render cancellation in HDFS scan node ineffective. We have seen users
>> > > reported similar backtraces in the past due to hung RPCs to HDFS name
>> > node.
>> > > Not saying this is the necessarily the case here.
>> > >
>> > > A couple of things which may be interesting to look at:
>> > > - is there any scan node in the profile which doesn't finish any
>> assigned
>> > > scan ranges ?
>> > > - if you happen to have a core, it may help to inspect the stack
>> traces
>> > of
>> > > the scanner threads and the disk io mgr threads to understand their
>> > states.
>> > >
>> > > On Thu, Aug 30, 2018 at 12:25 PM Todd Lipcon <t...@cloudera.com>
>> wrote:
>> > >
>> > > > Hey folks,
>> > > >
>> > > > I ran into some issues with a local core test run last night. This
>> is
>> > on
>> > > my
>> > > > own branch with some uncommitted work, but I haven't touched the
>> > backend
>> > > > and these issues seem to be backend-focused.
>> > > >
>> > > > Initially, I noticed the problem because a test run that I started
>> last
>> > > > night was still "stuck" this morning. I had three queries which had
>> > been
>> > > > running for 10 hours and not making progress. The 'fragments' page
>> for
>> > > the
>> > > > query showed that one of the backends had not reported to the
>> > coordinator
>> > > > for many hours. In attempting to debug this, I managed to get the
>> > > > StateStore to declare that node dead, and those queries eventually
>> were
>> > > > cancelled due to that.
>> > > >
>> > > > I resumed the node that I had been debugging with gdb, and it was
>> > > declared
>> > > > live again. I didn't restart the process, though, which might have
>> led
>> > to
>> > > > further issues:
>> > > >
>> > > > Next, I also noticed that /queries on my coordinator node showed
>> > "Number
>> > > of
>> > > > running queries with fragments on this host" at 100+ on all three
>> > nodes,
>> > > > even though no queries were running anymore. These numbers are
>> stable.
>> > I
>> > > > can continue to issue new queries and they complete normally. While
>> a
>> > > > queries are running, the count of fragments goes up appropriate, and
>> > when
>> > > > the query finishes, it goes back down. But, the "base" numbers are
>> > stuck
>> > > at
>> > > > {108, 391, 402} with nothing running.
>> > > >
>> > > > I also found that the node that had had the original problems has
>> three
>> > > > stuck fragments, all waiting on HdfsScanNode::GetNext:
>> > > > https://gist.github.com/57a99bf4c00f1575810af924013c259d
>> > > > Looking in the logs, I see continuous spew that it's trying to
>> cancel
>> > the
>> > > > fragment instances for this query, but apparently the cancellation
>> is
>> > not
>> > > > working. I talked to Michael Ho about this and it sounds like this
>> is a
>> > > > known issue with cancellation.
>> > > >
>> > > > So, there are still two mysteries:
>> > > > - why did it get stuck in the first place?
>> > > > - why are my "number of running queries" counters stuck at non-zero
>> > > values?
>> > > >
>> > > > Does anything above ring a bell for anyone?
>> > > >
>> > > > -Todd
>> > > > --
>> > > > Todd Lipcon
>> > > > Software Engineer, Cloudera
>> > > >
>> > >
>> > >
>> > > --
>> > > Thanks,
>> > > Michael
>> > >
>> >
>> >
>> >
>> > --
>> > Todd Lipcon
>> > Software Engineer, Cloudera
>> >
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to