Thanks for the reply, Clint. It does look related.

We also noticed that historicals are taking a long time to download the
segments after a restart. At least in 0.10.1, restart of a historical
wouldn't be a big deal as the segments it is responsible for serving were
still available on the local disk and so after a restart, the segment load
process would finish up fast.
With 0.12.1, I am seeing that a restarted historical is spending a lot of
time loading segments.

Sample log line:

*@400000005b50ee941f5ead84 2018-07-19T20:03:22,526 INFO
[Segment-Load-Startup-5]
io.druid.server.coordination.SegmentLoadDropHandler - Loading
segment[3681/4816][xyz_druid_raw_2015-10-11T00:00:00.000Z_2015-10-12T00:00:00.000Z_2018-03-14T23:58:34.091Z]*


Has something changed around how the segment assignment works since 0.10.1
which could cause this change in behavior? I could always dig in the code
and try figure out but considering you have actively worked in this area
recently, I thought you might have an idea?











On Thu, Jul 19, 2018 at 1:26 PM, Clint Wylie <clint.wy...@imply.io> wrote:

> You might be running into something related to these issues
> https://github.com/apache/incubator-druid/issues/5531 and
> https://github.com/apache/incubator-druid/issues/5882, the former of which
> should be fixed in 0.12.2. The effects of these issues can be at least
> partially mitigated by setting and  maxSegmentsInNodeLoadingQueue and
> maxSegmentsToMove http://druid.io/docs/latest/
> configuration/coordinator.html
> to limit how deep load queues get and minimizing the number of bad
> decisions the coordinator makes when a historical disappears due to zk
> blip, upgrade, or anything else.
>
> On Thu, Jul 19, 2018 at 1:10 PM, Samarth Jain <samarth.j...@gmail.com>
> wrote:
>
> > Hi Jihoon,
> >
> > I have a 6 node historical test cluster. 3 nodes are at ~80% and the
> other
> > two at ~60 and ~50% disk utilization.
> >
> > The interesting thing is that the 6th node ended up getting into zk
> timeout
> > (because of large GC pause) and is no longer part of the cluster (which
> is
> > a separate issue I am trying to figure out).
> > On this 6th node, I see that it is busy loading segments. However, once
> it
> > is done downloading, I am not sure if it will report back to zk as being
> > available.
> >
> >
> >
> >
> >
> > On Thu, Jul 19, 2018 at 12:58 PM, Jihoon Son <ghoon...@gmail.com> wrote:
> >
> > > Hi Samarth,
> > >
> > > have you had a change to check the segment balancing status of your
> > > cluster?
> > > Do you see any significant imbalance between historicals?
> > >
> > > Jihoon
> > >
> > > On Thu, Jul 19, 2018 at 12:28 PM Samarth Jain <samarth.j...@gmail.com>
> > > wrote:
> > >
> > > > I am working on upgrading our internal cluster to 0.12.1 release and
> > > seeing
> > > > that a few data sources fail to load. Looking at coordinator logs, I
> am
> > > > seeing messages like this for the datasource:
> > > >
> > > > @400000005b50dbc637061cec 2018-07-19T18:43:08,923 INFO
> > > > [Coordinator-Exec--0] io.druid.server.coordinator.
> CuratorLoadQueuePeon
> > -
> > > > Asking server peon[/druid-test--001/loadQueue/127.0.0.1:7103] to
> drop
> > > > segment[*datasource*
> > > >
> > > > _2015-09-03T00:00:00.000Z_2015-09-04T00:00:00.000Z_2018-
> > > 04-23T21:24:04.910Z]
> > > >
> > > >
> > > >
> > > > @400000005b50dbc637391f84 2018-07-19T18:43:08,926 WARN
> > > > [Coordinator-Exec--0] io.druid.server.coordinator.rules.LoadRule -
> No
> > > > available [_default_tier] servers or node capacity to assign primary
> > > >
> > > > segment[*datasource*-08-10T00:00:00.000Z_2015-08-11T00:00:
> > > 00.000Z_2018-04-23T21:24:04.910Z]!
> > > > Expected Replicants[1]
> > > >
> > > >
> > > > The datasource failed to load for a long time and then eventually was
> > > > loaded successfully. Has anyone else seen this? I see a few fixes
> > around
> > > > segment loading and coordination in 0.12.2 (which I am hoping will be
> > out
> > > > soon) but I am not sure if they address this issue.
> > > >
> > >
> >
>

Reply via email to