You might be running into something related to these issues
https://github.com/apache/incubator-druid/issues/5531 and
https://github.com/apache/incubator-druid/issues/5882, the former of which
should be fixed in 0.12.2. The effects of these issues can be at least
partially mitigated by setting and  maxSegmentsInNodeLoadingQueue and
maxSegmentsToMove http://druid.io/docs/latest/configuration/coordinator.html
to limit how deep load queues get and minimizing the number of bad
decisions the coordinator makes when a historical disappears due to zk
blip, upgrade, or anything else.

On Thu, Jul 19, 2018 at 1:10 PM, Samarth Jain <samarth.j...@gmail.com>
wrote:

> Hi Jihoon,
>
> I have a 6 node historical test cluster. 3 nodes are at ~80% and the other
> two at ~60 and ~50% disk utilization.
>
> The interesting thing is that the 6th node ended up getting into zk timeout
> (because of large GC pause) and is no longer part of the cluster (which is
> a separate issue I am trying to figure out).
> On this 6th node, I see that it is busy loading segments. However, once it
> is done downloading, I am not sure if it will report back to zk as being
> available.
>
>
>
>
>
> On Thu, Jul 19, 2018 at 12:58 PM, Jihoon Son <ghoon...@gmail.com> wrote:
>
> > Hi Samarth,
> >
> > have you had a change to check the segment balancing status of your
> > cluster?
> > Do you see any significant imbalance between historicals?
> >
> > Jihoon
> >
> > On Thu, Jul 19, 2018 at 12:28 PM Samarth Jain <samarth.j...@gmail.com>
> > wrote:
> >
> > > I am working on upgrading our internal cluster to 0.12.1 release and
> > seeing
> > > that a few data sources fail to load. Looking at coordinator logs, I am
> > > seeing messages like this for the datasource:
> > >
> > > @400000005b50dbc637061cec 2018-07-19T18:43:08,923 INFO
> > > [Coordinator-Exec--0] io.druid.server.coordinator.CuratorLoadQueuePeon
> -
> > > Asking server peon[/druid-test--001/loadQueue/127.0.0.1:7103] to drop
> > > segment[*datasource*
> > >
> > > _2015-09-03T00:00:00.000Z_2015-09-04T00:00:00.000Z_2018-
> > 04-23T21:24:04.910Z]
> > >
> > >
> > >
> > > @400000005b50dbc637391f84 2018-07-19T18:43:08,926 WARN
> > > [Coordinator-Exec--0] io.druid.server.coordinator.rules.LoadRule - No
> > > available [_default_tier] servers or node capacity to assign primary
> > >
> > > segment[*datasource*-08-10T00:00:00.000Z_2015-08-11T00:00:
> > 00.000Z_2018-04-23T21:24:04.910Z]!
> > > Expected Replicants[1]
> > >
> > >
> > > The datasource failed to load for a long time and then eventually was
> > > loaded successfully. Has anyone else seen this? I see a few fixes
> around
> > > segment loading and coordination in 0.12.2 (which I am hoping will be
> out
> > > soon) but I am not sure if they address this issue.
> > >
> >
>

Reply via email to