You might be running into something related to these issues https://github.com/apache/incubator-druid/issues/5531 and https://github.com/apache/incubator-druid/issues/5882, the former of which should be fixed in 0.12.2. The effects of these issues can be at least partially mitigated by setting and maxSegmentsInNodeLoadingQueue and maxSegmentsToMove http://druid.io/docs/latest/configuration/coordinator.html to limit how deep load queues get and minimizing the number of bad decisions the coordinator makes when a historical disappears due to zk blip, upgrade, or anything else.
On Thu, Jul 19, 2018 at 1:10 PM, Samarth Jain <samarth.j...@gmail.com> wrote: > Hi Jihoon, > > I have a 6 node historical test cluster. 3 nodes are at ~80% and the other > two at ~60 and ~50% disk utilization. > > The interesting thing is that the 6th node ended up getting into zk timeout > (because of large GC pause) and is no longer part of the cluster (which is > a separate issue I am trying to figure out). > On this 6th node, I see that it is busy loading segments. However, once it > is done downloading, I am not sure if it will report back to zk as being > available. > > > > > > On Thu, Jul 19, 2018 at 12:58 PM, Jihoon Son <ghoon...@gmail.com> wrote: > > > Hi Samarth, > > > > have you had a change to check the segment balancing status of your > > cluster? > > Do you see any significant imbalance between historicals? > > > > Jihoon > > > > On Thu, Jul 19, 2018 at 12:28 PM Samarth Jain <samarth.j...@gmail.com> > > wrote: > > > > > I am working on upgrading our internal cluster to 0.12.1 release and > > seeing > > > that a few data sources fail to load. Looking at coordinator logs, I am > > > seeing messages like this for the datasource: > > > > > > @400000005b50dbc637061cec 2018-07-19T18:43:08,923 INFO > > > [Coordinator-Exec--0] io.druid.server.coordinator.CuratorLoadQueuePeon > - > > > Asking server peon[/druid-test--001/loadQueue/127.0.0.1:7103] to drop > > > segment[*datasource* > > > > > > _2015-09-03T00:00:00.000Z_2015-09-04T00:00:00.000Z_2018- > > 04-23T21:24:04.910Z] > > > > > > > > > > > > @400000005b50dbc637391f84 2018-07-19T18:43:08,926 WARN > > > [Coordinator-Exec--0] io.druid.server.coordinator.rules.LoadRule - No > > > available [_default_tier] servers or node capacity to assign primary > > > > > > segment[*datasource*-08-10T00:00:00.000Z_2015-08-11T00:00: > > 00.000Z_2018-04-23T21:24:04.910Z]! > > > Expected Replicants[1] > > > > > > > > > The datasource failed to load for a long time and then eventually was > > > loaded successfully. Has anyone else seen this? I see a few fixes > around > > > segment loading and coordination in 0.12.2 (which I am hoping will be > out > > > soon) but I am not sure if they address this issue. > > > > > >