Thanks for the reply, Clint. It does look related. We also noticed that historicals are taking a long time to download the segments after a restart. At least in 0.10.1, restart of a historical wouldn't be a big deal as the segments it is responsible for serving were still available on the local disk and so after a restart, the segment load process would finish up fast. With 0.12.1, I am seeing that a restarted historical is spending a lot of time loading segments.
Sample log line: *@400000005b50ee941f5ead84 2018-07-19T20:03:22,526 INFO [Segment-Load-Startup-5] io.druid.server.coordination.SegmentLoadDropHandler - Loading segment[3681/4816][xyz_druid_raw_2015-10-11T00:00:00.000Z_2015-10-12T00:00:00.000Z_2018-03-14T23:58:34.091Z]* Has something changed around how the segment assignment works since 0.10.1 which could cause this change in behavior? I could always dig in the code and try figure out but considering you have actively worked in this area recently, I thought you might have an idea? On Thu, Jul 19, 2018 at 1:26 PM, Clint Wylie <clint.wy...@imply.io> wrote: > You might be running into something related to these issues > https://github.com/apache/incubator-druid/issues/5531 and > https://github.com/apache/incubator-druid/issues/5882, the former of which > should be fixed in 0.12.2. The effects of these issues can be at least > partially mitigated by setting and maxSegmentsInNodeLoadingQueue and > maxSegmentsToMove http://druid.io/docs/latest/ > configuration/coordinator.html > to limit how deep load queues get and minimizing the number of bad > decisions the coordinator makes when a historical disappears due to zk > blip, upgrade, or anything else. > > On Thu, Jul 19, 2018 at 1:10 PM, Samarth Jain <samarth.j...@gmail.com> > wrote: > > > Hi Jihoon, > > > > I have a 6 node historical test cluster. 3 nodes are at ~80% and the > other > > two at ~60 and ~50% disk utilization. > > > > The interesting thing is that the 6th node ended up getting into zk > timeout > > (because of large GC pause) and is no longer part of the cluster (which > is > > a separate issue I am trying to figure out). > > On this 6th node, I see that it is busy loading segments. However, once > it > > is done downloading, I am not sure if it will report back to zk as being > > available. > > > > > > > > > > > > On Thu, Jul 19, 2018 at 12:58 PM, Jihoon Son <ghoon...@gmail.com> wrote: > > > > > Hi Samarth, > > > > > > have you had a change to check the segment balancing status of your > > > cluster? > > > Do you see any significant imbalance between historicals? > > > > > > Jihoon > > > > > > On Thu, Jul 19, 2018 at 12:28 PM Samarth Jain <samarth.j...@gmail.com> > > > wrote: > > > > > > > I am working on upgrading our internal cluster to 0.12.1 release and > > > seeing > > > > that a few data sources fail to load. Looking at coordinator logs, I > am > > > > seeing messages like this for the datasource: > > > > > > > > @400000005b50dbc637061cec 2018-07-19T18:43:08,923 INFO > > > > [Coordinator-Exec--0] io.druid.server.coordinator. > CuratorLoadQueuePeon > > - > > > > Asking server peon[/druid-test--001/loadQueue/127.0.0.1:7103] to > drop > > > > segment[*datasource* > > > > > > > > _2015-09-03T00:00:00.000Z_2015-09-04T00:00:00.000Z_2018- > > > 04-23T21:24:04.910Z] > > > > > > > > > > > > > > > > @400000005b50dbc637391f84 2018-07-19T18:43:08,926 WARN > > > > [Coordinator-Exec--0] io.druid.server.coordinator.rules.LoadRule - > No > > > > available [_default_tier] servers or node capacity to assign primary > > > > > > > > segment[*datasource*-08-10T00:00:00.000Z_2015-08-11T00:00: > > > 00.000Z_2018-04-23T21:24:04.910Z]! > > > > Expected Replicants[1] > > > > > > > > > > > > The datasource failed to load for a long time and then eventually was > > > > loaded successfully. Has anyone else seen this? I see a few fixes > > around > > > > segment loading and coordination in 0.12.2 (which I am hoping will be > > out > > > > soon) but I am not sure if they address this issue. > > > > > > > > > >