Hi Samarth, IIRC, nothing has been changed around loading local segments when historicals start up. The above log looks that you have 4816 segments to load. How long does it take to load all of them? How slow is it compared to before?
Jihoon On Thu, Jul 19, 2018 at 1:37 PM Samarth Jain <[email protected]> wrote: > Thanks for the reply, Clint. It does look related. > > We also noticed that historicals are taking a long time to download the > segments after a restart. At least in 0.10.1, restart of a historical > wouldn't be a big deal as the segments it is responsible for serving were > still available on the local disk and so after a restart, the segment load > process would finish up fast. > With 0.12.1, I am seeing that a restarted historical is spending a lot of > time loading segments. > > Sample log line: > > *@400000005b50ee941f5ead84 2018-07-19T20:03:22,526 INFO > [Segment-Load-Startup-5] > io.druid.server.coordination.SegmentLoadDropHandler - Loading > > segment[3681/4816][xyz_druid_raw_2015-10-11T00:00:00.000Z_2015-10-12T00:00:00.000Z_2018-03-14T23:58:34.091Z]* > > > Has something changed around how the segment assignment works since 0.10.1 > which could cause this change in behavior? I could always dig in the code > and try figure out but considering you have actively worked in this area > recently, I thought you might have an idea? > > > > > > > > > > > > On Thu, Jul 19, 2018 at 1:26 PM, Clint Wylie <[email protected]> wrote: > > > You might be running into something related to these issues > > https://github.com/apache/incubator-druid/issues/5531 and > > https://github.com/apache/incubator-druid/issues/5882, the former of > which > > should be fixed in 0.12.2. The effects of these issues can be at least > > partially mitigated by setting and maxSegmentsInNodeLoadingQueue and > > maxSegmentsToMove http://druid.io/docs/latest/ > > configuration/coordinator.html > > to limit how deep load queues get and minimizing the number of bad > > decisions the coordinator makes when a historical disappears due to zk > > blip, upgrade, or anything else. > > > > On Thu, Jul 19, 2018 at 1:10 PM, Samarth Jain <[email protected]> > > wrote: > > > > > Hi Jihoon, > > > > > > I have a 6 node historical test cluster. 3 nodes are at ~80% and the > > other > > > two at ~60 and ~50% disk utilization. > > > > > > The interesting thing is that the 6th node ended up getting into zk > > timeout > > > (because of large GC pause) and is no longer part of the cluster (which > > is > > > a separate issue I am trying to figure out). > > > On this 6th node, I see that it is busy loading segments. However, once > > it > > > is done downloading, I am not sure if it will report back to zk as > being > > > available. > > > > > > > > > > > > > > > > > > On Thu, Jul 19, 2018 at 12:58 PM, Jihoon Son <[email protected]> > wrote: > > > > > > > Hi Samarth, > > > > > > > > have you had a change to check the segment balancing status of your > > > > cluster? > > > > Do you see any significant imbalance between historicals? > > > > > > > > Jihoon > > > > > > > > On Thu, Jul 19, 2018 at 12:28 PM Samarth Jain < > [email protected]> > > > > wrote: > > > > > > > > > I am working on upgrading our internal cluster to 0.12.1 release > and > > > > seeing > > > > > that a few data sources fail to load. Looking at coordinator logs, > I > > am > > > > > seeing messages like this for the datasource: > > > > > > > > > > @400000005b50dbc637061cec 2018-07-19T18:43:08,923 INFO > > > > > [Coordinator-Exec--0] io.druid.server.coordinator. > > CuratorLoadQueuePeon > > > - > > > > > Asking server peon[/druid-test--001/loadQueue/127.0.0.1:7103] to > > drop > > > > > segment[*datasource* > > > > > > > > > > _2015-09-03T00:00:00.000Z_2015-09-04T00:00:00.000Z_2018- > > > > 04-23T21:24:04.910Z] > > > > > > > > > > > > > > > > > > > > @400000005b50dbc637391f84 2018-07-19T18:43:08,926 WARN > > > > > [Coordinator-Exec--0] io.druid.server.coordinator.rules.LoadRule - > > No > > > > > available [_default_tier] servers or node capacity to assign > primary > > > > > > > > > > segment[*datasource*-08-10T00:00:00.000Z_2015-08-11T00:00: > > > > 00.000Z_2018-04-23T21:24:04.910Z]! > > > > > Expected Replicants[1] > > > > > > > > > > > > > > > The datasource failed to load for a long time and then eventually > was > > > > > loaded successfully. Has anyone else seen this? I see a few fixes > > > around > > > > > segment loading and coordination in 0.12.2 (which I am hoping will > be > > > out > > > > > soon) but I am not sure if they address this issue. > > > > > > > > > > > > > > >
