Harini, those are interesting findings. I'm not sure if the two pauses are
necessary, but my thought is that it ideally shouldn't matter because the
supervisor shouldn't be taking that long to handle its notices. A couple
things come to mind about that:

1) Did you see what specifically the supervisor is doing when it's handling
the notices? Maybe from a stack trace? We should look into optimizing it,
or making it asynchronous or something, depending on what it is.
2) Although, there isn't really a need to trigger a run for every single
task status change anyway; I think it's ok to coalesce them. This patch
would do it: https://github.com/apache/druid/pull/12018

Jason, also interesting findings! I took a crack at rebasing your patch on
master and adding a scale test for the TaskQueue with 1000 simulated tasks:
https://github.com/apache/druid/compare/master...gianm:tq-scale-test. When
I run the scale test, "doMassLaunchAndExit" passes quickly but
"doMassLaunchAndShutdown" times out. I suppose shutting down lots of tasks
is still a bottleneck.

Looking at RemoteTaskRunner and HttpRemoteTaskRunner, it should be pretty
straightforward to make the shutdown API asynchronous, which would help
speed up anything that is shutting down lots of tasks all at once. Would
that be helpful in your environments? Or are the changes to move shutdown
out of critical sections going to be enough?

On Wed, Dec 1, 2021 at 1:27 PM Jason Koch <jk...@netflix.com.invalid> wrote:

> Hi Harini,
>
> We have seen issues like this related to task roll time, related to task
> queue notifications on overlord instances; I have a patch running
> internally that resolves this.
>
> These are my internal triage notes:
> ======
> - Whenever task scheduling is happening (startup, ingest segment task
> rollover, redeployment of datasource) Overlord takes a long time to assign
> workers. This compounds because tasks sit so long before deployment that it
> starts failing tasks and having to relaunch them.
>
>    - TaskQueue: notifyStatus() which receives updates from the
>    middlemanagers, and the manage() loop which controls services, runs
> through
>    a single lock. For example, the shutdown request involves submitting
>    downstream HTTP requests synchronously (while holding the lock).
>    - This means for a cluster with ~700 tasks that tasks are held for
>    nearly 1second, and only after each 1 second around the manage loop can
> 1-2
>    notifications be processed. For a new startup, with 700 tasks, and a
> 1sec
>    delay, that is 300-600-or-more seconds for the overlord to realise all
> the
>    tasks are started by the middle manager.
>    - Similar delays happen for any other operations.
>    - Sub-optimal logging code path (double-concatening very long log
>    entries),
>    - ZkWorker: Worker fully deserializing all ZK payload data every time
>    looking up task IDs rather than only looking at the ID fields.
> Similarly,
>    repeat fetching data on task assignment.
>
> =====
>
> The patch I have is here:
> https://github.com/jasonk000/druid/pull/7/files
>
> It fixes a couple of things, most importantly the task queue notification
> system. The system is much more stable with high task counts and will
> easily restart many tasks concurrently.
>
> I have other perf issues I want to look at first before I can document it
> fully, build a test case, rebase it on apache/master, etc. If you test it
> out, and it works, we could submit a PR that would resolve it.
>
> PS - I have a queue of similar fixes I'd like to submit, but need some time
> to do the documentation, build test cases, upstreaming, etc, if anyone
> wants to collaborate, I could open some Issues and share my partial notes.
>
> Thanks
> Jason
>
> On Wed, Dec 1, 2021 at 12:59 PM Harini Rajendran
> <hrajend...@confluent.io.invalid> wrote:
>
> > Hi all,
> >
> > I have been investigating this in the background for a few days now and
> > need some help from the community.
> >
> > We noticed that every hour, when the tasks roll, we see a spike in the
> > ingestion lag for about 2-4 minutes. We have 180 tasks running on this
> > datasource.
> > [image: Screen Shot 2021-12-01 at 9.14.23 AM.png]
> >
> > On further debugging of task logs, we found out that around the duration
> > when the ingestion lag spikes up, *the gap between pause and resume
> > commands in the task logs during checkpointing are wide ranging from few
> > seconds to couple minutes*. For example, in the following task logs you
> > can see that it was about 1.5 minutes.
> > {"@timestamp":"2021-11-18T*20:06:58.513Z*", "log.level":"DEBUG",
> > "message":"Received pause command, *pausing* ingestion until resumed.", "
> > service.name
> > ":"druid/middleManager","event.dataset":"druid/middleManager.log","
> > process.thread.name
> >
> ":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"}
> > {"@timestamp":"2021-11-18T*20:08:26.326Z*", "log.level":"DEBUG",
> > "message":"Received pause command, *pausing* ingestion until resumed.", "
> > service.name
> > ":"druid/middleManager","event.dataset":"druid/middleManager.log","
> > process.thread.name
> >
> ":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"}
> > {"@timestamp":"2021-11-18T*20:08:26.329Z*", "log.level":"DEBUG",
> > "message":"Received resume command, *resuming* ingestion.", "
> service.name
> > ":"druid/middleManager","event.dataset":"druid/middleManager.log","
> > process.thread.name
> >
> ":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"}
> > So this explains why ingestion is lagging as the *tasks are paused for a
> > long time*.
> >
> > *Why are there 2 pauses during checkpointing and why such a huge gap?*
> > As a next step, I wanted to see why there is such a wide gap. Then we
> > realized that the first pause is when the task pauses itself here
> > <
> https://github.com/confluentinc/druid/blob/185ab56e42577dad6b077b415959512b0cd96345/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java#L728>
> while
> > requesting the supervisor for a checkpoint. And the second pause is when
> > the supervisor actually handles the checkpoint notice here
> > <
> https://github.com/confluentinc/druid/blob/185ab56e42577dad6b077b415959512b0cd96345/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L2548
> >
> > .
> > And since the supervisor thread for this data source takes such a long
> > time to process all the notices in the queue before this checkpoint
> notice,
> > the ingestion task ends up being in the paused state for a long time.
> >
> > *Why does the supervisor thread take such a long time to get to this
> > checkpoint notice?*
> > That was my next step in debugging.
> > Proving our theory, we noticed that the *noticesQueue in the supervisor
> > does get backed up with 100s of notices every hour when tasks roll*.
> > [image: Screen Shot 2021-12-01 at 9.32.59 AM.png]
> > And we saw that *run_notice takes between 5s and 7s during task rolls*.
> > And this causes backing up of noticesQueue causing checkpoint notice to
> be
> > in the queue for long leading to ingestion lag spike whenever tasks roll.
> > [image: Screen Shot 2021-12-01 at 9.34.29 AM.png]
> >
> > *Why does run_notice take 5-7s to finish?*
> > When the task starts, it takes about 5s for the HTTP server to come up.
> > So, till then the supervisor thread is in a loop trying to get the task
> > status and this causes run_notice to take about 5-7s to finish.
> >
> > *Questions to the community*
> > Do we need 2 pauses during checkpointing? Should the task pause itself
> > before requesting a checkpoint notice given that the supervisor is
> anyways
> > going to pause the task while handling the notice? Or is it okay to
> remove
> > the pause in the TaskRunner before it sends a checkpoint notice request
> to
> > the supervisor? This would immediately solve the ingestion lag issue
> > completely as there won't be two pauses with such a huge gap in between.
> >
> > Harini
> > Software Engineer, Confluent
> >
>

Reply via email to