Re: Need help in understanding real-time ingestion task pause behavior during checkpointing

Harini Rajendran Mon, 31 Jan 2022 15:48:32 -0800

Hi Jason and Gian,

I got time to collect some more data points wrt this issue and have added
them as a comment here
<https://github.com/apache/druid/issues/11414#issuecomment-1026275109>.
Can you take a look at it when you get a chance and let me know what you
think?


Harini
Software Engineer, Observability
+1 412 708 3872



On Tue, Jan 11, 2022 at 11:00 AM Jason Koch <jk...@netflix.com> wrote:

> > Thanks for the update, Jason. We shall wait for the builds to pass.
>
> Seems to me the builds are passing now (IMO). The failures present are
> related to docker image build, seems like they are issues with the test or
> build environment itself, but an expert opinion can perhaps correct me.
>
> > Also, are you planning to get #12099 PR reviewed by the community
> anytime soon?
>
> I think it will help if you can provide a real world perspective on
> whether this patch set solves your issues if that's at all possible?
>
> I'm not intending to actively seek any reviews until I have a few more PRs
> together. Gian, Frank, Xavier have all provided quality input and reviews
> on a few of the PRs so far, and it is a quiet / holiday period for many.
>
> Thanks
> Jason
>
> On Thu, Jan 6, 2022 at 8:19 AM Harini Rajendran <hrajend...@confluent.io>
> wrote:
>
>> Thanks for the update, Jason. We shall wait for the builds to pass.
>> Also, are you planning to get #12099 PR reviewed by the community
>> anytime soon?
>>
>> Harini
>> Software Engineer, Observability
>> +1 412 708 3872
>>
>>
>>
>> On Wed, Jan 5, 2022 at 8:25 PM Jason Koch <jk...@netflix.com> wrote:
>>
>>> Harini - these are as far as I can progress them; 12096 is good, 12097
>>> is good except for what seems to me to be a CI issue, and 12099 is "done"
>>> but has a CI issue as well, I'll ask for help below.
>>>
>>> Gian / Frank - looking for some help on these:
>>>
>>> - On #12099, it is rejecting because of an Intellij Inspect issue, I've
>>> added the `noinspection` pragma but it is being ignored by the linter. I'm
>>> not sure how to progress fixing this, apart from rewriting the solution
>>> which would make it less "obvious" to reasona bout.
>>>
>>> - On #12097, the failures are at a libc-related package installation, I
>>> assume these are ignorable!
>>>
>>> Thanks!
>>> Jason
>>>
>>>
>>> On Wed, Jan 5, 2022 at 9:03 AM Harini Rajendran <hrajend...@confluent.io>
>>> wrote:
>>>
>>>> That's great! Thanks. I'll keep an eye out on those PRs.
>>>>
>>>> Harini
>>>> Software Engineer, Observability
>>>> +1 412 708 3872
>>>>
>>>>
>>>>
>>>> On Wed, Jan 5, 2022 at 11:00 AM Jason Koch <jk...@netflix.com> wrote:
>>>>
>>>>> Thanks for the prompt - yes I'll get these fixed. They are code
>>>>> coverage / linter fixes, I had mistakenly assumed they were flake-y tests.
>>>>> I'll aim to fix these today.
>>>>>
>>>>> On Wed, Jan 5, 2022 at 7:25 AM Harini Rajendran <
>>>>> hrajend...@confluent.io> wrote:
>>>>>
>>>>>> Hi Jason,
>>>>>>
>>>>>> I was taking a look at your PRs and see that CI build is failing for
>>>>>> 2 of them. Do you know why those are failing and are you planning to fix
>>>>>> them?
>>>>>>
>>>>>> Harini
>>>>>> Software Engineer, Observability
>>>>>> +1 412 708 3872
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jan 3, 2022 at 7:59 PM Harini Rajendran <
>>>>>> hrajend...@confluent.io> wrote:
>>>>>>
>>>>>>> Hi Jason,
>>>>>>>
>>>>>>> I shall take a look at these 3 PRs and see if we can try these out
>>>>>>> in our test environment.
>>>>>>>
>>>>>>> Also, we use AWS RDS as the metadata engine.
>>>>>>>
>>>>>>> Harini
>>>>>>> Software Engineer, Observability
>>>>>>> +1 412 708 3872
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Dec 31, 2021 at 3:22 PM Jason Koch <jk...@netflix.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Harini,
>>>>>>>>
>>>>>>>> I had a chance to look at the checkpoint behaviour you mentioned in
>>>>>>>> more detail, and found two codepaths where the RunNotice code ends up 
>>>>>>>> in
>>>>>>>> the TaskQueue, and hits the same locks. I'd be interested if you want 
>>>>>>>> to
>>>>>>>> try the three related PRs I have submitted. (I added more detail to the
>>>>>>>> issue https://github.com/apache/druid/issues/11414).
>>>>>>>>
>>>>>>>> Failing that, I think the best next step would be some stack traces
>>>>>>>> / and/or profiler output from the supervisor at the time of rollover. 
>>>>>>>> It
>>>>>>>> would also be useful to know which metadata storage solution you are 
>>>>>>>> using
>>>>>>>> as the RunNotice interacts with metadata engine too.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Jason
>>>>>>>>
>>>>>>>> On Fri, Dec 3, 2021 at 2:43 PM Jason Koch <jk...@netflix.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Gian,
>>>>>>>>>
>>>>>>>>> I've submitted a PR to gianm/tq-scale-test that provides a
>>>>>>>>> concurrent test, (and fixes a concurrency bug I found along the way). 
>>>>>>>>> The
>>>>>>>>> change uses an 8millis response time for shutdown acknowledgment, and 
>>>>>>>>> a 2
>>>>>>>>> second time for shutdown completion/notification.
>>>>>>>>>
>>>>>>>>> Based on this test,
>>>>>>>>> - serial TaskQueue timeout occurs after 60sec for each test, and,
>>>>>>>>> - concurrent TaskQueue passes in ~10sec per test,
>>>>>>>>>
>>>>>>>>> https://github.com/gianm/druid/pull/3/files
>>>>>>>>>
>>>>>>>>> Let me know your thoughts.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Jason
>>>>>>>>>
>>>>>>>>> On Fri, Dec 3, 2021 at 11:41 AM Jason Koch <jk...@netflix.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Gian
>>>>>>>>>>
>>>>>>>>>> > Jason, also interesting findings! I took a crack at rebasing
>>>>>>>>>> your patch on
>>>>>>>>>> > master and adding a scale test for the TaskQueue with 1000
>>>>>>>>>> simulated tasks:
>>>>>>>>>> >
>>>>>>>>>> https://github.com/apache/druid/compare/master...gianm:tq-scale-test
>>>>>>>>>> <https://github.com/apache/druid/compare/master...gianm:tq-scale-test>.
>>>>>>>>>> When
>>>>>>>>>> > I run the scale test, "doMassLaunchAndExit" passes quickly but
>>>>>>>>>> > "doMassLaunchAndShutdown" times out. I suppose shutting down
>>>>>>>>>> lots of tasks
>>>>>>>>>> > is still a bottleneck.
>>>>>>>>>>
>>>>>>>>>> Looks good, I'll come back to the test below.
>>>>>>>>>>
>>>>>>>>>> > Looking at RemoteTaskRunner and HttpRemoteTaskRunner, it should
>>>>>>>>>> be pretty
>>>>>>>>>> > straightforward to make the shutdown API asynchronous, which
>>>>>>>>>> would help
>>>>>>>>>> > speed up anything that is shutting down lots of tasks all at
>>>>>>>>>> once. Would
>>>>>>>>>> > that be helpful in your environments? Or are the changes to
>>>>>>>>>> move shutdown
>>>>>>>>>> > out of critical sections going to be enough?
>>>>>>>>>>
>>>>>>>>>> This would be the logical next step for some environments, but I
>>>>>>>>>> didn't need to go
>>>>>>>>>> that far. For this particular cluster we are reading from Kafka,
>>>>>>>>>> which is a
>>>>>>>>>> SeekableStreamIndexRunner, the /stop call does not stop directly,
>>>>>>>>>> it only sets a
>>>>>>>>>> flag, so, the response back to overlord comes back in
>>>>>>>>>> single-digit-milliseconds.
>>>>>>>>>>
>>>>>>>>>> Some extra detail on the specific interaction as related to Kafka
>>>>>>>>>> might make
>>>>>>>>>> the problem/fix more clear ...
>>>>>>>>>>
>>>>>>>>>> ==
>>>>>>>>>>
>>>>>>>>>> Suppose for simplicity we have 500 tasks to roll, and each takes
>>>>>>>>>> 2ms to
>>>>>>>>>> acknowledge a stop request. TaskQueue#L322-L332 is going to issue
>>>>>>>>>> 500x2ms
>>>>>>>>>> requests to stop all of them, which will take approx 1 second to
>>>>>>>>>> complete.
>>>>>>>>>>
>>>>>>>>>> Notice it is doing this whilst holding the lock.
>>>>>>>>>>
>>>>>>>>>> After cleanup those tasks will issue a status update via ZK ->
>>>>>>>>>> RemoteTaskRunner
>>>>>>>>>> ::taskComplete that they have completed. That taskComplete fires
>>>>>>>>>> the future
>>>>>>>>>> completion which lands back in TaskQueue::notifyStatus where the
>>>>>>>>>> TaskQueue
>>>>>>>>>> can now update state that the task has finished.
>>>>>>>>>>
>>>>>>>>>> But - notifyStatus can *only* proceed once the lock has been
>>>>>>>>>> released, and then
>>>>>>>>>> it claims the lock, and calls removeTaskInternal. At this point
>>>>>>>>>> the lock is released,
>>>>>>>>>> and, maybe a few more concurrent ZK->notifyStatus() calls
>>>>>>>>>> proceed. Let's
>>>>>>>>>> suppose we got lucky, and we processed 10 requests, which have
>>>>>>>>>> now been
>>>>>>>>>> removed from the TaskQueue.
>>>>>>>>>>
>>>>>>>>>> At some point though, TaskQueue manage loop is going to get that
>>>>>>>>>> lock, and we
>>>>>>>>>> are now at 490 tasks that the queue believes are running, which
>>>>>>>>>> we expected
>>>>>>>>>> to be stopped, so we issue another 490*2ms=980ms of HTTP stop
>>>>>>>>>> requests.
>>>>>>>>>> And then, maybe, we get another 10 notifyStatus complete .... and
>>>>>>>>>> we issue 480,
>>>>>>>>>> and so on.
>>>>>>>>>>
>>>>>>>>>> Evidently, this is going to take a long time for the TaskQueue to
>>>>>>>>>> quiesce to the
>>>>>>>>>> correct state, and things are a little confused across the
>>>>>>>>>> cluster until that
>>>>>>>>>> happens. And, to top it off, tasks and supervisor get confused as
>>>>>>>>>> to progress,
>>>>>>>>>> so the task is marked failed, and put back in queue to restart,
>>>>>>>>>> which means
>>>>>>>>>> it takes longer.
>>>>>>>>>>
>>>>>>>>>> The fix is basically to make sure that the
>>>>>>>>>> TaskQueue::notifyStatus can proceed
>>>>>>>>>> to update the task state without blocking. Following that we get
>>>>>>>>>> a flood of ZK
>>>>>>>>>> updates in short order, so making the logging & ZK processing
>>>>>>>>>> more efficient
>>>>>>>>>> significantly reduces the time for quiesce to complete.
>>>>>>>>>>
>>>>>>>>>> ==
>>>>>>>>>>
>>>>>>>>>> So back to the test, it looks good, and I think some tweaks need
>>>>>>>>>> to happen to
>>>>>>>>>> replicate the above:
>>>>>>>>>> (1) the taskQueue.add() and shutdown() calls should be run
>>>>>>>>>> concurrently from lots
>>>>>>>>>> of threads
>>>>>>>>>> (2) the TestTaskRunner::shutdown(taskId) call should respond in
>>>>>>>>>> ~5ms, and
>>>>>>>>>> concurrently sleep some time, followed by a
>>>>>>>>>> knownTasks.get(taskId).setResult(...), which I think will trigger
>>>>>>>>>> the notification loop.
>>>>>>>>>>
>>>>>>>>>> I'll take a shot at this after lunch today.
>>>>>>>>>>
>>>>>>>>>> Sample logfile of a quick clean shutdown of a kafka task, from
>>>>>>>>>> overlord view:
>>>>>>>>>>
>>>>>>>>>> https://gist.githubusercontent.com/jasonk000/40c11dce3faed44d3a89c27e0227b982/raw/a456a4cddd31f508e618321e03ef8b5241931904/druid.log
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Jason
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 2, 2021 at 4:58 AM Gian Merlino <g...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Harini, those are interesting findings. I'm not sure if the two
>>>>>>>>>>> pauses are
>>>>>>>>>>> necessary, but my thought is that it ideally shouldn't matter
>>>>>>>>>>> because the
>>>>>>>>>>> supervisor shouldn't be taking that long to handle its notices.
>>>>>>>>>>> A couple
>>>>>>>>>>> things come to mind about that:
>>>>>>>>>>>
>>>>>>>>>>> 1) Did you see what specifically the supervisor is doing when
>>>>>>>>>>> it's handling
>>>>>>>>>>> the notices? Maybe from a stack trace? We should look into
>>>>>>>>>>> optimizing it,
>>>>>>>>>>> or making it asynchronous or something, depending on what it is.
>>>>>>>>>>> 2) Although, there isn't really a need to trigger a run for
>>>>>>>>>>> every single
>>>>>>>>>>> task status change anyway; I think it's ok to coalesce them.
>>>>>>>>>>> This patch
>>>>>>>>>>> would do it: https://github.com/apache/druid/pull/12018
>>>>>>>>>>>
>>>>>>>>>>> Jason, also interesting findings! I took a crack at rebasing
>>>>>>>>>>> your patch on
>>>>>>>>>>> master and adding a scale test for the TaskQueue with 1000
>>>>>>>>>>> simulated tasks:
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/apache/druid/compare/master...gianm:tq-scale-test.
>>>>>>>>>>> When
>>>>>>>>>>> I run the scale test, "doMassLaunchAndExit" passes quickly but
>>>>>>>>>>> "doMassLaunchAndShutdown" times out. I suppose shutting down
>>>>>>>>>>> lots of tasks
>>>>>>>>>>> is still a bottleneck.
>>>>>>>>>>>
>>>>>>>>>>> Looking at RemoteTaskRunner and HttpRemoteTaskRunner, it should
>>>>>>>>>>> be pretty
>>>>>>>>>>> straightforward to make the shutdown API asynchronous, which
>>>>>>>>>>> would help
>>>>>>>>>>> speed up anything that is shutting down lots of tasks all at
>>>>>>>>>>> once. Would
>>>>>>>>>>> that be helpful in your environments? Or are the changes to move
>>>>>>>>>>> shutdown
>>>>>>>>>>> out of critical sections going to be enough?
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Dec 1, 2021 at 1:27 PM Jason Koch
>>>>>>>>>>> <jk...@netflix.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>> > Hi Harini,
>>>>>>>>>>> >
>>>>>>>>>>> > We have seen issues like this related to task roll time,
>>>>>>>>>>> related to task
>>>>>>>>>>> > queue notifications on overlord instances; I have a patch
>>>>>>>>>>> running
>>>>>>>>>>> > internally that resolves this.
>>>>>>>>>>> >
>>>>>>>>>>> > These are my internal triage notes:
>>>>>>>>>>> > ======
>>>>>>>>>>> > - Whenever task scheduling is happening (startup, ingest
>>>>>>>>>>> segment task
>>>>>>>>>>> > rollover, redeployment of datasource) Overlord takes a long
>>>>>>>>>>> time to assign
>>>>>>>>>>> > workers. This compounds because tasks sit so long before
>>>>>>>>>>> deployment that it
>>>>>>>>>>> > starts failing tasks and having to relaunch them.
>>>>>>>>>>> >
>>>>>>>>>>> >    - TaskQueue: notifyStatus() which receives updates from the
>>>>>>>>>>> >    middlemanagers, and the manage() loop which controls
>>>>>>>>>>> services, runs
>>>>>>>>>>> > through
>>>>>>>>>>> >    a single lock. For example, the shutdown request involves
>>>>>>>>>>> submitting
>>>>>>>>>>> >    downstream HTTP requests synchronously (while holding the
>>>>>>>>>>> lock).
>>>>>>>>>>> >    - This means for a cluster with ~700 tasks that tasks are
>>>>>>>>>>> held for
>>>>>>>>>>> >    nearly 1second, and only after each 1 second around the
>>>>>>>>>>> manage loop can
>>>>>>>>>>> > 1-2
>>>>>>>>>>> >    notifications be processed. For a new startup, with 700
>>>>>>>>>>> tasks, and a
>>>>>>>>>>> > 1sec
>>>>>>>>>>> >    delay, that is 300-600-or-more seconds for the overlord to
>>>>>>>>>>> realise all
>>>>>>>>>>> > the
>>>>>>>>>>> >    tasks are started by the middle manager.
>>>>>>>>>>> >    - Similar delays happen for any other operations.
>>>>>>>>>>> >    - Sub-optimal logging code path (double-concatening very
>>>>>>>>>>> long log
>>>>>>>>>>> >    entries),
>>>>>>>>>>> >    - ZkWorker: Worker fully deserializing all ZK payload data
>>>>>>>>>>> every time
>>>>>>>>>>> >    looking up task IDs rather than only looking at the ID
>>>>>>>>>>> fields.
>>>>>>>>>>> > Similarly,
>>>>>>>>>>> >    repeat fetching data on task assignment.
>>>>>>>>>>> >
>>>>>>>>>>> > =====
>>>>>>>>>>> >
>>>>>>>>>>> > The patch I have is here:
>>>>>>>>>>> > https://github.com/jasonk000/druid/pull/7/files
>>>>>>>>>>> >
>>>>>>>>>>> > It fixes a couple of things, most importantly the task queue
>>>>>>>>>>> notification
>>>>>>>>>>> > system. The system is much more stable with high task counts
>>>>>>>>>>> and will
>>>>>>>>>>> > easily restart many tasks concurrently.
>>>>>>>>>>> >
>>>>>>>>>>> > I have other perf issues I want to look at first before I can
>>>>>>>>>>> document it
>>>>>>>>>>> > fully, build a test case, rebase it on apache/master, etc. If
>>>>>>>>>>> you test it
>>>>>>>>>>> > out, and it works, we could submit a PR that would resolve it.
>>>>>>>>>>> >
>>>>>>>>>>> > PS - I have a queue of similar fixes I'd like to submit, but
>>>>>>>>>>> need some time
>>>>>>>>>>> > to do the documentation, build test cases, upstreaming, etc,
>>>>>>>>>>> if anyone
>>>>>>>>>>> > wants to collaborate, I could open some Issues and share my
>>>>>>>>>>> partial notes.
>>>>>>>>>>> >
>>>>>>>>>>> > Thanks
>>>>>>>>>>> > Jason
>>>>>>>>>>> >
>>>>>>>>>>> > On Wed, Dec 1, 2021 at 12:59 PM Harini Rajendran
>>>>>>>>>>> > <hrajend...@confluent.io.invalid> wrote:
>>>>>>>>>>> >
>>>>>>>>>>> > > Hi all,
>>>>>>>>>>> > >
>>>>>>>>>>> > > I have been investigating this in the background for a few
>>>>>>>>>>> days now and
>>>>>>>>>>> > > need some help from the community.
>>>>>>>>>>> > >
>>>>>>>>>>> > > We noticed that every hour, when the tasks roll, we see a
>>>>>>>>>>> spike in the
>>>>>>>>>>> > > ingestion lag for about 2-4 minutes. We have 180 tasks
>>>>>>>>>>> running on this
>>>>>>>>>>> > > datasource.
>>>>>>>>>>> > > [image: Screen Shot 2021-12-01 at 9.14.23 AM.png]
>>>>>>>>>>> > >
>>>>>>>>>>> > > On further debugging of task logs, we found out that around
>>>>>>>>>>> the duration
>>>>>>>>>>> > > when the ingestion lag spikes up, *the gap between pause and
>>>>>>>>>>> resume
>>>>>>>>>>> > > commands in the task logs during checkpointing are wide
>>>>>>>>>>> ranging from few
>>>>>>>>>>> > > seconds to couple minutes*. For example, in the following
>>>>>>>>>>> task logs you
>>>>>>>>>>> > > can see that it was about 1.5 minutes.
>>>>>>>>>>> > > {"@timestamp":"2021-11-18T*20:06:58.513Z*",
>>>>>>>>>>> "log.level":"DEBUG",
>>>>>>>>>>> > > "message":"Received pause command, *pausing* ingestion until
>>>>>>>>>>> resumed.", "
>>>>>>>>>>> > > service.name
>>>>>>>>>>> > >
>>>>>>>>>>> ":"druid/middleManager","event.dataset":"druid/middleManager.log","
>>>>>>>>>>> > > process.thread.name
>>>>>>>>>>> > >
>>>>>>>>>>> >
>>>>>>>>>>> ":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"}
>>>>>>>>>>> > > {"@timestamp":"2021-11-18T*20:08:26.326Z*",
>>>>>>>>>>> "log.level":"DEBUG",
>>>>>>>>>>> > > "message":"Received pause command, *pausing* ingestion until
>>>>>>>>>>> resumed.", "
>>>>>>>>>>> > > service.name
>>>>>>>>>>> > >
>>>>>>>>>>> ":"druid/middleManager","event.dataset":"druid/middleManager.log","
>>>>>>>>>>> > > process.thread.name
>>>>>>>>>>> > >
>>>>>>>>>>> >
>>>>>>>>>>> ":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"}
>>>>>>>>>>> > > {"@timestamp":"2021-11-18T*20:08:26.329Z*",
>>>>>>>>>>> "log.level":"DEBUG",
>>>>>>>>>>> > > "message":"Received resume command, *resuming* ingestion.", "
>>>>>>>>>>> > service.name
>>>>>>>>>>> > >
>>>>>>>>>>> ":"druid/middleManager","event.dataset":"druid/middleManager.log","
>>>>>>>>>>> > > process.thread.name
>>>>>>>>>>> > >
>>>>>>>>>>> >
>>>>>>>>>>> ":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"}
>>>>>>>>>>> > > So this explains why ingestion is lagging as the *tasks are
>>>>>>>>>>> paused for a
>>>>>>>>>>> > > long time*.
>>>>>>>>>>> > >
>>>>>>>>>>> > > *Why are there 2 pauses during checkpointing and why such a
>>>>>>>>>>> huge gap?*
>>>>>>>>>>> > > As a next step, I wanted to see why there is such a wide
>>>>>>>>>>> gap. Then we
>>>>>>>>>>> > > realized that the first pause is when the task pauses itself
>>>>>>>>>>> here
>>>>>>>>>>> > > <
>>>>>>>>>>> >
>>>>>>>>>>> https://github.com/confluentinc/druid/blob/185ab56e42577dad6b077b415959512b0cd96345/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java#L728
>>>>>>>>>>> >
>>>>>>>>>>> > while
>>>>>>>>>>> > > requesting the supervisor for a checkpoint. And the second
>>>>>>>>>>> pause is when
>>>>>>>>>>> > > the supervisor actually handles the checkpoint notice here
>>>>>>>>>>> > > <
>>>>>>>>>>> >
>>>>>>>>>>> https://github.com/confluentinc/druid/blob/185ab56e42577dad6b077b415959512b0cd96345/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L2548
>>>>>>>>>>> > >
>>>>>>>>>>> > > .
>>>>>>>>>>> > > And since the supervisor thread for this data source takes
>>>>>>>>>>> such a long
>>>>>>>>>>> > > time to process all the notices in the queue before this
>>>>>>>>>>> checkpoint
>>>>>>>>>>> > notice,
>>>>>>>>>>> > > the ingestion task ends up being in the paused state for a
>>>>>>>>>>> long time.
>>>>>>>>>>> > >
>>>>>>>>>>> > > *Why does the supervisor thread take such a long time to get
>>>>>>>>>>> to this
>>>>>>>>>>> > > checkpoint notice?*
>>>>>>>>>>> > > That was my next step in debugging.
>>>>>>>>>>> > > Proving our theory, we noticed that the *noticesQueue in the
>>>>>>>>>>> supervisor
>>>>>>>>>>> > > does get backed up with 100s of notices every hour when
>>>>>>>>>>> tasks roll*.
>>>>>>>>>>> > > [image: Screen Shot 2021-12-01 at 9.32.59 AM.png]
>>>>>>>>>>> > > And we saw that *run_notice takes between 5s and 7s during
>>>>>>>>>>> task rolls*.
>>>>>>>>>>> > > And this causes backing up of noticesQueue causing
>>>>>>>>>>> checkpoint notice to
>>>>>>>>>>> > be
>>>>>>>>>>> > > in the queue for long leading to ingestion lag spike
>>>>>>>>>>> whenever tasks roll.
>>>>>>>>>>> > > [image: Screen Shot 2021-12-01 at 9.34.29 AM.png]
>>>>>>>>>>> > >
>>>>>>>>>>> > > *Why does run_notice take 5-7s to finish?*
>>>>>>>>>>> > > When the task starts, it takes about 5s for the HTTP server
>>>>>>>>>>> to come up.
>>>>>>>>>>> > > So, till then the supervisor thread is in a loop trying to
>>>>>>>>>>> get the task
>>>>>>>>>>> > > status and this causes run_notice to take about 5-7s to
>>>>>>>>>>> finish.
>>>>>>>>>>> > >
>>>>>>>>>>> > > *Questions to the community*
>>>>>>>>>>> > > Do we need 2 pauses during checkpointing? Should the task
>>>>>>>>>>> pause itself
>>>>>>>>>>> > > before requesting a checkpoint notice given that the
>>>>>>>>>>> supervisor is
>>>>>>>>>>> > anyways
>>>>>>>>>>> > > going to pause the task while handling the notice? Or is it
>>>>>>>>>>> okay to
>>>>>>>>>>> > remove
>>>>>>>>>>> > > the pause in the TaskRunner before it sends a checkpoint
>>>>>>>>>>> notice request
>>>>>>>>>>> > to
>>>>>>>>>>> > > the supervisor? This would immediately solve the ingestion
>>>>>>>>>>> lag issue
>>>>>>>>>>> > > completely as there won't be two pauses with such a huge gap
>>>>>>>>>>> in between.
>>>>>>>>>>> > >
>>>>>>>>>>> > > Harini
>>>>>>>>>>> > > Software Engineer, Confluent
>>>>>>>>>>> > >
>>>>>>>>>>> >
>>>>>>>>>>>
>>>>>>>>>>

Re: Need help in understanding real-time ingestion task pause behavior during checkpointing

Reply via email to