Re: Need help in understanding real-time ingestion task pause behavior during checkpointing

2022-01-31 Thread Harini Rajendran
Hi Jason and Gian, I got time to collect some more data points wrt this issue and have added them as a comment here . Can you take a look at it when you get a chance and let me know what you think? Harini Software Engineer, Obs

Re: Need help in understanding real-time ingestion task pause behavior during checkpointing

2022-01-11 Thread Jason Koch
> Thanks for the update, Jason. We shall wait for the builds to pass. Seems to me the builds are passing now (IMO). The failures present are related to docker image build, seems like they are issues with the test or build environment itself, but an expert opinion can perhaps correct me. > Also, a

Re: Need help in understanding real-time ingestion task pause behavior during checkpointing

2022-01-06 Thread Harini Rajendran
Thanks for the update, Jason. We shall wait for the builds to pass. Also, are you planning to get #12099 PR reviewed by the community anytime soon? Harini Software Engineer, Observability +1 412 708 3872 On Wed, Jan 5, 2022 at 8:25 PM Jason Koch wrote: > Harini - these are as far as I can pro

Re: Need help in understanding real-time ingestion task pause behavior during checkpointing

2022-01-05 Thread Jason Koch
Harini - these are as far as I can progress them; 12096 is good, 12097 is good except for what seems to me to be a CI issue, and 12099 is "done" but has a CI issue as well, I'll ask for help below. Gian / Frank - looking for some help on these: - On #12099, it is rejecting because of an Intellij

Re: Need help in understanding real-time ingestion task pause behavior during checkpointing

2022-01-05 Thread Harini Rajendran
That's great! Thanks. I'll keep an eye out on those PRs. Harini Software Engineer, Observability +1 412 708 3872 On Wed, Jan 5, 2022 at 11:00 AM Jason Koch wrote: > Thanks for the prompt - yes I'll get these fixed. They are code coverage / > linter fixes, I had mistakenly assumed they were fl

Re: Need help in understanding real-time ingestion task pause behavior during checkpointing

2022-01-05 Thread Jason Koch
Thanks for the prompt - yes I'll get these fixed. They are code coverage / linter fixes, I had mistakenly assumed they were flake-y tests. I'll aim to fix these today. On Wed, Jan 5, 2022 at 7:25 AM Harini Rajendran wrote: > Hi Jason, > > I was taking a look at your PRs and see that CI build is

Re: Need help in understanding real-time ingestion task pause behavior during checkpointing

2022-01-05 Thread Harini Rajendran
Hi Jason, I was taking a look at your PRs and see that CI build is failing for 2 of them. Do you know why those are failing and are you planning to fix them? Harini Software Engineer, Observability +1 412 708 3872 On Mon, Jan 3, 2022 at 7:59 PM Harini Rajendran wrote: > Hi Jason, > > I shall

Re: Need help in understanding real-time ingestion task pause behavior during checkpointing

2022-01-03 Thread Harini Rajendran
Hi Jason, I shall take a look at these 3 PRs and see if we can try these out in our test environment. Also, we use AWS RDS as the metadata engine. Harini Software Engineer, Observability +1 412 708 3872 On Fri, Dec 31, 2021 at 3:22 PM Jason Koch wrote: > Hi Harini, > > I had a chance to loo

Re: Need help in understanding real-time ingestion task pause behavior during checkpointing

2021-12-31 Thread Jason Koch
Hi Harini, I had a chance to look at the checkpoint behaviour you mentioned in more detail, and found two codepaths where the RunNotice code ends up in the TaskQueue, and hits the same locks. I'd be interested if you want to try the three related PRs I have submitted. (I added more detail to the i

Re: Need help in understanding real-time ingestion task pause behavior during checkpointing

2021-12-03 Thread Jason Koch
Gian, I've submitted a PR to gianm/tq-scale-test that provides a concurrent test, (and fixes a concurrency bug I found along the way). The change uses an 8millis response time for shutdown acknowledgment, and a 2 second time for shutdown completion/notification. Based on this test, - serial TaskQ

Re: Need help in understanding real-time ingestion task pause behavior during checkpointing

2021-12-03 Thread Jason Koch
Hi Gian > Jason, also interesting findings! I took a crack at rebasing your patch on > master and adding a scale test for the TaskQueue with 1000 simulated tasks: > https://github.com/apache/druid/compare/master...gianm:tq-scale-test

Re: Need help in understanding real-time ingestion task pause behavior during checkpointing

2021-12-02 Thread Gian Merlino
Harini, those are interesting findings. I'm not sure if the two pauses are necessary, but my thought is that it ideally shouldn't matter because the supervisor shouldn't be taking that long to handle its notices. A couple things come to mind about that: 1) Did you see what specifically the supervi

Re: Need help in understanding real-time ingestion task pause behavior during checkpointing

2021-12-01 Thread Jason Koch
Hi Harini, We have seen issues like this related to task roll time, related to task queue notifications on overlord instances; I have a patch running internally that resolves this. These are my internal triage notes: == - Whenever task scheduling is happening (startup, ingest segment task rol

Need help in understanding real-time ingestion task pause behavior during checkpointing

2021-12-01 Thread Harini Rajendran
Hi all, I have been investigating this in the background for a few days now and need some help from the community. We noticed that every hour, when the tasks roll, we see a spike in the ingestion lag for about 2-4 minutes. We have 180 tasks running on this datasource. [image: Screen Shot 2021-12-