Hi Bogdan, A quick fix here would be to add a new tm param to control the use of the lock. By default, we should keep the existing behaviour: serial tm timers processing.
-ovidiu On Tue, Apr 5, 2022 at 10:25 AM Bogdan-Andrei Iancu <bog...@opensips.org> wrote: > > Hi Ovidiu, > > Thanks for the joined debugging session here, very very useful. The > bottleneck was identified by Ovidiu to some locking added like 6 years > ago [1] in order to fix some race conditions. > > I guess that story need to be re-evaluated ;) > > > [1] > https://github.com/OpenSIPS/opensips/commit/ea6e1fa7fed29efed6057894a8cc4054308160b3 > > Regards, > > Bogdan-Andrei Iancu > > OpenSIPS Founder and Developer > https://www.opensips-solutions.com > OpenSIPS eBootcamp 23rd May - 3rd June 2022 > https://opensips.org/training/OpenSIPS_eBootcamp_2022/ > > On 4/5/22 4:26 PM, Ovidiu Sas wrote: > > Hello Bogdan, > > > > No typos here :) > > > > Let me rephrase with some examples: > > - at time n, both tm-timer and tm-utimer are scheduled > > - tm-timer is scheduled first, followed by tm-utimer > > - tm-timer starts running (grabbing the .ex_lock) while tm-utimer is > > waiting for the .ex_lock to be released by tm-timer > > - tm-timer takes 150ms to complete its task > > - at time n + 100ms, tm-utimer is scheduled again, but because the > > initial run of the tm-utimer is not completed (still waiting for the > > .ex_lock), the warning is issued > > - after 50ms (time is now n + 150ms) the tm-timer job is completed, > > then the tm-utimer (that was scheduled at time n) takes its turn and > > finishes in 10ms. > > - now the time is n + 160ms and both tm-timer and tm-utimer jobs are done > > - at time n + 200ms, tm-utimer is scheduled again and all is back to > > normal > > > > Hope this clarifies how tm-timer is interfering with tm-utimer. > > The timer that is taking a long time to be processed is the WT-TIMER > > inside tm-timer and this is triggering the warning on the tm-utimer. > > The tm-utimer works fine, but we have this misleading warning on it > > because of the tm-timer. > > > > -ovidiu > > > > On Tue, Apr 5, 2022 at 3:59 AM Bogdan-Andrei Iancu <bog...@opensips.org> > > wrote: > >> Hi Ovidiu, > >> > >> A first quick note :). You mentioned the tm_utimer as the problematic > >> one - this is the 100ms based timer, used ONLY for the outbound > >> retransmissions. This conflicts with your later finding on WT_TIMER > >> which actually is on tm_timer, not tm_utimer. > >> > >> So, just some typos here, or ? :) > >> > >> Best regards, > >> > >> Bogdan-Andrei Iancu > >> > >> OpenSIPS Founder and Developer > >> https://www.opensips-solutions.com > >> OpenSIPS eBootcamp 23rd May - 3rd June 2022 > >> https://opensips.org/training/OpenSIPS_eBootcamp_2022/ > >> > >> On 4/4/22 8:16 AM, Ovidiu Sas wrote: > >>> Hello all, > >>> > >>> The tm module handles all it's internal timers via two handlers: > >>> - timer_routine (second based timers) > >>> - utimer_routine (100ms based timers) > >>> Each of these routines handles 4 different timers each. > >>> Both routines are very similar in functionality and there is no timer > >>> that is handled by both routines. > >>> Because both routines are protected by the same lock > >>> (timertable[(long)set].ex_lock), these two routines cannot run in > >>> parallel (assuming that we have only one set, i.e. a single > >>> timer_partition). > >>> > >>> In my testing, I noticed that the tm_utimer routine has difficulties > >>> running smoothly. > >>> After doing more testing and some profiling, it looks like the culprit > >>> is the WT_TIMER. > >>> For around 10-15K records in the WT_TIMER detached timer list, we > >>> spend around 3ms to create the list and 200-300ms to > >>> run_handler_for_each. Because of this, the tm_utimer (which is > >>> scheduled to run every 100ms) is blocked by the lock on the first run > >>> and on the second run the scheduler detects that the previous run is > >>> still running (waiting for the lock) and therefore issues the famous > >>> "already scheduled" warning. > >>> > >>> The check_and_split_time_list function has its own locks and then each > >>> handlers operates on its own list (with locks for dealing with cells), > >>> so why do we have the timertable[(long)set].ex_lock? > >>> > >>> I removed the lock, tested with one single timer_partition, then with > >>> two timer_partitions and the performance increased dramatically. Is > >>> there a reason for keeping this lock or is it something that was > >>> inherited and nobody bothered to check why and remove it? > >>> > >>> Thanks, > >>> Ovidiu > >>> > > > -- VoIP Embedded, Inc. http://www.voipembedded.com _______________________________________________ Devel mailing list Devel@lists.opensips.org http://lists.opensips.org/cgi-bin/mailman/listinfo/devel