Re: [lttng-dev] [PATCH] coredump debugging: add a tracepoint to report the coredumping

2024-02-23 Thread Karim Yaghmour via lttng-dev



On 2/23/24 11:54, Mathieu Desnoyers wrote:
...

So this sched_process_exit placement was actually decided
by Karim Yaghmour back in the LTT days (2001). I don't think
he will mind us moving it around some 23 years later. ;)


Gee ... the shadows are longer than I thought :)

It's all yours. You guys have been doing a fantastic job and I'm happy 
to be on the consumer side of it all these days :D


Cheers,

--
Karim Yaghmour
CEO - Opersys inc. / www.opersys.com
http://twitter.com/karimyaghmour

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH] coredump debugging: add a tracepoint to report the coredumping

2024-02-23 Thread Steven Rostedt via lttng-dev
On Fri, 23 Feb 2024 11:54:30 -0500
Mathieu Desnoyers  wrote:

> So this sched_process_exit placement was actually decided
> by Karim Yaghmour back in the LTT days (2001). I don't think
> he will mind us moving it around some 23 years later. ;)

And I wonder how many people are involved in this that are younger than
that change :-p

I'm starting to feel really old.

-- Steve
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH] coredump debugging: add a tracepoint to report the coredumping

2024-02-23 Thread Mathieu Desnoyers via lttng-dev

On 2024-02-23 09:26, Steven Rostedt wrote:

On Mon, 19 Feb 2024 13:01:16 -0500
Mathieu Desnoyers  wrote:


Between "sched_process_exit" and "sched_process_free", the task can still be
observed by a trace analysis looking at sched and signal events: it's a zombie 
at
that stage.


Looking at the history of this tracepoint, it was added in 2008 by commit
0a16b60758433 ("tracing, sched: LTTng instrumentation - scheduler").
Hmm, LLTng? I wonder who the author was?


[ common typo: LLTng -> LTTng ;-) ]



   Author: Mathieu Desnoyers 

  :-D

Mathieu, I would say it's your call on where the tracepoint can be located.
You added it, you own it!


Wow! that's now 16 years ago :)

I've checked with Matthew Khouzam (maintainer of Trace Compass)
which care about this tracepoint, and we have not identified any
significant impact of moving it on its model of the scheduler, other
than slightly changing its timing.

I've also checked quickly in lttng-analyses and have not found
any code that care about its specific placement.

So I would say go ahead and move it earlier in do_exit(), it's
fine by me.

If you are interested in a bit of archeology, "sched_process_free"
originated from my ltt-experimental 0.1.99.13 kernel patch against
2.6.12-rc4-mm2 back in September 2005 (that's 19 years ago). It was
a precursor to the LTTng 0.x kernel patchset.

https://lttng.org/files/ltt-experimental/patch-2.6.12-rc4-mm2-ltt-exp-0.1.99.13.gz

Index: kernel/exit.c
===
--- a/kernel/exit.c (.../trunk/kernel/linux-2.6.12-rc4-mm2) (revision 41)
+++ b/kernel/exit.c (.../branches/mathieu/linux-2.6.12-rc4-mm2) 
(revision 41)
@@ -4,6 +4,7 @@
  *  Copyright (C) 1991, 1992  Linus Torvalds
  */
 
+#include 

 #include 
 #include 
 #include 
@@ -55,6 +56,7 @@ static void __unhash_process(struct task
}
 
 	REMOVE_LINKS(p);

+  trace_process_free(p->pid);
 }
 
 void release_task(struct task_struct * p)

@@ -832,6 +834,8 @@ fastcall NORET_TYPE void do_exit(long co
}
exit_mm(tsk);
 
+	trace_process_exit(tsk->pid);

+
exit_sem(tsk);
__exit_files(tsk);
__exit_fs(tsk);

This was a significant improvement over the prior LTT which only
had the equivalent of "sched_process_exit", which caused issues
with the Linux scheduler model in LTTV due to zombie processes.

Here is where it appeared in LTT back in 1999:

http://www.opersys.com/ftp/pub/LTT/TracePackage-0.9.0.tgz

patch-ltt-2.2.13-991118

diff -urN linux/kernel/exit.c linux-2.2.13/kernel/exit.c
--- linux/kernel/exit.c Tue Oct 19 20:14:02 1999
+++ linux-2.2.13/kernel/exit.c  Sun Nov  7 23:49:17 1999
@@ -14,6 +14,8 @@
 #include 
 #endif
 
+#include 

+
 #include 
 #include 
 #include 
@@ -386,6 +388,8 @@
del_timer(>real_timer);
end_bh_atomic();
 
+	TRACE_PROCESS(TRACE_EV_PROCESS_EXIT, 0, 0);

+
lock_kernel();
 fake_volatile:
 #ifdef CONFIG_BSD_PROCESS_ACCT

And it was moved to its current location (after exit_mm()) a bit
later (2001):

http://www.opersys.com/ftp/pub/LTT/TraceToolkit-0.9.5pre2.tgz

Patches/patch-ltt-linux-2.4.5-vanilla-010909-1.10

diff -urN linux/kernel/exit.c /ext2/home/karym/kernel/linux-2.4.5/kernel/exit.c
--- linux/kernel/exit.c Fri May  4 17:44:06 2001
+++ /ext2/home/karym/kernel/linux-2.4.5/kernel/exit.c   Wed Jun 20 12:39:24 2001
@@ -14,6 +14,8 @@
 #include 
 #endif
 
+#include 

+
 #include 
 #include 
 #include 
@@ -439,6 +441,8 @@
 #endif
__exit_mm(tsk);
 
+	TRACE_PROCESS(TRACE_EV_PROCESS_EXIT, 0, 0);

+
lock_kernel();
sem_exit();
__exit_files(tsk);

So this sched_process_exit placement was actually decided
by Karim Yaghmour back in the LTT days (2001). I don't think
he will mind us moving it around some 23 years later. ;)

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] Crash in application due to watchdog timeout with python3 lttng

2024-02-23 Thread Kienan Stewart via lttng-dev

Hi Lakshmi,

On 2/20/24 10:24 AM, Lakshmi Deverkonda wrote:

Hi,

Sorry, I could not reply on this thread. We tried to handle these issues 
in our application code.
One query, what is the write way to delete the lttng logger to stop 
logging the events(from the code).



I think that the design of LTTng-UST agents is such that tracing is not 
meant to be tightly controlled by the traced applications themselves.


To stop emitting events:

 * disable the python events for the session(s): `lttng disable-event 
--python --all-events --session= --channel=`

 * stop the tracing session(s)`lttng stop `
 * quit the `sessiond`

Note: the agent thread(s) in the python application will continue to 
run, but not emit events.


If you want to do things from within the application, there are some 
options:


 * Implement your own logic to stop calling `self.lttng_logger.log()`
 * Add a filter to the `lttngust.loghandler._Handler` objects on the 
root logger
 * Forcibly remove the `lttngust.loghandler._Handler` handler(s) from 
root logger (this really isn't meant to be done)
 * Use liblttngctl to perform one of the above actions (disable the 
python event(s), stopping the tracing sessions(s), ...)


thanks,
kienan




Regards,
Lakshmi

*From:* Kienan Stewart 
*Sent:* 16 February 2024 22:11
*To:* Lakshmi Deverkonda ; lttng-dev@lists.lttng.org 

*Subject:* Re: [lttng-dev] Crash in application due to watchdog timeout 
with python3 lttng

External email: Use caution opening links or attachments


Hi Lakshmi,

On 2/16/24 09:33, Lakshmi Deverkonda wrote:

This is how, we have created the logger. So the first logger is for file
logging where is as the second one is for lttng.

  self.logger  = logging.getLogger('cd')
self.lttng_logger = logging.getLogger('cd-lttng')

It seems like at the instant exactly when lttng is logging some data on
a particular thread and the same instant we receive SIGTERM for the
application,
we are unable to join that particular thread. Can you please help.



This doesn't constitute a usable reproducer for us. You are also
omitting information on the setup and usage of lttng on your system.

I get the impression you are not in a position to share your code.
EfficiOS offers support contracts with NDAs that could allow us to work
with you to analyze and improve your use of LTTng. For more info, please
feel free to contact sa...@efficios.com.


Also we see that performance of lttng is not that good for python3. My
application has around 24 threads and when logging is enabled for each
of the threads,
there is a delay upto 24s for processing the external events.
Please suggest how to proceed further on these issues.


Could you describe what you mean by 'processing external events'?

Which system(s) are involved in processing the events?

Are the 'external events' the events emitted by invoking
`self.lttng_logger.info('...')`, for example?

What versions of lttng-tools, lttng-ust, urcu, babeltrace, and python3
are you using? Are you using a relay-daemon at any point?

How are your lttng sessions configured? Eg. memory allocation, blocking
settings, behaviour on full buffers, etc. The commands you use to create
the session, enable the channels, and activate the events would be great
information to have.

While performing the logging is the system under heavy load from other
sources? What resources on the system face the most contention (CPU, IO,
memory, ...)?

We'd be more than happy to analyze the performance of python-lttngust
and work to make improvements so it can meet your needs under a
development contract. For more information, please reach out to
sa...@efficios.com.



Regards,
Lakshmi


I have taken the time to invent a fictitious example based on the few
details you have given:
https://gist.github.com/kienanstewart/879bd3bf19d852653b70a3c42caef361 

which spawns a number of python threads using the threading module.

I am using lttng-tools master, lttng-ust master at
47bc09f338f3c1199a878f77b7b18be8d2a224f6, urcu master at
81270292c23ff28aba1abd9a65f0624b657de82b, and babeltrace2 master at
b93af5a2d22e36cf547da1739d60e19791daccbd. My system is running Debian
sid with python 3.11.8.

To set up a recording session I do the following:

```
lttng create
lttng enable-event --python 'tp'
lttng start
```

To run the application, I do the following:

```
time python3 ./main.py
```

To quit the application, I send sigterm using the following command

```
killall $(pgrep -f 'python3 ./main.py')
```

After the application terminates, I stop the session and view the events

```
lttng stop
lttng view
lttng view | wc -l
```

In a 25s run of the application on my 4-thread laptop, I recorded
1010748 events.

thanks,
kienan




*From:* Lakshmi Deverkonda 
*Sent:* 13 February 2024 21:05
*To:* Kienan