Hi Stephane,
You might want to add a separate section that details the thinking about
why you don't want to use a single, multiplexed syscall. If you add this,
it could go after you've detailed the session breakdown, and before you've
described the current syscalls. I know this has been an area some of the
LKML folks have picked at before.
I will read this in more detail later.
- Corey
[EMAIL PROTECTED] wrote on 07/01/2008 09:41:36
AM:
> Hello everyone,
>
> I intend to send this following description to LKML and a few LKML
developers
> to try and explain the reasoning behind the current syscall interface
> for perfmon2.
>
> I know there have been a lot of doubts and misunderstandings as to why
we need
> to many syscalls and how they could be extended. I tried to address
> those concerns
> here.
>
> Please feel free to comment, add to it.
>
> Thanks.
>
>
-----------------------------------------------------------------------------------------------------------------------
>
> 1) monitoring session breakdown
>
> A monitoring session can be decomposed into a sequence of fundamental
> actions which
> are as follows:
> - create the session
> - program registers
> - attach to target thread or CPU
> - start monitoring
> - stop monitoring
> - read results
> - detach from thread or CPU
> - terminate session
>
> The order may not necessarily be like shown. For instance, the
> programming may happen
> after the session has been attached. Obviously, the start/stop
> operations may be
> repeated before results are read and results can be read multiple times.
>
> In the next sections, we examine each action separately.
>
> 2) session creation
>
> Perfmon2 supports 2 types of sessions: per-thread or per-CPU (so
> called system-wide)
>
> During the creation of the session, certain attributes are set, they
> remain until the
> session is terminated. For instance, the per-cpu attribute cannot
> be changed.
>
> During creation, the kernel state to support the session is
> allocated and initialized.
> No PMU hardware is actually accessed. Permissions to create a
> session may be checked.
> Resource limits are also validated and memory consumption is accounted
for.
>
> The software state of the PMU is initialized, i.e., all
> configuration registers are
> set to a quiescent value. Data registers are initialized to zero
> whenever possible.
>
> Upon return, the kernel returns a unique identifier which is to be
> used for all
> subsequent actions on the session.
>
> 3) programming the registers
>
> Programming of the PMU registers can occur at any time during the
> lifetime of a session,
> the session does not need to be attached to a thread of CPU.
>
> It may be necessary to change the settings, e.g., monitor another
> event or reset the counts
> when sampling at the user level. Thus, the writing of the registers
> MUST be decoupled from
> the creation of the session.
>
> Similarly, writing of configuration and data registers must also be
> decoupled, as data
> registers may be reprogrammed independently of their configuration
> registers, like when
> sampling for instance.
>
> The number of registers varies a lot from one PMU to the other. The
> relationships between
> configuration and data registers can be more complex than just
> one-to-one. On most PMU,
> writing of the PMU registers requires running at the most privileged
> level, i.e., in the
> kernel. To amortize the cost of a system call, it is interesting to
> be able to program multiple
> registers in one call. Thus, it must be possible to pass vector
> arguments. Of course,
> for security reasons, the system administrator may impose a limit on
> how big vectors can
> actually be. The advantage is that vector can vary in size and thus
> the amount of data
> passed between application and kernel can be optimized to be just
> the minimal needed.
> System call data needs to be copied into the kernel memory space
> before it can be used.
>
> 4) attachment and detachment
>
> A session can be attached to a kernel-visible thread or a CPU. If
> there is attachment,
> then it must be possible to detach the session to possibly re-attach
> it to another thread
> or CPU. Detachment should not require destroying the session.
>
> There are 3 possibilities for attachment:
> - when the session is created
> - when the monitoring is activated
> - with a dedicated call
>
> If the attachment is done during the creation of the session, then it
> means the target (thread or CPU)
> needs to exist at that time. For a cpu-wide session, this means that
> the session must be created while
> executing on that CPU. This does not seem unreasonable especially on
> NUMA systems.
>
> For a per-thread session however, this is a bit more problematic as
> this means it is not possible
> to prepare the session and the PMU registers before the thread
> exists. When monitoring across fork
> and pthread_create, it is important to minimize overhead. Creation of
> a session can trigger complex
> memory allocations in the kernel. Thus, it may be interesting to
> prepare a batch of ready-to-go sessions,
> which just need to be attached when the fork or pthread_create
> notification arrives.
>
> If the attachment is coupled with the creation of the session, it
> implies that the detachment is coupled
> with its destruction, by symmetry. Coupling of detachment with
> termination is problematic for both per-thread
> and CPU-wide mode. With the former, the termination of a thread is
> usually totally asynchronous with the
> termination of the session by the monitoring tool. The only case
> where they are synchronized is for
> self-monitored threads. When a tool is monitoring a thread in another
> process, the termination of that thread
> will cause the kernel to detach the session. But the session must not
> be closed because the tool likely wants
> to read the results and also because the session still exists for the
> tool. For CPU-wide, there is also an issue
> when a monitored CPU is put off-line dynamically. The session would
> be detached by the kernel, yet the session would
> still be live in the tool whose controlling thread would have been
> migrated off of that CPU.
>
> If the attachment is done when monitoring is activated, then the
> detachment is done when monitoring
> is deactivated. The following relationships are therefore enforced:
>
> attached => activated
> stopped => detached
>
> It is expected that start/stop operations could be very frequent for
> self-monitored workloads. When used
> to monitor small sections of critical code, e.g., loop kernels, it is
> important to minimize overhead, thus
> the start/stop should be as simple as possible.
>
> Attaching requires loading the PMU machine state onto the PMU
> hardware. Conversely, detaching implies flushing
> the PMU state to memory so results can be read even after the
> termination of a thread, for instance. Both
> operations are expensive due to the high cost of accessing the PMU
registers.
>
> Furthermore, there are certain PMU models, e.g., Intel Itanium, where
> it is possible to let user level code
> start/stop monitoring with a single instruction. To minimize
> overhead, it is very important to allow this
> mechanism for self-monitored programs. Yet the session would have to
> be attached/detached somehow. With
> dedicated attach/detach calls, this can be supported transparently.
> One possible work-around with the coupled
> calls would be to require a system call to attach the session and do
> the initial activation, subsequent
> start/stop could use the lightweight instruction. The session would
> be stopped and detached with a system call.
>
> The dedicated attach/detach calls offer a maximum level of
> flexibility. The let applications create sessions
> in advance or on-demand. The actions on the session, start/stop and
> attach/detach, are perfectly symmetrical.
> The termination of the monitored target can cause its detachment, but
> the session remains accessible. Issuing
> of the detach call on a session already detached by the kernel is
harmless.
>
> The cost of start/stop is not impacted.
>
> The following properties are enforced:
> upon attachment => monitoring stopped
> during detachment => monitoring stopped
>
> 5) start and stop
>
> It must be possible for an application to start and stop monitoring
> at will and at any moment.
> Start and stop can be called very frequently and not just at the
> beginning and end of a session.
> This is especially likely for self-monitored threads where it is
> customary to monitor execution of
> only one function or loop. Thus those operations can be on the
> critical path and they must therefore
> by as lightweight as possible. See the discussion in the section
> about attachment and detachment.
>
>
> 6) reading the results
>
> The results are extracted by reading the PMU registers containing
> data (as opposed to configuration).
> The number of registers of interest can vary based on the PMU model,
> the type of measurement, the events
> measured.
>
> Reading can occur at regular interval, e.g., time-based user level
> sampling, and can therefore be on the
> critical path. Thus it must as lightweight as possible. Given that
> the cost of dominated by the latency
> of accessing the PMU registers, it is important to only read the
> registers that are used. Thus, the call
> must provide vector arguments just like for the calls to program the
PMU.
>
> It must be possible to read the registers while the session is
> detached but also when it is attached to a
> thread or CPU.
>
> 7) termination
>
> Termination of a session means all the associated resources are
> either released to the free pool or destroyed.
> After termination, no state remains. Termination implies, stopping
> monitoring and detaching the session if
> necessary.
>
> For the purpose of termination, one has to differentiate between the
> monitored entity and the controlling entity.
> When a tool monitors a thread in another process, all the threads
> from the tool are controlling entities, and the
> monitored thread is the monitored entity. Any entity can vanish at any
time.
>
> If the monitored entity terminates voluntarily, i.e., normal exit, or
> involuntarily, e.g., core dump, the kernel
> simply detaches the session but it is not destroyed.
>
> Until the last controlling entity disappears, the session remains
accessible.
>
> There are situations where all the controlling entities disappear
> before the monitored entity. In this case, the
> session becomes useless, results cannot be extracted, thus the
> session enters the zombie state. It will
> eventually be detached and its resources will be reclaimed by the
> kernel, i.e., the session will be terminated.
>
> 8) extensibility
>
> There is already a vast diversity with existing PMU models, this is
> unlikely to change, quite to the contrary
> it is envisioned that the PMU will become a true valid-add and that
> vendors will therefore try to differentiate
> one from the other. Moreover, the PMU will remain closely tied to
> the underlying micro-architecture. Therefore,
> it is very important to ensure that the monitoring interface will be
> able to adapt easily to future PMU models
> and their extended features, i.e., what is offered beyond counting
events.
>
> It is important to realize that extensibility is not limited to
> supporting more PMU registers. It also includes
> supporting advanced sampling features or socket-level PMUs as
> opposed to just core-level PMUs.
>
> It may be necessary to extend the system calls with new generic or
> architecture specific parameters, and this
> without simply adding new system calls.
>
> 9) current perfmon2 interface
>
> The perfmon2 interface design is guided by the principles described
> in the previous sections.
> We now explain each call is details.
>
>
> a) session creation
>
> int pfm_create_session(struct pfarg_ctx *ctx, char *smpl_name,
> void *smpl_arg, size_t arg_size);
>
> The function creates the perfmon session and returns a file
> descriptor used to manipulate the session
> thereafter.
>
> The calls takes several parameters which are as follows:
> - pfarg_ctx: encapsulates all session parameters (see below)
> - smpl_name: used when sampling to designate which format to use
> - smpl_arg: point to format-specific arguments
> - smpl_size: size of the structure passed in smpl_arg
>
> The pfarg_ctx structure is defined as follows:
> - flags: generic and arch-specific flags for the session
> - reserved: reserved for future extensions
>
> To provide for future extensions, the pfarg_ctx structure
> contains reserved fields. Reserved fields
> must be zeroed.
>
> To create a per-cpu session, the value PFM_CTX_SYSTEM_WIDE must
> be passed in flags.
>
> When in-kernel sampling is not used smpl_name, smpl_arg, arg_size
> must be 0.
>
> b) programming the registers
>
> int pfm_write_pmcs(int fd, struct pfarg_pmc *pmcs, int n);
> int pfm_write_pmds(int fd, struct pfarg_pmd *pmds, int n);
>
> The calls are provided to program the configuration and data
> registers respectively. The parameters are
> as follows:
> - fd: file descriptor identifying the session
> - pmc: pointer to parg_pmc structures
> - pmd: pointer to parg_pmd structures
> - n : number of elements in the pmc or pmd vector
>
> It is possible to pass vector of parg_pmc or pfarg_pmd registers.
> The minimal size is 1, maximum size is
> determined by system administrator.
>
> The pfarg_pmc structure is defined as follows:
> struct pfarg_pmc {
> u16 reg_num;
> u64 reg_value;
> u64 reserved[];
> };
>
> The pfarg_pmd structure is defined as follows:
> struct pfarg_pmd {
> u16 reg_num;
> u64 reg_value;
> u64 reserved[];
> };
>
> Although both structures are currently identical, they will
> differ as more functionalities are added so better
> to create two versions from the start.
>
> Provisions for extensions are provided by the reserved field in
> each structure.
>
>
> c) attachment and detachment
>
> int pfm_load_context(int fd, struct pfarg_load *ld);
> int pfm_unload_context(int fd);
>
>
> The session is identified by the file descriptor, fd.
>
> To attach, the targeted thread or CPU must be provided. For
> extensibility purposes, the target is passed in
> in structure which is defined as follows:
> struct pfarg_load {
> u32 target;
> u64 reserved[];
> };
> In per-thread mode, the target field must be set to the kernel
> thread identification (gettid()).
>
> In per-cpu mode, the target field must be set to the logical CPU
> identification as seen by the kernel.
> Furthermore, the caller must be running on the CPU to monitor
> otherwise the call fails.
>
> Extensions can be implemented using the reserved field.
>
>
> d) start and stop
>
> int pfm_start(int fd);
> int pfm_stop(int fd);
>
> The session is identified by the file descriptor fd.
>
> Currently no other parameters are supported for those calls.
>
>
> e) reading results
>
> int pfm_read_pmds(int fd, struct pfarg_pmd *pmds, int n);
>
>
> The session is identified by the file descriptor fd.
>
> Just like for programming the registers, it is possible to pass
> vectors of structures in pmds. The number
> of elements is passed in n.
>
>
> f) termination
>
> int close(fd);
>
> To terminate a session, the file descriptor has to be closed. The
> semantics of file descriptor sharing
> applies, so if another reference to the session, i.e., another
> file descriptor exists, the session will
> only be effectively destroyed, once that reference disappears.
>
> Of course, the kernel does close all file descriptor on process
> termination, thus the associated sessions
> will eventually be destroyed.
>
> In per-cpu mode, it is not necessary, though recommended, to be
> on the monitored CPU to issue this call.
>
>
-------------------------------------------------------------------------
> Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
> Studies have shown that voting for your favorite open source project,
> along with a healthy diet, reduces your potential for chronic lameness
> and boredom. Vote Now at http://www.sourceforge.net/community/cca08
> _______________________________________________
> perfmon2-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel
-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
perfmon2-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel