Re: [PATCH] perf record: mmap output file - RFC
On 10/9/13 7:33 AM, Jiri Olsa wrote: + + if (size > remaining) { + memcpy(rec->mmap_addr + rec->mmap_offset, buf, remaining); + rec->bytes_written += remaining; + + size -= remaining; + buf += remaining; + + msync(rec->mmap_addr, rec->mmap_size, MS_ASYNC); + munmap(rec->mmap_addr, rec->mmap_size); do we need msync call if munmap is called right away? I was being overly cautious. I will remove it in v2. David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] perf record: mmap output file - RFC
On Sun, Sep 22, 2013 at 08:05:59PM -0600, David Ahern wrote: SNIP > + > + if (size > remaining) { > + memcpy(rec->mmap_addr + rec->mmap_offset, buf, remaining); > + rec->bytes_written += remaining; > + > + size -= remaining; > + buf += remaining; > + > + msync(rec->mmap_addr, rec->mmap_size, MS_ASYNC); > + munmap(rec->mmap_addr, rec->mmap_size); do we need msync call if munmap is called right away? > + goto do_mmap; SNIP > + if (rec->use_mmap) { > + off_t len = rec->bytes_at_mmap_start + rec->bytes_written; > + > + rec->use_mmap = false; > + msync(rec->mmap_addr, rec->mmap_size, MS_ASYNC); > + munmap(rec->mmap_addr, rec->mmap_size); ditto thanks, jirka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] perf record: mmap output file - RFC
On Sun, Sep 22, 2013 at 08:05:59PM -0600, David Ahern wrote: SNIP + + if (size remaining) { + memcpy(rec-mmap_addr + rec-mmap_offset, buf, remaining); + rec-bytes_written += remaining; + + size -= remaining; + buf += remaining; + + msync(rec-mmap_addr, rec-mmap_size, MS_ASYNC); + munmap(rec-mmap_addr, rec-mmap_size); do we need msync call if munmap is called right away? + goto do_mmap; SNIP + if (rec-use_mmap) { + off_t len = rec-bytes_at_mmap_start + rec-bytes_written; + + rec-use_mmap = false; + msync(rec-mmap_addr, rec-mmap_size, MS_ASYNC); + munmap(rec-mmap_addr, rec-mmap_size); ditto thanks, jirka -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] perf record: mmap output file - RFC
On 10/9/13 7:33 AM, Jiri Olsa wrote: + + if (size remaining) { + memcpy(rec-mmap_addr + rec-mmap_offset, buf, remaining); + rec-bytes_written += remaining; + + size -= remaining; + buf += remaining; + + msync(rec-mmap_addr, rec-mmap_size, MS_ASYNC); + munmap(rec-mmap_addr, rec-mmap_size); do we need msync call if munmap is called right away? I was being overly cautious. I will remove it in v2. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] perf record: mmap output file - RFC
* David Ahern wrote: > On 9/26/13 11:51 AM, Jiri Olsa wrote: > >but it's still faster, since we finally get perf a chance to sleep ;-) > > > >new time: > > real0m30.392s > > user0m0.041s > > sys 0m0.389s > > > >old time: > > real0m32.235s > > user0m3.080s > > sys 0m14.444s > > > > Another data point on the performance improvement of perf itself. > Using openssl speed as a workload and perf-stat to collect > information about the perf-record process only: > > perf stat -i -- perf record -g -o /tmp/perf.data openssl speed aes > > With write(): >158.606380 task-clock > 72 context-switches > 34 cpu-migrations > 5,400 page-faults >336,054,007 cycles >137,804,036 stalled-cycles-frontend > 74,505,914 stalled-cycles-backend >474,401,639 instructions > 91,246,072 branches > 1,968,289 branch-misses > > With mmap(): > 50.314270 task-clock > 61 context-switches > 7 cpu-migrations > 3,958 page-faults > 93,585,618 cycles > 64,878,225 stalled-cycles-frontend > 41,680,427 stalled-cycles-backend > 81,552,219 instructions > 15,301,389 branches >387,230 branch-misses > > So time, CPU cycles, instructions all drop by more than a factor of 3. Impressive! Btw., a perf stat bugreport: it's not clear at all that 'task-clock' is in units of milliseconds. Would be nice to print the unit as '(ms)' or so ... For the other entries the unit is obvious (a count) - with the exception of the task-clock. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] perf record: mmap output file - RFC
* David Ahern dsah...@gmail.com wrote: On 9/26/13 11:51 AM, Jiri Olsa wrote: but it's still faster, since we finally get perf a chance to sleep ;-) new time: real0m30.392s user0m0.041s sys 0m0.389s old time: real0m32.235s user0m3.080s sys 0m14.444s Another data point on the performance improvement of perf itself. Using openssl speed as a workload and perf-stat to collect information about the perf-record process only: perf stat -i -- perf record -g -o /tmp/perf.data openssl speed aes With write(): 158.606380 task-clock 72 context-switches 34 cpu-migrations 5,400 page-faults 336,054,007 cycles 137,804,036 stalled-cycles-frontend 74,505,914 stalled-cycles-backend 474,401,639 instructions 91,246,072 branches 1,968,289 branch-misses With mmap(): 50.314270 task-clock 61 context-switches 7 cpu-migrations 3,958 page-faults 93,585,618 cycles 64,878,225 stalled-cycles-frontend 41,680,427 stalled-cycles-backend 81,552,219 instructions 15,301,389 branches 387,230 branch-misses So time, CPU cycles, instructions all drop by more than a factor of 3. Impressive! Btw., a perf stat bugreport: it's not clear at all that 'task-clock' is in units of milliseconds. Would be nice to print the unit as '(ms)' or so ... For the other entries the unit is obvious (a count) - with the exception of the task-clock. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] perf record: mmap output file - RFC
Hi Jiri and David, On Thu, 26 Sep 2013 19:51:05 +0200, Jiri Olsa wrote: > On Sun, Sep 22, 2013 at 08:05:59PM -0600, David Ahern wrote: >> When recording raw_syscalls for the entire system, e.g., >> perf record -e raw_syscalls:*,sched:sched_switch -a -- sleep 1 >> >> you end up with a negative feedback loop as perf itself calls >> write() fairly often. This patch mmap's the file in chunks of 64M >> at a time and copies events from the event buffers to the file >> avoiding write system calls. > > moved processing into userspace: > > 17.24% -17.10% libpthread-2.15.so [.] __write_nocancel > > ... > 0.07% +0.64% libc-2.15.so[.] __memcpy_sse2 > > 0.02% +51.84% libc-2.15.so[.] __memcpy_ssse3_back > > 0.01% +0.34% libc-2.15.so[.] __mempcpy_sse2 > > ... >> >> Before (with write syscall): >> >> perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- >> sleep 1 >> [ perf record: Woken up 0 times to write data ] >> [ perf record: Captured and wrote 81.843 MB /tmp/perf.data (~3575786 >> samples) ] >> >> After (using mmap): >> >> perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- >> sleep 1 >> [ perf record: Woken up 31 times to write data ] > > > but it's still faster, since we finally get perf a chance to sleep ;-) > > new time: > real0m30.392s > user0m0.041s > sys 0m0.389s > > old time: > real0m32.235s > user0m3.080s > sys 0m14.444s But why the new user time took so short? I guess it should take at least 10 seconds or so. Any ideas? > > >> [ perf record: Captured and wrote 8.203 MB /tmp/perf.data (~358388 samples) ] >> >> Before I get too far down this path I wanted to get comments on the approach. > > I think it's worthwhile doing this Indeed! It looks like a nice improvement. Thanks, Namhyung -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] perf record: mmap output file - RFC
On 9/26/13 11:51 AM, Jiri Olsa wrote: but it's still faster, since we finally get perf a chance to sleep ;-) new time: real0m30.392s user0m0.041s sys 0m0.389s old time: real0m32.235s user0m3.080s sys 0m14.444s Another data point on the performance improvement of perf itself. Using openssl speed as a workload and perf-stat to collect information about the perf-record process only: perf stat -i -- perf record -g -o /tmp/perf.data openssl speed aes With write(): 158.606380 task-clock 72 context-switches 34 cpu-migrations 5,400 page-faults 336,054,007 cycles 137,804,036 stalled-cycles-frontend 74,505,914 stalled-cycles-backend 474,401,639 instructions 91,246,072 branches 1,968,289 branch-misses With mmap(): 50.314270 task-clock 61 context-switches 7 cpu-migrations 3,958 page-faults 93,585,618 cycles 64,878,225 stalled-cycles-frontend 41,680,427 stalled-cycles-backend 81,552,219 instructions 15,301,389 branches 387,230 branch-misses So time, CPU cycles, instructions all drop by more than a factor of 3. David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] perf record: mmap output file - RFC
On Thu, Sep 26, 2013 at 12:12:02PM -0600, David Ahern wrote: > On 9/26/13 11:51 AM, Jiri Olsa wrote: > >but it's still faster, since we finally get perf a chance to sleep ;-) > > > >new time: > > real0m30.392s > > user0m0.041s > > sys 0m0.389s > > > >old time: > > real0m32.235s > > user0m3.080s > > sys 0m14.444s > > excellent. > > > > > > >>[ perf record: Captured and wrote 8.203 MB /tmp/perf.data (~358388 samples) > >>] > >> > >>Before I get too far down this path I wanted to get comments on the > >>approach. > > > >I think it's worthwhile doing this > > If you can toss it onto your queue to run through it other use cases > -- like the 'make -j' on kernel builds. I am doing the same here and > if all goes well I submit with some other patches next week. ok, np jirka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] perf record: mmap output file - RFC
On 9/26/13 11:51 AM, Jiri Olsa wrote: but it's still faster, since we finally get perf a chance to sleep ;-) new time: real0m30.392s user0m0.041s sys 0m0.389s old time: real0m32.235s user0m3.080s sys 0m14.444s excellent. [ perf record: Captured and wrote 8.203 MB /tmp/perf.data (~358388 samples) ] Before I get too far down this path I wanted to get comments on the approach. I think it's worthwhile doing this If you can toss it onto your queue to run through it other use cases -- like the 'make -j' on kernel builds. I am doing the same here and if all goes well I submit with some other patches next week. Thanks, David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] perf record: mmap output file - RFC
On Sun, Sep 22, 2013 at 08:05:59PM -0600, David Ahern wrote: > When recording raw_syscalls for the entire system, e.g., > perf record -e raw_syscalls:*,sched:sched_switch -a -- sleep 1 > > you end up with a negative feedback loop as perf itself calls > write() fairly often. This patch mmap's the file in chunks of 64M > at a time and copies events from the event buffers to the file > avoiding write system calls. moved processing into userspace: 17.24% -17.10% libpthread-2.15.so [.] __write_nocancel ... 0.07% +0.64% libc-2.15.so[.] __memcpy_sse2 0.02% +51.84% libc-2.15.so[.] __memcpy_ssse3_back 0.01% +0.34% libc-2.15.so[.] __mempcpy_sse2 ... > > Before (with write syscall): > > perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- > sleep 1 > [ perf record: Woken up 0 times to write data ] > [ perf record: Captured and wrote 81.843 MB /tmp/perf.data (~3575786 samples) > ] > > After (using mmap): > > perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- > sleep 1 > [ perf record: Woken up 31 times to write data ] but it's still faster, since we finally get perf a chance to sleep ;-) new time: real0m30.392s user0m0.041s sys 0m0.389s old time: real0m32.235s user0m3.080s sys 0m14.444s > [ perf record: Captured and wrote 8.203 MB /tmp/perf.data (~358388 samples) ] > > Before I get too far down this path I wanted to get comments on the approach. I think it's worthwhile doing this jirka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] perf record: mmap output file - RFC
On Sun, Sep 22, 2013 at 08:05:59PM -0600, David Ahern wrote: When recording raw_syscalls for the entire system, e.g., perf record -e raw_syscalls:*,sched:sched_switch -a -- sleep 1 you end up with a negative feedback loop as perf itself calls write() fairly often. This patch mmap's the file in chunks of 64M at a time and copies events from the event buffers to the file avoiding write system calls. moved processing into userspace: 17.24% -17.10% libpthread-2.15.so [.] __write_nocancel ... 0.07% +0.64% libc-2.15.so[.] __memcpy_sse2 0.02% +51.84% libc-2.15.so[.] __memcpy_ssse3_back 0.01% +0.34% libc-2.15.so[.] __mempcpy_sse2 ... Before (with write syscall): perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1 [ perf record: Woken up 0 times to write data ] [ perf record: Captured and wrote 81.843 MB /tmp/perf.data (~3575786 samples) ] After (using mmap): perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1 [ perf record: Woken up 31 times to write data ] but it's still faster, since we finally get perf a chance to sleep ;-) new time: real0m30.392s user0m0.041s sys 0m0.389s old time: real0m32.235s user0m3.080s sys 0m14.444s [ perf record: Captured and wrote 8.203 MB /tmp/perf.data (~358388 samples) ] Before I get too far down this path I wanted to get comments on the approach. I think it's worthwhile doing this jirka -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] perf record: mmap output file - RFC
On 9/26/13 11:51 AM, Jiri Olsa wrote: but it's still faster, since we finally get perf a chance to sleep ;-) new time: real0m30.392s user0m0.041s sys 0m0.389s old time: real0m32.235s user0m3.080s sys 0m14.444s excellent. [ perf record: Captured and wrote 8.203 MB /tmp/perf.data (~358388 samples) ] Before I get too far down this path I wanted to get comments on the approach. I think it's worthwhile doing this If you can toss it onto your queue to run through it other use cases -- like the 'make -j' on kernel builds. I am doing the same here and if all goes well I submit with some other patches next week. Thanks, David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] perf record: mmap output file - RFC
On Thu, Sep 26, 2013 at 12:12:02PM -0600, David Ahern wrote: On 9/26/13 11:51 AM, Jiri Olsa wrote: but it's still faster, since we finally get perf a chance to sleep ;-) new time: real0m30.392s user0m0.041s sys 0m0.389s old time: real0m32.235s user0m3.080s sys 0m14.444s excellent. [ perf record: Captured and wrote 8.203 MB /tmp/perf.data (~358388 samples) ] Before I get too far down this path I wanted to get comments on the approach. I think it's worthwhile doing this If you can toss it onto your queue to run through it other use cases -- like the 'make -j' on kernel builds. I am doing the same here and if all goes well I submit with some other patches next week. ok, np jirka -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] perf record: mmap output file - RFC
On 9/26/13 11:51 AM, Jiri Olsa wrote: but it's still faster, since we finally get perf a chance to sleep ;-) new time: real0m30.392s user0m0.041s sys 0m0.389s old time: real0m32.235s user0m3.080s sys 0m14.444s Another data point on the performance improvement of perf itself. Using openssl speed as a workload and perf-stat to collect information about the perf-record process only: perf stat -i -- perf record -g -o /tmp/perf.data openssl speed aes With write(): 158.606380 task-clock 72 context-switches 34 cpu-migrations 5,400 page-faults 336,054,007 cycles 137,804,036 stalled-cycles-frontend 74,505,914 stalled-cycles-backend 474,401,639 instructions 91,246,072 branches 1,968,289 branch-misses With mmap(): 50.314270 task-clock 61 context-switches 7 cpu-migrations 3,958 page-faults 93,585,618 cycles 64,878,225 stalled-cycles-frontend 41,680,427 stalled-cycles-backend 81,552,219 instructions 15,301,389 branches 387,230 branch-misses So time, CPU cycles, instructions all drop by more than a factor of 3. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] perf record: mmap output file - RFC
Hi Jiri and David, On Thu, 26 Sep 2013 19:51:05 +0200, Jiri Olsa wrote: On Sun, Sep 22, 2013 at 08:05:59PM -0600, David Ahern wrote: When recording raw_syscalls for the entire system, e.g., perf record -e raw_syscalls:*,sched:sched_switch -a -- sleep 1 you end up with a negative feedback loop as perf itself calls write() fairly often. This patch mmap's the file in chunks of 64M at a time and copies events from the event buffers to the file avoiding write system calls. moved processing into userspace: 17.24% -17.10% libpthread-2.15.so [.] __write_nocancel ... 0.07% +0.64% libc-2.15.so[.] __memcpy_sse2 0.02% +51.84% libc-2.15.so[.] __memcpy_ssse3_back 0.01% +0.34% libc-2.15.so[.] __mempcpy_sse2 ... Before (with write syscall): perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1 [ perf record: Woken up 0 times to write data ] [ perf record: Captured and wrote 81.843 MB /tmp/perf.data (~3575786 samples) ] After (using mmap): perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1 [ perf record: Woken up 31 times to write data ] but it's still faster, since we finally get perf a chance to sleep ;-) new time: real0m30.392s user0m0.041s sys 0m0.389s old time: real0m32.235s user0m3.080s sys 0m14.444s But why the new user time took so short? I guess it should take at least 10 seconds or so. Any ideas? [ perf record: Captured and wrote 8.203 MB /tmp/perf.data (~358388 samples) ] Before I get too far down this path I wanted to get comments on the approach. I think it's worthwhile doing this Indeed! It looks like a nice improvement. Thanks, Namhyung -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] perf record: mmap output file - RFC
When recording raw_syscalls for the entire system, e.g., perf record -e raw_syscalls:*,sched:sched_switch -a -- sleep 1 you end up with a negative feedback loop as perf itself calls write() fairly often. This patch mmap's the file in chunks of 64M at a time and copies events from the event buffers to the file avoiding write system calls. Before (with write syscall): perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1 [ perf record: Woken up 0 times to write data ] [ perf record: Captured and wrote 81.843 MB /tmp/perf.data (~3575786 samples) ] After (using mmap): perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1 [ perf record: Woken up 31 times to write data ] [ perf record: Captured and wrote 8.203 MB /tmp/perf.data (~358388 samples) ] Before I get too far down this path I wanted to get comments on the approach. Signed-off-by: David Ahern Cc: Ingo Molnar Cc: Frederic Weisbecker Cc: Peter Zijlstra Cc: Jiri Olsa Cc: Namhyung Kim Cc: Mike Galbraith Cc: Stephane Eranian --- tools/perf/builtin-record.c | 87 +++ 1 file changed, 87 insertions(+) diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index da13840..45bb565 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -29,6 +29,9 @@ #include #include +/* mmap file big chunks at a time */ +#define MMAP_OUTPUT_SIZE (64*1024*1024) + #ifndef HAVE_ON_EXIT #ifndef ATEXIT_MAX #define ATEXIT_MAX 32 @@ -64,6 +67,14 @@ static void __handle_on_exit_funcs(void) struct perf_record { struct perf_tooltool; struct perf_record_opts opts; + + /* for MMAP based file writes */ + void*mmap_addr; + u64 bytes_at_mmap_start; /* bytes in file when mmap use starts */ + u64 mmap_offset;/* current location within mmap */ + size_t mmap_size; /* size of mmap segments */ + booluse_mmap; + u64 bytes_written; const char *output_name; struct perf_evlist *evlist; @@ -82,8 +93,66 @@ static void advance_output(struct perf_record *rec, size_t size) rec->bytes_written += size; } +static int do_mmap_output(struct perf_record *rec, void *buf, size_t size) +{ + u64 remaining; + off_t offset; + + if (rec->mmap_addr == NULL) { +do_mmap: + offset = rec->bytes_at_mmap_start + rec->bytes_written; + if (offset < (ssize_t) rec->mmap_size) { + rec->mmap_offset = offset; + offset = 0; + } else + rec->mmap_offset = 0; + + rec->mmap_addr = mmap(NULL, rec->mmap_size, +PROT_WRITE | PROT_READ, +MAP_SHARED, +rec->output, +offset); + + if (rec->mmap_addr == MAP_FAILED) { + pr_err("mmap failed: %d: %s\n", errno, strerror(errno)); + return -1; + } + + /* expand file to include this mmap segment */ + if (ftruncate(rec->output, offset + rec->mmap_size) != 0) { + pr_err("ftruncate failed\n"); + return -1; + } + } + + remaining = rec->mmap_size - rec->mmap_offset; + + if (size > remaining) { + memcpy(rec->mmap_addr + rec->mmap_offset, buf, remaining); + rec->bytes_written += remaining; + + size -= remaining; + buf += remaining; + + msync(rec->mmap_addr, rec->mmap_size, MS_ASYNC); + munmap(rec->mmap_addr, rec->mmap_size); + goto do_mmap; + } + + if (size) { + memcpy(rec->mmap_addr + rec->mmap_offset, buf, size); + rec->bytes_written += size; + rec->mmap_offset += size; + } + + return 0; +} + static int write_output(struct perf_record *rec, void *buf, size_t size) { + if (rec->use_mmap) + return do_mmap_output(rec, buf, size); + while (size) { int ret = write(rec->output, buf, size); @@ -546,6 +615,11 @@ static int __cmd_record(struct perf_record *rec, int argc, const char **argv) if (forks) perf_evlist__start_workload(evsel_list); + if (!rec->opts.pipe_output && stat(output_name, ) == 0) { + rec->use_mmap = true; + rec->bytes_at_mmap_start = st.st_size - rec->bytes_written; + } + for (;;) { int hits = rec->samples; @@ -572,6 +646,18 @@ static int __cmd_record(struct perf_record *rec, int argc, const char **argv) } } + if
[PATCH] perf record: mmap output file - RFC
When recording raw_syscalls for the entire system, e.g., perf record -e raw_syscalls:*,sched:sched_switch -a -- sleep 1 you end up with a negative feedback loop as perf itself calls write() fairly often. This patch mmap's the file in chunks of 64M at a time and copies events from the event buffers to the file avoiding write system calls. Before (with write syscall): perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1 [ perf record: Woken up 0 times to write data ] [ perf record: Captured and wrote 81.843 MB /tmp/perf.data (~3575786 samples) ] After (using mmap): perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1 [ perf record: Woken up 31 times to write data ] [ perf record: Captured and wrote 8.203 MB /tmp/perf.data (~358388 samples) ] Before I get too far down this path I wanted to get comments on the approach. Signed-off-by: David Ahern dsah...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Frederic Weisbecker fweis...@gmail.com Cc: Peter Zijlstra pet...@infradead.org Cc: Jiri Olsa jo...@redhat.com Cc: Namhyung Kim namhy...@kernel.org Cc: Mike Galbraith efa...@gmx.de Cc: Stephane Eranian eran...@google.com --- tools/perf/builtin-record.c | 87 +++ 1 file changed, 87 insertions(+) diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index da13840..45bb565 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -29,6 +29,9 @@ #include sched.h #include sys/mman.h +/* mmap file big chunks at a time */ +#define MMAP_OUTPUT_SIZE (64*1024*1024) + #ifndef HAVE_ON_EXIT #ifndef ATEXIT_MAX #define ATEXIT_MAX 32 @@ -64,6 +67,14 @@ static void __handle_on_exit_funcs(void) struct perf_record { struct perf_tooltool; struct perf_record_opts opts; + + /* for MMAP based file writes */ + void*mmap_addr; + u64 bytes_at_mmap_start; /* bytes in file when mmap use starts */ + u64 mmap_offset;/* current location within mmap */ + size_t mmap_size; /* size of mmap segments */ + booluse_mmap; + u64 bytes_written; const char *output_name; struct perf_evlist *evlist; @@ -82,8 +93,66 @@ static void advance_output(struct perf_record *rec, size_t size) rec-bytes_written += size; } +static int do_mmap_output(struct perf_record *rec, void *buf, size_t size) +{ + u64 remaining; + off_t offset; + + if (rec-mmap_addr == NULL) { +do_mmap: + offset = rec-bytes_at_mmap_start + rec-bytes_written; + if (offset (ssize_t) rec-mmap_size) { + rec-mmap_offset = offset; + offset = 0; + } else + rec-mmap_offset = 0; + + rec-mmap_addr = mmap(NULL, rec-mmap_size, +PROT_WRITE | PROT_READ, +MAP_SHARED, +rec-output, +offset); + + if (rec-mmap_addr == MAP_FAILED) { + pr_err(mmap failed: %d: %s\n, errno, strerror(errno)); + return -1; + } + + /* expand file to include this mmap segment */ + if (ftruncate(rec-output, offset + rec-mmap_size) != 0) { + pr_err(ftruncate failed\n); + return -1; + } + } + + remaining = rec-mmap_size - rec-mmap_offset; + + if (size remaining) { + memcpy(rec-mmap_addr + rec-mmap_offset, buf, remaining); + rec-bytes_written += remaining; + + size -= remaining; + buf += remaining; + + msync(rec-mmap_addr, rec-mmap_size, MS_ASYNC); + munmap(rec-mmap_addr, rec-mmap_size); + goto do_mmap; + } + + if (size) { + memcpy(rec-mmap_addr + rec-mmap_offset, buf, size); + rec-bytes_written += size; + rec-mmap_offset += size; + } + + return 0; +} + static int write_output(struct perf_record *rec, void *buf, size_t size) { + if (rec-use_mmap) + return do_mmap_output(rec, buf, size); + while (size) { int ret = write(rec-output, buf, size); @@ -546,6 +615,11 @@ static int __cmd_record(struct perf_record *rec, int argc, const char **argv) if (forks) perf_evlist__start_workload(evsel_list); + if (!rec-opts.pipe_output stat(output_name, st) == 0) { + rec-use_mmap = true; + rec-bytes_at_mmap_start = st.st_size - rec-bytes_written; + } + for (;;) { int hits = rec-samples; @@ -572,6 +646,18 @@ static