Re: [PATCHSET 00/17] perf tools: Introduce new 'ftrace' command (v4)
On 130813 11:20:52, Namhyung Kim wrote: > 8<- > diff --git a/tools/perf/builtin-ftrace.c b/tools/perf/builtin-ftrace.c > index 9e78ec19caeb..10590b794cae 100644 > --- a/tools/perf/builtin-ftrace.c > +++ b/tools/perf/builtin-ftrace.c > @@ -555,17 +555,25 @@ sleep: > while (true) { > int n = read(trace_fd, buf, sizeof(buf)); > > - if (n < 0) > - goto out_close; > - if (n == 0) > + if (n < 0) { > + if (errno == EINTR || errno == EAGAIN) > + break; > + perror("flush read"); > + goto out_close2; > + } else if (n == 0) > break; > - if (write(output_fd, buf, n) != n) > - goto out_close; > + > + if (write(output_fd, buf, n) != n) { > + perror("flush write"); > + goto out_close2; > + } > > byte_written += n; > } > fra->state = RECORD_STATE__DONE; > > +out_close2: > + close(output_fd); > out_close: > close(trace_fd); > out: > @@ -579,6 +587,8 @@ out: > pthread_cond_signal(_ready_cond); > pthread_mutex_unlock(_mutex); > } > + > + pr_debug2("done with %ld bytes\n", (long)byte_written); > return fra; > } > Hmm, I already had hunk #3 in your git tree v4. > @@ -1139,12 +1149,12 @@ retry: > return record; > } > > - munmap(fra->map, pevent_get_page_size(ftrace->pevent)); > - fra->map = NULL; > - > if (fra->done) > return NULL; > > + munmap(fra->map, pevent_get_page_size(ftrace->pevent)); > + fra->map = NULL; > + > fra->offset += pevent_get_page_size(ftrace->pevent); > if (fra->offset >= fra->size) { > /* EOF */ After patching your tree with just the first 2 hunks, I'm able to get ftrace-style function graphing out of perf. # ./perf ftrace record df Filesystem 1K-blocks Used Available Use% Mounted on ... # ./perf --no-pager ftrace show | head -20 overriding event (11) ftrace:funcgraph_entry with new print handler overriding event (10) ftrace:funcgraph_exit with new print handler 2) 0.686 us | finish_task_switch(); 2) 0.260 us | finish_wait(); 2)| mutex_lock() { 2) 0.211 us |_cond_resched(); 2) 1.170 us | } 2) 0.319 us | generic_pipe_buf_confirm(); 2) 0.261 us | generic_pipe_buf_map(); 2) 0.129 us | generic_pipe_buf_unmap(); 2) 0.747 us | anon_pipe_buf_release(); 2) 0.138 us | mutex_unlock(); 2)| __wake_up_sync_key() { 2) 0.279 us |_raw_spin_lock_irqsave(); 2) 0.135 us |__wake_up_common(); 2) 0.133 us |__lock_text_start(); 2) 3.386 us | } 2)| kill_fasync() { 2)| smp_reschedule_interrupt() { 2) 0.130 us |kvm_guest_apic_eoi_write(); Nice. Not sure if you intend to move all ftrace functionality over to perf ftrace, but the function graph timings is a great start and something sorely missing. Do you intend to add -e event support or -l function-specific options ? In the real world, without filtering on events or functions, I've had systems hang, plus performance impact is too great. A common invocation of ftrace via trace-cmd is: # trace-cmd record -p function_graph -e irq:* -l do_IRQ ping -c1 www.redhat.com So possible perf equivalent? # ./perf ftrace record -e irq:* -e do_IRQ ping -c1 www.redhat.com Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHSET 00/17] perf tools: Introduce new 'ftrace' command (v4)
On 130813 11:20:52, Namhyung Kim wrote: 8- diff --git a/tools/perf/builtin-ftrace.c b/tools/perf/builtin-ftrace.c index 9e78ec19caeb..10590b794cae 100644 --- a/tools/perf/builtin-ftrace.c +++ b/tools/perf/builtin-ftrace.c @@ -555,17 +555,25 @@ sleep: while (true) { int n = read(trace_fd, buf, sizeof(buf)); - if (n 0) - goto out_close; - if (n == 0) + if (n 0) { + if (errno == EINTR || errno == EAGAIN) + break; + perror(flush read); + goto out_close2; + } else if (n == 0) break; - if (write(output_fd, buf, n) != n) - goto out_close; + + if (write(output_fd, buf, n) != n) { + perror(flush write); + goto out_close2; + } byte_written += n; } fra-state = RECORD_STATE__DONE; +out_close2: + close(output_fd); out_close: close(trace_fd); out: @@ -579,6 +587,8 @@ out: pthread_cond_signal(recorder_ready_cond); pthread_mutex_unlock(recorder_mutex); } + + pr_debug2(done with %ld bytes\n, (long)byte_written); return fra; } Hmm, I already had hunk #3 in your git tree v4. @@ -1139,12 +1149,12 @@ retry: return record; } - munmap(fra-map, pevent_get_page_size(ftrace-pevent)); - fra-map = NULL; - if (fra-done) return NULL; + munmap(fra-map, pevent_get_page_size(ftrace-pevent)); + fra-map = NULL; + fra-offset += pevent_get_page_size(ftrace-pevent); if (fra-offset = fra-size) { /* EOF */ After patching your tree with just the first 2 hunks, I'm able to get ftrace-style function graphing out of perf. # ./perf ftrace record df Filesystem 1K-blocks Used Available Use% Mounted on snip... # ./perf --no-pager ftrace show | head -20 overriding event (11) ftrace:funcgraph_entry with new print handler overriding event (10) ftrace:funcgraph_exit with new print handler 2) 0.686 us | finish_task_switch(); 2) 0.260 us | finish_wait(); 2)| mutex_lock() { 2) 0.211 us |_cond_resched(); 2) 1.170 us | } 2) 0.319 us | generic_pipe_buf_confirm(); 2) 0.261 us | generic_pipe_buf_map(); 2) 0.129 us | generic_pipe_buf_unmap(); 2) 0.747 us | anon_pipe_buf_release(); 2) 0.138 us | mutex_unlock(); 2)| __wake_up_sync_key() { 2) 0.279 us |_raw_spin_lock_irqsave(); 2) 0.135 us |__wake_up_common(); 2) 0.133 us |__lock_text_start(); 2) 3.386 us | } 2)| kill_fasync() { 2)| smp_reschedule_interrupt() { 2) 0.130 us |kvm_guest_apic_eoi_write(); Nice. Not sure if you intend to move all ftrace functionality over to perf ftrace, but the function graph timings is a great start and something sorely missing. Do you intend to add -e event support or -l function-specific options ? In the real world, without filtering on events or functions, I've had systems hang, plus performance impact is too great. A common invocation of ftrace via trace-cmd is: # trace-cmd record -p function_graph -e irq:* -l do_IRQ ping -c1 www.redhat.com So possible perf equivalent? # ./perf ftrace record -e irq:* -e do_IRQ ping -c1 www.redhat.com Thanks! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea
On 130729 12:59:47, Jeremy Eder wrote: > On 130729 23:57:31, Youquan Song wrote: > > Hi Jeremy, > > > > I try reproduce your result and then fix the issue, but I do not reproduce > > it > > yet. > > > > I run at netperf-2.6.0 at one machine as server: netserver, other > > machine: netperf -t TCP_RR -H $SERVER_IP -l 60. The target machine is > > used in both client and server. I do not reproduce the performance drop > > issue. I also notice the result is not stable, sometime it is high, > > sometime is low. In sumarry, it is hard to make a definite result. > > > > Can you try tell me how to reproduce the issue? how do you get the C0 > > data? > > > > What's your config for kernel? Do you enable CONFIG_NO_HZ_FULL=y or > > only CONFIG_NO_HZ=y? > > > > > > Thanks > > -Youquan > > Hi, > > To answer both your and Daniel's question, those results used only > CONFIG_NO_HZ=y. > > These network latency benchmarks are fickle creatures, and need careful > tuning to become reproducible. Plus there are BIOS implications and tuning > varies by vendor. > > Anyway for the most part it's probably not stable because in order to get > any sort > of reproducibility between runs you need to do at least these steps: > > - ensure as little is running in userspace as possible > - determine PCI affinity for the NIC > - on both machines, isolate the socket connected to the NIC from userspace > tasks > - Turn off irqbalance and bind all IRQs for that NIC to a single core on > the same socket as the NIC > - run netperf with -TX,Y where X,Y are core numbers that you wish > netperf/netserver to run on, respectively. > > For example, if your NIC is attached to socket 0 and socket 0 cores are > enumerated 0-7, then: > > - set /proc/irq/NNN/smp_affinity_list to, say, 6 for all vectors on that > NIC. > - nice -20 netperf -t TCP_RR - $SERVER_IP -l 60 -T4,4 -s 2 > > That should get you most of the way there. The -s 2 connects and waits 2 > seconds, I found this to help with the first few second's worth of data. > Or > you could just toss the first 2 seconds worth, it seems to take that long > to stabilize. What I mean is, if you're not using -D1,1 option to netperf, > you might not have seen that netperf tests seem to take a few seconds to > stabilize even > when properly tuned. > > I got the C0 data by running turbostat in parallel with each benchmark run, > then grabbing the C-state data for the cores relevant to the test. In my > case that was cores 4 and 6, where core 4 was where I put netperf/netserver > and core 6 was where I put the NIC IRQs. Then I parsed that output into a > format that this could interpret: > > https://github.com/bitly/data_hacks/blob/master/data_hacks/histogram.py > > I'm building a kernel from Rafael's tree and will try to confirm what Len > already sent. Thanks everyone for looking into it. Hi, sorry for the delay. In addition to the results I initially posted, the below results confirm my initial data, plus what Len sent: 3.11-rc2 w/reverts TCP_RR trans/s 54454.13 3.11-rc2 w/reverts + c0 lock TCP_RR trans/s 55088.11 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea
On 130729 12:59:47, Jeremy Eder wrote: On 130729 23:57:31, Youquan Song wrote: Hi Jeremy, I try reproduce your result and then fix the issue, but I do not reproduce it yet. I run at netperf-2.6.0 at one machine as server: netserver, other machine: netperf -t TCP_RR -H $SERVER_IP -l 60. The target machine is used in both client and server. I do not reproduce the performance drop issue. I also notice the result is not stable, sometime it is high, sometime is low. In sumarry, it is hard to make a definite result. Can you try tell me how to reproduce the issue? how do you get the C0 data? What's your config for kernel? Do you enable CONFIG_NO_HZ_FULL=y or only CONFIG_NO_HZ=y? Thanks -Youquan Hi, To answer both your and Daniel's question, those results used only CONFIG_NO_HZ=y. These network latency benchmarks are fickle creatures, and need careful tuning to become reproducible. Plus there are BIOS implications and tuning varies by vendor. Anyway for the most part it's probably not stable because in order to get any sort of reproducibility between runs you need to do at least these steps: - ensure as little is running in userspace as possible - determine PCI affinity for the NIC - on both machines, isolate the socket connected to the NIC from userspace tasks - Turn off irqbalance and bind all IRQs for that NIC to a single core on the same socket as the NIC - run netperf with -TX,Y where X,Y are core numbers that you wish netperf/netserver to run on, respectively. For example, if your NIC is attached to socket 0 and socket 0 cores are enumerated 0-7, then: - set /proc/irq/NNN/smp_affinity_list to, say, 6 for all vectors on that NIC. - nice -20 netperf -t TCP_RR - $SERVER_IP -l 60 -T4,4 -s 2 That should get you most of the way there. The -s 2 connects and waits 2 seconds, I found this to help with the first few second's worth of data. Or you could just toss the first 2 seconds worth, it seems to take that long to stabilize. What I mean is, if you're not using -D1,1 option to netperf, you might not have seen that netperf tests seem to take a few seconds to stabilize even when properly tuned. I got the C0 data by running turbostat in parallel with each benchmark run, then grabbing the C-state data for the cores relevant to the test. In my case that was cores 4 and 6, where core 4 was where I put netperf/netserver and core 6 was where I put the NIC IRQs. Then I parsed that output into a format that this could interpret: https://github.com/bitly/data_hacks/blob/master/data_hacks/histogram.py I'm building a kernel from Rafael's tree and will try to confirm what Len already sent. Thanks everyone for looking into it. Hi, sorry for the delay. In addition to the results I initially posted, the below results confirm my initial data, plus what Len sent: 3.11-rc2 w/reverts TCP_RR trans/s 54454.13 3.11-rc2 w/reverts + c0 lock TCP_RR trans/s 55088.11 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea
On 130729 23:57:31, Youquan Song wrote: > Hi Jeremy, > > I try reproduce your result and then fix the issue, but I do not reproduce it > yet. > > I run at netperf-2.6.0 at one machine as server: netserver, other > machine: netperf -t TCP_RR -H $SERVER_IP -l 60. The target machine is > used in both client and server. I do not reproduce the performance drop > issue. I also notice the result is not stable, sometime it is high, > sometime is low. In sumarry, it is hard to make a definite result. > > Can you try tell me how to reproduce the issue? how do you get the C0 > data? > > What's your config for kernel? Do you enable CONFIG_NO_HZ_FULL=y or > only CONFIG_NO_HZ=y? > > > Thanks > -Youquan Hi, To answer both your and Daniel's question, those results used only CONFIG_NO_HZ=y. These network latency benchmarks are fickle creatures, and need careful tuning to become reproducible. Plus there are BIOS implications and tuning varies by vendor. Anyway for the most part it's probably not stable because in order to get any sort of reproducibility between runs you need to do at least these steps: - ensure as little is running in userspace as possible - determine PCI affinity for the NIC - on both machines, isolate the socket connected to the NIC from userspace tasks - Turn off irqbalance and bind all IRQs for that NIC to a single core on the same socket as the NIC - run netperf with -TX,Y where X,Y are core numbers that you wish netperf/netserver to run on, respectively. For example, if your NIC is attached to socket 0 and socket 0 cores are enumerated 0-7, then: - set /proc/irq/NNN/smp_affinity_list to, say, 6 for all vectors on that NIC. - nice -20 netperf -t TCP_RR - $SERVER_IP -l 60 -T4,4 -s 2 That should get you most of the way there. The -s 2 connects and waits 2 seconds, I found this to help with the first few second's worth of data. Or you could just toss the first 2 seconds worth, it seems to take that long to stabilize. What I mean is, if you're not using -D1,1 option to netperf, you might not have seen that netperf tests seem to take a few seconds to stabilize even when properly tuned. I got the C0 data by running turbostat in parallel with each benchmark run, then grabbing the C-state data for the cores relevant to the test. In my case that was cores 4 and 6, where core 4 was where I put netperf/netserver and core 6 was where I put the NIC IRQs. Then I parsed that output into a format that this could interpret: https://github.com/bitly/data_hacks/blob/master/data_hacks/histogram.py I'm building a kernel from Rafael's tree and will try to confirm what Len already sent. Thanks everyone for looking into it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea
On 130729 23:57:31, Youquan Song wrote: Hi Jeremy, I try reproduce your result and then fix the issue, but I do not reproduce it yet. I run at netperf-2.6.0 at one machine as server: netserver, other machine: netperf -t TCP_RR -H $SERVER_IP -l 60. The target machine is used in both client and server. I do not reproduce the performance drop issue. I also notice the result is not stable, sometime it is high, sometime is low. In sumarry, it is hard to make a definite result. Can you try tell me how to reproduce the issue? how do you get the C0 data? What's your config for kernel? Do you enable CONFIG_NO_HZ_FULL=y or only CONFIG_NO_HZ=y? Thanks -Youquan Hi, To answer both your and Daniel's question, those results used only CONFIG_NO_HZ=y. These network latency benchmarks are fickle creatures, and need careful tuning to become reproducible. Plus there are BIOS implications and tuning varies by vendor. Anyway for the most part it's probably not stable because in order to get any sort of reproducibility between runs you need to do at least these steps: - ensure as little is running in userspace as possible - determine PCI affinity for the NIC - on both machines, isolate the socket connected to the NIC from userspace tasks - Turn off irqbalance and bind all IRQs for that NIC to a single core on the same socket as the NIC - run netperf with -TX,Y where X,Y are core numbers that you wish netperf/netserver to run on, respectively. For example, if your NIC is attached to socket 0 and socket 0 cores are enumerated 0-7, then: - set /proc/irq/NNN/smp_affinity_list to, say, 6 for all vectors on that NIC. - nice -20 netperf -t TCP_RR - $SERVER_IP -l 60 -T4,4 -s 2 That should get you most of the way there. The -s 2 connects and waits 2 seconds, I found this to help with the first few second's worth of data. Or you could just toss the first 2 seconds worth, it seems to take that long to stabilize. What I mean is, if you're not using -D1,1 option to netperf, you might not have seen that netperf tests seem to take a few seconds to stabilize even when properly tuned. I got the C0 data by running turbostat in parallel with each benchmark run, then grabbing the C-state data for the cores relevant to the test. In my case that was cores 4 and 6, where core 4 was where I put netperf/netserver and core 6 was where I put the NIC IRQs. Then I parsed that output into a format that this could interpret: https://github.com/bitly/data_hacks/blob/master/data_hacks/histogram.py I'm building a kernel from Rafael's tree and will try to confirm what Len already sent. Thanks everyone for looking into it. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RFC: revert request for cpuidle patches e11538d1 and 69a37bea
better performance), by reverting commit 69a37beabf1f0a6705c08e879bdd5d82ff6486c4. While taking into account the changing landscape with regards to CPU governors, and both P- and C-states, we think that a single-thread should still be able to achieve maximum performance. With the current upstream code base, workloads with a low number of "hot" threads are not able to achieve maximum performance "out of the box". Also recently, Intel's LAD has posted upstream performance results that include an interesting column with their table of results. See upstream commit 0a4db187a999, column #3 within the "Performance numbers" table. It seems known, even within Intel, that the deeper C-states incur a cost too high to bear, as they've explicitly tested restricting the CPU to higher c-states of C0,1. -- Jeremy Eder -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RFC: revert request for cpuidle patches e11538d1 and 69a37bea
better performance), by reverting commit 69a37beabf1f0a6705c08e879bdd5d82ff6486c4. While taking into account the changing landscape with regards to CPU governors, and both P- and C-states, we think that a single-thread should still be able to achieve maximum performance. With the current upstream code base, workloads with a low number of hot threads are not able to achieve maximum performance out of the box. Also recently, Intel's LAD has posted upstream performance results that include an interesting column with their table of results. See upstream commit 0a4db187a999, column #3 within the Performance numbers table. It seems known, even within Intel, that the deeper C-states incur a cost too high to bear, as they've explicitly tested restricting the CPU to higher c-states of C0,1. -- Jeremy Eder -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/