Re: [PATCHSET 00/17] perf tools: Introduce new 'ftrace' command (v4)

2013-08-28 Thread Jeremy Eder
On 130813 11:20:52, Namhyung Kim wrote:
> 8<-
> diff --git a/tools/perf/builtin-ftrace.c b/tools/perf/builtin-ftrace.c
> index 9e78ec19caeb..10590b794cae 100644
> --- a/tools/perf/builtin-ftrace.c
> +++ b/tools/perf/builtin-ftrace.c
> @@ -555,17 +555,25 @@ sleep:
>   while (true) {
>   int n = read(trace_fd, buf, sizeof(buf));
>  
> - if (n < 0)
> - goto out_close;
> - if (n == 0)
> + if (n < 0) {
> + if (errno == EINTR || errno == EAGAIN)
> + break;
> + perror("flush read");
> + goto out_close2;
> + } else if (n == 0)
>   break;
> - if (write(output_fd, buf, n) != n)
> - goto out_close;
> +
> + if (write(output_fd, buf, n) != n) {
> + perror("flush write");
> + goto out_close2;
> + }
>  
>   byte_written += n;
>   }
>   fra->state = RECORD_STATE__DONE;
>  
> +out_close2:
> + close(output_fd);
>  out_close:
>   close(trace_fd);
>  out:
> @@ -579,6 +587,8 @@ out:
>   pthread_cond_signal(_ready_cond);
>   pthread_mutex_unlock(_mutex);
>   }
> +
> + pr_debug2("done with %ld bytes\n", (long)byte_written);
>   return fra;
>  }
>  

Hmm, I already had hunk #3 in your git tree v4.

> @@ -1139,12 +1149,12 @@ retry:
>   return record;
>   }
>  
> - munmap(fra->map, pevent_get_page_size(ftrace->pevent));
> - fra->map = NULL;
> -
>   if (fra->done)
>   return NULL;
>  
> + munmap(fra->map, pevent_get_page_size(ftrace->pevent));
> + fra->map = NULL;
> +
>   fra->offset += pevent_get_page_size(ftrace->pevent);
>   if (fra->offset >= fra->size) {
>   /* EOF */


After patching your tree with just the first 2 hunks, I'm able to
get ftrace-style function graphing out of perf.

# ./perf ftrace record df
Filesystem   1K-blocks Used Available Use% Mounted on
...

# ./perf --no-pager ftrace show | head -20
overriding event (11) ftrace:funcgraph_entry with new print handler
overriding event (10) ftrace:funcgraph_exit with new print handler
  2)   0.686 us |  finish_task_switch();
  2)   0.260 us |  finish_wait();
  2)|  mutex_lock() {
  2)   0.211 us |_cond_resched();
  2)   1.170 us |  }
  2)   0.319 us |  generic_pipe_buf_confirm();
  2)   0.261 us |  generic_pipe_buf_map();
  2)   0.129 us |  generic_pipe_buf_unmap();
  2)   0.747 us |  anon_pipe_buf_release();
  2)   0.138 us |  mutex_unlock();
  2)|  __wake_up_sync_key() {
  2)   0.279 us |_raw_spin_lock_irqsave();
  2)   0.135 us |__wake_up_common();
  2)   0.133 us |__lock_text_start();
  2)   3.386 us |  }
  2)|  kill_fasync() {
  2)|  smp_reschedule_interrupt() {
  2)   0.130 us |kvm_guest_apic_eoi_write();

Nice.

Not sure if you intend to move all ftrace functionality over to
perf ftrace, but the function graph timings is a great start and something 
sorely
missing.

Do you intend to add -e event support or -l function-specific options ?  In the 
real
world, without filtering on events or functions, I've had systems hang, plus
performance impact is too great.

A common invocation of ftrace via trace-cmd is:
# trace-cmd record -p function_graph -e irq:* -l do_IRQ ping -c1 www.redhat.com

So possible perf equivalent?
# ./perf ftrace record -e irq:* -e do_IRQ ping -c1 www.redhat.com

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHSET 00/17] perf tools: Introduce new 'ftrace' command (v4)

2013-08-28 Thread Jeremy Eder
On 130813 11:20:52, Namhyung Kim wrote:
 8-
 diff --git a/tools/perf/builtin-ftrace.c b/tools/perf/builtin-ftrace.c
 index 9e78ec19caeb..10590b794cae 100644
 --- a/tools/perf/builtin-ftrace.c
 +++ b/tools/perf/builtin-ftrace.c
 @@ -555,17 +555,25 @@ sleep:
   while (true) {
   int n = read(trace_fd, buf, sizeof(buf));
  
 - if (n  0)
 - goto out_close;
 - if (n == 0)
 + if (n  0) {
 + if (errno == EINTR || errno == EAGAIN)
 + break;
 + perror(flush read);
 + goto out_close2;
 + } else if (n == 0)
   break;
 - if (write(output_fd, buf, n) != n)
 - goto out_close;
 +
 + if (write(output_fd, buf, n) != n) {
 + perror(flush write);
 + goto out_close2;
 + }
  
   byte_written += n;
   }
   fra-state = RECORD_STATE__DONE;
  
 +out_close2:
 + close(output_fd);
  out_close:
   close(trace_fd);
  out:
 @@ -579,6 +587,8 @@ out:
   pthread_cond_signal(recorder_ready_cond);
   pthread_mutex_unlock(recorder_mutex);
   }
 +
 + pr_debug2(done with %ld bytes\n, (long)byte_written);
   return fra;
  }
  

Hmm, I already had hunk #3 in your git tree v4.

 @@ -1139,12 +1149,12 @@ retry:
   return record;
   }
  
 - munmap(fra-map, pevent_get_page_size(ftrace-pevent));
 - fra-map = NULL;
 -
   if (fra-done)
   return NULL;
  
 + munmap(fra-map, pevent_get_page_size(ftrace-pevent));
 + fra-map = NULL;
 +
   fra-offset += pevent_get_page_size(ftrace-pevent);
   if (fra-offset = fra-size) {
   /* EOF */


After patching your tree with just the first 2 hunks, I'm able to
get ftrace-style function graphing out of perf.

# ./perf ftrace record df
Filesystem   1K-blocks Used Available Use% Mounted on
snip...

# ./perf --no-pager ftrace show | head -20
overriding event (11) ftrace:funcgraph_entry with new print handler
overriding event (10) ftrace:funcgraph_exit with new print handler
  2)   0.686 us |  finish_task_switch();
  2)   0.260 us |  finish_wait();
  2)|  mutex_lock() {
  2)   0.211 us |_cond_resched();
  2)   1.170 us |  }
  2)   0.319 us |  generic_pipe_buf_confirm();
  2)   0.261 us |  generic_pipe_buf_map();
  2)   0.129 us |  generic_pipe_buf_unmap();
  2)   0.747 us |  anon_pipe_buf_release();
  2)   0.138 us |  mutex_unlock();
  2)|  __wake_up_sync_key() {
  2)   0.279 us |_raw_spin_lock_irqsave();
  2)   0.135 us |__wake_up_common();
  2)   0.133 us |__lock_text_start();
  2)   3.386 us |  }
  2)|  kill_fasync() {
  2)|  smp_reschedule_interrupt() {
  2)   0.130 us |kvm_guest_apic_eoi_write();

Nice.

Not sure if you intend to move all ftrace functionality over to
perf ftrace, but the function graph timings is a great start and something 
sorely
missing.

Do you intend to add -e event support or -l function-specific options ?  In the 
real
world, without filtering on events or functions, I've had systems hang, plus
performance impact is too great.

A common invocation of ftrace via trace-cmd is:
# trace-cmd record -p function_graph -e irq:* -l do_IRQ ping -c1 www.redhat.com

So possible perf equivalent?
# ./perf ftrace record -e irq:* -e do_IRQ ping -c1 www.redhat.com

Thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-08-02 Thread Jeremy Eder
On 130729 12:59:47, Jeremy Eder wrote:
> On 130729 23:57:31, Youquan Song wrote:
> > Hi Jeremy,
> > 
> > I try reproduce your result and then fix the issue, but I do not reproduce 
> > it
> >  yet.
> > 
> > I run at netperf-2.6.0 at one machine as server: netserver, other
> > machine: netperf -t TCP_RR -H $SERVER_IP -l 60. The target machine is
> > used in both client and server. I do not reproduce the performance drop
> > issue. I also notice the result is not stable, sometime it is high,
> > sometime is low. In sumarry, it is hard to make a definite result.
> > 
> > Can you try tell me how to reproduce the issue? how do you get the C0
> > data?
> > 
> > What's your config for kernel?  Do you enable CONFIG_NO_HZ_FULL=y or
> > only CONFIG_NO_HZ=y?
> > 
> > 
> > Thanks
> > -Youquan 
> 
> Hi,
> 
> To answer both your and Daniel's question, those results used only
> CONFIG_NO_HZ=y.
> 
> These network latency benchmarks are fickle creatures, and need careful
> tuning to become reproducible.  Plus there are BIOS implications and tuning
> varies by vendor.
> 
> Anyway for the most part it's probably not stable because in order to get
> any sort
> of reproducibility between runs you need to do at least these steps:
> 
> - ensure as little is running in userspace as possible
> - determine PCI affinity for the NIC
> - on both machines, isolate the socket connected to the NIC from userspace
>   tasks
> - Turn off irqbalance and bind all IRQs for that NIC to a single core on
>   the same socket as the NIC
> - run netperf with -TX,Y where X,Y are core numbers that you wish
>   netperf/netserver to run on, respectively.
> 
> For example, if your NIC is attached to socket 0 and socket 0 cores are
> enumerated 0-7, then:
> 
> - set /proc/irq/NNN/smp_affinity_list to, say, 6 for all vectors on that
>   NIC.
> - nice -20 netperf -t TCP_RR - $SERVER_IP -l 60 -T4,4 -s 2
> 
> That should get you most of the way there.  The -s 2 connects and waits 2
> seconds, I found this to help with the first few second's worth of data.
> Or
> you could just toss the first 2 seconds worth, it seems to take that long
> to stabilize.  What I mean is, if you're not using -D1,1 option to netperf,
> you might not have seen that netperf tests seem to take a few seconds to
> stabilize even
> when properly tuned.
> 
> I got the C0 data by running turbostat in parallel with each benchmark run,
> then grabbing the C-state data for the cores relevant to the test.  In my
> case that was cores 4 and 6, where core 4 was where I put netperf/netserver
> and core 6 was where I put the NIC IRQs.  Then I parsed that output into a
> format that this could interpret:
> 
> https://github.com/bitly/data_hacks/blob/master/data_hacks/histogram.py
> 
> I'm building a kernel from Rafael's tree and will try to confirm what Len
> already sent.  Thanks everyone for looking into it.


Hi, sorry for the delay.  In addition to the results I initially posted,
the below results confirm my initial data, plus what Len sent:

3.11-rc2 w/reverts
TCP_RR trans/s 54454.13

3.11-rc2 w/reverts + c0 lock
TCP_RR trans/s 55088.11
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-08-02 Thread Jeremy Eder
On 130729 12:59:47, Jeremy Eder wrote:
 On 130729 23:57:31, Youquan Song wrote:
  Hi Jeremy,
  
  I try reproduce your result and then fix the issue, but I do not reproduce 
  it
   yet.
  
  I run at netperf-2.6.0 at one machine as server: netserver, other
  machine: netperf -t TCP_RR -H $SERVER_IP -l 60. The target machine is
  used in both client and server. I do not reproduce the performance drop
  issue. I also notice the result is not stable, sometime it is high,
  sometime is low. In sumarry, it is hard to make a definite result.
  
  Can you try tell me how to reproduce the issue? how do you get the C0
  data?
  
  What's your config for kernel?  Do you enable CONFIG_NO_HZ_FULL=y or
  only CONFIG_NO_HZ=y?
  
  
  Thanks
  -Youquan 
 
 Hi,
 
 To answer both your and Daniel's question, those results used only
 CONFIG_NO_HZ=y.
 
 These network latency benchmarks are fickle creatures, and need careful
 tuning to become reproducible.  Plus there are BIOS implications and tuning
 varies by vendor.
 
 Anyway for the most part it's probably not stable because in order to get
 any sort
 of reproducibility between runs you need to do at least these steps:
 
 - ensure as little is running in userspace as possible
 - determine PCI affinity for the NIC
 - on both machines, isolate the socket connected to the NIC from userspace
   tasks
 - Turn off irqbalance and bind all IRQs for that NIC to a single core on
   the same socket as the NIC
 - run netperf with -TX,Y where X,Y are core numbers that you wish
   netperf/netserver to run on, respectively.
 
 For example, if your NIC is attached to socket 0 and socket 0 cores are
 enumerated 0-7, then:
 
 - set /proc/irq/NNN/smp_affinity_list to, say, 6 for all vectors on that
   NIC.
 - nice -20 netperf -t TCP_RR - $SERVER_IP -l 60 -T4,4 -s 2
 
 That should get you most of the way there.  The -s 2 connects and waits 2
 seconds, I found this to help with the first few second's worth of data.
 Or
 you could just toss the first 2 seconds worth, it seems to take that long
 to stabilize.  What I mean is, if you're not using -D1,1 option to netperf,
 you might not have seen that netperf tests seem to take a few seconds to
 stabilize even
 when properly tuned.
 
 I got the C0 data by running turbostat in parallel with each benchmark run,
 then grabbing the C-state data for the cores relevant to the test.  In my
 case that was cores 4 and 6, where core 4 was where I put netperf/netserver
 and core 6 was where I put the NIC IRQs.  Then I parsed that output into a
 format that this could interpret:
 
 https://github.com/bitly/data_hacks/blob/master/data_hacks/histogram.py
 
 I'm building a kernel from Rafael's tree and will try to confirm what Len
 already sent.  Thanks everyone for looking into it.


Hi, sorry for the delay.  In addition to the results I initially posted,
the below results confirm my initial data, plus what Len sent:

3.11-rc2 w/reverts
TCP_RR trans/s 54454.13

3.11-rc2 w/reverts + c0 lock
TCP_RR trans/s 55088.11
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-29 Thread Jeremy Eder
On 130729 23:57:31, Youquan Song wrote:
> Hi Jeremy,
> 
> I try reproduce your result and then fix the issue, but I do not reproduce it
>  yet.
> 
> I run at netperf-2.6.0 at one machine as server: netserver, other
> machine: netperf -t TCP_RR -H $SERVER_IP -l 60. The target machine is
> used in both client and server. I do not reproduce the performance drop
> issue. I also notice the result is not stable, sometime it is high,
> sometime is low. In sumarry, it is hard to make a definite result.
> 
> Can you try tell me how to reproduce the issue? how do you get the C0
> data?
> 
> What's your config for kernel?  Do you enable CONFIG_NO_HZ_FULL=y or
> only CONFIG_NO_HZ=y?
> 
> 
> Thanks
> -Youquan 

Hi,

To answer both your and Daniel's question, those results used only
CONFIG_NO_HZ=y.

These network latency benchmarks are fickle creatures, and need careful
tuning to become reproducible.  Plus there are BIOS implications and tuning
varies by vendor.

Anyway for the most part it's probably not stable because in order to get
any sort
of reproducibility between runs you need to do at least these steps:

- ensure as little is running in userspace as possible
- determine PCI affinity for the NIC
- on both machines, isolate the socket connected to the NIC from userspace
  tasks
- Turn off irqbalance and bind all IRQs for that NIC to a single core on
  the same socket as the NIC
- run netperf with -TX,Y where X,Y are core numbers that you wish
  netperf/netserver to run on, respectively.

For example, if your NIC is attached to socket 0 and socket 0 cores are
enumerated 0-7, then:

- set /proc/irq/NNN/smp_affinity_list to, say, 6 for all vectors on that
  NIC.
- nice -20 netperf -t TCP_RR - $SERVER_IP -l 60 -T4,4 -s 2

That should get you most of the way there.  The -s 2 connects and waits 2
seconds, I found this to help with the first few second's worth of data.
Or
you could just toss the first 2 seconds worth, it seems to take that long
to stabilize.  What I mean is, if you're not using -D1,1 option to netperf,
you might not have seen that netperf tests seem to take a few seconds to
stabilize even
when properly tuned.

I got the C0 data by running turbostat in parallel with each benchmark run,
then grabbing the C-state data for the cores relevant to the test.  In my
case that was cores 4 and 6, where core 4 was where I put netperf/netserver
and core 6 was where I put the NIC IRQs.  Then I parsed that output into a
format that this could interpret:

https://github.com/bitly/data_hacks/blob/master/data_hacks/histogram.py

I'm building a kernel from Rafael's tree and will try to confirm what Len
already sent.  Thanks everyone for looking into it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-29 Thread Jeremy Eder
On 130729 23:57:31, Youquan Song wrote:
 Hi Jeremy,
 
 I try reproduce your result and then fix the issue, but I do not reproduce it
  yet.
 
 I run at netperf-2.6.0 at one machine as server: netserver, other
 machine: netperf -t TCP_RR -H $SERVER_IP -l 60. The target machine is
 used in both client and server. I do not reproduce the performance drop
 issue. I also notice the result is not stable, sometime it is high,
 sometime is low. In sumarry, it is hard to make a definite result.
 
 Can you try tell me how to reproduce the issue? how do you get the C0
 data?
 
 What's your config for kernel?  Do you enable CONFIG_NO_HZ_FULL=y or
 only CONFIG_NO_HZ=y?
 
 
 Thanks
 -Youquan 

Hi,

To answer both your and Daniel's question, those results used only
CONFIG_NO_HZ=y.

These network latency benchmarks are fickle creatures, and need careful
tuning to become reproducible.  Plus there are BIOS implications and tuning
varies by vendor.

Anyway for the most part it's probably not stable because in order to get
any sort
of reproducibility between runs you need to do at least these steps:

- ensure as little is running in userspace as possible
- determine PCI affinity for the NIC
- on both machines, isolate the socket connected to the NIC from userspace
  tasks
- Turn off irqbalance and bind all IRQs for that NIC to a single core on
  the same socket as the NIC
- run netperf with -TX,Y where X,Y are core numbers that you wish
  netperf/netserver to run on, respectively.

For example, if your NIC is attached to socket 0 and socket 0 cores are
enumerated 0-7, then:

- set /proc/irq/NNN/smp_affinity_list to, say, 6 for all vectors on that
  NIC.
- nice -20 netperf -t TCP_RR - $SERVER_IP -l 60 -T4,4 -s 2

That should get you most of the way there.  The -s 2 connects and waits 2
seconds, I found this to help with the first few second's worth of data.
Or
you could just toss the first 2 seconds worth, it seems to take that long
to stabilize.  What I mean is, if you're not using -D1,1 option to netperf,
you might not have seen that netperf tests seem to take a few seconds to
stabilize even
when properly tuned.

I got the C0 data by running turbostat in parallel with each benchmark run,
then grabbing the C-state data for the cores relevant to the test.  In my
case that was cores 4 and 6, where core 4 was where I put netperf/netserver
and core 6 was where I put the NIC IRQs.  Then I parsed that output into a
format that this could interpret:

https://github.com/bitly/data_hacks/blob/master/data_hacks/histogram.py

I'm building a kernel from Rafael's tree and will try to confirm what Len
already sent.  Thanks everyone for looking into it.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-26 Thread Jeremy Eder
 better
performance), by reverting commit 69a37beabf1f0a6705c08e879bdd5d82ff6486c4.

While taking into account the changing landscape with regards to CPU
governors, and both P- and C-states, we think that a single-thread should
still be able to achieve maximum performance.  With the current upstream
code base, workloads with a low number of "hot" threads are not able to
achieve maximum performance "out of the box".

Also recently, Intel's LAD has posted upstream performance results that
include an interesting column with their table of results.  See upstream
commit 0a4db187a999, column #3 within the "Performance numbers" table.  It
seems known, even within Intel, that the deeper C-states incur a cost too
high to bear, as they've explicitly tested restricting the CPU to higher
c-states of C0,1.

-- Jeremy Eder
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-26 Thread Jeremy Eder
 better
performance), by reverting commit 69a37beabf1f0a6705c08e879bdd5d82ff6486c4.

While taking into account the changing landscape with regards to CPU
governors, and both P- and C-states, we think that a single-thread should
still be able to achieve maximum performance.  With the current upstream
code base, workloads with a low number of hot threads are not able to
achieve maximum performance out of the box.

Also recently, Intel's LAD has posted upstream performance results that
include an interesting column with their table of results.  See upstream
commit 0a4db187a999, column #3 within the Performance numbers table.  It
seems known, even within Intel, that the deeper C-states incur a cost too
high to bear, as they've explicitly tested restricting the CPU to higher
c-states of C0,1.

-- Jeremy Eder
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/