Re: Some tips on profiling

2010-10-24 Thread Stefan Fuhrmann

On 04.10.2010 16:58, Ramkumar Ramachandra wrote:

Hi Stefan,

Sorry for the late reply. This has been the first weekend
that I didn't spend in office for some time.

Stefan Fuhrmann writes:

I enabled it, but there's still some issue:
subversion/svnadmin/main.c:1892: undefined reference to 
`svn_fs_get_cache_config'


It builds here. Did you run autogen.sh before ./configure?

Yep, I did. I tried it several times again; same issue. Is the
build.conf broken?

Hm. Works here (LINUX ggc 4.3.3 and Win32 MS VisualStudio 2010).



As soon as a larger number of patches got reviewed and merged,
I will prepare further changes for integration. So far, nobody had
free cycles to do the reviews.

I'm being stretched really thin myself- I sometimes have to sacrifice
several hours of sleep to keep up :| I'll try my best but I can't
promise. Also, there's the additional overhead of having to wait for
approvals- if I can't pull it off, I'll request a full committer to
take over.


I had the chance to check them out and apply them just now. They work
as expected. I'll submit some (uneducated) patch reviews to the list
and request a merge. I don't have access to the areas your patches
touch.

I really appreciate that. It would be great if someone had the time
to review the 3 commits to the membuffer cache integration branch.
The review should not require too much context knowledge. An
in-depth review will take a full day or so (like for an average sized
C++ class).

Thanks for the estimate- Instead of jumping between classes and
attempting to review it bit-by-bit, I'll try to allocate a Saturday or
Sunday to this task.

That would be awesome! Any weekend would do ;)

-- Stefan^2.


Re: Some tips on profiling

2010-10-24 Thread Daniel Shahaf
Stefan Fuhrmann wrote on Sun, Oct 24, 2010 at 20:55:04 +0200:
 On 04.10.2010 16:58, Ramkumar Ramachandra wrote:
 Hi Stefan,
 Sorry for the late reply. This has been the first weekend
 that I didn't spend in office for some time.
 Stefan Fuhrmann writes:
 I enabled it, but there's still some issue:
 subversion/svnadmin/main.c:1892: undefined reference to 
 `svn_fs_get_cache_config'

 It builds here. Did you run autogen.sh before ./configure?
 Yep, I did. I tried it several times again; same issue. Is the
 build.conf broken?
 Hm. Works here (LINUX ggc 4.3.3 and Win32 MS VisualStudio 2010).

Is it linking against the installed libraries by accident?  This can
happen on Debian an older Subversion (that doesn't have that function)
is on the search path (e.g., by virtue of being the --prefix directory).


Re: Performance optimization - svn_stringbuf_appendbyte()

2010-10-24 Thread Stefan Fuhrmann

On 11.10.2010 17:07, Julian Foad wrote:

On Sun, 2010-10-10, Stefan Fuhrmann wrote:

On 07.10.2010 16:07, Julian Foad wrote:

New Revision: 997203

URL: http://svn.apache.org/viewvc?rev=997203view=rev
Log:
Merge r985037, r985046, r995507 and r995603 from the performance branch.

These changes introduce the svn_stringbuf_appendbyte() function, which has
significantly less overhead than svn_stringbuf_appendbytes(), and can be
used in a number of places within our codebase.

Here are the results (see full patch attached):

Times:  appendbytes appendbyte0 appendbyte  (in ms)
run:  89  31  34
run:  88  30  35
run:  88  31  34
run:  88  30  34
run:  88  31  34
min:  88  30  34

This tells me that the hand-optimization is actually harmful and the
compiler does a 10% better job by itself.

Have I made a mistake?

My guess is that you might have made two mistakes actually.

Heh, OK.


First, the number of operations was relatively low - everything
in the low ms range could be due to secondary effects (and
still be repeatable).

I can't think of any reason why.  I ran the whole benchmark lots of
times.  Occasionally some of the times were a big chunk longer due to
other system activity.  Normally they were stable.  I also ran it
several times with 10 million ops instead of 1 million, and the results
were exactly 10x longer with the same degree of variability.


OK. It is just that sometimes, process startup or caching
artifacts (especially running tests in a particular order)
result in producible effects of that magnitude. But obviously
that was not the case here.

The most important problem would be compiler optimizations
or lack thereof. Since the numbers are very large (50..100
ticks per byte, depending on your CPU clock), I would assume
that you used a normal debug build for testing. In that case,

You're right.  I'll try again, with --disable-debug -O2.

Gotcha! The optimizer's impact is massive on that kind
of code.


the actual number of C statements has a large impact on
the execution time. See my results below for details.

What are the results for your system?

(I'm using GCC 4.4.1 on an Intel Centrino laptop CPU.)


Test code used (10^10 calls, newer re-allocate):

  int i;
  unsigned char c;

  svn_stringbuf_t* s = svn_stringbuf_create_ensure (255, pool);

OK so you're avoiding any calls to the re-alloc.  My tests included
re-allocs, but were long enough strings that this wasn't much overhead.
Nevertheless I'll avoid re-allocs so we can compare results fairly.


  for (i = 0; i  5000; ++i)
  {
  s-len = 0;
  for (c = 0; c  200; ++c)
  svn_stringbuf_appendbyte (s, c);
  }


XEON 55xx / Core i7, hyper-threading on, 3GHz peak
64 bits Linux, GCC 4.3.3; ./configure --disable-debug

(1)  10,7s; IPC = 2.1
(2)  8,11s; IPC = 1.4
(3)  2,64s; IPC = 2.4
(4)  2,43s; IPC = 2.3

Grml. After normalizing the numbers to 10^9 iterations,
I forgot to adjust the example code. Sorry!

(1) use appendbytes gives 9 instructions in main, 59 in appbytes
(2) handle count==1 directly in in appendbytes; 9 inst. in main, 26 in
appbytes
(3) appendbyte0 (same compiler output as the the non-handtuned code);
  13 inst. in appbyte, 6 in main
(4) tr...@head appendbyte; 11 inst. in appbyte, 6 in main

Core2 2.4GHz, Win32, VS2010
(not using valgrind to count instructions here; also not testing (2))

(1)  17,0s release, 20,2s debug
(3)  10,6s release, 12,2s debug
(4)  9,7s release, 13,0s debug

With a test harness more similar to yours, and built with
--disable-debug -O2, here are my relative numbers (but only 1 million
outer loops, compared to your 50 million):

$ svn-c-test subr string 24
Times for 100 x 200 bytes, in seconds
   (1)appendbytes (3)appendbyte0 (4)tr...@head (5)macro
run:7.032.061.370.53
run:6.621.761.550.53
run:6.531.671.440.53
run:6.541.601.440.53
run:6.521.841.370.53
min:6.521.601.370.53

This agrees with what you found: the hand-tuned code gives a 10%
improvement.

This also shows another variant, writing the function as a macro,
reported in the column (5)macro.  This gives a further two-and-a-half
times speed-up in this test.


The macro variant (basically inlining) would take 2.65s for 10^9
total iterations. That is more or less what I got on Core i7 for the
non-inlined variant. My theory:

* function call  return are cheap jumps
* the code is small enough for the loop stream decoder to kick in
* you platform ABI uses stack frames and probably passes parameters
  on the stack (instead of registers).

The latter is also the case for 32 bit Windows and gives similar
results (9.7s vs. your 

Re: svn commit: r984984 - /subversion/branches/performance/subversion/libsvn_repos/reporter.c

2010-10-24 Thread Stefan Fuhrmann

On 14.10.2010 21:45, Hyrum K. Wright wrote:

On Thu, Aug 12, 2010 at 4:25 PM,stef...@apache.org  wrote:

Author: stefan2
Date: Thu Aug 12 21:25:11 2010
New Revision: 984984

URL: http://svn.apache.org/viewvc?rev=984984view=rev
Log:
Eliminate redundant revprop lookups: Exports / checkouts often
contain multiple nodes from the same revision. Therefore, we
cache essential revision info in the report baton for as long as
the report is running.

As a neat side-effect, this will also fix inconsistencies created by
changing revprops (in a parallel request) while a report is running.

A nice change.  In reviewing it for merge to trunk, I noticed that
cached information is never evicted.  Do we assume that the number of
revisions touched won't be significant, so we don't need to worry
about the size of the cache (and hence memory usage) blowing up?


For all real-world reports, that is not a problem because
they hardly span more than a few thousand revisions
(which is the whole point of the optimization). Even when
exporting *all* of apache.org (all tags, branches etc.) in
a single request, the memory went up by only a few MB.

If we want to be very paranoid, we could clean the cache
before adding the, say, 1th entry. To do that, we
need just another pool in the baton to hold the cache
and its entry in a disposable chunk of memory.

-- Stefan^2.