Re: My take on the diff-optimizations-bytes branch

2011-01-26 Thread Stefan Fuhrmann

On 26.01.2011 03:09, Johan Corveleyn wrote:

On Tue, Jan 25, 2011 at 1:37 AM, Stefan Fuhrmanneq...@web.de  wrote:
[ ... snip ...]

And, as promised, here some ideas how to get more
speed from the generic code. Your latest commit:

+#if SVN_UNALIGNED_ACCESS_IS_OK
+
+  /* Skip quickly over the stuff between EOLs. */
+  for (i = 0, can_read_word = TRUE; ifile_len; i++)
+can_read_word = can_read_word
+(file[i].curp + sizeof(apr_size_t)file[i].endp);
+  while (can_read_word)
+{
+  for (i = 1, is_match = TRUE; ifile_len; i++)
+is_match = is_match
+(   *(const apr_size_t *)file[0].curp
+   == *(const apr_size_t *)file[i].curp);
+
+  if (!is_match || contains_eol(*(const apr_size_t *)file[0].curp))
+break;
+
+  for (i = 0; ifile_len; i++)
+file[i].curp += sizeof(apr_size_t);
+  for (i = 0, can_read_word = TRUE; ifile_len; i++)
+can_read_word = can_read_word
+(file[i].curp + sizeof(apr_size_t)file[i].endp);
+}
+
+#endif

could be changed to something like the following.
Please note that I haven't tested any of this:

Thanks. There was one error in your suggestion, which I found out
after testing. See below.

I was afraid so. I just hacked the code directly
into Thunderbird and was unsure whether the
exit behavior was correct.

/* Determine how far we may advance with chunky ops without reaching
  * endp for any of the files.
  * Signedness is important here if curp gets close to endp.
  */
apr_ssize_t max_delta = file[0].endp - file[0].curp - sizeof(apr_size_t);
for (i = 1; ifile_len; i++)
{
apr_ssize_t delta = file[i].endp - file[i].curp - sizeof(apr_size_t);
if (deltamax_delta)
max_delta = delta;
}

/* the former while() loop */
is_match = TRUE;
for (delta = 0; deltamax_deltais_match; delta += sizeof(apr_size_t))
{
apr_size_t chunk = *(const apr_size_t *)(file[0].curp + delta);
if (contains_eol(chunk))
break;

for (i = 1; ifile_len; i++)
if (chunk != *(const apr_size_t *)(file[i].curp + delta))
{
is_match = FALSE;

Here, I inserted:

 delta -= sizeof(apr_size_t);

because otherwise, delta will be increased too far (it will still be
increased by the counting expression of the outer for-loop (after
which it will stop because of !is_match)). Maybe there is a
cleaner/clearer way to break out of the outer for-loop here, without
incrementing delta again, but for now, I've committed it with this
change (r1063565).

Yeah, I really wished for a goto there. The clean
solution, I guess, would be moving that section to
some sub-routine returning the final value of delta.

break;
}
}

/* We either found a mismatch or an EOL at or shortly behind curp+delta
  * or we cannot proceed with chunky ops without exceeding endp.
  * In any way, everything up to curp + delta is equal and not an EOL.
  */
for (i = 0; ifile_len; i++)
file[i].curp += delta;

Thanks. This gives on my machine/example another 15-20% performance
increase (datasources_open time going down from ~21 s to ~17 s). We
should probably do the same for suffix scanning, but I'm too tired
right now :-) (and suffix scanning is more difficult to grok, so not a
good idea to do at 3 am).

Don't get yourself burned out ;)

-- Stefan^2.



Re: My take on the diff-optimizations-bytes branch

2011-01-25 Thread Johan Corveleyn
On Tue, Jan 25, 2011 at 1:37 AM, Stefan Fuhrmann eq...@web.de wrote:
[ ... snip ...]
 And, as promised, here some ideas how to get more
 speed from the generic code. Your latest commit:

 +#if SVN_UNALIGNED_ACCESS_IS_OK
 +
 +      /* Skip quickly over the stuff between EOLs. */
 +      for (i = 0, can_read_word = TRUE; i  file_len; i++)
 +        can_read_word = can_read_word
 +  (file[i].curp + sizeof(apr_size_t)  file[i].endp);
 +      while (can_read_word)
 +        {
 +          for (i = 1, is_match = TRUE; i  file_len; i++)
 +            is_match = is_match
 +  (   *(const apr_size_t *)file[0].curp
 +                           == *(const apr_size_t *)file[i].curp);
 +
 +          if (!is_match || contains_eol(*(const apr_size_t *)file[0].curp))
 +            break;
 +
 +          for (i = 0; i  file_len; i++)
 +            file[i].curp += sizeof(apr_size_t);
 +          for (i = 0, can_read_word = TRUE; i  file_len; i++)
 +            can_read_word = can_read_word
 +  (file[i].curp + sizeof(apr_size_t)  file[i].endp);
 +        }
 +
 +#endif

 could be changed to something like the following.
 Please note that I haven't tested any of this:

Thanks. There was one error in your suggestion, which I found out
after testing. See below.

 /* Determine how far we may advance with chunky ops without reaching
  * endp for any of the files.
  * Signedness is important here if curp gets close to endp.
  */
 apr_ssize_t max_delta = file[0].endp - file[0].curp - sizeof(apr_size_t);
 for (i = 1; i  file_len; i++)
 {
    apr_ssize_t delta = file[i].endp - file[i].curp - sizeof(apr_size_t);
    if (delta  max_delta)
        max_delta = delta;
 }

 /* the former while() loop */
 is_match = TRUE;
 for (delta = 0; delta  max_delta  is_match; delta += sizeof(apr_size_t))
 {
    apr_size_t chunk = *(const apr_size_t *)(file[0].curp + delta);
    if (contains_eol(chunk))
        break;

    for (i = 1; i  file_len; i++)
        if (chunk != *(const apr_size_t *)(file[i].curp + delta))
        {
            is_match = FALSE;

Here, I inserted:

delta -= sizeof(apr_size_t);

because otherwise, delta will be increased too far (it will still be
increased by the counting expression of the outer for-loop (after
which it will stop because of !is_match)). Maybe there is a
cleaner/clearer way to break out of the outer for-loop here, without
incrementing delta again, but for now, I've committed it with this
change (r1063565).

            break;
        }
 }

 /* We either found a mismatch or an EOL at or shortly behind curp+delta
  * or we cannot proceed with chunky ops without exceeding endp.
  * In any way, everything up to curp + delta is equal and not an EOL.
  */
 for (i = 0; i  file_len; i++)
    file[i].curp += delta;

Thanks. This gives on my machine/example another 15-20% performance
increase (datasources_open time going down from ~21 s to ~17 s). We
should probably do the same for suffix scanning, but I'm too tired
right now :-) (and suffix scanning is more difficult to grok, so not a
good idea to do at 3 am).

Cheers,
-- 
Johan


Re: My take on the diff-optimizations-bytes branch

2011-01-24 Thread Stefan Fuhrmann

On 23.01.2011 22:46, Johan Corveleyn wrote:

On Sat, Jan 8, 2011 at 6:50 AM, Stefan Fuhrmann
stefanfuhrm...@alice-dsl.de  wrote:

On 03.01.2011 02:14, Johan Corveleyn wrote:

It would be interesting to see where the biggest gains
are coming from (I'm guessing from the per-machine-word
reading/comparing; I'd like to try that first, maybe together with the
allocating of the file[] on the stack; I'd like to avoid
special-casing file_len==2, splitting up functions in *_slow and
*_fast, because it makes it a lot more difficult imho (so I'd only
like to do that if it's a significant contributor)). But if you want
to leave that as an exercise for the reader, that's fine as well :-).

Exercise is certainly not a bad thing ;)

But I think, the stack variable is certainly helpful
and easy to do. While chunky operation gives
a *big* boost, it is much more difficult to code if
you need to compare multiple sources. It should
be possible, though.

Ok, I finished the exercise :-).

- File info structures on stack instead of on heap: committed in
r1060625. Gives 10% boost.

- chunky operation for multiple sources: committed in r1062532. Gives
~250% boost on my machine.

On 64 bit machines, it should give another 50 to 100% throughput.
I think that kind of speed-up in an important function justifies the
somewhat unfavorable #if sections, mask voodoo etc.

For suffix scanning, I made a couple of changes to the logic you sent
in your patch (as you remember, there was a problem with the
implementation of suffix scanning [1]). It was (in
scan_backward_fast):

[[[
+#if SVN_UNALIGNED_ACCESS_IS_OK
+
+  if (   (file[0].curp - sizeof(apr_size_t)  file[0].curp)
+  (file[1].curp - sizeof(apr_size_t)  file[1].curp))

Args! Copy'n'pasto.

+{
+  do
+{
+  file[0].curp -= sizeof(apr_size_t);
+  file[1].curp -= sizeof(apr_size_t);
+}
+  while (   (file[0].curp= min_curp0)
+  (file[1].curp= min_curp1)
+  (   *(const apr_size_t*)(file[0].curp + 1)
+ == *(const apr_size_t*)(file[1].curp + 1)));
+
+  file[0].curp += sizeof(apr_size_t);
+  file[1].curp += sizeof(apr_size_t);
+}
]]]

snip

Now, you suggested in a previous post in this thread that hardcoding
file_len to 2 should speed it up with up to 50%. And you are correct
(on my machine, it's sped up by ~40% if I replace all file_len's in
find_identical_* with 2).

So, that leads me to conclude that I should probably throw away the
generic code, and replace it with 3 variants: find_identical_*2,
find_identical_*3 and find_identical_*4 (once I write *2, the rest
should be mostly copy-n-paste). As you said, this should also vastly
simplify the code (drop the for loops, simpler conditionals, ...).

As discussed on IRC yesterday, we should first exploit
further optimization potential in the general code. And
only after that is maxed out, use that as a template for
the specific implementations.

And, as promised, here some ideas how to get more
speed from the generic code. Your latest commit:

+#if SVN_UNALIGNED_ACCESS_IS_OK
+
+  /* Skip quickly over the stuff between EOLs. */
+  for (i = 0, can_read_word = TRUE; i  file_len; i++)
+can_read_word = can_read_word
+  (file[i].curp + sizeof(apr_size_t)  file[i].endp);
+  while (can_read_word)
+{
+  for (i = 1, is_match = TRUE; i  file_len; i++)
+is_match = is_match
+  (   *(const apr_size_t *)file[0].curp
+   == *(const apr_size_t *)file[i].curp);
+
+  if (!is_match || contains_eol(*(const apr_size_t *)file[0].curp))
+break;
+
+  for (i = 0; i  file_len; i++)
+file[i].curp += sizeof(apr_size_t);
+  for (i = 0, can_read_word = TRUE; i  file_len; i++)
+can_read_word = can_read_word
+  (file[i].curp + sizeof(apr_size_t)  file[i].endp);
+}
+
+#endif

could be changed to something like the following.
Please note that I haven't tested any of this:

/* Determine how far we may advance with chunky ops without reaching
 * endp for any of the files.
 * Signedness is important here if curp gets close to endp.
 */
apr_ssize_t max_delta = file[0].endp - file[0].curp - sizeof(apr_size_t);
for (i = 1; i  file_len; i++)
{
apr_ssize_t delta = file[i].endp - file[i].curp - sizeof(apr_size_t);
if (delta  max_delta)
max_delta = delta;
}

/* the former while() loop */
is_match = TRUE;
for (delta = 0; delta  max_delta  is_match; delta += sizeof(apr_size_t))
{
apr_size_t chunk = *(const apr_size_t *)(file[0].curp + delta);
if (contains_eol(chunk))
break;

for (i = 1; i  file_len; i++)
if (chunk != *(const apr_size_t *)(file[i].curp + delta))
{
is_match = FALSE;
break;
}
}

/* We either found a mismatch or an EOL at or shortly behind curp+delta
 * or we cannot proceed with chunky ops without exceeding endp.
 * In any way, everything up 

Re: My take on the diff-optimizations-bytes branch

2011-01-23 Thread Johan Corveleyn
On Sat, Jan 8, 2011 at 6:50 AM, Stefan Fuhrmann
stefanfuhrm...@alice-dsl.de wrote:
 On 03.01.2011 02:14, Johan Corveleyn wrote:
 It would be interesting to see where the biggest gains
 are coming from (I'm guessing from the per-machine-word
 reading/comparing; I'd like to try that first, maybe together with the
 allocating of the file[] on the stack; I'd like to avoid
 special-casing file_len==2, splitting up functions in *_slow and
 *_fast, because it makes it a lot more difficult imho (so I'd only
 like to do that if it's a significant contributor)). But if you want
 to leave that as an exercise for the reader, that's fine as well :-).

 Exercise is certainly not a bad thing ;)

 But I think, the stack variable is certainly helpful
 and easy to do. While chunky operation gives
 a *big* boost, it is much more difficult to code if
 you need to compare multiple sources. It should
 be possible, though.

Ok, I finished the exercise :-).

- File info structures on stack instead of on heap: committed in
r1060625. Gives 10% boost.

- chunky operation for multiple sources: committed in r1062532. Gives
~250% boost on my machine.

For suffix scanning, I made a couple of changes to the logic you sent
in your patch (as you remember, there was a problem with the
implementation of suffix scanning [1]). It was (in
scan_backward_fast):

[[[
+#if SVN_UNALIGNED_ACCESS_IS_OK
+
+  if (   (file[0].curp - sizeof(apr_size_t)  file[0].curp)
+   (file[1].curp - sizeof(apr_size_t)  file[1].curp))
+{
+  do
+{
+  file[0].curp -= sizeof(apr_size_t);
+  file[1].curp -= sizeof(apr_size_t);
+}
+  while (   (file[0].curp = min_curp0)
+  (file[1].curp = min_curp1)
+  (   *(const apr_size_t*)(file[0].curp + 1)
+ == *(const apr_size_t*)(file[1].curp + 1)));
+
+  file[0].curp += sizeof(apr_size_t);
+  file[1].curp += sizeof(apr_size_t);
+}
]]]

I changed this to (the N-file equivalent of):
[[[
+#if SVN_UNALIGNED_ACCESS_IS_OK
+
+  if (   (file[0].curp - sizeof(apr_size_t) = min_curp0)
+   (file[1].curp - sizeof(apr_size_t) = min_curp1))
+{
+  do
+{
+  file[0].curp -= sizeof(apr_size_t);
+  file[1].curp -= sizeof(apr_size_t);
+}
+  while (   (file[0].curp - sizeof(apr_size_t) = min_curp0)
+  (file[1].curp - sizeof(apr_size_t) = min_curp1)
+  (   *(const apr_size_t*)(file[0].curp + 1)
+ == *(const apr_size_t*)(file[1].curp + 1)));
+
+  file[0].curp += sizeof(apr_size_t);
+  file[1].curp += sizeof(apr_size_t);
+}
]]]

(so, changed the if-comparisons before the do-while-loop, to make sure
there is room to subtract a sizeof(apr_size_t) (the original didn't
really make sense to me), and changed the comparisons with min_curpX
in the while condition (check if there is still room to subtract a
sizeof(apr_size_t)).

After that, all tests went fine.


Now, you suggested in a previous post in this thread that hardcoding
file_len to 2 should speed it up with up to 50%. And you are correct
(on my machine, it's sped up by ~40% if I replace all file_len's in
find_identical_* with 2).

So, that leads me to conclude that I should probably throw away the
generic code, and replace it with 3 variants: find_identical_*2,
find_identical_*3 and find_identical_*4 (once I write *2, the rest
should be mostly copy-n-paste). As you said, this should also vastly
simplify the code (drop the for loops, simpler conditionals, ...).

Cheers,
-- 
Johan

[1] http://svn.haxx.se/dev/archive-2011-01/0022.shtml


Re: My take on the diff-optimizations-bytes branch

2011-01-23 Thread Johan Corveleyn
On Sun, Jan 23, 2011 at 10:46 PM, Johan Corveleyn jcor...@gmail.com wrote:
 On Sat, Jan 8, 2011 at 6:50 AM, Stefan Fuhrmann
 stefanfuhrm...@alice-dsl.de wrote:
 On 03.01.2011 02:14, Johan Corveleyn wrote:
 It would be interesting to see where the biggest gains
 are coming from (I'm guessing from the per-machine-word
 reading/comparing; I'd like to try that first, maybe together with the
 allocating of the file[] on the stack; I'd like to avoid
 special-casing file_len==2, splitting up functions in *_slow and
 *_fast, because it makes it a lot more difficult imho (so I'd only
 like to do that if it's a significant contributor)). But if you want
 to leave that as an exercise for the reader, that's fine as well :-).

 Exercise is certainly not a bad thing ;)

 But I think, the stack variable is certainly helpful
 and easy to do. While chunky operation gives
 a *big* boost, it is much more difficult to code if
 you need to compare multiple sources. It should
 be possible, though.

 Ok, I finished the exercise :-).

 - File info structures on stack instead of on heap: committed in
 r1060625. Gives 10% boost.

 - chunky operation for multiple sources: committed in r1062532. Gives
 ~250% boost on my machine.

Oops, as some people pointed out on IRC, that should be ~60% (I meant
2.5 times faster, going from ~50s to ~20s, so that's about ~60% less
time spent in datasources_open).

-- 
Johan


Re: My take on the diff-optimizations-bytes branch

2011-01-22 Thread Johan Corveleyn
On Wed, Jan 19, 2011 at 1:19 AM, Stefan Fuhrmann
stefanfuhrm...@alice-dsl.de wrote:
 On 18.01.2011 12:56, Johan Corveleyn wrote:

 On Mon, Jan 17, 2011 at 3:12 AM, Johan Corveleynjcor...@gmail.com
  wrote:

 On Sat, Jan 8, 2011 at 6:50 AM, Stefan Fuhrmann
 stefanfuhrm...@alice-dsl.de  wrote:

 [ ... snip ... ]

 But I think, the stack variable is certainly helpful
 and easy to do.

 Ok, I've done this (locally, still have to clean up a little and then
 commit). It gives a nice 10% speedup of prefix/suffix scanning, which
 is great.

 I have a couple of small questions though, about other small
 optimizations you did:

 Index: subversion/libsvn_diff/diff_file.c
 ===
 --- subversion/libsvn_diff/diff_file.c  (revision 1054044)
 +++ subversion/libsvn_diff/diff_file.c  (working copy)

 ...

 @@ -258,10 +259,10 @@

     \
      for (i = 0; i  files_len; i++)
      \
      {
      \
 -      if (all_files[i]-curp  all_files[i]-endp - 1)
     \
 -        all_files[i]-curp++;
      \
 +      if (all_files[i].curp + 1  all_files[i].endp)
     \
 +        all_files[i].curp++;
     \

 Just curious: why did you change that condition (from 'X  Y - 1' to
 'X + 1  Y')?

 You mention in your log message: minor re-arragements in arithmetic
 expressions to maximize reuse of results (e.g. in
 INCREMENT_POINTERS).

 Why does this help, and what's the potential impact?

 There is a hidden common sub-expression.
 Variant 1 in pseudo-code:

 lhs = all_files[i].curp;
 temp1 = all_files[i].endp;
 rhs = temp1 - 1;
 if (lhs  rhs)
 {
  temp2 = lhs + 1;
  all_files[i].curp = temp2;
 }

 Variant 2:

 lhs = all_files[i].curp;
 temp = lhs + 1;
 rhs = all_files[i].endp;
 if (lhs  rhs)
  all_files[i].curp = temp;

 The problem is that the compiler cannot figure that out
 on its own because (x+1) and (y-1) don't overflow / wrap
 around for the same values. In particular 0  (0-1) is true
 and (0+1)  0 is false in unsigned arithmetic.

Thanks for the explanation. So, I understand why this should be more
efficient, except ... that it's not, for some reason. At least not
after testing it on my x86 machine.

It's about 1% slower, tested with two test files generated with the
gen-big-files.py script (sent in the other thread), with 1,000,000
prefix and suffix lines, and 500 mismatching lines in the middle:

$ ./gen-big-files.py -1 big1.txt -2 big2.txt -p 100 -s 100

So for now, I didn't commit this change. Maybe it's better on other
platforms, so it may still make sense to do it, but then again, at
around 1% it's probably not that important ...


 Second question:

 +find_identical_prefix_slow(svn_boolean_t *reached_one_eof,
 +                           apr_off_t *prefix_lines,
 +                           struct file_info file[], apr_size_t file_len,
 +                           apr_pool_t *pool)
 +{
 +  svn_boolean_t had_cr = FALSE;
    svn_boolean_t is_match, reached_all_eof;
    apr_size_t i;
 +  apr_off_t lines = 0;

    *prefix_lines = 0;
    for (i = 1, is_match = TRUE; i  file_len; i++)
 -    is_match = is_match  *file[0]-curp == *file[i]-curp;
 +    is_match = is_match  *file[0].curp == *file[i].curp;
 +
    while (is_match)
      {
        /* ### TODO: see if we can take advantage of
           diff options like ignore_eol_style or ignore_space. */
        /* check for eol, and count */
 -      if (*file[0]-curp == '\r')
 +      if (*file[0].curp == '\r')
          {
 -          (*prefix_lines)++;
 +          lines++;

 Here you added a local variable 'lines', which is incremented in the
 function, and copied to *prefix_lines at the end (you do the same in
 fine_identical_prefix_fast). The original code just sets and
 increments *prefix_lines directly. So, out of curiousness: why does
 this help, and how much?

 Incrementing the temporary variable requires one
 indirection less than the original code and uses only
 one instead of two registers (in typical cases).

 The 'how much?' part depends on compiler / optimizer
 and even more on the target platform (x86 has very
 few registers to keep data in fly; x64 has about twice
 as many).

Ok, this apparently sped it up for around 1% on my machine. So I
committed this in r1062275.

Thanks.
Cheers,
-- 
Johan


Re: My take on the diff-optimizations-bytes branch

2011-01-18 Thread Johan Corveleyn
On Mon, Jan 17, 2011 at 3:12 AM, Johan Corveleyn jcor...@gmail.com wrote:
 On Sat, Jan 8, 2011 at 6:50 AM, Stefan Fuhrmann
 stefanfuhrm...@alice-dsl.de wrote:

[ ... snip ... ]

 But I think, the stack variable is certainly helpful
 and easy to do.

Ok, I've done this (locally, still have to clean up a little and then
commit). It gives a nice 10% speedup of prefix/suffix scanning, which
is great.

I have a couple of small questions though, about other small
optimizations you did:

 Index: subversion/libsvn_diff/diff_file.c
 ===
 --- subversion/libsvn_diff/diff_file.c(revision 1054044)
 +++ subversion/libsvn_diff/diff_file.c(working copy)
...
 @@ -258,10 +259,10 @@
   
 \
  for (i = 0; i  files_len; i++)  
 \
  {
 \
 -  if (all_files[i]-curp  all_files[i]-endp - 1)   
 \
 -all_files[i]-curp++;
 \
 +  if (all_files[i].curp + 1  all_files[i].endp) 
 \
 +all_files[i].curp++; 
 \

Just curious: why did you change that condition (from 'X  Y - 1' to
'X + 1  Y')?

You mention in your log message: minor re-arragements in arithmetic
expressions to maximize reuse of results (e.g. in
INCREMENT_POINTERS).

Why does this help, and what's the potential impact?


Second question:

 +find_identical_prefix_slow(svn_boolean_t *reached_one_eof,
 +   apr_off_t *prefix_lines,
 +   struct file_info file[], apr_size_t file_len,
 +   apr_pool_t *pool)
 +{
 +  svn_boolean_t had_cr = FALSE;
svn_boolean_t is_match, reached_all_eof;
apr_size_t i;
 +  apr_off_t lines = 0;

*prefix_lines = 0;
for (i = 1, is_match = TRUE; i  file_len; i++)
 -is_match = is_match  *file[0]-curp == *file[i]-curp;
 +is_match = is_match  *file[0].curp == *file[i].curp;
 +
while (is_match)
  {
/* ### TODO: see if we can take advantage of
   diff options like ignore_eol_style or ignore_space. */
/* check for eol, and count */
 -  if (*file[0]-curp == '\r')
 +  if (*file[0].curp == '\r')
  {
 -  (*prefix_lines)++;
 +  lines++;

Here you added a local variable 'lines', which is incremented in the
function, and copied to *prefix_lines at the end (you do the same in
fine_identical_prefix_fast). The original code just sets and
increments *prefix_lines directly. So, out of curiousness: why does
this help, and how much?

Thanks,
-- 
Johan


Re: My take on the diff-optimizations-bytes branch

2011-01-18 Thread Stefan Fuhrmann

On 18.01.2011 12:56, Johan Corveleyn wrote:

On Mon, Jan 17, 2011 at 3:12 AM, Johan Corveleynjcor...@gmail.com  wrote:

On Sat, Jan 8, 2011 at 6:50 AM, Stefan Fuhrmann
stefanfuhrm...@alice-dsl.de  wrote:

[ ... snip ... ]


But I think, the stack variable is certainly helpful
and easy to do.

Ok, I've done this (locally, still have to clean up a little and then
commit). It gives a nice 10% speedup of prefix/suffix scanning, which
is great.

I have a couple of small questions though, about other small
optimizations you did:


Index: subversion/libsvn_diff/diff_file.c
===
--- subversion/libsvn_diff/diff_file.c  (revision 1054044)
+++ subversion/libsvn_diff/diff_file.c  (working copy)

...

@@ -258,10 +259,10 @@
   \
  for (i = 0; i  files_len; i++)  \
  {\
-  if (all_files[i]-curp  all_files[i]-endp - 1)   \
-all_files[i]-curp++;\
+  if (all_files[i].curp + 1  all_files[i].endp) \
+all_files[i].curp++; \

Just curious: why did you change that condition (from 'X  Y - 1' to
'X + 1  Y')?

You mention in your log message: minor re-arragements in arithmetic
expressions to maximize reuse of results (e.g. in
INCREMENT_POINTERS).

Why does this help, and what's the potential impact?

There is a hidden common sub-expression.
Variant 1 in pseudo-code:

lhs = all_files[i].curp;
temp1 = all_files[i].endp;
rhs = temp1 - 1;
if (lhs  rhs)
{
  temp2 = lhs + 1;
  all_files[i].curp = temp2;
}

Variant 2:

lhs = all_files[i].curp;
temp = lhs + 1;
rhs = all_files[i].endp;
if (lhs  rhs)
  all_files[i].curp = temp;

The problem is that the compiler cannot figure that out
on its own because (x+1) and (y-1) don't overflow / wrap
around for the same values. In particular 0  (0-1) is true
and (0+1)  0 is false in unsigned arithmetic.



Second question:


+find_identical_prefix_slow(svn_boolean_t *reached_one_eof,
+   apr_off_t *prefix_lines,
+   struct file_info file[], apr_size_t file_len,
+   apr_pool_t *pool)
+{
+  svn_boolean_t had_cr = FALSE;
svn_boolean_t is_match, reached_all_eof;
apr_size_t i;
+  apr_off_t lines = 0;

*prefix_lines = 0;
for (i = 1, is_match = TRUE; i  file_len; i++)
-is_match = is_match  *file[0]-curp == *file[i]-curp;
+is_match = is_match  *file[0].curp == *file[i].curp;
+
while (is_match)
  {
/* ### TODO: see if we can take advantage of
   diff options like ignore_eol_style or ignore_space. */
/* check for eol, and count */
-  if (*file[0]-curp == '\r')
+  if (*file[0].curp == '\r')
  {
-  (*prefix_lines)++;
+  lines++;

Here you added a local variable 'lines', which is incremented in the
function, and copied to *prefix_lines at the end (you do the same in
fine_identical_prefix_fast). The original code just sets and
increments *prefix_lines directly. So, out of curiousness: why does
this help, and how much?

Incrementing the temporary variable requires one
indirection less than the original code and uses only
one instead of two registers (in typical cases).

The 'how much?' part depends on compiler / optimizer
and even more on the target platform (x86 has very
few registers to keep data in fly; x64 has about twice
as many).

So, both changes are more or less general coding
patterns that will improve performance somewhat
without compromising readability.

-- Stefan^2.




Re: My take on the diff-optimizations-bytes branch

2011-01-16 Thread Johan Corveleyn
On Sat, Jan 8, 2011 at 6:50 AM, Stefan Fuhrmann
stefanfuhrm...@alice-dsl.de wrote:
 On 03.01.2011 02:14, Johan Corveleyn wrote:

 On Mon, Jan 3, 2011 at 12:13 AM, Johan Corveleynjcor...@gmail.com
  wrote:

 On Sun, Jan 2, 2011 at 10:04 PM, Johan Corveleynjcor...@gmail.com
  wrote:
 For now, some feedback on the rest of the patch:

 [[[

 -          DECREMENT_POINTERS(file, file_len, pool);
 -        }
 +        DECREMENT_POINTERS(file, file_len, pool);
      }

 +  return SVN_NO_ERROR;
 +}
 +
 +/* Find the prefix which is identical between all elements of the FILE
 array.
 + * Return the number of prefix lines in PREFIX_LINES.  REACHED_ONE_EOF
 will be
 + * set to TRUE if one of the FILEs reached its end while scanning
 prefix,
 + * i.e. at least one file consisted entirely of prefix.  Otherwise,
 + * REACHED_ONE_EOF is set to FALSE.
 + *
 + * After this function is finished, the buffers, chunks, curp's and
 endp's
 + * of the FILEs are set to point at the first byte after the prefix. */
 +static svn_error_t *
 +find_identical_prefix(svn_boolean_t *reached_one_eof, apr_off_t
 *prefix_lines,
 +                      struct file_info file[], apr_size_t file_len,
 +                      apr_pool_t *pool)
 +{
 +  if (file_len == 2)
 +    SVN_ERR(find_identical_prefix_fast(reached_one_eof, prefix_lines,
 +                                       file, pool));
 +  else
 +    SVN_ERR(find_identical_prefix_slow(reached_one_eof, prefix_lines,
 +                                       file, file_len, pool));
 +

 Small problem here (not really visible in practice, but potential
 change in behavior none the less): if either of the above function
 calls exited early because of reached_all_eof, this function should
 also exit early.

 Yep, that is just an artifact caused by moving code
 into a sub-function.

Ok, that's no problem anymore, since I removed the early-exit in
r1057435. It was actually not correct, and was the main reason why
some tests kept failing when I ported it to diff3 (see that revisions
commit message for more details).

[ ... snip ... ]

 Also, it would really help if you could split up this patch (though
 this was certainly a good way to try it out and give me an idea of the
 possibilities).

 Giving you some ideas was the main point of it.

 If I find some time, I will try to reduce the patch
 (no fast / slow separation). But since the heap
 to stack conversion spreads across all of the
 code, it is hard to split the patch.

 It would be interesting to see where the biggest gains
 are coming from (I'm guessing from the per-machine-word
 reading/comparing; I'd like to try that first, maybe together with the
 allocating of the file[] on the stack; I'd like to avoid
 special-casing file_len==2, splitting up functions in *_slow and
 *_fast, because it makes it a lot more difficult imho (so I'd only
 like to do that if it's a significant contributor)). But if you want
 to leave that as an exercise for the reader, that's fine as well :-).

 Exercise is certainly not a bad thing ;)

Ok, I'll give it a go one of the coming days, now that I've got
diff3/diff4 working, and I've got a suitable
big-files-generating-script for creating reproducible test-files.

 But I think, the stack variable is certainly helpful
 and easy to do. While chunky operation gives
 a *big* boost, it is much more difficult to code if
 you need to compare multiple sources. It should
 be possible, though.

 The great advantage of the file_len==2 case is
 that this is the only time-critical operation (a
 true merge over long files is rare). Also, many
 constructs collapse if the count is fixed to 2:

 for (i = 1, is_match = TRUE; i  file_len; i++)
  is_match = is_match  *file[0].curp == *file[i].curp;
 while(is_match)
 {
 ...
  for (i = 1, is_match = TRUE; i  file_len; i++)
    is_match = is_match  *file[0].curp == *file[i].curp;
 }

 becomes

 while(*file[0].curp == *file[1].curp)
 {
 }

Hm, that's true, but I'm not sure that will have big impact (but like
you say, we'll need to do more tests). And now that I've read about a
really bad merge-case on the users-list [1], I'm thinking that diff3
should probably also get all the perf-improvement that we can give it.

Actually, what's a little bit troubling is that there are currently
only 3 possible file_len's, of which only 2 are used in practice:
diff2 and diff3 (diff4 is not used in svn core, only in
tools/diff/diff4). So, if we make a slow and a fast version, we could
just as well make a XXX2 and an XXX3 version explicitly, and dispense
with the need to pass arrays and array lenghts. Hmmm... generic (but
more complex) code vs. a couple of specialized versions.


 But basically, we will need to run some more tests.

 -- Stefan^2.



Cheers,
-- 
Johan

[1] http://svn.haxx.se/users/archive-2011-01/0245.shtml


Re: My take on the diff-optimizations-bytes branch

2011-01-07 Thread Stefan Fuhrmann

On 03.01.2011 13:16, Johan Corveleyn wrote:

On Mon, Jan 3, 2011 at 12:03 PM, Stefan Fuhrmann
stefanfuhrm...@alice-dsl.de  wrote:

On 03.01.2011 02:14, Johan Corveleyn wrote:

On Mon, Jan 3, 2011 at 12:13 AM, Johan Corveleynjcor...@gmail.com
  wrote:

[... snip ...]


And it's fast too! It's taking only 58 seconds in diff, vs. 72 for
the normal version. Splitting that up further, it's taking only 28
seconds in prefix/suffix scanning, vs. ~47 for the normal version (for
some reason, the lcs processing took a little longer, but that's
probably unrelated (only did a single run)). And that's with
scan_backward_slow. That's pretty amazing.

On x64, datasource_open() got about 10x faster
but my test-case was the TSVN changelog where
we (mainly) insert at the beginning instead of
close to end of the file. So, one broken scan
direction seems completely plausible ;)

Strange, if you mainly insert at the beginning, it should be mostly
broken, because then it's mostly identical suffix (the part which is
broken). But maybe it's only broken under certain edge conditions, so
that could still explain it...

It has also been a relatively short file (100k)
and therefore, maybe, only a single hunk.

Maybe it's best to refer to datasource*s*_open(), as opposed to the
original datasource_open(). In the original diff algorithm,
datasources were opened each one by one (and then extracting tokens,
inserting them into token tree, and then move on to the next
datasource). In diff-optimizations-bytes branch, I introduced
datasource*s*_open, to open them together, in order to chop off the
identical prefix/suffix. It's probably not a good function name,
because it's so similar to the old one (maybe datasource_open_all or
something would be better). But OTOH: when I started with this, I
thought I could remove/deprecate the datasource_open, once all diffN
algorithms would use datasource*s*_open. Oh well ...

I didn't even notice that there were two such
functions and simply worked on the one that
consumed a lot of CPU time ;)

The code of scan_backward_fast is pretty difficult to understand for
me. So I'm not sure if I can spot the problem. I'll continue staring
at it for a little while ...

Ok, I can't find it right now. There must be some functional
difference between scan_backward_fast and scan_backward_slow.

One way to get closer to the problem:

* create a backup of the current rename scan_backward_fast code
* copy scan_backward_slow to scan_backward_fast
* replace all file_len with a hard-coded 2
  -  can give a 50% boost depending on optimizer cleverness
* step-by-step apply changes from the old scan_backward_fast code

Ok, if I have some time tonight, I will try to get closer ...


For now, some feedback on the rest of the patch:

[[[

[... snip ...]


+  if (had_cr)
+{
+  /* Check if we ended in the middle of a \r\n for one file, but \r
for
+ another. If so, back up one byte, so the next loop will back up
+ the entire line. Also decrement *prefix_lines, since we counted
one
+ too many for the \r. */
+  if ((*file[0].curp == '\n') || (*file[1].curp == '\n'))

Here (or below DECREMENT, but within this same if condition) you need:
  (*prefix_lines)--;

just like in the original (and *_slow) case. As is explained in the
comment above: if we are here, we actually counted a full prefix line
too much, because we already counted one for a matching '\r', but the
character after that was '\n' in one file, but a different character
in the other. So the eol-termination is different, so this is a
non-matching line which needs to be fully rolled back (hence also the
extra DECREMENT here).

You are right. I had some intermediate code in
place where the increment would happen later
(thus sparing the extra decrement).

I'm not even sure the original code is correct at all.
For instance, it does not handle '\r' (Mac style) EOLs
properly. Not really sure how to fix that.

Are you sure it's not correct for Mac style EOLs? Can you give an example?

You are correct, the extra check for '\n' before decrementing
the line count fixes up the line-count for WIN EOLs only.

find_prefix (a\rb, a\rc) - prefix_lines = 1, had_cr = TRUE
find_prefix (a\nb, a\nc) - prefix_lines = 1, had_cr = FALSE

The other case that I was thinking about seems to be unsupported
by SVN anyways:

find_prefix (a\r\nb, a\rc) - prefix_lines = 0, had_cr = TRUE
find_prefix (a\n\rb, a\nc) - prefix_lines = 1, had_cr = FALSE

So, you are right, I'm wrong ;)


I thought I made it work correctly with '\r' EOLs (the very first
iterations of this (when I was still sending it as patches to the
dev-list) only worked for \r\n and \n, but somewhere along the way, I
made it handle \r EOLs as well). That's what the entire block below
/* check for eol, and count */ is for:
- if \r: set had_cr = TRUE, count an extra prefix_line.
- if \n: only count an additional prefix_line if the had_cr == FALSE,
because otherwise it's the second half of a \r\n EOL, 

Re: My take on the diff-optimizations-bytes branch

2011-01-07 Thread Stefan Fuhrmann

On 03.01.2011 02:14, Johan Corveleyn wrote:

On Mon, Jan 3, 2011 at 12:13 AM, Johan Corveleynjcor...@gmail.com  wrote:

On Sun, Jan 2, 2011 at 10:04 PM, Johan Corveleynjcor...@gmail.com  wrote:

On Sat, Jan 1, 2011 at 11:25 PM, Stefan Fuhrmann
stefanfuhrm...@alice-dsl.de  wrote:

Hi Johan,

Thursday night I did something stupid and had a look at  how
svn blame could be made faster based on the HEAD code in
your branch.

One night and most of the following day later, I think I made it
a few percent faster now. Some of the code I committed directly
to /trunk and you need to pull these changes into your branch
to compile the attached patch.

Feel free to test it, take it apart and morph it any way you like --
I know the patch isn't very pretty in the first place. I tested the
patch on x64 LINUX and would like to hear whether it at least
somewhat improved performance on your system for your
svn blame config.xml use-case.

-- Stefan^2.

[[[
Speed-up of datasource_open() and its sub-routines
by a series of optimizations:

* allocate the file[] array on stack instead of heap
  (all members become directly addressible through
  the stack pointer because all static sub-functions
  will actually be in-lined)
* minor re-arragements in arithmetic expressions to
  maximize reuse of results (e.g. in INCREMENT_POINTERS)
* move hot spots to seperate functions and provide a
  specialized version for file_len == 2
  (many loops collapse to a single execution, other
  values can be hard-coded, etc.)
* use seperate loops to process data within a chunk
  so we don't need to call INCREMENT_POINTERS  friends
  that frequently
* scan / compare strings in machine-word granularity
  instead of bytes
]]]

Hi Stefan,

Thanks for taking a look. I really appreciate it.

When I saw your first couple of commits, to the adler32 stuff, I
already thought: hmmm, he's up to something :-). And after I saw your
change to eol.c#svn_eol__find_eol_start, reading per machine-word, I
was thinking: hey, I could really use something like that in the
prefix/suffix scanning. Nice ... :-)

(I had some trouble applying your patch. It doesn't apply cleanly
anymore to HEAD of the branch (but most hunks were applied correctly,
and I could manually change the rest, so no problem)).

However, first tests are not so great. In fact, it's about 25% slower
(when blaming my settings.xml file with 2200 revisions, it's spending
about 90 seconds in diff, vs. around 72 with HEAD of
diff-optimizations-bytes).

Analyzing further, I can see it's spending significantly less time in
prefix/suffix scanning, but more time in token.c#svn_diff__get_tokens
(which is where the original algorithm gets the tokens/lines and
inserts them into the token tree). This tells me it's not working
correctly: either prefix/suffix scanning fails too soon, so there's
much more left for the regular algorithm. Or it's just not functioning
correctly.

Looking at the result of the blame operation, it seems it's the
latter. The final result of the blame is not correct anymore.

I'll try to look some more into it, to see what's going wrong. Maybe
you can also see it with a simple diff of a large file ... (for a good
example in svn's own codebase, you could try
subversion/tests/cmdline/merge-tests.py (pretty large, and almost 700
revisions)).

Ok, by eliminating parts of the new stuff, I found out the problem is
in scan_backward_fast (part of suffix scanning). If I bypass that, and
always take scan_backward_slow, it works.

And it's fast too! It's taking only 58 seconds in diff, vs. 72 for
the normal version. Splitting that up further, it's taking only 28
seconds in prefix/suffix scanning, vs. ~47 for the normal version (for
some reason, the lcs processing took a little longer, but that's
probably unrelated (only did a single run)). And that's with
scan_backward_slow. That's pretty amazing.

The code of scan_backward_fast is pretty difficult to understand for
me. So I'm not sure if I can spot the problem. I'll continue staring
at it for a little while ...

Ok, I can't find it right now. There must be some functional
difference between scan_backward_fast and scan_backward_slow.

For now, some feedback on the rest of the patch:

[[[


-  DECREMENT_POINTERS(file, file_len, pool);
-}
+DECREMENT_POINTERS(file, file_len, pool);
  }

+  return SVN_NO_ERROR;
+}
+
+/* Find the prefix which is identical between all elements of the FILE array.
+ * Return the number of prefix lines in PREFIX_LINES.  REACHED_ONE_EOF will be
+ * set to TRUE if one of the FILEs reached its end while scanning prefix,
+ * i.e. at least one file consisted entirely of prefix.  Otherwise,
+ * REACHED_ONE_EOF is set to FALSE.
+ *
+ * After this function is finished, the buffers, chunks, curp's and endp's
+ * of the FILEs are set to point at the first byte after the prefix. */
+static svn_error_t *
+find_identical_prefix(svn_boolean_t *reached_one_eof, apr_off_t *prefix_lines,
+  struct 

Re: My take on the diff-optimizations-bytes branch

2011-01-03 Thread Stefan Fuhrmann

On 03.01.2011 02:14, Johan Corveleyn wrote:

On Mon, Jan 3, 2011 at 12:13 AM, Johan Corveleynjcor...@gmail.com  wrote:

On Sun, Jan 2, 2011 at 10:04 PM, Johan Corveleynjcor...@gmail.com  wrote:

On Sat, Jan 1, 2011 at 11:25 PM, Stefan Fuhrmann
stefanfuhrm...@alice-dsl.de  wrote:

Hi Johan,

Thursday night I did something stupid and had a look at  how
svn blame could be made faster based on the HEAD code in
your branch.

One night and most of the following day later, I think I made it
a few percent faster now. Some of the code I committed directly
to /trunk and you need to pull these changes into your branch
to compile the attached patch.

Feel free to test it, take it apart and morph it any way you like --
I know the patch isn't very pretty in the first place. I tested the
patch on x64 LINUX and would like to hear whether it at least
somewhat improved performance on your system for your
svn blame config.xml use-case.

-- Stefan^2.

[[[
Speed-up of datasource_open() and its sub-routines
by a series of optimizations:

* allocate the file[] array on stack instead of heap
  (all members become directly addressible through
  the stack pointer because all static sub-functions
  will actually be in-lined)
* minor re-arragements in arithmetic expressions to
  maximize reuse of results (e.g. in INCREMENT_POINTERS)
* move hot spots to seperate functions and provide a
  specialized version for file_len == 2
  (many loops collapse to a single execution, other
  values can be hard-coded, etc.)
* use seperate loops to process data within a chunk
  so we don't need to call INCREMENT_POINTERS  friends
  that frequently
* scan / compare strings in machine-word granularity
  instead of bytes
]]]

Hi Stefan,

Thanks for taking a look. I really appreciate it.

When I saw your first couple of commits, to the adler32 stuff, I
already thought: hmmm, he's up to something :-). And after I saw your
change to eol.c#svn_eol__find_eol_start, reading per machine-word, I
was thinking: hey, I could really use something like that in the
prefix/suffix scanning. Nice ... :-)

(I had some trouble applying your patch. It doesn't apply cleanly
anymore to HEAD of the branch (but most hunks were applied correctly,
and I could manually change the rest, so no problem)).

However, first tests are not so great. In fact, it's about 25% slower
(when blaming my settings.xml file with 2200 revisions, it's spending
about 90 seconds in diff, vs. around 72 with HEAD of
diff-optimizations-bytes).

Analyzing further, I can see it's spending significantly less time in
prefix/suffix scanning, but more time in token.c#svn_diff__get_tokens
(which is where the original algorithm gets the tokens/lines and
inserts them into the token tree). This tells me it's not working
correctly: either prefix/suffix scanning fails too soon, so there's
much more left for the regular algorithm. Or it's just not functioning
correctly.

Looking at the result of the blame operation, it seems it's the
latter. The final result of the blame is not correct anymore.

I'll try to look some more into it, to see what's going wrong. Maybe
you can also see it with a simple diff of a large file ... (for a good
example in svn's own codebase, you could try
subversion/tests/cmdline/merge-tests.py (pretty large, and almost 700
revisions)).

Ok, by eliminating parts of the new stuff, I found out the problem is
in scan_backward_fast (part of suffix scanning). If I bypass that, and
always take scan_backward_slow, it works.

And it's fast too! It's taking only 58 seconds in diff, vs. 72 for
the normal version. Splitting that up further, it's taking only 28
seconds in prefix/suffix scanning, vs. ~47 for the normal version (for
some reason, the lcs processing took a little longer, but that's
probably unrelated (only did a single run)). And that's with
scan_backward_slow. That's pretty amazing.

On x64, datasource_open() got about 10x faster
but my test-case was the TSVN changelog where
we (mainly) insert at the beginning instead of
close to end of the file. So, one broken scan
direction seems completely plausible ;)

The code of scan_backward_fast is pretty difficult to understand for
me. So I'm not sure if I can spot the problem. I'll continue staring
at it for a little while ...

Ok, I can't find it right now. There must be some functional
difference between scan_backward_fast and scan_backward_slow.

One way to get closer to the problem:

* create a backup of the current rename scan_backward_fast code
* copy scan_backward_slow to scan_backward_fast
* replace all file_len with a hard-coded 2
  - can give a 50% boost depending on optimizer cleverness
* step-by-step apply changes from the old scan_backward_fast code


For now, some feedback on the rest of the patch:

[[[

Index: subversion/libsvn_diff/diff_file.c
===
--- subversion/libsvn_diff/diff_file.c  (revision 1054044)
+++ subversion/libsvn_diff/diff_file.c  (working copy)

  ... 

Re: My take on the diff-optimizations-bytes branch

2011-01-03 Thread Johan Corveleyn
On Mon, Jan 3, 2011 at 12:03 PM, Stefan Fuhrmann
stefanfuhrm...@alice-dsl.de wrote:
 On 03.01.2011 02:14, Johan Corveleyn wrote:

 On Mon, Jan 3, 2011 at 12:13 AM, Johan Corveleynjcor...@gmail.com
  wrote:

[... snip ...]

 And it's fast too! It's taking only 58 seconds in diff, vs. 72 for
 the normal version. Splitting that up further, it's taking only 28
 seconds in prefix/suffix scanning, vs. ~47 for the normal version (for
 some reason, the lcs processing took a little longer, but that's
 probably unrelated (only did a single run)). And that's with
 scan_backward_slow. That's pretty amazing.

 On x64, datasource_open() got about 10x faster
 but my test-case was the TSVN changelog where
 we (mainly) insert at the beginning instead of
 close to end of the file. So, one broken scan
 direction seems completely plausible ;)

Strange, if you mainly insert at the beginning, it should be mostly
broken, because then it's mostly identical suffix (the part which is
broken). But maybe it's only broken under certain edge conditions, so
that could still explain it...

Maybe it's best to refer to datasource*s*_open(), as opposed to the
original datasource_open(). In the original diff algorithm,
datasources were opened each one by one (and then extracting tokens,
inserting them into token tree, and then move on to the next
datasource). In diff-optimizations-bytes branch, I introduced
datasource*s*_open, to open them together, in order to chop off the
identical prefix/suffix. It's probably not a good function name,
because it's so similar to the old one (maybe datasource_open_all or
something would be better). But OTOH: when I started with this, I
thought I could remove/deprecate the datasource_open, once all diffN
algorithms would use datasource*s*_open. Oh well ...

 The code of scan_backward_fast is pretty difficult to understand for
 me. So I'm not sure if I can spot the problem. I'll continue staring
 at it for a little while ...

 Ok, I can't find it right now. There must be some functional
 difference between scan_backward_fast and scan_backward_slow.

 One way to get closer to the problem:

 * create a backup of the current rename scan_backward_fast code
 * copy scan_backward_slow to scan_backward_fast
 * replace all file_len with a hard-coded 2
  - can give a 50% boost depending on optimizer cleverness
 * step-by-step apply changes from the old scan_backward_fast code

Ok, if I have some time tonight, I will try to get closer ...

 For now, some feedback on the rest of the patch:

 [[[

[... snip ...]

 +  if (had_cr)
 +    {
 +      /* Check if we ended in the middle of a \r\n for one file, but \r
 for
 +         another. If so, back up one byte, so the next loop will back up
 +         the entire line. Also decrement *prefix_lines, since we counted
 one
 +         too many for the \r. */
 +      if ((*file[0].curp == '\n') || (*file[1].curp == '\n'))

 Here (or below DECREMENT, but within this same if condition) you need:
          (*prefix_lines)--;

 just like in the original (and *_slow) case. As is explained in the
 comment above: if we are here, we actually counted a full prefix line
 too much, because we already counted one for a matching '\r', but the
 character after that was '\n' in one file, but a different character
 in the other. So the eol-termination is different, so this is a
 non-matching line which needs to be fully rolled back (hence also the
 extra DECREMENT here).

 I'm not even sure the original code is correct at all.
 For instance, it does not handle '\r' (Mac style) EOLs
 properly. Not really sure how to fix that.

Are you sure it's not correct for Mac style EOLs? Can you give an example?

I thought I made it work correctly with '\r' EOLs (the very first
iterations of this (when I was still sending it as patches to the
dev-list) only worked for \r\n and \n, but somewhere along the way, I
made it handle \r EOLs as well). That's what the entire block below
/* check for eol, and count */ is for:
- if \r: set had_cr = TRUE, count an extra prefix_line.
- if \n: only count an additional prefix_line if the had_cr == FALSE,
because otherwise it's the second half of a \r\n EOL, and we already
counted it for the \r.
- all the rest: don't care, just move on (setting had_cr = FALSE).
- In the special case where prefix scanning ended with a mismatch
between a '\r\n' and '\rsome other character', decrement prefix_line
and go back one byte, because this line wasn't identical after all.

(my older versions just counted the number of \n's, which is ok for
both \n and \r\n).

The only reason to look at EOLs is to count the number of prefix_lines
(and to rollback the last non-matching line until the previous EOL,
but that's done by going back until finding either a \r or an \n). All
the rest is just comparing bytes.

(note that it doesn't support ignoring EOL-style differences (see the
comment /* ### TODO: see if we can take advantage of diff options
like ignore_eol_style or ignore_space. */). Maybe 

Re: My take on the diff-optimizations-bytes branch

2011-01-02 Thread Johan Corveleyn
On Sat, Jan 1, 2011 at 11:25 PM, Stefan Fuhrmann
stefanfuhrm...@alice-dsl.de wrote:
 Hi Johan,

 Thursday night I did something stupid and had a look at  how
 svn blame could be made faster based on the HEAD code in
 your branch.

 One night and most of the following day later, I think I made it
 a few percent faster now. Some of the code I committed directly
 to /trunk and you need to pull these changes into your branch
 to compile the attached patch.

 Feel free to test it, take it apart and morph it any way you like --
 I know the patch isn't very pretty in the first place. I tested the
 patch on x64 LINUX and would like to hear whether it at least
 somewhat improved performance on your system for your
 svn blame config.xml use-case.

 -- Stefan^2.

 [[[
 Speed-up of datasource_open() and its sub-routines
 by a series of optimizations:

 * allocate the file[] array on stack instead of heap
  (all members become directly addressible through
  the stack pointer because all static sub-functions
  will actually be in-lined)
 * minor re-arragements in arithmetic expressions to
  maximize reuse of results (e.g. in INCREMENT_POINTERS)
 * move hot spots to seperate functions and provide a
  specialized version for file_len == 2
  (many loops collapse to a single execution, other
  values can be hard-coded, etc.)
 * use seperate loops to process data within a chunk
  so we don't need to call INCREMENT_POINTERS  friends
  that frequently
 * scan / compare strings in machine-word granularity
  instead of bytes
 ]]]

Hi Stefan,

Thanks for taking a look. I really appreciate it.

When I saw your first couple of commits, to the adler32 stuff, I
already thought: hmmm, he's up to something :-). And after I saw your
change to eol.c#svn_eol__find_eol_start, reading per machine-word, I
was thinking: hey, I could really use something like that in the
prefix/suffix scanning. Nice ... :-)

(I had some trouble applying your patch. It doesn't apply cleanly
anymore to HEAD of the branch (but most hunks were applied correctly,
and I could manually change the rest, so no problem)).

However, first tests are not so great. In fact, it's about 25% slower
(when blaming my settings.xml file with 2200 revisions, it's spending
about 90 seconds in diff, vs. around 72 with HEAD of
diff-optimizations-bytes).

Analyzing further, I can see it's spending significantly less time in
prefix/suffix scanning, but more time in token.c#svn_diff__get_tokens
(which is where the original algorithm gets the tokens/lines and
inserts them into the token tree). This tells me it's not working
correctly: either prefix/suffix scanning fails too soon, so there's
much more left for the regular algorithm. Or it's just not functioning
correctly.

Looking at the result of the blame operation, it seems it's the
latter. The final result of the blame is not correct anymore.

I'll try to look some more into it, to see what's going wrong. Maybe
you can also see it with a simple diff of a large file ... (for a good
example in svn's own codebase, you could try
subversion/tests/cmdline/merge-tests.py (pretty large, and almost 700
revisions)).

Cheers,
-- 
Johan


Re: My take on the diff-optimizations-bytes branch

2011-01-02 Thread Johan Corveleyn
On Sun, Jan 2, 2011 at 10:04 PM, Johan Corveleyn jcor...@gmail.com wrote:
 On Sat, Jan 1, 2011 at 11:25 PM, Stefan Fuhrmann
 stefanfuhrm...@alice-dsl.de wrote:
 Hi Johan,

 Thursday night I did something stupid and had a look at  how
 svn blame could be made faster based on the HEAD code in
 your branch.

 One night and most of the following day later, I think I made it
 a few percent faster now. Some of the code I committed directly
 to /trunk and you need to pull these changes into your branch
 to compile the attached patch.

 Feel free to test it, take it apart and morph it any way you like --
 I know the patch isn't very pretty in the first place. I tested the
 patch on x64 LINUX and would like to hear whether it at least
 somewhat improved performance on your system for your
 svn blame config.xml use-case.

 -- Stefan^2.

 [[[
 Speed-up of datasource_open() and its sub-routines
 by a series of optimizations:

 * allocate the file[] array on stack instead of heap
  (all members become directly addressible through
  the stack pointer because all static sub-functions
  will actually be in-lined)
 * minor re-arragements in arithmetic expressions to
  maximize reuse of results (e.g. in INCREMENT_POINTERS)
 * move hot spots to seperate functions and provide a
  specialized version for file_len == 2
  (many loops collapse to a single execution, other
  values can be hard-coded, etc.)
 * use seperate loops to process data within a chunk
  so we don't need to call INCREMENT_POINTERS  friends
  that frequently
 * scan / compare strings in machine-word granularity
  instead of bytes
 ]]]

 Hi Stefan,

 Thanks for taking a look. I really appreciate it.

 When I saw your first couple of commits, to the adler32 stuff, I
 already thought: hmmm, he's up to something :-). And after I saw your
 change to eol.c#svn_eol__find_eol_start, reading per machine-word, I
 was thinking: hey, I could really use something like that in the
 prefix/suffix scanning. Nice ... :-)

 (I had some trouble applying your patch. It doesn't apply cleanly
 anymore to HEAD of the branch (but most hunks were applied correctly,
 and I could manually change the rest, so no problem)).

 However, first tests are not so great. In fact, it's about 25% slower
 (when blaming my settings.xml file with 2200 revisions, it's spending
 about 90 seconds in diff, vs. around 72 with HEAD of
 diff-optimizations-bytes).

 Analyzing further, I can see it's spending significantly less time in
 prefix/suffix scanning, but more time in token.c#svn_diff__get_tokens
 (which is where the original algorithm gets the tokens/lines and
 inserts them into the token tree). This tells me it's not working
 correctly: either prefix/suffix scanning fails too soon, so there's
 much more left for the regular algorithm. Or it's just not functioning
 correctly.

 Looking at the result of the blame operation, it seems it's the
 latter. The final result of the blame is not correct anymore.

 I'll try to look some more into it, to see what's going wrong. Maybe
 you can also see it with a simple diff of a large file ... (for a good
 example in svn's own codebase, you could try
 subversion/tests/cmdline/merge-tests.py (pretty large, and almost 700
 revisions)).

Ok, by eliminating parts of the new stuff, I found out the problem is
in scan_backward_fast (part of suffix scanning). If I bypass that, and
always take scan_backward_slow, it works.

And it's fast too! It's taking only 58 seconds in diff, vs. 72 for
the normal version. Splitting that up further, it's taking only 28
seconds in prefix/suffix scanning, vs. ~47 for the normal version (for
some reason, the lcs processing took a little longer, but that's
probably unrelated (only did a single run)). And that's with
scan_backward_slow. That's pretty amazing.

The code of scan_backward_fast is pretty difficult to understand for
me. So I'm not sure if I can spot the problem. I'll continue staring
at it for a little while ...

Cheers,
-- 
Johan


Re: My take on the diff-optimizations-bytes branch

2011-01-02 Thread Johan Corveleyn
On Mon, Jan 3, 2011 at 12:13 AM, Johan Corveleyn jcor...@gmail.com wrote:
 On Sun, Jan 2, 2011 at 10:04 PM, Johan Corveleyn jcor...@gmail.com wrote:
 On Sat, Jan 1, 2011 at 11:25 PM, Stefan Fuhrmann
 stefanfuhrm...@alice-dsl.de wrote:
 Hi Johan,

 Thursday night I did something stupid and had a look at  how
 svn blame could be made faster based on the HEAD code in
 your branch.

 One night and most of the following day later, I think I made it
 a few percent faster now. Some of the code I committed directly
 to /trunk and you need to pull these changes into your branch
 to compile the attached patch.

 Feel free to test it, take it apart and morph it any way you like --
 I know the patch isn't very pretty in the first place. I tested the
 patch on x64 LINUX and would like to hear whether it at least
 somewhat improved performance on your system for your
 svn blame config.xml use-case.

 -- Stefan^2.

 [[[
 Speed-up of datasource_open() and its sub-routines
 by a series of optimizations:

 * allocate the file[] array on stack instead of heap
  (all members become directly addressible through
  the stack pointer because all static sub-functions
  will actually be in-lined)
 * minor re-arragements in arithmetic expressions to
  maximize reuse of results (e.g. in INCREMENT_POINTERS)
 * move hot spots to seperate functions and provide a
  specialized version for file_len == 2
  (many loops collapse to a single execution, other
  values can be hard-coded, etc.)
 * use seperate loops to process data within a chunk
  so we don't need to call INCREMENT_POINTERS  friends
  that frequently
 * scan / compare strings in machine-word granularity
  instead of bytes
 ]]]

 Hi Stefan,

 Thanks for taking a look. I really appreciate it.

 When I saw your first couple of commits, to the adler32 stuff, I
 already thought: hmmm, he's up to something :-). And after I saw your
 change to eol.c#svn_eol__find_eol_start, reading per machine-word, I
 was thinking: hey, I could really use something like that in the
 prefix/suffix scanning. Nice ... :-)

 (I had some trouble applying your patch. It doesn't apply cleanly
 anymore to HEAD of the branch (but most hunks were applied correctly,
 and I could manually change the rest, so no problem)).

 However, first tests are not so great. In fact, it's about 25% slower
 (when blaming my settings.xml file with 2200 revisions, it's spending
 about 90 seconds in diff, vs. around 72 with HEAD of
 diff-optimizations-bytes).

 Analyzing further, I can see it's spending significantly less time in
 prefix/suffix scanning, but more time in token.c#svn_diff__get_tokens
 (which is where the original algorithm gets the tokens/lines and
 inserts them into the token tree). This tells me it's not working
 correctly: either prefix/suffix scanning fails too soon, so there's
 much more left for the regular algorithm. Or it's just not functioning
 correctly.

 Looking at the result of the blame operation, it seems it's the
 latter. The final result of the blame is not correct anymore.

 I'll try to look some more into it, to see what's going wrong. Maybe
 you can also see it with a simple diff of a large file ... (for a good
 example in svn's own codebase, you could try
 subversion/tests/cmdline/merge-tests.py (pretty large, and almost 700
 revisions)).

 Ok, by eliminating parts of the new stuff, I found out the problem is
 in scan_backward_fast (part of suffix scanning). If I bypass that, and
 always take scan_backward_slow, it works.

 And it's fast too! It's taking only 58 seconds in diff, vs. 72 for
 the normal version. Splitting that up further, it's taking only 28
 seconds in prefix/suffix scanning, vs. ~47 for the normal version (for
 some reason, the lcs processing took a little longer, but that's
 probably unrelated (only did a single run)). And that's with
 scan_backward_slow. That's pretty amazing.

 The code of scan_backward_fast is pretty difficult to understand for
 me. So I'm not sure if I can spot the problem. I'll continue staring
 at it for a little while ...

Ok, I can't find it right now. There must be some functional
difference between scan_backward_fast and scan_backward_slow.

For now, some feedback on the rest of the patch:

[[[
 Index: subversion/libsvn_diff/diff_file.c
 ===
 --- subversion/libsvn_diff/diff_file.c(revision 1054044)
 +++ subversion/libsvn_diff/diff_file.c(working copy)
 ... snip ...
 @@ -363,53 +364,152 @@
  /* Check whether one of the FILEs has its pointers at EOF (this is the case 
 if
   * one of them has curp == endp (this can only happen at the last chunk)) */
  static svn_boolean_t
 -is_one_at_eof(struct file_info *file[], apr_size_t file_len)
 +is_one_at_eof(struct file_info file[], apr_size_t file_len)
  {
apr_size_t i;

for (i = 0; i  file_len; i++)
 -if (file[i]-curp == file[i]-endp)
 +if (file[i].curp == file[i].endp)
return TRUE;

return FALSE;