Re: My take on the diff-optimizations-bytes branch

Stefan Fuhrmann Mon, 24 Jan 2011 16:39:09 -0800

On 23.01.2011 22:46, Johan Corveleyn wrote:

On Sat, Jan 8, 2011 at 6:50 AM, Stefan Fuhrmann
<stefanfuhrm...@alice-dsl.de>  wrote:

On 03.01.2011 02:14, Johan Corveleyn wrote:

It would be interesting to see where the biggest gains
are coming from (I'm guessing from the "per-machine-word"
reading/comparing; I'd like to try that first, maybe together with the
allocating of the file[] on the stack; I'd like to avoid
special-casing file_len==2, splitting up functions in *_slow and
*_fast, because it makes it a lot more difficult imho (so I'd only
like to do that if it's a significant contributor)). But if you want
to leave that as an exercise for the reader, that's fine as well :-).

Exercise is certainly not a bad thing ;)


But I think, the stack variable is certainly helpful
and easy to do. While "chunky" operation gives
a *big* boost, it is much more difficult to code if
you need to compare multiple sources. It should
be possible, though.

Ok, I finished the exercise :-).

- File info structures on stack instead of on heap: committed in
r1060625. Gives 10% boost.

- chunky operation for multiple sources: committed in r1062532. Gives
~250% boost on my machine.

On 64 bit machines, it should give another 50 to 100% throughput.
I think that kind of speed-up in an important function justifies the
somewhat unfavorable #if sections, mask voodoo etc.

For suffix scanning, I made a couple of changes to the logic you sent
in your patch (as you remember, there was a problem with the
implementation of suffix scanning [1]). It was (in
scan_backward_fast):

[[[
+#if SVN_UNALIGNED_ACCESS_IS_OK
+
+      if (   (file[0].curp - sizeof(apr_size_t)<  file[0].curp)
+&&  (file[1].curp - sizeof(apr_size_t)<  file[1].curp))

Args! Copy'n'pasto.

+        {
+          do
+            {
+              file[0].curp -= sizeof(apr_size_t);
+              file[1].curp -= sizeof(apr_size_t);
+            }
+          while (   (file[0].curp>= min_curp0)
+&&  (file[1].curp>= min_curp1)
+&&  (   *(const apr_size_t*)(file[0].curp + 1)
+                     == *(const apr_size_t*)(file[1].curp + 1)));
+
+          file[0].curp += sizeof(apr_size_t);
+          file[1].curp += sizeof(apr_size_t);
+        }
]]]

<snip>

Now, you suggested in a previous post in this thread that hardcoding
file_len to 2 should speed it up with up to 50%. And you are correct
(on my machine, it's sped up by ~40% if I replace all file_len's in
find_identical_* with 2).

So, that leads me to conclude that I should probably throw away the
generic code, and replace it with 3 variants: find_identical_*2,
find_identical_*3 and find_identical_*4 (once I write *2, the rest
should be mostly copy-n-paste). As you said, this should also vastly
simplify the code (drop the for loops, simpler conditionals, ...).

As discussed on IRC yesterday, we should first exploit
further optimization potential in the "general" code. And
only after that is maxed out, use that as a template for
the specific implementations.

And, as promised, here some ideas how to get more
speed from the generic code. Your latest commit:

+#if SVN_UNALIGNED_ACCESS_IS_OK
+
+      /* Skip quickly over the stuff between EOLs. */
+      for (i = 0, can_read_word = TRUE; i<  file_len; i++)
+        can_read_word = can_read_word
+&&  (file[i].curp + sizeof(apr_size_t)<  file[i].endp);
+      while (can_read_word)
+        {
+          for (i = 1, is_match = TRUE; i<  file_len; i++)
+            is_match = is_match
+&&  (   *(const apr_size_t *)file[0].curp
+                           == *(const apr_size_t *)file[i].curp);
+
+          if (!is_match || contains_eol(*(const apr_size_t *)file[0].curp))
+            break;
+
+          for (i = 0; i<  file_len; i++)
+            file[i].curp += sizeof(apr_size_t);
+          for (i = 0, can_read_word = TRUE; i<  file_len; i++)
+            can_read_word = can_read_word
+&&  (file[i].curp + sizeof(apr_size_t)<  file[i].endp);
+        }
+
+#endif

could be changed to something like the following.
Please note that I haven't tested any of this:

/* Determine how far we may advance with chunky ops without reaching
 * endp for any of the files.
 * Signedness is important here if curp gets close to endp.
 */
apr_ssize_t max_delta = file[0].endp - file[0].curp - sizeof(apr_size_t);
for (i = 1; i<  file_len; i++)
{
    apr_ssize_t delta = file[i].endp - file[i].curp - sizeof(apr_size_t);
    if (delta<  max_delta)
        max_delta = delta;
}

/* the former while() loop */
is_match = TRUE;
for (delta = 0; delta<  max_delta&&  is_match; delta += sizeof(apr_size_t))
{
    apr_size_t chunk = *(const apr_size_t *)(file[0].curp + delta);
    if (contains_eol(chunk))
        break;

    for (i = 1; i<  file_len; i++)
        if (chunk != *(const apr_size_t *)(file[i].curp + delta))
        {
            is_match = FALSE;
            break;
        }
}

/* We either found a mismatch or an EOL at or shortly behind curp+delta
 * or we cannot proceed with chunky ops without exceeding endp.
 * In any way, everything up to curp + delta is equal and not an EOL.
 */
for (i = 0; i<  file_len; i++)
    file[i].curp += delta;


-- Stefan^2.

Re: My take on the diff-optimizations-bytes branch

Reply via email to