On 03.01.2011 13:16, Johan Corveleyn wrote:
On Mon, Jan 3, 2011 at 12:03 PM, Stefan Fuhrmann
<stefanfuhrm...@alice-dsl.de>  wrote:
On 03.01.2011 02:14, Johan Corveleyn wrote:
On Mon, Jan 3, 2011 at 12:13 AM, Johan Corveleyn<jcor...@gmail.com>
  wrote:
[... snip ...]

And it's fast too! It's taking only 58 seconds in "diff", vs. 72 for
the normal version. Splitting that up further, it's taking only 28
seconds in prefix/suffix scanning, vs. ~47 for the normal version (for
some reason, the lcs processing took a little longer, but that's
probably unrelated (only did a single run)). And that's with
scan_backward_slow. That's pretty amazing.
On x64, datasource_open() got about 10x faster
but my test-case was the TSVN changelog where
we (mainly) insert at the beginning instead of
close to end of the file. So, one broken scan
direction seems completely plausible ;)
Strange, if you mainly insert at the beginning, it should be "mostly
broken", because then it's mostly identical suffix (the part which is
broken). But maybe it's only broken under certain edge conditions, so
that could still explain it...
It has also been a relatively short file (<100k)
and therefore, maybe, only a single hunk.
Maybe it's best to refer to "datasource*s*_open()", as opposed to the
original "datasource_open()". In the original diff algorithm,
datasources were opened each one by one (and then extracting tokens,
inserting them into token tree, and then move on to the next
datasource). In diff-optimizations-bytes branch, I introduced
datasource*s*_open, to open them together, in order to chop off the
identical prefix/suffix. It's probably not a good function name,
because it's so similar to the old one (maybe "datasource_open_all" or
something would be better). But OTOH: when I started with this, I
thought I could remove/deprecate the datasource_open, once all diffN
algorithms would use datasource*s*_open. Oh well ...
I didn't even notice that there were two such
functions and simply worked on the one that
consumed a lot of CPU time ;)
The code of scan_backward_fast is pretty difficult to understand for
me. So I'm not sure if I can spot the problem. I'll continue staring
at it for a little while ...
Ok, I can't find it right now. There must be some functional
difference between scan_backward_fast and scan_backward_slow.
One way to get closer to the problem:

* create a backup of the current rename scan_backward_fast code
* copy scan_backward_slow to scan_backward_fast
* replace all file_len with a hard-coded "2"
  ->  can give a 50% boost depending on optimizer cleverness
* step-by-step apply changes from the old scan_backward_fast code
Ok, if I have some time tonight, I will try to get closer ...

For now, some feedback on the rest of the patch:

[[[
[... snip ...]

+  if (had_cr)
+    {
+      /* Check if we ended in the middle of a \r\n for one file, but \r
for
+         another. If so, back up one byte, so the next loop will back up
+         the entire line. Also decrement *prefix_lines, since we counted
one
+         too many for the \r. */
+      if ((*file[0].curp == '\n') || (*file[1].curp == '\n'))
Here (or below DECREMENT, but within this same if condition) you need:
          (*prefix_lines)--;

just like in the original (and *_slow) case. As is explained in the
comment above: if we are here, we actually counted a full prefix line
too much, because we already counted one for a matching '\r', but the
character after that was '\n' in one file, but a different character
in the other. So the eol-termination is different, so this is a
non-matching line which needs to be fully rolled back (hence also the
extra DECREMENT here).
You are right. I had some intermediate code in
place where the increment would happen later
(thus sparing the extra decrement).
I'm not even sure the original code is correct at all.
For instance, it does not handle '\r' (Mac style) EOLs
properly. Not really sure how to fix that.
Are you sure it's not correct for Mac style EOLs? Can you give an example?
You are correct, the extra check for '\n' before decrementing
the line count fixes up the line-count for WIN EOLs only.

find_prefix ("a\rb", "a\rc") -> prefix_lines = 1, had_cr = TRUE
find_prefix ("a\nb", "a\nc") -> prefix_lines = 1, had_cr = FALSE

The other case that I was thinking about seems to be unsupported
by SVN anyways:

find_prefix ("a\r\nb", "a\rc") -> prefix_lines = 0, had_cr = TRUE
find_prefix ("a\n\rb", "a\nc") -> prefix_lines = 1, had_cr = FALSE

So, you are right, I'm wrong ;)

I thought I made it work correctly with '\r' EOLs (the very first
iterations of this (when I was still sending it as patches to the
dev-list) only worked for \r\n and \n, but somewhere along the way, I
made it handle \r EOLs as well). That's what the entire block below
"/* check for eol, and count */" is for:
- if \r: set had_cr = TRUE, count an extra prefix_line.
- if \n: only count an additional prefix_line if the had_cr == FALSE,
because otherwise it's the second half of a \r\n EOL, and we already
counted it for the \r.
- all the rest: don't care, just move on (setting had_cr = FALSE).
- In the special case where prefix scanning ended with a mismatch
between a '\r\n' and '\r<some other character>', decrement prefix_line
and go back one byte, because this line wasn't identical after all.

(my older versions just counted the number of \n's, which is ok for
both \n and \r\n).

The only reason to look at EOLs is to count the number of prefix_lines
(and to rollback the last non-matching line until the previous EOL,
but that's done by going back until finding either a \r or an \n). All
the rest is just comparing bytes.

(note that it doesn't support ignoring EOL-style differences (see the
comment "/* ### TODO: see if we can take advantage of diff options
like ignore_eol_style or ignore_space. */"). Maybe you're referring to
that? If you encounter comparison between different EOL-styles, you
just lose the prefix/suffix optimization, and fall back to the
original algorithm, which is ok for now I think, because it's not that
common.).

Cheers,
-- Stefan^2.

Reply via email to