[issue31561] difflib pathological behavior with mixed line endings

Tim Peters Sun, 24 Sep 2017 17:11:19 -0700

Tim Peters added the comment:

The text/binary distinction you have in mind doesn't particularly apply to 
difflib:  it compares sequences of hashable objects.  "Text files" are 
typically converted by front ends to lists of strings, but, e.g., the engine is 
just as happy comparing tuples of floats.


File comparison interfaces typically do this at _two_ levels:  first, viewing 
files as lists of strings (one string per file line).  Then, when two blocks of 
mismatching lines are encountered, viewing the lines as sequences of 
characters.  The only role "line endings" play in any of this is in how the 
_input_ to the difference engine is created:  all decisions about how a file is 
broken into strings are made before the difference engine is consulted.  This 
preprocessing can choose to normalize line endings, leave them exactly as-is 
(typical), or remove them entirely from the strings it presents to the 
difference engine - or anything else it feels like doing.  The engine itself 
has no concept of "line termination sequences" - if there happen to be \r\n, 
\n, \r, or \0 substrings in strings passed to it, they're treated exactly the 
same as any other characters.

If the input processing creates lists of lines A and B for two files, where the 
files have different line-end terminators which are left in the strings, then 
no exact match whatsoever is possible between any line of A and a line in B.  
You suggest to just skip over both then, but the main text-file-comparison 
"front end" in difflib works hard to try to do better than that.  That's "a 
feature", albeit a potentially expensive one.  Viewing the file lines as 
sequences of characters, it computes a "similarity score" for _every_ line in A 
compared to _every_ line in B.  So len(A)*len(B) such scores are computed.  The 
pair with the highest score (assuming it exceeds a cutoff value) is taken as 
being the synch point, and then it can go on to show the _intra_line 
differences between those two lines.

That's why, e.g., given the lists of "lines":

A = ["eggrolls", "a a a", "b bb"]
B = ["ccc", "dd d", "egg rolls"]

it can (and does) tell you that the `egg rolls` in B was plausibly obtained 
from the `eggrolls` in A by inserting a blank.  This is often much more helpful 
than just giving up, saying "well, no line in A matched any line in B, so we'll 
just say A was entirely replaced by B".  That would be "correct" too - and much 
faster - but not really helpful.

Of course there's nothing special about the blank character in that.  Exactly 
the same applies if the line terminators differ between the files, and input 
processing leaves them in the strings.  difflib doesn't give up just because 
there are no exact line-level matches, and the same expensive "similarity 
score" algorithm kicks in to find the "most similar" lines despite the lack of 
exact matches.

Since that's a feature (albeit potentially expensive), I agree with Raymond 
closing this.  You can, of course, avoid the expense by ensuring your files all 
use the same line terminator sequence to begin with.  Which is the one obvious 
& thoroughly sane approach ;-)  Alternatively, talk to the `icdiff` author(s).  
I noticed it opens files for reading in binary mode, guaranteeing that 
different line-end conventions will be visible.  It's possible they could be 
talked into opening text files (or add an option to do so) using Python's 
"universal newline" mode, which converts all instances of \n, \r\n, and \r to 
\n on input.  Then lines that are identical except for line-end convention 
would in fact appear identical to difflib, and so skip the expensive similarity 
computations whenever that's so.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue31561>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue31561] difflib pathological behavior with mixed line endings

Reply via email to