Tim Peters <t...@python.org> added the comment:

difflib generally synchs on the longest contiguous matching subsequence that 
doesn't contain a "junk" element.  By default, `ndiff()`'s optional `charjunk` 
argument considers blanks and tabs to be junk characters.

In the strings:

"drwxrwxr-x 2 2000  2000\n"
"drwxr-xr-x 2 2000  2000\n"

the longest matching substring not containing whitespace is "rwxr-x", of length 
6, starting at index 4 in the first string and at index 1 in the second.  So 
it's aligning the strings like so:

"drwxrwxr-x 2 2000  2000\n"
   "drwxr-xr-x 2 2000  2000\n"
     123456

That's why it wants to delete the 1:4 slice in the first string and insert 
"r-x" after the longest matching substring.

The default is aimed at improving results for human-readable text, like prose 
and Python code, where stuff between whitespace is often read "as a whole" 
(words, keywords, identifiers, ...).

For cases like this one, where character-by-character differences are 
important, it's often better to pass `charjunk=None`.  Then the longest 
matching substring is "xr-x 2 2000  2000" at the tail end of both strings, and 
you get the output you're expecting.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue35955>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to