Git has a diff.wordRegex config that allows the user to specify a
regex that defines a word. Setting diff.wordRegex to "." works well
for a char-level diff for ASCII chars, but not for UTF-8 chars.

For example, if a file (encoded by UTF-8) with text "一人" is changed to
"丁人", "git diff --word-diff=color" gets "<E4><B8><80><81>人" (where
"<80>" is red and "<81>" is green) instead of desired "一丁人" (where "一"
is red and "丁" is green). This could be very annoying when diff-ing
files containing CJK chars.

Git diff.wordRegex seems to implement a very basic regex that doesn't
support matching char range by encoding such as "\x41" for "a". Is
there a way to make the char-level diff work correctly? If not, maybe
we should implement a way to allow it.

Reply via email to