Re: Git grep does not support multi-byte characters (like UTF-8)

Plamen Totev Tue, 07 Jul 2015 01:59:35 -0700

Nguyen, thanks for the help and the patch. Also the escaping suggested by 
Scharfe seems as good choice. But i dig some more into the problem and I found 
some other thing. That's why I replied on the main thread not on the patch. I 
hope you'll excuse me if this is a bad practice.


git grep -i -P also does not works because the PCRE_UTF8 is not set and pcre 
library does not treat the string as UTF-8.

pickaxe search also uses kwsearch so the case insensitive search with it does 
not work (e.g. git log -i -S).  Maybe this is a less of a problem here as one 
is expected to search for exact string (hence knows the case)

There is a interesting corner case. is_fixed treats all patterns containing 
nulls as fixed. So what about if the string contains non-ASCII symbols as well 
as nulls and the search is case insensitive :) I have to admin that my 
knowledge in UTF-8 is not enough to answer the question if this could occur 
during normal usage. For example the second byte in multi-byte symbol is NULL. 
I would guess that's not true as it would break a lot of programs that depend 
on NULL delimited string but it's good if somebody could confirm.

GNU grep indeed uses escaped regular expressions when the string is using 
multi-byte encoding and the search is case insensitive. If the encoding is 
UTF-8 then this strategy could be used in git too. Especially that git already 
have support and helper functions to work with UTF-8. As for the other 
multi-byte encodings - I think the things would become more complicated. As far 
I know in UTF-8 the '{' character for example is two bytes not one. Maybe 
really a support could be added only for the UTF-8 and if the string is not 
UTF-8 to issue a warning.

So maybe the following makes sense when a grep search is performed:
* check if the multi-byte encoding is used. If it's and the search is case 
insensitive and the encoding is not UTF-8 give a warning;
* if pcre is used and the string is UTF-8 encoded set the PCRE_UTF8 flag;
* if the search is case insensitive, the string is fixed and the encoding  used 
is UTF-8 use regcomp instead of kwsearch and escape any regex special 
characters in the pattern;

And the question with the behavior of pickaxe search remains open. Using kwset 
does not work with case insensitive non-ASCII searches. Instead of fixing 
grep.c maybe it's better if new function is introduced that performs keyword 
searches so it could be used by both grep, diffcore-pickaxe and any other code 
in the future that may require such functionality. Or maybe diffcore-pickaxe 
should use grep instead of directly kwset/regcomp

Regards,
Plamen Totev



>-------- Оригинално писмо -------- 
>От: Duy Nguyen [email protected] 
>Относно: Re: Git grep does not support multi-byte characters (like UTF-8) 
>До: Plamen Totev <[email protected]> 
>Изпратено на: 06.07.2015 15:23 

> I think we over-optimized a bit. If you your system provides regex 
> with locale support (e.g. Linux) and you don't explicitly use fallback 
> regex implementation, it should work. I suppose your test patterns 
> look "fixed" (i.e. no regex special characters)? Can you try just add 
> "." and see if case insensitive matching works? 

This is indeed the problem. When I added the "." the matching works just fine.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Git grep does not support multi-byte characters (like UTF-8)

Reply via email to