Nguyen, thanks for the help and the patch. Also the escaping suggested by
Scharfe seems as good choice. But i dig some more into the problem and I found
some other thing. That's why I replied on the main thread not on the patch. I
hope you'll excuse me if this is a bad practice.
git grep -i -P also does not works because the PCRE_UTF8 is not set and pcre
library does not treat the string as UTF-8.
pickaxe search also uses kwsearch so the case insensitive search with it does
not work (e.g. git log -i -S). Maybe this is a less of a problem here as one
is expected to search for exact string (hence knows the case)
There is a interesting corner case. is_fixed treats all patterns containing
nulls as fixed. So what about if the string contains non-ASCII symbols as well
as nulls and the search is case insensitive :) I have to admin that my
knowledge in UTF-8 is not enough to answer the question if this could occur
during normal usage. For example the second byte in multi-byte symbol is NULL.
I would guess that's not true as it would break a lot of programs that depend
on NULL delimited string but it's good if somebody could confirm.
GNU grep indeed uses escaped regular expressions when the string is using
multi-byte encoding and the search is case insensitive. If the encoding is
UTF-8 then this strategy could be used in git too. Especially that git already
have support and helper functions to work with UTF-8. As for the other
multi-byte encodings - I think the things would become more complicated. As far
I know in UTF-8 the '{' character for example is two bytes not one. Maybe
really a support could be added only for the UTF-8 and if the string is not
UTF-8 to issue a warning.
So maybe the following makes sense when a grep search is performed:
* check if the multi-byte encoding is used. If it's and the search is case
insensitive and the encoding is not UTF-8 give a warning;
* if pcre is used and the string is UTF-8 encoded set the PCRE_UTF8 flag;
* if the search is case insensitive, the string is fixed and the encoding used
is UTF-8 use regcomp instead of kwsearch and escape any regex special
characters in the pattern;
And the question with the behavior of pickaxe search remains open. Using kwset
does not work with case insensitive non-ASCII searches. Instead of fixing
grep.c maybe it's better if new function is introduced that performs keyword
searches so it could be used by both grep, diffcore-pickaxe and any other code
in the future that may require such functionality. Or maybe diffcore-pickaxe
should use grep instead of directly kwset/regcomp
Regards,
Plamen Totev
>-------- Оригинално писмо --------
>От: Duy Nguyen [email protected]
>Относно: Re: Git grep does not support multi-byte characters (like UTF-8)
>До: Plamen Totev <[email protected]>
>Изпратено на: 06.07.2015 15:23
> I think we over-optimized a bit. If you your system provides regex
> with locale support (e.g. Linux) and you don't explicitly use fallback
> regex implementation, it should work. I suppose your test patterns
> look "fixed" (i.e. no regex special characters)? Can you try just add
> "." and see if case insensitive matching works?
This is indeed the problem. When I added the "." the matching works just fine.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html