Re: [PATCH v2 0/9] icase match on non-ascii

2015-07-08 Thread Torsten Bögershausen
On 2015-07-08 12.38, Nguyễn Thái Ngọc Duy wrote:
 Side note, I almost added the third has_non_ascii() function. Maybe we
 should consider merging the two existing has_non_ascii() functions
 back, or rename one to something else.

Side question:

has_non_ascii can mean different things:
 UTF-8, ISO-8859-1, ISO-8859-x...

In short: everything.
Should we be more critical here ?

May be this can be used from commit.c:
static int verify_utf8(struct strbuf *buf)

Other question:
Should the non-ascii characters in the test scripts be octal-escaped ?

Third question:
What happens on systems, which don't have gettext, (for whatever reasons)
--- a/gettext.c
+++ b/gettext.c
@@ -166,12 +166,17 @@ void git_setup_gettext(void)
textdomain(git);
}

+int is_utf8_locale(void)
+{
+ return !strcmp(charset, UTF-8);
+}


4th question:
What happens on systems which don't have locale support at all ?


As one may suspect, I'm not a friend of being dependent on gettext and/or
locale, at least not for this kind of business.

Would it make more sense to have a command line option ?


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/9] icase match on non-ascii

2015-07-08 Thread Duy Nguyen
On Wed, Jul 8, 2015 at 6:32 PM, Torsten Bögershausen tbo...@web.de wrote:
 On 2015-07-08 12.38, Nguyễn Thái Ngọc Duy wrote:
 Side note, I almost added the third has_non_ascii() function. Maybe we
 should consider merging the two existing has_non_ascii() functions
 back, or rename one to something else.

 Side question:

 has_non_ascii can mean different things:
  UTF-8, ISO-8859-1, ISO-8859-x...

 In short: everything.
 Should we be more critical here ?

For the purpose of this series, no. What we need to ask is can we
optimize this? and that leads to ok we can safely optimize for
ascii-only patterns, don't optimize otherwise.

 May be this can be used from commit.c:
 static int verify_utf8(struct strbuf *buf)

Except the pcre/utf8 patch, the rest should work for all single-byte encodings..

 Other question:
 Should the non-ascii characters in the test scripts be octal-escaped ?

I would prefer something readable from a text editor. Even though I
don't speak Icelandic (the strings were copied from gettext test
script) I can see uppercase/lowercase of letters. But I don't know how
many people have fonts covering more than just ascii..

 Third question:
 What happens on systems, which don't have gettext, (for whatever reasons)
 --- a/gettext.c
 +++ b/gettext.c
 @@ -166,12 +166,17 @@ void git_setup_gettext(void)
 textdomain(git);
 }

 +int is_utf8_locale(void)
 +{
 + return !strcmp(charset, UTF-8);
 +}

Hm.. pcre on utf-8 is screwed. I really don't want to go through
nl_langinfo, $LC_ALL, $LC_CTYPE and $LANG to detect utf8 like what's
done in compat/regex/regcomp.c.. But I guess there's no other way,
people can disable gettext and expect git to work properly with utf-8
if their pcre library supports it.

 4th question:
 What happens on systems which don't have locale support at all ?

I suppose by locale here you do not mean gettext any more, then
icase does not work. We simply delegate the work to system regex/pcre.
If they don't support locale, nothing we can do. Git itself does not
know how to fold case outside ascii (except maybe utf8, I don't know
how smart our utf-8 impl is). gettext support does not matter here,
except pcre/utf8 case above.

 As one may suspect, I'm not a friend of being dependent on gettext and/or
 locale, at least not for this kind of business.

 Would it make more sense to have a command line option ?

I assume you only care about utf-8 here (and utf8.c knows how to fold
case), you'll need to improve compat regex to take advantage of it
first, maybe make sure it can live side by side with system regex,
also make sure pcre is compiled with utf-8 support if you use pcre.
And if setting LANG to let git know you want to use utf-8 is too
subtle, then yes a command line option makes sense :)
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/9] icase match on non-ascii

2015-07-08 Thread Junio C Hamano
 Patch 5 is funny. The patch itself is in iso-8859-1, but my name in
 the commit message is in utf-8.

As an e-mail message is a single file, by definition that is not merely
funny but just broken, no matter what encoding your MUA
declares the contents are in, no?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/9] icase match on non-ascii

2015-07-08 Thread Duy Nguyen
On Wed, Jul 8, 2015 at 10:36 PM, Junio C Hamano gits...@pobox.com wrote:
 Patch 5 is funny. The patch itself is in iso-8859-1, but my name in
 the commit message is in utf-8.

 As an e-mail message is a single file, by definition that is not merely
 funny but just broken, no matter what encoding your MUA
 declares the contents are in, no?

Yes. But if your MUA is not so strict and does not reject invalid byte
sequences, git-am might be able to process it. What are the options we
have? Teach git-format-patch to generate patches with non-ascii chars
as binary (utf-8 may be treated specially and kept as-is after
validation)?
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html