[PATCH nd/wildmatch] Correct Git's version of isprint and isspace

2012-11-13 Thread Nguyễn Thái Ngọc Duy
Git's ispace does not include 11 and 12. Git's isprint includes
control space characters (10-13). According to glibc-2.14.1 on C
locale on Linux, this is wrong. This patch fixes it.

Signed-off-by: Nguyễn Thái Ngọc Duy pclo...@gmail.com
---
 I wrote a small C program to compare the result of all is* functions
 that Git replaces against the libc version. These are the only ones that
 differ. Which matches what Jan Schönherr commented.

 ctype.c   |  6 +++---
 git-compat-util.h | 11 ++-
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/ctype.c b/ctype.c
index 0bfebb4..71311a3 100644
--- a/ctype.c
+++ b/ctype.c
@@ -14,11 +14,11 @@ enum {
P = GIT_PATHSPEC_MAGIC, /* other non-alnum, except for ] and } */
X = GIT_CNTRL,
U = GIT_PUNCT,
-   Z = GIT_CNTRL | GIT_SPACE
+   Z = GIT_CNTRL_SPACE
 };
 
-const unsigned char sane_ctype[256] = {
-   X, X, X, X, X, X, X, X, X, Z, Z, X, X, Z, X, X, /*   0.. 15 */
+const unsigned int sane_ctype[256] = {
+   X, X, X, X, X, X, X, X, X, Z, Z, Z, Z, Z, X, X, /*   0.. 15 */
X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, /*  16.. 31 */
S, P, P, P, R, P, P, P, R, R, G, R, P, P, R, P, /*  32.. 47 */
D, D, D, D, D, D, D, D, D, D, P, P, P, P, P, G, /*  48.. 63 */
diff --git a/git-compat-util.h b/git-compat-util.h
index 02f48f6..4ed3f94 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -474,8 +474,8 @@ extern const char tolower_trans_tbl[256];
 #undef ispunct
 #undef isxdigit
 #undef isprint
-extern const unsigned char sane_ctype[256];
-#define GIT_SPACE 0x01
+extern const unsigned int sane_ctype[256];
+#define GIT_CNTRL_SPACE 0x01
 #define GIT_DIGIT 0x02
 #define GIT_ALPHA 0x04
 #define GIT_GLOB_SPECIAL 0x08
@@ -483,9 +483,10 @@ extern const unsigned char sane_ctype[256];
 #define GIT_PATHSPEC_MAGIC 0x20
 #define GIT_CNTRL 0x40
 #define GIT_PUNCT 0x80
-#define sane_istest(x,mask) ((sane_ctype[(unsigned char)(x)]  (mask)) != 0)
+#define GIT_SPACE 0x100
+#define sane_istest(x,mask) ((sane_ctype[(unsigned int)(x)]  (mask)) != 0)
 #define isascii(x) (((x)  ~0x7f) == 0)
-#define isspace(x) sane_istest(x,GIT_SPACE)
+#define isspace(x) sane_istest(x,GIT_SPACE | GIT_CNTRL_SPACE)
 #define isdigit(x) sane_istest(x,GIT_DIGIT)
 #define isalpha(x) sane_istest(x,GIT_ALPHA)
 #define isalnum(x) sane_istest(x,GIT_ALPHA | GIT_DIGIT)
@@ -493,7 +494,7 @@ extern const unsigned char sane_ctype[256];
 #define isupper(x) sane_iscase(x, 0)
 #define is_glob_special(x) sane_istest(x,GIT_GLOB_SPECIAL)
 #define is_regex_special(x) sane_istest(x,GIT_GLOB_SPECIAL | GIT_REGEX_SPECIAL)
-#define iscntrl(x) (sane_istest(x,GIT_CNTRL))
+#define iscntrl(x) (sane_istest(x,GIT_CNTRL | GIT_CNTRL_SPACE))
 #define ispunct(x) sane_istest(x, GIT_PUNCT | GIT_REGEX_SPECIAL | \
GIT_GLOB_SPECIAL | GIT_PATHSPEC_MAGIC)
 #define isxdigit(x) (hexval_table[x] != -1)
-- 
1.8.0.rc2.23.g1fb49df

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nd/wildmatch] Correct Git's version of isprint and isspace

2012-11-13 Thread Jan H. Schönherr
Hi.

Am 13.11.2012 11:46, schrieb Nguyễn Thái Ngọc Duy:
 Git's ispace does not include 11 and 12. Git's isprint includes
 control space characters (10-13). According to glibc-2.14.1 on C
 locale on Linux, this is wrong. This patch fixes it.
 
 Signed-off-by: Nguyễn Thái Ngọc Duy pclo...@gmail.com
 ---
  I wrote a small C program to compare the result of all is* functions
  that Git replaces against the libc version. These are the only ones that
  differ. Which matches what Jan Schönherr commented.
 
  ctype.c   |  6 +++---
  git-compat-util.h | 11 ++-
  2 files changed, 9 insertions(+), 8 deletions(-)
 
 diff --git a/ctype.c b/ctype.c
 index 0bfebb4..71311a3 100644
 --- a/ctype.c
 +++ b/ctype.c
 @@ -14,11 +14,11 @@ enum {
   P = GIT_PATHSPEC_MAGIC, /* other non-alnum, except for ] and } */
   X = GIT_CNTRL,
   U = GIT_PUNCT,
 - Z = GIT_CNTRL | GIT_SPACE
 + Z = GIT_CNTRL_SPACE
  };
  
 -const unsigned char sane_ctype[256] = {
 - X, X, X, X, X, X, X, X, X, Z, Z, X, X, Z, X, X, /*   0.. 15 */
 +const unsigned int sane_ctype[256] = {
 + X, X, X, X, X, X, X, X, X, Z, Z, Z, Z, Z, X, X, /*   0.. 15 */
   X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, /*  16.. 31 */
   S, P, P, P, R, P, P, P, R, R, G, R, P, P, R, P, /*  32.. 47 */
   D, D, D, D, D, D, D, D, D, D, P, P, P, P, P, G, /*  48.. 63 */

An alternative to switching from 1-byte to 4-byte values (don't we have
a 2-byte datatype?), would be to free up GIT_CNTRL and simply do:

#define iscntrl(x) ((x)  0x20)


 diff --git a/git-compat-util.h b/git-compat-util.h
 index 02f48f6..4ed3f94 100644
 --- a/git-compat-util.h
 +++ b/git-compat-util.h
[...]
 @@ -483,9 +483,10 @@ extern const unsigned char sane_ctype[256];
  #define GIT_PATHSPEC_MAGIC 0x20
  #define GIT_CNTRL 0x40
  #define GIT_PUNCT 0x80
 -#define sane_istest(x,mask) ((sane_ctype[(unsigned char)(x)]  (mask)) != 0)
 +#define GIT_SPACE 0x100
 +#define sane_istest(x,mask) ((sane_ctype[(unsigned int)(x)]  (mask)) != 0)

That should better be left (unsigned char)? We might access values after the
array otherwise.

(That said, it wasn't really correct before either, when there really is a
possibility that x = 0x100.)

Regards
Jan

PS: It looks like my isprint() version was given precedence over your
isprint() version during the merge into next. That should also be sorted out,
but I've no idea which one is actually better: two comparisons versus one
cache lookup and a bitop... (though my guess is that comparisons are cheaper,
but then we should also convert isdigit()...)
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nd/wildmatch] Correct Git's version of isprint and isspace

2012-11-13 Thread René Scharfe

Am 13.11.2012 11:46, schrieb Nguyễn Thái Ngọc Duy:

Git's isprint includes
control space characters (10-13). According to glibc-2.14.1 on C
locale on Linux, this is wrong. This patch fixes it.


isprint() is not in master, yet.  Can we perhaps still introduce it in 
such a way that we never have an incorrect version in master's history?


And could you please update test-ctype.c to match the change to 
isspace()?  The tests there just documented the status quo before I made 
changes to ctype.c long ago, so it's definitions are just as correct (or 
wrong) as the original implementation.


Thanks,
René

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nd/wildmatch] Correct Git's version of isprint and isspace

2012-11-13 Thread René Scharfe

Am 13.11.2012 11:46, schrieb Nguyễn Thái Ngọc Duy:

Git's ispace does not include 11 and 12.  [...]

 According to glibc-2.14.1 on C locale on Linux, this is wrong.

11 and 12 being vertical tab (\v) and form-feed (\f).  This lack goes 
back to the introduction of git's own character classifier macros seven 
years ago in 4546738b (Unlocalized isspace and friends).


Linus, do you remember if you left them out on purpose?

Thanks,
René

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nd/wildmatch] Correct Git's version of isprint and isspace

2012-11-13 Thread Johannes Sixt
Am 13.11.2012 11:46, schrieb Nguyễn Thái Ngọc Duy:
 @@ -14,11 +14,11 @@ enum {
   P = GIT_PATHSPEC_MAGIC, /* other non-alnum, except for ] and } */
   X = GIT_CNTRL,
   U = GIT_PUNCT,
 - Z = GIT_CNTRL | GIT_SPACE
 + Z = GIT_CNTRL_SPACE
  };
  
 -const unsigned char sane_ctype[256] = {
 - X, X, X, X, X, X, X, X, X, Z, Z, X, X, Z, X, X, /*   0.. 15 */
 +const unsigned int sane_ctype[256] = {
 + X, X, X, X, X, X, X, X, X, Z, Z, Z, Z, Z, X, X, /*   0.. 15 */
   X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, /*  16.. 31 */
   S, P, P, P, R, P, P, P, R, R, G, R, P, P, R, P, /*  32.. 47 */
   D, D, D, D, D, D, D, D, D, D, P, P, P, P, P, G, /*  48.. 63 */
 diff --git a/git-compat-util.h b/git-compat-util.h
 index 02f48f6..4ed3f94 100644
 --- a/git-compat-util.h
 +++ b/git-compat-util.h
 @@ -474,8 +474,8 @@ extern const char tolower_trans_tbl[256];
  #undef ispunct
  #undef isxdigit
  #undef isprint
 -extern const unsigned char sane_ctype[256];
 -#define GIT_SPACE 0x01
 +extern const unsigned int sane_ctype[256];
 +#define GIT_CNTRL_SPACE 0x01
  #define GIT_DIGIT 0x02
  #define GIT_ALPHA 0x04
  #define GIT_GLOB_SPECIAL 0x08
 @@ -483,9 +483,10 @@ extern const unsigned char sane_ctype[256];
  #define GIT_PATHSPEC_MAGIC 0x20
  #define GIT_CNTRL 0x40
  #define GIT_PUNCT 0x80
 -#define sane_istest(x,mask) ((sane_ctype[(unsigned char)(x)]  (mask)) != 0)
 +#define GIT_SPACE 0x100
 +#define sane_istest(x,mask) ((sane_ctype[(unsigned int)(x)]  (mask)) != 0)
  #define isascii(x) (((x)  ~0x7f) == 0)
 -#define isspace(x) sane_istest(x,GIT_SPACE)
 +#define isspace(x) sane_istest(x,GIT_SPACE | GIT_CNTRL_SPACE)
  #define isdigit(x) sane_istest(x,GIT_DIGIT)
  #define isalpha(x) sane_istest(x,GIT_ALPHA)
  #define isalnum(x) sane_istest(x,GIT_ALPHA | GIT_DIGIT)
 @@ -493,7 +494,7 @@ extern const unsigned char sane_ctype[256];
  #define isupper(x) sane_iscase(x, 0)
  #define is_glob_special(x) sane_istest(x,GIT_GLOB_SPECIAL)
  #define is_regex_special(x) sane_istest(x,GIT_GLOB_SPECIAL | 
 GIT_REGEX_SPECIAL)
 -#define iscntrl(x) (sane_istest(x,GIT_CNTRL))
 +#define iscntrl(x) (sane_istest(x,GIT_CNTRL | GIT_CNTRL_SPACE))
  #define ispunct(x) sane_istest(x, GIT_PUNCT | GIT_REGEX_SPECIAL | \
   GIT_GLOB_SPECIAL | GIT_PATHSPEC_MAGIC)
  #define isxdigit(x) (hexval_table[x] != -1)

So we have two properties that overlap:

  SS
   

You seem to generate partions:

   XXXYZ

then assign individual bits to each partition. Now each entry in the
lookup table has only one bit set. Then you define isxxx() to check for
one of the two possible bits:

   iscntrl is X or Y
   isspace is Y or Z

But shouldn't you just assign one bit for S and another one for C, have
entries in the lookup table with more than one bit set, and check for
only one bit in the isxxx macro?

That way you don't run out of bits as easily as you do with this patch.

-- Hannes

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nd/wildmatch] Correct Git's version of isprint and isspace

2012-11-13 Thread Linus Torvalds
On Tue, Nov 13, 2012 at 11:15 AM, René Scharfe
rene.scha...@lsrfire.ath.cx wrote:

 Linus, do you remember if you left them out on purpose?

Umm, no.

I have to wonder why you care? As far as I'm concerned, the only valid
space is space, TAB and CR/LF.

Anything else is *noise*, not space. What's the reason for even caring?

  Linus
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nd/wildmatch] Correct Git's version of isprint and isspace

2012-11-13 Thread Linus Torvalds
On Tue, Nov 13, 2012 at 11:40 AM, Linus Torvalds
torva...@linux-foundation.org wrote:

 I have to wonder why you care? As far as I'm concerned, the only valid
 space is space, TAB and CR/LF.

 Anything else is *noise*, not space. What's the reason for even caring?

Btw, expanding the whitespace selection may actually be very
counter-productive. It is used primarily for things like removing
extraneous space at the end of lines etc, and for that, the current
selection of SPACE, TAB and LF/CR is the right thing to do.

Adding things like FF etc - that are *technically* whitespace, but
aren't the normal kind of silent whitespace - is potentially going to
change things too much. People might *want* a form-feed in their
messages, for all we know.

So I really object to changing things just because. There's a reason
we do our own ctype.c: it avoids the crazy crap. It avoids the idiotic
localization issues, and it avoids the ambiguous cases.

So just let it be, unless you have some major real reason to actually
care about a real-world case. And if you do, please explain it. Don't
change things just because.

   Linus
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html