Re: [PATCH] Unicode: update of combining code points
Torsten Bögershausen: Some of the code points which have 0 length on the display are called combining, others are called vowels or accents. E.g. 5BF is not marked any of them, but if you look at the glyph, it should be combining (please correct me if that is wrong). All combining characters has a non-zero combining class in http://www.unicode.org/Public/UNIDATA/UnicodeData.txt (fourth field, called Canonical_Combining_Class in http://www.unicode.org/reports/tr44/ ). For instance, the aforementioned U+05BF is defined as follows: 05BF;HEBREW POINT RAFE;Mn;23;NSM;N; The combining class is 23, so this is a combining character. There is a difference between non-spacing combining marks (Mn in the third column (General_Category)) and others (Mc for spacing marks and Me for enclosing marks), so they might need specifial handling. Additionally, you have the zero-width characters, such as U+200B Zero Width Space. These have the Cf class, although it also contains visible characters IIRC. -- \\// Peter - http://www.softwolves.pp.se/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Unicode: update of combining code points
On 16/04/2014 22:58, Torsten Bögershausen wrote: Excellent, thanks for the pointers. Running the script below shows that 0X00AD SOFT HYPHEN should have zero length (and some others too). I wonder if that is really the case, and which one of the last 2 lines in the script is the right one. What does this mean for us: CfFormat a format control character Maybe dig back through the Git logs to check the original logic, but the comments suggest that Cf characters have been viewed as zero-width. That makes sense - they're usually markers indicating things like bidirectional text flow, so won't be taking space. (Although they may be causing even more extreme layout effects...) Soft-hyphen is noted as an explicit exception to the rule in the utf8.c comments. As of Unicode 4.0, it's supposed to be a character indicating a point where a hyphen could be placed if a line-wrap occurs, and if that wrap happens, then it can actually take up 1 space, otherwise not. So its width could be either 0 or 1, depending. Or, quite likely, the terminal doesn't treat it specially, and it always just looks like a hyphen... Thus we err on the safe side and give it width 1. See http://en.wikipedia.org/wiki/Soft_hyphen for background. The comments suggest adding -00AD +1160-11FF to the uniset command line for that tweak and for composing Hangul. (The +200B tweak isn't necessary any more - Zero-Width Space U+200B became Cf officially in Unicode 4.0.1: http://en.wikipedia.org/wiki/Zero-width_space http://www.unicode.org/review/resolved-pri.html#pri21 ) All of this is only really an approximation - a best-effort attempt to figure out the width of a string without any actual communication with the display device. So it'll never be perfect. The choice between double and single width in particular will often be unpredictable, unless you had deeper locale knowledge. Actually, while doing this, I've realised that this was originally Markus Kuhn's implementation, and that is acknowledged at the top of the file: http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c Good, because he knows what he's doing. Kevin -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Unicode: update of combining code points
On 16/04/2014 07:48, Torsten Bögershausen wrote: On 15.04.14 21:10, Peter Krefting wrote: Torsten Bögershausen: diff --git a/utf8.c b/utf8.c index a831d50..77c28d4 100644 --- a/utf8.c +++ b/utf8.c Is there a script that generates this code from the Unicode database files, or did you hand-update it? Some of the code points which have 0 length on the display are called combining, others are called vowels or accents. E.g. 5BF is not marked any of them, but if you look at the glyph, it should be combining (please correct me if that is wrong). Indeed it is combining (more specifically it has General Category Nonspacing_Mark = Mn). If I could have found a file which indicates for each code point, what it is, I could write a script. The most complete and machine-readable data are in these files: http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt The general categories can also be seen more legibly in: http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt For docs, see: http://www.unicode.org/reports/tr44/ http://www.unicode.org/reports/tr11/ http://www.unicode.org/ucd/ The existing utf8.c comments describe the attributes being selected from the tables (general categories Cf,Mn,Me, East Asian Width W, F). And they suggest that the combining character table was originally auto-generated from UnicodeData.txt with a uniset tool. Presumably this? https://github.com/depp/uniset The fullwidth-checking code looks like it was done by hand, although apparently uniset can process EastAsianWidth.txt. Kevin -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Unicode: update of combining code points
On 2014-04-16 12.51, Kevin Bracey wrote: On 16/04/2014 07:48, Torsten Bögershausen wrote: On 15.04.14 21:10, Peter Krefting wrote: Torsten Bögershausen: diff --git a/utf8.c b/utf8.c index a831d50..77c28d4 100644 --- a/utf8.c +++ b/utf8.c Is there a script that generates this code from the Unicode database files, or did you hand-update it? Some of the code points which have 0 length on the display are called combining, others are called vowels or accents. E.g. 5BF is not marked any of them, but if you look at the glyph, it should be combining (please correct me if that is wrong). Indeed it is combining (more specifically it has General Category Nonspacing_Mark = Mn). If I could have found a file which indicates for each code point, what it is, I could write a script. The most complete and machine-readable data are in these files: http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt The general categories can also be seen more legibly in: http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt For docs, see: http://www.unicode.org/reports/tr44/ http://www.unicode.org/reports/tr11/ http://www.unicode.org/ucd/ The existing utf8.c comments describe the attributes being selected from the tables (general categories Cf,Mn,Me, East Asian Width W, F). And they suggest that the combining character table was originally auto-generated from UnicodeData.txt with a uniset tool. Presumably this? https://github.com/depp/uniset The fullwidth-checking code looks like it was done by hand, although apparently uniset can process EastAsianWidth.txt. Kevin Excellent, thanks for the pointers. Running the script below shows that 0X00AD SOFT HYPHEN should have zero length (and some others too). I wonder if that is really the case, and which one of the last 2 lines in the script is the right one. What does this mean for us: Cf Format a format control character #!/bin/sh if ! test -f UnicodeData.txt; then wget http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt fi if ! test -f EastAsianWidth.txt; then wget http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt fi if ! test -f DerivedGeneralCategory.txt; then wget http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt fi if ! test -d uniset; then git clone https://github.com/tboegi/uniset.git fi ( cd uniset if ! test -x uniset; then autoreconf -i ./configure --enable-warnings=-Werror CFLAGS='-O0 -ggdb' fi make ) UNICODE_DIR=. ./uniset/uniset --32 cat:Me,Mn,Cf #UNICODE_DIR=. ./uniset/uniset --32 cat:Me,Mn -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Unicode: update of combining code points
Torsten Bögershausen: diff --git a/utf8.c b/utf8.c index a831d50..77c28d4 100644 --- a/utf8.c +++ b/utf8.c Is there a script that generates this code from the Unicode database files, or did you hand-update it? -- \\// Peter - http://www.softwolves.pp.se/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Unicode: update of combining code points
On 15.04.14 21:10, Peter Krefting wrote: Torsten Bögershausen: diff --git a/utf8.c b/utf8.c index a831d50..77c28d4 100644 --- a/utf8.c +++ b/utf8.c Is there a script that generates this code from the Unicode database files, or did you hand-update it? Some of the code points which have 0 length on the display are called combining, others are called vowels or accents. E.g. 5BF is not marked any of them, but if you look at the glyph, it should be combining (please correct me if that is wrong). If I could have found a file which indicates for each code point, what it is, I could write a script. So yes, it is updated by hand. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Unicode: update of combining code points
On 04/09/2014 12:37 AM, Junio C Hamano wrote: Jonathan Nieder jrnie...@gmail.com writes: Torsten Bögershausen wrote: Unicode 6.3 defines the following code as combining or accents, git_wcwidth() should return 0. Earlier unicode standards had defined these code point as reserved: Thanks for the update. Could the commit message also explain how this was noticed and what the user-visible effect is? For example: Unicode just announced that That means we should mark the relevant code points as combining characters so git knows they are zero-width and doesn't screw up the alignment when presenting branch names in columns with 'git branch --column' or something like that. Perhaps (the original read clearly enough for me, though). [...] 358 COMBINING DOT ABOVE RIGHT 359 COMBINING ASTERISK BELOW I'm not sure this list is needed --- the code + the reference to the Unicode 6.3 standard seems like enough (but if you think otherwise, I don't really mind). I can go either way. This commit touches only the range 300-6FF, there may be more to be updated. The there may be more here sounds ominous. Indeed it does ;-) Does that mean Unicode 6.3 also added some zero-width characters in other ranges that should be dealt with in the future? How many such ranges? How do we know when we're done? Just biting off the most important characters first and putting off the rest for later sounds fine to me --- my complaint is that the above comment doesn't make clear what the to-do list is for finishing the update later. I'll queue this at the tip of 'pu', not to forget about it while waiting for a clarification. Thanks. Thanks for comments, here comes the long version of the strory: I recently fooled myself by running git config --global user.name with a decomposed ö on a new Mac OS X machine. While there was little problems on Mac OS, all Windows and Linux machines stumbled over the decomposed ö, to be more exact over 0x308, COMBINING DIARESIS, (the 2 dots), giving all kind of weired output in git log. Looking into commit.c and utf8.c, how to improve the situation, I made this observations: - Some code from commit.c can possibly be moved into utf8.c, so that we only have 1 utf8 code parser. - A solution would be to run precompose_string() under Mac OS (which is a nop otherwise). This could have saved my day. Probably I will make a patch some day. - Some of the combining code points exist in Unicode 6.3, but not in utf8.c (which seams to be based on Unicode 2.0 6.3) I found some in the 0x300 area, and looked at the neighbors, and had enough time to read all code pages up to 0x7FF. So if somebody knows how to find out which code points that are combined, accents,,, or in other words should return 0 in git_wcwidth(), please let me know. How about this as a commit message: Unicode: partially update to version 6.3 Unicode 6.3 defines the following code points as combining or accents, git_wcwidth() should return 0. Earlier unicode standards had defined these code point as reserved: 358--35C 487 5A2, 5BA, 5C5, 5C7 604, 616--61A, 659--65F Note: for this commit only the range 0..7FF has been checked, more updates may be needed. Signed-off-by: Torsten Bögershausen tbo...@web.de -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Unicode: update of combining code points
Torsten Bögershausen tbo...@web.de writes: How about this as a commit message: Unicode: partially update to version 6.3 Unicode 6.3 defines the following code points as combining or accents, git_wcwidth() should return 0. Earlier unicode standards had defined these code point as reserved: 358--35C 487 5A2, 5BA, 5C5, 5C7 604, 616--61A, 659--65F Note: for this commit only the range 0..7FF has been checked, more updates may be needed. Signed-off-by: Torsten Bögershausen tbo...@web.de Thanks. I do not think you meant to say that the listed codepoints above are the only ones that were reserved. Rather, the codepoints listed are what are affected by this hange, and these were all reserved. Also it may help end-user visible effect like Jonathan asked in his earlier message. How about extending it like this? -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- utf8.c: partially update to version 6.3 Unicode 6.3 defines more code points as combining or accents. For example, the character ö could be expressed as an o followed by U+0308 COMBINING DIARESIS (aka umlaut, double-dot-above). We should consider that such a sequence of two codepoints occupies one display column for the alignment purposes, and for that, git_wcwidth() should return 0 for them. Affected codepoints are: U+0358..U+035C U+0487 U+05A2, U+05BA, U+05C5, U+05C7 U+0604, U+0616..U+061A, U+0659..U+065F Earlier unicode standards had defined these as reserved. Only the range 0..U+07FF has been checked to see which codepoints need to be marked as 0-width while preparing for this commit; more updates may be needed. Signed-off-by: Torsten Bögershausen tbo...@web.de Signed-off-by: Junio C Hamano gits...@pobox.com -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Unicode: update of combining code points
Excellent, thanks. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Unicode: update of combining code points
Jonathan Nieder jrnie...@gmail.com writes: Torsten Bögershausen wrote: Unicode 6.3 defines the following code as combining or accents, git_wcwidth() should return 0. Earlier unicode standards had defined these code point as reserved: Thanks for the update. Could the commit message also explain how this was noticed and what the user-visible effect is? For example: Unicode just announced that That means we should mark the relevant code points as combining characters so git knows they are zero-width and doesn't screw up the alignment when presenting branch names in columns with 'git branch --column' or something like that. Perhaps (the original read clearly enough for me, though). [...] 358 COMBINING DOT ABOVE RIGHT 359 COMBINING ASTERISK BELOW I'm not sure this list is needed --- the code + the reference to the Unicode 6.3 standard seems like enough (but if you think otherwise, I don't really mind). I can go either way. This commit touches only the range 300-6FF, there may be more to be updated. The there may be more here sounds ominous. Indeed it does ;-) Does that mean Unicode 6.3 also added some zero-width characters in other ranges that should be dealt with in the future? How many such ranges? How do we know when we're done? Just biting off the most important characters first and putting off the rest for later sounds fine to me --- my complaint is that the above comment doesn't make clear what the to-do list is for finishing the update later. I'll queue this at the tip of 'pu', not to forget about it while waiting for a clarification. Thanks. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Unicode: update of combining code points
Hi, Torsten Bögershausen wrote: Unicode 6.3 defines the following code as combining or accents, git_wcwidth() should return 0. Earlier unicode standards had defined these code point as reserved: Thanks for the update. Could the commit message also explain how this was noticed and what the user-visible effect is? For example: Unicode just announced that That means we should mark the relevant code points as combining characters so git knows they are zero-width and doesn't screw up the alignment when presenting branch names in columns with 'git branch --column' or something like that. [...] 358 COMBINING DOT ABOVE RIGHT 359 COMBINING ASTERISK BELOW I'm not sure this list is needed --- the code + the reference to the Unicode 6.3 standard seems like enough (but if you think otherwise, I don't really mind). This commit touches only the range 300-6FF, there may be more to be updated. The there may be more here sounds ominous. Does that mean Unicode 6.3 also added some zero-width characters in other ranges that should be dealt with in the future? How many such ranges? How do we know when we're done? Just biting off the most important characters first and putting off the rest for later sounds fine to me --- my complaint is that the above comment doesn't make clear what the to-do list is for finishing the update later. Thanks and hope that helps, Jonathan -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html