Re: [PATCH] Unicode: update of combining code points

2014-04-24 Thread Peter Krefting

Torsten Bögershausen:


Some of the code points which have 0 length on the display are called
combining, others are called vowels or accents.
E.g. 5BF is not marked any of them, but if you look at the glyph, it should
be combining (please correct me if that is wrong).


All combining characters has a non-zero combining class in 
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt (fourth field, 
called Canonical_Combining_Class in 
http://www.unicode.org/reports/tr44/ ). For instance, the aforementioned 
U+05BF is defined as follows:


  05BF;HEBREW POINT RAFE;Mn;23;NSM;N;

The combining class is 23, so this is a combining character.

There is a difference between non-spacing combining marks (Mn in the 
third column (General_Category)) and others (Mc for spacing marks 
and Me for enclosing marks), so they might need specifial handling. 
Additionally, you have the zero-width characters, such as U+200B 
Zero Width Space. These have the Cf class, although it also contains 
visible characters IIRC.


--
\\// Peter - http://www.softwolves.pp.se/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Unicode: update of combining code points

2014-04-17 Thread Kevin Bracey

On 16/04/2014 22:58, Torsten Bögershausen wrote:

Excellent, thanks for the pointers.
Running the script below shows that
0X00AD SOFT HYPHEN should have zero length (and some others too).
I wonder if that is really the case, and which one of the last 2 lines
in the script is the right one.

What does this mean for us:
CfFormat  a format control character

Maybe dig back through the Git logs to check the original logic, but the 
comments suggest that Cf characters have been viewed as zero-width. 
That makes sense - they're usually markers indicating things like 
bidirectional text flow, so won't be taking space. (Although they may be 
causing even more extreme layout effects...)


Soft-hyphen is noted as an explicit exception to the rule in the utf8.c 
comments. As of Unicode 4.0, it's supposed to be a character indicating 
a point where a hyphen could be placed if a line-wrap occurs, and if 
that wrap happens, then it can actually take up 1 space, otherwise not. 
So its width could be either 0 or 1, depending. Or, quite likely, the 
terminal doesn't treat it specially, and it always just looks like a 
hyphen... Thus we err on the safe side and give it width 1.


See http://en.wikipedia.org/wiki/Soft_hyphen for background.

The comments suggest adding -00AD +1160-11FF to the uniset command 
line for that tweak and for composing Hangul. (The +200B tweak isn't 
necessary any more - Zero-Width Space U+200B became Cf officially in 
Unicode 4.0.1:


http://en.wikipedia.org/wiki/Zero-width_space
http://www.unicode.org/review/resolved-pri.html#pri21
)

All of this is only really an approximation - a best-effort attempt to 
figure out the width of a string without any actual communication with 
the display device. So it'll never be perfect. The choice between double 
and single width in particular will often be unpredictable, unless you 
had deeper locale knowledge.


Actually, while doing this, I've realised that this was originally 
Markus Kuhn's implementation, and that is acknowledged at the top of the 
file:


http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

Good, because he knows what he's doing.

Kevin




--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Unicode: update of combining code points

2014-04-16 Thread Kevin Bracey

On 16/04/2014 07:48, Torsten Bögershausen wrote:

On 15.04.14 21:10, Peter Krefting wrote:

Torsten Bögershausen:


diff --git a/utf8.c b/utf8.c
index a831d50..77c28d4 100644
--- a/utf8.c
+++ b/utf8.c

Is there a script that generates this code from the Unicode database files, or 
did you hand-update it?


Some of the code points which have 0 length on the display are called
combining, others are called vowels or accents.
E.g. 5BF is not marked any of them, but if you look at the glyph, it should
be combining (please correct me if that is wrong).


Indeed it is combining (more specifically it has General Category 
Nonspacing_Mark = Mn).




If I could have found a file which indicates for each code point, what it
is, I could write a script.



The most complete and machine-readable data are in these files:

http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt

The general categories can also be seen more legibly in:

http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt

For docs, see:

http://www.unicode.org/reports/tr44/
http://www.unicode.org/reports/tr11/
http://www.unicode.org/ucd/

The existing utf8.c comments describe the attributes being selected from 
the tables (general categories Cf,Mn,Me, East Asian Width W, 
F). And they suggest that the combining character table was originally 
auto-generated from UnicodeData.txt with a uniset tool. Presumably this?


https://github.com/depp/uniset

The fullwidth-checking code looks like it was done by hand, although 
apparently uniset can process EastAsianWidth.txt.


Kevin

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Unicode: update of combining code points

2014-04-16 Thread Torsten Bögershausen
On 2014-04-16 12.51, Kevin Bracey wrote:
 On 16/04/2014 07:48, Torsten Bögershausen wrote:
 On 15.04.14 21:10, Peter Krefting wrote:
 Torsten Bögershausen:

 diff --git a/utf8.c b/utf8.c
 index a831d50..77c28d4 100644
 --- a/utf8.c
 +++ b/utf8.c
 Is there a script that generates this code from the Unicode database files, 
 or did you hand-update it?

 Some of the code points which have 0 length on the display are called
 combining, others are called vowels or accents.
 E.g. 5BF is not marked any of them, but if you look at the glyph, it should
 be combining (please correct me if that is wrong).
 
 Indeed it is combining (more specifically it has General Category 
 Nonspacing_Mark = Mn).
 

 If I could have found a file which indicates for each code point, what it
 is, I could write a script.

 
 The most complete and machine-readable data are in these files:
 
 http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
 http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt
 
 The general categories can also be seen more legibly in:
 
 http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt
 
 For docs, see:
 
 http://www.unicode.org/reports/tr44/
 http://www.unicode.org/reports/tr11/
 http://www.unicode.org/ucd/
 
 The existing utf8.c comments describe the attributes being selected from the 
 tables (general categories Cf,Mn,Me, East Asian Width W, F). And 
 they suggest that the combining character table was originally auto-generated 
 from UnicodeData.txt with a uniset tool. Presumably this?
 
 https://github.com/depp/uniset
 
 The fullwidth-checking code looks like it was done by hand, although 
 apparently uniset can process EastAsianWidth.txt.
 
 Kevin
Excellent, thanks for the pointers.
Running the script below shows that 
0X00AD SOFT HYPHEN should have zero length (and some others too).
I wonder if that is really the case, and which one of the last 2 lines 
in the script is the right one.

What does this mean for us:
Cf Format  a format control character


#!/bin/sh

if ! test -f UnicodeData.txt; then
  wget http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
fi 
if ! test -f EastAsianWidth.txt; then
  wget http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt
fi
if ! test -f DerivedGeneralCategory.txt; then
  wget 
http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt
fi 
if ! test -d uniset; then
  git clone https://github.com/tboegi/uniset.git
fi 
(
  cd uniset 
  if ! test -x uniset; then 
autoreconf -i 
./configure --enable-warnings=-Werror CFLAGS='-O0 -ggdb'
  fi 
  make
) 
UNICODE_DIR=. ./uniset/uniset --32 cat:Me,Mn,Cf
#UNICODE_DIR=. ./uniset/uniset --32 cat:Me,Mn










 
 -- 
 To unsubscribe from this list: send the line unsubscribe git in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Unicode: update of combining code points

2014-04-15 Thread Peter Krefting

Torsten Bögershausen:


diff --git a/utf8.c b/utf8.c
index a831d50..77c28d4 100644
--- a/utf8.c
+++ b/utf8.c


Is there a script that generates this code from the Unicode database 
files, or did you hand-update it?


--
\\// Peter - http://www.softwolves.pp.se/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Unicode: update of combining code points

2014-04-15 Thread Torsten Bögershausen
On 15.04.14 21:10, Peter Krefting wrote:
 Torsten Bögershausen:
 
 diff --git a/utf8.c b/utf8.c
 index a831d50..77c28d4 100644
 --- a/utf8.c
 +++ b/utf8.c
 
 Is there a script that generates this code from the Unicode database files, 
 or did you hand-update it?
 
Some of the code points which have 0 length on the display are called
combining, others are called vowels or accents.
E.g. 5BF is not marked any of them, but if you look at the glyph, it should
be combining (please correct me if that is wrong).

If I could have found a file which indicates for each code point, what it
is, I could write a script.

So yes, it is updated by hand.



--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Unicode: update of combining code points

2014-04-09 Thread Torsten Bögershausen
On 04/09/2014 12:37 AM, Junio C Hamano wrote:
 Jonathan Nieder jrnie...@gmail.com writes:

 Torsten Bögershausen wrote:

 Unicode 6.3 defines the following code as combining or accents,
 git_wcwidth() should return 0.

 Earlier unicode standards had defined these code point as reserved:
 Thanks for the update.  Could the commit message also explain how this
 was noticed and what the user-visible effect is?

 For example:

  Unicode just announced that   That means we should mark the
   relevant code points as combining characters so git knows they are
   zero-width and doesn't screw up the alignment when presenting branch
   names in columns with 'git branch --column'

 or something like that.
 Perhaps (the original read clearly enough for me, though).

 [...]
 358 COMBINING DOT ABOVE RIGHT
 359 COMBINING ASTERISK BELOW
 I'm not sure this list is needed --- the code + the reference to the
 Unicode 6.3 standard seems like enough (but if you think otherwise,
 I don't really mind).
 I can go either way.

 This commit touches only the range 300-6FF, there may be more to be updated.
 The there may be more here sounds ominous.
 Indeed it does ;-)

 Does that mean Unicode
 6.3 also added some zero-width characters in other ranges that should
 be dealt with in the future?  How many such ranges?  How do we know
 when we're done?

 Just biting off the most important characters first and putting off
 the rest for later sounds fine to me --- my complaint is that the
 above comment doesn't make clear what the to-do list is for finishing
 the update later.
 I'll queue this at the tip of 'pu', not to forget about it while
 waiting for a clarification.

 Thanks.
Thanks for comments, here comes the long version of the strory:
I recently fooled myself by running
git config --global user.name with a decomposed ö on a new Mac OS X machine.

While there was little problems on Mac OS, all Windows and Linux machines 
stumbled
over the decomposed ö, to be more exact over 0x308, COMBINING DIARESIS, (the 2 
dots),
giving all kind of weired output in git log.

Looking into commit.c and utf8.c, how to improve the situation, I made this 
observations:
- Some code from commit.c can possibly be moved into utf8.c, so that we only
  have 1 utf8 code parser.
- A solution would be to run precompose_string() under Mac OS (which is a nop 
otherwise).
  This could have saved my day. Probably I will make a patch some day.
- Some of the combining code points exist in Unicode 6.3, but not in utf8.c
  (which seams to be based on Unicode 2.0 6.3)
  I found some in the 0x300 area, and looked at the neighbors, and had enough 
time to
  read all code pages up to 0x7FF. 

 So if somebody knows how to find out which code points that are combined, 
accents,,, or in other words should return 0 in git_wcwidth(), please let me 
know.

How about this as a commit message:

Unicode: partially update to version 6.3

Unicode 6.3 defines the following code points as combining or accents,
git_wcwidth() should return 0.

Earlier unicode standards had defined these code point as reserved:
358--35C
487
5A2, 5BA, 5C5, 5C7
604, 616--61A, 659--65F

Note: for this commit only the range 0..7FF has been checked,
more updates may be needed.

Signed-off-by: Torsten Bögershausen tbo...@web.de


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Unicode: update of combining code points

2014-04-09 Thread Junio C Hamano
Torsten Bögershausen tbo...@web.de writes:

 How about this as a commit message:

 Unicode: partially update to version 6.3

 Unicode 6.3 defines the following code points as combining or accents,
 git_wcwidth() should return 0.

 Earlier unicode standards had defined these code point as reserved:
 358--35C
 487
 5A2, 5BA, 5C5, 5C7
 604, 616--61A, 659--65F

 Note: for this commit only the range 0..7FF has been checked,
 more updates may be needed.

 Signed-off-by: Torsten Bögershausen tbo...@web.de

Thanks.

I do not think you meant to say that the listed codepoints above are
the only ones that were reserved.  Rather, the codepoints listed
are what are affected by this hange, and these were all reserved.

Also it may help end-user visible effect like Jonathan asked in his
earlier message.  How about extending it like this?

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
utf8.c: partially update to version 6.3

Unicode 6.3 defines more code points as combining or accents.  For
example, the character ö could be expressed as an o followed by
U+0308 COMBINING DIARESIS (aka umlaut, double-dot-above).  We should
consider that such a sequence of two codepoints occupies one display
column for the alignment purposes, and for that, git_wcwidth()
should return 0 for them.  Affected codepoints are:

U+0358..U+035C
U+0487
U+05A2, U+05BA, U+05C5, U+05C7
U+0604, U+0616..U+061A, U+0659..U+065F

Earlier unicode standards had defined these as reserved.

Only the range 0..U+07FF has been checked to see which codepoints
need to be marked as 0-width while preparing for this commit; more
updates may be needed.

Signed-off-by: Torsten Bögershausen tbo...@web.de
Signed-off-by: Junio C Hamano gits...@pobox.com
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Unicode: update of combining code points

2014-04-09 Thread Torsten Bögershausen

Excellent, thanks.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Unicode: update of combining code points

2014-04-08 Thread Junio C Hamano
Jonathan Nieder jrnie...@gmail.com writes:

 Torsten Bögershausen wrote:

 Unicode 6.3 defines the following code as combining or accents,
 git_wcwidth() should return 0.

 Earlier unicode standards had defined these code point as reserved:

 Thanks for the update.  Could the commit message also explain how this
 was noticed and what the user-visible effect is?

 For example:

  Unicode just announced that   That means we should mark the
   relevant code points as combining characters so git knows they are
   zero-width and doesn't screw up the alignment when presenting branch
   names in columns with 'git branch --column'

 or something like that.

Perhaps (the original read clearly enough for me, though).

 [...]
 358 COMBINING DOT ABOVE RIGHT
 359 COMBINING ASTERISK BELOW

 I'm not sure this list is needed --- the code + the reference to the
 Unicode 6.3 standard seems like enough (but if you think otherwise,
 I don't really mind).

I can go either way.

 This commit touches only the range 300-6FF, there may be more to be updated.

 The there may be more here sounds ominous.

Indeed it does ;-)

 Does that mean Unicode
 6.3 also added some zero-width characters in other ranges that should
 be dealt with in the future?  How many such ranges?  How do we know
 when we're done?

 Just biting off the most important characters first and putting off
 the rest for later sounds fine to me --- my complaint is that the
 above comment doesn't make clear what the to-do list is for finishing
 the update later.

I'll queue this at the tip of 'pu', not to forget about it while
waiting for a clarification.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Unicode: update of combining code points

2014-04-07 Thread Jonathan Nieder
Hi,

Torsten Bögershausen wrote:

 Unicode 6.3 defines the following code as combining or accents,
 git_wcwidth() should return 0.

 Earlier unicode standards had defined these code point as reserved:

Thanks for the update.  Could the commit message also explain how this
was noticed and what the user-visible effect is?

For example:

 Unicode just announced that   That means we should mark the
  relevant code points as combining characters so git knows they are
  zero-width and doesn't screw up the alignment when presenting branch
  names in columns with 'git branch --column'

or something like that.

[...]
 358 COMBINING DOT ABOVE RIGHT
 359 COMBINING ASTERISK BELOW

I'm not sure this list is needed --- the code + the reference to the
Unicode 6.3 standard seems like enough (but if you think otherwise,
I don't really mind).

 This commit touches only the range 300-6FF, there may be more to be updated.

The there may be more here sounds ominous.  Does that mean Unicode
6.3 also added some zero-width characters in other ranges that should
be dealt with in the future?  How many such ranges?  How do we know
when we're done?

Just biting off the most important characters first and putting off
the rest for later sounds fine to me --- my complaint is that the
above comment doesn't make clear what the to-do list is for finishing
the update later.

Thanks and hope that helps,
Jonathan
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html