Emoji map of Colorado

2020-04-01 Thread Karl Williamson via Unicode

https://www.reddit.com/r/Denver/comments/fsmn87/quarantine_boredom_my_emoji_map_of_colorado/?mc_cid=365e908e08&mc_eid=0700c8706b


EGYPTIAN HIEROGLYPH MAN WITH A ROLL OF TOILET PAPER

2020-03-11 Thread Karl Williamson via Unicode

On 2/12/20 11:12 AM, Frédéric Grosshans via Unicode wrote:

Dear Unicode list members (CC Michel Suignard),

   the Unicode proposal L2/20-068 
, 
“Revised draft for the encoding of an extended Egyptian Hieroglyphs 
repertoire, Groups A to N” ( 
https://www.unicode.org/L2/L2020/20068-n5128-ext-hieroglyph.pdf ) by 
Michel Suignard contains a very interesting hieroglyph at position 
*U+13579 EGYPTIAN HIEROGLYPH A-12-054, which seems to represent a man 
with a laptop, as can be obvious in the attached image.




Someone suggested today that this would be the more up-to-date character



Re: Call for feedback on UTS #18: Unicode Regular Expressions

2020-01-02 Thread Karl Williamson via Unicode
One thing I noticed in reviewing this is the removal of text about loose 
matching of the name property.  But I didn't see an explanation for this 
removal.  Please point me to the explanation, or tell me what it is.


Specifically these lines were removed:

As with other property values, names should use a loose match, 
disregarding case, spaces and hyphen (the underbar character "_" cannot 
occur in Unicode character names). An implementation may also choose to 
allow namespaces, where some prefix like "LATIN LETTER" is set globally 
and used if there is no match otherwise.


There are, however, three instances that require special-casing with 
loose matching, where an extra test shall be made for the presence or 
absence of a hyphen.


U+0F68 TIBETAN LETTER A and
U+0F60 TIBETAN LETTER -A
U+0FB8 TIBETAN SUBJOINED LETTER A and
U+0FB0 TIBETAN SUBJOINED LETTER -A
U+116C HANGUL JUNGSEONG OE and
U+1180 HANGUL JUNGSEONG O-E




Re: Missing UAX#31 tests?

2018-07-14 Thread Karl Williamson via Unicode

On 07/09/2018 02:11 PM, Karl Williamson via Unicode wrote:

On 07/08/2018 03:21 AM, Mark Davis ☕️ wrote:
I'm surprised that the tests for 11.0 passed for a 10.0 
implementation, because the following should have triggered a 
difference for WB. Can you check on this particular case?


÷ 0020 × 0020 ÷#÷ [0.2] SPACE (WSegSpace) × [3.4] SPACE (WSegSpace) ÷ 
[0.3]


I'm one of the people who advocated for this change, and I had already 
tailored our implementation of 10.0 to not break between horizontal 
white space, so it's actually not surprising that this rule didn't break




It turns out that the fault was all mine; the Unicode 11.0 tests were 
failing on a 10.0 implementation.  I'm sorry for starting this red 
herring thread.


If you care to know the details, read on.

The code that runs the tests knows what version of the UCD it is using, 
and it knows what version of the UAX boundary algorithms it is using. 
If these differ, it emits a warning about the discrepancy, and expects 
that there are going to be many test failures, so it marks all failing 
ones as 'To do' which suppresses their output, so as to not distract 
from any other failures that have been introduced by using the new UCD 
version.  (Updating the algorithm comes last.)


The solution for the future is to change the warning about the 
discrepancy to note that the failing boundary algorithm tests are 
suppressed.  This will clue me (or whoever) in that all is not 
necessarily well.




About the testing:

The tests are generated so that they go all the combinations of pairs, 
and some combinations of triples. The generated test cases use a 
sample from each partition of characters, to cut down on the file size 
to a reasonable level. That also means that some changes in the rules 
don't cause changes in the test results. Because it is not possible to 
test every combination, so there is also provision for additional test 
cases, such as those at the end of the files, eg:


https://unicode.org/Public/11.0.0/ucd/auxiliary/WordBreakTest.html
https://unicode.org/Public/10.0.0/ucd/auxiliary/WordBreakTest.html

We should extend those each time to make sure we cover combinations 
that aren't covered by pairs. There were some additions to that end; 
if they didn't cover enough cases, then we can look at your experience 
to add more.


I can suggest two strategies for further testing:

1. To do a full test, for each row check every combinations obtained 
by replacing each sample character by every other character in its 
partition. Eg for the above line that would mean testing every 
 sequence.


2. Use a monkey test against ICU. That is, generate random 
combinations of characters from different partitions and check that 
ICU and your implementation are in sync.


3. During the beta period, test your previous-version with the new 
test files. If there are no failures, yet there are changes in the 
rules, then raise that issue during the beta period so we can add tests.


I actually did this, and as I recall, did find some test failures.  In 
retrospect, I must have screwed up somehow back then.  I was under tight 
deadline pressure, and as a result, did more cursory beta testing than 
normal.


4. If possible, during the beta period upgrade your implementation and 
test against the new and old test files.




Anyone else have other suggestions for testing?

Mark



As an aside, a release or two ago, I implemented SB, and someone 
immediately found a bug, and accused me of releasing software that had 
not been tested at all.  He had looked through the test suite and not 
found anything that looked like it was testing that.  But he failed to 
find the test file which bundled up all your tests, in a manner he was 
not accustomed to, so it was easy for him to overlook.  The bug only 
manifested itself in longer runs of characters than your pairs and 
triples tested.  I looked at it, and your SB tests still seemed 
reasonable, and I should not expect a more complete series than you 
furnished.



Mark
//

On Sun, Jul 8, 2018 at 6:52 AM, Karl Williamson via Unicode 
mailto:unicode@unicode.org>> wrote:


    I am working on upgrading from Unicode 10 to Unicode 11.

    I used all the new files.

    The algorithms for some of the boundaries, like GCB and WB, have
    changed so that some of the property values no longer have code
    points associated with them.

    I ran the tests furnished in 11.0 for these boundaries, without
    having changed the algorithms from earlier releases.  All passed 
100%.


    Unless I'm missing something, that indicates that the tests
    furnished in 11.0 do not contain instances that exercise these
    changes.  My guess is that the 10.0 tests were also deficient.

    I have been relying on the UCD to furnish tests that have enough
    coverage to sufficiently exercise the algorithms that are specified
    in UAX 31, but that appears to have been naive on my part











Re: Missing UAX#31 tests?

2018-07-09 Thread Karl Williamson via Unicode

On 07/08/2018 03:21 AM, Mark Davis ☕️ wrote:
I'm surprised that the tests for 11.0 passed for a 10.0 implementation, 
because the following should have triggered a difference for WB. Can you 
check on this particular case?


÷ 0020 × 0020 ÷#÷ [0.2] SPACE (WSegSpace) × [3.4] SPACE (WSegSpace) ÷ [0.3]


I'm one of the people who advocated for this change, and I had already 
tailored our implementation of 10.0 to not break between horizontal 
white space, so it's actually not surprising that this rule didn't break



About the testing:

The tests are generated so that they go all the combinations of pairs, 
and some combinations of triples. The generated test cases use a sample 
from each partition of characters, to cut down on the file size to a 
reasonable level. That also means that some changes in the rules don't 
cause changes in the test results. Because it is not possible to test 
every combination, so there is also provision for additional test cases, 
such as those at the end of the files, eg:


https://unicode.org/Public/11.0.0/ucd/auxiliary/WordBreakTest.html
https://unicode.org/Public/10.0.0/ucd/auxiliary/WordBreakTest.html

We should extend those each time to make sure we cover combinations that 
aren't covered by pairs. There were some additions to that end; if they 
didn't cover enough cases, then we can look at your experience to add more.


I can suggest two strategies for further testing:

1. To do a full test, for each row check every combinations obtained by 
replacing each sample character by every other character in its 
partition. Eg for the above line that would mean testing every 
 sequence.


2. Use a monkey test against ICU. That is, generate random combinations 
of characters from different partitions and check that ICU and your 
implementation are in sync.


3. During the beta period, test your previous-version with the new test 
files. If there are no failures, yet there are changes in the rules, 
then raise that issue during the beta period so we can add tests.


I actually did this, and as I recall, did find some test failures.  In 
retrospect, I must have screwed up somehow back then.  I was under tight 
deadline pressure, and as a result, did more cursory beta testing than 
normal.


4. If possible, during the beta period upgrade your implementation and 
test against the new and old test files.




Anyone else have other suggestions for testing?

Mark



As an aside, a release or two ago, I implemented SB, and someone 
immediately found a bug, and accused me of releasing software that had 
not been tested at all.  He had looked through the test suite and not 
found anything that looked like it was testing that.  But he failed to 
find the test file which bundled up all your tests, in a manner he was 
not accustomed to, so it was easy for him to overlook.  The bug only 
manifested itself in longer runs of characters than your pairs and 
triples tested.  I looked at it, and your SB tests still seemed 
reasonable, and I should not expect a more complete series than you 
furnished.



Mark
//

On Sun, Jul 8, 2018 at 6:52 AM, Karl Williamson via Unicode 
mailto:unicode@unicode.org>> wrote:


I am working on upgrading from Unicode 10 to Unicode 11.

I used all the new files.

The algorithms for some of the boundaries, like GCB and WB, have
changed so that some of the property values no longer have code
points associated with them.

I ran the tests furnished in 11.0 for these boundaries, without
having changed the algorithms from earlier releases.  All passed 100%.

Unless I'm missing something, that indicates that the tests
furnished in 11.0 do not contain instances that exercise these
changes.  My guess is that the 10.0 tests were also deficient.

I have been relying on the UCD to furnish tests that have enough
coverage to sufficiently exercise the algorithms that are specified
in UAX 31, but that appears to have been naive on my part







Re: Missing UAX#31 tests?

2018-07-08 Thread Karl Williamson via Unicode

On 07/08/2018 03:23 AM, Mark Davis ☕️ wrote:
PS, although the title was "Missing UAX#31 tests?", I assumed you were 
talking about http://unicode.org/reports/tr29/




Yes, sorry.



Missing UAX#31 tests?

2018-07-07 Thread Karl Williamson via Unicode

I am working on upgrading from Unicode 10 to Unicode 11.

I used all the new files.

The algorithms for some of the boundaries, like GCB and WB, have changed 
so that some of the property values no longer have code points 
associated with them.


I ran the tests furnished in 11.0 for these boundaries, without having 
changed the algorithms from earlier releases.  All passed 100%.


Unless I'm missing something, that indicates that the tests furnished in 
11.0 do not contain instances that exercise these changes.  My guess is 
that the 10.0 tests were also deficient.


I have been relying on the UCD to furnish tests that have enough 
coverage to sufficiently exercise the algorithms that are specified in 
UAX 31, but that appears to have been naive on my part


Traditional and Simplified Han in UTS 39

2017-12-27 Thread Karl Williamson via Unicode

In UTS 39, it says, that optionally,

"Mark Chinese strings as “mixed script” if they contain both simplified 
(S) and traditional (T) Chinese characters, using the Unihan data in the 
Unicode Character Database [UCD].


"The criterion can only be applied if the language of the string is 
known to be Chinese."


What does it mean for the language to "be known to be Chinese"?  Is this 
something algorithmically determinable, or does it come from information 
about the input text that comes from outside the UCD?


The example given shows some Hirigana in the text.  That clearly 
indicates the language isn't Chinese.  So in this example we can 
algorithmically rule out that its Chinese.


And what does Chinese really mean here?



Inconsistency between UTS 39 and 24

2017-12-21 Thread Karl Williamson via Unicode

In http://unicode.org/reports/tr39/#Mixed_Script_Detection
it says, "For more information on the Script_Extensions property and 
Jpan, Kore, and Hanb, see UAX #24"


In http://www.unicode.org/reports/tr24/, there certainly is more 
information on scx; however, none of the terms Jpan Kore nor Hanb are 
mentioned.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Karl Williamson via Unicode
Under Best Practices, how many REPLACEMENT CHARACTERs should the 
sequence  generate?  0, 1, 2, 3, 4 ?


In practice, how many do parsers generate?


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Karl Williamson via Unicode

On 05/30/2017 02:30 PM, Doug Ewell via Unicode wrote:

L2/17-168 says:

"For UTF-8, recommend evaluating maximal subsequences based on the
original structural definition of UTF-8, without ever restricting trail
bytes to less than 80..BF. For example:  is a single maximal
subsequence because C0 was originally a lead byte for two-byte
sequences."

When was it ever true that C0 was a valid lead byte? And what does that
have to do with (not) restricting trail bytes?


Until TUS 3.1, it was legal for UTF-8 parsers to treat the sequence
  as U+002F.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Karl Williamson via Unicode

On 05/26/2017 12:22 PM, Ken Whistler wrote:


On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote:

The link provided about the PRI doesn't lead to the comments.



PRI #121 (August, 2008) pre-dated the practice of keeping all the 
feedback comments together with the PRI itself in a numbered directory 
with the name "feedback.html". But the comments were collected together 
at the time and are accessible here:


http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121

Also there was a separately submitted comment document:

http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt

And the minutes of the pertinent UTC meeting (UTC #116):

http://www.unicode.org/L2/L2008/08253.htm

The minutes simply capture the consensus to adopt Option #2 from PRI 
#121, and the relevant action items.


I now return the floor to the distinguished disputants to continue 
litigating history. ;-)


--Ken




The reason this discussion got started was that in December, someone 
came to me and said the code I support does not follow Unicode best 
practices, and suggested I need to change, though no ticket (yet) has 
been filed.  I was surprised, and posted a query to this list about what 
the advantages of the new approach are.  There were a number of replies, 
but I did not see anything that seemed definitive.  After a month, I 
created a ticket in Unicode and Markus was assigned to research it, and 
came up with the proposal currently being debated.


Looking at the PRI, it seems to me that treating an overlong as a single 
maximal unit is in the spirit of the wording, if not the fine print. 
That seems to be borne out by Markus, even with his stake in ICU, 
supporting option #2.


Looking at the comments, I don't see any discussion of the effect of 
this on overlong treatments.  My guess is that the effect change was 
unintentional.


So I have code that handled overlongs in the only correct way possible 
when they were acceptable, and in the obvious way after they became 
illegal, and now without apparent discussion (which is very much akin to 
"flimsy reasons"), it suddenly was no longer "best practice".  And that 
change came "rather late in the game".  That this escaped notice for 
years indicates that the specifics of REPLACEMENT CHAR handling don't 
matter all that much.


To cut to the chase, I think Unicode should issue a Corrigendum to the 
effect that it was never the intent of this change to say that treating 
overlongs as a single unit isn't best practice.  I'm not sure this 
warrants a full-fledge Corrigendum, though.  But I believe the text of 
the best practices should indicate that treating overlongs as a single 
unit is just as acceptable as Martin's interpretation.


I believe this is pretty much in line with Shawn's position.  Certainly, 
a discussion of the reasons one might choose one interpretation over 
another should be included in TUS.  That would likely have satisfied my 
original query, which hence would never have been posted.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Karl Williamson via Unicode

On 05/26/2017 04:28 AM, Martin J. Dürst wrote:
It may be worth to think about whether the Unicode standard should 
mention implementations like yours. But there should be no doubt about 
the fact that the PRI and Unicode 5.2 (and the current version of 
Unicode) are clear about what they recommend, and that that 
recommendation is based on the definition of UTF-8 at that time (and 
still in force), and not at based on a historical definition of UTF-8.


The link provided about the PRI doesn't lead to the comments.

Is there any evidence that there was a realization that the language 
being adopted would lead to overlongs being split into multiple subparts?




Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-24 Thread Karl Williamson via Unicode

On 05/24/2017 12:46 AM, Martin J. Dürst wrote:

On 2017/05/24 05:57, Karl Williamson via Unicode wrote:

On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote:



Adding a "recommendation" this late in the game is just bad standards
policy.



Unless I misunderstand, you are missing the point.  There is already a
recommendation listed in TUS,


That's indeed correct.



and that recommendation appears to have
been added without much thought.


That's wrong. There was a public review issue with various options and 
with feedback, and the recommendation has been implemented and in use 
widely (among else, in major programming language and browsers) without 
problems for quite some time.


Could you supply a reference to the PRI and its feedback?

The recommendation in TUS 5.2 is "Replace each maximal subpart of an 
ill-formed subsequence by a single U+FFFD."


And I agree with that.  And I view an overlong sequence as a maximal 
ill-formed subsequence that should be replaced by a single FFFD. 
There's nothing in the text of 5.2 that immediately follows that 
recommendation that indicates to me that my view is incorrect.


Perhaps my view is colored by the fact that I now maintain code that was 
written to parse UTF-8 back when overlongs were still considered legal 
input.  An overlong was a single unit.  When they became illegal, the 
code still considered them a single unit.


I can understand how someone who comes along later could say C0 can't be 
followed by any continuation character that doesn't yield an overlong, 
therefore C0 is a maximal subsequence.


But I assert that my interpretation is just as valid as that one.  And 
perhaps more so, because of historical precedent.


It appears to me that little thought was given to the fact that these 
changes would cause overlongs to now be at least two units instead of 
one, making long existing code no longer be best practice.  You are 
effectively saying I'm wrong about this.  I thought I had been paying 
attention to PRI's since the 5.x series, and I don't remember anything 
about this.  If you have evidence to the contrary, please give it. 
However, I would have thought Markus would have dug any up and given it 
in his proposal.






There is no proposal to add a
recommendation "this late in the game".


True. The proposal isn't for an addition, it's for a change. The "late 
in the game" however, still applies.


Regards,   Martin.






Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Karl Williamson via Unicode

On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote:

On 5/23/2017 10:45 AM, Markus Scherer wrote:
On Tue, May 23, 2017 at 7:05 AM, Asmus Freytag via Unicode 
mailto:unicode@unicode.org>> wrote:


So, if the proposal for Unicode really was more of a "feels right"
and not a "deviate at your peril" situation (or necessary escape
hatch), then we are better off not making a RECOMMEDATION that
goes against collective practice.


I think the standard is quite clear about this:

Although a UTF-8 conversion process is required to never consume
well-formed subsequences as part of its error handling for
ill-formed subsequences, such a process is not otherwise
constrained in how it deals with any ill-formed subsequence
itself. An ill-formed subsequence consisting of more than one code
unit could be treated as a single error or as multiple errors.


And why add a recommendation that changes that from completely up to the 
implementation (or groups of implementations) to something where one way 
of doing it now has to justify itself?


If the thread has made one thing clear is that there's no consensus in 
the wider community that one approach is obviously better. When it comes 
to ill-formed sequences, all bets are off. Simple as that.


Adding a "recommendation" this late in the game is just bad standards 
policy.


A./




Unless I misunderstand, you are missing the point.  There is already a 
recommendation listed in TUS, and that recommendation appears to have 
been added without much thought.  There is no proposal to add a 
recommendation "this late in the game".


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Karl Williamson via Unicode

On 05/15/2017 04:21 AM, Henri Sivonen via Unicode wrote:

In reference to:
http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf

I think Unicode should not adopt the proposed change.

The proposal is to make ICU's spec violation conforming. I think there
is both a technical and a political reason why the proposal is a bad
idea.



Henri's claim that "The proposal is to make ICU's spec violation 
conforming" is a false statement, and hence all further commentary based 
on this false premise is irrelevant.


I believe that ICU is actually currently conforming to TUS.

The proposal reads:

"For UTF-8, recommend evaluating maximal subsequences based on the 
original structural definition of UTF-8..."


There is nothing in here that is requiring any implementation to be 
changed.  The word "recommend" does not mean the same as "require". 
Have you guys been so caught up in the current international political 
situation that you have lost the ability to read straight?


TUS has certain requirements for UTF-8 handling, and it has certain 
other "Best Practices" as detailed in 3.9.  The proposal involves 
changing those recommendations.  It does not involve changing any 
requirements.