Emoji map of Colorado

2020-04-01 Thread Karl Williamson via Unicode
https://www.reddit.com/r/Denver/comments/fsmn87/quarantine_boredom_my_emoji_map_of_colorado/?mc_cid=365e908e08&mc_eid=0700c8706b

EGYPTIAN HIEROGLYPH MAN WITH A ROLL OF TOILET PAPER

2020-03-11 Thread Karl Williamson via Unicode
On 2/12/20 11:12 AM, Frédéric Grosshans via Unicode wrote: Dear Unicode list members (CC Michel Suignard),   the Unicode proposal L2/20-068 , “Revised draft for the encoding of an extended Egyptian Hieroglyphs repertoire, Group

Re: Call for feedback on UTS #18: Unicode Regular Expressions

2020-01-02 Thread Karl Williamson via Unicode
One thing I noticed in reviewing this is the removal of text about loose matching of the name property. But I didn't see an explanation for this removal. Please point me to the explanation, or tell me what it is. Specifically these lines were removed: As with other property values, names sho

Re: Missing UAX#31 tests?

2018-07-14 Thread Karl Williamson via Unicode
On 07/09/2018 02:11 PM, Karl Williamson via Unicode wrote: On 07/08/2018 03:21 AM, Mark Davis ☕️ wrote: I'm surprised that the tests for 11.0 passed for a 10.0 implementation, because the following should have triggered a difference for WB. Can you check on this particular case? ÷

Re: Missing UAX#31 tests?

2018-07-09 Thread Karl Williamson via Unicode
r SB tests still seemed reasonable, and I should not expect a more complete series than you furnished. Mark // On Sun, Jul 8, 2018 at 6:52 AM, Karl Williamson via Unicode mailto:unicode@unicode.org>> wrote: I am working on upgrading from Unicode 10 to Unicode 11. I used a

Re: Missing UAX#31 tests?

2018-07-08 Thread Karl Williamson via Unicode
On 07/08/2018 03:23 AM, Mark Davis ☕️ wrote: PS, although the title was "Missing UAX#31 tests?", I assumed you were talking about http://unicode.org/reports/tr29/ Yes, sorry.

Missing UAX#31 tests?

2018-07-07 Thread Karl Williamson via Unicode
I am working on upgrading from Unicode 10 to Unicode 11. I used all the new files. The algorithms for some of the boundaries, like GCB and WB, have changed so that some of the property values no longer have code points associated with them. I ran the tests furnished in 11.0 for these boundar

Traditional and Simplified Han in UTS 39

2017-12-27 Thread Karl Williamson via Unicode
In UTS 39, it says, that optionally, "Mark Chinese strings as “mixed script” if they contain both simplified (S) and traditional (T) Chinese characters, using the Unihan data in the Unicode Character Database [UCD]. "The criterion can only be applied if the language of the string is known to

Inconsistency between UTS 39 and 24

2017-12-21 Thread Karl Williamson via Unicode
In http://unicode.org/reports/tr39/#Mixed_Script_Detection it says, "For more information on the Script_Extensions property and Jpan, Kore, and Hanb, see UAX #24" In http://www.unicode.org/reports/tr24/, there certainly is more information on scx; however, none of the terms Jpan Kore nor Hanb

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Karl Williamson via Unicode
Under Best Practices, how many REPLACEMENT CHARACTERs should the sequence generate? 0, 1, 2, 3, 4 ? In practice, how many do parsers generate?

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Karl Williamson via Unicode
On 05/30/2017 02:30 PM, Doug Ewell via Unicode wrote: L2/17-168 says: "For UTF-8, recommend evaluating maximal subsequences based on the original structural definition of UTF-8, without ever restricting trail bytes to less than 80..BF. For example: is a single maximal subsequence because C0 was

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Karl Williamson via Unicode
On 05/26/2017 12:22 PM, Ken Whistler wrote: On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote: The link provided about the PRI doesn't lead to the comments. PRI #121 (August, 2008) pre-dated the practice of keeping all the feedback comments together with the PRI itself in a num

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Karl Williamson via Unicode
On 05/26/2017 04:28 AM, Martin J. Dürst wrote: It may be worth to think about whether the Unicode standard should mention implementations like yours. But there should be no doubt about the fact that the PRI and Unicode 5.2 (and the current version of Unicode) are clear about what they recommend

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-24 Thread Karl Williamson via Unicode
On 05/24/2017 12:46 AM, Martin J. Dürst wrote: On 2017/05/24 05:57, Karl Williamson via Unicode wrote: On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote: Adding a "recommendation" this late in the game is just bad standards policy. Unless I misunderstand, you are m

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Karl Williamson via Unicode
On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote: On 5/23/2017 10:45 AM, Markus Scherer wrote: On Tue, May 23, 2017 at 7:05 AM, Asmus Freytag via Unicode mailto:unicode@unicode.org>> wrote: So, if the proposal for Unicode really was more of a "feels right" and not a "deviate

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Karl Williamson via Unicode
On 05/15/2017 04:21 AM, Henri Sivonen via Unicode wrote: In reference to: http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf I think Unicode should not adopt the proposed change. The proposal is to make ICU's spec violation conforming. I think there is both a technical and a political re

Re: a character for an unknown character

2016-12-21 Thread Karl Williamson
On 12/21/2016 08:45 AM, David Corbett wrote: One Unicode character specifically for this purpose is U+3013 GETA MARK. It is a Japanese symbol used to replace characters that cannot be read during transcription of manuscripts (source: Japanese Wikipedia). It looks like a bold equals sign: 〓. Othe

Best practices for replacing UTF-8 overlongs

2016-12-19 Thread Karl Williamson
It seems counterintuitive to me that the two byte sequence C0 80 should be replaced by 2 replacement characters under best practices, or that E0 80 80 should also be replaced by 2. Each sequence was legal in early Unicode versions, and it seems that it would be best to treat them as each a sin

Should unassigned code points in blocks reserved for combining marks, etc be GCB extended?

2016-12-12 Thread Karl Williamson
These are currently GCB Other, but when assigned, don't we know that they will be Extended? So this could be done now.

I'm excited about the proposal to add a brontosaurus emoji codepoint

2016-08-29 Thread Karl Williamson
"I'm excited about the proposal to add a brontosaurus emoji codepoint because it has the potential to bring together a half-dozen different groups of pedantic people together" From http://xkcd.com/1726/ I don't know if this is new, or I just never saw it before.

Re: Where are the tools to generate posix and json from cldr?

2016-08-11 Thread Karl Williamson
data, and still could not find the tools in it. One would think that the tools directory contains it, and I did not look in every sub-directory in it, but none looked likely. I then tried transforms, but came up empty there too. // On Thu, Aug 11, 2016 at 8:29 PM, Karl Williamson

Where are the tools to generate posix and json from cldr?

2016-08-11 Thread Karl Williamson
I can't find these that are mentioned in http://cldr.unicode.org/ "For those interested in the source CLDR data, it is available for each release in the XML format specified by LDML. There are also tools that will convert to JSON and POSIX format. For more information, see CLDR Releases/Downlo

Re: Release date?

2016-06-21 Thread Karl Williamson
On 06/21/2016 08:43 AM, Doug Ewell wrote: http://opiniojuris.org/2016/06/20/emojis-and-international-law "And tomorrow, June 21, we will have 71 new emojis to play with." Do only bloggers and the press get notified in advance of the release date of Unicode 9.0? http://www.unicode.org/versions

Re: 9.0.0 segmentation and line breaks on the empty string

2016-06-19 Thread Karl Williamson
On 06/19/2016 07:25 AM, Daniel Bünzli wrote: Le dimanche, 12 juin 2016 à 14:26, Daniel Bünzli a écrit : Hello, I notice that in 9.0.0, UAX29 segmentations no longer report boundaries on the empty string while UAX14 still does report a hard line break on it. Is this intended ? and what is the

Re: Adopting ZWJ

2016-06-07 Thread Karl Williamson
On 06/07/2016 06:25 PM, Marcel Schneider wrote: On Tue, 7 Jun 2016 14:52:36 -0600, Karl Williamson wrote: On 06/07/2016 02:48 PM, Karl Williamson wrote: I heard that someone was considering adopting ZWJ. They seemed to think that non-printables are not adoptable. But I was unable to find a

Re: Adopting ZWJ

2016-06-07 Thread Karl Williamson
On 06/07/2016 02:48 PM, Karl Williamson wrote: I heard that someone was considering adopting ZWJ. They seemed to think that non-printables are not adoptable. But I was unable to find a clear list of criteria. The page that allows one to adopt said that it wasn't available, but that

Adopting ZWJ

2016-06-07 Thread Karl Williamson
I heard that someone was considering adopting ZWJ. They seemed to think that non-printables are not adoptable. But I was unable to find a clear list of criteria. The page that allows one to adopt said that it wasn't available, but that page really doesn't make it clear how one can test for t

Re: Emoji for subdivision flags

2016-05-25 Thread Karl Williamson
On 05/25/2016 09:27 AM, Doug Ewell wrote: Now that UTR #52 has been suspended, are any *specific* alternative plans for representing subdivision flags being bandied about? -- Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸 What I'd like to know is how does one find out about such decision

Re: UTC makes the Colbert show

2016-03-30 Thread Karl Williamson
On 03/30/2016 11:54 AM, Mark Davis ☕️ wrote: On Wed, Mar 30, 2016 at 7:42 PM, Jennifer 8. Lee mailto:je...@jennifer8lee.com>> wrote: I thought his "elf exposing self in park" was an amazing (and accurate) facial expression. ​Right! How does he make his cheeks do that!?!​ Botox?

Girl, 12, charged for threatening her school with emojis

2016-02-28 Thread Karl Williamson
http://abc27.com/2016/02/27/girl-12-charged-for-threatening-emojis/

Re: Redundancy in TR14

2016-01-11 Thread Karl Williamson
On 01/11/2016 10:55 PM, Mark Davis ☕️ wrote: Looks that way to me too. Can you submit this as feedback? will do {phone} On Jan 12, 2016 00:39, "Karl Williamson" mailto:pub...@khwilliamson.com>> wrote: Example 7 in http://www.unicode.org/reports/tr14/#Examples has

Re: Trying to understand Line_Break property apparent discrepancy

2016-01-11 Thread Karl Williamson
On 01/11/2016 03:42 PM, Karl Williamson wrote: It appears that http://www.unicode.org/Public/8.0.0/ucd/auxiliary/LineBreakTest.txt is testing a tailoring rather than the default line break algorithm, contrary to its heading "# Default Line Break Test". And http://www.unicode.org/

Redundancy in TR14

2016-01-11 Thread Karl Williamson
Example 7 in http://www.unicode.org/reports/tr14/#Examples has these two rules NU × (NU | SY | IS) NU (NU | SY | IS)* × (NU | SY | IS | CL | CP ) It appears to me that the first rule generates a subset of what the 2nd rule generates, and so is useless. It could be hence removed for simplici

Trying to understand Line_Break property apparent discrepancy

2016-01-11 Thread Karl Williamson
It appears that http://www.unicode.org/Public/8.0.0/ucd/auxiliary/LineBreakTest.txt is testing a tailoring rather than the default line break algorithm, contrary to its heading "# Default Line Break Test". And http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/LineBreakTest.html follows a

Re: Question about Perl5 extended UTF-8 design

2015-11-06 Thread Karl Williamson
On 11/06/2015 01:32 PM, Richard Wordingham wrote: On Thu, 05 Nov 2015 13:41:42 -0700 "Doug Ewell" wrote: Richard Wordingham wrote: No-one's claiming it is for a Unicode Transformation Format (UTF). Then they ought not to call it "UTF-8" or "extended" or "modified" UTF-8, or anything of the

Question about Perl5 extended UTF-8 design

2015-11-05 Thread Karl Williamson
that these extra bits are "reserved". So we're wondering what potential use you had thought of for these bits. Thanks Karl Williamson

Re: Dark beer emoji

2015-09-01 Thread Karl Williamson
On 09/01/2015 10:37 AM, Doug Ewell wrote: I have no idea whether my proposal is more or less serious, or more or less likely to be adopted, than the original. When I read this, I wondered if it was April 1 instead of September 1.

\b{wb}

2015-08-22 Thread Karl Williamson
The concept of \b in a regular expression meaning to match the boundary between a word and non-word was invented by Larry Wall, for the Perl programming language. This was before Unicode, and a word was defined as alphanumerics plus the underscore, which fit well with how identifiers in that c

Re: a mug

2015-07-11 Thread Karl Williamson
On 07/11/2015 10:36 AM, Johannes Bergerhausen wrote: Yes, the mug is funny. It shows not a Unicode problem, it points at a general font problem of operating systems. Dear Apple, Dear Google, Dear Microsoft: please give us *all* missing Unicode glyphs right inside your operating systems!

Re: trying to understand the relationship between the Version 1 Hangul syllables and the later versions'

2015-06-24 Thread Karl Williamson
t why the following policy exists and is *strictly* enforced: http://www.unicode.org/policies/stability_policy.html#Encoding or why the applicable version for that stability policy is 2.0+, the answer is that it was a direct reaction to "The Korean Mess". --Ken On 6/19/2015 1:29 PM, Karl

Re: Why aren't the emoji modifiers GCB=Extend?

2015-06-21 Thread Karl Williamson
On 06/20/2015 03:02 AM, Mark Davis ☕️ wrote: On Sat, Jun 20, 2015 at 12:24 AM, Ken Whistler mailto:kenwhist...@att.net>> wrote: This results from the fact that the fallback behavior for the modifiers is simply as independent pictographic blorts, i.e. the color swatch images. Tha

Why aren't the emoji modifiers GCB=Extend?

2015-06-19 Thread Karl Williamson
Someone writing code using Unicode 8 found that the FITZPATRICK modifiers are considered separate graphemes from what they modify. This is surprising, and seems contrary to not only the concept of a grapheme cluster, but the spirit of tr51 2.2.3 "A supported emoji modifier sequence should be

trying to understand the relationship between the Version 1 Hangul syllables and the later versions'

2015-06-19 Thread Karl Williamson
I haven't found any information on this. It can't just be a transliteration difference, because the number of code points is vastly different between them. Is it the case that the version 1 syllables is a failed abstraction that was replaced by the later versions? Thanks

The Oral History Of The Poop Emoji

2015-06-01 Thread Karl Williamson
https://www.fastcompany.com/3037803/the-oral-history-of-the-poop-emoji-or-how-google-brought-poop-to-america

Re: FYI: The world’s languages, in 7 maps and charts

2015-05-12 Thread Karl Williamson
On 05/12/2015 03:05 PM, Mark Davis ☕️ wrote: http://www.washingtonpost.com/blogs/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts/ // And a critique: http://languagelog.ldc.upenn.edu/nll/?p=18844

Re: Meroitic cursive fractions numerical values

2015-04-01 Thread Karl Williamson
On 03/31/2015 11:30 AM, Doug Ewell wrote: Karl Williamson wrote: It's a small matter to add code to reduce the UCD-specified rational numbers, but it's just one more complication to have to deal with along with the many that the UCD already presents, and if there is not a good reaso

Re: Meroitic cursive fractions numerical values

2015-03-30 Thread Karl Williamson
On 03/29/2015 03:41 AM, Andrew West wrote: On 28 March 2015 at 20:05, Karl Williamson wrote: Existing software that looks at the numeric values of characters is written expecting that rational numbers will have been reduced to their lowest form. That seems to be a rather rash statement. I

Re: Meroitic cursive fractions numerical values

2015-03-28 Thread Karl Williamson
numerator and denominator Or is that written down somewhere already? A./ On 3/28/2015 1:05 PM, Karl Williamson wrote: In the 8.0 Beta files, some numerical values are not reduced to their lowest forms. Is there a compelling reason that 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R6/12

Meroitic cursive fractions numerical values

2015-03-28 Thread Karl Williamson
In the 8.0 Beta files, some numerical values are not reduced to their lowest forms. Is there a compelling reason that 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R6/12;N; is not written as 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R1/2;N; given that there is als

Problems in google and yahoo searches

2015-03-28 Thread Karl Williamson
I had thought Tangut was going to be in Unicode 8, but the beta files didn't include it, so I tried simply searching on "tangut" from http://www.unicode.org/search/ Only bing showed a match on the pipeline page which had the answer. Recently I wanted to review the email correspondence when pro

Re: Question about the Sentence_Break property

2015-02-21 Thread Karl Williamson
On 02/20/2015 04:56 PM, Philippe Verdy wrote: 2015-02-20 6:14 GMT+01:00 Richard Wordingham mailto:richard.wording...@ntlworld.com>>: TUS has a whole section on the issue, namely TUS 7.0.0 Section 5.8. One thing that is missing is mention of the convention that a single newline charac

Question about the Sentence_Break property

2015-02-19 Thread Karl Williamson
UAX 29 says this: Break after paragraph separators. SB4.Sep | CR | LF Why are CR and LF considered to be paragraph separators? NEL and Line Break are as well. My mental model of plain text has it containing embedded characters, which I'll call \n, to allow it to be displayed in a ter

Re: UAX 29 questions

2015-01-29 Thread Karl Williamson
On 01/29/2015 08:19 PM, Philippe Verdy wrote: 2015-01-29 19:52 GMT+01:00 Karl Williamson mailto:pub...@khwilliamson.com>>: Rule WB4 is "Ignore Format and Extend characters, except when they appear at the beginning of a region of text.". Not clearly stated, b

Re: UAX 29 questions

2015-01-29 Thread Karl Williamson
er | MidNumLet) (ALetter | Hebrew_Letter) (WB57) Hebrew_Letter × Single_Quote But you cannot merge any of these two last rules in a single rule for WB56. 2015-01-25 7:26 GMT+01:00 Karl Williamson mailto:pub...@khwilliamson.com>>: I vaguely recall asking something like this before, but if so,

UAX 29 questions

2015-01-24 Thread Karl Williamson
I vaguely recall asking something like this before, but if so, I didn't save the answers, and a search of the archives didn't turn up anything. Some of the rules in UAX #29 don't make sense to me. For example, rule WB7a Hebrew_Letter × Single_Quote seems to say that a Hebrew_Le

Re: New Unicode Emoji draft, available for review

2014-11-07 Thread Karl Williamson
On 11/05/2014 02:48 PM, Rick McGowan wrote: FYI, Posting this on behalf of Mark Davis... Something in his original reply message is apparently toxic to our mail gateway that it can't get through. (Investigating.) May be the literal U+1F4A9, which I have (I'm sorry) redacted below. Rick T

Re: New Unicode Emoji draft, available for review

2014-11-05 Thread Karl Williamson
On 11/03/2014 08:17 PM, announceme...@unicode.org wrote: egg hatching emoji The Unicode Consortium has released the draft “Unicode Emoji ” document, whose main goal is to help improve the interoperability of emoji characters across implementations

Re: What happened to...?

2014-09-22 Thread Karl Williamson
On 09/20/2014 03:32 AM, Mark Davis ☕️ wrote: I agree that we should minute at least some reason for declining. It need only be a sentence or two. I would hope that the requesters get a detailed explanation of the rejection. It would be very wrong not to do so. If so, then the minutes could

Re: Question about WordBreak property rules

2014-07-24 Thread Karl Williamson
On 07/24/2014 01:38 PM, Karl Williamson wrote: http://www.unicode.org/draft/reports/tr29/tr29.html#WB6 indicates that there should be no break between the first two letters in the sequence Hebrew_Letter Single_Quote Hebrew_Letter. However, rule 7a just below indicates that there should be no

Question about WordBreak property rules

2014-07-24 Thread Karl Williamson
http://www.unicode.org/draft/reports/tr29/tr29.html#WB6 indicates that there should be no break between the first two letters in the sequence Hebrew_Letter Single_Quote Hebrew_Letter. However, rule 7a just below indicates that there should be no break between a Hebrew_Letter and a Single_Quote

Re: Corrigendum #9

2014-07-14 Thread Karl Williamson
I ran across this in Section 3.7.4 of http://www.unicode.org/reports/tr36/ "Use pairs of noncharacter code points in the range FDD0..FDEF. These are "super" private-use characters, and are discouraged for general interchange. The transformation would take each nibble of a byte Y, and add to FD

Re: Corrigendum #9

2014-07-03 Thread Karl Williamson
On 07/03/2014 02:48 PM, Asmus Freytag wrote: On 7/3/2014 11:02 AM, Richard COOK wrote: On Jul 2, 2014, at 8:02 AM, Karl Williamson wrote: Corrigendum #9 has changed this so much that people are coming to me and saying that inputs may very well have non-characters, and that the default should

Re: Corrigendum #9

2014-07-02 Thread Karl Williamson
On 06/12/2014 11:14 PM, Peter Constable wrote: From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Karl Williamson Sent: Wednesday, June 11, 2014 9:30 PM I have a something like a library that was written a long time ago (not by me) assuming that noncharacters were illegal in open

Unencoded cased scripts and unencoded titlecase letters

2014-07-02 Thread Karl Williamson
It's my sense that there are very few cased scripts in existence that are ever likely to be encoded by Unicode that haven't already been so-encoded. I also suspect that there very few new titlecased letters will ever be added to Unicode, as I believe these all come to maintain roundtrip compa

Re: Corrigendum #9

2014-06-11 Thread Karl Williamson
On 06/02/2014 09:48 AM, Markus Scherer wrote: On Mon, Jun 2, 2014 at 8:27 AM, Doug Ewell mailto:d...@ewellic.org>> wrote: I suspect everyone can agree on the edge cases, that noncharacters are harmless in internal processing, but probably should not appear in random text shipped arou

Apparent discrepanccy between FAQ and Age.txt

2014-06-10 Thread Karl Williamson
The FAQ http://www.unicode.org/faq/private_use.html#sentinels says that the last 2 code points on the planes except BMP were made noncharacters in TUS 3.1. DerivedAge.txt gives 2.0 for these. "The conformance wording about U+FFFE and U+ changed somewhat in Unicode 2.0, but these were stil

Re: Corrigendum #9

2014-06-08 Thread Karl Williamson
On 06/07/2014 10:33 PM, Asmus Freytag wrote: On 6/7/2014 9:19 PM, Karl Williamson wrote: On 06/02/2014 11:00 AM, Shawn Steele wrote: To further my understanding, can someone provide examples of how these are used in actual practice? I can't think of any offhand and the closest I get is

Re: Corrigendum #9

2014-06-07 Thread Karl Williamson
On 06/02/2014 11:00 AM, Shawn Steele wrote: To further my understanding, can someone provide examples of how these are used in actual practice? I can't think of any offhand and the closest I get is like the old escape characters to get a dot matrix printer to shift modes, or old word processo

Re: Corrigendum #9

2014-06-01 Thread Karl Williamson
On 06/01/2014 10:07 AM, Markus Scherer wrote: On Sun, Jun 1, 2014 at 7:49 AM, Karl Williamson mailto:pub...@khwilliamson.com>> wrote: Thanks, I had not thought about that. I'm thinking wording something like this is more appropriate "Noncharacters may be openly i

Re: Corrigendum #9

2014-06-01 Thread Karl Williamson
On 05/30/2014 12:49 PM, Asmus Freytag wrote: One of the concerns was that people felt that they had to have "data pipeline" style implementations (tools) go and filter these out - even if there was no intent for the implementation to use them internally in any way. Making clear that the standard

Corrigendum #9

2014-05-30 Thread Karl Williamson
I'm having a problem with this http://www.unicode.org/versions/corrigendum9.html Some people now think it means that noncharacters are really no different from private-use characters, and should be treated very similarly if not identically. It seems to me that they should be illegal in open i

Re: ID_Start, ID_Continue, and stability extensions

2014-04-28 Thread Karl Williamson
On 04/24/2014 01:56 PM, Steffen Nurpmeso wrote: Markus Scherer wrote: |I strongly recommend you parse the derived properties rather than trying to |follow the derivation formula, because that can change over time. ..this file includes only those core properties that have themselves a deriva

Re: Is it save to dig into comment contents of PropList.txt?

2013-11-14 Thread Karl Williamson
On 11/07/2013 05:58 AM, Steffen Daode Nurpmeso wrote: Karl Williamson wrote: |On 11/06/2013 03:43 AM, Steffen Daode Nurpmeso wrote: |> Philippe Verdy wrote: |>|2013/11/5 Steffen Daode |>|> (The problem i'm facing is that _PRINT and _GRAPH cannot be set |>|> f

Re: Is it save to dig into comment contents of PropList.txt?

2013-11-06 Thread Karl Williamson
On 11/06/2013 03:43 AM, Steffen Daode Nurpmeso wrote: Philippe Verdy wrote: |2013/11/5 Steffen Daode |> (The problem i'm facing is that _PRINT and _GRAPH cannot be set |> for some properties from PropList.txt, say, _PRINT can't be set |> for U+0009, CHARACTER TABULATION (ht), since it's

Re: What to backup after corruption of code units?

2013-08-28 Thread Karl Williamson
On 08/28/2013 06:52 PM, Asmus Freytag wrote: On 8/28/2013 5:19 PM, Doug Ewell wrote: Actually 0xC2, according to the rules of UTF-8. Hmm. What you are referring to is that 0xC0 and 0xC1 don't occur because of the requirement for minimal length encoding. However, a check for >=0xC0 will give t

Re: What does one do if the encoding is unknown and all you have is a sequence of bytes?

2013-07-19 Thread Karl Williamson
On 07/19/2013 11:51 AM, Costello, Roger L. wrote: Hi Folks, Suppose that these hex bytes: C3 83 C2 B1 show up in a message and the message contains no hint what its encoding is. Perhaps it is 8859-1, in which case the message consists of four 1-byte characters: C3 = Ã 83 = the “no b

Bing now translates to/from Klingon

2013-05-17 Thread Karl Williamson
http://www.bing.com/translator

Re: Rendering Raised FULL STOP between Digits

2013-03-22 Thread Karl Williamson
On 03/21/2013 04:48 PM, Richard Wordingham wrote: For linguistic analysis, you need the normalisation appropriate to the task. This is a case where Unicode normalisation generally throws away information (namely, how the author views the characters), whereas in analysing Burmese you may want to

Re: Rendering Raised FULL STOP between Digits

2013-03-20 Thread Karl Williamson
On 03/09/2013 07:52 PM, Richard Wordingham wrote: On Sat, 09 Mar 2013 16:21:17 -0700 Karl Williamson wrote: Sorry, for the delayed reply; I've been under deadline Rendering is not the only consideration. Processing textual content for 0387 is broken because it is considered to

Re: Rendering Raised FULL STOP between Digits

2013-03-09 Thread Karl Williamson
On 03/09/2013 03:41 PM, Asmus Freytag wrote: Due to another unfortunate unification (or semi-unification), 0387 (Greek ano teleia) has been defined as canonical equivalent to 00B7, with the note “00B7 is the preferred character”. This means that glyph design for 00B7 needs to take this into acco

Re: When the reader enters the digital space for writing, he participates in the unending ballet between characters and glyphs

2012-12-23 Thread Karl Williamson
On 12/23/2012 09:56 AM, Jukka K. Korpela wrote: 2012-12-23 18:09, Karl Williamson wrote: As another poster said, this quotation would be considered fair use under USA law. It was not a quotation but an excerpt posted without permission. Quotations are allowed when they are needed to back up

Re: When the reader enters the digital space for writing, he participates in the unending ballet between characters and glyphs

2012-12-23 Thread Karl Williamson
that as a native USA English speaker, I found nothing wrong with the original post. And I found Jukka's response objectionable. They say that the road to hell is paved with good intentions. Sincerely, Erkki -Alkuperäinen viesti- Lähettäjä: unicode-bou...@unicode.org [mailto:unicode-

Re: When the reader enters the digital space for writing, he participates in the unending ballet between characters and glyphs

2012-12-22 Thread Karl Williamson
On 12/22/2012 03:45 PM, Jukka K. Korpela wrote: 2012-12-22 23:56, Costello, Roger L. wrote: I figure the people on this list can truly appreciate this: I don’t. You are posting an excerpt from a copyrighted book as such, not as a legal quotation for an acceptable purpose. Moreover, you have d

Re: Are Named sequences always going to be graphemes?

2012-06-21 Thread Karl Williamson
On 06/21/2012 01:45 PM, Asmus Freytag wrote: OK. Will they always be in NFC? To apply Ken's dictume to this case: That seems like a straitjacket looking for an unwilling wearer. ;-) Unless it's excluded from the start, anytime you limit it, when the time comes you need something li

Re: Are Named sequences always going to be graphemes?

2012-06-21 Thread Karl Williamson
On 06/20/2012 07:43 PM, Ken Whistler wrote: On 6/20/2012 3:22 PM, Karl Williamson wrote: All current named sequences appear to be each a single grapheme. That seems like it should always be the case. Possibly, but keep in mind that neither the Unicode Standard nor UAX #29 in particular

Are Named sequences always going to be graphemes?

2012-06-20 Thread Karl Williamson
All current named sequences appear to be each a single grapheme. That seems like it should always be the case. If I'm right, should UAX #34 say this.

Turkic casefolding rules

2012-05-12 Thread Karl Williamson
In CaseFolding.txt, it says the following: "Note that the Turkic mappings do not maintain canonical equivalence without additional processing. See the discussions of case mapping in the Unicode Standard for more information." I couldn't find any more detail about these in the 6.1 Unicode sta

Re: Origins of w

2012-04-16 Thread Karl Williamson
On 04/16/2012 12:04 PM, Asmus Freytag wrote: On 4/16/2012 9:23 AM, arno.s wrote: Am 16/04/2012 15:55, schrieb Andreas Prilop: On Sun, 15 Apr 2012, David Starner wrote: At Wiktionary, we're looking at (U+1E98) and we can't figure out where it came from. It's from Unicode 1.1, which makes it har

Three character canonical decompositions in version 2 releases

2012-04-03 Thread Karl Williamson
http://unicode.org/policies/stability_policy.html says that effective starting in Version 2.0, "Canonical mappings (Decomposition_Mapping property values) are always limited either to a single value or to a pair. The second character in the pair cannot itself have a canonical mapping." I noti

Re: more flexible pipeline for new scripts and characters

2011-11-18 Thread Karl Williamson
On 11/16/2011 07:25 AM, Asmus Freytag wrote: Peter, in principle, the idea of a provisional status is a useful concept whenever one wants to "publish" something based on potentially doubtful or possibly incomplete information. And you are correct, that, in principle, such an approach could be mo

Re: How do we find out what assigned code points aren't normally used in text?

2011-09-09 Thread Karl Williamson
On 09/09/2011 02:36 PM, Kent Karlsson wrote: Den 2011-09-09 21:24, skrev "Karl Williamson": On 07/06/2011 04:23 PM, Ken Whistler wrote: I'm not sure whether the FB05/FB06 instance is important enough to add or not. Neither of those compabitility ligatures should ordinarily

How do we find out what assigned code points aren't normally used in text?

2011-09-09 Thread Karl Williamson
On 07/06/2011 04:23 PM, Ken Whistler wrote: I'm not sure whether the FB05/FB06 instance is important enough to add or not. Neither of those compabitility ligatures should ordinarily be used in text, anyway ... --Ken I'm wondering what other characters might not ordinarily be used in text, or

Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-08-31 Thread Karl Williamson
On 08/30/2011 06:27 PM, Philippe Verdy wrote: After looking at the effective reason why this PRI #202 emerged (a request from Perl authors), exposed in UTC document number "L2/2011/11281", I think now that even *all* these aliases were not needed. The bug emerged in Perl only because a character

Slide show: Survey of current programming language support for Unicode

2011-07-30 Thread Karl Williamson
Tom Christiansen recently gave a talk at the OSCON conference concerning the varying levels of support for Unicode in some current programming languages. It is accessible via this link http://training.perl.com/OSCON2011/index.html The talk is entitled "Unicode Support Shootout", and is is one

Re: Proposed Update UAXes for Unicode 6.1

2011-07-07 Thread Karl Williamson
On 07/07/2011 02:33 PM, announceme...@unicode.org wrote: Proposed updates for most Unicode Standard Annexes for Version 6.1 of the Unicode Standard have been posted for public review. See http://www.unicode.org/review/ for details and links to the various documents. Review periods for provision

Re: Questions about UAX #29

2011-07-05 Thread Karl Williamson
On 07/05/2011 09:29 AM, Mark Davis ☕ wrote: Ah, you're right; I wasn't looking carefully enough at what you wrote. Yes, an unassigned code point (Cn) is treated as a base character. Unassigned code points are peculiar beasts, since we don't know really how they should behave until (and if) they

Re: Questions about UAX #29

2011-07-04 Thread Karl Williamson
On 07/03/2011 05:52 PM, Mark Davis ☕ wrote: Mark /— Il meglio è l’inimico del bene —/ On Sat, Jul 2, 2011 at 14:58, Karl Williamson mailto:pub...@khwilliamson.com>> wrote: I have two questions about this. 1) In UAX #44, it says for information about the Grapheme_Base pr

Questions about UAX #29

2011-07-02 Thread Karl Williamson
I have two questions about this. 1) In UAX #44, it says for information about the Grapheme_Base property, to see UAX #29, but that document doesn't mention this property. 2) The definition in UAX #29 for both legacy and extended grapheme clusters effectively says that any Gc=Cn code points fo

Unicode game

2010-11-17 Thread karl williamson
I'm posting this Perl program so the author doesn't have to subscribe to this list. We thought people here might appreciate it. As the sample output shows, it takes input text and reverses and mirrors it. [This is completely silly, just an afternoon programming game.] Witness "leo" in acti

Irrational numeric values in TUS

2010-10-12 Thread karl williamson
The Unicode standard only gives numeric values to rational numbers. Is the reason for this merely because of the difficulty of representing irrational ones? In looking through the list of code points, I actually found only one case where a character totally unambiguously refers to a particula

  1   2   >