Re: Is the binaryness/textness of a data format a property?

2020-03-22 Thread Markus Scherer via Unicode
On Sat, Mar 21, 2020 at 12:35 PM Doug Ewell via Unicode wrote: > I thought the whole premise of GB18030 was that it was Unicode mapped into > a GB2312 framework. What characters exist in GB18030 that don't exist in > Unicode, and have they been proposed for Unicode yet, and why was none of > the

Re: Egyptian Hieroglyph Man with a Laptop

2020-02-12 Thread Markus Scherer via Unicode
On Wed, Feb 12, 2020 at 11:37 AM Marius Spix via Unicode < unicode@unicode.org> wrote: > In my opinion, this is an invalid character, which should not be > included in Unicode. > Please remember that feedback that you want the committee to look at needs to go through http://www.unicode.org/report

Fwd: ICU 66preview available

2019-12-05 Thread Markus Scherer via Unicode
rence documents are published on unicode-org.github.io/icu-docs/ – follow the “Dev” links there. Best regards, Markus Scherer for the ICU Project

Re: Proposal to add Roman transliteration schemes to ISO 15924.

2019-12-02 Thread Markus Scherer via Unicode
On Mon, Dec 2, 2019 at 5:47 PM विश्वासो वासुकिजः (Vishvas Vasuki) via Unicode wrote: > But that says that the definitions are at >> > >> https://github.com/unicode-org/cldr/releases/tag/latest/common/bcp47/transform.xml >> , >> but all one currently gets from that is an error message 'XML Parsing

Re: Proposal to add Roman transliteration schemes to ISO 15924.

2019-12-02 Thread Markus Scherer via Unicode
On Mon, Dec 2, 2019 at 8:42 AM Roozbeh Pournader via Unicode < unicode@unicode.org> wrote: > You don't need an ISO 15924 script code. You need to think in terms of BCP > 47. Sanskrit in Latin would be sa-Latn. > Right! Now, if you want to distinguish the different transcription systems for > wri

Re: Encoding the Nsibidi script (African) for writing the Igbo language

2019-11-11 Thread Markus Scherer via Unicode
On Mon, Nov 11, 2019 at 4:03 AM Philippe Verdy via Unicode < unicode@unicode.org> wrote: > But first there's still no code in ISO 15924 (first step easy to complete > before encoding in the UCS). > That's not first; it's nearly last. The script code standard says "In general, script codes shall

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-11 Thread Markus Scherer via Unicode
On Fri, Oct 11, 2019 at 12:05 PM Richard Wordingham via Unicode < unicode@unicode.org> wrote: > On Thu, 10 Oct 2019 15:23:00 -0700 > Markus Scherer via Unicode wrote: > > > [c \q{ch}]h should work like (ch|c)h. Note that the order matters in > > the alternation -- so

Re: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of Unicode?

2019-10-11 Thread Markus Scherer via Unicode
On Fri, Oct 11, 2019 at 4:37 AM Fred Brennan via Unicode < unicode@unicode.org> wrote: > Many users are asking me and I'm not sure of the answer (nor how to find > it > out). > You can find out by looking at the data files that are being developed for Unicode 13. Look at the latest UnicodeData.tx

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-10 Thread Markus Scherer via Unicode
On Tue, Oct 8, 2019 at 7:28 AM Richard Wordingham via Unicode < unicode@unicode.org> wrote: > An example UTS#18 gives for matching a literal cluster can be simplified > to, in its notation: > > [c \q{ch}] > > This is interpreted as 'match against "ch" if possible, otherwise > against "c". Thus th

Re: Manipuri/Meitei customary writing system

2019-10-04 Thread Markus Scherer via Unicode
On Fri, Oct 4, 2019 at 2:05 PM Richard Wordingham via Unicode < unicode@unicode.org> wrote: > > >> Is the use of the Meitei script aspirational or customary? > > >> Which script is being used for major newspapers, popular books, > > >> and video captions? > > > > > > This may give you some more in

Manipuri/Meitei customary writing system

2019-10-03 Thread Markus Scherer via Unicode
Dear Unicoders, Is Manipuri/Meitei customarily written in Bangla/Bengali script or in Meitei script? I am looking at https://en.wikipedia.org/wiki/Meitei_language#Writing_systems which seems to describe writing practice in transition, and I can't quite tell where it stands. Is the use of the Mei

Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Markus Scherer via Unicode
There are lots of ways to implement the UCA. When you want fast string comparison, the zero weights are useful for processing -- and you don't actually assemble a sort key. People who want sort keys usually want them to be short, so you spend time on compression. You probably also build sort keys

Re: Dealing with Georgian capitalization in programming languages

2018-10-02 Thread Markus Scherer via Unicode
On Tue, Oct 2, 2018 at 12:50 AM Martin J. Dürst via Unicode < unicode@unicode.org> wrote: > ... The only > operation that can cause problems is 'capitalize'. > > When I say "cause problems", I mean producing mixed-case output. I > originally thought that 'capitalize' would be fine. It is fine for

Re: Diacritic marks in parentheses

2018-07-26 Thread Markus Scherer via Unicode
I would not expect for Ä+combining () above = Ä᪻ to look right except with specialized fonts. http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%84%5Cu1ABB&s=&uv=0 Even if it worked widely, I think it would be confusing. I think you are best off writing Arzt/Ärztin. Viele Grüße, markus

Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols

2018-05-15 Thread Markus Scherer via Unicode
On Tue, May 15, 2018 at 10:47 AM, Johnny Farraj via Unicode < unicode@unicode.org> wrote: > Dear Unicode list members, > > I wish to get feedback about a new symbol submission proposal. > Just to clarify, this is a discussion list where you may get some useful feedback. This is not where you woul

Re: More scripts, not more emoji (Re: Accessibility Emoji)

2018-04-14 Thread Markus Scherer via Unicode
On Sat, Apr 14, 2018 at 5:50 PM, Marcel Schneider via Unicode < unicode@unicode.org> wrote: > We need to get more scripts into Unicode, not more emoji. > > That is — somewhat inflated — the core message of a NYT article published > six months ago, > and never shared here (no more than so many arti

Re: [Unicode] Re: Fonts and font sizes used in the Unicode

2018-03-05 Thread Markus Scherer via Unicode
On Mon, Mar 5, 2018 at 9:03 AM, suzuki toshiya via Unicode < unicode@unicode.org> wrote: > I have a question; if some people try to make a > translated version of Unicode, they should contact > all font contributors and ask for the license? > Unicode Consortium cannot give any sublicense? > If yo

Re: Fonts and font sizes used in the Unicode

2018-03-04 Thread Markus Scherer via Unicode
On Sun, Mar 4, 2018 at 6:10 AM, Helena Miton via Unicode < unicode@unicode.org> wrote: > Greetings. Is there a way to know which font and font size have been used > in the Unicode charts (for various writing systems)? Many thanks! > What are you trying to do? Many of the fonts are unique to the

Re: Emoji blooper

2018-02-13 Thread Markus Scherer via Unicode
On my machine (Chromebox+Gmail), the trumpets point down to the lower left. If you want to convey precise images, then send images... markus

Re: Internationalization & Unicode Conference 2018

2018-01-24 Thread Markus Scherer via Unicode
If your presentation is accepted for the conference, you should get a hotel discount. markus

Re: Minimal Implementation of Unicode Collation Algorithm

2017-12-04 Thread Markus Scherer via Unicode
On Mon, Dec 4, 2017 at 5:30 AM, Richard Wordingham via Unicode < unicode@unicode.org> wrote: > May a collation algorithm that always compares all strings as equal be a > compliant implementation of the Unicode Collation Algorithm (UTS #10)? > If not, by which clause is it not compliant? Formally,

Re: implicit weight base for U+2CEA2

2017-09-27 Thread Markus Scherer via Unicode
On Wed, Sep 27, 2017 at 4:07 PM, James Tauber wrote: > Ah yes, I was just going by membership in the CJK Unified Ideographs > Extension E block, not actual assignment. > > So the lack of assignment means it should fail the Unified_Ideograph > membership in http://unicode.org/reports/tr10/#Values_

Re: implicit weight base for U+2CEA2

2017-09-27 Thread Markus Scherer via Unicode
On Wed, Sep 27, 2017 at 1:49 PM, James Tauber via Unicode < unicode@unicode.org> wrote: > I recently updated pyuca[1], my pure Python implementation of the Unicode > Collation Algorithm to work with 8.0.0, 9.0.0, and 10.0.0 but to get all > the tests to work, I had to special case the implicit wei

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-09-22 Thread Markus Scherer via Unicode
FYI, I changed the ICU behavior for the upcoming ICU 60 release (pending code review). Proposal & description: https://sourceforge.net/p/icu/mailman/message/35990833/ Code changes: http://bugs.icu-project.org/trac/review/13311 Best regards, markus On Thu, Aug 3, 2017 at 5:34 PM, Mark Davis ☕️

Re: Emoji Space

2017-07-17 Thread Markus Scherer via Unicode
On Mon, Jul 17, 2017 at 5:25 AM, Christoph Päper via Unicode < unicode@unicode.org> wrote: > As you may know, the combined original Japanese emoji set included three > whitespace characters: one was the full width of a (square) emoji, one was > half that and the last one was a quarter blank. Their

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-03 Thread Markus Scherer via Unicode
On Wed, May 31, 2017 at 5:12 AM, Henri Sivonen wrote: > On Sun, May 21, 2017 at 7:37 PM, Mark Davis ☕️ via Unicode > wrote: > > There is plenty of time for public comment, since it was targeted at > Unicode > > 11, the release for about a year from now, not Unicode 10, due this year. > > When th

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Markus Scherer via Unicode
On Fri, May 26, 2017 at 3:28 AM, Martin J. Dürst wrote: > But there's plenty in the text that makes it absolutely clear that some > things cannot be included. In particular, it says > > > The term “maximal subpart of an ill-formed subsequence” refers to the code > units that were collected i

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-24 Thread Markus Scherer via Unicode
On Wed, May 24, 2017 at 3:56 PM, Karl Williamson wrote: > On 05/24/2017 12:46 AM, Martin J. Dürst wrote: > >> That's wrong. There was a public review issue with various options and >> with feedback, and the recommendation has been implemented and in use >> widely (among else, in major programming

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Markus Scherer via Unicode
On Tue, May 23, 2017 at 7:05 AM, Asmus Freytag via Unicode < unicode@unicode.org> wrote: > So, if the proposal for Unicode really was more of a "feels right" and not > a "deviate at your peril" situation (or necessary escape hatch), then we > are better off not making a RECOMMEDATION that goes aga

Re: Comparing Raw Values of the Age Property

2017-05-22 Thread Markus Scherer via Unicode
On Mon, May 22, 2017 at 2:44 PM, Richard Wordingham via Unicode < unicode@unicode.org> wrote: > Given two raw values of the Age property, defined in UCD file > DerivedAge.txt, how is a computer program supposed to compare them? > Apart from special handling for the value "Unassigned" and its short

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Markus Scherer via Unicode
Let me try to address some of the issues raised here. The proposal changes a recommendation, not a requirement. Conformance applies to finding and interpreting valid sequences properly. This includes not consuming parts of valid sequences when dealing with illegal ones, as explained in the section

Re: Emoji Compatibility Symbols

2017-04-04 Thread Markus Scherer
There were some symbols, mostly proprietary logos, that we did not propose for encoding in Unicode. See pages 83-89 of http://www.unicode.org/L2/L2010/10132-emojidata.pdf You could also mine the defunct symbols subcommittee page for more information: https://sites.google.com/site/unicodesymbols/Ho

Re: Proposal to add standardized variation sequences for chess notation

2017-04-03 Thread Markus Scherer
On Mon, Apr 3, 2017 at 2:33 PM, Michael Everson wrote: > On 3 Apr 2017, at 18:51, Markus Scherer wrote: > > > It seems to me that higher-level layout (e.g, HTML+CSS) is appropriate for > the board layout (e.g., via a table), board frame style, and cell/field > shading.

Re: Proposal to add standardized variation sequences for chess notation

2017-04-03 Thread Markus Scherer
It seems to me that higher-level layout (e.g, HTML+CSS) is appropriate for the board layout (e.g., via a table), board frame style, and cell/field shading. In each field, the existing characters should suffice. markus

Re: Unicode Emoji 5.0 characters now final

2017-03-29 Thread Markus Scherer
I think "recommended" could be renamed to "(expected to be) widely implemented". markus

Re: Unicode Emoji 5.0 characters now final

2017-03-28 Thread Markus Scherer
On Tue, Mar 28, 2017 at 11:41 AM, Doug Ewell wrote: > Mark Davis wrote: > > > 3. Valid, but not recommended: "usca". Corresponds to the valid > > Unicode subdivision code for California according to > > http://unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences > > and CLDR, but is n

Re: Unicode Emoji 5.0 characters now final

2017-03-27 Thread Markus Scherer
On Mon, Mar 27, 2017 at 5:09 PM, Philippe Verdy wrote: > I followed the links. Check your links, you are referencing the proposal, > and this contradicts the published version 4.0 of TR51. Where is stability ? > Of course I am pointing to the proposal. The version of TR 51 under review adds a me

Re: Unicode Emoji 5.0 characters now final

2017-03-27 Thread Markus Scherer
On Mon, Mar 27, 2017 at 4:58 PM, Philippe Verdy wrote: > This only describes the sequences encoded with 2 characters, not the newer > longer sequences for flags of subnational regions. the > unicode_region_subtag data does not contain anything about the flags for > the first 3 regions in GB. > P

Re: Unicode Emoji 5.0 characters now final

2017-03-27 Thread Markus Scherer
On Mon, Mar 27, 2017 at 1:34 PM, Ken Whistler wrote: > Anybody could *attempt* to convey a flag of Pomerania (a rather handsome > black gryphon on a yellow background, btw) with an emoji tag sequence right > now, I suppose. I suppose not. Since it's bound to ISO 3166 subdivision codes (possibly

Re: Unicode Emoji 5.0 characters now final

2017-03-27 Thread Markus Scherer
On Mon, Mar 27, 2017 at 1:39 PM, Philippe Verdy wrote: > Note also that ISO3166-2 is far from being stable, and this could > contradict Unicode encoding stability: it would then be required to ensure > this stability by only allowing sequences that are effectively registered > in http://www.unico

Re: Encoding of old compatibility characters

2017-03-27 Thread Markus Scherer
I think the interest has been low because very few documents survive in these encodings, and even fewer documents using not-already-encoded symbols. In my opinion, this is a good use of the Private Use Area among a very small group of people. See also https://en.wikipedia.org/wiki/ConScript_Unicod

Re: Implications of Logical Order Exception Property

2017-01-25 Thread Markus Scherer
On Wed, Jan 25, 2017 at 12:00 PM, Richard Wordingham < richard.wording...@ntlworld.com> wrote: > > > 2) Claims that logical_order_exception is relevant for searching > > > (TUS, as above) > > > It informs the construction of the DUCET and could be used to > > suppress_contractions in a search tail

Re: Implications of Logical Order Exception Property

2017-01-25 Thread Markus Scherer
On Wed, Jan 25, 2017 at 11:10 AM, Richard Wordingham < richard.wording...@ntlworld.com> wrote: > I now have a clutch of errors to report on Unicode's use of the term > 'logical order' and references to logical_order_exception: > > 1) Claims that Thai is not encoded in logical order in > Technica

Re: IdnaTest.txt and RFC 5893

2017-01-04 Thread Markus Scherer
On Wed, Jan 4, 2017 at 2:28 AM, Alastair Houghton < alast...@alastairs-place.net> wrote: > RFC 5893 seems pretty clear to me, and the problem really is that the test > vectors (which come from unicode.org) seem (to me) to be incorrect. https://tools.ietf.org/html/rfc5893#section-2 says "*The fol

Re: About standardized variants of characters in Dingbat block

2016-12-25 Thread Markus Scherer
On Sun, Dec 25, 2016 at 8:33 AM, Yifán Wáng <747.neut...@gmail.com> wrote: > I'm curious about the reason why U+270C VICTORY HAND ✌ has > standardized text and emoji styles defined but not with U+270A RAISED > FIST ✊ and U+270B RAISED HAND ✋. > http://www.unicode.org/Public/9.0.0/ucd/StandardizedV

Re: Best practices for replacing UTF-8 overlongs

2016-12-20 Thread Markus Scherer
On Tue, Dec 20, 2016 at 8:59 AM, Ken Whistler wrote: > You found the resulting text in TUS 9.0, p. 126 - 129. The origin of the > text there about best practices for using U+FFFD was the discussion and > resolution of PRI #121 in August, 2008: > > http://www.unicode.org/review/pr-121.html > Yes

Re: Best practices for replacing UTF-8 overlongs

2016-12-19 Thread Markus Scherer
On Mon, Dec 19, 2016 at 3:04 PM, Karl Williamson wrote: > It seems counterintuitive to me that the two byte sequence C0 80 should be > replaced by 2 replacement characters under best practices, or that E0 80 80 > should also be replaced by 2. Each sequence was legal in early Unicode > versions,

Re: Mixed-Script confusables in prog.languages

2016-12-04 Thread Markus Scherer
On Sun, Dec 4, 2016 at 3:09 AM, Reini Urban wrote: > Is anybody aware of any other language implementation, which does > confusable or mixed-script protection? > I think R has something, because it has this header: > https://cran.r-project.org/bin/windows/extsoft/3.4/ > include/unicode/uspoof.h

Re: Emoji mappings in Shift JIS / CP932/943

2016-12-03 Thread Markus Scherer
On Sat, Dec 3, 2016 at 2:37 PM, Christoph Päper wrote: > If an existing character encoding forms the (sole) base of an addition to > Unicode, shouldn’t it be part of the UTC’s job to document these sources? > This was obviously done in the case of Japanese emoji, hence the existence > of EmojiSou

Re: Emoji mappings in Shift JIS / CP932/943

2016-12-02 Thread Markus Scherer
On Fri, Dec 2, 2016 at 4:35 AM, Christoph Päper wrote: > Could and should custom vendor extensions like the ones documented in > > - http://unicode.org/Public/UCD/latest/ucd/EmojiSources.txt > > be included in these mappings? > They could, but it would be best for vendors to publish their actual

Re: IJ with accent

2016-09-28 Thread Markus Scherer
On Wed, Sep 28, 2016 at 9:16 AM, Philippe Verdy wrote: > My opinion is to put an accent on each letter and join them with a joiner > I don't see a reason for the joiner. markus

Re: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech

2016-08-26 Thread Markus Scherer
On Fri, Aug 26, 2016 at 10:26 AM, Ken Whistler wrote: > On 8/26/2016 10:01 AM, John O'Conner wrote: > >> What I find more interesting is how emoji (a small digital image or icon) >> was ever interpreted as encodable text for the Unicode Standard. If our >> German newspaper friends have made a mis

Re: Whitespace characters in Unicode

2016-08-05 Thread Markus Scherer
On Fri, Aug 5, 2016 at 8:52 AM, Sean Leonard wrote: > What makes a character a "whitespace" in Unicode, e.g., why are ZWSP and > ZWNBSP not "whitespace" even though they clearly say "SPACE" in them? > I think "white space" basically wants to have an advance width (occupy space) but no ink (all w

Re: UAX44: loose matching of symbolic values and the `is` prefix

2016-06-06 Thread Markus Scherer
Interesting discussion! ICU does not support "is" nor "in" prefixes. I wasn't even aware that UAX #44 loose matching prescribes "is". ICU just implements what Property[Value]Aliases.txt say: # Loose matching should be applied to all property names and property values, with # the exception of Stri

Re: Canonical block names: spaces vs. underscores

2016-05-26 Thread Markus Scherer
Note that the Block property is an artifact of how the committee organizes the encoding of characters. It is not very useful for processing. For that, the Script property, Script_Extensions, and others are normally much better. markus

Re: The Hebrew Extended (Proposed) Block

2016-05-10 Thread Markus Scherer
FYI It seems like 08xx is reserved for RTL scripts. http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedBidiClass.txt # The unassigned code points that default to R are in the ranges: # [\u0590-\u05FF *\u07C0-\u089F* \uFB1D-\uFB4F \U00010800-\U00010FFF \U0001E800-\U0001EDFF \U0001EF

Re: Case for letters j and J with acute

2016-02-09 Thread Markus Scherer
On Tue, Feb 9, 2016 at 3:18 AM, ACJ Unicode wrote: > [3] https://en.wikipedia.org/wiki/IJ_(digraph)#Stress > This says "in Unicode it is possible to combine characters into a *j* with an acute accent – "bí

Re: Case for letters j and J with acute

2016-02-09 Thread Markus Scherer
On Tue, Feb 9, 2016 at 7:58 AM, Michael Everson wrote: > On 9 Feb 2016, at 11:18, ACJ Unicode wrote: > > > This is taught in writing in primary school in the Netherlands (or at > least it was 30 years ago), but this practice is often abandoned soon > afterwards, probably because of the technical

Re: precomposed polytonic Greek characters with macrons and other diacritics

2016-02-08 Thread Markus Scherer
On Mon, Feb 8, 2016 at 10:47 AM, James Tauber wrote: > Even with all this, though, my own work includes accentuation and > syllabification algorithms, all of which are made more cumbersome by the > lack of precomposed characters indicating vowel length. I'm currently > leaning towards adding a la

Re: Unicode password mapping for crypto standard

2016-01-05 Thread Markus Scherer
I would specify that UTF-8 must be used, without mapping. US-ASCII is a proper subset, so need not be mentioned explicitly, nor distinguished in the protocol. Mappings would require that all implementations carry relevant data, and are up to date to recent versions of Unicode, or else previously-un

Re: Hentaigana proposal

2015-12-10 Thread Markus Scherer
Dear Mr. Tranter, I can't tell whether you intend to start a discussion on this discussion mailing list, or intend to submit feedback on a proposal. Maybe you are looking for discussion before you formalize your feedback. If you do intend to submit feedback, then, once you have formulated a posit

Re: Question about Perl5 extended UTF-8 design

2015-11-05 Thread Markus Scherer
On Thu, Nov 5, 2015 at 9:25 AM, Philippe Verdy wrote: > (0xFF was reserved only in the old RFC version of UTF-8 when it allowed > code points up to 31 bits, but even this RFC is obsolete and should no > longer be used and it has never been approved by Unicode). > No, even in the original UTF-8 d

Re: Emoji data in UCD xml ?

2015-11-03 Thread Markus Scherer
About http://www.unicode.org/L2/L2015/15299-ucd-emoji-props.pdf which has Emoji_Presentation (EP) ● Non_Emoji (NE) ● Default_Text (DT) ● Default_Emoji (DE) ● NA Why do we need both Non_Emoji and NA? Can't Non_Emoji be the default for all code points that are not mentioned in the data? markus

Re: Unpaired surrogates (was: Re: Why Work at Encoding Level?)

2015-10-19 Thread Markus Scherer
On Mon, Oct 19, 2015 at 1:32 PM, Doug Ewell wrote: > > ICU (but perhaps it's actually Java) seems to have a culture of > > tolerating lone surrogates, and rules for handling lone surrogates are > > strewn across the Unicode standards and annexes. > > I suspect you have an example. I have exampl

Re: Deleting Lone Surrogates

2015-10-04 Thread Markus Scherer
I would not spend any time specifying intricate rules for unpaired surrogates in 16-bit strings, or out-of range values in 32-bit strings. Most processing will treat them like unassigned characters, like U+50005, with only default behaviors. markus

Re: Hentaigana and the Kana Supplement block

2015-07-27 Thread Markus Scherer
On Mon, Jul 27, 2015 at 4:46 PM, Garth Wallace wrote: > where > does that leave the Kana Supplement block? That block contains only > two encoded characters, but was allocated 256 code points, presumably > for the future encoding of hentaigana. With hentaigana handled by > SVSes, it seems unlikel

Re: ISO 15924

2015-07-09 Thread Markus Scherer
Thanks! markus

Re: Precomposed Cyrillic letters

2015-07-09 Thread Markus Scherer
On Thu, Jul 9, 2015 at 8:53 AM, Doug Ewell wrote: > From http://www.unicode.org/L2/L2015/15169-montenegro-cyrillic.pdf, > "Addition of two letters from Montenegrin language, CYRILLIC script": > > > 9. Can any of the proposed characters be encoded using a composed > > character sequence of either

Re: Possible issue with Character Fallback Substitutions between version 24 and 25 ?

2015-06-18 Thread Markus Scherer
If the chart does not reflect the data, then please submit a bug ticket. http://unicode.org/cldr/trac/newticket The data is what counts. markus

Re: Another take on the English apostrophe in Unicode

2015-06-04 Thread Markus Scherer
Looks all wrong to me. "don’t" is a contraction of two words, it is not one word. English is taught as that squiggle being punctuation, not a letter. (Unlike, say, the Hawaiʻian ʻOkina .) You can't use simple regular expressions to find word boundaries.

Re: Flag tags with U+1F3F3 and subtypes

2015-05-18 Thread Markus Scherer
On Mon, May 18, 2015 at 11:19 AM, Doug Ewell wrote: > Is the new mechanism intended to allow flag tags that include either > "subtype" values or "contains" values? As far as I can tell from your quotes, CLDR will say what's valid (plus containment info), and Unicode permits you to show a flag f

Re: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)

2015-05-08 Thread Markus Scherer
On Fri, May 8, 2015 at 9:13 PM, Philippe Verdy wrote: > 2015-05-09 5:13 GMT+02:00 Richard Wordingham < > richard.wording...@ntlworld.com>: > >> I can't think of a practical use for the specific concepts of Unicode >> 8-bit, 16-bit and 32-bit strings. Unicode 16-bit strings are >> essentially the

Re: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

2015-05-07 Thread Markus Scherer
I assume that the JSON spec deliberately allows anything that Java and JavaScript allow. In particular, there is no requirement for a Java String or JavaScript string to contain "text", or well-formed UTF-16, or only assigned characters. Some code stores binary data (sequence of arbitrary 16-bit un

Re: Usage stats?

2015-03-27 Thread Markus Scherer
On Fri, Mar 27, 2015 at 1:27 PM, Michael Norton < michaelanortons...@gmail.com> wrote: > Easy example: what's the code for [blank space] U+020 across all language > sets of Unicode? Is it the same ie: 100%? > I don't understand what you are asking, and I have a hunch you haven't said it in a way

Re: Fixing the sort order of the SignWriting symbols in Unicode 8

2015-02-24 Thread Markus Scherer
On Tue, Feb 24, 2015 at 9:38 AM, Stephen E Slevinski Jr < sle...@signpuddle.net> wrote: > Hi Unicode list, > This is a useful place for discussion, but once the discussion peters out please submit formal feedback: http://www.unicode.org/review/pri285/ I am concerned that the SignWriting symbols

Re: Compatibility decomposition for Hebrew and Greek final letters

2015-02-20 Thread Markus Scherer
On Thu, Feb 19, 2015 at 11:51 PM, Eli Zaretskii wrote: > I think decomposition to NFKD solves these issues, doesn't it? > Not completely. Judging from your question, you expected more mappings than NFKD has. You might want to try the mappings that are used as input for deriving the DUCET (defaul

Re: Compatibility decomposition for Hebrew and Greek final letters

2015-02-19 Thread Markus Scherer
On Thu, Feb 19, 2015 at 12:17 PM, Eli Zaretskii wrote: > Sorry, I disagree. First, collation data is overkill for search, > since the order information is not required, so the weights are simply > wasting storage. Second, people do want to find, e.g., "²" when they > search for "2" etc. > Depe

Re: About cultural/languages communities flags

2015-02-09 Thread Markus Scherer
On Mon, Feb 9, 2015 at 1:11 PM, Joan Montané wrote: > AFAIK, this is done in font side. Emoji flags are just ligatures, so a > font can provide a ligature for 4 RIS characters. > Technically true, but a font that violates the encoding standard would cause large problems. Imagine a font that liga

Re: About cultural/languages communities flags

2015-02-09 Thread Markus Scherer
On Mon, Feb 9, 2015 at 9:54 AM, Andrea Giammarchi < andrea.giammar...@gmail.com> wrote: > > if a cultural/language TLD is typed with Unicode RIS, then show the flag > for these culture/language: > This does not work. The "Unicode RIS" are defined to be used in pairs, with semantics according to c

Re: Wrong plane numbers

2015-02-06 Thread Markus Scherer
These are not block boundaries. These lines are for book chart production, where we don't need to print every unsigned code point. markus ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode

N'Ko - which character? 02BC vs. 2019

2015-01-31 Thread Markus Scherer
Dear Unicoders, which is the proper second character in "N'Ko"? See below for details. Thanks, markus -- Forwarded message -- From: Doug Ewell Date: Sat, Jan 31, 2015 at 9:16 AM Subject: Apostrophes (was: Re: ISO 639-3 changes) To: Philip Newton Cc: ietf-langua...@iana.org Phili

Re: Looking for a standard on historical countries

2014-10-31 Thread Markus Scherer
On Fri, Oct 31, 2014 at 6:20 AM, "Jörg Knappen" wrote: > Does someone here is aware of a standard or a de facto standard for names > or codes of historical countries? For the requirement I have in mind, all > countries where there was a printing press would be optimal coverage, > anything going b

Re: Bliss?

2014-10-13 Thread Markus Scherer
As Michael said, I don't have information. But I found this which might help: http://en.wikipedia.org/wiki/Blissymbols#Towards_the_international_standardization_of_the_script markus ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/

Re: Bliss?

2014-10-13 Thread Markus Scherer
On Mon, Oct 13, 2014 at 2:23 PM, Jean-François Colson wrote: > I’ve found a 16-year-old proposal for Blissymbolics ( > http://www.evertype.com/standards/iso10646/pdf/bliss.pdf ) but nothing more > recent. Was that script rejected? Was it forgotten? Are there any technical > difficulties related to

Re: Request for Information

2014-07-23 Thread Markus Scherer
Some of the data is available in the Unicode CLDR script metadata: http://unicode.org/cldr/trac/browser/trunk/common/properties/scriptMetadata.txt http://cldr.unicode.org/development/updating-codes/updating-script-metadata markus -- Google Internationalization Engineering ___

Re: Default case algorithms

2014-06-24 Thread Markus Scherer
On Tue, Jun 24, 2014 at 6:46 PM, Daniel Bünzli wrote: > Having a look at the data it seems that the Uppercase_Mapping property of > UCD includes (using the terminology of SpecialCasing.txt): > > * All the unconditional mappings of SpecialCasing.txt (context independent) > * None of the conditiona

Re: Default case algorithms

2014-06-24 Thread Markus Scherer
On Tue, Jun 24, 2014 at 4:56 PM, Daniel Bünzli wrote: > Does an algorithm that simply applies R1 *regardless of context* > constitute a default case algorithm or not ? I.e. does simply mapping each > character C in a string using Uppercase_Mapping (C) (e.g. as exposed by the > XML UCD) constitute

Re: Default case algorithms

2014-06-24 Thread Markus Scherer
The context-sensitive and/or language-sensitive mappings are here: http://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt Best regards, markus ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

2014-06-12 Thread Markus Scherer
On Wed, Jun 11, 2014 at 9:29 PM, Karl Williamson wrote: > I have a something like a library that was written a long time ago (not by > me) assuming that noncharacters were illegal in open interchange. Programs > that use the library were guaranteed that they would not receive > noncharacters in t

Re: Corrigendum #9

2014-06-02 Thread Markus Scherer
On Mon, Jun 2, 2014 at 1:32 PM, David Starner wrote: > I would especially discourage any web browser from handling > these; they're noncharacters used for unknown purposes that are > undisplayable and if used carelessly for their stated purpose, can > probably trigger serious bugs in some lamebra

Re: Corrigendum #9

2014-06-02 Thread Markus Scherer
On Mon, Jun 2, 2014 at 10:00 AM, Shawn Steele wrote: > To further my understanding, can someone provide examples of how these are > used in actual practice? > CLDR collation data defines special contraction mappings that start with a noncharacter, for http://www.unicode.org/reports/tr35/tr35-col

Re: Corrigendum #9

2014-06-02 Thread Markus Scherer
On Mon, Jun 2, 2014 at 8:27 AM, Doug Ewell wrote: > I suspect everyone can agree on the edge cases, that noncharacters are > harmless in internal processing, but probably should not appear in > random text shipped around on the web. > Right, in principle. However, it should be ok to include nonc

Re: Corrigendum #9

2014-06-01 Thread Markus Scherer
On Sun, Jun 1, 2014 at 7:49 AM, Karl Williamson wrote: > Thanks, I had not thought about that. I'm thinking wording something like > this is more appropriate > > "Noncharacters may be openly interchanged, but it is inadvisable to do so > without prior agreement, since at each stage any of them m

Re: Unicode Regular Expressions, Surrogate Points and UTF-8

2014-06-01 Thread Markus Scherer
On Sun, Jun 1, 2014 at 1:49 AM, Richard Wordingham < richard.wording...@ntlworld.com> wrote: > D80: Unicode string: > A code unit sequence containing code units of a particular Unicode > encoding form... > Right -- in a Unicode 16-bit string, you have a sequence of any 16-bit value in any order.

Re: Unicode Regular Expressions, Surrogate Points and UTF-8

2014-05-31 Thread Markus Scherer
On Sat, May 31, 2014 at 1:59 AM, Richard Wordingham < richard.wording...@ntlworld.com> wrote: > Bear in mind that a pattern \uD808 shall not match anything in a > well-formed Unicode string. Depends. See the definitions of Unicode strings vs. UTF strings. \uD808\uDF45 specifies a sequence of tw

Re: Unicode Regular Expressions, Surrogate Points and UTF-8

2014-05-31 Thread Markus Scherer
On Sat, May 31, 2014 at 6:41 AM, Mark Davis ☕️ wrote: > I think you have a point here. We should probably change to: > > To meet this requirement, an implementation shall supply a mechanism for > specifying any Unicode scalar value (from U+ to U+D7FF and U+E000 to > U+10), using the hexad

Re: Unicode Regular Expressions, Surrogate Points and UTF-8

2014-05-30 Thread Markus Scherer
If you use Unicode 16-bit strings, it's easy to "pass through" unpaired surrogates and treat them like code points; it's often not productive or necessary to check for them all the time, that is, to be strict about UTF-16. On the other hand, I don't think anyone expects you to support invalid UTF-

Re: Block Boundaries (was: RE: Corrigendum #9)

2014-05-30 Thread Markus Scherer
In addition, the Block property is not particularly useful even in regular expressions or other processing. It is almost always more useful to use Script, Alphabetic, Unified_Ideograph, etc. Blocks help with planning and allocation but little else. markus ___

Re: Guillements in Email

2014-05-02 Thread Markus Scherer
If there is a Gmail bug, then please report it. Either way, I suggest you go into Gmail Settings and set it to "Use Unicode (UTF-8) encoding for outgoing messages" markus ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/u

Re: ID_Start, ID_Continue, and stability extensions

2014-04-28 Thread Markus Scherer
On Fri, Apr 25, 2014 at 11:06 PM, Mathias Bynens wrote: > My initial question can be rephrased as the following remark/change > request: > > http://unicode.org/reports/tr31/#Default_Identifier_Syntax could make it > more clear that “stability extensions” means `Other_ID_Start` and > `Other_ID_Co

  1   2   3   4   5   6   >