Re: base1024 encoding using Unicode emojis

2018-03-11 Thread Mathias Bynens via Unicode
Neat! Prior art: - https://github.com/watson/base64-emoji - https://github.com/nate-parrott/emojicode On Sun, Mar 11, 2018 at 6:04 AM, Keith Turner via Unicode < unicode@unicode.org> wrote: > I created a neat little project based on Unicode emojis. I thought > some on this list may find

Re: HTTPS

2017-10-04 Thread Mathias Bynens via Unicode
unicode.org and www.unicode.org are now available over HTTPS. E.g. https://unicode.org/Public/10.0.0/ On Thu, Mar 6, 2014 at 3:54 PM, Robbert wrote: > Hi, > > For tools that rely on the Unicode database it would be great if the > databases were available over HTTPS as

Re: LATIN CAPITAL LETTER SHARP S officially recognized

2017-06-30 Thread Mathias Bynens via Unicode
On Fri, Jun 30, 2017 at 5:34 PM, Michael Everson via Unicode wrote: > > It would be sensible to case-map ß to ẞ however. I’m hoping this can happen — converting ß to SS is lossy, so mapping to ẞ would be far superior. However,

Re: UAX44: loose matching of symbolic values and the `is` prefix

2016-06-07 Thread Mathias Bynens
> On 7 Jun 2016, at 17:56, Doug Ewell wrote: > > Rather than changing the spec based on anecdotal evidence, […] > > It seems irresponsible to assume now that nobody anywhere needs > it. What assumption are you talking about? Markus and Nova provided actual examples of

Re: UAX44: loose matching of symbolic values and the `is` prefix

2016-06-06 Thread Mathias Bynens
> On 7 Jun 2016, at 00:39, Nova Patch wrote: > > […] Based on my past research for Unicode Regular Expression Engines at > IUC38, I suspect that there might not be any regex engine that actually > supports syntax like Script=IsGreek as described in UAX44-LM3! If anybody

Re: UAX44: loose matching of symbolic values and the `is` prefix

2016-06-06 Thread Mathias Bynens
> >> The `is` prefix doesn’t provide any functionality that would otherwise >> be unavailable. It doesn’t add any value, yet causes incompatibility, >> author confusion, and it increases implementation complexity. > > I don't see any evidence that it adds no value. Support for existing >

Re: UAX44: loose matching of symbolic values and the `is` prefix

2016-06-06 Thread Mathias Bynens
> On 6 Jun 2016, at 18:04, Ken Whistler wrote: > > UAX #44 doesn't *require* any regex engine to include this "is prefix" > handling. Are you referring to the fact that the first paragraph on http://unicode.org/reports/tr44/#Matching_Rules uses “strongly recommended”

UAX44: loose matching of symbolic values and the `is` prefix

2016-06-06 Thread Mathias Bynens
http://unicode.org/reports/tr44/#UAX44-LM3 mentions the `is` prefix: > For loose matching of symbolic values, an initial prefix string "is" is > ignored. […] Ignoring any initial "is" on a symbolic value during loose > matching is likely to produce the best results in application areas such as

Re: Canonical block names: spaces vs. underscores

2016-05-26 Thread Mathias Bynens
> On 26 May 2016, at 20:07, Ken Whistler wrote: > > Well, let's take an example. The entry in Blocks.txt for the Arabic > Presentation Forms-A block is: > > FB50..FDFF; Arabic Presentation Forms-A > > The entry for that block in PropertyValueAliases.txt is: > > blk;

Re: Canonical block names: spaces vs. underscores

2016-05-26 Thread Mathias Bynens
> On 26 May 2016, at 10:17, Mathias Bynens <math...@qiwi.be> wrote: > > `Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists blocks such > as `Cyrillic Supplement`. > > However, `PropertyValueAliases.txt` > (http://unicode.org/Public/UNIDATA/Prope

Canonical block names: spaces vs. underscores

2016-05-26 Thread Mathias Bynens
`Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists blocks such as `Cyrillic Supplement`. However, `PropertyValueAliases.txt` (http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to this block as `Cyrillic_Supplement`, with an underscore instead of a space. Which

Re: Unicode in passwords

2015-10-01 Thread Mathias Bynens
> On 1 Oct 2015, at 07:19, Marc Durdin wrote: > > 2. The number of dots corresponds to the number of code points, which > is misleading with complex scripts or advanced input methods: you won’t > necessarily see one dot per keystroke; in some cases, typing a character

Re: ID_Start, ID_Continue, and stability extensions

2014-04-28 Thread Mathias Bynens
On 23 Apr 2014, at 20:18, Markus Scherer markus@gmail.com wrote: I strongly recommend you parse the derived properties rather than trying to follow the derivation formula, because that can change over time. No argument there! My initial question can be rephrased as the following

Re: ID_Start, ID_Continue, and stability extensions

2014-04-28 Thread Mathias Bynens
On 26 Apr 2014, at 17:06, Markus Scherer markus@gmail.com wrote: I suggest you report it here: http://www.unicode.org/reporting.html Done. Thank you, Markus! ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode

Re: Do `Grapheme_Extend` characters only apply to `Grapheme_Base`?

2014-04-24 Thread Mathias Bynens
On 23 Apr 2014, at 22:16, Mathias Bynens math...@qiwi.be wrote: Let’s say I’m writing a program that strips combining characters and grapheme extenders from an input string. For combining marks, I’m looking for any non-combining marks (e.g. `a`) followed by one or more combining marks

Re: Do `Grapheme_Extend` characters only apply to `Grapheme_Base`?

2014-04-24 Thread Mathias Bynens
On 24 Apr 2014, at 21:38, Whistler, Ken ken.whist...@sap.com wrote: Grapheme_Extend characters per se do not apply to anything. They are a mixture of different General_Category types -- mostly combining marks, but not all. The concept of applying to a base only refers to combining marks

ID_Start, ID_Continue, and stability extensions

2014-04-23 Thread Mathias Bynens
http://www.unicode.org/reports/tr31/#Default_Identifier_Syntax defines ID_Start as: Characters having the Unicode General_Category of uppercase letters (Lu), lowercase letters (Ll), titlecase letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers (Nl), minus Pattern_Syntax

Re: ID_Start, ID_Continue, and stability extensions

2014-04-23 Thread Mathias Bynens
On 23 Apr 2014, at 19:18, Mathias Bynens math...@qiwi.be wrote: http://www.unicode.org/reports/tr31/#Default_Identifier_Syntax defines ID_Start as: Characters having the Unicode General_Category of uppercase letters (Lu), lowercase letters (Ll), titlecase letters (Lt), modifier letters

Re: ID_Start, ID_Continue, and stability extensions

2014-04-23 Thread Mathias Bynens
On 23 Apr 2014, at 19:48, Whistler, Ken ken.whist...@sap.com wrote: See the listings for Other_ID_Start and Other_ID_Continue in PropList.txt. Those are your stability extensions for the derivation of the identifier-related derived properties. This answered all my questions :) Thanks!

Do `Grapheme_Extend` characters only apply to `Grapheme_Extend`?

2014-04-23 Thread Mathias Bynens
Let’s say I’m writing a program that strips combining characters and grapheme extenders from an input string. For combining marks, I’m looking for any non-combining marks (e.g. `a`) followed by one or more combining marks (e.g. `̃`), and then I remove everything but the non-combining mark

Re: FYI: More emoji from Chrome

2014-04-01 Thread Mathias Bynens
On 1 Apr 2014, at 09:13, Philippe Verdy verd...@wanadoo.fr wrote: April 1st joke... Sure – it really works, though. Try it out. Kinda cool :) I would’ve preferred if Google had finally implemented support for proper emoji in OS X, though:

Difference between ‘combining characters’ and ‘grapheme extenders’?

2014-02-20 Thread Mathias Bynens
What is the difference between ‘combining characters’ (http://www.unicode.org/faq/char_combmark.html) and ‘grapheme extenders’ (http://www.unicode.org/reports/tr44/#Grapheme_Extend) in Unicode? They seem to do the same thing, as far as I can tell – although the set of grapheme extenders is