Re: base1024 encoding using Unicode emojis

2018-03-11 Thread Mathias Bynens via Unicode
Neat! Prior art:

   - https://github.com/watson/base64-emoji
   - https://github.com/nate-parrott/emojicode


On Sun, Mar 11, 2018 at 6:04 AM, Keith Turner via Unicode <
unicode@unicode.org> wrote:

> I created a neat little project based on Unicode emojis.  I thought
> some on this list may find it interesting.  It encodes arbitrary data
> as 1024 emojis.  The project is called Ecoji and is hosted on github
> at https://github.com/keith-turner/ecoji
>
> Below are some examples of encoding and decoding.
>
> $ echo 'Unicode emojis are awesome!!' | ecoji
> 卵駱
>
> $ echo 卵駱   | ecoji -d
> Unicode emojis are awesome!!
>
> I would eventually like to create a base4096 version when there are more
> emojis.
>
> Keith
>
>


Re: HTTPS

2017-10-04 Thread Mathias Bynens via Unicode
unicode.org and www.unicode.org are now available over HTTPS. E.g.
https://unicode.org/Public/10.0.0/

On Thu, Mar 6, 2014 at 3:54 PM, Robbert  wrote:

> Hi,
>
> For tools that rely on the Unicode database it would be great if the
> databases were available over HTTPS as well:
> https://www.unicode.org/Public/6.3.0/
>
> In addition to this it would be helpful if the archive also contains
> SHA512 checksum files for each Unicode version to verify the integrity of
> databases that have already been downloaded (over HTTP), e.g.:
>
> https://www.unicode.org/Public/6.3.0/SHA512SUMS
>
> Mozilla already offers such checksums, although unfortunately not over
> HTTPS, but they can serve as an example.
>
> http://releases.mozilla.org/pub/mozilla.org/firefox/
> releases/27.0/SHA512SUMS
>
> I think this would improve the security of many libraries that directly
> and indirectly depend on Unicode.
>
> Kind regards,
> Robbert Broersma
> ___
> Unicode mailing list
> Unicode@unicode.org
> http://unicode.org/mailman/listinfo/unicode
>


Re: LATIN CAPITAL LETTER SHARP S officially recognized

2017-06-30 Thread Mathias Bynens via Unicode
On Fri, Jun 30, 2017 at 5:34 PM, Michael Everson via Unicode
 wrote:
>
> It would be sensible to case-map ß to ẞ however.

I’m hoping this can happen — converting ß to SS is lossy, so mapping
to ẞ would be far superior.

However,  says:

“If two characters form a case pair in a version of Unicode, they will
remain a case pair in each subsequent version of Unicode.

If two characters do not form a case pair in a version of Unicode,
they will never become a case pair in any subsequent version of
Unicode.”





Re: UAX44: loose matching of symbolic values and the `is` prefix

2016-06-07 Thread Mathias Bynens

> On 7 Jun 2016, at 17:56, Doug Ewell  wrote:
> 
> Rather than changing the spec based on anecdotal evidence, […]
> 
> It seems irresponsible to assume now that nobody anywhere needs
> it.

What assumption are you talking about? Markus and Nova provided actual examples 
of implementations not following the spec, and so far no one has been able to 
provide even a single counter-example.

> There must have been some basis for including the "is" case in the first
> place.

Now *that* sounds like an assumption to me.


Re: UAX44: loose matching of symbolic values and the `is` prefix

2016-06-06 Thread Mathias Bynens

> On 7 Jun 2016, at 00:39, Nova Patch  wrote:
> 
> […] Based on my past research for Unicode Regular Expression Engines at 
> IUC38, I suspect that there might not be any regex engine that actually 
> supports syntax like Script=IsGreek as described in UAX44-LM3! If anybody 
> knows otherwise, I’d love to hear about it.

This seems like a cut-and-dried case of reality not matching the specification, 
which is not helpful in any way. The sensible thing to do is to update the 
specification accordingly, as proposed.


Re: UAX44: loose matching of symbolic values and the `is` prefix

2016-06-06 Thread Mathias Bynens
> 
>> The `is` prefix doesn’t provide any functionality that would otherwise
>> be unavailable. It doesn’t add any value, yet causes incompatibility,
>> author confusion, and it increases implementation complexity.
> 
> I don't see any evidence that it adds no value. Support for existing
> implementations is value.

It adds no value because it doesn’t enable any new functionality.
I agree support for existing implementations would have some value, but given 
that existing implementations disagree on the properties for which they support 
`is` that is not going to happen anyway. It’s impossible to be compatible with 
all those different implementations at the same time.


Re: UAX44: loose matching of symbolic values and the `is` prefix

2016-06-06 Thread Mathias Bynens

> On 6 Jun 2016, at 18:04, Ken Whistler  wrote:
> 
> UAX #44 doesn't *require* any regex engine to include this "is prefix" 
> handling.

Are you referring to the fact that the first paragraph on  
http://unicode.org/reports/tr44/#Matching_Rules uses “strongly recommended” and 
“should” instead of “required” and “must”?

> What UAX #44 does is recommend that all property and property value aliases 
> be correctly recognized, and then specifies a clear statement (in UAX44-LM3) 
> of the loose matching rule for recognizing the various forms of those aliases 
> that could be considered equivalent. I don't think messing with that rule 
> statement (which has been in place since 2010) would be helpful.

Why not? What I had in mind was adding a small sentence like:

> For compatibility reasons, implementations may optionally support any initial 
> prefix string "is".

This wouldn’t be a breaking change in any way, and it would enable new 
implementations that aim to follow UAX44 to do so without having to support 
`is`, and it would solve the problem everywhere the matching rules get applied 
rather than just for regular expressions.

> I think the target of concern here is wrong. 

Not sure I agree. It seems to me the `is` prefix is problematic (for the same 
reasons) wherever it’s used, whether that’s in regular expressions or not.

> The target instead should be in UTS #18, which happily, has a proposed update 
> available for comment right now:
> 
> http://www.unicode.org/review/pri325/
> 
> The relevant point is:
> 
> http://www.unicode.org/reports/tr18/tr18-18.html#RL1.2
> 
> That is the conformance part that requires that conformant Unicode regex 
> implementations "must follow the Matching rules from [UAX44]".

Thanks for the pointer! I will submit my feedback there as well. It seems more 
awkward / difficult to add an exception there rather than just slightly 
tweaking the UAX44-LM3 text as suggested above, though.


UAX44: loose matching of symbolic values and the `is` prefix

2016-06-06 Thread Mathias Bynens
http://unicode.org/reports/tr44/#UAX44-LM3 mentions the `is` prefix:

> For loose matching of symbolic values, an initial prefix string "is" is 
> ignored. […] Ignoring any initial "is" on a symbolic value during loose 
> matching is likely to produce the best results in application areas such as 
> regex. Removal of an initial "is" string for a loose matching comparison only 
> needs to be done once for a symbolic value, and need not be tested 
> recursively. There are no property aliases or property value aliases of the 
> form "isisisisistooconvoluted" defined just to test implementation edge cases.

UAX44 provides the reason for the existence of this “feature”:

> The reason for this is that APIs returning property values are often named 
> using the convention of prefixing "is" (or "Is" or "Is_", and so forth) to a 
> property value.

That seems like a rather weak argument. Specifically applying this to UTS18 
(Unicode regular expressions):

> "Script=Greek" is equivalent to "Script=isGreek" or "Script=Is_Greek"

If there is already a way to match all symbols in the Greek script (not 
counting the use of aliases and other loose matching requirements), i.e. 
`Script=Greek` — what good does it do to add support for yet another one?

Looking at implementations in the wild, Steven Levithan found 
(https://github.com/mathiasbynens/es-unicode-regexp-proposal/issues/2#issuecomment-143288062)
 that some regex flavors use `Is` for scripts, some for blocks, some for 
scripts and blocks, some for neither. Since some script and block names 
collide, this causes problems, especially when porting regexes across flavors.

The `is` prefix doesn’t provide any functionality that would otherwise be 
unavailable. It doesn’t add any value, yet causes incompatibility, author 
confusion, and it increases implementation complexity. UAX 44 includes two 
entire paragraphs pointing out that last part:

> Removal of an initial "is" string for a loose matching comparison only needs 
> to be done once for a symbolic value, and need not be tested recursively. 
> There are no property aliases or property value aliases of the form 
> "isisisisistooconvoluted" defined just to test implementation edge cases.
> 
> Existing and future property aliases and property value aliases are 
> guaranteed to be unique within their relevant namespaces, even if an initial 
> prefix string "is" is ignored. The existing cases of note for aliases that do 
> start with "is" are: dt=Iso (Decomposition_Type=Isolated) and lb=IS. The 
> Decomposition_Type value alias does not cause any problem, because there is 
> no contrasting value alias dt=o (Decomposition_Type=olated). For lb=IS, note 
> that the "IS" is the entire property value alias, and is not a prefix. There 
> is no null value for the Line_Break property for it to contrast with, but 
> implementations of loose matching should be careful of this edge case, so 
> that "lb=IS" is not misinterpreted as matching a null value.


Backwards compatibility seems to be the only good reason to continue supporting 
the `is` prefix *for existing implementations*, such as the one in Perl. But 
why is it still a requirement for new engines to support it as part of 
UAX44-LM3?

I’d like to propose changing UAX44-LM3 to make supporting the `is` prefix 
optional for new implementations.




Re: Canonical block names: spaces vs. underscores

2016-05-26 Thread Mathias Bynens

> On 26 May 2016, at 20:07, Ken Whistler  wrote:
> 
> Well, let's take an example. The entry in Blocks.txt for the Arabic 
> Presentation Forms-A block is:
> 
> FB50..FDFF; Arabic Presentation Forms-A
> 
> The entry for that block in PropertyValueAliases.txt is:
> 
> blk; Arabic_PF_A  ; Arabic_Presentation_Forms_A  ; 
> Arabic_Presentation_Forms-A
> 
> So then which would it be? Should Blocks.txt be changed to the long preferred 
> alias:
> 
> FB50..FDFF; Arabic_Presentation_Forms_A
> 
> or to the abbreviated preferred alias:
> 
> FB50..FDFF; Arabic_PF_A
> 
> which would be more consistent with the XML attribute and with most regex 
> usage?

This sounds like a strawman argument (?). The long preferred alias definitely 
seems more suitable for a ‘canonical’ name.

> I suppose a proposal to the UTC to further modify the UCD handling of block 
> names
> could change this situation. But I'm not convinced that we shouldn't just 
> leave
> things as they stand -- for stability. And then live with the complications 
> required
> for scripts or other parsing algorithms that actually need to deal with 
> Blocks.txt to
> either parse out block ranges (its main function) or to get usable block names
> (its subsidiary function).

Perhaps the “Note:” in the commented header in `Blocks.txt` could be extended 
to point out that the ~~canonical block names~~, nay, ++preferred block 
aliases++ are listed in `PropertyValueAliases.txt`? That would’ve been enough 
to avoid the question that spawned this thread.


Re: Canonical block names: spaces vs. underscores

2016-05-26 Thread Mathias Bynens

> On 26 May 2016, at 10:17, Mathias Bynens <math...@qiwi.be> wrote:
> 
> `Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists blocks such 
> as `Cyrillic Supplement`.
> 
> However, `PropertyValueAliases.txt` 
> (http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to this 
> block as `Cyrillic_Supplement`, with an underscore instead of a space.
> 
> Which is it?
> 
> If proper canonical block names use spaces instead of underscores, why 
> doesn’t `PropertyValueAliases.txt` reflect that? 
> If proper canonical block names use underscores instead of spaces, why 
> doesn’t `Blocks.txt` reflect that?
> 

Another example: `Blocks.txt` has `Superscripts and Subscripts`, whereas 
`PropertyValueAliases.txt` has `Superscripts_And_Subscripts`. Note that in 
addition to the underscores, the case of the `A` changed as well. Which is the 
canonical name?

The same goes for other blocks with “and” in the name, e.g. `Miscellaneous 
Symbols and Pictographs`, `Supplemental Symbols and Pictographs`, etc.


Canonical block names: spaces vs. underscores

2016-05-26 Thread Mathias Bynens
`Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists blocks such 
as `Cyrillic Supplement`.

However, `PropertyValueAliases.txt` 
(http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to this 
block as `Cyrillic_Supplement`, with an underscore instead of a space.

Which is it?

If proper canonical block names use spaces instead of underscores, why doesn’t 
`PropertyValueAliases.txt` reflect that? 
If proper canonical block names use underscores instead of spaces, why doesn’t 
`Blocks.txt` reflect that?




Re: Unicode in passwords

2015-10-01 Thread Mathias Bynens

> On 1 Oct 2015, at 07:19, Marc Durdin  wrote:
> 
> 2.   The number of dots corresponds to the number of code points, which 
> is misleading with complex scripts or advanced input methods: you won’t 
> necessarily see one dot per keystroke; in some cases, typing a character may 
> replace a dot with another dot or even delete a dot.

Lots of systems have a bug where supplementary code points show up as two dots 
instead of one, due to UTF-16 being used internally. OS X is an example. Demo 
(open in your browser):

data:text/html,


Re: ID_Start, ID_Continue, and stability extensions

2014-04-28 Thread Mathias Bynens
On 23 Apr 2014, at 20:18, Markus Scherer markus@gmail.com wrote:

 I strongly recommend you parse the derived properties rather than trying to 
 follow the derivation formula, because that can change over time.

No argument there!

My initial question can be rephrased as the following remark/change request:

http://unicode.org/reports/tr31/#Default_Identifier_Syntax  could make it more 
clear that “stability extensions” means `Other_ID_Start` and 
`Other_ID_Continue`, respectively. At the moment it lists an incomplete 
formula: it’s explicit about all the categories and properties to include to 
form `ID_Start` and `ID_Continue` _except for those_, for seemingly no good 
reason.

Regards,
Mathias
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: ID_Start, ID_Continue, and stability extensions

2014-04-28 Thread Mathias Bynens
On 26 Apr 2014, at 17:06, Markus Scherer markus@gmail.com wrote:

 I suggest you report it here: http://www.unicode.org/reporting.html

Done. Thank you, Markus!
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Do `Grapheme_Extend` characters only apply to `Grapheme_Base`?

2014-04-24 Thread Mathias Bynens
On 23 Apr 2014, at 22:16, Mathias Bynens math...@qiwi.be wrote:

 Let’s say I’m writing a program that strips combining characters and grapheme 
 extenders from an input string.
 
 For combining marks, I’m looking for any non-combining marks (e.g. `a`) 
 followed by one or more combining marks (e.g. `̃`), and then I remove 
 everything but the non-combining mark (e.g. leaving only `a`). Is this a 
 correct approach?
 
 What should the approach be for grapheme extenders? Should the program only 
 look for `Grapheme_Base` characters followed by `Grapheme_Extend` characters 
 (which includes the code points in `Other_Grapheme_Extend`)?

The email subject should have been “Do `Grapheme_Extend` characters only apply 
to `Grapheme_Base`?” — sorry for the confusion.

Does anyone know the answer?
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Do `Grapheme_Extend` characters only apply to `Grapheme_Base`?

2014-04-24 Thread Mathias Bynens
On 24 Apr 2014, at 21:38, Whistler, Ken ken.whist...@sap.com wrote:

 Grapheme_Extend characters per se do not apply to anything.
 They are a mixture of different General_Category types -- mostly combining
 marks, but not all. The concept of applying to a base only refers to
 combining marks proper.
 
 The proper use of the Grapheme_Extend property is in the context of the
 text segmentation algorithms defined in UAX #29, and in particular:
 
 http://www.unicode.org/reports/tr29/#Default_Grapheme_Cluster_Table
 
 See that document for the proper use. They are relevant to the determination 
 of grapheme cluster boundaries.
 
 And by the way, it is a very bad idea to be writing a program to just 
 unilaterally strip away grapheme extenders from input strings. In particular, 
 many dependent vowels in Indic scripts are defined as grapheme extenders. If 
 you strip them away, the input string will just end up as random trash. That 
 is very, very different from something which is trying to strip diacritics 
 and accent marks off of Latin letters.

I agree. Don’t worry — I am not actually writing such a program, it was just an 
example to simplify my question.

The real program attempts to reverse a string while accounting for combining 
marks and grapheme extenders. Before reversing the code points one by one, some 
things need to happen:

* For combining marks, I use a regular expression that looks for non-combining 
marks followed by any number of combining marks, and then I swap the combining 
marks with the preceding character.
* Now I’m trying to figure out what to do about grapheme extenders (if 
anything). I was thinking: look for any non-grapheme extender symbol (or should 
it be only `Grapheme_Base` characters? Your reply suggested it shouldn’t) 
followed by a single grapheme extender (or should it be several, like with 
combining marks?), and then swap them. Would that be a correct approach?

I realize reversing a string has nothing to do with text segmentation – but 
ignoring grapheme extenders leads to unexpected results (since after reversing 
the code points, the grapheme extender might extend the wrong character): 
https://github.com/mathiasbynens/esrever/issues/5
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


ID_Start, ID_Continue, and stability extensions

2014-04-23 Thread Mathias Bynens
http://www.unicode.org/reports/tr31/#Default_Identifier_Syntax defines ID_Start 
as:

 Characters having the Unicode General_Category of uppercase letters (Lu), 
 lowercase letters (Ll), titlecase letters (Lt), modifier letters (Lm), other 
 letters (Lo), letter numbers (Nl), minus Pattern_Syntax and 
 Pattern_White_Space code points, plus stability extensions. Note that “other 
 letters” includes ideographs.

What are the “stability extensions” this document refers to?

I noticed that parsing `DerivedCoreProperties.txt` for `ID_Start` leads to 
slightly different results, than parsing `UnicodeData.txt` for category names 
and then adding the categories together, minus `Pattern_Syntax` and 
`Pattern_White_Space` which you can get by parsing `PropList.txt`.

For example, U+2118 SCRIPT CAPITAL P is included in `ID_Start` as per 
`DerivedCoreProperties.txt`, but it doesn’t match any of the above categories. 
Is this an example of such a “stability extension”, or was this an oversight?

Regards,
Mathias
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: ID_Start, ID_Continue, and stability extensions

2014-04-23 Thread Mathias Bynens
On 23 Apr 2014, at 19:18, Mathias Bynens math...@qiwi.be wrote:

 http://www.unicode.org/reports/tr31/#Default_Identifier_Syntax defines 
 ID_Start as:
 
 Characters having the Unicode General_Category of uppercase letters (Lu), 
 lowercase letters (Ll), titlecase letters (Lt), modifier letters (Lm), other 
 letters (Lo), letter numbers (Nl), minus Pattern_Syntax and 
 Pattern_White_Space code points, plus stability extensions. Note that “other 
 letters” includes ideographs.
 
 What are the “stability extensions” this document refers to?
 
 I noticed that parsing `DerivedCoreProperties.txt` for `ID_Start` leads to 
 slightly different results, than parsing `UnicodeData.txt` for category names 
 and then adding the categories together, minus `Pattern_Syntax` and 
 `Pattern_White_Space` which you can get by parsing `PropList.txt`.
 
 For example, U+2118 SCRIPT CAPITAL P is included in `ID_Start` as per 
 `DerivedCoreProperties.txt`, but it doesn’t match any of the above 
 categories. Is this an example of such a “stability extension”, or was this 
 an oversight?

Here are the code points that match the respective property according to 
`DerivedCoreProperties.txt`, yet don’t match these properties if you’re 
adding/removing the categories manually based on the property definition in 
TR31.

`ID_Start`:

* U+2118
* U+212E
* U+309B
* U+309C

`ID_Continue`:

* U+00B7
* U+0387
* U+1369
* U+1370
* U+1371
* U+19DA

Why these differences?


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: ID_Start, ID_Continue, and stability extensions

2014-04-23 Thread Mathias Bynens
On 23 Apr 2014, at 19:48, Whistler, Ken ken.whist...@sap.com wrote:
 See the listings for Other_ID_Start and Other_ID_Continue in PropList.txt.
 Those are your stability extensions for the derivation of the 
 identifier-related derived properties.

This answered all my questions :) Thanks!

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Do `Grapheme_Extend` characters only apply to `Grapheme_Extend`?

2014-04-23 Thread Mathias Bynens
Let’s say I’m writing a program that strips combining characters and grapheme 
extenders from an input string.

For combining marks, I’m looking for any non-combining marks (e.g. `a`) 
followed by one or more combining marks (e.g. `̃`), and then I remove 
everything but the non-combining mark (e.g. leaving only `a`). Is this a 
correct approach?

What should the approach be for grapheme extenders? Should the program only 
look for `Grapheme_Base` characters followed by `Grapheme_Extend` characters 
(which includes the code points in `Other_Grapheme_Extend`)?
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: FYI: More emoji from Chrome

2014-04-01 Thread Mathias Bynens
On 1 Apr 2014, at 09:13, Philippe Verdy verd...@wanadoo.fr wrote:

 April 1st joke...

Sure – it really works, though. Try it out. Kinda cool :)

I would’ve preferred if Google had finally implemented support for proper emoji 
in OS X, though: https://code.google.com/p/chromium/issues/detail?id=62435
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Difference between ‘combining characters’ and ‘grapheme extenders’?

2014-02-20 Thread Mathias Bynens
What is the difference between ‘combining characters’ 
(http://www.unicode.org/faq/char_combmark.html) and ‘grapheme extenders’ 
(http://www.unicode.org/reports/tr44/#Grapheme_Extend) in Unicode?

They seem to do the same thing, as far as I can tell – although the set of 
grapheme extenders is larger than the set of combining characters. I’m clearly 
missing something here. Why the distinction?

I’ve also posted this question on Stack Overflow: 
http://stackoverflow.com/q/21722729/96656
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode