Re: UAX #29 and WB4

2020-03-04 Thread Daniel Bünzli via Unicode
On 4 March 2020 at 18:48:09, Daniel Bünzli (daniel.buen...@erratique.ch) wrote:

> On 4 March 2020 at 18:01:25, Daniel Bünzli (daniel.buen...@erratique.ch) 
> wrote:
>  
> > Re-reading the text I suspect I should not restart the rules from the first 
> > one when a  
> WB4
> > rewrite occurs but only apply the subsequent rules. Is that correct ?
>  
> However even if that's correct I don't understand how this test case works:
>  
> ÷ 1F6D1 × 200D × 1F6D1 ÷ # ÷ [0.2] OCTAGONAL SIGN (ExtPict) × [4.0] ZERO 
> WIDTH JOINER (ZWJ_FE)  
> × [3.3] OCTAGONAL SIGN (ExtPict) ÷ [0.3]
>  
> Here the first two chars get rewritten with WB4 to ExtPic then if only 
> subsequent rules  
> are applied we end up in WB999 and a break between 200D and 1F6D1. 

That's nonsense and not the operational model of the algorithm which IIRC was 
once clearly stated on this list by Mark Davis (sorry I failed to dig out the 
message) which is to take each boundary position candidate and apply the rule 
in sequences taking the first one that matches and then start over with the 
next one.

In that case applying the rules bewteen 1F6D1 and 200D leads to WB4 but then 
that implicitely adds a non boundary condition -- this is not really evident 
from the formalism but see the comment above WB4, for that boundary position 
that settles the non boundary condition. Then we start again applying the rules 
between 200D and the last 1F6D1 and WB3c matches before WB4 quicks. 

I think the behaviour of → rules should be clarified: it's not clear on which 
data you apply it w.r.t. the boundary position candiate. If I understand 
correctly if the match spans over the boundary position candidate that simply 
turns it into a non-boundary. Otherwise you apply the rule on the left of the 
boundary position candiate. 

Regarding the question of my original message it seems at a certain point I 
knew better: 

  https://www.unicode.org/mail-arch/unicode-ml/y2016-m11/0151.html

Sorry for the noise. 

Daniel

P.S. I still think the UAX29 and UAX14 could benefit from clarifiying the 
operational model of the rules a bit (I also have the impression that the 
formalism to express all that may not be the right one, but then I don't have 
something better to propose at the time). Also it would be nicer for 
implementers if they didn't have to factorize rules themselves (e.g. like in 
the new LB30 rules of UAX14) so that correctness of implemented rules is easier 
to assert. 





Re: UAX #29 and WB4

2020-03-04 Thread Daniel Bünzli via Unicode
On 4 March 2020 at 18:01:25, Daniel Bünzli (daniel.buen...@erratique.ch) wrote:

> Re-reading the text I suspect I should not restart the rules from the first 
> one when a WB4  
> rewrite occurs but only apply the subsequent rules. Is that correct ?

However even if that's correct I don't understand how this test case works:

÷ 1F6D1 × 200D × 1F6D1 ÷ #  ÷ [0.2] OCTAGONAL SIGN (ExtPict) × [4.0] ZERO WIDTH 
JOINER (ZWJ_FE) × [3.3] OCTAGONAL SIGN (ExtPict) ÷ [0.3]

Here the first two chars get rewritten with WB4 to ExtPic then if only 
subsequent rules are applied we end up in WB999 and a break between 200D and 
1F6D1. The justification in the comment indicates to use WB3c on the ZWJ but 
that one should have been rewritten to ExtPict by WB4. 

Best,

Daniel





UAX #29 and WB4

2020-03-04 Thread Daniel Bünzli via Unicode
Hello, 

My implementation of word break chokes only on the following test case from the 
file [1]: 

÷ 0020 × 0308 ÷ 0020 ÷ #  ÷ [0.2] SPACE (WSegSpace) × [4.0] COMBINING DIAERESIS 
(Extend_FE) ÷ [999.0] SPACE (WSegSpace) ÷ [0.3] 

I find: 

÷ 0020 × 0308 × 0020 ÷

Basically my implementation uses WB4 to rewrite the first two characters to 
WSegSpace and then applies WB3ad resulting in the non-break between 0308 and 
0020.

Re-reading the text I suspect I should not restart the rules from the first one 
when a WB4 rewrite occurs but only apply the subsequent rules. Is that correct 
? 

Best, 

Daniel

[1]: https://unicode.org/Public/13.0.0/ucd/auxiliary/WordBreakTest.txt








UAX #14 for 13.0.0: LB27 first's line is obsolete

2020-03-03 Thread Daniel Bünzli via Unicode
Hello, 

I think (more precisely my compiler thinks [1]) the first line of LB27 is 
already handled by the new LB22 rule and can be removed. 

Best, 

Daniel

[1]
File "uuseg_line_break.ml", line 206, characters 38-40:

206 |   | (* LB27 *)  _, (JL|JV|JT|H2|H3), (IN|PO) -> no_boundary s
                                            ^^
Warning 12: this sub-pattern is unused.



Re: Grapheme clusters and backspace (was Re: Coding for Emoji: how to modify programs to work with emoji)

2019-10-22 Thread Daniel Bünzli via Unicode
Thanks for you answer.

> The compromise that has generally been reached is that 'delete' deletes
> a grapheme cluster and 'backspace' deletes a scalar value. (There are
> good editors like Emacs that delete only a single character.)

Just to make things clear. When you say character in your message, you 
consistently mean scalar value right ?

Best, 

Daniel





Grapheme clusters and backspace (was Re: Coding for Emoji: how to modify programs to work with emoji)

2019-10-22 Thread Daniel Bünzli via Unicode
On 22 October 2019 at 09:37:22, Richard Wordingham via Unicode 
(unicode@unicode.org) wrote:

> When it comes to the second sentence of the text of Slide 7 'Grapheme
> Clusters', my overwhelming reaction is one of extreme anger. Slide 8
> does nothing to lessen the offence. The problem is that it gives the
> impression that in general it is acceptable for backspace to delete the
> whole grapheme cluster.

Let's turn extreme anger into knowledge. 

I'm not very knowledgable in ligature heavy scripts (I suspect that's what you 
refer to) and what you describe is the first thing I went with for a readline 
editor data structure. 

Would maybe care to expand when exactly you think it's not acceptable and what 
kind of tools or standard I can find the Unicode toolbox to implement an 
acceptable behaviour for backspace on general Unicode text. 

Best, 

Daniel





Website format (was Re: Unicode website glitches. (was The Most Frequent Emoji))

2019-10-12 Thread Daniel Bünzli via Unicode
On 12 October 2019 at 02:05:23, Martin J. Dürst via Unicode 
(unicode@unicode.org) wrote:

> I think it's less the format and much more the split personality of the
> Unicode Web site(s?) that I have problems with.

I also do. 

One thing that is particulary annoying is the fact that the "home" link on the 
"technical" (unchanged) subpart of the website gets back to the "marketing" 
home page which is particularly inefficient (the links you are looking for are 
not above the fold on a laptop screen) and confusing (the whole layout shifts 
and the theme changes) for perusing the technical part of the website.

With all due respect for the work that has been done on the new website I think 
that the new structure significantly decreased the usability of the website for 
technical users.

Best, 

Daniel





Re: Unicode String Models

2018-10-03 Thread Daniel Bünzli via Unicode
On 3 October 2018 at 15:41:42, Mark Davis ☕️ via Unicode (unicode@unicode.org) 
wrote:
 
> Let me clear that up; I meant that "the underlying storage never contains
> something that would need to be represented as a surrogate code point." Of
> course, UTF-16 does need surrogate code units. What #1 would be excluding
> in the case of UTF-16 would be unpaired surrogates. That is, suppose the
> underlying storage is UTF-16 code units that don't satisfy #1.
>  
> 0061 D83D DC7D 0061 D83D
>  
> A code point API would return for those a sequence of 4 values, the last of
> which would be a surrogate code point.
>  
> 0061, 0001F47D, 0061, D83D
>  
> A scalar value API would return for those also 4 values, but since we
> aren't in #1, it would need to remap.
>  
> 0061, 0001F47D, 0061, FFFD

Ok understood. But I think that if you go to the length of providing a 
scalar-value API you would also prevent the construction of strings that have 
such anomalities in the first place (e.g. by erroring in the constructor if you 
provide it with malformed UTF-X data), i.e. maintain 1. From a programmer's 
perspective I really don't get anything from 2. except confusion.

> If it is a real datatype, with strong guarantees that it *never* contains
> values outside of [0x..0xD7FF 0xE000..0x10], then every conversion
> from number will require checking. And in my experience, without a strong
> guarantee the datatype is in practice pretty useless.

Sure. My point was that the places where you perform this check are few in 
practice. Namely mainly at the IO boundary of your program where you actually 
need to deal with encodings and, additionally, whenever you define scalar value 
constants (a check that could actually be performed by your compiler if your 
language provides a literal notation for values of this type).

Best, 

Daniel





Re: Unicode String Models

2018-10-03 Thread Daniel Bünzli via Unicode
On 3 October 2018 at 09:17:10, Mark Davis ☕️ via Unicode (unicode@unicode.org) 
wrote:

> There are two main choices for a scalar-value API:
>  
> 1. Guarantee that the storage never contains surrogates. This is the
> simplest model.
> 2. Substitute U+FFFD for surrogates when the API returns code
> points. This can be done where #1 is not feasible, such as where the API is
> a shim on top of a (perhaps large) IO buffer of buffer of 16-bit code units
> that are not guaranteed to be UTF-16. The cost is extra tests on every code
> point access.

I'm not sure 2. really makes sense in pratice: it would mean you can't access 
scalar values 
which needs surrogates to be encoded. 

Also regarding 1. you can always defines an API that has this property 
regardless of the actual storage, it's only that your indexing operations might 
be costly as they do not directly map to the underlying storage array.

That being said I don't think direct indexing/iterating for Unicode text is 
such an interesting operation due of course to the normalization/segmentation 
issues. Basically if your API provides them I only see these indexes as useful 
ways to define substrings. APIs that identify/iterate boundaries (and thus 
substrings) are more interesting due to the nature of Unicode text.

> If the programming language provides for such a primitive datatype, that is
> possible. That would mean at a minimum that casting/converting to that
> datatype from other numerical datatypes would require bounds-checking and
> throwing an exception for values outside of [0x..0xD7FF
> 0xE000..0x10]. 

Yes. But note that in practice if you are in 1. above you usually perform this 
only at the point of decoding where you are already performing a lot of other 
checks. Once done you no longer need to check anything as long as the 
operations you perform on the values preserve the invariant. Also converting 
back to an integer if you need one is a no-op: it's the identity function. 

The OCaml Uchar module does this. This is the interface: 

  https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.mli

which defines the type t as abstract and here is the implementation: 

  https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.ml

which defines the implementation of type t = int which means values of this 
type are an *unboxed* OCaml integer (and will be stored as such in say an OCaml 
array). However since the module system enforces type abstraction the only way 
of creating such values is to use the constants or the constructors (e.g. 
of_int) which all maintain the scalar value invariant (if you disregard the 
unsafe_* functions). 

Note that it would perfectly be possible to adopt a similar approach in C via a 
typedef though given C's rather loose type system a little bit more discipline 
would be required from the programmer (always go through the constructor 
functions to create values of the type).

Best, 

Daniel





Re: Unicode String Models

2018-10-02 Thread Daniel Bünzli via Unicode
On 2 October 2018 at 14:03:48, Mark Davis ☕️ via Unicode (unicode@unicode.org) 
wrote:

> Because of performance and storage consideration, you need to consider the
> possible internal data structures when you are looking at something as
> low-level as strings. But most of the 'model's in the document are only
> really distinguished by API, only the "Code Point model" discussions are
> segmented by internal storage, as with "Code Point Model: UTF-32"

I guess my gripe with the presentation of that document is that it perpetuates 
the problem of confusing "unicode characters" (or integers, or scalar values) 
and their *encoding* (how to represent these integers as byte sequences) which 
a source of endless confusion among programmers. 

This confusion is easy lifted once you explain that there exists certain 
integers, the scalar values, which are your actual characters and then you have 
different ways of encoding your characters; one can then explain that a 
surrogate is not a character per se, it's a hack and there's no point in 
indexing them except if you want trouble.

This may also suggest another taxonomy of classification for the APIs, those in 
which you work directly with the character data (the scalar values) and those 
in which you work with an encoding of the actual character data (e.g. a 
JavaScript string).

> In reality, most APIs are not even going to be in terms of code points:
> they will return int32's. 

That reality depends on your programming language. If the latter supports type 
abstraction you can define an abstract type for scalar values (whose 
implementation may simply be an integer). If you always go through the 
constructor to create these "integers" you can maintain the invariant that a 
value of this type is an integer in the ranges [0x;0xD7FF] and 
[0xE000;0x10]. Knowing this invariant holds is quite useful when you feed 
your "character" data to other processes like UTF-X encoders: it guarantees the 
correctness of their outputs regardless of what the programmer does.

Best, 

Daniel





Re: Unicode String Models

2018-09-09 Thread Daniel Bünzli via Unicode
Hello, 

I find your notion of "model" and presentation a bit confusing since it 
conflates what I would call the internal representation and the API. 

The internal representation defines how the Unicode text is stored and should 
not really matter to the end user of the string data structure. The API defines 
how the Unicode text is accessed, expressed by what is the result of an 
indexing operation on the string. The latter is really what matters for the 
end-user and what I would call the "model".

I think the presentation would benefit from making a clear distinction between 
the internal representation and the API; you could then easily summarize them 
in a table which would make a nice summary of the design space.

I also think you are missing one API which is the one with ECG I would favour: 
indexing returns Unicode scalar values, internally be it whatever you wish 
UTF-{8,16,32} or a custom encoding. Maybe that's what you intended by the "Code 
Point Model: Internal 8/16/32" but that's not what it says, the distinction 
between code point and scalar value is an important one and I think it would be 
good to insist on it to clarify the minds in such documents.

Best, 

Daniel





UAX #42 update for 11.0.0 & \p{Extended_Pictographic}

2018-04-04 Thread Daniel Bünzli via Unicode
Hello, 

Is there any ETA for an update to the ucdxml for 11.0.0 ? 

Also while reviewing the proposed update to UAX #29, I noticed it refers to a 
property (\p{Extended_Pictographic}) that doesn't seem to be formally part of 
the UCD but to be found in UTS #51.

Is there any chance for this property to be part of a possible update to UAX 
#42 for 11.0.0 ? That would significantly help implementers whose pipeline 
relies on the ucdxml to implement the standard. 

Best, 

Daniel







Re: emoji props in the ucdxml ?

2017-07-06 Thread Daniel Bünzli via Unicode
Ken, 

Thanks for your explanations. 

I would just like to note that UAX42 expresses a general xml data format to 
associate properties to code points. So it would be possible for the standard 
maintainers to publish, independently from the UCD, alongside the ad-hoc text 
files, xml files that have the properties.

Best, 

Daniel

P.S. I don't have a particular obsession or love with XML but when I started to 
implement bits of the standard a few years ago I apparently made the mistake to 
think that the UTC would eventually move away from creating ad-hoc text files 
and favour the structured data format of the ucdxml. So most of my 
implementation pipeline is geared at consuming character properties from these 
files.





emoji props in the ucdxml ?

2017-07-05 Thread Daniel Bünzli via Unicode
Hello, 

I know the emoji properties [1] are no formally part of the UCD (not sure 
exactly why though), but are there any plans to integrate the data in the 
ucdxml [2] (possibly as separate files) ? 

Thanks, 

Daniel

[1] http://www.unicode.org/reports/tr51/#Emoji_Properties_and_Data_Files
[2] http://www.unicode.org/reports/tr42/