Re: UAX #29 and WB4
On 4 March 2020 at 18:48:09, Daniel Bünzli (daniel.buen...@erratique.ch) wrote: > On 4 March 2020 at 18:01:25, Daniel Bünzli (daniel.buen...@erratique.ch) > wrote: > > > Re-reading the text I suspect I should not restart the rules from the first > > one when a > WB4 > > rewrite occurs but only apply the subsequent rules. Is that correct ? > > However even if that's correct I don't understand how this test case works: > > ÷ 1F6D1 × 200D × 1F6D1 ÷ # ÷ [0.2] OCTAGONAL SIGN (ExtPict) × [4.0] ZERO > WIDTH JOINER (ZWJ_FE) > × [3.3] OCTAGONAL SIGN (ExtPict) ÷ [0.3] > > Here the first two chars get rewritten with WB4 to ExtPic then if only > subsequent rules > are applied we end up in WB999 and a break between 200D and 1F6D1. That's nonsense and not the operational model of the algorithm which IIRC was once clearly stated on this list by Mark Davis (sorry I failed to dig out the message) which is to take each boundary position candidate and apply the rule in sequences taking the first one that matches and then start over with the next one. In that case applying the rules bewteen 1F6D1 and 200D leads to WB4 but then that implicitely adds a non boundary condition -- this is not really evident from the formalism but see the comment above WB4, for that boundary position that settles the non boundary condition. Then we start again applying the rules between 200D and the last 1F6D1 and WB3c matches before WB4 quicks. I think the behaviour of → rules should be clarified: it's not clear on which data you apply it w.r.t. the boundary position candiate. If I understand correctly if the match spans over the boundary position candidate that simply turns it into a non-boundary. Otherwise you apply the rule on the left of the boundary position candiate. Regarding the question of my original message it seems at a certain point I knew better: https://www.unicode.org/mail-arch/unicode-ml/y2016-m11/0151.html Sorry for the noise. Daniel P.S. I still think the UAX29 and UAX14 could benefit from clarifiying the operational model of the rules a bit (I also have the impression that the formalism to express all that may not be the right one, but then I don't have something better to propose at the time). Also it would be nicer for implementers if they didn't have to factorize rules themselves (e.g. like in the new LB30 rules of UAX14) so that correctness of implemented rules is easier to assert.
Re: UAX #29 and WB4
On 4 March 2020 at 18:01:25, Daniel Bünzli (daniel.buen...@erratique.ch) wrote: > Re-reading the text I suspect I should not restart the rules from the first > one when a WB4 > rewrite occurs but only apply the subsequent rules. Is that correct ? However even if that's correct I don't understand how this test case works: ÷ 1F6D1 × 200D × 1F6D1 ÷ # ÷ [0.2] OCTAGONAL SIGN (ExtPict) × [4.0] ZERO WIDTH JOINER (ZWJ_FE) × [3.3] OCTAGONAL SIGN (ExtPict) ÷ [0.3] Here the first two chars get rewritten with WB4 to ExtPic then if only subsequent rules are applied we end up in WB999 and a break between 200D and 1F6D1. The justification in the comment indicates to use WB3c on the ZWJ but that one should have been rewritten to ExtPict by WB4. Best, Daniel
UAX #29 and WB4
Hello, My implementation of word break chokes only on the following test case from the file [1]: ÷ 0020 × 0308 ÷ 0020 ÷ # ÷ [0.2] SPACE (WSegSpace) × [4.0] COMBINING DIAERESIS (Extend_FE) ÷ [999.0] SPACE (WSegSpace) ÷ [0.3] I find: ÷ 0020 × 0308 × 0020 ÷ Basically my implementation uses WB4 to rewrite the first two characters to WSegSpace and then applies WB3ad resulting in the non-break between 0308 and 0020. Re-reading the text I suspect I should not restart the rules from the first one when a WB4 rewrite occurs but only apply the subsequent rules. Is that correct ? Best, Daniel [1]: https://unicode.org/Public/13.0.0/ucd/auxiliary/WordBreakTest.txt
UAX #14 for 13.0.0: LB27 first's line is obsolete
Hello, I think (more precisely my compiler thinks [1]) the first line of LB27 is already handled by the new LB22 rule and can be removed. Best, Daniel [1] File "uuseg_line_break.ml", line 206, characters 38-40: 206 | | (* LB27 *) _, (JL|JV|JT|H2|H3), (IN|PO) -> no_boundary s ^^ Warning 12: this sub-pattern is unused.
Re: Grapheme clusters and backspace (was Re: Coding for Emoji: how to modify programs to work with emoji)
Thanks for you answer. > The compromise that has generally been reached is that 'delete' deletes > a grapheme cluster and 'backspace' deletes a scalar value. (There are > good editors like Emacs that delete only a single character.) Just to make things clear. When you say character in your message, you consistently mean scalar value right ? Best, Daniel
Grapheme clusters and backspace (was Re: Coding for Emoji: how to modify programs to work with emoji)
On 22 October 2019 at 09:37:22, Richard Wordingham via Unicode (unicode@unicode.org) wrote: > When it comes to the second sentence of the text of Slide 7 'Grapheme > Clusters', my overwhelming reaction is one of extreme anger. Slide 8 > does nothing to lessen the offence. The problem is that it gives the > impression that in general it is acceptable for backspace to delete the > whole grapheme cluster. Let's turn extreme anger into knowledge. I'm not very knowledgable in ligature heavy scripts (I suspect that's what you refer to) and what you describe is the first thing I went with for a readline editor data structure. Would maybe care to expand when exactly you think it's not acceptable and what kind of tools or standard I can find the Unicode toolbox to implement an acceptable behaviour for backspace on general Unicode text. Best, Daniel
Website format (was Re: Unicode website glitches. (was The Most Frequent Emoji))
On 12 October 2019 at 02:05:23, Martin J. Dürst via Unicode (unicode@unicode.org) wrote: > I think it's less the format and much more the split personality of the > Unicode Web site(s?) that I have problems with. I also do. One thing that is particulary annoying is the fact that the "home" link on the "technical" (unchanged) subpart of the website gets back to the "marketing" home page which is particularly inefficient (the links you are looking for are not above the fold on a laptop screen) and confusing (the whole layout shifts and the theme changes) for perusing the technical part of the website. With all due respect for the work that has been done on the new website I think that the new structure significantly decreased the usability of the website for technical users. Best, Daniel
Re: Unicode String Models
On 3 October 2018 at 15:41:42, Mark Davis ☕️ via Unicode (unicode@unicode.org) wrote: > Let me clear that up; I meant that "the underlying storage never contains > something that would need to be represented as a surrogate code point." Of > course, UTF-16 does need surrogate code units. What #1 would be excluding > in the case of UTF-16 would be unpaired surrogates. That is, suppose the > underlying storage is UTF-16 code units that don't satisfy #1. > > 0061 D83D DC7D 0061 D83D > > A code point API would return for those a sequence of 4 values, the last of > which would be a surrogate code point. > > 0061, 0001F47D, 0061, D83D > > A scalar value API would return for those also 4 values, but since we > aren't in #1, it would need to remap. > > 0061, 0001F47D, 0061, FFFD Ok understood. But I think that if you go to the length of providing a scalar-value API you would also prevent the construction of strings that have such anomalities in the first place (e.g. by erroring in the constructor if you provide it with malformed UTF-X data), i.e. maintain 1. From a programmer's perspective I really don't get anything from 2. except confusion. > If it is a real datatype, with strong guarantees that it *never* contains > values outside of [0x..0xD7FF 0xE000..0x10], then every conversion > from number will require checking. And in my experience, without a strong > guarantee the datatype is in practice pretty useless. Sure. My point was that the places where you perform this check are few in practice. Namely mainly at the IO boundary of your program where you actually need to deal with encodings and, additionally, whenever you define scalar value constants (a check that could actually be performed by your compiler if your language provides a literal notation for values of this type). Best, Daniel
Re: Unicode String Models
On 3 October 2018 at 09:17:10, Mark Davis ☕️ via Unicode (unicode@unicode.org) wrote: > There are two main choices for a scalar-value API: > > 1. Guarantee that the storage never contains surrogates. This is the > simplest model. > 2. Substitute U+FFFD for surrogates when the API returns code > points. This can be done where #1 is not feasible, such as where the API is > a shim on top of a (perhaps large) IO buffer of buffer of 16-bit code units > that are not guaranteed to be UTF-16. The cost is extra tests on every code > point access. I'm not sure 2. really makes sense in pratice: it would mean you can't access scalar values which needs surrogates to be encoded. Also regarding 1. you can always defines an API that has this property regardless of the actual storage, it's only that your indexing operations might be costly as they do not directly map to the underlying storage array. That being said I don't think direct indexing/iterating for Unicode text is such an interesting operation due of course to the normalization/segmentation issues. Basically if your API provides them I only see these indexes as useful ways to define substrings. APIs that identify/iterate boundaries (and thus substrings) are more interesting due to the nature of Unicode text. > If the programming language provides for such a primitive datatype, that is > possible. That would mean at a minimum that casting/converting to that > datatype from other numerical datatypes would require bounds-checking and > throwing an exception for values outside of [0x..0xD7FF > 0xE000..0x10]. Yes. But note that in practice if you are in 1. above you usually perform this only at the point of decoding where you are already performing a lot of other checks. Once done you no longer need to check anything as long as the operations you perform on the values preserve the invariant. Also converting back to an integer if you need one is a no-op: it's the identity function. The OCaml Uchar module does this. This is the interface: https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.mli which defines the type t as abstract and here is the implementation: https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.ml which defines the implementation of type t = int which means values of this type are an *unboxed* OCaml integer (and will be stored as such in say an OCaml array). However since the module system enforces type abstraction the only way of creating such values is to use the constants or the constructors (e.g. of_int) which all maintain the scalar value invariant (if you disregard the unsafe_* functions). Note that it would perfectly be possible to adopt a similar approach in C via a typedef though given C's rather loose type system a little bit more discipline would be required from the programmer (always go through the constructor functions to create values of the type). Best, Daniel
Re: Unicode String Models
On 2 October 2018 at 14:03:48, Mark Davis ☕️ via Unicode (unicode@unicode.org) wrote: > Because of performance and storage consideration, you need to consider the > possible internal data structures when you are looking at something as > low-level as strings. But most of the 'model's in the document are only > really distinguished by API, only the "Code Point model" discussions are > segmented by internal storage, as with "Code Point Model: UTF-32" I guess my gripe with the presentation of that document is that it perpetuates the problem of confusing "unicode characters" (or integers, or scalar values) and their *encoding* (how to represent these integers as byte sequences) which a source of endless confusion among programmers. This confusion is easy lifted once you explain that there exists certain integers, the scalar values, which are your actual characters and then you have different ways of encoding your characters; one can then explain that a surrogate is not a character per se, it's a hack and there's no point in indexing them except if you want trouble. This may also suggest another taxonomy of classification for the APIs, those in which you work directly with the character data (the scalar values) and those in which you work with an encoding of the actual character data (e.g. a JavaScript string). > In reality, most APIs are not even going to be in terms of code points: > they will return int32's. That reality depends on your programming language. If the latter supports type abstraction you can define an abstract type for scalar values (whose implementation may simply be an integer). If you always go through the constructor to create these "integers" you can maintain the invariant that a value of this type is an integer in the ranges [0x;0xD7FF] and [0xE000;0x10]. Knowing this invariant holds is quite useful when you feed your "character" data to other processes like UTF-X encoders: it guarantees the correctness of their outputs regardless of what the programmer does. Best, Daniel
Re: Unicode String Models
Hello, I find your notion of "model" and presentation a bit confusing since it conflates what I would call the internal representation and the API. The internal representation defines how the Unicode text is stored and should not really matter to the end user of the string data structure. The API defines how the Unicode text is accessed, expressed by what is the result of an indexing operation on the string. The latter is really what matters for the end-user and what I would call the "model". I think the presentation would benefit from making a clear distinction between the internal representation and the API; you could then easily summarize them in a table which would make a nice summary of the design space. I also think you are missing one API which is the one with ECG I would favour: indexing returns Unicode scalar values, internally be it whatever you wish UTF-{8,16,32} or a custom encoding. Maybe that's what you intended by the "Code Point Model: Internal 8/16/32" but that's not what it says, the distinction between code point and scalar value is an important one and I think it would be good to insist on it to clarify the minds in such documents. Best, Daniel
UAX #42 update for 11.0.0 & \p{Extended_Pictographic}
Hello, Is there any ETA for an update to the ucdxml for 11.0.0 ? Also while reviewing the proposed update to UAX #29, I noticed it refers to a property (\p{Extended_Pictographic}) that doesn't seem to be formally part of the UCD but to be found in UTS #51. Is there any chance for this property to be part of a possible update to UAX #42 for 11.0.0 ? That would significantly help implementers whose pipeline relies on the ucdxml to implement the standard. Best, Daniel
Re: emoji props in the ucdxml ?
Ken, Thanks for your explanations. I would just like to note that UAX42 expresses a general xml data format to associate properties to code points. So it would be possible for the standard maintainers to publish, independently from the UCD, alongside the ad-hoc text files, xml files that have the properties. Best, Daniel P.S. I don't have a particular obsession or love with XML but when I started to implement bits of the standard a few years ago I apparently made the mistake to think that the UTC would eventually move away from creating ad-hoc text files and favour the structured data format of the ucdxml. So most of my implementation pipeline is geared at consuming character properties from these files.
emoji props in the ucdxml ?
Hello, I know the emoji properties [1] are no formally part of the UCD (not sure exactly why though), but are there any plans to integrate the data in the ucdxml [2] (possibly as separate files) ? Thanks, Daniel [1] http://www.unicode.org/reports/tr51/#Emoji_Properties_and_Data_Files [2] http://www.unicode.org/reports/tr42/