May be; but there's real doubt that a regular expression that would need this property would be severely broken if that property was corrected. There are many other properties that are more useful (and mich more used) whose associated set of codepoints changes regularly across versions.
I don't see any specific interest in maintaining non-characters in that block, as it effectively reduces the reusaibility of this property. And in fact it would be highly preferable to no longer state that these non-characters in ArabicPresenationForm be treated like C1 controls or PUA (because they will ever be reassigned to something more useful). Making them PUA would not change radically the fact thzt these characters are not recommended but we xould no longer bother about checking if they are valid or not. They remain there only as a legacy with old outdated versions of Unicode for a mysterious need that I"ve not clearly identified. Let's assume we change them into PUA; some applications will start accepting them when some other won't. Not a problem given that they are already not interoperable. And regular expressions trying to use character properties have many more caveats to handle (the most serious being with canonical equivalences and discontinuous matches or partial matches; when searches are only focuing on exact sets of code points instead of sets of canonical equivalent texts; the other complciation coming with the effect of collation and its variable strength matching more or less parts of text spanning ignorable collation elements i.e, possibly also, discontinuous runs of ignorable codepoints if we want to get consistant results independant of th normalization form. more compicate is how to handle "partial matches" such as a combining character within a precomposed character which is canonically equivalent to string where this combining character appears And even more tricky is how to handle substitution with regexps, for example when perfrming search at primary collation level ignoring lettercase, but when we wnt to replace base letters but preserve case in the substituted string: this requires specific lookup of characters using properties **not** specified in the UCD but in the collation tailoring data, and then how to ensure that the result of the substitution in the plain-text source will remain a valid text not creating new unexpected canonical equivalences, and that it will also not break basic orthographic properties such as syllabic structures in a specific pair of language+script, and without also producing unexpected collation equivalents at the same collation strength; causing later unexpected never ending loops of subtitutions, for example in large websites with bots operating text corrections). Regexps are still a very experimental proposal, they are still very difficult to make interoperatable except in a small set of tested cases and for this reason I really doubt that the "characetrs encoding block" property is very productive for now with regexps (and notably not with this "compatibility" block, whose characters wll remain used isolately independantly of their context, if they are still used in rare cases). I see little value in keeping this old complication in this block, but just more interoperability problems for implementations. So these non characters should be treated mostly like PUA, except that they have a few more properties : direction=RTL, script= Arabic, and starters working in isolation for the Arabic joining type (these properties can help limit their generic reusability like regular PUAs but at least all other processes and notably generic validtors won't have to bother about them). 2014-05-31 18:17 GMT+02:00 Asmus Freytag <asm...@ix.netcom.com>: > On 5/31/2014 4:09 AM, Philippe Verdy wrote: > > 2014-05-30 20:49 GMT+02:00 Asmus Freytag <asm...@ix.netcom.com>: > >> This might have been possible at the time these were added, but now it is >> probably not feasible. One of the reasons is that block names are exposed >> (for better or for worse) as character properties and as such are also >> exposed in regular expressions. While not recommended, it would be really >> bad if the expression with pseudo-code "IsInArabicPresentationFormB(x)" >> were to fail, because we split the block into three (with the middle one >> being the noncharacters). >> > > If you think about pseudocode testing for properties then nothing > forbifs the test IsInArabicPresentationFormB(x) to check two ranges onstead > of just one. > > Besides the point. > > The issue is that the result of evaluation of an expression would change > over time. > > A./ > >
_______________________________________________ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode