[webkit-dev] Fwd: Fwd: Fwd: HTML5 & MathML3 entities

Alexey Proskuryakov Fri, 17 Sep 2010 18:08:15 -0700


Начало переадресованного сообщения:


> От: David Carlisle <[email protected]>
> Дата: 17 сентября 2010 г. 18:02:39 Тихоокеанское летнее время
> Кому: Alexey Proskuryakov <[email protected]>
> Тема: Ответ: [webkit-dev] Fwd: Fwd: HTML5 & MathML3 entities
> 
> On 18/09/2010 00:05, Alexey Proskuryakov wrote:
>> 
>> 17.09.2010, в 15:32, David Carlisle написал(а):
>> 
>>> adding a canonical decomposition doesn't imply deprecation.
>>> Depending on which canonical form is chosen, the canonicalisation
>>> mapping can go either way, loosely speaking some forms prefer
>>> composite characters, some use combining characters in preference
>>> (not that combining characters are involved here)
>> 
>> This is not accurate. For singleton decomposition, both NFC and NFD
>> contain the decomposed form. See Unicode 5.2.0 section D113 (full
>> composition exclusion) for details.
> 
> yes NFC and NFD are the same in these cases, but that doesn[t really change 
> the main point that deprecation here is nothing to do with the character 
> having a different normal form. Compare
> ANGSTROM SIGN (212B) and
> LATIN CAPITAL LETTER A WITH RING ABOVE (00C5)
> these are similarly related by canonical form and so clearly C5 is preferred 
> but 212B is not deprecated in the same way as 2329  is.
> see the entry for 212B in
> 
> http://www.unicode.org/charts/PDF/U2100.pdf
> 
> 2329 is deprecated because it is replaced by 27E8  not because it
> maps to something else in NFC.
> 
>> 
>>> 2329  was deprecated some years after the canonical mapping was
>>> added because it was realised that that mapping was wrong, but
>>> mappings are never changed once added. It became deprecated not
>>> when the mapping to 3008 was added; it became deprecated when it
>>> was replaced by 27E8 I described it as a two step process because
>>> it happened in two stages.
>> 
>> Because of the above, I don't see how it could happen in two stages.
>> Adding a singleton decomposition logically implies deprecation. And
>> it wasn't until Unicode 5.2 that "deprecated" had a clearly defined
>> meaning anyway.
> If that were the meaning of deprecated in this case, the deprecated character 
> would be deprecated in favour of its canonically equivalent character but 
> that isn't the case. It is deprecated _because_ that incorrect decomposition 
> exists, and is deprecated in favour of a new character added specifically to 
> avoid the problem.
>> 
>>> It was conformant to unicode 2 yes, the fact that unicode then
>>> added a canonical form to 3xxx doesn't make them non conformant,
>>> systems don't have to use NFC form and they don't have to use any
>>> particular glyph, so for either reason it's perfectly conformant to
>>> use a math character for 2329.
>> 
>> Again, both composition and decomposition of U+2329 produces U+3008.
> Yes but a system isn't obliged to compose or decompose (and most do not 
> automatically in my experience)
>> 
>>> The point is that there have been documents using those entities as
>>> math character names in continuous use since the '80s why should
>>> they all be broken? Not to mention the fact that the vast majority
>>> of use of those entities in html will also be expecting a
>>> mathematical bracket (even if on some systems, with some fonts the
>>> character glyph used was actually designed for CJK punctuation).
>>> 
>>> In fact where classical ISO usage and HTML usage differed I
>>> followed HTML usage in all cases (for all the obvious reasons) even
>>> when the HTML definitions make no sense at all (eg asymp) but in
>>> this case external factors (ie Unicode moving the goalposts) meant
>>> that the "new" Uniocde 3.2 character should be used here.
>> 
>> Do these documents use the entities with the same "&...;" notation?
> 
> yes, of course.
> 
>> MathML didn't exist in the 80's, so what are the documents that
>> actually conflict with HTML, or with compound XHTML documents?
> 
> Well the point of breaking the mathml (and html) entities into a separate 
> spec was to get a uniform set of definitions across different uses. If (as 
> was the case) the same entity name (used via the same syntax) means different 
> things in docbook, mathml and html, then formally you may argue that 
> everything is OK and consistent, each document obeys its own language 
> definition, but in practice moving fragments between documents results in 
> silent data corruption.
> the entity spec was separated out from the mathml spec in 2003 and went 
> through numerous public revisions, people in the old and the new HTML groups 
> were asked to commnt on it, people in the UTC/Unicode list and people on the 
> original ISO working groups who defined the entitiy names originally, after 7 
> years of open review it went to REC earlier this year (and MathML3 depending 
> on it will hopefully go to REC this month)
> 
> 
>>>>> the only fix the UTC suggest for that is just not using 2329 at
>>>>> all and use 27E8 instead. Which is what the entity spec
>>>>> recommends.
>>>> 
>>>> 
>>>> Did they actually suggest to use it for the lang entity in HTML,
>>>> or did they suggest to use it when a math character is desired?
> 
> the comments were in relation to the entities draft which has the explicit 
> intention of being a common set of definitions for any uses of these entity 
> names.
> 
>>> xhtml entities have document scope it is not possible for an
>>> xhtml+mathml document to have different definitions for html and
>>> mathml use, but even for pure html use it is fairly clear that 27e8
>>> is the correct choice.
>> 
>> I wasn't asking about HTML vs. XHTML - both used to define&rang; in
>> the same way.
> 
> The same way as MathML2, actually. This change isn't about matching XHTML or 
> MathML2, it's about tracking changes to Unicode.
> 
> I can re-phrase my question as "Did they actually
>> suggest to use it for the lang entity in (X)HTML, or did they suggest
>> to use it when a math character is desired?"
>> 
> 
>> 
>> I don't think that characterizing what we did in WebKit as bizarre in
>> the extreme is fair.
> 
> fair or not, I think it is was clearly the wrong thing to do (even if well 
> intentioned) nothing in HTML or XHTML specifications would licence such a 
> definition. You could claim perhaps that you were using HTML followed by NFC 
> normalisation, but that's a very weak argument I think.
> 
> The Unicode spec (or at least the code chart page at 
> http://www.unicode.org/charts/PDF/U2300.pdf which is what I have to hand) 
> doesn't say it is deprecated in favour of 3009 it says that it is deprecated 
> _because_ of the equivalence to CJK punctuation and that mathematical use is 
> strongly recommended to use 27e8 instead.
> 
> It is very hard to think that anyone using CJK characters (and so presumably 
> with access to some convenient keyboarding scheme for those code ranges) 
> suddenly requires an ascii entity name reference to access a punctuation 
> character. Conversely mathematical usage habitually uses long ascii names for 
> characters, It is clear that rang and lang have always been intended as 
> mathematical characters, and I ask again whether you really think that 
> (barring artificial test cases) anyone writing in CJK languages uses these 
> english ascii entity references for just those two characters? I don't see 
> how it is possible to read Uniocde as saying anything other than rang ought 
> to point to 27e8
> 
> Unicode techical report 25 says
> 
> Unicode 3.2 added two new mathematical angle bracket characters ⟨ ⟩ (U+27E8 
> and U+27E9) that are unequivocally intended for mathematical use and should 
> be used instead of U+2329 and U+232A.
> 
> 
> 
> David
> 
> 

- WBR, Alexey Proskuryakov


_______________________________________________
webkit-dev mailing list
[email protected]
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev

[webkit-dev] Fwd: Fwd: Fwd: HTML5 & MathML3 entities

Reply via email to