Re: [whatwg] Entity parsing
On Thu, 28 Jun 2007, �istein E. Andersen wrote: 1) Is it useful to handle unterminated entities followed by an alphanumerical character like IE does? The number of documents for which this actually helps might be small compared to the number of documents that contain other, incorrigible errors. The process also introduces errors, albeit not in conforming documents. Is the gain worth the added complexity? If so, then should this apply to all entities? (Probably not.) Would it be useful to add to/remove from the set supported by IE7? (This may seem insane, but we should try to avoid premature decisions.) 2) HTML 4.01 allows the semicolon to be omitted in certain cases. Does this cause problems? Firefox and Safari both support this, and it would seem meaningless to change the way conforming documents are parsed unless it can be shown that, e.g., ndash actually is supposed to mean amp;ndash more often than ndash; . (Conformance is a separate issue.) 3) Will new entities ever be needed? If yes, can new entities adopt existing conformance criteria and parsing rules? 4) Similar considerations for entities in attribute values. New entities have since been added, and the rules for parsing entities (sorry, named character references) have been changed a bit. However, I am reluctant to change this from what we have now, since what we have now works well. How strongly do you feel about this? -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Entity parsing
On Thu, 28 Jun 2007 04:53:09 +0200, Øistein E. Andersen [EMAIL PROTECTED] wrote: I would really like an informed decision, and I currently get the impression that rules are changed to follow IE by default rather than to handle existing content, which may lead to unnecessary complicated rules that do not actually handle existing documents optimally. 1) It was quite easy to implement. Took me about thirty minutes including updating several tests and adding a few extra tests. (In html5lib, Python.) 2) You're saying that content breaks in IE? -- Anne van Kesteren http://annevankesteren.nl/ http://www.opera.com/
Re: [whatwg] Entity parsing [trema/diaeresis vs umlaut]
I had a look at the reference page you have directed me to: it actually states that the ISO-8859-1 character set can be used for English. Although my hypothesis that the word œovre is not English remains valid (see also the citations in the appendix), I admit that the fact that the ligature œ is not included in the character set (and, consequently, that the character set ISO-8859-1 cannot be used for encoding French text, which I find kind of stunning because of the popularity of the French language) provides a much simpler explanation to the observable phenomenon. My fault, I should have checked that up first. Best regards Chris APPENDIX Other Wikipedia entries also disagree, e.g. http://en.wikipedia.org/wiki/%C5%92 Borrowings into English from Latin words featuring œ are often spelled with the letter e, especially in American English. For example, fœderal became federal in English, while fœtus became fetus only in American English. Other œs in English spell out as 2 separate letters oe. http://en.wikipedia.org/wiki/List_of_words_that_may_be_spelled_with_a_ligat ure The use of the œ and æ is obsolescent in modern English, and has been used predominantly in British English. It is usually used to evoke archaism, or in literal quotations of historic sources. http://en.wikipedia.org/wiki/American_and_British_English_spelling_differen ces#Simplification_of_ae_.28.C3.A6.29_and_oe_.28.C5.93.29 In English, which has imported words from all three languages, it is now usual to replace Æ/æ with Ae/ae and Œ/œ with Oe/oe. Microsoft Word does not accept hors d'œuvre but it has no problem with hors d'oeuvre. The American English International keyboard does not provide a way to type the ligature œ. The Microsoft Encarta dictionary does not recognize such a spelling, nor does Reference.com. The word coeur is not mentioned in any English dictionary I know. -Original Message- From: Oistein E. Andersen [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 27, 2007 11:44 PM To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [whatwg] Entity parsing [trema/diaeresis vs umlaut] You might want to have a look at http://pl.wikipedia.org/wiki/ISO_8859-1 . Afterwards, consider the following: 1) Latin-1 does not contain all the characters that are required for typesetting of English.
Re: [whatwg] Entity parsing [trema/diaeresis vs umlaut]
On Jun 28, 2007, at 14:51, K?i?tof ?elechovski wrote: I admit that the fact that the ligature œ is not included in the character set (and, consequently, that the character set ISO-8859-1 cannot be used for encoding French text, which I find kind of stunning because of the popularity of the French language) provides a much simpler explanation to the observable phenomenon. This discussion is not relevant to the WHATWG or HTML5. HTML5 is defined in terms of Unicode and Unicode covers both English and French (and quite a bit more). Anyone is free to use all that expressiveness straight by encoding documents as UTF-8. Entities or legacy encodings don't add any expressiveness. They just expand to Unicode. The details of how this is handled is constrained by legacy—not by political correctness. P.S. Before anyone slaps me for being politically incorrect or insensitive, I'd like to point out that my native language uses characters whose entity names are biased towards German terminology. But this isn't a slightest technical problem. Let's move on. -- Henri Sivonen [EMAIL PROTECTED] http://hsivonen.iki.fi/
Re: [whatwg] Entity parsing
On 28 Jun 2007, at 9:4AM, Anne van Kesteren wrote: 1) It was quite easy to implement. Sorry, I never meant to say that it was difficult to implement, merely that it is counter-intuitive and probably suboptimal. 2) You're saying that content breaks in IE? Surprising as it may sound, such content demonstrably exists, and available data do not support the presupposition that doing exactly what IE does is actually the best solution for handling existing content. -- Øistein E. Andersen
Re: [whatwg] Entity parsing [trema/diaeresis vs umlaut]
How does it influence the case flanceacutee vs oeliguvre? The only difference is that the first one is used in English. Chris -Original Message- From: Oistein E. Andersen [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 26, 2007 10:55 PM To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [whatwg] Entity parsing [trema/diaeresis vs umlaut] On 26 Jun 2007, at 7:49AM, Křištof Želechovski wrote: Internet Explorer apparently chose to support English natively while SGML preferred remaining language-agnostic. To be fair, this is not how things developed. Microsoft first chose to make the semicolon optional not only when allowed by SGML rules (notably before whitespace and tags), but in any position, for all named entities /that existed at the time/, i.e., latin-1. Unfortunately, this meant that new entities could not be added without changing the interpretation of already existing pages (e.g., if a page contained lessless, adding the entity le to the list would result in its being interpreted as less?ss), although most of the entities have names that are rather unlikely to appear by chance, and the ampersand should be spelt amp;. Microsoft did not dare to risk this, so entities beyond latin-1 require a semicolon in IE, even in cases where it is optional according to SGML (and therefore will pass HTML 4.01 validation, I might add). -- Oistein E. Andersen
Re: [whatwg] Entity parsing [trema/diaeresis vs umlaut]
On 27 Jun 2007, at 8:45PM, Křištof Želechovski wrote: How does it influence the case flanceacutee vs oeliguvre? You might want to have a look at http://pl.wikipedia.org/wiki/ISO_8859-1 . Afterwards, consider the following: 1) Latin-1 does not contain all the characters that are required for typesetting of English. 2) It does include characters that are never used in English at all. 3) In IE, the entities that can be used without a terminating semicolon are the ones that can be found in this character set. How does this make Microsoft Anglocentric? The only difference is that the first one is used in English. They are both used in English, actually (and the spelling with a ligature should not be considered obsolete in words borrowed from French, unlike those of Latin origin). -- Øistein E. Andersen
Re: [whatwg] Entity parsing
On 26 Jun 2007, at 4:35AM, Ian Hickson wrote: The informal research I did when updating the spec suggests that the current state of the spec is what is better. (It is difficult to say anything sensible without knowing either the nature of the research undertaken or the options under consideration.) I don't really know how to do more research -- it's quite hard to programatically tell when an entity should be expanded and when it shouldn't. True, but this is not completely insurmountable — or, rather: useful information can be extracted without necessarily making these decisions explicitly. I do not know what you have done already, but something like the following for each entity ref; would be useful for the discussion: — total number of ref; — number of ref;; — number of ref followed by /[a-zA-Z0-9]/; — the N most frequent matches of /[a-zA-Z0-9]*ref[a-zA-Z0-9]+/. Without any real data, arguing, e.g., that conforming HTML 4.01 documents that are currently handled correctly by Firefox and Safari must be handled differently in the future for the sake of backwards compatibility is not really persuasive. The only argument for following IE that I have been able to find in the archives is the following in a post from Simon Pieters on 14th Aug 2006 in the thread “Parsing Entities”: I guess that for compat with IE and the Web[1] we have to treat Reacutesumeacute as if it were Reacute;sumeacute;. [...] [1] http://www.google.com/search?q=R%26eacutesum%C3%A9 The implication seems to be that Reacutesumeacute can be found on the Web and therefore should be supported. But Google also tells us something else: (1) reacutesumé: 572 (2) +résumé: 114,000,000 (3) reacute;sumeacute -reacute;sumeacute;s: 16,300 (4) +résumé: 1,000 Actually, (1) does not only cover reacutesumeacute, but also code like ramp;eacutesumé, so the number of occurrences that can be saved by parser quirks is lower than 572. As could be expected, (1) is quite rare compared to (2), all the correctly encoded variants. Whether 0.0005% should be regarded as significant (supposing that résumé is representative) may be a contentious issue, but it is interesting to note that other errors — unwanted conversion of to amp; in (3) and a typical encoding problem in (4) — are actually significantly more common, and these cannot be corrected at all. -- Øistein E. Andersen
Re: [whatwg] Entity parsing
On Thu, 28 Jun 2007, �istein E. Andersen wrote: I don't really know how to do more research -- it's quite hard to programatically tell when an entity should be expanded and when it shouldn't. True, but this is not completely insurmountable — or, rather: useful information can be extracted without necessarily making these decisions explicitly. I do not know what you have done already, but something like the following for each entity ref; would be useful for the discussion: — total number of ref; — number of ref;; — number of ref followed by /[a-zA-Z0-9]/; — the N most frequent matches of /[a-zA-Z0-9]*ref[a-zA-Z0-9]+/. Without any real data, arguing, e.g., that conforming HTML 4.01 documents that are currently handled correctly by Firefox and Safari must be handled differently in the future for the sake of backwards compatibility is not really persuasive. Sadly none of the arguments in any direction right now are particularly persuasive. I'm not really convinced that the data that the above proposed survey might collect would actually help, since it doesn't tell us the what was intended by the author. You'd be surprised at how often people use ampersands in text in ways that have nothing to do with entities but in ways which could get interpreted as entities. The implication seems to be that Reacutesumeacute can be found on the Web and therefore should be supported. But Google also tells us something else: (1) reacutesumé: 572 (2) +résumé: 114,000,000 (3) reacute;sumeacute -reacute;sumeacute;s: 16,300 (4) +résumé: 1,000 Actually, (1) does not only cover reacutesumeacute, but also code like ramp;eacutesumé, so the number of occurrences that can be saved by parser quirks is lower than 572. The number of occurences of reacutesumé is at least two (the two hits I looked at both worked in IE and did not in Firefox). Am I correct in assuming that you would like the spec changed? What would you like the spec changed to, exactly? -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Entity parsing
On 28 Jun 2007, at 12:43AM, Ian Hickson wrote: Sadly none of the arguments in any direction right now are particularly persuasive. Indeed. I'm not really convinced that the data that the above proposed survey might collect would actually help, since it doesn't tell us the what was intended by the author. To a certain extent, this depends on the results. Some conclusions can be drawn without actually knowing the author's intent at all: if, for instance, foo[^;] is exceedingly rare, then what the author meant does not really matter, since the construct does not need to be supported anyway. I also tend to think that entities that are part of existing words are highly likely to be supposed to be expanded. Of course, 100% accuracy cannot be achieved, but this is not really needed for the results to be useful. Am I correct in assuming that you would like the spec changed? What would you like the spec changed to, exactly? I would really like an informed decision, and I currently get the impression that rules are changed to follow IE by default rather than to handle existing content, which may lead to unnecessary complicated rules that do not actually handle existing documents optimally. More specifically, some of the points that probably should be addressed are the following: 1) Is it useful to handle unterminated entities followed by an alphanumerical character like IE does? The number of documents for which this actually helps might be small compared to the number of documents that contain other, incorrigible errors. The process also introduces errors, albeit not in conforming documents. Is the gain worth the added complexity? If so, then should this apply to all entities? (Probably not.) Would it be useful to add to/remove from the set supported by IE7? (This may seem insane, but we should try to avoid premature decisions.) 2) HTML 4.01 allows the semicolon to be omitted in certain cases. Does this cause problems? Firefox and Safari both support this, and it would seem meaningless to change the way conforming documents are parsed unless it can be shown that, e.g., ndash actually is supposed to mean amp;ndash more often than ndash; . (Conformance is a separate issue.) 3) Will new entities ever be needed? If yes, can new entities adopt existing conformance criteria and parsing rules? 4) Similar considerations for entities in attribute values. -- Øistein E. Andersen
Re: [whatwg] Entity parsing [trema/diaresis vs umlaut]
Of course you are right; I was thinking of the tréma when I wrote that and I changed it to a dieresis afterwards to make it more English (to get rid of the red underlines). A general qui pro quo followed. Slovak ä is an original invention; the tréma palatalizes the preceding consonant. I did not consider capharnaüm invalid but irrelevant: it is a Hebrew (or Aramaic?) proper name and can be regarded as a transcription. Thanks Chris -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Oistein E. Andersen Sent: Monday, June 25, 2007 3:46 PM To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [whatwg] Entity parsing [trema/diaresis vs umlaut] On 25 Jun 2007, at 11:44AM, Křištof Želechovski wrote: To make it explicit and plain: the dieresis is a diacritical mark that has no intrinsic phonetic connotation, although it is used mostly for separating vowels; As you may know, diaresis derives from the Greek verb (diairein), which means to divide, and it does indeed have an intrinsic meaning. According to the OED, a diaresis is [t]he sign (¨) marking [a phonological diaresis], or, more usually, placed over the second of two vowels which otherwise make a diphthong or single sound, to indicate that they are to be pronounced separately. Similarly, umlaut is defined as [t]he diacritical sign (¨) placed over a vowel to indicate that [umlaut] has taken place. Hence, the use of either term when the double-dot diacritic is performing another linguistic function is equally abusive. the phonetic meaning of umlaut is generic and well-defined by its very name and it does not apply to the vowel I. Indeed. German umlaut notation is further restricted, and I am not quite sure if the phonetic phenomenon applies to y either, but this is rather far off topic. I did not intend to make HTML support all possible linguistic intricacies; I only wanted to eliminate the common nonsense of denoting i with iuml; [...] I only want the true umlaut to be distinct, not as a code point but as an entity name. [...] It would be up to the author to determine whether uuml; or utrema; is appropriate; both entities should denote the same character. Do you really think it is a good idea to introduce twelve new aliases that do not work in current browsers, do not make the language more expressive and require authors to make meaningless decisions? (Is Slovak ä borrowed from German [it is pronounced a or ?] and therefore auml; or does it have another origin? Should we use atrema; by default? How about Pinyin ü? Swedish words that contain an ö as a result of umlaut vs those that contain it for a different reason?) Trema or diaresis might have been a better choice than umlaut as a generic name, since umlaut does not apply to all Latin vowels, but it is really too late to fix this now. On 25 Jun 2007, at 11:51AM, Křištof Želechovski wrote: Could I have an example of otrema; please? The canonical example in Dutch seems to be coördinatie, see http://nl.wikipedia.org/wiki/Trema_in_de_Nederlandse_spelling . Something along the lines of zoölogy, but actually required? Well, such spellings are actually required in some varieties of English. The New Yorker mandates that authors must coöperate to reëducate our readership. - allegedly from the magazine's style manual. On 25 Jun 2007, at 11:16AM, Křištof Želechovski wrote: there is no language that could make use of this distinction by having both uuml; and utrema;. There are languages that use uuml; and theoretically there could be ones that use utrema;, although I do not know of any valid case (I consider the French case invalid). I have no idea why you consider capharnaüm to be invalid (if this is what you imply), but perhaps Spanish pingüino and Dutch reünie will be more convincing examples. -- Oistein E. Andersen
Re: [whatwg] Entity parsing
The difference between I.2 and I.3 is that I.2 is in English and I.3 is in French. Internet Explorer apparently chose to support English natively while SGML preferred remaining language-agnostic. Chris -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Oistein E. Andersen Sent: Tuesday, June 26, 2007 2:51 AM To: [EMAIL PROTECTED] Subject: Re: [whatwg] Entity parsing On 25 Jun 2007, at 8:28AM, Ian Hickson wrote: 2) only IE expands fianceacutee (390), cafeacutes (1,460), naiumlve (716) IE (correct): fiancée, cafés, naive SGML (incorrect): fianceacutee, cafeacutes, naiumlve 3) neither expands oeliguvre (719), coeligur (3,720) both (incorrect): oeliguvre, coeligur intended: ouvre, cour It is also interesting to notice that reasonably common words belonging to class I.2), which are handled by IE, are apparently no more frequent than words from I.3), which no (popular) current browser handles correctly. I am looking forward to seeing more extensive research on this. -- Oistein E. Andersen
Re: [whatwg] Entity parsing
On Sat, 23 Jun 2007, Allan Sandfeld Jensen wrote: What about the Gecko entity parsing extension? - IE consitently parses unterminated entities from latin-1 - Gecko parses all unterminated entities, even those beyond latin-1, but only in text-content, not in attributes. (seems my recent firefox also supports the IE parsing in attributes now.) Well we can't support two at once... There seems to be more of a case for having the spec support the IE model. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Entity parsing
On Sat, 23 Jun 2007, Sam Ruby wrote: With the latest changes to html5lib, we get a failure on a test named test_title_body_named_charref. Before, A mdash B == A — B, now A mdash B == A amp;mdash B. Is that what we really want? Testing with Firefox, the old behavior is preferable. What does IE do? -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Entity parsing
On Sun, 24 Jun 2007, �istein E. Andersen wrote: Personally, I would prefer something along these lines: I. All entities are created equal (the burden of carrying a semicolon shall be equally distributed amongst all). For authors, this is now the case. For implementations, we are pretty much constrained by what IE does. II. Abuse of the semicolon shall not be legally enforced (its omission shall be conforming unless it separates the entity from a following [ASCII] letter or digit). Well, I had that allowed before, but people complained. :-) For some of the entities, though, we have to have a semicolon, for compatibility. So if you want consistency, it has to be required everywhere. III. Entities living in attribute values are to be treated as first-class citizens (the same rules shall apply to them). Again, for authors this is done, but for compatibility reasons we're constrained on what we can say for implementations. We clearly should, to the extent possible, try to avoid bizarre quirks, and the current rules for entity parsing are not exactly straightforward or intuitive. HTML5 currently follows IE7 much more closely than Safari, Firefox and Opera do, which seems to suggest that some of the quirks could be dispensed with. It's possible, though people kept pointing out problems, which is how we ended up where we are now. At any rate, web pages containing + entity name followed by [^A-Za-z0-9] are probably more likely not to have been authored for IE and therefore relying on standard SGML behaviour, so it would probably be more backwards- compatible to treat such occurrences as + entity name + ; (i.e., expand the entity). Well, we'd have to prove this somehow with real research. Of course, conformance checkers would be more than welcome to signal that a certain current browser is unable to handle A mdash B as expected, but this need not mean that all future browsers should be required not to handle it properly (as per arguably [in the original sense] more sensible SGML rules). Calling SGML sensible is a slippery slope! :-) -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Entity parsing
On Monday 25 June 2007 09:19, Ian Hickson wrote: On Sat, 23 Jun 2007, Allan Sandfeld Jensen wrote: What about the Gecko entity parsing extension? - IE consitently parses unterminated entities from latin-1 - Gecko parses all unterminated entities, even those beyond latin-1, but only in text-content, not in attributes. (seems my recent firefox also supports the IE parsing in attributes now.) Well we can't support two at once... There seems to be more of a case for having the spec support the IE model. They are not incompatible. In Konqueror, we support both, and it appears by my little test that Firefox 2 does the same now. - In attributes all unclosed latin-1 tags are accepted. - In text-content ALL unclosed tags are accepted. A little inconsistent, but I believe there was a few websites, and a chat application that made me implement the Gecko quirk. Anyway I don't mind restricting it to latin-1, I just wanted to make sure it had been considered. `Allan
Re: [whatwg] Entity parsing
On Mon, 25 Jun 2007, Allan Sandfeld Jensen wrote: In Konqueror, we support both, and it appears by my little test that Firefox 2 does the same now. - In attributes all unclosed latin-1 tags are accepted. - In text-content ALL unclosed tags are accepted. A little inconsistent, but I believe there was a few websites, and a chat application that made me implement the Gecko quirk. Interesting. It was specifically because of sites breaking if we didn't do the IE-like behaviour for attribute entity parsing that the spec is as it is now. :-) Anyway I don't mind restricting it to latin-1, I just wanted to make sure it had been considered. Yup. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Entity parsing
If there is a character set that sports both, it must be used to put down some human language. My point there is no language that could make use of this distinction by having both uuml; and utrema;. There are languages that use uuml; and theoretically there could be ones that use utrema;, although I do not know of any valid case (I consider the French case invalid). Chëërs Chrïs _ From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Sander Sent: Saturday, June 23, 2007 2:59 PM To: Kristof Zelechovski; [EMAIL PROTECTED] Subject: Re: [whatwg] Entity parsing I hadn't thought of that one ;-) (in Dutch there are no native words with umlauts, only some of German or Scandinavian descent). My question was about char-sets that contain both a trema version and a (seperate) umlaut version of the same character. Are there any? cheers, Sander Kristof Zelechovski schreef: Only the vowel U can have either but I have not seen a valid example of utrema;. The orthography ambigüe has recently been changed to ambiguë for consistency. Polish nauka (science) and German beurteilen would make good candidates but the national rules of orthography do not allow this distinction because Slavic languages do not have diphthongs except in borrowed words and it would cause ambiguity in German (cf. geübt). (Incidentally, this leads to bad pronunciation often encountered even in Polish media.) Cheers Chris -Original Message- From: Sander [mailto:[EMAIL PROTECTED] Sent: Friday, June 22, 2007 9:26 PM To: Kristof Zelechovski Subject: Re: [whatwg] Entity parsing Kristof Zelechovski schreef: A dieresis is not an umlaut so I have to bite my tongue each time I write or read nonsense like iuml;. It feels like lying. Umlaut means mixed, a dieresis means standalone. Those are very different things, and I can never gets mixed so there is no ambiguïty. Since umlaut is borrowed from German, I can see no problem in borrowing tréma from French. I personally prefer itrema; to idier; because of readability, but I would not insist on that. In professional typography, umlaut dots are usually a bit closer to the letter's body than the dots of the trema. In handwriting, however, no distinction is visible between the two. This is also true for most computer fonts and encodings. [http://en.wikipedia.org/wiki/Umlaut_(diacritic)] Are there any char-sets that have both umlaut and trema variations of characters? If so, both entities could exist. cheers, Sander PS: I'd go for itrema; instead of idier; as well as the term trema is also the one that's used in Dutch.
Re: [whatwg] Entity parsing
On Mon, 18 Jun 2007 12:47:57 +0200, Simon Pieters [EMAIL PROTECTED] wrote: http://simon.html5.org/test/html/parsing/entities/trailing-semicolon/ [...] I might create proper test cases on this later when this is specced. Done: http://simon.html5.org/test/html/parsing/entities/trailing-semicolon/real/ -- Simon Pieters
Re: [whatwg] Entity parsing
Inconsistently, as of IE7: I got ge verbatim from your test. Chris -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Allan Sandfeld Jensen Sent: Saturday, June 23, 2007 2:55 PM To: whatwg@lists.whatwg.org Subject: Re: [whatwg] Entity parsing What about the Gecko entity parsing extension? - IE consitently parses unterminated entities from latin-1 - Gecko parses all unterminated entities, even those beyond latin-1, but only in text-content, not in attributes. (seems my recent firefox also supports the IE parsing in attributes now.) See the attached test-case. `Allan
Re: [whatwg] Entity parsing [trema/diaresis vs umlaut]
A stressed schwa is present in Polish maritime dialect as well (Kaszëbszczi) and Slovaks write mäso for miaso (meat), but that is not the point. All such uses can be covered under the hood of the dieresis; I only want the true umlaut to be distinct, not as a code point but as an entity name. BTW, to clear another misconception: the dieresis is not a double accentit may be more verbosely described as double dot abovebecause unqualified accent means acute accent by default; the Adobe registry name for the double accent is Hungarian umlaut because it is used in Hungarian orthography only. To make it explicit and plain: the dieresis is a diacritical mark that has no intrinsic phonetic connotation, although it is used mostly for separating vowels; the phonetic meaning of umlaut is generic and well-defined by its very name and it does not apply to the vowel I. I did not intend to make HTML support all possible linguistic intricacies; I only wanted to eliminate the common nonsense of denoting ï with iuml;, or at least allow the authors not to use this absurd denotation while still having an entity for that letter. iuml; should be an alias for itrema; for backward compatibility, that is the whole story. It would be up to the author to determine whether uuml; or utrema; is appropriate; both entities should denote the same character. Cheers Chris -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Oistein E. Andersen Sent: Saturday, June 23, 2007 11:28 PM To: [EMAIL PROTECTED] Subject: Re: [whatwg] Entity parsing [trema/diaresis vs umlaut] Sander wrote: Are there any char-sets that have both umlaut and trema variations of characters? Unicode does not make the distinction, so this is somewhat unlikely. (Personally, I tend to think that the apparent preference for umlaut dots closer to the letter than trema dots can be linked to extrinsic phenomena like the preference for steep accents in French typography.) Kristof Zelechovski wrote: Only the vowel U can have either This is not quite right. All Latin vowels (a, e, i, o, u, y) can take the trema/diaresis (ä, ë, i, ö, ü in Dutch; ë, i, ü*, y** in French), and a, o, u can all be umlauted (ä, ö, ü in German). Moreover, the double-dot accent also has other uses (e.g., ä and ë both designate a stressed schwa in Luxembourgeois), so it is probably not advisable to attempt a complete classification in HTML. -- Oistein E. Andersen *) possibly only in the word capharnaüm (disregarding the highly unpopular rectifications orthographiques of 1990) and in proper names **) only in proper names
Re: [whatwg] Entity parsing [trema/diæresis vs umlaut]
On 25 Jun 2007, at 11:44AM, Křištof Želechovski wrote: A stressed schwa is present in Polish maritime dialect as well (Kaszëbszczi) and Slovaks write mäso for miaso (meat), but that is not the point. All such uses can be covered under the hood of the dieresis; I really do not understand why these uses of the double-dot diacritic should be considered as instances of the diæresis (see below). the dieresis is not a double accent I never said double accent, but you are right in pointing out that I should have called it a double-dot diacritic rather than a double-dot accent, since -- strictly speaking -- the only accents are acute, grave and circumflex. To make it explicit and plain: the dieresis is a diacritical mark that has no intrinsic phonetic connotation, although it is used mostly for separating vowels; As you may know, diæresis derives from the Greek verb διαιρεῖν (diairein), which means “to divide”, and it does indeed have an intrinsic meaning. According to the OED, a diæresis is “[t]he sign (¨) marking [a phonological diæresis], or, more usually, placed over the second of two vowels which otherwise make a diphthong or single sound, to indicate that they are to be pronounced separately.” Similarly, umlaut is defined as “[t]he diacritical sign (¨) placed over a vowel to indicate that [umlaut] has taken place.” Hence, the use of either term when the double-dot diacritic is performing another linguistic function is equally abusive. the phonetic meaning of umlaut is generic and well-defined by its very name and it does not apply to the vowel I. Indeed. German umlaut notation is further restricted, and I am not quite sure if the phonetic phenomenon applies to y either, but this is rather far off topic. I did not intend to make HTML support all possible linguistic intricacies; I only wanted to eliminate the common nonsense of denoting ï with iuml; [...] I only want the true umlaut to be distinct, not as a code point but as an entity name. [...] It would be up to the author to determine whether uuml; or utrema; is appropriate; both entities should denote the same character. Do you really think it is a good idea to introduce twelve new aliases that do not work in current browsers, do not make the language more expressive and require authors to make meaningless decisions? (Is Slovak ä borrowed from German [it is pronounced æ or ɛ] and therefore auml; or does it have another origin? Should we use atrema; by default? How about Pinyin ü? Swedish words that contain an ö as a result of umlaut vs those that contain it for a different reason?) Trema or diæresis might have been a better choice than umlaut as a generic name, since umlaut does not apply to all Latin vowels, but it is really too late to fix this now. On 25 Jun 2007, at 11:51AM, Křištof Želechovski wrote: Could I have an example of otrema; please? The canonical example in Dutch seems to be coördinatie, see http://nl.wikipedia.org/wiki/Trema_in_de_Nederlandse_spelling . Something along the lines of zoölogy, but actually required? Well, such spellings are actually required in some varieties of English. “The New Yorker mandates that authors must coöperate to reëducate our readership.” — allegedly from the magazine’s style manual. On 25 Jun 2007, at 11:16AM, Křištof Želechovski wrote: there is no language that could make use of this distinction by having both uuml; and utrema;. There are languages that use uuml; and theoretically there could be ones that use utrema;, although I do not know of any valid case (I consider the French case invalid). I have no idea why you consider capharnaüm to be invalid (if this is what you imply), but perhaps Spanish pingüino and Dutch reünie will be more convincing examples. French dictionaries require loan-words like angström, führer and länder (plural of land) to be spelt with an umlaut, but these are of course too rare for a differentiation tréma/umlaut to have developed, and I would imagine German imports with umlaut to be only slightly more common in Dutch. It would be interesting to see whether 19th-c. German actually made a distinction between umlaut on a, o, u and diæresis on e, i (e.g., Rhomboïd), but I do not know how consistently the diæresis was used, and words requiring it are typically foreign words that, unlike the rest, will not have been printed in Fraktur... -- Øistein E. Andersen
Re: [whatwg] Entity parsing
On 25 Jun 2007, at 11:57AM, Kristof Zelechovski wrote: Inconsistently, as of IE7: I got ge verbatim from your test. ge; is /not/ a latin-1 entity. -- Øistein E. Andersen
Re: [whatwg] Entity parsing [trema/diaresis vs umlaut]
Křištof Želechovski schreef: Could I have an example of otrema; please? Something along the lines of zoölogy, but actually required? Not that I doubt your knowledge of Dutch but I would like to have it as a demonstration. Chris coördinaten BTW: neither of the quotes below are mine ;-) -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Oistein E. Andersen Sent: Saturday, June 23, 2007 11:28 PM To: [EMAIL PROTECTED] Subject: Re: [whatwg] Entity parsing [trema/diaresis vs umlaut] Sander wrote: Only the vowel U can have either This is not quite right. All Latin vowels (a, e, i, o, u, y) can take the trema/diaresis (ä, ë, i, ö, ü in Dutch; ë, i, ü*, y** in French), and a, o, u can all be umlauted (ä, ö, ü in German).
Re: [whatwg] Entity parsing [trema/diæresis vs umlau t]
Øistein E. Andersen schreef: French dictionaries require loan-words like angström, führer and länder (plural of land) to be spelt with an umlaut, but these are of course too rare for a differentiation tréma/umlaut to have developed, and I would imagine German imports with umlaut to be only slightly more common in Dutch. In Dutch there are words with umlaut from both German and Scandinavian descent. Most of them are substantives (e.g. übermensch, knäckebröd). The only one I can think of right now that is not a substantive is überhaupt.
Re: [whatwg] Entity parsing [trema/diæresis vs uml aut]
On Mon, 25 Jun 2007, �istein E. Andersen wrote: On 25 Jun 2007, at 11:44AM, Křištof Želechovski wrote: A stressed schwa is present in Polish maritime dialect as well (Kaszëbszczi) and Slovaks write mäso for miaso (meat), but that is not the point. All such uses can be covered under the hood of the dieresis; I really do not understand why these uses of the double-dot diacritic should be considered as instances of the diæresis (see below). This really is out of scope of this working group, it's more a Unicode Consortium issue. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Entity parsing
On 25 Jun 2007, at 8:28AM, Ian Hickson wrote: On Sun, 24 Jun 2007, Øistein E. Andersen wrote: HTML5 currently follows IE7 much more closely than Safari, Firefox and Opera do, which seems to suggest that some of the quirks could be dispensed with. It's possible, though people kept pointing out problems, which is how we ended up where we are now. I have probably missed parts of this discussion, but most of the arguments I have seen seem to rely on the assumption that whatever IE does is more compatible with the Web as it is, which is probably a good approximation, but replicating each single detail is not necessarily the best thing to do. Calling SGML sensible is a slippery slope! :-) Sure, I did not mean to imply that all aspects of SGML are sensible :-) (Bad connotations aside, SGML’s rules for optional semicolons happen to be less contrived than IE’s.) [It might be a good idea to accept a missing semicolon at the end of words.] Well, we'd have to prove this somehow with real research. Yes, research is really missing here. Whatever we do, some pages will break, and it is not a priori impossible that a compromise of IE and SGML rules may be less quirky and more compatible with existing content at the same time. I am unable to do a proper corpus study on this, but the following examples suggest that following IE blindly may not be optimal. All markup is extracted from real Web pages, and the author’s intent was quite obvious from the context. The numbers in parentheses indicate the number of pages found using Google. I] Should be expanded 1) only SGML expands mdash IE (incorrect): mdash SGML (correct): — 2) only IE expands fianceacutee (390), cafeacutes (1,460), naiumlve (716) IE (correct): fiancée, cafés, naïve SGML (incorrect): fianceacutee, cafeacutes, naiumlve 3) neither expands oeliguvre (719), coeligur (3,720) both (incorrect): oeliguvre, coeligur intended: œuvre, cœur II] Should not be expanded 1) IE expands moralethics, rosesthorns IE (incorrect): moralðics, rosesþs SGML (correct): moralethics, rosesthorns 2) SGML expands AlphaOmega, onceforall IE (correct): AlphaOmega, onceforall SGML (incorrect): AlphaΩ, once∀ 3) both expand rosethorn both (incorrect): roseþ intended: rosethorn The examples I have found in category II] are all quite rare, but it is not unlikely that more common ones exist. Opera and Google both seem to err on the side of caution by only expanding entities when both IE and SGML do, i.e., in case II.3) above. It is also interesting to notice that reasonably common words belonging to class I.2), which are handled by IE, are apparently no more frequent than words from I.3), which no (popular) current browser handles correctly. I am looking forward to seeing more extensive research on this. -- Øistein E. Andersen
Re: [whatwg] Entity parsing
On Tue, 26 Jun 2007, �istein E. Andersen wrote: I am looking forward to seeing more extensive research on this. The informal research I did when updating the spec suggests that the current state of the spec is what is better. I don't really know how to do more research -- it's quite hard to programatically tell when an entity should be expanded and when it shouldn't. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Entity parsing
On Sat, 23 Jun 2007 20:12:45 +0200, Sam Ruby [EMAIL PROTECTED] wrote: Before, A mdash B == A — B, now A mdash B == A amp;mdash B. Is that what we really want? Testing with Firefox, the old behavior is preferable. Yeah, it makes sense to follow Internet Explorer 7 for this. -- Anne van Kesteren http://annevankesteren.nl/ http://www.opera.com/
Re: [whatwg] Entity parsing
I hadn't thought of that one ;-) (in Dutch there are no native words with umlauts, only some of German or Scandinavian descent). My question was about char-sets that contain both a trema version and a (seperate) umlaut version of the same character. Are there any? cheers, Sander Kristof Zelechovski schreef: Only the vowel U can have either but I have not seen a valid example of utrema;. The orthography ambigüe has recently been changed to ambiguë for consistency. Polish nauka (science) and German beurteilen would make good candidates but the national rules of orthography do not allow this distinction because Slavic languages do not have diphthongs except in borrowed words and it would cause ambiguity in German (cf. geübt). (Incidentally, this leads to bad pronunciation often encountered even in Polish media.) Cheers Chris -Original Message- From: Sander [mailto:[EMAIL PROTECTED] Sent: Friday, June 22, 2007 9:26 PM To: Kristof Zelechovski Subject: Re: [whatwg] Entity parsing Kristof Zelechovski schreef: A dieresis is not an umlaut so I have to bite my tongue each time I write or read nonsense like iuml;. It feels like lying. Umlaut means mixed, a dieresis means standalone. Those are very different things, and I can never gets mixed so there is no ambiguïty. Since umlaut is borrowed from German, I can see no problem in borrowing tréma from French. I personally prefer itrema; to idier; because of readability, but I would not insist on that. In professional typography, umlaut dots are usually a bit closer to the letter's body than the dots of the trema. In handwriting, however, no distinction is visible between the two. This is also true for most computer fonts and encodings. [http://en.wikipedia.org/wiki/Umlaut_(diacritic)] Are there any char-sets that have both umlaut and trema variations of characters? If so, both entities could exist. cheers, Sander PS: I'd go for itrema; instead of idier; as well as the term trema is also the one that's used in Dutch.
Re: [whatwg] Entity parsing
On Friday 15 June 2007 03:05, Ian Hickson wrote: On Sun, 5 Nov 2006, �istein E. Andersen wrote: From section 9.2.3.1. Tokenising entities: For some entities, UAs require a semicolon, for others they don't. This applies to IE. FWIW, the entities not requiring a semicolon are the ones encoding Latin-1 characters, the other HTML 3.2 entities (amp, gt and lt), as well as quot and the uppercase variants (AMP, COPY, GT, LT, QUOT and REG). [...] I've defined the parsing and conformance requirements in a way that matches IE. As a side-effect, this has made things like naiumlve actually conforming. I don't know if we want this. On the one hand, it's pragmatic (after all, why require the semicolon?), and is equivalent to not requiring quotes around attribute values. On the other, people don't want us to make the quotes optional either. What about the Gecko entity parsing extension? - IE consitently parses unterminated entities from latin-1 - Gecko parses all unterminated entities, even those beyond latin-1, but only in text-content, not in attributes. (seems my recent firefox also supports the IE parsing in attributes now.) See the attached test-case. `Allan Test of HTML entities in quirky mode: amp; amp ample not; not notat notin; notin notina ge; ge gel Test of entities in attributes:
Re: [whatwg] Entity parsing
On 6/14/07, Ian Hickson [EMAIL PROTECTED] wrote: On Sun, 5 Nov 2006, Øistein E. Andersen wrote: From section 9.2.3.1. Tokenising entities: For some entities, UAs require a semicolon, for others they don't. This applies to IE. FWIW, the entities not requiring a semicolon are the ones encoding Latin-1 characters, the other HTML 3.2 entities (amp, gt and lt), as well as quot and the uppercase variants (AMP, COPY, GT, LT, QUOT and REG). [...] I've defined the parsing and conformance requirements in a way that matches IE. As a side-effect, this has made things like naiumlve actually conforming. I don't know if we want this. On the one hand, it's pragmatic (after all, why require the semicolon?), and is equivalent to not requiring quotes around attribute values. On the other, people don't want us to make the quotes optional either. With the latest changes to html5lib, we get a failure on a test named test_title_body_named_charref. Before, A mdash B == A — B, now A mdash B == A amp;mdash B. Is that what we really want? Testing with Firefox, the old behavior is preferable. - Sam Ruby
Re: [whatwg] Entity parsing [trema/diæresis vs umlaut]
Sander wrote: Are there any char-sets that have both umlaut and trema variations of characters? Unicode does not make the distinction, so this is somewhat unlikely. (Personally, I tend to think that the apparent preference for umlaut dots closer to the letter than trema dots can be linked to extrinsic phenomena like the preference for steep accents in French typography.) Kristof Zelechovski wrote: Only the vowel U can have either This is not quite right. All Latin vowels (a, e, i, o, u, y) can take the trema/diæresis (ä, ë, ï, ö, ü in Dutch; ë, ï, ü*, ÿ** in French), and a, o, u can all be umlauted (ä, ö, ü in German). Moreover, the double-dot accent also has other uses (e.g., ä and ë both designate a stressed schwa in Luxembourgeois), so it is probably not advisable to attempt a complete classification in HTML. -- Øistein E. Andersen *) possibly only in the word capharnaüm (disregarding the highly unpopular rectifications orthographiques of 1990) and in proper names **) only in proper names
Re: [whatwg] Entity parsing
On Fri, 22 Jun 2007, Kristof Zelechovski wrote: A dieresis is not an umlaut so I have to bite my tongue each time I write or read nonsense like iuml;. It feels like lying. Umlaut means mixed, a dieresis means standalone. Those are very different things, and I can never gets mixed so there is no ambiguïty. Since umlaut is borrowed from German, I can see no problem in borrowing tréma from French. I personally prefer itrema; to idier; because of readability, but I would not insist on that. There are plenty of entity names that are suboptimal. I wouldn't lose too much sleep over it. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Entity parsing
A dieresis is not an umlaut so I have to bite my tongue each time I write or read nonsense like iuml;. It feels like lying. Umlaut means mixed, a dieresis means standalone. Those are very different things, and I can never gets mixed so there is no ambiguïty. Since umlaut is borrowed from German, I can see no problem in borrowing tréma from French. I personally prefer itrema; to idier; because of readability, but I would not insist on that. Chris -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Ian Hickson Sent: Friday, June 22, 2007 6:09 AM To: [EMAIL PROTECTED] Subject: Re: [whatwg] Entity parsing On Fri, 15 Jun 2007, Kitof elechovski wrote: Aside: I know that it can be changed but iuml is a very unfortunate name for i trma. How about deprecating iuml in favor of itrema? We're not deprecating anything, and just introducing a new name for i-uml would be a dangerous slippery slope to start down. Anyway, i-umlaut is fine, and easier to spell than i-diaeresis; why would you call itrema? Trema doesn't seem any more common than umlaut...
Re: [whatwg] Entity parsing
On Sat, 16 Jun 2007 15:30:07 +0200, Anne van Kesteren [EMAIL PROTECTED] wrote: No, IE doesn't break them, and that's the point. Section 8.2.3.1. states This definition is used when parsing entities in text and in attributes. - if I understand this correctly, this makes semicolon optional for entities in both attributes and text and region in attribute would be interpreted as ®ion. If that's the case, it is not compatible with IE, because it parses entities differently in attributes and text. In attributes semicolon (any non-alphanumeric character actually) is required, but in text it is not. In IE6 a href=regionregion/a is equivalent to a href=amp;region®ion/a Awesome. Guess we have to reverse engineer that too then... http://simon.html5.org/test/html/parsing/entities/trailing-semicolon/ The tests aren't really digestable in their current state unless you know what they're doing, but well, I'll just say what the results are below. I might create proper test cases on this later when this is specced. Entity parsing works the same in different attributes (tested img alt and a href). Any character that is not in the range [a-zA-Z0-9] ends an entity -- i.e., the following are equivalent: img alt=AElig. img alt=AElig;. ...and the following are equivalent: img alt=AElig1 img alt=amp;AElig1 This means that the semi-colon is not part of the entity name, and we need to revert to the old entity table and instead have a third column that says which entities always require a semi-colon. You consume as many characters as possible that match the entity table, and for the longest match, check if the next character is in the abovementioned range. If yes, emit the consumed characters, otherwise emit the entity, or something along those lines. -- Simon Pieters
Re: [whatwg] Entity parsing
On Sat, 16 Jun 2007 00:58:21 +0200, MegaZone [EMAIL PROTECTED] wrote: Personally I prefer quoted attribute values too, but I don't feel that strongly about it. I just now that with the quotes optional someone is going to try to list space separated 'class' names. ;-) For what it's worth, they have _always_ been optional in HTML. And you're right, some people might do that. In fact, it was done wrong so often for meta http-equiv=content-type content=text/html; charset=utf-8 that browsers now all support a charset= attribute on meta for indicating the document encoding. -- Anne van Kesteren http://annevankesteren.nl/ http://www.opera.com/
Re: [whatwg] Entity parsing
On Fri, 15 Jun 2007 21:21:06 +0200, Kornel Lesinski [EMAIL PROTECTED] wrote: On Fri, 15 Jun 2007 19:37:46 +0100, Anne van Kesteren [EMAIL PROTECTED] wrote: I've defined the parsing and conformance requirements in a way that matches IE. As a side-effect, this has made things like naiumlve actually conforming. I don't know if we want this. Rather not. This would break unencoded URLs: ?foo=barregion=baz → ?foo=bar®ion=baz You mean that Internet Explorer breaks them already? That doesn't make much sense to me. No, IE doesn't break them, and that's the point. Section 8.2.3.1. states This definition is used when parsing entities in text and in attributes. - if I understand this correctly, this makes semicolon optional for entities in both attributes and text and region in attribute would be interpreted as ®ion. If that's the case, it is not compatible with IE, because it parses entities differently in attributes and text. In attributes semicolon (any non-alphanumeric character actually) is required, but in text it is not. In IE6 a href=regionregion/a is equivalent to a href=amp;region®ion/a Awesome. Guess we have to reverse engineer that too then... -- Anne van Kesteren http://annevankesteren.nl/ http://www.opera.com/
Re: [whatwg] Entity parsing
Once upon a time Anne van Kesteren shaped the electrons to say... For what it's worth, they have _always_ been optional in HTML. And you're right, some people might do that. In fact, it was done wrong so often for I know, it was one of the things that used to annoy me in other author's markup - not so much using them or not in general, but when someone would quote some attributes and not others. Pet peeve. Forcing the parens was something I liked about XHTML - on the other hand forcing lowercase elements took some getting used to, since I had been in the 'all caps' school since I first played with HTML in 1991. Win some, lose some. :-) meta http-equiv=content-type content=text/html; charset=utf-8 that browsers now all support a charset= attribute on meta for indicating the document encoding. This is a bit cleaner, since the name=value structure is still intact. I see people doing things like: a class=main titletext/a When they mean: a class=main titletext/a And not: a class=main title=text/a Quotes are really only optional on single-value attributes, or it creates a parsing nightmare, trying to read the authors mind. -MZ -- megazone-at-megazone.org http://www.MegaZone.org/ Gweep, Geek, Human, me. http://www.TiVoLovers.com/ http://www.Eyrie-Productions.com/ -- Hail Eris A little nonsense now and then, is relished by the wisest men 508-852-2171
Re: [whatwg] Entity parsing
On Fri, 15 Jun 2007 03:05:05 +0200, Ian Hickson [EMAIL PROTECTED] wrote: On Sun, 5 Nov 2006, �istein E. Andersen wrote: From section 9.2.3.1. Tokenising entities: For some entities, UAs require a semicolon, for others they don't. This applies to IE. FWIW, the entities not requiring a semicolon are the ones encoding Latin-1 characters, the other HTML 3.2 entities (amp, gt and lt), as well as quot and the uppercase variants (AMP, COPY, GT, LT, QUOT and REG). [...] I've defined the parsing and conformance requirements in a way that matches IE. As a side-effect, this has made things like naiumlve actually conforming. I don't know if we want this. Firefox, Opera and Safari treat naiumlve as equivalent to naamp;iumlve. So for compat with them, the semicolon should be made required. -- Simon Pieters
Re: [whatwg] Entity parsing
Aside: I know that it can be changed but iuml is a very unfortunate name for i tréma. How about deprecating iuml in favor of itrema? Chris -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Simon Pieters Sent: Friday, June 15, 2007 8:49 AM To: Ian Hickson; Oistein E. Andersen Cc: [EMAIL PROTECTED] Subject: Re: [whatwg] Entity parsing On Fri, 15 Jun 2007 03:05:05 +0200, Ian Hickson [EMAIL PROTECTED] wrote: On Sun, 5 Nov 2006, ?istein E. Andersen wrote: From section 9.2.3.1. Tokenising entities: For some entities, UAs require a semicolon, for others they don't. This applies to IE. FWIW, the entities not requiring a semicolon are the ones encoding Latin-1 characters, the other HTML 3.2 entities (amp, gt and lt), as well as quot and the uppercase variants (AMP, COPY, GT, LT, QUOT and REG). [...] I've defined the parsing and conformance requirements in a way that matches IE. As a side-effect, this has made things like naiumlve actually conforming. I don't know if we want this. Firefox, Opera and Safari treat naiumlve as equivalent to naamp;iumlve. So for compat with them, the semicolon should be made required. -- Simon Pieters
Re: [whatwg] Entity parsing
On Fri, 15 Jun 2007 02:05:05 +0100, Ian Hickson [EMAIL PROTECTED] wrote: I've defined the parsing and conformance requirements in a way that matches IE. As a side-effect, this has made things like naiumlve actually conforming. I don't know if we want this. Rather not. This would break unencoded URLs: ?foo=barregion=baz → ?foo=bar®ion=baz -- regards, Kornel Lesiński
Re: [whatwg] Entity parsing
On Fri, 15 Jun 2007 20:32:45 +0200, Kornel Lesinski [EMAIL PROTECTED] wrote: On Fri, 15 Jun 2007 02:05:05 +0100, Ian Hickson [EMAIL PROTECTED] wrote: I've defined the parsing and conformance requirements in a way that matches IE. As a side-effect, this has made things like naiumlve actually conforming. I don't know if we want this. Rather not. This would break unencoded URLs: ?foo=barregion=baz → ?foo=bar®ion=baz You mean that Internet Explorer breaks them already? That doesn't make much sense to me. -- Anne van Kesteren http://annevankesteren.nl/ http://www.opera.com/
Re: [whatwg] Entity parsing
Once upon a time Ian Hickson shaped the electrons to say... I've defined the parsing and conformance requirements in a way that matches IE. As a side-effect, this has made things like naiumlve actually conforming. I don't know if we want this. On the one hand, it's pragmatic (after all, why require the semicolon?), and is equivalent to not requiring quotes around attribute values. On the other, people don't want us to make the quotes optional either. I think the semicolon is important for readability and clarity - where does the entity reference end? There is potential confusion with similarly named entities: not; notin; or; ordf; ordm; pi; piv; sigma; sigmaf; sub; sube; sup; sup1; sup2; sup3; supe; theta; thetasym; The semicolon eliminates confusion. Personally I prefer quoted attribute values too, but I don't feel that strongly about it. I just now that with the quotes optional someone is going to try to list space separated 'class' names. ;-) -MZ -- megazone-at-megazone.org http://www.MegaZone.org/ Gweep, Geek, Human, me. http://www.TiVoLovers.com/ http://www.Eyrie-Productions.com/ -- Hail Eris A little nonsense now and then, is relished by the wisest men 508-852-2171
Re: [whatwg] Entity parsing
On Sun, 5 Nov 2006, �istein E. Andersen wrote: From section 9.2.3.1. Tokenising entities: For some entities, UAs require a semicolon, for others they don't. This applies to IE. FWIW, the entities not requiring a semicolon are the ones encoding Latin-1 characters, the other HTML 3.2 entities (amp, gt and lt), as well as quot and the uppercase variants (AMP, COPY, GT, LT, QUOT and REG). [...] I've defined the parsing and conformance requirements in a way that matches IE. As a side-effect, this has made things like naiumlve actually conforming. I don't know if we want this. On the one hand, it's pragmatic (after all, why require the semicolon?), and is equivalent to not requiring quotes around attribute values. On the other, people don't want us to make the quotes optional either. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Entity parsing
Le 2007-06-14 à 21:05, Ian Hickson a écrit : I've defined the parsing and conformance requirements in a way that matches IE. As a side-effect, this has made things like naiumlve actually conforming. I don't know if we want this. I'd make it non-conforming for the sake of readability. On the one hand, it's pragmatic (after all, why require the semicolon?), and is equivalent to not requiring quotes around attribute values. On the other, people don't want us to make the quotes optional either. I'm perfectly fine with quotes being optional; I think unquoted attribute values are generally as easy to read as their quoted counterparts, if not sometime easier since you don't have the noise of the quotes. On the other hand, it took me about a minute to figure out the word in your example -- naiumlve -- simply because I couldn't find where to put the delimitation between the end of the entity name and the last few characters in the word. In other words, is this the entity iu, ium, iuml, iumlv or iumlve ? Without a list of entities at hand, it takes a lot of guesswork to find the length it consume and the name of the entity. And not everyone can remember all those entity names. Michel Fortin [EMAIL PROTECTED] http://www.michelf.com/