subject:"Re\: \[whatwg\] Entity parsing"

Re: [whatwg] Entity parsing

2008-05-22 Thread Ian Hickson

On Thu, 28 Jun 2007, �istein E. Andersen wrote:
 
 1) Is it useful to handle unterminated entities followed by an 
 alphanumerical character like IE does? The number of documents for which 
 this actually helps might be small compared to the number of documents 
 that contain other, incorrigible errors. The process also introduces 
 errors, albeit not in conforming documents. Is the gain worth the added 
 complexity?
 
 If so, then should this apply to all entities? (Probably not.) Would it 
 be useful to add to/remove from the set supported by IE7? (This may seem 
 insane, but we should try to avoid premature decisions.)
 
 2) HTML 4.01 allows the semicolon to be omitted in certain cases. Does 
 this cause problems? Firefox and Safari both support this, and it would 
 seem meaningless to change the way conforming documents are parsed 
 unless it can be shown that, e.g., ndash  actually is supposed to 
 mean amp;ndash  more often than ndash; . (Conformance is a 
 separate issue.)
 
 3) Will new entities ever be needed? If yes, can new entities adopt 
 existing conformance criteria and parsing rules?
 
 4) Similar considerations for entities in attribute values.

New entities have since been added, and the rules for parsing entities 
(sorry, named character references) have been changed a bit. However, I 
am reluctant to change this from what we have now, since what we have now 
works well. How strongly do you feel about this?

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Entity parsing

2007-06-28 Thread Anne van Kesteren

On Thu, 28 Jun 2007 04:53:09 +0200, Øistein E. Andersen  
[EMAIL PROTECTED] wrote:
I would really like an informed decision, and I currently get the  
impression that rules are changed to follow IE by default rather than to  
handle existing content, which may lead to unnecessary complicated rules  
that do not

actually handle existing documents optimally.


 1) It was quite easy to implement. Took me about thirty minutes including  
updating

several tests and adding a few extra tests. (In html5lib, Python.)

 2) You're saying that content breaks in IE?



--
Anne van Kesteren
http://annevankesteren.nl/
http://www.opera.com/

Re: [whatwg] Entity parsing [trema/diaeresis vs umlaut]

2007-06-28 Thread K?i?tof ?elechovski

I had a look at the reference page you have directed me to: it actually
states that the ISO-8859-1 character set can be used for English.  Although
my hypothesis that the word œovre is not English remains valid (see also the
citations in the appendix), I admit that the fact that the ligature œ is not
included in the character set (and, consequently, that the character set
ISO-8859-1 cannot be used for encoding French text, which I find kind of
stunning because of the popularity of the French language) provides a much
simpler explanation to the observable phenomenon.  My fault, I should have
checked that up first.
Best regards
Chris

APPENDIX

Other Wikipedia entries also disagree, e.g.
http://en.wikipedia.org/wiki/%C5%92
Borrowings into English from Latin words featuring œ are often spelled with
the letter e, especially in American English. For example, fœderal became
federal in English, while fœtus became fetus only in American English. Other
œs in English spell out as 2 separate letters oe.
http://en.wikipedia.org/wiki/List_of_words_that_may_be_spelled_with_a_ligat
ure
The use of the œ and æ is obsolescent in modern English, and has been used
predominantly in British English. It is usually used to evoke archaism, or
in literal quotations of historic sources.
http://en.wikipedia.org/wiki/American_and_British_English_spelling_differen
ces#Simplification_of_ae_.28.C3.A6.29_and_oe_.28.C5.93.29
In English, which has imported words from all three languages, it is now
usual to replace Æ/æ with Ae/ae and Œ/œ with Oe/oe.

Microsoft Word does not accept hors d'œuvre but it has no problem with hors
d'oeuvre.  The American English International keyboard does not provide a
way to type the ligature œ.  The Microsoft Encarta dictionary does not
recognize such a spelling, nor does Reference.com.
The word coeur is not mentioned in any English dictionary I know.

-Original Message-
From: Oistein E. Andersen [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 27, 2007 11:44 PM
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: [whatwg] Entity parsing [trema/diaeresis vs umlaut]

You might want to have a look at
http://pl.wikipedia.org/wiki/ISO_8859-1 .

Afterwards, consider the following:
1) Latin-1 does not contain all the characters that are required
for typesetting of English.

Re: [whatwg] Entity parsing [trema/diaeresis vs umlaut]

2007-06-28 Thread Henri Sivonen


On Jun 28, 2007, at 14:51, K?i?tof ?elechovski wrote:


I admit that the fact that the ligature œ is not
included in the character set (and, consequently, that the  
character set
ISO-8859-1 cannot be used for encoding French text, which I find  
kind of
stunning because of the popularity of the French language) provides  
a much

simpler explanation to the observable phenomenon.


This discussion is not relevant to the WHATWG or HTML5. HTML5 is  
defined in terms of Unicode and Unicode covers both English and  
French (and quite a bit more). Anyone is free to use all that  
expressiveness straight by encoding documents as UTF-8.


Entities or legacy encodings don't add any expressiveness. They just  
expand to Unicode. The details of how this is handled is constrained  
by legacy—not by political correctness.


P.S. Before anyone slaps me for being politically incorrect or  
insensitive, I'd like to point out that my native language uses  
characters whose entity names are biased towards German terminology.  
But this isn't a slightest technical problem. Let's move on.


--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/

Re: [whatwg] Entity parsing

2007-06-28 Thread Øistein E . Andersen

On 28 Jun 2007, at 9:4AM, Anne van Kesteren wrote:

  1) It was quite easy to implement.

Sorry, I never meant to say that it was difficult to implement,
merely that it is counter-intuitive and probably suboptimal.

  2) You're saying that content breaks in IE?

Surprising as it may sound, such content demonstrably exists,
and available data do not support the presupposition that
doing exactly what IE does is actually the best solution for
handling existing content.

-- 
Øistein E. Andersen

Re: [whatwg] Entity parsing [trema/diaeresis vs umlaut]

2007-06-27 Thread Křištof Želechovski

How does it influence the case flanceacutee vs oeliguvre?  The only
difference is that the first one is used in English.
Chris

-Original Message-
From: Oistein E. Andersen [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 26, 2007 10:55 PM
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: [whatwg] Entity parsing [trema/diaeresis vs umlaut]

On 26 Jun 2007, at 7:49AM, Křištof Želechovski wrote:

 Internet Explorer apparently chose to support English natively
 while SGML preferred remaining language-agnostic.

To be fair, this is not how things developed.

Microsoft first chose to make the semicolon optional not only
when allowed by SGML rules (notably before whitespace and tags),
but in any position, for all named entities /that existed at the time/,
i.e., latin-1.

Unfortunately, this meant that new entities could not be added without
changing the interpretation of already existing pages (e.g., if a page
contained lessless, adding the entity le to the list would result in its
being interpreted
as less?ss), although most of the entities have names that are rather
unlikely to appear by chance, and the ampersand should be spelt amp;.

Microsoft did not dare to risk this, so entities beyond latin-1 require
a semicolon in IE, even in cases where it is optional according
to SGML (and therefore will pass HTML 4.01 validation, I might add).

-- 
Oistein E. Andersen

Re: [whatwg] Entity parsing [trema/diaeresis vs umlaut]

2007-06-27 Thread Øistein E . Andersen

On 27 Jun 2007, at 8:45PM, Křištof Želechovski wrote:

 How does it influence the case flanceacutee vs oeliguvre?

You might want to have a look at
http://pl.wikipedia.org/wiki/ISO_8859-1 .

Afterwards, consider the following:
1) Latin-1 does not contain all the characters that are required
for typesetting of English.
2) It does include characters that are never used in English at all.
3) In IE, the entities that can be used without a terminating semicolon
are the ones that can be found in this character set.

How does this make Microsoft Anglocentric?

 The only difference is that the first one is used in English.

They are both used in English, actually (and the spelling with
a ligature should not be considered obsolete in words borrowed
from French, unlike those of Latin origin).

-- 
Øistein E. Andersen

Re: [whatwg] Entity parsing

2007-06-27 Thread Øistein E . Andersen

On 26 Jun 2007, at 4:35AM, Ian Hickson wrote:

 The informal research I did when updating the spec suggests that the 
 current state of the spec is what is better.

(It is difficult to say anything sensible without knowing either the nature
of the research undertaken or the options under consideration.)

 I don't really know how to do more research
 -- it's quite hard to programatically tell when an entity 
 should be expanded and when it shouldn't.

True, but this is not completely insurmountable — or, rather: useful information
can be extracted without necessarily making these decisions explicitly.

I do not know what you have done already, but something like the following
for each entity ref; would be useful for the discussion:
— total number of ref;
— number of ref;;
— number of ref followed by /[a-zA-Z0-9]/;
— the N most frequent matches of /[a-zA-Z0-9]*ref[a-zA-Z0-9]+/.

Without any real data, arguing, e.g., that conforming HTML 4.01 documents that 
are
currently handled correctly by Firefox and Safari must be handled differently
in the future for the sake of backwards compatibility is not really persuasive.


The only argument for following IE that I have been able to find in the archives
is the following in a post from Simon Pieters on 14th Aug 2006 in the thread
“Parsing Entities”:

 I guess that for compat with IE and the Web[1] we have to treat
 Reacutesumeacute as if it were Reacute;sumeacute;. [...]
 [1] http://www.google.com/search?q=R%26eacutesum%C3%A9

The implication seems to be that Reacutesumeacute can be found on the Web
and therefore should be supported. But Google also tells us something else:

(1) reacutesumé: 572
(2) +résumé: 114,000,000
(3) reacute;sumeacute -reacute;sumeacute;s: 16,300
(4) +rÃ©sumÃ©: 1,000

Actually, (1) does not only cover reacutesumeacute, but also code like
ramp;eacutesumé, so the number of occurrences that can be saved
by parser quirks is lower than 572.

As could be expected, (1) is quite rare compared to (2), all the correctly
encoded variants. Whether 0.0005% should be regarded as significant
(supposing that résumé is representative) may be a contentious issue, but
it is interesting to note that other errors — unwanted conversion of  to amp;
in (3) and a typical encoding problem in (4) — are actually significantly
more common, and these cannot be corrected at all.

-- 
Øistein E. Andersen

Re: [whatwg] Entity parsing

2007-06-27 Thread Ian Hickson

On Thu, 28 Jun 2007, �istein E. Andersen wrote:
  
  I don't really know how to do more research -- it's quite hard to 
  programatically tell when an entity should be expanded and when it 
  shouldn't.
 
 True, but this is not completely insurmountable — or, rather: useful 
 information can be extracted without necessarily making these decisions 
 explicitly.
 
 I do not know what you have done already, but something like the following
 for each entity ref; would be useful for the discussion:
 — total number of ref;
 — number of ref;;
 — number of ref followed by /[a-zA-Z0-9]/;
 — the N most frequent matches of /[a-zA-Z0-9]*ref[a-zA-Z0-9]+/.
 
 Without any real data, arguing, e.g., that conforming HTML 4.01 
 documents that are currently handled correctly by Firefox and Safari 
 must be handled differently in the future for the sake of backwards 
 compatibility is not really persuasive.

Sadly none of the arguments in any direction right now are particularly 
persuasive.

I'm not really convinced that the data that the above proposed survey 
might collect would actually help, since it doesn't tell us the what was 
intended by the author. You'd be surprised at how often people use 
ampersands in text in ways that have nothing to do with entities but in 
ways which could get interpreted as entities.


 The implication seems to be that Reacutesumeacute can be found on the Web
 and therefore should be supported. But Google also tells us something else:
 
 (1) reacutesumé: 572
 (2) +résumé: 114,000,000
 (3) reacute;sumeacute -reacute;sumeacute;s: 16,300
 (4) +rÃ©sumÃ©: 1,000
 
 Actually, (1) does not only cover reacutesumeacute, but also code like 
 ramp;eacutesumé, so the number of occurrences that can be saved by 
 parser quirks is lower than 572.

The number of occurences of reacutesumé is at least two (the two hits
I looked at both worked in IE and did not in Firefox).


Am I correct in assuming that you would like the spec changed? What would 
you like the spec changed to, exactly?

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Entity parsing

2007-06-27 Thread Øistein E . Andersen

On 28 Jun 2007, at 12:43AM, Ian Hickson wrote:

 Sadly none of the arguments in any direction right now are particularly 
 persuasive.

Indeed.


 I'm not really convinced that the data that the above proposed survey 
 might collect would actually help, since it doesn't tell us the what was 
 intended by the author.

To a certain extent, this depends on the results.

Some conclusions can be drawn without actually knowing the author's intent
at all: if, for instance, foo[^;] is exceedingly rare, then what the author 
meant
does not really matter, since the construct does not need to be supported 
anyway.

I also tend to think that entities that are part of existing words are highly 
likely
to be supposed to be expanded. Of course, 100% accuracy cannot be achieved,
but this is not really needed for the results to be useful.


 Am I correct in assuming that you would like the spec changed? What would 
 you like the spec changed to, exactly?

I would really like an informed decision, and I currently get the impression
that rules are changed to follow IE by default rather than to handle existing
content, which may lead to unnecessary complicated rules that do not
actually handle existing documents optimally.

More specifically, some of the points that probably should be
addressed are the following:

1) Is it useful to handle unterminated entities followed by an alphanumerical
character like IE does? The number of documents for which this actually helps
might be small compared to the number of documents that contain other,
incorrigible errors. The process also introduces errors, albeit not in 
conforming
documents. Is the gain worth the added complexity?

If so, then should this apply to all entities? (Probably not.) Would it be 
useful
to add to/remove from the set supported by IE7? (This may seem insane,
but we should try to avoid premature decisions.)

2) HTML 4.01 allows the semicolon to be omitted in certain cases. Does this
cause problems? Firefox and Safari both support this, and it would seem
meaningless to change the way conforming documents are parsed unless
it can be shown that, e.g., ndash  actually is supposed to mean amp;ndash 
more often than ndash; . (Conformance is a separate issue.)

3) Will new entities ever be needed? If yes, can new entities adopt existing
conformance criteria and parsing rules? 

4) Similar considerations for entities in attribute values.

-- 
Øistein E. Andersen

Re: [whatwg] Entity parsing [trema/diaresis vs umlaut]

2007-06-26 Thread Křištof Želechovski



Of course you are right; I was thinking of the tréma when I wrote that and I
changed it to a dieresis afterwards to make it more English (to get rid of
the red underlines).  A general qui pro quo followed.
Slovak ä is an original invention; the tréma palatalizes the preceding
consonant.  I did not consider capharnaüm invalid but irrelevant: it is a
Hebrew (or Aramaic?) proper name and can be regarded as a transcription.
Thanks
Chris

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Oistein E. Andersen
Sent: Monday, June 25, 2007 3:46 PM
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: [whatwg] Entity parsing [trema/diaresis vs umlaut]

On 25 Jun 2007, at 11:44AM, Křištof Želechovski wrote:

 To make it explicit and plain: the dieresis is a diacritical mark that has
 no intrinsic phonetic connotation, although it is used mostly for
separating
 vowels;

As you may know, diaresis derives from the Greek verb  (diairein),
which means to divide, and it does indeed have an intrinsic meaning.

According to the OED, a diaresis is [t]he sign (¨) marking [a phonological
diaresis], or,
more usually, placed over the second of two vowels which otherwise make a
diphthong or single sound, to indicate that they are to be pronounced
separately.

Similarly, umlaut is defined as [t]he diacritical sign (¨) placed over a
vowel to
indicate that [umlaut] has taken place.

Hence, the use of either term when the double-dot diacritic is performing
another linguistic function is equally abusive.

 the phonetic meaning of umlaut is generic and well-defined by its
 very name and it does not apply to the vowel I.

Indeed. German umlaut notation is further restricted, and I am not quite
sure
if the phonetic phenomenon applies to y either, but this is rather far off
topic.

 I did not intend to make HTML support all possible linguistic intricacies;
 I only wanted to eliminate the common nonsense of denoting i with iuml;
 [...]
  I only want the true umlaut to be distinct, not as a code point but as an
entity name.
 [...]
 It would be up to the author to determine whether uuml; or utrema;
 is appropriate; both entities should denote the same character.

Do you really think it is a good idea to introduce twelve new aliases
that do not work in current browsers, do not make the language more
expressive and require authors to make meaningless decisions?
(Is Slovak ä borrowed from German [it is pronounced a or ?] and
therefore auml; or does it have another origin? Should we use
atrema; by default? How about Pinyin ü? Swedish words that contain
an ö as a result of umlaut vs those that contain it for a different reason?)

Trema or diaresis might have been a better choice than umlaut as a generic
name,
since umlaut does not apply to all Latin vowels, but it is really too late
to fix this now.


On 25 Jun 2007, at 11:51AM, Křištof Želechovski wrote:

 Could I have an example of otrema; please? 

The canonical example in Dutch seems to be coördinatie, see
http://nl.wikipedia.org/wiki/Trema_in_de_Nederlandse_spelling .

 Something along the lines of zoölogy, but actually required?

Well, such spellings are actually required in some varieties of English.
The New Yorker mandates that authors must coöperate to reëducate our
readership. - allegedly from the magazine's style manual.


On 25 Jun 2007, at 11:16AM, Křištof Želechovski wrote:

 there is no language that could make use of this distinction by having
both
 uuml; and utrema;.  There are languages that use uuml; and
theoretically
 there could be ones that use utrema;, although I do not know of any valid
case
 (I consider the French case invalid).

I have no idea why you consider capharnaüm to be invalid (if this is what
you imply),
but perhaps Spanish pingüino and Dutch reünie will be more convincing
examples.

-- 
Oistein E. Andersen

Re: [whatwg] Entity parsing

2007-06-26 Thread Křištof Želechovski

The difference between I.2 and I.3 is that I.2 is in English and I.3 is in
French.  Internet Explorer apparently chose to support English natively
while SGML preferred remaining language-agnostic.
Chris

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Oistein E. Andersen
Sent: Tuesday, June 26, 2007 2:51 AM
To: [EMAIL PROTECTED]
Subject: Re: [whatwg] Entity parsing

On 25 Jun 2007, at 8:28AM, Ian Hickson wrote:

2) only IE expands
fianceacutee (390), cafeacutes (1,460), naiumlve (716)
IE (correct): fiancée, cafés, naive
SGML (incorrect): fianceacutee, cafeacutes, naiumlve

3) neither expands
oeliguvre (719), coeligur (3,720)
both (incorrect): oeliguvre, coeligur
intended: ouvre, cour

It is also interesting to notice that reasonably common words belonging to
class
I.2), which are handled by IE, are apparently no more frequent than words
from I.3),
which no (popular) current browser handles correctly.

I am looking forward to seeing more extensive research on this.

-- 
Oistein E. Andersen

Re: [whatwg] Entity parsing

2007-06-25 Thread Ian Hickson

On Sat, 23 Jun 2007, Allan Sandfeld Jensen wrote:
 
 What about the Gecko entity parsing extension?

 - IE consitently parses unterminated entities from latin-1
 - Gecko parses all unterminated entities, even those beyond latin-1, but only 
 in text-content, not in attributes. (seems my recent firefox also supports 
 the IE parsing in attributes now.)

Well we can't support two at once... There seems to be more of a case for 
having the spec support the IE model.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Entity parsing

2007-06-25 Thread Ian Hickson

On Sat, 23 Jun 2007, Sam Ruby wrote:
 
 With the latest changes to html5lib, we get a failure on a test named 
 test_title_body_named_charref.
 
 Before, A mdash B == A — B, now A mdash B == A amp;mdash B.
 
 Is that what we really want?  Testing with Firefox, the old behavior is 
 preferable.

What does IE do?

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Entity parsing

2007-06-25 Thread Ian Hickson

On Sun, 24 Jun 2007, �istein E. Andersen wrote:
 
 Personally, I would prefer something along these lines:
 
 I. All entities are created equal (the burden of carrying a semicolon 
 shall be equally distributed amongst all).

For authors, this is now the case.

For implementations, we are pretty much constrained by what IE does.


 II. Abuse of the semicolon shall not be legally enforced (its omission 
 shall be conforming unless it separates the entity from a following 
 [ASCII] letter or digit).

Well, I had that allowed before, but people complained. :-) For some of 
the entities, though, we have to have a semicolon, for compatibility. So 
if you want consistency, it has to be required everywhere.


 III. Entities living in attribute values are to be treated as 
 first-class citizens (the same rules shall apply to them).

Again, for authors this is done, but for compatibility reasons we're 
constrained on what we can say for implementations.


 We clearly should, to the extent possible, try to avoid bizarre quirks, 
 and the current rules for entity parsing are not exactly straightforward 
 or intuitive. HTML5 currently follows IE7 much more closely than Safari, 
 Firefox and Opera do, which seems to suggest that some of the quirks 
 could be dispensed with.

It's possible, though people kept pointing out problems, which is how we 
ended up where we are now.


 At any rate, web pages containing  + entity name followed by 
 [^A-Za-z0-9] are probably more likely not to have been authored for IE 
 and therefore relying on standard SGML behaviour, so it would probably 
 be more backwards- compatible to treat such occurrences as  + entity 
 name + ; (i.e., expand the entity).

Well, we'd have to prove this somehow with real research.


 Of course, conformance checkers would be more than welcome to signal 
 that a certain current browser is unable to handle A mdash B as 
 expected, but this need not mean that all future browsers should be 
 required not to handle it properly (as per arguably [in the original 
 sense] more sensible SGML rules).

Calling SGML sensible is a slippery slope! :-)

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Entity parsing

2007-06-25 Thread Allan Sandfeld Jensen

On Monday 25 June 2007 09:19, Ian Hickson wrote:
 On Sat, 23 Jun 2007, Allan Sandfeld Jensen wrote:
  What about the Gecko entity parsing extension?
 
  - IE consitently parses unterminated entities from latin-1
  - Gecko parses all unterminated entities, even those beyond latin-1, but
  only in text-content, not in attributes. (seems my recent firefox also
  supports the IE parsing in attributes now.)

 Well we can't support two at once... There seems to be more of a case for
 having the spec support the IE model.

They are not incompatible.

In Konqueror, we support both, and it appears by my little test that Firefox 2 
does the same now.

- In attributes all unclosed latin-1 tags are accepted.
- In text-content ALL unclosed tags are accepted.

A little inconsistent, but I believe there was a few websites, and a chat 
application that made me implement the Gecko quirk.

Anyway I don't mind restricting it to latin-1, I just wanted to make sure it 
had been considered.

`Allan

Re: [whatwg] Entity parsing

2007-06-25 Thread Ian Hickson

On Mon, 25 Jun 2007, Allan Sandfeld Jensen wrote:
 
 In Konqueror, we support both, and it appears by my little test that 
 Firefox 2 does the same now.
 
 - In attributes all unclosed latin-1 tags are accepted.
 - In text-content ALL unclosed tags are accepted.
 
 A little inconsistent, but I believe there was a few websites, and a 
 chat application that made me implement the Gecko quirk.

Interesting. It was specifically because of sites breaking if we didn't do 
the IE-like behaviour for attribute entity parsing that the spec is as it 
is now. :-)


 Anyway I don't mind restricting it to latin-1, I just wanted to make 
 sure it had been considered.

Yup.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Entity parsing

2007-06-25 Thread Kristof Zelechovski

If there is a character set that sports both, it must be used to put down
some human language.  My point there is no language that could make use of
this distinction by having both uuml; and utrema;.  There are languages
that use uuml; and theoretically there could be ones that use utrema;,
although I do not know of any valid case (I consider the French case
invalid).

Chëërs

Chrïs

 

  _  

From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Sander
Sent: Saturday, June 23, 2007 2:59 PM
To: Kristof Zelechovski; [EMAIL PROTECTED]
Subject: Re: [whatwg] Entity parsing

 

I hadn't thought of that one ;-)  (in Dutch there are no native words with
umlauts, only some of German or Scandinavian descent).
My question was about char-sets that contain both a trema version and a
(seperate) umlaut version of the same character. Are there any?

cheers,
Sander


Kristof Zelechovski schreef: 

Only the vowel U can have either but I have not seen a valid example of
utrema;.  The orthography ambigüe has recently been changed to ambiguë
for consistency.  Polish nauka (science) and German beurteilen would
make good candidates but the national rules of orthography do not allow this
distinction because Slavic languages do not have diphthongs except in
borrowed words and it would cause ambiguity in German (cf. geübt).
(Incidentally, this leads to bad pronunciation often encountered even in
Polish media.)
Cheers
Chris
 
-Original Message-
From: Sander [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 22, 2007 9:26 PM
To: Kristof Zelechovski
Subject: Re: [whatwg] Entity parsing
 
 
Kristof Zelechovski schreef:
  

A dieresis is not an umlaut so I have to bite my tongue each time I write


or
  

read nonsense like iuml;.  It feels like lying.  Umlaut means mixed, a
dieresis means standalone.  Those are very different things, and I can
never gets mixed so there is no ambiguïty.  Since umlaut is borrowed


from
  

German, I can see no problem in borrowing tréma from French.  I


personally
  

prefer itrema; to idier; because of readability, but I would not
insist on that.
  


 
In professional typography, umlaut dots are usually a bit closer to the 
letter's body than the dots of the trema. In handwriting, however, no 
distinction is visible between the two. This is also true for most 
computer fonts and encodings.
[http://en.wikipedia.org/wiki/Umlaut_(diacritic)]
 
Are there any char-sets that have both umlaut and trema variations of 
characters? If so, both entities could exist.
 
cheers,
Sander
 
 
PS: I'd go for itrema; instead of idier; as well as the term 
trema is also the one that's used in Dutch.

Re: [whatwg] Entity parsing

2007-06-25 Thread Simon Pieters

On Mon, 18 Jun 2007 12:47:57 +0200, Simon Pieters [EMAIL PROTECTED]  
wrote:



http://simon.html5.org/test/html/parsing/entities/trailing-semicolon/

[...] I might create proper test cases on this later when this is  
specced.


Done:

   http://simon.html5.org/test/html/parsing/entities/trailing-semicolon/real/

--
Simon Pieters

Re: [whatwg] Entity parsing

2007-06-25 Thread Kristof Zelechovski

Inconsistently, as of IE7: I got ge verbatim from your test.
Chris

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Allan Sandfeld Jensen
Sent: Saturday, June 23, 2007 2:55 PM
To: whatwg@lists.whatwg.org
Subject: Re: [whatwg] Entity parsing


What about the Gecko entity parsing extension?

- IE consitently parses unterminated entities from latin-1
- Gecko parses all unterminated entities, even those beyond latin-1, but
only 
in text-content, not in attributes. (seems my recent firefox also supports 
the IE parsing in attributes now.)

See the attached test-case.

`Allan

Re: [whatwg] Entity parsing [trema/diaresis vs umlaut]

2007-06-25 Thread Kristof Zelechovski



A stressed schwa is present in Polish maritime dialect as well (Kaszëbszczi)
and Slovaks write mäso for miaso (meat), but that is not the point.  All
such uses can be covered under the hood of the dieresis; I only want the
true umlaut to be distinct, not as a code point but as an entity name.  BTW,
to clear another misconception: the dieresis is not a double accentit may
be more verbosely described as double dot abovebecause unqualified accent
means acute accent by default; the Adobe registry name for the double accent
is Hungarian umlaut because it is used in Hungarian orthography only.
To make it explicit and plain: the dieresis is a diacritical mark that has
no intrinsic phonetic connotation, although it is used mostly for separating
vowels; the phonetic meaning of umlaut is generic and well-defined by its
very name and it does not apply to the vowel I.  I did not intend to make
HTML support all possible linguistic intricacies; I only wanted to eliminate
the common nonsense of denoting ï with iuml;, or at least allow the authors
not to use this absurd denotation while still having an entity for that
letter.  iuml; should be an alias for itrema; for backward compatibility,
that is the whole story.  It would be up to the author to determine whether
uuml; or utrema; is appropriate; both entities should denote the same
character.
Cheers
Chris

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Oistein E. Andersen
Sent: Saturday, June 23, 2007 11:28 PM
To: [EMAIL PROTECTED]
Subject: Re: [whatwg] Entity parsing [trema/diaresis vs umlaut]

Sander wrote:

 Are there any char-sets that have both umlaut and trema variations of
characters?

Unicode does not make the distinction, so this is somewhat unlikely.

(Personally, I tend to think that the apparent preference for umlaut dots
closer
to the letter than trema dots can be linked to extrinsic phenomena like the
preference for steep accents in French typography.)

Kristof Zelechovski wrote:

 Only the vowel U can have either

This is not quite right. All Latin vowels (a, e, i, o, u, y) can take the
trema/diaresis
(ä, ë, i, ö, ü in Dutch; ë, i, ü*, y** in French), and a, o, u can all be
umlauted (ä, ö, ü
in German).

Moreover, the double-dot accent also has other uses (e.g., ä and ë both
designate
a stressed schwa in Luxembourgeois), so it is probably not advisable
to attempt a complete classification in HTML.

-- 
Oistein E. Andersen

*) possibly only in the word capharnaüm (disregarding the highly unpopular
rectifications orthographiques of 1990) and in proper names
**) only in proper names

Re: [whatwg] Entity parsing [trema/diæresis vs umlaut]

2007-06-25 Thread Øistein E . Andersen

On 25 Jun 2007, at 11:44AM, Křištof Želechovski wrote:

 A stressed schwa is present in Polish maritime dialect as well (Kaszëbszczi)
 and Slovaks write mäso for miaso (meat), but that is not the point.  All
 such uses can be covered under the hood of the dieresis;

I really do not understand why these uses of the double-dot diacritic
should be considered as instances of the diæresis (see below).

 the dieresis is not a double accent

I never said double accent, but you are right in pointing out that I should 
have
called it a double-dot diacritic rather than a double-dot accent, since
-- strictly speaking -- the only accents are acute, grave and circumflex.

 To make it explicit and plain: the dieresis is a diacritical mark that has
 no intrinsic phonetic connotation, although it is used mostly for separating
 vowels;

As you may know, diæresis derives from the Greek verb διαιρεῖν (diairein),
which means “to divide”, and it does indeed have an intrinsic meaning.

According to the OED, a diæresis is “[t]he sign (¨) marking [a phonological 
diæresis], or,
more usually, placed over the second of two vowels which otherwise make a
diphthong or single sound, to indicate that they are to be pronounced 
separately.”

Similarly, umlaut is defined as “[t]he diacritical sign (¨) placed over a vowel 
to
indicate that [umlaut] has taken place.”

Hence, the use of either term when the double-dot diacritic is performing
another linguistic function is equally abusive.

 the phonetic meaning of umlaut is generic and well-defined by its
 very name and it does not apply to the vowel I.

Indeed. German umlaut notation is further restricted, and I am not quite sure
if the phonetic phenomenon applies to y either, but this is rather far off 
topic.

 I did not intend to make HTML support all possible linguistic intricacies;
 I only wanted to eliminate the common nonsense of denoting ï with iuml;
 [...]
  I only want the true umlaut to be distinct, not as a code point but as an 
 entity name.
 [...]
 It would be up to the author to determine whether uuml; or utrema;
 is appropriate; both entities should denote the same character.

Do you really think it is a good idea to introduce twelve new aliases
that do not work in current browsers, do not make the language more
expressive and require authors to make meaningless decisions?
(Is Slovak ä borrowed from German [it is pronounced æ or ɛ] and
therefore auml; or does it have another origin? Should we use
atrema; by default? How about Pinyin ü? Swedish words that contain
an ö as a result of umlaut vs those that contain it for a different reason?)

Trema or diæresis might have been a better choice than umlaut as a generic name,
since umlaut does not apply to all Latin vowels, but it is really too late to 
fix this now.


On 25 Jun 2007, at 11:51AM, Křištof Želechovski wrote:

 Could I have an example of otrema; please? 

The canonical example in Dutch seems to be coördinatie, see
http://nl.wikipedia.org/wiki/Trema_in_de_Nederlandse_spelling .

 Something along the lines of zoölogy, but actually required?

Well, such spellings are actually required in some varieties of English.
“The New Yorker mandates that authors must coöperate to reëducate our
readership.” — allegedly from the magazine’s style manual.


On 25 Jun 2007, at 11:16AM, Křištof Želechovski wrote:

 there is no language that could make use of this distinction by having both
 uuml; and utrema;.  There are languages that use uuml; and theoretically
 there could be ones that use utrema;, although I do not know of any valid 
 case
 (I consider the French case invalid).

I have no idea why you consider capharnaüm to be invalid (if this is what you 
imply),
but perhaps Spanish pingüino and Dutch reünie will be more convincing examples.

French dictionaries require loan-words like angström, führer and länder (plural
of land) to be spelt with an umlaut, but these are of course too rare for
a differentiation tréma/umlaut to have developed, and I would imagine
German imports with umlaut to be only slightly more common in Dutch.

It would be interesting to see whether 19th-c. German actually made a
distinction between umlaut on a, o, u and diæresis on e, i (e.g., Rhomboïd),
but I do not know how consistently the diæresis was used, and words
requiring it are typically foreign words that, unlike the rest, will not have
been printed in Fraktur...

-- 
Øistein E. Andersen

Re: [whatwg] Entity parsing

2007-06-25 Thread Øistein E . Andersen

On 25 Jun 2007, at 11:57AM, Kristof Zelechovski wrote:

 Inconsistently, as of IE7: I got ge verbatim from your test.

ge; is /not/ a latin-1 entity.

-- 
Øistein E. Andersen

Re: [whatwg] Entity parsing [trema/diaresis vs umlaut]

2007-06-25 Thread Sander



Křištof Želechovski schreef:

Could I have an example of otrema; please?  Something along the lines of
zoölogy, but actually required?  Not that I doubt your knowledge of Dutch
but I would like to have it as a demonstration.
Chris
  

coördinaten


BTW: neither of the quotes below are mine ;-)

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Oistein E. Andersen
Sent: Saturday, June 23, 2007 11:28 PM
To: [EMAIL PROTECTED]
Subject: Re: [whatwg] Entity parsing [trema/diaresis vs umlaut]

Sander wrote:

  

Only the vowel U can have either



This is not quite right. All Latin vowels (a, e, i, o, u, y) can take the
trema/diaresis
(ä, ë, i, ö, ü in Dutch; ë, i, ü*, y** in French), and a, o, u can all be
umlauted (ä, ö, ü
in German).

Re: [whatwg] Entity parsing [trema/diæresis vs umlau t]

2007-06-25 Thread Sander



Øistein E. Andersen schreef:

French dictionaries require loan-words like angström, führer and länder (plural
of land) to be spelt with an umlaut, but these are of course too rare for
a differentiation tréma/umlaut to have developed, and I would imagine
German imports with umlaut to be only slightly more common in Dutch.


In Dutch there are words with umlaut from both German and Scandinavian 
descent.
Most of them are substantives (e.g. übermensch, knäckebröd). The 
only one I can think of right now that is not a substantive is überhaupt.

Re: [whatwg] Entity parsing [trema/diæresis vs uml aut]

2007-06-25 Thread Ian Hickson

On Mon, 25 Jun 2007, �istein E. Andersen wrote:

 On 25 Jun 2007, at 11:44AM, Křištof Želechovski wrote:
 
  A stressed schwa is present in Polish maritime dialect as well 
  (Kaszëbszczi) and Slovaks write mäso for miaso (meat), but that 
  is not the point.  All such uses can be covered under the hood of the 
  dieresis;
 
 I really do not understand why these uses of the double-dot diacritic 
 should be considered as instances of the diæresis (see below).

This really is out of scope of this working group, it's more a Unicode 
Consortium issue.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Entity parsing

2007-06-25 Thread Øistein E . Andersen

On 25 Jun 2007, at 8:28AM, Ian Hickson wrote:

 On Sun, 24 Jun 2007, Øistein E. Andersen wrote:

 HTML5 currently follows IE7 much more closely than Safari, 
Firefox and Opera do, which seems to suggest that some of the quirks 
could be dispensed with.

 It's possible, though people kept pointing out problems, which is how we 
 ended up where we are now.

I have probably missed parts of this discussion, but most of the arguments
I have seen seem to rely on the assumption that whatever IE does is more
compatible with the Web as it is, which is probably a good approximation,
but replicating each single detail is not necessarily the best thing to do.

 Calling SGML sensible is a slippery slope! :-)

Sure, I did not mean to imply that all aspects of SGML are sensible :-)

(Bad connotations aside, SGML’s rules for optional semicolons
happen to be less contrived than IE’s.)

 [It might be a good idea to accept a missing semicolon at the end of words.]

 Well, we'd have to prove this somehow with real research.

Yes, research is really missing here.

Whatever we do, some pages will break, and it is not a priori impossible
that a compromise of IE and SGML rules may be less quirky and more
compatible with existing content at the same time.

I am unable to do a proper corpus study on this, but the following
examples suggest that following IE blindly may not be optimal.
All markup is extracted from real Web pages, and the author’s intent
was quite obvious from the context. The numbers in parentheses indicate
the number of pages found using Google.


I] Should be expanded

1) only SGML expands
mdash
IE (incorrect): mdash
SGML (correct): —

2) only IE expands
fianceacutee (390), cafeacutes (1,460), naiumlve (716)
IE (correct): fiancée, cafés, naïve
SGML (incorrect): fianceacutee, cafeacutes, naiumlve

3) neither expands
oeliguvre (719), coeligur (3,720)
both (incorrect): oeliguvre, coeligur
intended: œuvre, cœur

II] Should not be expanded

1) IE expands
moralethics, rosesthorns
IE (incorrect): moralðics, rosesþs
SGML (correct): moralethics, rosesthorns

2) SGML expands
AlphaOmega, onceforall
IE (correct): AlphaOmega, onceforall
SGML (incorrect): AlphaΩ, once∀

3) both expand
rosethorn
both (incorrect): roseþ
intended: rosethorn


The examples I have found in category II] are all quite rare, but it is not 
unlikely
that more common ones exist.

Opera and Google both seem to err on the side of caution by only expanding
entities when both IE and SGML do, i.e., in case II.3) above.

It is also interesting to notice that reasonably common words belonging to class
I.2), which are handled by IE, are apparently no more frequent than words from 
I.3),
which no (popular) current browser handles correctly.

I am looking forward to seeing more extensive research on this.

-- 
Øistein E. Andersen

Re: [whatwg] Entity parsing

2007-06-25 Thread Ian Hickson

On Tue, 26 Jun 2007, �istein E. Andersen wrote:
 
 I am looking forward to seeing more extensive research on this.

The informal research I did when updating the spec suggests that the 
current state of the spec is what is better. I don't really know how to do 
more research -- it's quite hard to programatically tell when an entity 
should be expanded and when it shouldn't.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Entity parsing

2007-06-24 Thread Anne van Kesteren

On Sat, 23 Jun 2007 20:12:45 +0200, Sam Ruby [EMAIL PROTECTED]  
wrote:

Before, A mdash B == A — B, now A mdash B == A amp;mdash B.

Is that what we really want?  Testing with Firefox, the old behavior
is preferable.


Yeah, it makes sense to follow Internet Explorer 7 for this.


--
Anne van Kesteren
http://annevankesteren.nl/
http://www.opera.com/

Re: [whatwg] Entity parsing

2007-06-23 Thread Sander

I hadn't thought of that one ;-)  (in Dutch there are no native words 
with umlauts, only some of German or Scandinavian descent).
My question was about char-sets that contain both a trema version and a 
(seperate) umlaut version of the same character. Are there any?


cheers,
Sander


Kristof Zelechovski schreef:

Only the vowel U can have either but I have not seen a valid example of
utrema;.  The orthography ambigüe has recently been changed to ambiguë
for consistency.  Polish nauka (science) and German beurteilen would
make good candidates but the national rules of orthography do not allow this
distinction because Slavic languages do not have diphthongs except in
borrowed words and it would cause ambiguity in German (cf. geübt).
(Incidentally, this leads to bad pronunciation often encountered even in
Polish media.)
Cheers
Chris

-Original Message-
From: Sander [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 22, 2007 9:26 PM

To: Kristof Zelechovski
Subject: Re: [whatwg] Entity parsing


Kristof Zelechovski schreef:
  

A dieresis is not an umlaut so I have to bite my tongue each time I write


or
  

read nonsense like iuml;.  It feels like lying.  Umlaut means mixed, a
dieresis means standalone.  Those are very different things, and I can
never gets mixed so there is no ambiguïty.  Since umlaut is borrowed


from
  

German, I can see no problem in borrowing tréma from French.  I


personally
  

prefer itrema; to idier; because of readability, but I would not
insist on that.
  



In professional typography, umlaut dots are usually a bit closer to the 
letter's body than the dots of the trema. In handwriting, however, no 
distinction is visible between the two. This is also true for most 
computer fonts and encodings.

[http://en.wikipedia.org/wiki/Umlaut_(diacritic)]

Are there any char-sets that have both umlaut and trema variations of 
characters? If so, both entities could exist.


cheers,
Sander


PS: I'd go for itrema; instead of idier; as well as the term 
trema is also the one that's used in Dutch.

Re: [whatwg] Entity parsing

2007-06-23 Thread Allan Sandfeld Jensen

On Friday 15 June 2007 03:05, Ian Hickson wrote:
 On Sun, 5 Nov 2006, �istein E. Andersen wrote:
  From section 9.2.3.1. Tokenising entities:
For some entities, UAs require a semicolon, for others they don't.
 
  This applies to IE.
 
  FWIW, the entities not requiring a semicolon are the ones encoding
  Latin-1 characters, the other HTML 3.2 entities (amp, gt and lt), as
  well as quot and the uppercase variants (AMP, COPY, GT, LT, QUOT
  and REG). [...]

 I've defined the parsing and conformance requirements in a way that
 matches IE. As a side-effect, this has made things like naiumlve
 actually conforming. I don't know if we want this. On the one hand, it's
 pragmatic (after all, why require the semicolon?), and is equivalent to
 not requiring quotes around attribute values. On the other, people don't
 want us to make the quotes optional either.

What about the Gecko entity parsing extension?

- IE consitently parses unterminated entities from latin-1
- Gecko parses all unterminated entities, even those beyond latin-1, but only 
in text-content, not in attributes. (seems my recent firefox also supports 
the IE parsing in attributes now.)

See the attached test-case.

`Allan



Test of HTML entities in quirky mode:


amp;	

amp	

ample	

not;	

not	

notat	

notin;	

notin	

notina	

ge;	

ge	

gel	


Test of entities in attributes:

Re: [whatwg] Entity parsing

2007-06-23 Thread Sam Ruby


On 6/14/07, Ian Hickson [EMAIL PROTECTED] wrote:

On Sun, 5 Nov 2006, Øistein E. Andersen wrote:

 From section 9.2.3.1. Tokenising entities:
   For some entities, UAs require a semicolon, for others they don't.

 This applies to IE.

 FWIW, the entities not requiring a semicolon are the ones encoding
 Latin-1 characters, the other HTML 3.2 entities (amp, gt and lt), as
 well as quot and the uppercase variants (AMP, COPY, GT, LT, QUOT
 and REG). [...]

I've defined the parsing and conformance requirements in a way that
matches IE. As a side-effect, this has made things like naiumlve
actually conforming. I don't know if we want this. On the one hand, it's
pragmatic (after all, why require the semicolon?), and is equivalent to
not requiring quotes around attribute values. On the other, people don't
want us to make the quotes optional either.


With the latest changes to html5lib, we get a failure on a test named
test_title_body_named_charref.

Before, A mdash B == A — B, now A mdash B == A amp;mdash B.

Is that what we really want?  Testing with Firefox, the old behavior
is preferable.

- Sam Ruby

Re: [whatwg] Entity parsing [trema/diæresis vs umlaut]

2007-06-23 Thread Øistein E . Andersen

Sander wrote:

 Are there any char-sets that have both umlaut and trema variations of 
 characters?

Unicode does not make the distinction, so this is somewhat unlikely.

(Personally, I tend to think that the apparent preference for umlaut dots closer
to the letter than trema dots can be linked to extrinsic phenomena like the
preference for steep accents in French typography.)

Kristof Zelechovski wrote:

 Only the vowel U can have either

This is not quite right. All Latin vowels (a, e, i, o, u, y) can take the 
trema/diæresis
(ä, ë, ï, ö, ü in Dutch; ë, ï, ü*, ÿ** in French), and a, o, u can all be 
umlauted (ä, ö, ü
in German).

Moreover, the double-dot accent also has other uses (e.g., ä and ë both 
designate
a stressed schwa in Luxembourgeois), so it is probably not advisable
to attempt a complete classification in HTML.

-- 
Øistein E. Andersen

*) possibly only in the word capharnaüm (disregarding the highly unpopular
rectifications orthographiques of 1990) and in proper names
**) only in proper names

Re: [whatwg] Entity parsing

2007-06-22 Thread Ian Hickson

On Fri, 22 Jun 2007, Kristof Zelechovski wrote:

 A dieresis is not an umlaut so I have to bite my tongue each time I 
 write or read nonsense like iuml;.  It feels like lying.  Umlaut means 
 mixed, a dieresis means standalone.  Those are very different 
 things, and I can never gets mixed so there is no ambiguïty.  Since 
 umlaut is borrowed from German, I can see no problem in borrowing 
 tréma from French.  I personally prefer itrema; to idier; 
 because of readability, but I would not insist on that.

There are plenty of entity names that are suboptimal. I wouldn't lose too 
much sleep over it.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Entity parsing

2007-06-22 Thread Kristof Zelechovski

A dieresis is not an umlaut so I have to bite my tongue each time I write or
read nonsense like iuml;.  It feels like lying.  Umlaut means mixed, a
dieresis means standalone.  Those are very different things, and I can
never gets mixed so there is no ambiguïty.  Since umlaut is borrowed from
German, I can see no problem in borrowing tréma from French.  I personally
prefer itrema; to idier; because of readability, but I would not
insist on that.
Chris

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Ian Hickson
Sent: Friday, June 22, 2007 6:09 AM
To: [EMAIL PROTECTED]
Subject: Re: [whatwg] Entity parsing

On Fri, 15 Jun 2007, Kitof elechovski wrote:

 Aside: I know that it can be changed but iuml is a very unfortunate 
 name for i trma.  How about deprecating iuml in favor of itrema?

We're not deprecating anything, and just introducing a new name for i-uml 
would be a dangerous slippery slope to start down. Anyway, i-umlaut is 
fine, and easier to spell than i-diaeresis; why would you call itrema? 
Trema doesn't seem any more common than umlaut...

Re: [whatwg] Entity parsing

2007-06-18 Thread Simon Pieters

On Sat, 16 Jun 2007 15:30:07 +0200, Anne van Kesteren [EMAIL PROTECTED]  
wrote:



No, IE doesn't break them, and that's the point.

Section 8.2.3.1. states This definition is used when parsing entities  
in text and in attributes. - if I understand this correctly, this  
makes semicolon optional for entities in both attributes and text and  
region in attribute would be interpreted as ®ion.
If that's the case, it is not compatible with IE, because it parses  
entities differently in attributes and text. In attributes semicolon  
(any non-alphanumeric character actually) is required, but in text it  
is not.


In IE6 a href=regionregion/a is equivalent to a  
href=amp;region®ion/a


Awesome. Guess we have to reverse engineer that too then...


   http://simon.html5.org/test/html/parsing/entities/trailing-semicolon/

The tests aren't really digestable in their current state unless you know  
what they're doing, but well, I'll just say what the results are below. I  
might create proper test cases on this later when this is specced.



Entity parsing works the same in different attributes (tested img alt  
and a href).


Any character that is not in the range [a-zA-Z0-9] ends an entity -- i.e.,  
the following are equivalent:


   img alt=AElig.
   img alt=AElig;.

...and the following are equivalent:

   img alt=AElig1
   img alt=amp;AElig1


This means that the semi-colon is not part of the entity name, and we need  
to revert to the old entity table and instead have a third column that  
says which entities always require a semi-colon.


You consume as many characters as possible that match the entity table,  
and for the longest match, check if the next character is in the  
abovementioned range. If yes, emit the consumed characters, otherwise emit  
the entity, or something along those lines.


--
Simon Pieters

Re: [whatwg] Entity parsing

2007-06-16 Thread Anne van Kesteren


On Sat, 16 Jun 2007 00:58:21 +0200, MegaZone [EMAIL PROTECTED] wrote:

Personally I prefer quoted attribute values too, but I don't feel
that strongly about it.  I just now that with the quotes optional
someone is going to try to list space separated 'class' names. ;-)


For what it's worth, they have _always_ been optional in HTML. And you're  
right, some people might do that. In fact, it was done wrong so often for


  meta http-equiv=content-type content=text/html; charset=utf-8

that browsers now all support a charset= attribute on meta for  
indicating the document encoding.



--
Anne van Kesteren
http://annevankesteren.nl/
http://www.opera.com/

Re: [whatwg] Entity parsing

2007-06-16 Thread Anne van Kesteren

On Fri, 15 Jun 2007 21:21:06 +0200, Kornel Lesinski [EMAIL PROTECTED]  
wrote:
On Fri, 15 Jun 2007 19:37:46 +0100, Anne van Kesteren [EMAIL PROTECTED]  
wrote:

I've defined the parsing and conformance requirements in a way that
matches IE. As a side-effect, this has made things like naiumlve
actually conforming. I don't know if we want this.


Rather not. This would break unencoded URLs:

?foo=barregion=baz → ?foo=bar®ion=baz


You mean that Internet Explorer breaks them already? That doesn't make  
much sense to me.


No, IE doesn't break them, and that's the point.

Section 8.2.3.1. states This definition is used when parsing entities  
in text and in attributes. - if I understand this correctly, this makes  
semicolon optional for entities in both attributes and text and  
region in attribute would be interpreted as ®ion.
If that's the case, it is not compatible with IE, because it parses  
entities differently in attributes and text. In attributes semicolon  
(any non-alphanumeric character actually) is required, but in text it is  
not.


In IE6 a href=regionregion/a is equivalent to a  
href=amp;region®ion/a


Awesome. Guess we have to reverse engineer that too then...


--
Anne van Kesteren
http://annevankesteren.nl/
http://www.opera.com/

Re: [whatwg] Entity parsing

2007-06-16 Thread MegaZone

Once upon a time Anne van Kesteren shaped the electrons to say...
 For what it's worth, they have _always_ been optional in HTML. And you're  
 right, some people might do that. In fact, it was done wrong so often for

I know, it was one of the things that used to annoy me in other
author's markup - not so much using them or not in general, but when
someone would quote some attributes and not others.  Pet peeve.

Forcing the parens was something I liked about XHTML - on the other
hand forcing lowercase elements took some getting used to, since I had
been in the 'all caps' school since I first played with HTML in 1991.
Win some, lose some. :-)

   meta http-equiv=content-type content=text/html; charset=utf-8
 
 that browsers now all support a charset= attribute on meta for  
 indicating the document encoding.

This is a bit cleaner, since the name=value structure is still
intact.  I see people doing things like:

a class=main titletext/a

When they mean:

a class=main titletext/a

And not:

a class=main title=text/a

Quotes are really only optional on single-value attributes, or it
creates a parsing nightmare, trying to read the authors mind.

-MZ
-- 
megazone-at-megazone.org  http://www.MegaZone.org/   Gweep, Geek, Human, me.
http://www.TiVoLovers.com/  http://www.Eyrie-Productions.com/ -- Hail Eris 
A little nonsense now and then, is relished by the wisest men 508-852-2171

Re: [whatwg] Entity parsing

2007-06-15 Thread Simon Pieters


On Fri, 15 Jun 2007 03:05:05 +0200, Ian Hickson [EMAIL PROTECTED] wrote:


On Sun, 5 Nov 2006, �istein E. Andersen wrote:


From section 9.2.3.1. Tokenising entities:
  For some entities, UAs require a semicolon, for others they don't.

This applies to IE.

FWIW, the entities not requiring a semicolon are the ones encoding
Latin-1 characters, the other HTML 3.2 entities (amp, gt and lt), as
well as quot and the uppercase variants (AMP, COPY, GT, LT, QUOT
and REG). [...]


I've defined the parsing and conformance requirements in a way that
matches IE. As a side-effect, this has made things like naiumlve
actually conforming. I don't know if we want this.


Firefox, Opera and Safari treat naiumlve as equivalent to  
naamp;iumlve. So for compat with them, the semicolon should be made  
required.


--
Simon Pieters

Re: [whatwg] Entity parsing

2007-06-15 Thread Křištof Želechovski

Aside: I know that it can be changed but iuml is a very unfortunate name
for i tréma.  How about deprecating iuml in favor of itrema?
Chris

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Simon Pieters
Sent: Friday, June 15, 2007 8:49 AM
To: Ian Hickson; Oistein E. Andersen
Cc: [EMAIL PROTECTED]
Subject: Re: [whatwg] Entity parsing

On Fri, 15 Jun 2007 03:05:05 +0200, Ian Hickson [EMAIL PROTECTED] wrote:

 On Sun, 5 Nov 2006, ?istein E. Andersen wrote:

 From section 9.2.3.1. Tokenising entities:
   For some entities, UAs require a semicolon, for others they don't.

 This applies to IE.

 FWIW, the entities not requiring a semicolon are the ones encoding
 Latin-1 characters, the other HTML 3.2 entities (amp, gt and lt), as
 well as quot and the uppercase variants (AMP, COPY, GT, LT, QUOT
 and REG). [...]

 I've defined the parsing and conformance requirements in a way that
 matches IE. As a side-effect, this has made things like naiumlve
 actually conforming. I don't know if we want this.

Firefox, Opera and Safari treat naiumlve as equivalent to  
naamp;iumlve. So for compat with them, the semicolon should be made  
required.

-- 
Simon Pieters

Re: [whatwg] Entity parsing

2007-06-15 Thread Kornel Lesinski


On Fri, 15 Jun 2007 02:05:05 +0100, Ian Hickson [EMAIL PROTECTED] wrote:


I've defined the parsing and conformance requirements in a way that
matches IE. As a side-effect, this has made things like naiumlve
actually conforming. I don't know if we want this.


Rather not. This would break unencoded URLs:

?foo=barregion=baz → ?foo=bar®ion=baz

--
regards, Kornel Lesiński

Re: [whatwg] Entity parsing

2007-06-15 Thread Anne van Kesteren

On Fri, 15 Jun 2007 20:32:45 +0200, Kornel Lesinski [EMAIL PROTECTED]  
wrote:

On Fri, 15 Jun 2007 02:05:05 +0100, Ian Hickson [EMAIL PROTECTED] wrote:


I've defined the parsing and conformance requirements in a way that
matches IE. As a side-effect, this has made things like naiumlve
actually conforming. I don't know if we want this.


Rather not. This would break unencoded URLs:

?foo=barregion=baz → ?foo=bar®ion=baz


You mean that Internet Explorer breaks them already? That doesn't make  
much sense to me.



--
Anne van Kesteren
http://annevankesteren.nl/
http://www.opera.com/

Re: [whatwg] Entity parsing

2007-06-15 Thread MegaZone

Once upon a time Ian Hickson shaped the electrons to say...
 I've defined the parsing and conformance requirements in a way that 
 matches IE. As a side-effect, this has made things like naiumlve 
 actually conforming. I don't know if we want this. On the one hand, it's 
 pragmatic (after all, why require the semicolon?), and is equivalent to 
 not requiring quotes around attribute values. On the other, people don't 
 want us to make the quotes optional either.

I think the semicolon is important for readability and clarity - where
does the entity reference end?  There is potential confusion with
similarly named entities: not; notin; or; ordf; ordm; pi; piv;
sigma; sigmaf; sub; sube; sup; sup1; sup2; sup3; supe;
theta; thetasym;

The semicolon eliminates confusion.

Personally I prefer quoted attribute values too, but I don't feel
that strongly about it.  I just now that with the quotes optional
someone is going to try to list space separated 'class' names. ;-)

-MZ
-- 
megazone-at-megazone.org  http://www.MegaZone.org/   Gweep, Geek, Human, me.
http://www.TiVoLovers.com/  http://www.Eyrie-Productions.com/ -- Hail Eris 
A little nonsense now and then, is relished by the wisest men 508-852-2171

Re: [whatwg] Entity parsing

2007-06-14 Thread Ian Hickson

On Sun, 5 Nov 2006, �istein E. Andersen wrote:

 From section 9.2.3.1. Tokenising entities:
   For some entities, UAs require a semicolon, for others they don't.
 
 This applies to IE.
 
 FWIW, the entities not requiring a semicolon are the ones encoding 
 Latin-1 characters, the other HTML 3.2 entities (amp, gt and lt), as 
 well as quot and the uppercase variants (AMP, COPY, GT, LT, QUOT 
 and REG). [...]

I've defined the parsing and conformance requirements in a way that 
matches IE. As a side-effect, this has made things like naiumlve 
actually conforming. I don't know if we want this. On the one hand, it's 
pragmatic (after all, why require the semicolon?), and is equivalent to 
not requiring quotes around attribute values. On the other, people don't 
want us to make the quotes optional either.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Entity parsing

2007-06-14 Thread Michel Fortin


Le 2007-06-14 à 21:05, Ian Hickson a écrit :

I've defined the parsing and conformance requirements in a way that  
matches IE. As a side-effect, this has made things like naiumlve  
actually conforming. I don't know if we want this.


I'd make it non-conforming for the sake of readability.

On the one hand, it's pragmatic (after all, why require the  
semicolon?), and is equivalent to not requiring quotes around  
attribute values. On the other, people don't want us to make the  
quotes optional either.


I'm perfectly fine with quotes being optional; I think unquoted  
attribute values are generally as easy to read as their quoted  
counterparts, if not sometime easier since you don't have the noise  
of the quotes.


On the other hand, it took me about a minute to figure out the word  
in your example -- naiumlve -- simply because I couldn't find  
where to put the delimitation between the end of the entity name and  
the last few characters in the word. In other words, is this the  
entity iu, ium, iuml, iumlv or iumlve ? Without a list of  
entities at hand, it takes a lot of guesswork to find the length it  
consume and the name of the entity. And not everyone can remember all  
those entity names.



Michel Fortin
[EMAIL PROTECTED]
http://www.michelf.com/

46 matches

Mail list logo