[whatwg] Subsequent access to empty string URLs (Was: Base URL’s effect on an empty @src element)

2013-05-01 Thread Leif Halvard Silli
Anne van Kesteren on Wed May 1 09:46:50 PDT 2013:
> On Wed, May 1, 2013 at 5:39 PM, Boris Zbarsky wrote:
>> Interesting.  Certainly at the point when Gecko implemented the current
>> behavior I recall it matching the spec…

Thanks so much, Darin, Boris and Anne.

> Changed in: http://html5.org/r/4841
> 
> Context: 
> 
http://lists.w3.org/Archives/Public/public-whatwg-archive/2010Mar/thread.html#msg67

FOLLOW-UP on src="":

  If @src is empty (and there is no base url) a 'subsequent access' via 
a contextual menu, such as 'Show/Open image' or 'Save/Download image' 
has no effect in Firefox20, Opera12, IE10. Whereas Safari/Chrome do 
provide a contextual menu item for those features. (And the UA results 
are the same - except with regard to Firefox, also if there *is* a base 
URL.)

  Webkit/Blink seems inconsistent/buggy, right?

  A special detail is the last paragraph of section '2.5.3 Dynamic 
changes to base URLS'[1] which implies that a change to the base URL 
should (even when @src is empty, one must assume, not?) affect the @src 
URL so that a 'subsequent access' via context menu could be used to 
e.g. open the image resource set by the base URL. Is it meaningful? 

  By now, only Webkit/Blink let base URL affect the subsequent access.
  (And Firefox, but that's because of the bug.)


FOLLOW-UP w.r.t. cite="" and longdesc="":

   What if @cite or @longdesc are empty? Personally, I think it would 
be simplest to handle at least @longesc - but probably @cite too - the 
same way that @src is handled. The relevance to subsequent access to 
empty @src is that @longdesc and @cite tend, from users’ point of view, 
to be subsequently accessed (e.g. via context menu).

   Currently, the HTML spec doesn't even require the @cite attribute to 
be a *non-empty* URL - thus it can be empty.[2] By contrast, the 
@longdesc cannot be empty.[3] What is the use case for an empty @cite 
attribute?

   For @longdesc, the ‘trend’ of implementations is to ignore the 
longdesc when it is the empty string.[4] And basically, my motivation 
for these letters is to make sure that the longdesc spec can safely say 
- without conflicting with anything else - that implementations should 
ignore empty longdesc attributes.[5]

[1] 
http://www.whatwg.org/specs/web-apps/current-work/multipage/urls.html#dynamic-changes-to-base-urls
[2] 
http://www.whatwg.org/specs/web-apps/current-work/multipage/grouping-content.html#attr-blockquote-cite
[3] 
https://dvcs.w3.org/hg/html-proposals/raw-file/default/longdesc1/longdesc.html#attributes
[4] https://www.w3.org/Bugs/Public/show_bug.cgi?id=21778#c2
[5] https://www.w3.org/Bugs/Public/show_bug.cgi?id=21778#c4
-- 
leif halvard silli

[whatwg] Base URL’s effect on an empty @src element

2013-05-01 Thread Leif Halvard Silli
Given a document, where 

 1. the content of img @src is empty, and thus invalid, 
 2. but there is base url which points to an image
 
 Example:
  
  
  

  Live DOM Viewer test:
  http://software.hixie.ch/utilities/js/live-dom-viewer/saved/2236
   
Current UA behaviors:
 x) Firefox is the only one to render it the image 'image.jpg'.
 y) Webkit/Blink, IE10 and Opera12 render it as the alt text.

Questions:
 A) Which of x) and y) is correct?
 B) If the correct behavior is already defined in a spec,
where is it defined?

Leif Halvard Silli


Re: [whatwg] Entity definitions in XHTML

2013-01-17 Thread Leif Halvard Silli
David Carlisle on Fri, 18 Jan 2013 00:03:12 +:
> To: Ian Hickson 
> On 17/01/2013 23:31, Ian Hickson wrote:
>> On Thu, 17 Jan 2013, David Carlisle wrote:

>>>>> that documents will be interpreted differently by an XHTML
>>>>> user agent and a standard XML toolchain.
>>>> 
>>>> I do not understand what this means. Can you give an example?

Though not XML, the trouble Anolis had with putting out the correct 
glyph values for the ⟩ and ⟨ entities, was caused by a part 
of Anolis that interpreted those entities in the old, HTML5 
*in*compatible, way. This in turn resulted in the wrong character when 
the entities were converted to normal characters before being output to 
the HTML5 spec:
https://www.w3.org/Bugs/Public/show_bug.cgi?id=14430
This was a surprisingly long lasting bug. (And perhaps not fully solved 
yet …) It had probably existed since HTML5 included named entities in 
the spec. And, as the reporter of the bug, I was asked time and again 
and again about whether the bug had been fixed or not ...

In this case, Anolis outputted "polyglot" character references, since 
it converted the named reference to numeric references. (Please ignore 
HTML5's current shortcut: 
https://www.w3.org/Bugs/Public/show_bug.cgi?id=20702) But since the bug 
actually was in Anolis’ list of named character references, this 
nevertheless caused a misrepresentation of the named entities.

>>> There is more to compatibility than compatibility between the
>>> browsers. For XHTML there needs to be compatibility between
>>> Browsers and XML tools (otherwise why use XML at all, I know you
>>> would rather people didn't but so long as the spec allows then to
>>> it should not mandate a situation that makes document corruption so
>>> likely).
>> 
>> There is no such mandate. The spec merely provides a catalogue of
>> public identifiers and their modern meaning. Nothing stops XML users
>>  from using any other identifier, in particular SYSTEM identifiers.
>> The spec discourages people from using DTDs in general, because of
>> precisely the kinds of issues that are being discussed here, but the
>>  XML spec allows it, and that's what controls this at the end of the
>>  day (especially in the case of software that isn't using the HTML
>> spec's catalogue).
>> 
> As I note above there are many existing systems using the Public
> identifiers of XHTML1 to refer to the XHTML1 DTD and using validating
> parsers. They can not simply switch in a catalog that makes their
> existing document collections invalid. So they can not make documents
> using the XHTML1 public identifier load a DTD other than XHTML1 DTD.

1) If the legacy XHTML DTDs are so risky, shouldn't the spec
   explicitly warned against using them in authoring of XHTML5
   documents?

2) David, have you considered the possibility of link this named
   entity magic to the legacy-compat variant of the HTML5 doctype?

   http://www.w3.org/TR/html5/syntax.html#doctype-legacy-string

   The advantage of doing so would be that nothing new needs to be
   introduced.
   The disadvantage (but perhaps advantage in Ian's eyes) ;-)
   would be the name of this doctype variant - "legacy".
-- 
leif halvard silli

Re: [whatwg] Encoding sniffing algorithm

2012-09-09 Thread Leif Halvard Silli
Ian Hickson ian at hixie.ch  on Thu Sep 6 12:55:03 PDT 2012:
> On Fri, 27 Jul 2012, Leif Halvard Silli wrote:

>> Revised encoding sniffing algorithm proposal:
>> 
>> NEW! 0. document is XML format - opt out of the algorithm.
>> [This step is already implicit in the spec, but it would
>> make sense to explicitly include it to make sure that
>> one could e.g. write test cases to see that it is step
>> is implemented. Currently Safari, Chrome and Opera do 
>> not 100% implement this step.]
> 
> I don't understand the relevance of the algorithm to XML. Why would anyone 
> even look at this algorithm if they were parsing XML?

In principle it should not be needed. Agree. 

But many of those who are parsing XML are also parsing HTML - for that 
reason it should be natural for them to compare specs and requirements. 
Currently, in particular Webkit and Chromium seem to be colored by 
their HTML parsing when they parse XML. (See the table in my blog 
post.) Also, the spec do a few time includes phrases similar to "if it 
is XML, then abort these steps" (for example in '3.4.1 Opening the 
input stream'),[*] so there is some precedence, I think.

[*] 
http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#opening-the-input-stream

>> NEW! #. Alternative: The BOM signa­ture could go here instead of 
>> in step 5. There is a bug to move the BOM hereto and make
>> it override anything else. What speaks against this are:
>>   a) that Firefox, IE10 and Opera do not currently have
>>  this behavior.
>>   b) this revision of the sniffing algorithm, especially
>>  the revision in step 6 (required UTF-8 detection),
>>  might make the BOM-trumps-everything-else override
>>  less necessary
>> What speaks for this override:
>>   a) Safari, Chrome and legacy IE implement it.
>>   b) some legacy content may depend on it
> 
> Not sure what this means.

You will be dealing with it when you take care of Anne's bug: "Bug 
15359 Make BOM trump HTTP". [*] Thus, you can just ignore it. 
[*] https://www.w3.org/Bugs/Public/show_bug.cgi?id=15359. 


>>  1. user override.
>> (PS: The spec should clarify whether user override is
>>  cacheable.)
> 
> This seems to be entirely a user interface issue.

But then, why do you go on to describe it in the new note? (See below.)


>> NEW! 2. iframe inherits user override from parent browsing context
>> [Currently not mentioned in the spec, despite that "all"
>>  UAs do have this step for HTML docs.]
> 
> That's a UI issue much like whether it's remembered or not. But I've added 
> a non-normative note.

Your new note:

"""1. Typically, user agents remember such user requests 
   across sessions, and in some cases apply them to 
   documents in iframes as well."""

My comments:

   1: How does that differ from the "info on the likely encoding" step?

   2: Could you define 'sessions' somewhere? It sounds to me that the 
'sessions' behavior that you describe resembles the Opera behavior. 
Which is bad when the Opera behavior is the least typical one. (And 
most annoying from a page developer's point of view.) The typical thing 
- which Opera breaks! - is to, in some way or another, limit the 
encoding override to the current *tab* only. Thus, if you insist on 
describing what UAs "typically" do, then you should instead of 
describing the exception (Opera), say that browsers *differ*, but that 
the typical thing is to limit the encoding override, some way or 
another, to the current tab. 

   3: Browses differ enough for you to evaluate how they behave and 
pick the best behavior. However, I'd say Firefox is best as it offers a 
compromise between IE and Webkit. (See belows.)

Comments in more details:

FIRSTLY: Regarding "across sessions". then my assumption would be 
that a "single session" is equal to the lifespan of a single tab (or a 
single window, if there is no Tab in the window). If so, then that is 
how Safari/Chrome behave: Override lasts as long as one stays in the 
current frame.

SECONDLY: Does 'sessions' relate to a particular document - as in 
"document during several sessions"? Or to a particular tab/window - as 
in "session = tab"?
  * Under FIRSTLY, I described how Safari/Chrome behave: They do not 
give heed to the document. They *only* give heed to the current 
tab/window: If you override a document to use the KOI8-R encoding then 
the next document you load in the same tab wil

Re: [whatwg] alt and title attribute exception

2012-08-01 Thread Leif Halvard Silli
Philip Jägenstedt Wed Aug 1 05:05:15 PDT 2012:
> On Tue, 31 Jul 2012 14:03:02 +0200, Steve Faulkner wrote:
> 
>> title has differing semantics to alt. In situations where alt it not
>> present on an img but title is, in webkit based browsers the title
>> attribute content is displayed on mouse hover and is also displayed in
>> place of the image when images are disabled or not available. This
>> implementation appears to contradict the must requirement in the spec.
>>
>> User agents must not present the contents of the alt attribute in the  
>> same way as content of the title attribute.
>>
>> As there is no way visual distinction between title content being  
>> displayed and of alt content in this case.
> 
> To be very clear, you agree with the spec, think that WebKit is wrong and  
> would not offer any applause if Opera were to use the title attribute to  
> replace images when images are disabled and there is no alt attribute?

[I suppose 'the spec' means the W3 HTML5 spec?] 

Question: I would be rather simple for Opera, would it not, to add some 
CSS that makes the @title be used as @alt replacement when the @alt is 
lacking?
-- 
leif halvard silli

Re: [whatwg] alt and title attribute exception

2012-07-31 Thread Leif Halvard Silli
Steve Faulkner on Tue, 31 Jul 2012 13:03:02 +0100, wrote,
in reply to Philip Jägenstedt:
>> but I'm confused -- is falling back to title a Good Thing that people want
>> browsers to implement, or is it just a quirk that some legacy browser had?
> 
> Given that there is a semantic distinction in the spec between what alt
> content is and what title content is and a swathe of normative
> requirements/advice based on this distinction it would appear unwise to
> promote the use of title as fallback without providing normative
> requirements on provision of a method to distinguish between the two.

So, it is bad that the Webkittens fall back to using @title? 

I must admit that I don't understand how you reason. Because, when 
@title is used as fallback, then we _want_ @title to be treated as 
@alt. So why do need a method to distinguish the two, then?

> *Note:* in terms of the accessible name calculation for an img element, if
> the image does not have aria-label or an aria-labelledby or an alt
> attribute, but does have a title attribute, then the title attribute is
> used as the accessible name. From an accessibility API perspective, no
> distinction is indicated as to the source of the accessible name (apart
> from in the Mac AX API).

On the old mac I have at hand, right now, then AXImage (of 
Accessibility Inspector) renders the @title content, when the @alt is 
lacking. There is no info about the fact that the AXImage stems from 
@title. But perhaps that has changed so that AT users are informed when 
the accessible name stems from the @title?
 
> The last point is another reason why making the title attribute on images
> (without alt) conforming is that the semantics, for all users, are
> ambiguous.

And another place in the same letter you say:

>> User agents must not present the contents of the alt attribute
>> in the same way as content of the title attribute.
> 
> As there is no way visual distinction between title content 
> being displayed and of alt content in this case.

Comments:

(1) It does not follow, from the fact that the spec forbids @alt from 
being rendered as a tooltip, that a tooltip cannot be rendered as an 
@alt.

(2) If the spec did not forbid @alt from render as a tooltip, then 
authors would be confused to write @alt texts that were excellent as 
tooltips but bad sub optimal as @alt content. 
(Thus, it is based on the respect for how the two features are 
distinct.) Conversely, if @title render as @alt, then authors would 
perhaps write tooltips that served OK as @alt. If that is bad, then why 
is it bad? 

(3) The fact that @title is used as last resort when calculating the 
accessible name is because an accessible name is so important that even 
a tooltip can be useful for that purpose, when need be. So why would it 
be a big no no that a lacking @alt causes the @title to be rendered as 
@alt content? 

I think the spec's motivation for the current "exception" might be 
similar to the generator exception: It is done to not triggers authors 
to e.g. create empty @alt or repeated, meaningless @alt text of the 
kind alt="image" - just in order to validate. I disagree strongly with 
the generator exception. But I cannot say I strongly disagree with the 
@title exception. With the introduction of ARIA, it has become even 
less critical to remove this exception, since ARIA includes the @title 
as a last resort anyhow.

I'm uncertain about how lack of keyboard access to @title can be used 
against this exception, when both Webkittens and ARIA give them access 
to it.
-- 
Leif Halvard Silli

Re: [whatwg] Suggest making and valid in

2012-07-31 Thread Leif Halvard Silli
Ian Yang on Thu, 19 Jul 2012 15:04:48 +0800, wrote:

>> From previous discussions, some people had suggested possible markup for
>> "life cycle" type contents. And personally I will stick to using  until
>> there is a better solution.
> 
> There is still one thing left unanswered. And that's whether we will be
> able to put  inside .
> 
> Let's consider  we used often. When coding a form, none of us make it
> like the following one because that's obviously very ugly and, most
> importantly, it hurts our eyes!
> 
> 
> Name
> 
  [...]
> Instead, we use  (some people use ) to group sub elements
  [...]
> 
> 
> Name
> 
> 

Would it not be better if, rather than , you used ? Then 
it would not only benefit your eyes but also the semantics:

  
 Name
 
  

There is even the option that you wrap the  around the input - 
then you can drop the @id too - and be semantic as well:

  Name
 
  

This way you can 'increase' both the semantics and the 'eye wellness'.

> Like above examples, the following  is not well organized, and it's
> also a pain to read it:
> 
> 
> Lorem Ipsum
> Sit amet, consectetur adipiscing elit.
> Aliquam Viverra
> Fringilla
   [... etc ...]
> 
> 
> If developers could, *optionally*, use  to wrap each group, the code
> would be more organized:
> 
> 
> 
> Lorem Ipsum
> Sit amet, consectetur adipiscing elit.
> 
> 
> Aliquam Viverra
> Fringilla nulla nunc enim nibh, commodo sed cursus in.
> 
   [...]
> 
> 
> And usually "life cycle" type contents are presented as circles. Without
> (s), it will be hard to style them.

How about the following method - essentially a variant of 
Egg: A white egg. [etc], as proposed by by Ian:

Lorem Ipsum
  Sit amet, consectetur adipiscing elit.

Aliquam Viverra
  Fringilla nulla nunc enim nibh, commodo 
  sed cursus in.

Or, if one wishes, one could drop the … completely 
and  instead e.g. do the following: 

figure figure{display:list-item}


  Lorem Ipsum
  Sit amet, consectetur adipiscing elit.


  Aliquam Viverra
  Fringilla nulla nunc enim nibh, commodo 
  sed cursus in.



> Since the *optional *use of  in  could solve many problems, may we
> have  being valid in ?

The most serious problem with that proposal seems to me to be that the 
 only have styling functionality. I think one would have to define 
it as a new list type, where  has semantic meaning, and then it 
could perhaps work.
-- 
Leif Halvard Silli

Re: [whatwg] Suggest making and valid in

2012-07-31 Thread Leif Halvard Silli
Ian Hickson on Mon, 16 Jul 2012 04:31:44 + (UTC), wrote:

> It's certainly true that many element names are derived more from 
> historical accidents than their current semantics, but  and  are 
> semantically quite different, as the spec describes.
> 
> Specifically,  implies that the order of the list cannot be changed 
> without affecting the meaning of the page, whereas the order in a  
> list is merely aesthetic.

Thanks. I learned a lot from this thread.

Just now took myself in writing the following in a Web page: "Regarding 
the last list-item, then …". And then I realized that that "last 
list-item" occurred inside a  list. Which meant that I had to (or 
at least I did) change the list from  to . I also replaced the 
numerical list-item numbering with circles, to signify that the items 
was not numbered.

In fact, I frequently deal with texts where there is "homework items" 
where each homework item contains one or more sub-items. For these 
sub-items, I use …… — which seems logical as long as 
there more than one sub-item. But what - at least for the time being - 
there is only one sub-item? I want the sub-item to have a bullet, or 
similar, to signify that it is a sub-item. I don't want a number. At 
the same time, there is no principal difference between that lone 
sub-item and the multiple sub-items in the nearby homework item.

So one option that comes to mind is to do the following, in order to be 
certain that sole-items have a different style:
ol>li:first-child:last-child {list-style-type:circle}

Should I want to add one item more, then I automatically get numbering.

What strikes me is that I almost never would like to use  anymore. 
Only when I would like to explicitly say that the meaning of this 
document does not change whichever way you list the list-items, only 
then would I pick .

Which makes me wonder: Why is not value="" allowed for  
inside ? E.g. I might want to add accidental numbers to the 
list-items while at the same time also wanting to say that the page 
does not change meaning whichever way you order the items?

I also wonder: Would it not make sense to advice, when uncertain about 
whether order is significant, advice authors to pick  over ? 
For instance the sub-items of our homework items: Since the order of 
the sub-items often risks becoming significant, it seems smart to pick 
 and not  - even if  sometimes could work too.
-- 
Leif H Silli

[whatwg] Encoding sniffing algorithm - update proposal

2012-07-26 Thread Leif Halvard Silli
I have just written a document on how implementations prioritize 
encoding info for HTML documents.[1] (As that document shows, I have 
not tested Safari 6.) Based on my findings there, I would like to 
suggest that the spec's encoding sniffing algorithm should be updated 
to look as follows:

Revised encoding sniffing algorithm proposal:

NEW! 0. document is XML format - opt out of the algorithm.
[This step is already implicit in the spec, but it would
make sense to explicitly include it to make sure that
one could e.g. write test cases to see that it is step
is implemented. Currently Safari, Chrome and Opera do 
not 100% implement this step.]
 
NEW! #. Alternative: The BOM signa­ture could go here instead of 
in step 5. There is a bug to move the BOM hereto and make
it override anything else. What speaks against this are:
  a) that Firefox, IE10 and Opera do not currently have
 this behavior.
  b) this revision of the sniffing algorithm, especially
 the revision in step 6 (required UTF-8 detection),
 might make the BOM-trumps-everything-else override
 less necessary
What speaks for this override:
  a) Safari, Chrome and legacy IE implement it.
  b) some legacy content may depend on it

 1. user override.
(PS: The spec should clarify whether user override is
 cacheable.)

NEW! 2. iframe inherits user override from parent browsing context
[Currently not mentioned in the spec, despite that "all"
 UAs do have this step for HTML docs.]

 3. explicit charset attribute in Content-Type header.

 4. BOM signa­ture [or as the second step, see above]

 5. native markup label 

NEW! 6. UTF-8 detection.
I think we should separate UTF-8 detection from other
detection in order to make this step obligatory.
The newness here is only the limitation to UTF-8
detection plus that it should be obligatory. 
(Thus: If it is not detected as UTF-8, then
the parser proceeds to next step in the algorithm.)
This step would make browsers lean more strongly 
towards UTF-8.

NEW! 7. parent browsing context default.
The current spec does not mention this step at all,
despite that both Opera, IE, Safari, Chrome, Firefox
do implement it.

Regarding 6. and 7., then the order is important. Chrome
does for instance perform UTF-8 detection, but it does it
only /after/ the parent browsing context. Whereas everyone
else (Opera 12 by default, Firefox for some locales - don't
know if there are others) let it happen before the 'parent
browsing context default'.

NEW! 8. info on “the likely encoding”
The main newness is that this step is placed _after_ 
the (revised) UTF-8 detection and after the (new) parent
browsing context default.
The name 'the likely encoding' is from the current spec
text. I am a bit uncertain about what it means in the 
current spec, though. So I move here what I think make
sense. The steps under this point should perhaps be
optional:

a. detection of other charsets than UTF-8
   (e.g the optional Cyrillic detection in
   Firefox or legacy Asian encoding detection.
   The actual detection might happen in step 6,
   but it should only be made to count here.)
b. markup label of the sister language
   
   (Opera/Webkit/Chrome currently have this directly
   after the native encoding label step - step 5.
c. Other things? What does "likely encoding" current
   refer to, exactly?

 9. locale default

[1] 
http://malform.no/blog/white-spots-in-html5-s-encoding-sniffing-algorithm
[2] To the question of whether the BOM should trump everything else, 
then I think it it would be more important to get the other parts of 
this algorithm right. If we do get the rest of it right, then the 'BOM 
should trump' argument, becomes less important.
-- 
Leif Halvard Silli

Re: [whatwg] alt="" and the exception

2012-07-25 Thread Leif Halvard Silli
Edward O'Connor on Tue, 24 Jul 2012 10:37:20 -0700
> We could address this problem by making changes along these lines:
> 
> 1. Drop the  alt="" exception.
> 2. Mint a global boolean attribute that, when present, indicates that
>the element and its descendants are outside of the page author's
>control (at least insofar as author conformance criteria are
>concerned).

How about simply introducing a @generator attribute:

 

> 3. Add a new exception to the "Guidance for conformance checkers"
>section which prevents conformance checkers from emitting errors for
>missing alt="" in subtrees marked with the new attribute.

Instead of a validator exception, how about simply let the validator 
split the validation results in several parts based upon who the author 
is? Here is an example report which identifies two document 
authors/generators:

* Main author report:
  No alt errors detected.
* Report for generator 'foo': 
  10 valid images but which the main author has not verified yet.
   
Here I assume that that verification flag is the *lack* of the 
generator attribute. I.e. when the main author somehow blesses the @alt 
text, then the generator is removed.

> Some issues that come to mind:
> 
> 1. What other author conformance criteria should conformance checkers
>relax in such subtrees?
> 
> 2. Authors might start including such an attribute on the  element
>just to get some kind of "valid html5" badge without actually
>improving their pages.

This is easier to avoid if the validator identifies responsibility - 
see above.

> 3. What's a good name for such an attribute?

@generator.  :-D
-- 
Leif H Silli


Re: [whatwg] A link[scoped] usecase

2012-03-02 Thread Leif Halvard Silli
Gray Zhang on Fri, 02 Mar 2012 10:58:32 -0800:
> By now, for the reason that there is not link[scoped] and style[scoped] is
> not supported for any browser, my solution is add a data-theme attribute on
> wrapper element, and the theme .css file should add some extra selector:
> 
> .visual-root[data-theme="fireworks"] {
> background-color: #404040;
> color: #addede;
> }

Until support is available, would this help?

   
  #uniqueID + .visual-root {style}
  #uniqueID + .visual-root * {style}
   
   

At the very least, what this stylesheet will style, depends on which 
exact *.visual-root element you place it adjacent to.
-- 
Leif Halvard Silli


[whatwg] Character-encoding-related threads

2012-02-13 Thread Leif Halvard Silli
Anne van Kesteren, Mon Feb 13 12:02:53 PST 2012:
> On Mon, 13 Feb 2012 20:46:57 +0100, Anne van Kesteren wrote:

>> The list starts with  and the moment you do not use UTF-8 (or UTF-16,  
>> but you really shouldn't) you can run into problems. I wonder how  
>> controversial it is to just require UTF-8 and not accept anything else.

Hear, hear!

> I guess one could argue that  is already captured by the requirements  
> around URL validation. That would leave  and potentially some  
> script-related features. It still seems sensible to me to flag everything  
> that is not labeled as UTF-8,

Indeed. Such a step would make it a must for HTML5-compliant authoring 
tools to default to UTF-8. It would also positively affect validators - 
they would have to give "mild" advices about how to, the simplest way, 
use UTF-8. (E.g. if page is US-ASCII or US-ASCII with entities, then - 
a simple move: Just at a encoding declaration.) It is likely to have 
many, many positive side effects.

> but if we want something intermediate we  
> could start by flagging non-UTF-8 pages that use  and maybe obsolete  
>  or obsolete any other value than utf-8 (I filed a  
> bug on that feature already to at least restrict it to a single value).

The full way - all pages regardless of  - seems the simplest and 
best.
-- 
Leif H Silli


[whatwg] Comments before the DOCTYPE (warning message in validator.nu)

2012-02-04 Thread Leif Halvard Silli
Sat, 4 Feb 2012 04:28:35 + (UTC), Ian Hickson
> On Sat, 4 Feb 2012, Leif Halvard Silli wrote:
>> 
>> If one tries to validate [...]

> Just a reminder to everyone that first of all, this list isn't really the 
> right list for discussing implementation specifics (we have an 
> implementation list for that if you're an implementor, but if it's just a 
> bug report then the best thing to do is to approach the implementors 
> directly via their bug systems),

Sorry, I did not read Henri's info [*] well enough. He mentions WHATWG 
there, but I skimped over that he spoke about the list you mention. 
Point taken.

[*] http://about.validator.nu/#reporting-bugs
-- 
Leif H Silli


[whatwg] Comments before the DOCTYPE (warning message in validator.nu)

2012-02-03 Thread Leif Halvard Silli
If one tries to validate



then validator.nu gives this warning:

]]
Warning: Comments seen before doctype. 
Internet Explorer will go into the quirks mode.
From line 1, column 1; to line 1, column 7
http://xn--mlform-iua.no/blog/no-condition-comments.

Hence, I would suggest that e.g. "some versions of Internet Explorer 
risk entering quirks-mode" would be a more truthful message to send.
-- 
Leif Halvard Silli

Re: [whatwg] [Selectors4] case-insensitive attribute value matching (in XML)

2012-01-21 Thread Leif Halvard Silli
Ian Hickson on Fri Jan 20 14:31:01 PST 2012:
> On Tue, 26 Jul 2011, Christoph Päper wrote:
>> Anne van Kesteren:
>> > I'm still trying to get HTML and browsers to change so that attribute 
>> > values always match case-sensitively, irrespective of markup language. 
>> > The current magic attribute list in HTML whose values needs to be 
>> > matched ASCII case-insensitively is just ugly.

> The spec changed recently in response to Anne's efforts here. If this is 
> an area of interest, I encourage you to study the specification to see if 
> the current requirements are satisfactory.

The matching rule for attribute names and element names, [1] doesn't 
match reality, see demo: [2]

* Gecko uses ASCII case-insensitive matching (as specced by HTML5)
* Trident/Webkit/Presto use Unicode caseless matching (variant).
  (Legacy Firefox 3.6 behave like Trident/Webkit/Presto too.)

The differences affect @data-* and @x-* (and other extensions). 
Shouldn't spec match Trident/WEbkit/Presto?

[1] http://dev.w3.org/html5/spec/links#case-sensitivity
[2] http://software.hixie.ch/utilities/js/live-dom-viewer/saved/1307
-- 
Leif Halvard Silli

Re: [whatwg] [encoding] utf-16

2012-01-03 Thread Leif Halvard Silli
Leif Halvard Silli, Tue, 3 Jan 2012 23:51:52 +0100:
> Henri Sivonen, Mon Jan 2 07:43:07 PST 2012
>> On Fri, Dec 30, 2011 at 12:54 PM, Anne van Kesteren wrote:
>>> And why should there be UTF-16 sniffing?
>> 
>> The reason why Gecko detects BOMless Basic Latin-only UTF-16
>> regardless of the heuristic detector mode is
>> https://bugzilla.mozilla.org/show_bug.cgi?id=631751
> 
> That bug was not solved perfectly. E.g. this page renders readable in 
> IE, but not in Firefox: <http://www.hughesrenier.be/actualites.html>. 
> (For some reason, it renders well if I download it to my harddisk.)

Oops, that was of course because the HTTP level said "ISO-8859-1".

>> It's quite possible that Firefox could have gotten away with not
>> having this behavior.
-- 
Leif H Silli


Re: [whatwg] [encoding] utf-16

2012-01-03 Thread Leif Halvard Silli
Henri Sivonen, Mon Jan 2 07:43:07 PST 2012
> On Fri, Dec 30, 2011 at 12:54 PM, Anne van Kesteren wrote:
>> And why should there be UTF-16 sniffing?
> 
> The reason why Gecko detects BOMless Basic Latin-only UTF-16
> regardless of the heuristic detector mode is
> https://bugzilla.mozilla.org/show_bug.cgi?id=631751

That bug was not solved perfectly. E.g. this page renders readable in 
IE, but not in Firefox: <http://www.hughesrenier.be/actualites.html>. 
(For some reason, it renders well if I download it to my harddisk.)
 
> It's quite possible that Firefox could have gotten away with not
> having this behavior.
-- 
Leif Halvard Silli


Re: [whatwg] Default encoding to UTF-8?

2012-01-03 Thread Leif Halvard Silli
Henri Sivonen, Tue Jan 3 00:33:02 PST 2012:
> On Thu, Dec 22, 2011 at 12:36 PM, Leif Halvard Silli wrote:

> Making 'unicode' an alias of UTF-16 or UTF-16LE would be useful for
> UTF-8-encoded pages that say charset=unicode in  if alias
> resolution happens before UTF-16 labels are mapped to UTF-8.

Yup.
 
> Making 'unicode' an alias for UTF-16 or UTF-16LE would be useless for
> pages that are (BOMless) UTF-16LE and that have charset=unicode in
> , because the  prescan doesn't see UTF-16-encoded metas.

Hm. Yes. I see that I misread something, and ended up believing that 
the  would *still* be used if the mapping from 'UTF-16' to 
'UTF-8' turned out to be incorrect. I guess I had not understood, well 
enough, that the meta prescan *really* doesn't see UTF-16-encoded 
metas. Also contributing was the fact that I did nto realize that IE 
doesn't actually read the page as UTF-16 but as Windows-1252: 
<http://www.hughesrenier.be/actualites.html>. (Actually, browsers does 
see the UTF-16 , but only if the default encoding is set to be 
UTF-16 - see step 1 of '8.2.2.4 Changing the encoding while parsing' 
<http://dev.w3.org/html5/spec/parsing.html#change-the-encoding>.)

> Furthermore, it doesn't make sense to make the  prescan look for
> UTF-16-encoded metas, because it would make sense to honor the value
> only if it matched a flavor of UTF-16 appropriate for the pattern of
> zero bytes in the file, so it would be more reliable and straight
> forward to just analyze the pattern of zero bytes without bothering to
> look for UTF-16-encoded s.

Makes sense.

   [ snip ]
>> What we will instead see is that those using legacy encodings must be
>> more clever in labelling their pages, or else they won't be detected.
> 
> Many pages that use legacy encodings are legacy pages that aren't
> actively maintained. Unmaintained pages aren't going to become more
> clever about labeling.

But their Non-UTF-8-ness should be picked up in the first 1024 bytes?

  [... sniff - sorry, meant snip ;-) ...]

> I mean the performance impact of reloading the page or, 
> alternatively, the loss of incremental rendering.)
>
> A solution that would border on reasonable would be decoding as
> US-ASCII up to the first non-ASCII byte

Thus possibly prescan of more than 1024 bytes? Is it faster to scan 
ASCII? (In Chrome, there does not seem to be an end to the prescan, as 
long as the text source code is ASCII only.)

> and then deciding between
> UTF-8 and the locale-specific legacy encoding by examining the first
> non-ASCII byte and up to 3 bytes after it to see if they form a valid
> UTF-8 byte sequence.

Except for the specifics, that sounds like more or less the idea I 
tried to state. May be it could be made into a bug in Mozilla? (I could 
do it, but ...)

However, there is one thing that should be added: The parser should 
default to UTF-8 even if it does not detect any UTF-8-ish non-ASCII. Is 
that part of your idea? Because, if it does not behave like that, then 
it would work as Google Chrome now does work. Which for the following, 
UTF-8 encoded (but charset-un-labelled) page means, that it default to 
UTF-8:

æøå

While it for this - identical - page, would default to the locale 
encoding, due to the use of ASCII based character entities, which 
causes that it does not detect any UTF-8-ish characters:

æøå

As weird variant of the latter example is UTF-8 based data URIs, where 
all browsers (that I could test - IE only supports data URIs in the 
@src attribute, including ) default to the locale encoding 
(apart for Mozilla Camino - which has character detection enabled by 
default):

data:text/html,%C3%A6%C3%B8%C3%A5

All the 3 examples above should default to UTF-8, if the "border on 
sane" approach was applied.

> But trying to gain more statistical confidence
> about UTF-8ness than that would be bad for performance (either due to
> stalling stream processing or due to reloading).

So here you say tthat it is better to start to present early, and 
eventually reload [I think] if during the presentation the encoding 
choice shows itself to be wrong, than it would be to investigate too 
much and be absolutely certain before starting to present the page.

Later, at Jan 3 00:50:26 PST 2012, you added:
> And it's worth noting that the above paragraph states a "solution" to
> the problem that is: "How to make it possible to use UTF-8 without
> declaring it?"

Indeed.

> Adding autodetection wouldn't actually force authors to use UTF-8, so
> the problem Faruk stated at the start of the thread (authors not using
> UTF-8 throughout systems that process user input) wouldn't be solved.

If we take that logic to its end, then it would not make sense for the 
validator to display

Re: [whatwg] [encoding] utf-16

2011-12-30 Thread Leif Halvard Silli
Anne van Kesteren  Fri, 30 Dec 2011 11:54:34 +0100
> On Fri, 30 Dec 2011 05:51:16 +0100, Leif Halvard Silli:
>> The Trident cache behaviour is a symptom of its over all UTF-16
>> behaviour: Apart from reading the BOM, it doesn't do any UTF-16
>> sniffing. I suspect that you want Opera/Firefox to become "as bad" at
>> 'getting' the UTF-16 encoding as Webkit/IE are? (Note that Webkit is
>> worse than IE - just to, once again, emphasize how difficult it is to
>> replicate IE.)
> 
> How is WebKit worse than IE?

For HTML: If HTTP says 'WINDOWS-1252' but the page is little-endian 
UTF-16 without the BOM, then IE will render the page as WINDOWS-1252, 
and this will actually work - at least in some circumstances ... Check: 
<http://www.acsd.k12.sc.us/wwes/>. (There could be other pages that IE 
handles, but which doesn't fall into this category.)

For XHTML: For 'nude' tests 
<http://malform.no/testing/utf/#xml-table-1>, then Webkit is worse than 
Trident <http://malform.no/testing/utf/#xml-table-1-results>. (Trident 
performs a variant of the sniffing described in XML 1.0, whereas Webkit 
does not sniff at all unless there is a XML prolog.)

> And why should there be UTF-16 sniffing?

FIRST: What is 'UTF-16 sniffing'? The BOM is a sniffing form.  The 
HTML5 character encoding *sniffing* algorithm covers UTF-16 as well. 
Should we single out UTF-16 sniffing as something that should not be 
sniffed?

 What do browser vendors think?

Based on the tests at <http://malform.no/testing/utf/>, then it seems 
like IE performs no UTF-16 detection/sniffing beyond using HTTP, using 
the BOM and - as last resort - reading the META element (including the 
MS 'unicode' and MS 'unicodeFFFE' values - that Webkit also reads). 

But for HTML, then Trident - unlike Webkit - does not make use of the 
XML encoding declaration for detecting encoding: 
<http://malform.no/testing/utf/#html-table-4>. And for HTML, then 
Trident - unlike Webkit - does not make use of the the XML prolog (no, 
not the encoding declaration) for sniffing the endianness of UTF-16 
files: <http://malform.no/testing/utf/#html-table-9>.

Aligning with IE would mean that Opera, Mozilla and Webkit must 
'degenerate' their heuristics. Why would a vendor want to become less 
compatible with the Web?

>> But is the little endian defaulting really important?
>> Over all, proper UTF-16 treatment (read: sniffing) on IE/WEbkit's part,
>> would probably improve the situation more.
> 
> You mean there are sites that only work in Gecko/Presto?

'Sites' is perhaps a big word - 'UTF-16' pages are often lone pages, it 
seems. But yes, obviously. E.g. big-endian UTF-16 labelled pages 
without a BOM. 

But, oops: It seems like Firefox does not use the META element anymore. 
It used to use the META element, in Firefox 3. But apparently stopped 
doing that - may be they misread the HTML5 algorithm ... Nevertheless, 
I have come across pages that work in Firefox/Opera but not Trident.

MS Word, which often make these pages, cane save both big and little 
endian.

>> I know ... And it precisely therefore that it would have been an
>> advantage to, for the Web, focus on *requiring* the BOM for UTF-16.
> 
> It seems simpler to focus on promoting only UTF-8.

It seems simple enough say that the BOM must be used. Saying something 
like that is no different from saying that a certain range of 
WINDOWS-1252 must not be used, is it?

>>> Yeah, I'm going to file a new bug so we can reconsider although the  
>>> octet sequence the various BOMs represent can have legitimate meanings  
>>> in certain encodings,
>> 
>> You mean: In addition to the BOM meaning, I suppose.
> 
> No. In e.g. windows-1258 there is no BOM and FF FE simply means U+00FF  
> U+20AB.

I think we have the same thing in mind. And btw, Google Search displays 
many such letters in UTF-16 encoded pages ... instead of displaying the 
content. Apparently, Google *fails* consider the BOM octets magic ... 
Or may be it is UTF-16-negative ...

>>> it seems in practice people use them for Unicode.
>>> (Helped by the fact that Trident/WebKit behave this way of course.)
>> 
>> Don't forget the fact that Presto/Gecko do not move the BOM into the
>>  when you use UTF-16LE/BE, like they - per the spec of those
>> encodings - should do. See:
>> <http://bugzilla.validator.nu/show_bug.cgi?id=890>
> 
> Well yes, that's why I'm planning to define utf-16 more in line with  
> implementations (and render the current text obsolete I suppose).

You don't need, for that reason, to follow a strategy that nullifies 
UTF-16LE/UTF-16BE. I outlined another strategy: Say that all HTML pages 
are interpreted as being 'UTF-16', even if they mis-labelled with the 
BOM-less UTF-16LE/UTF-16BE labels.
-- 
Leif H Silli


Re: [whatwg] [encoding] utf-16

2011-12-29 Thread Leif Halvard Silli
Anne van Kesteren - Thu Dec 29 04:07:14 PST 2011
> On Thu, 29 Dec 2011 11:37:25 +0100, Leif Halvard Silli wrote:
>> Anne van Kesteren Wed Dec 28 08:11:01 PST 2011:
>>> On Wed, 28 Dec 2011 12:31:12 +0100, Leif Halvard Silli wrote:
>>>> As for Mozilla, if HTTP content-type says 'utf-16', then it is prepared
>>>> to handle BOM-less little-endian as well as bom-less big-endian.
>>>> Whereas if you send 'utf-16le' via HTTP, then it only accepts
>>>> 'utf-16le'. The same also goes for Opera. But not for Webkit and IE.
>>>
>>> Right. I think we should do it like Trident.
>>
>> To behave like Trident is quite difficult unless one applies the logic
>> that Trident does. First and foremost, the BOM must be treated the same
>> way that Trident and Webkit treat them. Secondly: It might not be be
>> desirable to behave exactly like Trident because Trident doesn't really
>> handle UTF-16 *at all* unless the file starts wtih the BOM - [...]
> 
> Yeah I noticed the weird thing with caching too. Anyway, I meant  
> WebKit/Trident.

The Trident cache behaviour is a symptom of its over all UTF-16 
behaviour: Apart from reading the BOM, it doesn't do any UTF-16 
sniffing. I suspect that you want Opera/Firefox to become "as bad" at 
'getting' the UTF-16 encoding as Webkit/IE are? (Note that Webkit is 
worse than IE - just to, once again, emphasize how difficult it is to 
replicate IE.) But is the little endian defaulting really important? 
Over all, proper UTF-16 treatment (read: sniffing) on IE/WEbkit's part, 
would probably improve the situation more.

Note as well that Trident does not have the same endian problems when 
it comes to XML - for XML it tend to handle any endianness, with or 
without the BOM.

>>> I personally think everything but UTF-8 should be non-conforming,  
>>> because of the large number of gotchas embedded in the platform if you  
>>> don't use
>>> UTF-8. Anyway, it's not logical because I suggested to follow Trident
>>> which has different behavior for utf-16 and utf-16be.
>>
>> We simplify - remove a gotcha - if we say that BOM-less UTF-16 should
>> be non-conforming. From every angle, BOM-less UTF-16 as well as
>> "BOM-full" UTF-16LE and UTF-16BE, makes no sense.
> 
> That's only one. Form submission will use UTF-8 if you use UTF-16,  
> XMLHttpRequest is heavily tied to UTF-8, URLs are tied to UTF-8. Various  
> new formats such as Workers, cache manifests, WebVTT, are tied to UTF-8.  
> Using anything but UTF-8 is going to hurt and will end up confusing you  
> unless you know a shitload about encodings and the overall platform, which  
> most people don't.

I know ... And it precisely therefore that it would have been an 
advantage to, for the Web, focus on *requiring* the BOM for UTF-16. 
Make UTF-16LE/BE non-conforming to use. Because it is only with a 
reliable UTF-16 detection that the necessary 'conversion' (inside the 
UA) to UTF-8 (to the encoding of those formats you mentioned), is 
reliable. Anything with a BOM, whether UTF-16 or UTf-8, seems to go 
well together. (E.g. when I made my tests, I saw that the UTF-8 encoded 
CSS file was not used by several of the browsers - not until I made 
sure the CSS file included the BOM, then the UAs were able to get the 
CSS file to work with the UTF-16 encoded HTML files.)

I'm not a 'fan' of UTF-16. But I guess you call me a fan - a devote 
such - of the BOM.

>> You perhaps would like to see this bug, which focuses on how many
>> implementations, including XML-implementations, give precedence to the
>> BOM over other encoding declarations:
>> https://www.w3.org/Bugs/Public/show_bug.cgi?id=12897
>>
>> *Before* paying attention to the actual encoding, you say. More
>> correct: Before deciding whether to pay attention to the 'actual'
>> encoding, they look for a BOM.
> 
> Yeah, I'm going to file a new bug so we can reconsider although the octet  
> sequence the various BOMs represent can have legitimate meanings in  
> certain encodings,

You mean: In addition to the BOM meaning, I suppose.

> it seems in practice people use them for Unicode.  
> (Helped by the fact that Trident/WebKit behave this way of course.)

Don't forget the fact that Presto/Gecko do not move the BOM into the 
 when you use UTF-16LE/BE, like they - per the spec of those 
encodings - should do. See: 
<http://bugzilla.validator.nu/show_bug.cgi?id=890>

More helping facts:

0 While theoretically legitimate, HTML (per HTML5) is geared towards
  UTF-8, and HTML clients are not required to support more than UTF-8.
  For that reason it seem legitimate to gear

Re: [whatwg] [encoding] utf-16

2011-12-29 Thread Leif Halvard Silli
Anne van Kesteren Wed Dec 28 08:11:01 PST 2011:
> On Wed, 28 Dec 2011 12:31:12 +0100, Leif Halvard Silli wrote:
>> Anne van Kesteren Wed Dec 28 01:05:48 PST 2011:
>>> On Wed, 28 Dec 2011 03:20:26 +0100, Leif Halvard Silli wrote:
>>>> By "default" you supposedly mean "default, before error
>>>> handling/heuristic detection". Relevance: On the "real" Web, no browser
>>>> fails to display utf-16 as often as Webkit - its defaulting behavior
>>>> not withstanding - it can't be a goal to replicate that, for instance.
>>>
>>> Do you mean heuristics when it comes to the decoding layer? Or before
>>> that? I do think any heuristics ought to be defined.
>>
>> Meant: While UAs may prepare for little-endian when seeing the 'utf-16'
>> label, they should also be prepared for detecting it as big-endian.
>>
>> As for Mozilla, if HTTP content-type says 'utf-16', then it is prepared
>> to handle BOM-less little-endian as well as bom-less big-endian.
>> Whereas if you send 'utf-16le' via HTTP, then it only accepts
>> 'utf-16le'. The same also goes for Opera. But not for Webkit and IE.
> 
> Right. I think we should do it like Trident.

To behave like Trident is quite difficult unless one applies the logic 
that Trident does. First and foremost, the BOM must be treated the same 
way that Trident and Webkit treat them. Secondly: It might not be be 
desirable to behave exactly like Trident because Trident doesn't really 
handle UTF-16 *at all* unless the file starts wtih the BOM - just run 
this test to verify:

1)  visit this test suite with IE: 
<http://malform.no/testing/utf/caching/>
2)  Click yourself through 7 pages in the test, until the 
last, 'UTF-16' labelled, big-endian, BOM-less page
(which causes mojibake in IE).
3)  Now, use the Back (or Forward) button to go backward
(or Forward) page by page. (You will even be able
see the last,  mojibake-ish page, if you use the 
Forward button to visit it.)

RESULT: 4 of the 7 files in the test - namely: the UTF-16 files without 
a BOM - fail when IE pulls them from cache. When loaded from cache, the 
non-ASCII letters becomes destructed. Note especially that it doesn't 
matter whether the file is big endian or little endian!

Surely, this is not something that we would like UAs to replicate.

Conclusions: a) BOM-less UTF-16 should simply be considered 
non-conforming on the Web, if Trident is the standard. b) there is no 
need to consider what Trident do with BOM-less files as conforming, 
irrespective of whether the page is big endian or little endian. (That 
it handles little-endian BOM-less files a little better than big-endian 
BOM-less files, is just a marginal advantage.)

>>>>> utf-16le becomes a label for utf-16.
>>>>
>>>> * Logically, utf-16be should become a label for utf-16 then, as well.
>>>
>>> That's not logical.
>>
>> Care to elaborate?
>>
>> To not make 'utf-16be' a de-facto label for 'utf-16', only makes sense
>> if you plan to make it non-conforming to send files with the 'utf-16'
>> label unless they are little-endian encoded.
> 
> I personally think everything but UTF-8 should be non-conforming, because  
> of the large number of gotchas embedded in the platform if you don't use  
> UTF-8. Anyway, it's not logical because I suggested to follow Trident  
> which has different behavior for utf-16 and utf-16be.

We simplify - remove a gotcha - if we say that BOM-less UTF-16 should 
be non-conforming. From every angle, BOM-less UTF-16 as well as 
"BOM-full" UTF-16LE and UTF-16BE, makes no sense.

>> Meaning: The "BOM" should not, for UTF-16be/le, be removed. Thus, if
>> the ZWNBSP character at the beginning of a 'utf-16be' labelled file is
>> treated as the BOM, then we do not speak about the 'utf-16be' encoding,
>> but about a mislabelled 'utf-16' file.
> 
> I never spoke of any existing standard. The Unicode standard is wrong here  
> for all implementations.

Here, at least, you do speak about an existing standard ...  It is 
exactly my point that the browsers don't interpret UTF-16be/le as 
UTF-16be/le but more like UTF-16. But in which why, exactly, is UTF-16 
not specified correctly, you mean?

>>> the first four bytes have special meaning.
>>> That does not all suggest we should do the same for numerous other
>>> encodings unrelated to utf-16.
>>
>> Why not? I see absolutely no difference here. When would you like to
>> render a page with a BOM as anything other than what the BOM specifies?
> 

[whatwg] [encoding] utf-16

2011-12-28 Thread Leif Halvard Silli
Anne van Kesteren Wed Dec 28 01:05:48 PST 2011:
> On Wed, 28 Dec 2011 03:20:26 +0100, Leif Halvard Silli wrote:

>> By "default" you supposedly mean "default, before error
>> handling/heuristic detection". Relevance: On the "real" Web, no browser
>> fails to display utf-16 as often as Webkit - its defaulting behavior
>> not withstanding - it can't be a goal to replicate that, for instance.
> 
> Do you mean heuristics when it comes to the decoding layer? Or before  
> that? I do think any heuristics ought to be defined.

Meant: While UAs may prepare for little-endian when seeing the 'utf-16' 
label, they should also be prepared for detecting it as big-endian.

As for Mozilla, if HTTP content-type says 'utf-16', then it is prepared 
to handle BOM-less little-endian as well as bom-less big-endian. 
Whereas if you send 'utf-16le' via HTTP, then it only accepts 
'utf-16le'. The same also goes for Opera. But not for Webkit and IE.

>>> utf-16le becomes a label for utf-16.
>>
>> * Logically, utf-16be should become a label for utf-16 then, as well.
> 
> That's not logical.

Care to elaborate?

To not make 'utf-16be' a de-facto label for 'utf-16', only makes sense 
if you plan to make it non-conforming to send files with the 'utf-16' 
label unless they are little-endian encoded.

Note that in 'utf-16be' and 'utf-16le', then - per the UTF-16 
specification - the BOM is not a BOM. Citing Wikipedia: «UTF-16BE or 
UTF-16LE as the encoding type. When the byte order is specified 
explicitly this way, a BOM is specifically not supposed to be prepended 
to the text, and a U+FEFF at the beginning should be handled as a 
ZWNBSP character.» (Which, in turn, should trigger quirks mode.)

Meaning: The "BOM" should not, for UTF-16be/le, be removed. Thus, if 
the ZWNBSP character at the beginning of a 'utf-16be' labelled file is 
treated as the BOM, then we do not speak about the 'utf-16be' encoding, 
but about a mislabelled 'utf-16' file.

>> Is that what you suggest? Because, if the BOM can change the meaning of
>> utf-16be, then it makes sense to treat the utf-16be label as well as
>> the utf-16le label as synonymous with utf-16. (Thus, effectively
>> utf-16le and utf-16be becomes defunct/unreliable on the Web.)
> 
> No, because utf-16be actually has different behavior in absence of a BOM.  
> It does mean they can share some common algorithm(s), but they have to  
> stay different encodings.

Per the UTF-16 specification, the 'utf-16' label covers both big-endian 
and little-endian. Thus it covers - in a way - two encodings. Hence, 
that we have to treat little endian BOMless UTF-16 different from big 
endian BOMless UTF-16 thus does should not need to mean that they are 
different encodings.

>> SECONDLY: You effectively say that, for the UTF-16 BOM, then the BOM
>> should override the HTTP level charset info. OK. But then you should go
>> the full way, and give the BOM the same, overriding authority when it
>> comes to the UTF-8 BOM. For instance, if the HTTP server's Content-Type
>> header specifies ISO-8859-1 (or 'utf-8' or 'utf-16'), but the file
>> itself contains a BOM (that contradicts the HTTP info), then the BOM
>> "wins" - in IE and WEbkit. (And, btw, w.r.t. IE, then the
>> X-Content-Type: header has no effect w.r.t. treating the HTTP's charset
>> info as authoritative - the BOM wins even then.)
> 
> No, I don't see why we have to go there at all. All this suggests is that  
> within the two utf-16 encodings

What are 'the two utf-16 encodings'? There are 3 UTF-16 encodings per 
the UTF-16 spec. There are 2 endian options but 3 encodings.

> the first four bytes have special meaning.  
> That does not all suggest we should do the same for numerous other  
> encodings unrelated to utf-16.

Why not? I see absolutely no difference here. When would you like to 
render a page with a BOM as anything other than what the BOM specifies? 
Use cases? To not treat it like the BOM would render the page in 
quirks-mode - when does one want that?

The only way where it can make some sense to not treat the UTF-8 BOM 
that way, would be if we see both 'utf-16le' and 'utf-16be' as - on the 
Web - de-facto synonyms for 'ut-16'. (Because then UAs would have 
indirect permission from the UTF-16 spec, to 'sniff' the UTF-16 flavour 
of the BOM even if HTTP says 'utf-16le' or 'utf-16be'.)

Note as well that this is not only related to 'numerous other 
encodings' but directly related to UTF-16 itself: If HTTP says 'utf-16' 
but the BOM is a UTF-8 BOM (or opposite, if HTTP says 'utf-8' but the 
BOM is a utf-16 BOM), then Webkit and IE both use the encoding that the 
BOM specifies.

If it is Trident/Webkit which is supposed to send the standard here, 
then please do. You are glossing over how Trident/Webkit behave if you 
fail to recognize that the issue here is them giving preference to the 
BOM over HTTP. (There is even precedent long into the XML world for 
giving preference to the BOM.)
-- 
Leif Halvard Silli

Re: [whatwg] [encoding] utf-16

2011-12-28 Thread Leif Halvard Silli
Anne van Kesteren Tue Dec 27 06:52:01 PST 2011:

I spotted a shortcoming in your testing:

> I ran some utf-16 tests using 007A as input data, optionally preceded by  
> FFFE or FEFF, and with utf-16, utf-16le, and utf-16be declared in the  
> Content-Type header. For WebKit I tested both Safari 5.1.2 and Chrome  
> 17.0.963.12. Trident is Internet Explorer 9 on Windows 7. Presto is Opera  
> 11.60. Gecko is Nightly 12.0a1 (2011-12-26).
> 
> HTTP  BOM   Trident  WebKit  Gecko  Presto
> utf-16- 7A00 7A00007A   007A
> utf-16le  - 7A00 7A007A00   7A00
> utf-16be  - 007A 007A007A   007A

The above test row is not complete. You should also run a BOM-less test 
using the UTF-16 label but where the 007A is represented in the 
big-endian way - a bit like I did here: 
. The you get as result 
that Opera and Firefox do not take it for a given that files sent as 
'utf-16' are big-endian:

  utf-16- gibb*gibb*   007A   007A

*gibb = gibberish/mojibake.

> utf-16FFFE  7A00 7A007A00   7A00
> utf-16le  FFFE  7A00 7A007A00   7A00
> utf-16be  FFFE  7A00 7A00FFFD*  FFFD*
> 
> utf-16FEFF  007A 007A007A   007A
> utf-16le  FEFF  007A 007AFFFD** FFFD**
> utf-16be  FEFF  007A 007A007A   007A
> 
> * Gecko decodes FFFE 007A as FFFD followed by FE00 presumably dropping the  
> 7A. Opera decodes it as FFFD 007A.
> ** Gecko decoes FEFF 007A as FFFD followed by 00FF presumably dropping the  
> 7A. Opera decodes it as FFFD 7A00.
> 
> It seems in Trident/WebKit utf-16 and utf-16le are labels for the same  
> encoding and the BOM is more important than the encoding. Gecko and Presto  
> match existing specifications around utf-16 with different error handling  
> (afaict).
> 
> I think http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html should  
> follow Trident/WebKit. Specifically: utf-16 defaults to utf-16le in  
> absence of a BOM. utf-16le becomes a label for utf-16. A BOM overrides the  
> direction (of utf-16 / utf-16be) and is removed from the output.

That the BOM is removed from the output for utf-16be labelled files, 
means that the 'utf-16be' labelled file nevertheless is treated as 
UTF-16 (per UTF-16's specification). (Otherwise, if it had not been 
removed, the BOM character should have caused quirks mode.)

Taking what you did not test for into account, it would make sense if 
'utf-16' continues to be treated as a label under which both big-endian 
and litt-endian can be expected. And thus, that Webkit and IE starts to 
detect when UTF-16 is big-endian, but without a BOM.
-- 
Leif H Silli


Re: [whatwg] [encoding] utf-16

2011-12-27 Thread Leif Halvard Silli
Hi Anne. Over all, your findings corresponds with mine, which are based 
on . I also agree with the direction of 
the conclusions, but I would like the encodings document to make some 
distinctions that it currently doesn't - and which you have not 
proposed either. See below.

Anne van Kesteren wrote:
 [ snip ]
> I think http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html should 
follow Trident/WebKit. Specifically: utf-16 defaults to utf-16le in
absence of a BOM.

By "default" you supposedly mean "default, before error 
handling/heuristic detection". Relevance: On the "real" Web, no browser 
fails to display utf-16 as often as Webkit - its defaulting behavior 
not withstanding - it can't be a goal to replicate that, for instance.

> utf-16le becomes a label for utf-16.

* Logically, utf-16be should become a label for utf-16 then, as well. 
Is that what you suggest? Because, if the BOM can change the meaning of 
utf-16be, then it makes sense to treat the utf-16be label as well as 
the utf-16le label as synonymous with utf-16. (Thus, effectively 
utf-16le and utf-16be becomes defunct/unreliable on the Web.)

* W.r.t. making utf-16le and utf-16be into 'label[s] for utf-16', then 
OK, when it comes to how UAs should *treat* them. But I suppose it 
should not be considered conforming to *send* the UTF-16LE/UTF-16BE 
labels with text/html, due to their ambiguous status on the Web. Rather 
it should only be conforming to send 'utf-16'.

> A BOM overrides the direction (of utf-16 / utf-16be) and is removed from 
the output.

FIRSTLY: Another way to see this is to say: IE and Webkit do not 
support 'utf-16le' or 'utf-16be' - they only support 'utf-16' but 
defaults to little endian rather than big endian when BOM is omitted. 
When bom is "omitted" for utf-16be, then they default to big endian, 
making 'utf-16le' an alias of Microsoft's private 'unicode' label and 
'utf-16be' an alias of Microsoft's private 'unicodeFFFE' label (which 
each of them uses the BOM). In other words: On the Web, then 'utf-16le' 
and 'utf-16be' becomes de-facto synonyms for MS 'unicode' and MS 
'unicodefffe'. 

SECONDLY: You effectively say that, for the UTF-16 BOM, then the BOM 
should override the HTTP level charset info. OK. But then you should go 
the full way, and give the BOM the same, overriding authority when it 
comes to the UTF-8 BOM. For instance, if the HTTP server's Content-Type 
header specifies ISO-8859-1 (or 'utf-8' or 'utf-16'), but the file 
itself contains a BOM (that contradicts the HTTP info), then the BOM 
"wins" - in IE and WEbkit. (And, btw, w.r.t. IE, then the 
X-Content-Type: header has no effect w.r.t. treating the HTTP's charset 
info as authoritative - the BOM wins even then.)

Summary: It makes no sense to treat the BOM as winning over the HTTP 
level charset parameter *only* for UTF-16 - the UTF-8 BOM must have 
same overriding effect as well. Thus: Any encoding info from the header 
would be overridden by the BOM. Of course, a documents where the BOM 
contradicts the HTTP charset, should not be considered conforming. But 
the UA treatment of them should still be uniform.

(PS: If you insert the BOM as  before , then IE 
will use UTF-8, when it loads the page from cache. Just say'in.)
--
Leif H Silli


Re: [whatwg] Default encoding to UTF-8?

2011-12-22 Thread Leif Halvard Silli
Henri Sivonen hsivonen Mon Dec 19 07:17:43 PST 2011
> On Sun, Dec 11, 2011 at 1:21 PM, Leif Halvard Silli wrote:

Sorry for my slow reply.

> It surprises me greatly that Gecko doesn't treat "unicode" as an alias
> for "utf-16".
> 
>> Which must
>> EITHER mean that many of these pages *are* UTF-16 encoded OR that their
>> content is predominantly  US-ASCII and thus the artefacts of parsing
>> UTF-8 pages ("UTF-16" should be treated as "UTF-8 when it isn't
>> "UTF-16") as WINDOWS-1252, do not affect users too much.
> 
> It's unclear to me if you are talking about HTTP-level charset=UNICODE
> or charset=UNICODE in a meta. Is content labeled with charset=UNICODE
> BOMless?

Charset=UNICODE in meta, as generated by MS tools (Office or IE, eg.) 
seems to usually be "BOM-full". But there are still enough occurrences 
of pages without BOM. I have found UTF-8 pages with the charset=unicode 
label in meta. But the few page I found contained either BOM or 
HTTP-level charset=utf-8. I have to little "research material" when it 
comes to UTF-8 pages with charset=unicode inside.

>>  (2) for the user tests you suggested in Mozilla bug 708995 (above),
>> the presence of  would trigger a need for Firefox
>> users to select UTF-8 - unless the locale already defaults to UTF-8;
> 
> Hmm. The HTML spec isn't too clear about when alias resolution
> happens, to I (incorrectly, I now think) mapped only "UTF-16",
> "UTF-16BE" and "UTF-16LE" (ASCII-case-insensitive) to UTF-8 in meta
> without considering aliases at that point. Hixie, was alias resolution
> supposed to happen first? In Firefox, alias resolution happen after,
> so  is ignored per the non-ASCII
> superset rule.

Waiting to hear what Hixie says ...

>>> While UTF-8 is possible to detect, I really don't want to take Firefox
>>> down the road where users who currently don't have to suffer page load
>>> restarts from heuristic detection have to start suffering them. (I
>>> think making incremental rendering any less incremental for locales
>>> that currently don't use a detector is not an acceptable solution for
>>> avoiding restarts. With English-language pages, the UTF-8ness might
>>> not be apparent from the first 1024 bytes.)
>>
>> FIRSTLY, HTML5:
>>
>> ]] 8.2.2.4 Changing the encoding while parsing
>> [...] This might happen if the encoding sniffing algorithm described
>> above failed to find an encoding, or if it found an encoding that was
>> not the actual encoding of the file. [[
>>
>> Thus, trying to detect UTF-8 is second last step of the sniffing
>> algorithm. If it, correctly, detects UTF-8, then, while the detection
>> probably affects performance, detecting UTF-8 should not lead to a need
>> for re-parsing the page?
> 
> Let's consider, for simplicity, the locales for Western Europe and the
> Americas that default to Windows-1252 today. If browser in these
> locales started doing UTF-8-only detection, they could either:
>  1) Start the parse assuming Windows-1252 and reload if the detector 
> says UTF-8.

When the detector says UTF-8 - that is step 7 of the sniffing algorith, 
no?
http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding

>  2) Start the parse assuming UTF-8 and reload as Windows-1252 if the
> detector says non-UTF-8.
> 
> (Buffering the whole page is not an option, since it would break
> incremental loading.)
> 
> Option #1 would be bad, because we'd see more and more reloading over
> time assuming that authors start using more and more UTF-8-enabled
> tools over time but don't go through the trouble of declaring UTF-8,
> since the pages already seem to "work" without declarations.

So the so called badness is only a theory about what will happen - how 
the web will develop. As is, there is nothing particular bad about 
starting out with UTF-8 as the assumption.

I think you are mistaken there: If parsers perform UTF-8 detection, 
then unlabelled pages will be detected, and no reparsing will happen. 
Not even increase. You at least need to explain this negative spiral 
theory better before I buy it ... Step 7 will *not* lead to reparsing 
unless the default encoding is WINDOWS-1252. If the default encoding is 
UTF-8, then step 7, when it detects UTF-8, then it means that parsing 
can continue uninterrupted.

What we will instead see is that those using legacy encodings must be 
more clever in labelling their pages, or else they won't be detected. 

I am a bitt baffled here: It sounds like you say that there will be bad 
consequences if browsers becomes more reliable ...

> Option #2 

Re: [whatwg] Unicode as alias for UTF-16 (was Re: Default encoding to UTF-8?)

2011-12-22 Thread Leif Halvard Silli
Henri Sivonen on Tue Dec 20 01:13:45 PST 2011:
> On Mon, Dec 19, 2011 at 9:44 PM, L. David Baron wrote:

>>> > I discovered that "UNICODE" is
>>> > used as alias for "UTF-16" in IE and Webkit.
>>> ...
>>> > Seemingly, this has not affected Firefox users too much.
>>>
>>> It surprises me greatly that Gecko doesn't treat "unicode" as an alias
>>> for "utf-16".
>>
>> Why?
> 
> From playing with IE, I thought it was known that "unicode" is an
> alias for "utf-16" and it had never occurred to me to check if that
> was true in Gecko.

MS 'unicode' is only to a 50% degree (sic) an alias for 'utf-16', 
namely for the *little-endian* "half" of *UTF-16*. (Thus: It is not 
UTF-16LE, since MS 'unicode' usually includes the BOM.)  There is also 
MS 'unicodeFFFE' that represents big-endian UTF-16. See: 
http://mail.apps.ietf.org/ietf/charsets/msg02030.html

>> If it's not needed, why shouldn't WebKit and IE drop it?

Actually, UTF-16 fails in Webkit much, much more often than in any 
other browser. E.g. this page is (not that it related, though) labelled 
as MS 'unicode': http://sacredheartbayhead.com/. Firefox, Opera and IE 
all display it. But Chrome/Safari fails to detect the encoding.

So despite that Webkit aligns with IE by understanding MS 'unicode' and 
MS 'unicodeFFFE', it does other things wrong when it comes to UTF-16. 
So, you should only look at Webkit if you want to see how well a 
browser can do in the market when it has below average UTF-16 support 
... (Chrome is may be a  better than Safari, though - Chrome at least 
allows me to *select* UTF-16, whereas Safari does not offer UTF-16 in 
its encoding menu.. Chrome also uses character set detection more 
actively.)

> Needed is relative. So far, I haven't seen data about how much
> existing content there is out there that depends on this. It could be
> that some users somewhere have rejected Firefox or Opera for this and
> there just isn't enough of a feedback loop.

Feedback loop for you: In UTF-16LE or UTF-16BE pages without any other 
encoding info. (The HTML5 encoding sniffing tells UAs to *do* read the 
meta @charset *if* all other tests fails.) And, voila, I just now found 
one such page: <http://www.hughesrenier.be/actualites.html>. This page 
works fine in IE - and IE only. (That it fails in Webkit is because of 
some bug in its encoding sniffing - see below.) Offline, on my 
computer, when I switched the value of the meta @charset for that page 
to 'UTF-16', then Firefox and Opera would also pick up the encoding. 

   Other pages of the same kind: 
<http://www.sunsetridgebusinesspark.com/BusinessListing.html>
<http://www.rpmcmillen.com/taxes.html>
<http://www.hughesrenier.be/illustration.html>
<http://memphismitchellathletics.com/pages/2010football.html>

   There are also pages like these, which works fine in IE, but which 
in Firefox, if I manually select UTF-16, displays 
broken-character-signs - I don't know if the UTF-16 code is buggy?:
<http://www.casamobile.org/BoardMembersStaff.html>
<http://comfortablerentals.com/Our%20Services.html>
<http://lergp.cce.cornell.edu/IPM/Home.htm>
<http://www.belpaese2000.narod.ru/Teca/Nove/Deledda/nov/regina.htm>
<http://www.belpaese2000.narod.ru/Teca/Nove/Deledda/nov/macchie.htm>
<http://web.tiscali.it/marcokiller/Mappa_del_sito.htm>
<http://familienlundorff.dk/familienLundorff.dk/genealogi/Andreas_1769/Niels_1813_Johanne_1854.html>
<http://www.prcflow.com/orifice_meter_runs_plates.htm>
<http://healthactioncenter.com/aboutus.htm>
<http://www.belpaese2000.narod.ru/Teca/Nove/Deledda/nov/mago.htm>
<http://www.trascaucristian.3x.ro/> (shows BOM sign)
<http://www.casamobile.org/history.html>
<http://www.hawkpages.com/> (See 'embedded' code on right page side)

I found them via Google, which for certain UTF-16 pages renders the 
source code as search result (which make Google Search very similar to 
how Webkit handles UTF-16, btw):
<http://www.google.com/search?q=%22%3Cmeta+content%3D%27text/html%3B+charset%3Dunicode%27%22>

Not the same thing, but speaking about necessity: This page declares 
"UTF-8" 3 times plus that it includes the BOM. However, the HTTP 
charset says ISO-8859-1, and hence ... the page fails in Firefox and 
Opera, but not in Webkit and IE: <http://www.bozze.1.vg/>.

> Maybe it isn't needed, but it seems that from the WebKit or IE point
> of view, the potential upside from dropping this alias is about
> non-existent while there could be a downside. I'd expect it to be hard
> to get IE and WebKit to drop the alias.

Btw, one thing: A big source of Google findings for the search string 
"http://stsk.no/pipermail/drill-aspiranter_stsk.no/attachments/20101230/8335fbe4/attachment-0001.html
-- 
Leif Halvard Silli

Re: [whatwg] Default encoding to UTF-8?

2011-12-11 Thread Leif Halvard Silli
Leif Halvard Silli Sun Dec 11 03:21:40 PST 2011

> W.r.t. iframe, then the "big in Norway" newspaper Dagbladet.no is 
> declared ISO-8859-1 encoded and it includes a least one ads-iframe that 
  ...
> * Let's say that I *kept* ISO-8859-1 as default encoding, but instead 
> enabled the Universal detector. The frame then works.
> * But if I make the frame page very short, 10 * the letter "ø" as 
> content, then the Universal detector fails - on a test on my own 
> computer, it guess the page to be Cyrillic rather than Norwegian.
> * What's the problem? The Universal detector is too greedy - it tries 
> to fix more problems than I have. I only want it to guess on "UTF-8". 
> And if it doesn't detect UTF-8, then it should fall back to the locale 
> default (including fall back to the encoding of the parent frame).

The above illustrates that the current charset-detection solutions are 
starting to get old: They are not geared and optimized towards UTF-8 as 
the firmly recommended and - in principle - anticipated default.

The above may also catch a real problem with switching to UTF-8: that 
one may need to embed pages which do not use UTF-8: If one could trust 
UAs to attempt UTF-8 detection (but not "Univeral detection) before 
defaulting, then it became virtually risk free to switch a page to 
UTF-8, even if it contains iframe pages. Not?

Leif H Silli

Re: [whatwg] Default encoding to UTF-8?

2011-12-11 Thread Leif Halvard Silli
Henri Sivonen Fri Dec 9 05:34:08 PST 2011:
> On Fri, Dec 9, 2011 at 12:33 AM, Leif Halvard Silli:
>> Henri Sivonen Tue Dec 6 23:45:11 PST 2011:
>> These localizations are nevertheless live tests. If we want to move
>> more firmly in the direction of UTF-8, one could ask users of those
>> 'live tests' about their experience.
> 
> Filed https://bugzilla.mozilla.org/show_bug.cgi?id=708995

This is brilliant. Looking forward to the results!

>>> (which means
>>> *other-language* pages when the language of the localization doesn't
>>> have a pre-UTF-8 legacy).
>>
>> Do you have any concrete examples?
> 
> The example I had in mind was Welsh.

Logical candidate. WHat do you know about the Farsi and Arabic local? 
HTML5 specifies UTF-8 for them - due to the way Firefox behaves, I 
think. IE seems to be the big dominator for these locales - at least in 
Iran. Firefox was number two in Iran, but still only at around 5 
percent, in the stats I saw.

Btw, as I looked into Iran a bit ... I discovered that "UNICODE" is 
used as alias for "UTF-16" in IE and Webkit. And, for XML, then Webkit, 
Firefox and Opera sees it as a non-fatal error (but Opera just treats 
all illegal names that way). WHile IE9 seems to se it as fatal. File an 
HTML5 bug:

https://www.w3.org/Bugs/Public/show_bug.cgi?id=15142

Seemingly, this has not affected Firefox users too much. Which must 
EITHER mean that many of these pages *are* UTF-16 encoded OR that their 
content is predominantly  US-ASCII and thus the artefacts of parsing 
UTF-8 pages ("UTF-16" should be treated as "UTF-8 when it isn't 
"UTF-16") as WINDOWS-1252, do not affect users too much.

I mention it here for 3 reasons: 

 (1) charset=Unicode inside  is caused by MSHTML, including Word. 
And Boris mentioned Word's behaviour as a reason for keeping the legacy 
defaulting. However, when MSHTML saves with charset=UNICODE, then for 
browsers to legacy default is not the correct behaviour. (I don't know 
exactly when MSHTML spits out charset=UNICODE, though - or whether it 
is locale affected whether MSHTML spits out charset=UNICODE - or what.)

 (2) for the user tests you suggested in Mozilla bug 708995 (above), 
the presence of  would trigger a need for Firefox 
users to select UTF-8 - unless the locale already defaults to UTF-8; 

 (3) That HTML5 bug 15142 (see above) has been unknown (?) till now, 
despite that it affects Firefox and Opera, hints that, for the 
"WINDOWS-1252 languages", when they are served as UTF-8 but parsed as  
WINDOWS-1252 (by Firefox and Opera), then users survive. (Of course, 
some of these pages will be "picked up" by an Apache Content-Type: 
header declaring the encoding or perhaps be chardet?

>> And are there user complaints?
> 
> Not that I know of, but I'm not part of a feedback loop if there even
> is a feedback loop here.
> 
>> The Serb localization uses UTF-8. The Croat uses Win-1252, but only on
>> Windows and Mac: On Linux it appears to use UTF-8, if I read the HG
>> repository correctly.
> 
> OS-dependent differences are *very* suspicious. :-(

Mmm, yes. 

>>> I think that defaulting to UTF-8 is always a bug, because at the time
>>> these localizations were launched, there should have been no unlabeled
>>> UTF-8 legacy, because up until these locales were launched, no
>>> browsers defaulted to UTF-8 (broadly speaking). I think defaulting to
>>> UTF-8 is harmful, because it makes it possible for locale-siloed
>>> unlabeled UTF-8 content come to existence
>>
>> The current legacy encodings nevertheless creates siloed pages already.
>> I'm also not sure that it would be a problem with such a UTF-8 silo:
>> UTF-8 is possible to detect, for browsers - Chrome seems to perform
>> more such detection than other browsers.
> 
> While UTF-8 is possible to detect, I really don't want to take Firefox
> down the road where users who currently don't have to suffer page load
> restarts from heuristic detection have to start suffering them. (I
> think making incremental rendering any less incremental for locales
> that currently don't use a detector is not an acceptable solution for
> avoiding restarts. With English-language pages, the UTF-8ness might
> not be apparent from the first 1024 bytes.)

FIRSTLY, HTML5:

]] 8.2.2.4 Changing the encoding while parsing
[...] This might happen if the encoding sniffing algorithm described 
above failed to find an encoding, or if it found an encoding that was 
not the actual encoding of the file. [[

Thus, trying to detect UTF-8 is second last step of the sniffing 
algorithm. If it, correctly, detects UTF-8, then, while the detection 
probably affects performance, detecting UTF-8 s

Re: [whatwg] Default encoding to UTF-8?

2011-12-08 Thread Leif Halvard Silli
Henri Sivonen Tue Dec 6 23:45:11 PST 2011:
> On Mon, Dec 5, 2011 at 7:42 PM, Leif Halvard Silli wrote:

> Mozilla grants localizers a lot of latitude here. The defaults you see
> are not carefully chosen by a committee of encoding strategists doing
> whole-Web optimization at Mozilla.

We could use such a committee for the Web!

> They are chosen by individual
> localizers. Looking at which locales default to UTF-8, I think the
> most probable explanation is that the localizers mistakenly tried to
> pick an encoding that fits the language of the localization instead of
> picking an encoding that's the most successful at decoding unlabeled
> pages most likely read by users of the localization

These localizations are nevertheless live tests. If we want to move 
more firmly in the direction of UTF-8, one could ask users of those 
'live tests' about their experience.

> (which means
> *other-language* pages when the language of the localization doesn't
> have a pre-UTF-8 legacy).

Do you have any concrete examples? And are there user complaints?

The Serb localization uses UTF-8. The Croat uses Win-1252, but only on 
Windows and Mac: On Linux it appears to use UTF-8, if I read the HG 
repository correctly. As for Croat and Window-1252, then it does not 
even support the Croat alphabet (in full) - I think about the digraphs. 
But I'm not sure about the pre-UTF-8 legacy for Croatian.

Some language communities in Russia have a similar minority situation 
as Serb Cyrillic, only that their minority script is Latin: They use 
Cyrillic but they may also use Latin. But in Russia, Cyrillic 
dominates. Hence it seems to be the case - according to my earlier 
findings, that those few letters that, per each language, do not occur 
in Window-1251, are inserted as NCRs (that is: when UTF-8 is not used). 
That way, WIN-1251 can be used for Latin with non-ASCII inside. But 
given that Croat defaults to WIn-1252, they could in theory just use 
NCRs too ...

Btw, for Safari on Mac, I'm unable to see any effect of switching 
locale: Always Win-1252 (Latin) - it used to have effect before ... But 
may be there is an parameter I'm unaware of - like Apple's knowledge of 
where in the World I live ...

> I think that defaulting to UTF-8 is always a bug, because at the time
> these localizations were launched, there should have been no unlabeled
> UTF-8 legacy, because up until these locales were launched, no
> browsers defaulted to UTF-8 (broadly speaking). I think defaulting to
> UTF-8 is harmful, because it makes it possible for locale-siloed
> unlabeled UTF-8 content come to existence

The current legacy encodings nevertheless creates siloed pages already. 
I'm also not sure that it would be a problem with such a UTF-8 silo: 
UTF-8 is possible to detect, for browsers - Chrome seems to perform 
more such detection than other browsers.

Today, perhaps especially for English users, it happens all the time 
that a page - without notice - defaults with regard to encoding - and 
this causes the browser - when used as an authoring tool - defaults to 
Windows-1252: http://twitter.com/#!/komputist/status/144834229610614784 
(I suppose he used that browser based spec authoring tool that is in 
development.) 

In another message you suggested I 'lobby' against authoring tools. OK. 
But the browser is also an authoring tool. So how can we have authors 
output UTF-8, by default, without changing the parsing default?

> (instead of guiding all Web
> authors always to declare their use of UTF-8 so that the content works
> with all browser locale configurations).

On must guide authors to do this regardless.

> I have tried to lobby internally at Mozilla for stricter localizer
> oversight here but have failed. (I'm particularly worried about
> localizers turning the heuristic detector on by default for their
> locale when it's not absolutely needed, because that's actually
> performance-sensitive and less likely to be corrected by the user.
> Therefore, turning the heuristic detector on may do performance
> reputation damage. )

W.r.t. heuristic detector: Testing the default encoding behaviour for 
Firefox was difficult. But in the end I understood that I must delete 
the cached version of the Profile folder - only then would the 
encodings 'fall back' properly. But before I came thus far, I tried 
with the e.g. the Russian version of Firefox, and discovered that it 
enabled the encoding heuristics: Thus it worked! Had it not done that, 
then it would instead have used Windows-1252 as the default ... So you 
perhaps need to be careful before telling them to disable heuristics ...

Btw: In Firefox, then in one sense, it is impossible to disable 
"automatic" character detection: In Firefox, overriding of the encoding 
only lasts until the next reload. However, I just di

Re: [whatwg] Default encoding to UTF-8?

2011-12-06 Thread Leif Halvard Silli
Jukka K. Korpela Tue Dec 6 13:27:11 PST 2011
> 2011-12-06 22:58, Leif Halvard Silli write:
> 
>> There is now a bug, and the editor says the outcome depends on "a
>> browser vendor to ship it":
>> https://www.w3.org/Bugs/Public/show_bug.cgi?id=15076
>>
>> Jukka K. Korpela Tue Dec 6 00:39:45 PST 2011
>>
>>> what is this proposed change to defaults supposed to achieve. […]
>>
>> I'd say the same as in XML: UTF-8 as a reliable, common default.
> 
> The "bug" was created so that the argument given was:
> "It would be nice to minimize number of declarations a page needs to 
> include."

I just wanted to cite Kornel's original statement. But just because 
Kornel cited an authoring use case does not mean that it doesn't have 
other use cases. This entire thread started with a user problem. Also, 
the entire HTML5 argues in favour of UTF-8, so that seemed not so 
important to justify more.

> That is, author convenience - so that authors could work sloppily and 
> produce documents that could fail on user agents that haven't 
> implemented this change.

There already is locales where UTF-8 is the default, and the fact that 
this could benefit some sloppy authors within those locales, is not an 
relevant argument against it. In the Western-European locales, one can 
make documents that fail on UAs which doesn't operate within our 
locales. Thus, either way, some sloppy authors will "benefit" ... But 
with the proposed change, then even users *outside* the locales that 
share the default encoding of the sloppy author's locale, would benefit.

> This sounds more absurd than I can describe.
> 
> XML was created as a new data format; it was an entirely different issue.

HTML5 includes some features that are meant to benefit "jumping" back 
and forth between HTML and XML, and this features would and one more 
such feature.

>>> If there's something that should be added to or modified in the
>>> algorithm for determining character encoding, the I'd say it's error
>>> processing. I mean user agent behavior when it detects, [...]
>>
>> There is already an (optional) detection step in the algorithm - but UA
>> treat that step differently, it seems.
> 
> I'm afraid I can't find it - I mean the treatment of a document for 
> which some encoding has been deduced (say, directly from HTTP headers) 
> and which then turns out to violate the rules of the encoding.

Sorry, I thought you meant a document where there were no meta data 
about the encoding available - (as  described in step 7 - 'attempt to 
auto-detect' etc).

Leif H Silli

Re: [whatwg] Default encoding to UTF-8?

2011-12-06 Thread Leif Halvard Silli
There is now a bug, and the editor says the outcome depends on "a 
browser vendor to ship it": 
https://www.w3.org/Bugs/Public/show_bug.cgi?id=15076

Jukka K. Korpela Tue Dec 6 00:39:45 PST 2011

> what is this proposed change to defaults supposed to achieve. […]

I'd say the same as in XML: UTF-8 as a reliable, common default.

> (Basic character encoding issues are of course 
> not that complex to you and me or most people around here; but most 
> authors are more or less confused with them, and I don't think we should 
> add to the confusion.) [...]

That's an authoring issue. I bet the authoring gains would outweigh the 
authoring pains.

> If the purpose is UTF-8 evangelism, then it would be just the kind
> of evangelism that produces angry people, not converts.

Evangelism is already in place. The purpose is change of UA behaviour.

> If there's something that should be added to or modified in the 
> algorithm for determining character encoding, the I'd say it's error 
> processing. I mean user agent behavior when it detects, [...]

There is already an (optional) detection step in the algorithm - but UA 
treat that step differently, it seems.

NARUSE, Yui Tue Dec 6 05:59:27 PST 2011
> (2011/12/06 17:39), Jukka K. Korpela wrote:
>> 2011-12-06 6:54, Leif Halvard Silli wrote:

> I found it: http://rink77.web.fc2.com/html/metatagu.html
> It uses HTML5 doctype and not declare encoding and its encoding is 
> Shift_JIS,the default encoding of Japanese locale.

Let's hope it is a single incident and not a cult, yet ...
-- 
Leif H Silli

Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Leif Halvard Silli
Boris Zbarsky Mon Dec 5 19:18:10 PST 2011:
> On 12/5/11 9:55 PM, Leif Halvard Silli wrote:

>> I said I agreed with him that Faruk's solution was not good. However, I
>> would not be against treating  as a 'default to UTF-8'
>> declaration
> 
> This might work, if there hasn't been too much cargo-culting yet.  Data 
> urgently needed!

Yeah, it would be a pity if it had already become an widespread 
cargo-cult to - all at once - use HTML5 doctype without using UTF-8 
*and* without using some encoding declaration *and* thus effectively 
relying on the default locale encoding ... Who does have a data corpus? 
Henri, as Validator.nu developer?

This change would involve adding one more step in the HTML5 parser's 
encoding sniffing algorithm. [1] The question then is when, upon seeing 
the HTML5 doctype, the default to UTF-8 ought to happen, in order to be 
useful. It seems it would have to happen after the processing of the 
explicit meta data (Step 1 to 5) but before the last 3 steps - step 6, 
7 and 8:

Step 6: 'if the user agent has information on the likely encoding'
Step 7: UA 'may attempt to autodetect the character encoding'
Step 8: 'implementation-defined or user-specified default'

The role of the HTML5 DOCTYPE, encoding wise, would then be to ensure 
that step 6 to 8 does not happen. 

[1] http://dev.w3.org/html5/spec/parsing#encoding-sniffing-algorithm
-- 
Leif H Silli


Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Leif Halvard Silli
On 12/5/11 6:14 PM, Leif Halvard Silli wrote:
>> It is more likely that there is another reason, IMHO: They may have
>> tried it, and found that it worked OK
> 
> Where by "it" you mean "open a text editor, type some text, and save". 
> So they get whatever encoding their OS and editor defaults to.

If that is all they tested, then I'd said they did not test enough.

> And yes, then they find that it works ok, so they don't worry about 
> encodings.

Ditto.
 
>>> No.  He's describing a problem using UTF-8 to view pages that are not
>>> written in English.
>>
>> And why is that a problem in those cases when it is a problem?
> 
> Because the characters are wrong?

But the characters will be wrong many more times than exactly those 
times when he tries to read a  Web page with a Western European 
languages that is not declared as WIN-1252. Does English locale uses 
have particular expectations with regard to exactly those Web pages? 
What about Polish Web pages etc? English locale users is a very 
multiethnic lot.

>> Do he read those languages, anyway?
> 
> Do you read English?  Seriously, what are you asking there, exactly?

Because if it is an issue, then it is an about expectations for exactly 
those pages. (Plus the quote problem, of course.)

> (For the record, reading a particular page in a language is a much 
> simpler task than reading the language; I can't "read German", but I can 
> certainly read a German subway map.)

Or Polish subway map - which doesn't default to said encoding.

>> The solution I proposed was that English locale browsers should default
>> to UTF-8.
> 
> I know the solution you proposed.  That solution tries to avoid the 
> issues David was describing by only breaking things for people in 
> English browser locales, I understand that.

That characterization is only true with regard to the quote problem. 
That German pages "breaks" would not be any more important than the 
fact that Polish pages would. For that matter: It happens that UTF-8 
pages breaks as well.

I only suggest it as a first step, so to speak. Or rather - since some 
locales apparently already default to UTF-9 - as a next step. 
Thereafter, more locales would be expected to follow suit - as the 
development of each locale permits.

>>> Why does it matter?  David's default locale is almost certainly en-US,
>>> which defaults to ISO-8859-1 (or whatever Windows-??? encoding that
>>> actually means on the web) in his browser.  But again, he's changed the
>>> default encoding from the locale default, so the locale is irrelevant.
>>
>> The locale is meant to predominantly be used within a physical locale.
> 
> Yes, so?

So then we have a set of expectations for the language of that locale. 
If we look at how the locale settings handles other languages, then we 
are outside the issue that the locale specific encodings are supposed 
to handle.

>> If he is at another physical locale or a virtually other locale, he
>> should not be expecting that it works out of the box unless a common
>> encoding is used.
> 
> He was responding to a suggestion that the default encoding be changed 
> to UTF-8 for all locales.  Are you _really_ sure you understood the 
> point of his mail?

I said I agreed with him that Faruk's solution was not good. However, I 
would not be against treating  as a 'default to UTF-8' 
declaration, as suggested by some - if it were possible to agree about 
that. Then we could keep things as they are, except for the HTML5 
DOCTYPE. I guess the HTML5 doctype would become 'the default before the 
default': If everything else fails, then UTF-8 if the DOCTYPE is 
, or else, the locale default.

It sounded like Darin Adler thinks it possible. How about you?
 
>> Even today, if he visits Japan, he has to either
>> change his browser settings *or* to rely on the pages declaring their
>> encodings. So nothing would change, for him, when visiting Japan — with
>> his browser or with his computer.
> 
> He wasn't saying it's a problem for him per se.  He's a somewhat 
> sophisticated browser user who knows how to change the encoding for a 
> particular page.

If we are talking about English locale user visiting Japan, then I 
doubt a change in the default encoding would matter - Win-1252 as 
default would anyway be wrong.

> What he was saying is that there are lots of pages out there that aren't 
> encoded in UTF-8 and rely on locale fallbacks to particular encodings, 
> and that he's run into them a bunch while traveling in particular, so 
> they were not pages in English.  So far, you and he seem to agree.

So far we agree, yes.
 
>> Yes, there would b

Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Leif Halvard Silli
Boris Zbarsky Mon Dec 5 13:49:45 PST 2011:
> On 12/5/11 12:42 PM, Leif Halvard Silli wrote:
>> Last I checked, some of those locales defaulted to UTF-8. (And HTML5
>> defines it the same.) So how is that possible?
> 
> Because authors authoring pages that users of those locales  
> tend to use use UTF-8 more than anything else?

It is more likely that there is another reason, IMHO: They may have 
tried it, and found that it worked OK. But they of course have the same 
need for reading non-English museum and railway pages as Mozilla 
employees.

>> Don't users of those locales travel as much as you do?

> I think you completely misunderstood his 
> comments about travel and locales.  Keep reading.

I'm pretty sure I haven't misunderstood very much.

>> What kind of trouble are you actually describing here? You are
>> describing a problem with using UTF-8 for *your locale*.
> 
> No.  He's describing a problem using UTF-8 to view pages that are not 
> written in English.

And why is that a problem in those cases when it is a problem? Do he 
read those languages, anyway? Don't we expect some problems when we 
thread out of our borders?
 
> Now what language are the non-English pages you look at written in? 
> Well, it depends.  In western Europe they tend to be in languages that 
> can be encoded in ISO-8859-1, so authors sometimes use that encoding 
> (without even realizing it).  If you set your browser to default to 
> UTF-8, those pages will be broken.
> 
> In Japan, a number of pages are authored in Shift_JIS.  Those will 
> similarly be broken in a browser defaulting to UTF-8.

The solution I proposed was that English locale browsers should default 
to UTF-8. Of course, to such users, then "when in Japan", they could 
get problems - on some Japanese pages, which is a small nuisance, 
especially if they read Japansese.

>> What is your locale?
> 
> Why does it matter?  David's default locale is almost certainly en-US, 
> which defaults to ISO-8859-1 (or whatever Windows-??? encoding that 
> actually means on the web) in his browser.  But again, he's changed the 
> default encoding from the locale default, so the locale is irrelevant.

The locale is meant to predominantly be used within a physical locale. 
If he is at another physical locale or a virtually other locale, he 
should not be expecting that it works out of the box unless a common 
encoding is used. Even today, if he visits Japan, he has to either 
change his browser settings *or* to rely on the pages declaring their 
encodings. So nothing would change, for him, when visiting Japan — with 
his browser or with his computer.

Yes, there would be a change, w.r.t. Enlgish quotation marks (see 
below) and w.r.tg. visiting Western European languages pages: For those 
a number of pages which doesn't fail with Win-1252 as the default, 
would start to fail. But relatively speaking, it is less important that 
non-English pages fail for the English locale.

>> (Quite often it sounds as
>> if some see Latin-1 - or Windows-1251 as we now should say - as a
>> 'super default' rather than a locale default. If that is the case, that
>> it is a super default, then we should also spec it like that! Until
>> further, I'll treat Latin-1 as it is specced: As a default for certain
>> locales.)
> 
> That's exactly what it is.

A default for certain locales? Right.

>> Since it is a locale problem, we need to understand which locale you
>> have - and/or which locale you - and other debaters - think they have.
> 
> Again, doesn't matter if you change your settings from the default.

I don't think I have misunderstood anything.
 
>> However, you also say that your problem is not so much related to pages
>> written for *your* locale as it is related for pages written for users
>> of *other* locales. So how many times per year do Dutch, Spanish or
>> Norwegian  - and other non-English pages - are creating troubles for
>> you, as a English locale user? I am making an assumption: Almost never.
>> You don't read those languages, do you?
> 
> Did you miss the "travel" part?  Want to look up web pages for museums, 
> airports, etc in a non-English speaking country?  There's a good chance 
> they're not in English!

There is a very good chance, also, that only very few of the Web pages 
for such professional institutions would fail to declare their encoding.

>> This is also an expectation thing: If you visit a Russian page in a
>> legacy Cyrillic encoding, and gets mojibake because your browser
>> defaults to Latin-1, then what does it matter to you whether your
>> browser defaults to Latin-1 or UTF-8? Answer: Nothing.
> 
> Yes.  So?

So

Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Leif Halvard Silli
>> (And HTML5 defines it the same.)
> 
> No. As far as I understand, HTML5 defines US-ASCII to be the default and
> requires that any other encoding is explicitly declared. I do like this
> approach.

We are here discussing the default *user agent behaviour* - we are not 
specifically discussing how web pages should be authored.

For use agents, then please be aware that HTML5 maintains a table over 
'Suggested default encoding': 
http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding

When you say 'requires': Of course, HTML5 recommends that you declare 
the encoding (via HTTP/higher protocol, via the BOM 'sideshow' or via 
). I just now also discovered that Validator.nu 
issues an error message if it does not find any of of those *and* the 
document contains non-ASCII. (I don't know, however, whether this error 
message is just something Henri added at his own discretion - it would 
be nice to have it literally in the spec too.)

(The problem is of course that many English pages expect the whole 
"Unicode alphabet" even if they only contain US-ASCII from the start.)

HTML5 says that validators *may* issue a warning if UTF-8 is *not* the 
encoding. But so far, validator.nu has not picked that up.
 
> We should also lobby for authoring tools (as recommended by HTML5) to
> default their output to UTF-8 and make sure the encoding is declared.

HTML5 already says: "Authoring tools should default to using UTF-8 for 
newly-created documents. [RFC3629]" 
http://dev.w3.org/html5/spec/semantics.html#charset

> As
> so many pages, supposedly (I have not researched this), use the incorrect
> encoding, it makes no sense to try to clean this mess by messing with
> existing defaults. It may fix some pages and break others. Browsers have
> the ability to override an incorrect encoding and this a reasonable
> workaround.

Do you use a English locale computer? If you do, without being a native 
English speaker, then you are some kind of geek ... Why can't you work 
around the troubles -as you are used to anyway?

Starting a switch to UTF-8 as the default UA encoding for English 
locale users should *only* affect how English locale users experience 
languages which *both* need non-ASCII *and* historically have been 
using Windows-1252 as the default encoding *and* which additionally do 
not include any encoding declaration.
-- 
Leif Halvard Silli


Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Leif Halvard Silli
he US English locale of their OS and/or browser. We should not 
consider the needs of geeks - they will follow (read: lead) the way, so 
the fact that *they* may see mojibake, should not be a concern.

See? We would have a plan. Or what do you think? Of course, we - or 
rather: the browser vendors - would need to market this as an important 
change. The HTML5 spec already justifies the use of UTF-8 several 
places - it says that pages might not work as expected e.g. w.r.t. 
URLs, unless UTF-8 is used. So there are enough of arguments that can 
be used.

There are other technical ideas I have, such as treating the BOM the 
way Webkit and IE treats it - that would increase the number of pages 
treated as UTF-8 by all browsers a little bit [1]. However that can 
wait or whatever: The most important thing is to *initiate* the default 
encoding change.

[1] https://www.w3.org/Bugs/Public/show_bug.cgi?id=12897

Leif Halvard Silli


[whatwg] second example

2010-06-27 Thread Leif Halvard Silli
In www-archive@ you sasked Sam for a technical reason to remove the 
second example [=]:

>> either apply Lachlan's change in a way that does not cause the documents 
to diverge
> 
> If you want a change to the WHATWG version of the specifications, please 
> provide rationale that argues for that change. So far the only rationale 
> you have provided convincingly argues for keeping both examples. Unless 
> informed otherwise by the WHATWG charter members, I will assume that in 
> the absence of a good technical reason, the example should remain in the 
> WHATWG specifications.

As a HTMLWG member, I cannot ask you to change an example that isn't in 
the HTMLWG spec. Hence I joined WHATWG mailing list in order to suggest 
some technical reasons why the second example should be removed or 
changed.

First, I struggle to understand the justification for the second 
example. But it could seem as if the technical reason is to "mention 
proprietariness" [#]: 

> Lachlan's rationale is not the real rationale for the change: it argues 
> for what the WHATWG spec says (have two examples), not what the HTMLWG 
> spec says (not mention proprietariness).

Lachlan's rationale, in contrast, was that your example did not 
demonstrate best practice[&]: "The ideal fallback should instead 
provide some kind of alternative content." So my working theory is that 
demonstration of "proprietariness" is the reason why you have the 
second example. And it might be that you consider your example a good 
an honest fallback text whenever proprietary plug-ins are used - who 
knows - there absolutely nothing in the WHATWG spec which explains why 
the WHATWG spec has the second example. 

My technical argument for why you should not have the second example is 
that it demonstrates the exact same thing as the first example 
demonstrates. 

Namely, both of them demonstrate how a plug-in is required in order to 
view _plug-in specific_ content. Another use of plug-ins - irrespective 
of whether the plug-in is proprietary or not - is to use them for 
displaying _standards compatible content_ which the user agent is 
lacking support for. E.g. think of the plug-ins for viewing MathML and 
SVG that exist. Hey, there are even Flash players (Google's SVGweb) for 
playing SVG, as well as different Microsoft VML solutions for playing 
SVG (such as AmpleSDK) in Internet Explorer. If/When Flash starts to 
support it, then one can also use the Flash plug-in for playing  the 
WebM format.

So, in conclusion, even the open source plugin (the first example) 
demonstrates "proprietariness", if we define "proprietariness" as 
"requiring a plug-in in order to view plug-in specific content". And 
that is why you should remove the second example, or provide an example 
that demonstrates a real difference.

[=] http://lists.w3.org/Archives/Public/www-archive/2010Jun/0066
[#] http://lists.w3.org/Archives/Public/www-archive/2010Jun/0066
[&] http://www.w3.org/mid/4bc5c6a5.9090...@lachy.id.au
-- 
leif halvard silli


Re: [whatwg] Link rot is not dangerous

2009-05-17 Thread Leif Halvard Silli

Geoffrey Sneddon On 09-05-16 21.38:

On 16 May 2009, at 07:08, Leif Halvard Silli wrote:

Geoffrey Sneddon Fri May 15 14:27:03 PDT 2009


On 15 May 2009, at 18:25, Shelley Powers wrote:

> One of the very first uses of RDF, in RSS 1.0, for feeds, is 
still  > in existence, still viable. You don't have to take my word, 
check it > out yourselves:

>
> http://purl.org/rss/1.0/

Who actually treats RSS 1.0 as RDF? Every major feed reader just 
uses  a generic XML parser for it (quite frequently a non-namespace 
aware one) and just totally ignores any RDF-ness of it.


What does it mean to "treat as RDF"?  [  ... snip ... ]


I mean using an RDF processor, and treating it as an RDF graph. 
Everything just creates from an XML stream (or object model) a bunch 
of items with a certain title, date, and description, and acts on that 
(and parses it out in a format specific manner, so it creates the same 
sort of item for, e.g., Atom) — it doesn't actually use an RDF graph 
for it. If you can find any widely used software that actually treats 
it as an RDF graph I'd be interested to know.


"OpenLink Data Explorer" [1] treats the W3 stream[2] as RDF.

[1] https://addons.mozilla.org/en-US/firefox/search?q=openlink
[2] http://www.w3.org/2000/08/w3c-synd/home.rss
--
leif halvard silli


Re: [whatwg] Annotating structured data that HTML has no semantics

2009-05-16 Thread Leif Halvard Silli

Tab Atkins Jr. On 09-05-15 22.15:
On Wed, May 13, 2009 at 10:04 AM, Leif Halvard Silli 
  

Toby Inkster on Wed May 13 02:19:17 PDT 2009:


Hear hear.  Lets call it "Cascading RDF Sheets".


http://buzzword.org.uk/2008/rdf-ease/spec
http://buzzword.org.uk/2008/rdf-ease/reactions
  
RDFa is better though.
  

What does 'better' mean in this context? Why and how? Because it is easier
to process? But EASE seems more compatible with microformats, and is
"better" in that sense.



I'd also like clarification here.  I dislike *all* of the inline
metadata proposals to some degree, for the same reasons that I dislike
inline @style and @onfoo handlers.  A Selector-based way of applying
semantics fits my theoretical needs much better.
  


A possibly 10 year old use case where I think EASE  - or GRDDL as such - 
should fit in:


Shelley and Geoffrey reminded us that "RSS 1.0" stands for "RDF Site 
Summary 1.0".  The W3 also uses RSS 1.0. for its feed[1].  The feed is 
generated via a profile transformation [2] that happens with XSLT. The 
profile defines the  as news items (note the 
combination of element and class - as in EASE)-  But the profile also 
implements particular rules for particular elements without looking at 
the @class. (E.g. each  must contain  or , for 
example.)


All in all, it sounds very similar to what the newer technology GRDDL 
does, since it is all happening based on a profile and some class names 
and specific element structures. And, this is possible to test with the 
W3 GRDDL service, which produces a "feed" that in fact, when you look 
with the right eyes, is the same as the published homepage feed[3].


If the "microdata" becomes part of the final version of HTML 5, then 
GRDDL  (with or without EASE) will probably prosper, since it probably 
doesn't matter to GRDDL whether it looks into @class or @item, as long 
as "the thing" is part of the profile and the profiletransformation. 
(But it would be interesting if someone in the know could test if the 
"triples" would be the same, etc ...) And if so, then the introduction 
of "microdata" increases the need for @profile in HTML 5.


[1] http://www.w3.org/2000/08/w3c-synd/home.rss
[2] http://www.w3.org/2000/08/w3c-synd/
[3] 
http://www.w3.org/2007/08/grddl/?docAddr=http%3A%2F%2Fwww.w3.org%2F&output=rdfxml



I read all the reactions you pointed to. Some made the claim that EASE would
move semantics out of the HTML file, and that microformats was better as it
keeps the semantics inside the file. But I of course agree with you that
EASE just underline/outline the semantics already in the file.



Yup.  The appropriate critique of separated metadata is that the
*data* is moved out of the document, where it will inevitably decay
compared to the live document.  RDF-EASE keeps all the data stored in
the live document, and merely specifies how to extract it.  The only
way you can lose data then is by changing the html structure itself,
which is much less common than just changing the content.
  


That the structure changes seldom, /could/ be a reason for using RDFa to 
store the meta info in the very element instead of using EASE (or even 
Dublin Core in  elements in the ). OTOH, that the structure 
changes little, could also be something that /permits/ the use of GRDDL 
... So it depends on how you see it.



From the EASE draft:


All properties in RDF-EASE begin with the string -rdf-, as per §4.1.2.1
Vendor-specific extensions in [CSS21]. This allows RDF-EASE and CSS to be
safely mixed in one file, [...]
  

I wonder why you think it is so important to be able to mix CSS and EASE. It
seems better to separate the two completely.



I'm not thrilled with the mixture of CSS and metadata either.  Just
because it uses Selectors doesn't mean it needs to be specifiable
alongside CSS.  jQuery uses Selectors too, but it stays where it
belongs.  ^_^  (That being said, there's a plugin for it that allows
you to specify js in your CSS, and it gets applied to the matching
elements from the block's selector.)
  


But may be, after all, it ain't so bad. It is good to have the 
opportunity. :-) (Since you, as I perceived it, disagreed with yourself 
above, I continue the tradition.) :-)

--
leif halvard silli


[whatwg] Link rot is not dangerous

2009-05-15 Thread Leif Halvard Silli

Geoffrey Sneddon Fri May 15 14:27:03 PDT 2009


On 15 May 2009, at 18:25, Shelley Powers wrote:

> One of the very first uses of RDF, in RSS 1.0, for feeds, is still  
> in existence, still viable. You don't have to take my word, check it  
> out yourselves:

>
> http://purl.org/rss/1.0/

Who actually treats RSS 1.0 as RDF? Every major feed reader just uses  
a generic XML parser for it (quite frequently a non-namespace aware  
one) and just totally ignores any RDF-ness of it.


What does it mean to "treat as RDF"? An "RSS 1.0" feed is essentially a 
stream of "items" that has been lifted from the page(s) and placed in an 
RDF/XML feed. When I read e.g. 
http://www.w3.org/2000/08/w3c-synd/home.rss in Safari, I can sort the 
news items according to date, source, title. Which means - I think - 
that Safari sees the feed as "machine readable".  It is certainly 
possible to do more - I guess, and Safari does the same to non-RDF 
feeds, but still. And search engines should have the same opportunities 
w.r.t. creating indexes based on "RSS 1.0" as on RDFa. (Though here 
perhaps comes in between the fact that search engines prefers to help us 
locate HTML pages rather than feeds.)

--
leif halvard silli


[whatwg] Annotating structured data that HTML has no semantics for

2009-05-13 Thread Leif Halvard Silli

Toby Inkster on Wed May 13 02:19:17 PDT 2009:

Leif Halvard Silli wrote:

> Hear hear.  Lets call it "Cascading RDF Sheets".

http://buzzword.org.uk/2008/rdf-ease/spec

http://buzzword.org.uk/2008/rdf-ease/reactions

I have actually implemented it. It works.


Oh! Thanks for sharing.


RDFa is better though.


What does 'better' mean in this context? Why and how? Because it is 
easier to process? But EASE seems more compatible with microformats, and 
is "better" in that sense.


I read all the reactions you pointed to. Some made the claim that EASE 
would move semantics out of the HTML file, and that microformats was 
better as it keeps the semantics inside the file. But I of course agree 
with you that EASE just underline/outline the semantics already in the file.


The thing that probably is most different from (most) microformats (and 
RDFa?) is that EASE can apply semantics even to bare naked elements 
without any @class, @id or other attributes. However, EASE do not 
/require/ one to use it like that. One may choose to create an entirely 
class based EASE document.


It would even be possible to use EASE together with Ian's microdata, 
don't you think?


From the EASE draft:
All properties in RDF-EASE begin with the string -rdf-, as per 
§4.1.2.1 Vendor-specific extensions in [CSS21]. This allows RDF-EASE 
and CSS to be safely mixed in one file, [...]
I wonder why you think it is so important to be able to mix CSS and 
EASE. It seems better to separate the two completely.


From the EASE draft:
The algorithm assumes that the document is held in a DOM-compatible 
representation,
Side kick:  is proposed as part of microdata. But both Firefox and 
Safari will in the DOM render  as part of , regardless.

--
leif halvard silli


Re: [whatwg] Annotating structured data that HTML has no semantics for

2009-05-12 Thread Leif Halvard Silli

Tab Atkins Jr. on Tue, 12 May 2009 12:30:27 -0500:


On Tue, May 12, 2009 at 5:55 AM, Eduard Pascual:
  

> [...] It would be preferable to be able
> to state something like "each (row)  in the  describes an
> iguana: the s are each iguana's picture, the contents of the
> 's are the names, and the @href of the 's are the URLs to their
> main pages" just once.

Indeed.

> If I only need to state the table headings once
> for the users to understand this concept, why should a micro-data
> consumer require me to state it 20 times, once for each row?
> Please note how such a page would be quite painful to maintain: any
> mistake in the micro-data mark-up would generate invalid data and
> require a manual harvest of the data on the page, thus killing the
> whole purpose of micro-data.


Indeed. (But of course, for "copy-paste" safety, the format has to be 
"wordy" and repetitive.)



 And repeating something 20 (or more)
> times brings a lot of chances to put a typo in, or to miss an
> attribute, or any minor but devastating mistake like these.



Well, he didn't quite *ignore* it - he did explicitly call out that
requirement to say that his solution didn't solve it at all.  He also
laid down the reason why - it's unlikely that any reasonable simple
in-place metadata solution would allow you to do that.  You either
need significant complexity, some reliance on language semantics (like
tables can rely on their headers), or moving to out-of-band
specification, likely through a Selectors-based model.
  
Indeed. And Ian's arguments against a selector based model (the claim 
that authors have problems understanding selectors) was one of the least 
convincing arguments he made, I think.  CSS and selectors appears to be 
one of the best understood technologies of the web.

The last is likely the best solution for that, and is even easier to
implement within Ian' simplified proposal.  I don't see a good reason
why that can't advance on a separate track, as (being out-of-band) it
doesn't require changes to HTML to be usable.

I floated a basic proposal for Cascading RDF[1] several months ago,
and someone else (I think Eduard?  I'd have to check my archives) did
something very similar.

[1]: http://www.xanthir.com/rdfa-vs-crdf.php
  


Hear hear.  Lets call it "Cascading RDF Sheets". It could be used for 
the following purposes:


1. The IRI of the Cascading RDF Sheet could serve the role of profile URI;
2. The Cascading RDF Sheet itself could serve the role of a profile 
document; (Finally we could get some kind of registered profile format.)
3. Just as CSS sheets today, a cRDFsheet could be used as authoring 
help, when authoring with a microformat. HTML editing programs could 
offer the elements + classes in the Cascading RDF Sheet to authors, the 
same way that some editors to today use the selectors in stylesheets as 
a "vocabulary repository" for the current file or project. CSS selectors 
is already a well known format. (One may then, of course, already use a 
CSS style sheet for this, kind of. But this soon becomes clumsy. Better 
to separate styling from semantics and structure.)


In fact, I myself begun looking into creating something along these 
lines ... Though rather than a "Cascading RDF Sheet", I looked into 
creating a "Profile Style Sheet" which could be used to define a machine 
readable microformat profile. My motivation for doing this was the 
authoring side of things, as I have been using a text editor which more 
or less uses CSS selectors the same way. (Instead of only offering me to 
pick "" it also offers me to to pick  etc.) Ian's 
proposal do not give much thought about the authoring side, I feel, 
except  for the more casual author. For authors, it is helpful to have a 
"recipe" document and to avoid repetition and "data rot", as you 
mentioned in another message.


Ian's microdata format is easy to grasp the inner logics of - that is a 
good side of the proposal, this could help that it gets used.  But when 
it comes to author's and author groups' ability to define their own, 
decentralised semantics etc., then a decent profile format, which could 
be easily and simply integrated with authoring tools,  seems like a just 
as important  issue as a super simple microdata format.


The microformats.org community does not really have a machine parsable 
profile format. If there were such a format, I believe we would see more 
of more decentralized microformats.

--
leif halvard silli


Re: [whatwg] Helping people seaching for content filtered by license

2009-05-10 Thread Leif Halvard Silli

Ben Adida ben at adida.net  Sun May 10 15:29:53 PDT 2009:

Julian wrote:
> You are aware of MNot's "Web Linking" draft
> 
(<http://greenbytes.de/tech/webdav/draft-nottingham-http-link-header-05.html>),

> and the fact that it seems to enjoy support from the TAG?

Julian, you continue to bring this up as if we hadn't already discussed
this:


Where and when has it been discussed?


there are significant differences of opinion with mnot on whether
his interpretation of @rel values is correct in prior HTML versions, 


He has Requested For Comments, so that can be corrected, no?


and  there are a number of folks who disagree (not just us in RDFa),
including at least two RECs (RDFa and GRDDL).


Is this claim based on a mere comparison of the description of those 
link relations in said specifications? Perhaps some of the disagreements 
are merely a different wording?



The point is: if you assume that @rel="foo" always means the same thing,
then many folks believe you're already violating the HTML spec, which
specifically uses @profile to modulate the meaning of @rel, and
sometimes via another level of indirection.


Where does nottingham draft define anything that contradicts the default 
HTML 401 profile?  Authors will often assume that rel="foo" does means 
the same thing wherever it appears, hence a central register is a 
benefit so that specification writers and profile writers can know what 
the standard semantics are.


As to modifying semantics, it is probably not wise to profile or specify 
semantics that differs from the central register. But having a central 
register cannot in itself prevent profiles (default profiles or 
linked-in profiles) from defining their own semantics when necessary.


It does by the way seem like an unfortunate mix of semantics and other 
issues that HTML 5 does not allow the @rev attribute. Rather, HTML 5 
should allow the @rev attribute, but should eventually say that it 
hasn't defined any values for it. Thus authors who are linking to a 
profile that does define values for @rev could still use @rev without 
producing an invalid HTML 5 document.

--
leif halvard silli



Re: [whatwg] P element's content model restrictions

2007-05-29 Thread Leif Halvard Silli
On 2007-05-30 02:19:48 +0200 Jon Barnett <[EMAIL PROTECTED]> wrote:

> On 5/29/07, Anne van Kesteren <[EMAIL PROTECTED]> wrote:
>> 
>> I don't care much for the semantic side of things but changing section 8.2
>> (and Acid2) to make  not become  as per HTML4
>> would be fine with me. We discussed this recently in #whatwg. Simon has
>> some ideas about it.

Fingers x-ed. 

I updated my little testpage 
 with a version version in 
the «opposite mode»  
so you e.g. can see how TABLE in IE inherits font color and font size from P 
when in Standards, but not so, when in Quirks.

> Is there a link to any of this discussion?  I imagine my searching for 
> or "p element" might be futile.

I suppose Anne referred to IRC, for which Whatwg.org points to this log 
.

> Given:
> This is a lengthy paragraph that talks about the following table
> ...
> 
> Breaking scripts that depend on p.nextSibling to be the table and styles
> that depend on p + table to work (and other various DOM issues) is an
> obvious point, and I'm sure it's been discussed.

That script would then allready be broken, cross-browser wise, I suppose?

The worst case is probably not when authors uses p+table{}, because then clean 
IE targetted styling stands ready to pick-up. The worst is if important TABLE 
styles was targetted via table{} for OperaSafariFirefox, but via _table{} for 
IE. (But then the underscore hack is not considered good coding style either 
...) THis will fix itself, when IE fix their browser (and remove the #table{} 
option and other hack-arounds).

Allthough there are also lots - most? - of content out there, for which it is 
quite irrelevant whether TABLE is sibling or child of the nearest P. P and 
TABLE are often contained in a DIV which carries the styling that positions 
that container on the page. That MSIE and the others see  so 
differently, without most of us recognising any difference on page, should be 
quite telling. 

Collapsing vertical Margins plays an important role in hiding the effects, I 
suppose. For instance if you have texttext, then the empty P 
that Opera-Safari-Firefox currently see in standards-mode, may become 
completely collapsed, unless you add padding-top/-bottom or border-top/-bottom 
to it. The reason collapsing vertical margins is such a useful default CSS 
behaviour, is probably simply because it is so typical to not add 
padding-top/padding-bottom or border-top/border-bottom. (When one do add 
padding/border, then the vertical margin collapsing behaviour stops working, 
and the author gets a whole lot more to think about, with regard to the 
vertical space between the elements.)
-- 
leif



Re: [whatwg] P element's content model restrictions

2007-05-29 Thread Leif Halvard Silli
On 2007-05-29 07:18:09 +0200 Martin Payne <[EMAIL PROTECTED]> wrote:

> Leif Halvard Silli wrote:
>> I'd like to question these restricions. I think that at the very least, 
>> TABLE should be allowed inside the P element. The reason is that MSIE (I 
>> tested version 6 and 7) accept TABLE in P, regardless of whether it is in 
>> Quirks-Mode or in Standards-Mode. Even Firefox-Opera-Safari (FirOpSa) allow 
>> TABLE inside P - allthough they only do so when in Quirks-Mode.
>> 
> 
> When would it ever make sense to do this though? Surely you would never want 
> to put a table inside a paragraph, because tables and paragraphs are two 
> totally different things. I also can’t see why you would put a paragraph 
> inside a paragraph—surely it should either just be one paragraph, or be two 
> completely separate paragraphs.

Hi Martin, I thought the whole «real-world» would come upon me for suggesting 
this, but you are the only critical voice so far. Many thanks to the suppor 
from Benjamin as well. I would love to hear from «browser vendors» also. When 
the spec quotes «historical reasons», then I suppose they mean «real world» or 
something. But then please note that IE treats TABLE within P «better» in 
standards-mode than in quirks-mode. :-) (Fewer inheritable styles get inherited 
by TABLE in quirks-mode than in standards-mode.) 

First, you do realize that what I suggested is *in line* with what the proposed 
specification permits for XHTML5? (Actually, I think the spec should make it 
much clearer that it is. For instance, it could start with clearly stating that 
section 8 about «The HTML syntax» only deals with HTML5, and not with the 
syntax of XHTML5. Again - shoot me if I am wrong.)

Second, I think TABLE inside P would be usefull - often (and not just 
sometimes). Let's say you wanted a left border on all P-elements. Or that you 
wanted to enumerate all P-elements (all paragraphs).

| Para with table
| 
| +-+
| | TH | TH |
| ||+
| | TD | TD |
| +-+
|
| End of para.

If P have any semantics, then this example shows very well why the TABLE 
_should_ be insid the P. The alternatives are two P-elements and a TABLE in the 
middle of them. Or simply no paragraph at all - a DIV. But with two P-elements, 
then if you want to enumerate all the P-elements via CSS, you will get one 
numbers per P = 2 numbers each time you have a TABLE, instead of just 1 number. 
There is of course a plethora of workarounds ... But it would be better to make 
the language as simple as possible for authors. That will in itself prevent 
much «invention/workaround/hacks» - which browser vendors then will have to 
«quirk» with.

Academical documents is also something that would benefit. Do you think a 
person who just want to write a thesis-like document in HTML would think that 
he in fact cannot have have lists, tables and blockquots within paragraphs? And 
in this context, I will also mention e.g. Prince XML, the XML formatter, which 
can also format HTML. What if I make something that prints well in Prince and 
want to put it online - things starts to break down?

Yet another thing, is that if you may place lists and table inside P-elements, 
then CSS becomes simpler. You use your energy the right CSS for the P-elements, 
so they constitute a column or something. Then you stuff your TABLE elements 
inside the P-elements. And so TABLE will be placed relative to the P - and not 
relative to the parent of P. Simple. Logical.

The fact that the above is possible in XHTML5, but not in HTML5 gives XHTML5 an 
advantage. But it was my impression that the WHATwg wants to «lift» the status 
of HTML to a higher level - if I may put it that way. 

The HTML5 proposal has a section where it discusses how paragraphs are 
represented in HTML. 
<http://www.whatwg.org/specs/web-apps/current-work/#paragraphs>. It does not 
say that TABLE can represent a paragraph. Or that DIV can. But the effect of 
not being able ot place TABLE inside P, is that either DIV or TABLE will be 
used as paragraphs. And it also means that sometimes, things that really are 
one paragraph, will look as if it is 2 or 3 paragraphs. I.e. we cannot trust 
that the P represent what was meant to be expressed.
-- 
leif halvard silli



Re: [whatwg] P element's content model restrictions

2007-05-28 Thread Leif Halvard Silli

On 2007-05-29 04:07:11 +0200 Leif Halvard Silli <[EMAIL PROTECTED]> wrote:

The MAP element is also worth mentioning in this context. It may 
appear 
inside P.


Sorry - I was wrong. HTMl5, as it stands, defines new rules for MAP - 
it may no more appear in P - unlike what it could in HTML4.

--
leif



[whatwg] P element's content model restrictions

2007-05-28 Thread Leif Halvard Silli
The subsection 8.1.2.5. «Restrictions on content models»  puts restricions «for 
historical reasons» on the content model of amongst others, the P element (my 
understanding is that the restrictions are only valid for the HTML5, but not 
for XHTML5 - please, correct me if I am wrong):

A p element must not contain blockquote, dl, menu, ol, pre,
table, or ul elements, even though these elements are technically
allowed inside p elements according to the content models described
in this specification. (In fact, if one of those elements is put
inside a p element in the markup, it will instead imply a p element 
end tag before it.)  

I'd like to question these restricions. I think that at the very least, TABLE 
should be allowed inside the P element. The reason is that MSIE (I tested 
version 6 and 7) accept TABLE in P, regardless of whether it is in Quirks-Mode 
or in Standards-Mode. Even Firefox-Opera-Safari (FirOpSa) allow TABLE inside P 
- allthough they only do so when in Quirks-Mode. When in standards mode, 
FirOpSa behave according to the HTML5 restriction quoted above.  I have a test 
page here, where the DOCTYPE get MSIE into standards mode, and FirOpSa into 
Quirks-Mode, and with two TABLE elements in the second P-element of that page: 
<http://www.malform.no/prov/content-model/index.html>.

In my view, it isn't desirable to limit the containment of TABLE inside P to 
quirks-mode (and XML ...). Quirks-Mode should only deal with CSS quirks. This 
is a «content model quirk».

As the test page also shows, it would not be _that_ simple to allow BLOCKQUOTE, 
DL, MENU, PRE or UL in today's browsers. However, if you stuff any of those 
elements in a SPAN element, then they become more digestible - for Firefox and 
Safari and little brother iCab (but not for Opera or MSIE). (And when reading 
section 3.12.18. «The span element», SPAN is said to be allowed to contain 
«Otherwise: any inline-level content», i.e. structured inline-level content as 
well – thus stuffing e.g. UL in SPAN seems to be in line with the 
content-model. It is unclear to me whether the restrictions in 8.1.2.5 is meant 
to also count for SPAN inside a P - but I assume they are.)

I'd also like to mention that both Firefox, Opera and Safari allow the 
restricted elements, as well as P itself, to become nested inside a P provided 
you stuff them inside a OBJECT. (See the mentioned test page). Whether this is 
allowed according to HTML401, is unclear to me. But HTML401 gives many code 
examples, using P without closing tag (thus sometimes open for 
interpretatations) where P inside OBJECT inside P is used. And it is at least 
not forbidden. And the HTML validator accepts it. Note that FirOpSa _nest_ P 
inside P this way: You will see a P with margin, padding and border inside its 
«parent P». MSIE does not allow P inside OBJECT though.

The MAP element is also worth mentioning in this context. It may appear inside 
P. And it may also contain block level elements. Thus via MAP a P-element may 
contain another P-element. It is my experience that those browsers that nest P 
inside P via the OBJECT (I am thinking about when the embedded content of 
OBJECT is unavailable), does not handle P inside MAP inside P. (To see if User 
Agents does not handle P inside MAP inside P or P inside OBJECT inside P, one 
must apply BORDER, MARGING and or PADDING, and perhaps background color - then 
you will see how it falls apart. Opera, for instance, does not handle P inside 
MAP inside P.)

Anyway, the whole thing about what P may or may not contain in HTML, is much 
more blurred than a fast reference to «historical reasons» can tell. Hence it 
would be better to try to un-restrict the content-model of P as much as 
possible. Because if HTML5 will continue to apply these restricitons on P, then 
authors must continue to work with paragraphs in HTML in quite unintuitive ways.
-- 
leif halvard silli



Re: [whatwg] Style sheet loading and parsing (over HTTP)

2007-05-28 Thread Leif Halvard Silli
On 2007-05-24 09:50:47 +0200 Anne van Kesteren <[EMAIL PROTECTED]> wrote:

> The HTML WG accepted to review the HTML 5 proposal. Presumably members of  
> the HTML WG are doing that. I'm not sure why they would need tutorials as  
> well to do such a thing.


You should interpret me a bit «symbolic». The spec and its ideas needs to be 
*presented*.
-- 
leif



Re: [whatwg] Style sheet loading and parsing (over HTTP)

2007-05-28 Thread Leif Halvard Silli
On 2007-05-24 02:45:44 +0200 Ian Hickson <[EMAIL PROTECTED]> wrote:
> On Thu, 24 May 2007, Leif Halvard Silli wrote:
> The chairman of the HTML WG asked that we stop discussing, that's why I 
> haven't been posting on that list. :-(

Ok - I see. Yes it is true he did.

> Note, though, that there are a number of ways to see what progress is being 
> made on the W3C HTML 5 specification; they are all listed at the top of the 
> spec, as required by teh W3C publication rules:
> 
>
> http://dev.w3.org/cvsweb/~checkout~/html5/spec/Overview.html?content-type=text/html;%20charset=utf-8
> 
> So it's not like the progress has been hidden from the HTML working group.

You have a point there. And especially with the thing you said above.

>> On 2007-05-23 23:20:40 +0200 Ian Hickson <[EMAIL PROTECTED]> replied to 
>> Julian:
>> 
>>> If the spec I'm working on isn't that spec, then I'll stop working on > 
>>> it, and return to working on the spec with real-world relevance.)
>> 
>> I think many would feel that the whole process would pretty much falls 
>> apart if this should happen. On the other side, it doesn't sound as if you 
>> are open to much debate. You better think about how you present this to the 
>> HTMLwg. No one likes to discuss under a Damocles sword. On the other side, 
>> it is just fair to say that there are some limites on what one can accept. 
>> But then again, the HTMLwg has been conveened pretty much because of WHATwg 
>> - so it would be a bit strange.
> 
> I'm not really sure I understand what you mean here. I think I've been pretty 
> open about my position. I don't really understand what you want me to say or 
> do.

No. Those thing I had in mind are probably not all that important ... I support 
the view that the spec must have a strong focus on «real world relevance». And 
I think I'll interpret what you said as a sign of your strong convictions about 
the way forward ... :-) 
-- 
leif



Re: [whatwg] Style sheet loading and parsing (over HTTP)

2007-05-23 Thread Leif Halvard Silli
On 2007-05-23 22:59:19 +0200 Julian Reschke <[EMAIL PROTECTED]> wrote:

> Ian, I understand that this is what the WHATWG'S HTML5 document does today; I 
> just don't see how it can become the W3C's HTML5 spec while doing this. It 
> seems to me it's pointless to continue this thread over here, but the same 
> issues will come up again on the W3C mailing list for sure.

I just joined this list and the HTMLwg because I read Anne's blog. I might only 
be a fly in the soup. On the other side, the member of WHATwg has been telling 
us for months that «this is important, please join». Hence, I raise my voice.

Because Julian makes some good points here. This is in line what I have been 
thinking for a while. You - the WHATwg - need to face the lions of the HTMLwg.  
For instance, Ian, to tell on _this_ list that predefined classes are taken out 
of the spec. But not inform (as I noticed, anyway) the forum which caused the 
removal about the same issue, what is that? [1] 

The WHATwg spec has become the starting-point. Victory, said Anne van. Sounds 
more like Ian think the HTMLwg is a drag. Anne tells in his blog how he 
presents HTML5 to different audiences. And Karl Dubost began speaking about 
tutorial for users. But who needs a tutorial here, if not the HTMLwg itself? 
Doesn't the WHATwg spec as starting point mean that WHATwg somehow have been 
given a responsibility here? To present its spec to the _HTMLwg_? Section for 
section. After all, you wanted the HTMLwg to accept it. And you therfore are 
obligued to present it - and deserve the space and time to do so. It is really 
difficult to discuss small bits such as class names unless we have a broader 
context.

On 2007-05-23 23:20:40 +0200 Ian Hickson <[EMAIL PROTECTED]> replied to Julian:
> If the spec I'm working on isn't that spec, then I'll stop working on it, and 
> return to working on the spec with real-world relevance.)

I think many would feel that the whole process would pretty much falls apart if 
this should happen. On the other side, it doesn't sound as if you are open to 
much debate. You better think about how you present this to the HTMLwg. No one 
likes to discuss under a Damocles sword. On the other side, it is just fair to 
say that there are some limites on what one can accept. But then again, the 
HTMLwg has been conveened pretty much because of WHATwg - so it would be a bit 
strange.

[1] <http://dev.w3.org/cvsweb/html5/spec/Overview.html#rev1.26>
-- 
leif halvard silli





Re: [whatwg] Predefined classes are gone

2007-05-17 Thread Leif Halvard Silli
On 2007-05-18 00:37:53 +0200 Ian Hickson <[EMAIL PROTECTED]> wrote:

> On Thu, 17 May 2007, Adrienne Travis wrote:
>> 
>> Is there still room for discussion regarding the /role/ attribute with 
>> namespacing as an alternative? A lot of us loved the IDEA of predefined 
>> "classes", but didn't like the idea of confusing THAT mechanism with the 
>> CSS class mechanism.
> 
> I'd rather we discussed use cases first, and then tried to work out what 
> solutions fit those use cases. The main reason the predefined classes were 
> removed is that they had no convincing use cases.

But opposition would still be there even when those cases are found. Especially 
if A) one cannot avoid this new semantic being applied to old pages, B) authors 
cannot easily avoid the predefined meaning of these names when they want to.

Maciej Stachowiak interestingly told the HTMLwg list that 'rel=nofollow' is a 
microformat.  So, why not think 'microformat' in this context also? Not so much 
in the picking of the specific class _names_  as in how we enable their 
_meaning_. On the HTMLwg list it was also said by someone that Tantek's 
microformats are typically hierarachical - they are a whole structure of 
elements with spesified class names. That way their CLASSes cannot easily clash 
with «meaningless» class names. So hierarchy seems like an important principle. 
Another principle is: the author must enable it. That is also how it is with 
REL=nofollow - author enables it. :-)

«Predfined class names» is not a insentive for an HTML author. «Accessibility» 
_is_. In this message, I'll refer to these (now «gone») class names as 
«HTML5UniversalAccess». Then to create a hierarchic «microformat», we could 
require that their meaning is «enabled» by simply by applying 
class="HTML5UniversalAccess" either on HTML, BODY, LINK or STYLE. 

Namespace: Since CLASS on HTML, LINK and STYLE is not permitted in HTML401, 
using class on any of those elements would in effect create a kind of 
«namespace». For use in HTML401 documents, on could permit it in BODY. HTML or 
BODY may seem like the most natural place for such a class. But as authors  do 
most things that are related to class & style in the STYLE and LINK elements, 
putting the «trigger class» there might also be recommendable.

Also: it is easy to think of other predefined class names than those that HTML5 
will define. If we think «microformat» when we define the (mechanism for the) 
predefined class names of HTML5, then I think we can create a pattern for how 
authors and groups themselves can define such things in a way that can be used 
by UAs.

PS: if «copyright» etc really even before anything has become defined, are 
widely used in a way that is semantically inline with how HTML5 would define 
the same names, then I suggest UAs enable some kind of sniffing to handle those 
cases when for instance «class="HTML5UniversalAccess"» is lacking (or spelled 
incorrectly).
-- 
leif halvard silli, oslo
www.målform.no