Re: [Lynx-dev] rendering (0x97)

2020-07-01 Thread Halaasz Saandor via Lynx-dev

2020/06/30 10:31 ... David Woolley:
Are you sure that the browser is given free reign?  I thought the HTML5 
principle is that every browser should produce the same output 
regardless of whether the document was syntactically valid, and that is 
why they define error cases in such detail.


I think there is also a principle that pre-HTML5 invalid pages should 
produce results similar to those on mainstream pre-HTML5 browsers.


Seems I misstated it. I meant that in former interpretations of HTML, 
ere 5, in case of such error the webbrowser s behavior is not defined. 
And after Mouse s words, if every (mis)use of HTML gets a fixed 
interpretation, where is the error?


___
Lynx-dev mailing list
Lynx-dev@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lynx-dev


Re: [Lynx-dev] rendering (0x97)

2020-06-30 Thread Mouse
> I thought the HTML5 principle is that every browser should produce
> the same output regardless of whether the document was syntactically
> valid, and that is why they define error cases in such detail.

Then is "error" really an appropriate word?  If an interpretation is
defined for something, how much sense does it make to call it an error?

And, of course, it's completely impossible for every browser to produce
the same output, unless "the same" is taken in a ludicrously relaxed
sense.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

___
Lynx-dev mailing list
Lynx-dev@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lynx-dev


Re: [Lynx-dev] rendering (0x97)

2020-06-30 Thread David Woolley

On 30/06/2020 14:49, Halaasz Saandor via Lynx-dev wrote:
say text between  and , with the webbrowser free to make any 
interpretation of it.


Are you sure that the browser is given free reign?  I thought the HTML5 
principle is that every browser should produce the same output 
regardless of whether the document was syntactically valid, and that is 
why they define error cases in such detail.


I think there is also a principle that pre-HTML5 invalid pages should 
produce results similar to those on mainstream pre-HTML5 browsers.


___
Lynx-dev mailing list
Lynx-dev@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lynx-dev


Re: [Lynx-dev] rendering (0x97)

2020-06-30 Thread Halaasz Saandor via Lynx-dev

2020/06/28 13:34 ... Thomas Dickey:

but in the meantime, the html5 crowd declared that iso-8859-1 is
identical to cp1252


I, too, think the crowd crazie, for other reasons besides: when I heard 
of this crowd I glanced at the website, and found great effort exerted 
on the meaning of strings that had been considered bad HTML, say text 
between  and , with the webbrowser free to make any 
interpretation of it. Furthermore, I looked for limitless nesting, as 
the deprecated  is allowed within , to the webbrowser s 
limits, and found that such limitless nesting is not and never shall be 
part of HTML, right after claims that the element structure allows 
that--but only in style!
Sounds as if one has to use something like MSoft Word to generate it, 
because Word well fakes nesting by means of indenting. Ugh.


___
Lynx-dev mailing list
Lynx-dev@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lynx-dev


Re: [Lynx-dev] rendering (0x97)

2020-06-30 Thread Halaasz Saandor via Lynx-dev

2020/06/29 14:43 ... Mouse:

I hav seen that, and , in Microsoft HTML from Word.

That means little.  Just because a Microsoft program generates
something does not mean it's compatible with non-Microsoft software,
and sometimes does not even mean it's compatible with other Microsoft
software, and certainly does not mean it's correct.


My point was the perversity of Microsoft software. HT is almost useless 
in HTML, it is only another space.


___
Lynx-dev mailing list
Lynx-dev@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lynx-dev


Re: [Lynx-dev] rendering (0x97)

2020-06-29 Thread Mouse
>> Content-Encoding=Windows-1252
> I meant Charset, and I hadn't read the other replies.

> If it is the document character set I'm not sure how one should
> interpret that for variable length codes.

As a codepoint, rather than as a encoding octet, I would guess.

Content-Type:'s charset= is actually two things.  (It arguably
shouldn't be, but since when has that made any difference to
HTTP-family protocols?)  It is a charset in the strict sense, a mapping
from integer codepoints to abstract characters, and it is an encoding,
a way of turning a stream of integer codepoints into a stream of
octets.  The latter really should be split out into a separate header;
I speculate that that wasn't done because everyone used the trivial
encoding for single-octet character sets, then added UTF-8, and nobody
noticed that they were silently adding an encoding spec to the charset
spec until after it got entrenched.

I could argue it either way whether something like  should be
"octet 151 for the encoding specified by charset=" or "codepoint 151
for the character set specified by charset=".  I do strongly believe
it is broken for it to be "Unicode codepoint 151" even if the charset=
specifies something very non-Unicode like 8859-14 or KOI-8.  If nothing
else, it makes it completely impossible to represent non-single-octet
codepoints when using a character set that is not a subset of Unicode.
But what I believe doesn't matter

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

___
Lynx-dev mailing list
Lynx-dev@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lynx-dev


Re: [Lynx-dev] rendering (0x97)

2020-06-29 Thread Thorsten Glaser
David Woolley dixit:

> If it is the document character set I'm not sure how one should
> interpret that for variable length codes.

Right…

| 4.1 Character and Entity References
|
| [Definition: A character reference refers to a specific character in
| the ISO/IEC 10646 character set, for example one not directly
| accessible from available input devices.] Character Reference
|
| [66]CharRef::=

Re: [Lynx-dev] rendering (0x97)

2020-06-29 Thread David Woolley

On 29/06/2020 20:51, David Woolley wrote:
 Content-Encoding=Windows-1252

I meant Charset, and I hadn't read the other replies.

If it is the document character set I'm not sure how one should 
interpret that for variable length codes.


___
Lynx-dev mailing list
Lynx-dev@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lynx-dev


Re: [Lynx-dev] rendering (0x97)

2020-06-29 Thread David Woolley

On 29/06/2020 19:07, Halaasz Saandor via Lynx-dev wrote:
What do you mean? The actual Unicode number is U+2014, or 8212, and 
 is simply cp1252 in disguise. I hav seen that, and , in 
Microsoft HTML from Word.


I mean that  sent with Content-Encoding=Windows-1252 is still 
interpreted as Unicode and therefore has no valid graphic.


___
Lynx-dev mailing list
Lynx-dev@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lynx-dev


Re: [Lynx-dev] rendering (0x97)

2020-06-29 Thread Thorsten Glaser
Mouse dixit:

>I think the double-quoted text above is saying that  is defined
>to be not "codepoint 151 in the encoding specified by the
>Content-Type:" but rather "Unicode codepoint 151".
>
>Is that actually true?  I don't know; I'm not au courant enough with

No, but the document character set is Unicode in UTF-8 encoding.

In both XML and HTML, numeric (decimal or hexadecimal) entities
are in the document character set.

bye,
//mirabilos
-- 
Yay for having to rewrite other people's Bash scripts because bash
suddenly stopped supporting the bash extensions they make use of
-- Tonnerre Lombard in #nosec

___
Lynx-dev mailing list
Lynx-dev@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lynx-dev


Re: [Lynx-dev] rendering (0x97)

2020-06-29 Thread Mouse
>> but if they are sending  over the wire, rather than the a byte
>> containing the value 151, the contents encoding wouldn't matter, as
>> entities are interpreted in Unicode,

> What do you mean?  The actual Unicode number is U+2014, or 8212, and
>  is simply cp1252 in disguise.

I think the double-quoted text above is saying that  is defined
to be not "codepoint 151 in the encoding specified by the
Content-Type:" but rather "Unicode codepoint 151".

Is that actually true?  I don't know; I'm not au courant enough with
Web specs to know where to look - I have as little to do with the Web
as I can get away with.

> I hav seen that, and , in Microsoft HTML from Word.

That means little.  Just because a Microsoft program generates
something does not mean it's compatible with non-Microsoft software,
and sometimes does not even mean it's compatible with other Microsoft
software, and certainly does not mean it's correct.

For example, I've seen mail generated by Microsoft tools with
codepoints in the 128-159 range, obviously intended to be printable
characters, but labeled as being 8859-1.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

___
Lynx-dev mailing list
Lynx-dev@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lynx-dev


Re: [Lynx-dev] rendering (0x97)

2020-06-29 Thread Thorsten Glaser
Halaasz Saandor via Lynx-dev dixit:

>  is simply cp1252 in disguise

It’s not, 

Re: [Lynx-dev] rendering (0x97)

2020-06-29 Thread Halaasz Saandor via Lynx-dev

2020/06/28 18:28 ... David Woolley:
but if they are sending  over the wire, rather than the a byte 
containing the value 151, the contents encoding wouldn't matter, as 
entities are interpreted in Unicode,


What do you mean? The actual Unicode number is U+2014, or 8212, and 
 is simply cp1252 in disguise. I hav seen that, and , in 
Microsoft HTML from Word.


___
Lynx-dev mailing list
Lynx-dev@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lynx-dev


Re: [Lynx-dev] rendering (0x97)

2020-06-29 Thread russellbell
Quoth David Woolley: 'Firefox on Debian also faults it:
'adventures '
Firefox from Slackware renders it as emdash.  2 of my
resources identify it as em dash.
Usually you-all ignore my character-rendering comments.  I
don't mind; I edit the source to my preferences.  I bring it up on
this list in case it helps someone else who wants to customize theirs.
nytimes.com encodes pages that existed before digitization
variously.  It suits me to accommodate their mistakes if it doesn't
conflict with another character.  I don't need 'C1 special code'.  I
suspect it's left over from the good old TTY days - ah polar relays! -
I can hear them now.  They used to be kept behind plexiglass screens
to dampen the noise.

russell bell

___
Lynx-dev mailing list
Lynx-dev@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lynx-dev


Re: [Lynx-dev] rendering (0x97)

2020-06-28 Thread David Woolley

On 28/06/2020 18:40, Thorsten Glaser wrote:



but in the meantime, the html5 crowd declared that iso-8859-1 is
identical to cp1252


WHAT‽


WHATWG!

Looking at the network traffic, on Firefox, I can see why they wouldn't 
like a non-WHATWG compliant browser.  The page even seems to report back 
frequently as to whether you still have it open!  It loads large number 
of resources which are not there for your benefit.  You are completely 
breaking the real purpose of the page by trying to access just the 
editorial.


Firefox is reporting the & as , even in the source as recovered 
from the developer tools, so I'm not certain what is going over the 
wire, but if they are sending  over the wire, rather than the a 
byte containing the value 151, the contents encoding wouldn't matter, as 
entities are interpreted in Unicode, unless WHATWG have also dictated 
that control characters be overlaid with CP 1252 characters in WHATWG 
"Unicode".


The fact that Debian Firefox doesn't play ball on this makes me think 
they either haven't done that, or only did it very recently.



___
Lynx-dev mailing list
Lynx-dev@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lynx-dev


Re: [Lynx-dev] rendering (0x97)

2020-06-28 Thread Thomas Dickey


- Original Message -
| From: "Thorsten Glaser" 
| Cc: "lynx-dev" 
| Sent: Sunday, June 28, 2020 1:40:48 PM
| Subject: Re: [Lynx-dev] rendering  (0x97)

| Thomas Dickey dixit:
| 
|>but in the meantime, the html5 crowd declared that iso-8859-1 is
|>identical to cp1252
| 
| WHAT‽
| 
| I knew they were crazy, but… like THAT?

Here's something relevant:

https://encoding.spec.whatwg.org/#names-and-labels

I seem to recall reading that in one of those pages summarizing changes for 
html5.

On the other hand, it might be one of those "facts" created in Wikipedia 
(there's a lot of that).
And even if I saw it some other place, Wikipedia might still be the ultimate 
source.

Looking there, I see it evolving since

https://en.wikipedia.org/w/index.php?title=Windows-1252=revision=267905312=262016711
https://en.wikipedia.org/w/index.php?title=Windows-1252=revision=285015046=285011908

with the second edit referring to

https://web.archive.org/web/20090417231914/http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html

Here's the source for the first edit:

https://web.archive.org/web/20090204094727/http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html

See "8.2.2.2 Character encoding requirements", which (seems familiar) says that 
ISO-8859-1 should be treated as if it were CP1252.

Move forward to 2012, and the wording is amended

https://web.archive.org/web/20120930155353/http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html

and going to 2013, I don't see it anymore.

That is, I don't see it in whatwg at that point.  But Wikipedia's been updated, 
and so has whatwq...

As of today, here's the current page:

https://en.wikipedia.org/w/index.php?title=Windows-1252=964485118

which says

This is now standard behavior in the HTML5 specification, which requires that 
documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 
encoding.[5]

[5] "Encoding". WHATWG. 27 January 2015. sec. 5.2 Names and labels. Archived 
from the original on 4 February 2015. Retrieved 4 February 2015.

That is, it points to something that we can read on Internet Archive:

https://web.archive.org/web/20150204174315/https://encoding.spec.whatwg.org/#names-and-labels

...and that page does say (in effect) that ISO-8859-1 and several other 
charsets:

"ansi_x3.4-1968"
"ascii"
"cp1252"
"cp819"
"csisolatin1"
"ibm819"
"iso-8859-1"
"iso-ir-100"
"iso8859-1"
"iso88591"
"iso_8859-1"
"iso_8859-1:1987"
"l1"
"latin1"
"us-ascii"
"windows-1252"
"x-cp1252"

are to be interpreted as CP1252.  The current page gives the same information:

https://web.archive.org/web/20200613144751/https://encoding.spec.whatwg.org/
 
| That being said this still is UTF-8, not ISO-8859-1…


-- 
Thomas E. Dickey 
http://invisible-island.net
ftp://ftp.invisible-island.net

___
Lynx-dev mailing list
Lynx-dev@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lynx-dev


Re: [Lynx-dev] rendering (0x97)

2020-06-28 Thread Thorsten Glaser
Thomas Dickey dixit:

>but in the meantime, the html5 crowd declared that iso-8859-1 is
>identical to cp1252

WHAT‽

I knew they were crazy, but… like THAT?

That being said this still is UTF-8, not ISO-8859-1…

bye,
//mirabilos
-- 
“It is inappropriate to require that a time represented as
 seconds since the Epoch precisely represent the number of
 seconds between the referenced time and the Epoch.”
-- IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2

___
Lynx-dev mailing list
Lynx-dev@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lynx-dev


Re: [Lynx-dev] rendering (0x97)

2020-06-28 Thread Thomas Dickey
On Sun, Jun 28, 2020 at 06:07:47PM +0100, David Woolley wrote:
> On 28/06/2020 17:44, Thorsten Glaser wrote:
> > cp1252, “the” most common Windows/ANSI codepage, has it as U+2014
> > (em dash).
> 
> I found that eventually, but you got there first.
> 
> This, ten year old, thread explains, in more detail, why this is broken
> HTML: 
> ·

sure.

but in the meantime, the html5 crowd declared that iso-8859-1 is
identical to cp1252

-- 
Thomas E. Dickey 
https://invisible-island.net
ftp://ftp.invisible-island.net


signature.asc
Description: PGP signature
___
Lynx-dev mailing list
Lynx-dev@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lynx-dev


Re: [Lynx-dev] rendering (0x97)

2020-06-28 Thread David Woolley

On 28/06/2020 17:44, Thorsten Glaser wrote:

cp1252, “the” most common Windows/ANSI codepage, has it as U+2014
(em dash).


I found that eventually, but you got there first.

This, ten year old, thread explains, in more detail, why this is broken 
HTML: 
·



___
Lynx-dev mailing list
Lynx-dev@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lynx-dev


Re: [Lynx-dev] rendering (0x97)

2020-06-28 Thread Thorsten Glaser
David Woolley dixit:

> UTF8, not a Microsoft code page. Actually, I have been unable to find 
> any common code page where that codepoint represents a character that 

cp1252, “the” most common Windows/ANSI codepage, has it as U+2014
(em dash).

| Content-Type: text/html; charset=utf-8

Although since the HTTP response declares UTF-8 encoding, it
obviously has no meaning.

bye,
//mirabilos
-- 
(gnutls can also be used, but if you are compiling lynx for your own use,
there is no reason to consider using that package)
-- Thomas E. Dickey on the Lynx mailing list, about OpenSSL

___
Lynx-dev mailing list
Lynx-dev@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lynx-dev


Re: [Lynx-dev] rendering (0x97)

2020-06-28 Thread David Woolley

On 28/06/2020 17:03, russellb...@gmail.com wrote:

I encountered this in
https://www.nytimes.com/2003/09/26/obituaries/george-plimpton-urbane-and-witty-writer-dies-at-76.html
lynx un-rendered it.  I use def7_uni.tbl .  lynx renders it in
some other tables.  I added an entry to mine.


That character is C1 Controls: END OF GUARDED AREA.  There is no Unicode 
graphic associated with it and it is not one of the few control 
characters that have any significance in Unicode.


Firefox on Debian also faults it: 'adventures  as "professional" 
athlete, stand-up comedian, movie bad guy or circus performer '. 
It is probably a Microsoft code page character, but the page charset is 
UTF8, not a Microsoft code page. Actually, I have been unable to find 
any common code page where that codepoint represents a character that 
would make sense in context.


This is a bug with the New York Times article.

___
Lynx-dev mailing list
Lynx-dev@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lynx-dev


[Lynx-dev] rendering (0x97)

2020-06-28 Thread russellbell
I encountered this in
https://www.nytimes.com/2003/09/26/obituaries/george-plimpton-urbane-and-witty-writer-dies-at-76.html
lynx un-rendered it.  I use def7_uni.tbl .  lynx renders it in
some other tables.  I added an entry to mine.

russell bell

___
Lynx-dev mailing list
Lynx-dev@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lynx-dev