Re: [Lynx-dev] rendering — (0x97)
>> Content-Encoding=Windows-1252 > I meant Charset, and I hadn't read the other replies. > If it is the document character set I'm not sure how one should > interpret that for variable length codes. As a codepoint, rather than as a encoding octet, I would guess. Content-Type:'s charset= is actually two things. (It arguably shouldn't be, but since when has that made any difference to HTTP-family protocols?) It is a charset in the strict sense, a mapping from integer codepoints to abstract characters, and it is an encoding, a way of turning a stream of integer codepoints into a stream of octets. The latter really should be split out into a separate header; I speculate that that wasn't done because everyone used the trivial encoding for single-octet character sets, then added UTF-8, and nobody noticed that they were silently adding an encoding spec to the charset spec until after it got entrenched. I could argue it either way whether something like — should be "octet 151 for the encoding specified by charset=" or "codepoint 151 for the character set specified by charset=". I do strongly believe it is broken for it to be "Unicode codepoint 151" even if the charset= specifies something very non-Unicode like 8859-14 or KOI-8. If nothing else, it makes it completely impossible to represent non-single-octet codepoints when using a character set that is not a subset of Unicode. But what I believe doesn't matter /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B ___ Lynx-dev mailing list Lynx-dev@nongnu.org https://lists.nongnu.org/mailman/listinfo/lynx-dev
Re: [Lynx-dev] rendering — (0x97)
David Woolley dixit: > If it is the document character set I'm not sure how one should > interpret that for variable length codes. Right… | 4.1 Character and Entity References | | [Definition: A character reference refers to a specific character in | the ISO/IEC 10646 character set, for example one not directly | accessible from available input devices.] Character Reference | | [66]CharRef::='' [0-9]+ ';' | | '' [0-9a-fA-F]+ ';' [582][WFC: Legal Character] I stand corrected. Sorry, my mind’s on two different projects right now, //mirabilos -- 22:20⎜ The crazy that persists in his craziness becomes a master 22:21⎜ And the distance between the craziness and geniality is only measured by the success 18:35⎜ "Psychotics are consistently inconsistent. The essence of sanity is to be inconsistently inconsistent ___ Lynx-dev mailing list Lynx-dev@nongnu.org https://lists.nongnu.org/mailman/listinfo/lynx-dev
Re: [Lynx-dev] rendering — (0x97)
On 29/06/2020 20:51, David Woolley wrote: Content-Encoding=Windows-1252 I meant Charset, and I hadn't read the other replies. If it is the document character set I'm not sure how one should interpret that for variable length codes. ___ Lynx-dev mailing list Lynx-dev@nongnu.org https://lists.nongnu.org/mailman/listinfo/lynx-dev
Re: [Lynx-dev] rendering — (0x97)
On 29/06/2020 19:07, Halaasz Saandor via Lynx-dev wrote: What do you mean? The actual Unicode number is U+2014, or 8212, and — is simply cp1252 in disguise. I hav seen that, and , in Microsoft HTML from Word. I mean that — sent with Content-Encoding=Windows-1252 is still interpreted as Unicode and therefore has no valid graphic. ___ Lynx-dev mailing list Lynx-dev@nongnu.org https://lists.nongnu.org/mailman/listinfo/lynx-dev
Re: [Lynx-dev] rendering — (0x97)
Mouse dixit: >I think the double-quoted text above is saying that — is defined >to be not "codepoint 151 in the encoding specified by the >Content-Type:" but rather "Unicode codepoint 151". > >Is that actually true? I don't know; I'm not au courant enough with No, but the document character set is Unicode in UTF-8 encoding. In both XML and HTML, numeric (decimal or hexadecimal) entities are in the document character set. bye, //mirabilos -- Yay for having to rewrite other people's Bash scripts because bash suddenly stopped supporting the bash extensions they make use of -- Tonnerre Lombard in #nosec ___ Lynx-dev mailing list Lynx-dev@nongnu.org https://lists.nongnu.org/mailman/listinfo/lynx-dev
Re: [Lynx-dev] rendering — (0x97)
>> but if they are sending — over the wire, rather than the a byte >> containing the value 151, the contents encoding wouldn't matter, as >> entities are interpreted in Unicode, > What do you mean? The actual Unicode number is U+2014, or 8212, and > — is simply cp1252 in disguise. I think the double-quoted text above is saying that — is defined to be not "codepoint 151 in the encoding specified by the Content-Type:" but rather "Unicode codepoint 151". Is that actually true? I don't know; I'm not au courant enough with Web specs to know where to look - I have as little to do with the Web as I can get away with. > I hav seen that, and , in Microsoft HTML from Word. That means little. Just because a Microsoft program generates something does not mean it's compatible with non-Microsoft software, and sometimes does not even mean it's compatible with other Microsoft software, and certainly does not mean it's correct. For example, I've seen mail generated by Microsoft tools with codepoints in the 128-159 range, obviously intended to be printable characters, but labeled as being 8859-1. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B ___ Lynx-dev mailing list Lynx-dev@nongnu.org https://lists.nongnu.org/mailman/listinfo/lynx-dev
Re: [Lynx-dev] rendering — (0x97)
Halaasz Saandor via Lynx-dev dixit: > — is simply cp1252 in disguise It’s not, number; are interpreted as decimal numbers in the document charset. bye, //mirabilos -- Stéphane, I actually don’t block Googlemail, they’re just too utterly stupid to successfully deliver to me (or anyone else using Greylisting and not whitelisting their ranges). Same for a few other providers such as Hotmail. Some spammers (Yahoo) I do block. ___ Lynx-dev mailing list Lynx-dev@nongnu.org https://lists.nongnu.org/mailman/listinfo/lynx-dev
Re: [Lynx-dev] rendering — (0x97)
2020/06/28 18:28 ... David Woolley: but if they are sending — over the wire, rather than the a byte containing the value 151, the contents encoding wouldn't matter, as entities are interpreted in Unicode, What do you mean? The actual Unicode number is U+2014, or 8212, and — is simply cp1252 in disguise. I hav seen that, and , in Microsoft HTML from Word. ___ Lynx-dev mailing list Lynx-dev@nongnu.org https://lists.nongnu.org/mailman/listinfo/lynx-dev
Re: [Lynx-dev] rendering — (0x97)
Quoth David Woolley: 'Firefox on Debian also faults it: 'adventures —' Firefox from Slackware renders it as emdash. 2 of my resources identify it as em dash. Usually you-all ignore my character-rendering comments. I don't mind; I edit the source to my preferences. I bring it up on this list in case it helps someone else who wants to customize theirs. nytimes.com encodes pages that existed before digitization variously. It suits me to accommodate their mistakes if it doesn't conflict with another character. I don't need 'C1 special code'. I suspect it's left over from the good old TTY days - ah polar relays! - I can hear them now. They used to be kept behind plexiglass screens to dampen the noise. russell bell ___ Lynx-dev mailing list Lynx-dev@nongnu.org https://lists.nongnu.org/mailman/listinfo/lynx-dev