Hi all,
Following up on a discussion on yesterday's Tools Call --
On 4/14/25 2:40 AM, Colin Perkins wrote:
(belatedly, inline)
On 20 Mar 2025, at 6:27, Carsten Bormann wrote:
On 20. Mar 2025, at 07:11, Robert Sparks [email protected]
<mailto:[email protected]> wrote:
On 3/20/25 11:09 AM, Carsten Bormann wrote:
On 20. Mar 2025, at 04:45, Jean Mahoney [email protected]
editor.org <mailto:[email protected]> wrote:
[JM] TEXT is used for RFCs created in the RFCXML v3 era.
ASCII is for older RFCs. The TEXT label indicates the
file can contain non-ASCII characters [2].
There are a dozen or so pre-v3 RFCs that are beyond-ASCII.
(And actually a couple that aren’t even UTF-8!)
Pointers to the non-UTF8 encoded RFCs please?
I didn’t take notes when I last checked this, but I can do the check
again.
Let’s start with:
rfc101 rfc177 rfc178 rfc182 rfc227 rfc234 rfc235 rfc237 rfc243 rfc270
rfc282 rfc288 rfc290 rfc292 rfc303 rfc306 rfc307 rfc310 rfc313 rfc315
rfc316 rfc317 rfc323 rfc327 rfc367 rfc369
[JM] The RFCs listed above have a note at the bottom of the file that
notes they were put into machine-readable form. These notes are the only
place non-ASCII is found in these file:
[This RFC was put into machine readable form for entry]
[into the online RFC archives by Kelly Tardif, Viag�nie 10/99]
rfc441
[JM] In addition to the note about putting the file into
machine-readable form, there's this:
U + 3 --> X�
rfc2497
[JM] Has an em dash in the [EU164] reference.
rfc2557
[JM] There's non-ASCII in an example:
E with acute accent becomes �.<br>
E with acute accent becomes É.<p>
rfc2708 rfc2875
[JM] A smart apostrophe was used:
assigned ID�s, there is...
rfc2875
[JM] Smart quotes and apostrophe were used:
TBS: the �text� for computing the SHA-1 HMAC.
Signature verification requires CA�s private key
For info, here are a few RFCs that are not v3 but not ASCII either:
rfc8187
[JM] RFC 8187 is titled "Indicating Character Encoding and Language for
HTTP Header Field Parameters". It's the first RFC published with UTF-8
characters at the request of the authors.
rfc8264 rfc8265 rfc8266
[JM] These RFCs specify the preparation, enforcement, and comparison of
internationalized strings (PRECIS) and were published with UTF-8
characters.
Best regards,
Jean
And then there are the RFCs that contain NUL bytes, like RFC 674…
I didn’t do a full categorization of these critters.
We have the following, although it’s been many years since it was
checked for accuracy:
|def charset(self) -> str: """ Most RFCs are UTF-8, or it's ASCII
subset. A few are not. Return an appropriate encoding for the text of
this RFC. """ if (self.doc_id == "RFC0064") or (self.doc_id ==
"RFC0101") or \ (self.doc_id == "RFC0177") or (self.doc_id == "RFC0178")
or \ (self.doc_id == "RFC0182") or (self.doc_id == "RFC0227") or \
(self.doc_id == "RFC0234") or (self.doc_id == "RFC0235") or \
(self.doc_id == "RFC0237") or (self.doc_id == "RFC0243") or \
(self.doc_id == "RFC0270") or (self.doc_id == "RFC0282") or \
(self.doc_id == "RFC0288") or (self.doc_id == "RFC0290") or \
(self.doc_id == "RFC0292") or (self.doc_id == "RFC0303") or \
(self.doc_id == "RFC0306") or (self.doc_id == "RFC0307") or \
(self.doc_id == "RFC0310") or (self.doc_id == "RFC0313") or \
(self.doc_id == "RFC0315") or (self.doc_id == "RFC0316") or \
(self.doc_id == "RFC0317") or (self.doc_id == "RFC0323") or \
(self.doc_id == "RFC0327") or (self.doc_id == "RFC0367") or \
(self.doc_id == "RFC0369") or (self.doc_id == "RFC0441") or \
(self.doc_id == "RFC1305"): return "iso8859_1" elif self.doc_id ==
"RFC2166": return "windows-1252" elif (self.doc_id == "RFC2497") or
(self.doc_id == "RFC2557"): return "iso8859_1" elif self.doc_id ==
"RFC2708": # This RFC is corrupt: line 521 has a byte with value 0xC6
that # is clearly intended to be a ' character, but that code point
# doesn't correspond to ' in any character set I can find. Use # ISO
8859-1 which gets all characters right apart from this. # # According to
Greg Skinner: "regarding the test in line 268 # for RFC2708, as far as I
can tell, U+0092 was introduced in # draft-ietf-printmib-job-protomap-01
in multiple places. In -02, # it was replaced with U+0027 everywhere
except section 5.0. # Somehow, that stray character became the corrupt
text you # identified." # (https://github.com/glasgow-ipl/ietfdata/
issues/137) return "iso8859_1" elif self.doc_id == "RFC2875": # Both the
text and PDF versions of this document have corrupt # characters (lines
754 and 926 of the text version). Using # ISO 8859-1 is no more corrupt
than the original. return "iso8859_1" else: return "utf-8" |
Cheers,
Colin
_______________________________________________
rfc-interest mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
rfc-interest mailing list -- [email protected]
To unsubscribe send an email to [email protected]