Hi all,

Following up on a discussion on yesterday's Tools Call --

On 4/14/25 2:40 AM, Colin Perkins wrote:
(belatedly, inline)

On 20 Mar 2025, at 6:27, Carsten Bormann wrote:

    On 20. Mar 2025, at 07:11, Robert Sparks [email protected]
    <mailto:[email protected]> wrote:

        On 3/20/25 11:09 AM, Carsten Bormann wrote:

            On 20. Mar 2025, at 04:45, Jean Mahoney [email protected]
            editor.org <mailto:[email protected]> wrote:

                [JM] TEXT is used for RFCs created in the RFCXML v3 era.
                ASCII is for older RFCs. The TEXT label indicates the
                file can contain non-ASCII characters [2].

            There are a dozen or so pre-v3 RFCs that are beyond-ASCII.
            (And actually a couple that aren’t even UTF-8!)

        Pointers to the non-UTF8 encoded RFCs please?

    I didn’t take notes when I last checked this, but I can do the check
    again.

    Let’s start with:
    rfc101 rfc177 rfc178 rfc182 rfc227 rfc234 rfc235 rfc237 rfc243 rfc270
    rfc282 rfc288 rfc290 rfc292 rfc303 rfc306 rfc307 rfc310 rfc313 rfc315
rfc316 rfc317 rfc323 rfc327 rfc367 rfc369

[JM] The RFCs listed above have a note at the bottom of the file that notes they were put into machine-readable form. These notes are the only place non-ASCII is found in these file:

          [This RFC was put into machine readable form for entry]
      [into the online RFC archives by Kelly Tardif, Viag�nie 10/99]


rfc441

[JM] In addition to the note about putting the file into machine-readable form, there's this:

      U + 3 --> X�


rfc2497

[JM] Has an em dash in the [EU164] reference.


rfc2557

[JM] There's non-ASCII in an example:

      E with acute accent becomes �.<br>
      E with acute accent becomes &Eacute;.<p>

rfc2708 rfc2875

[JM] A smart apostrophe was used:

   assigned ID�s, there is...

rfc2875

[JM] Smart quotes and apostrophe were used:

   TBS: the �text� for computing the SHA-1 HMAC.

   Signature verification requires CA�s private key


    For info, here are a few RFCs that are not v3 but not ASCII either:
rfc8187

[JM] RFC 8187 is titled "Indicating Character Encoding and Language for HTTP Header Field Parameters". It's the first RFC published with UTF-8 characters at the request of the authors.


rfc8264 rfc8265 rfc8266

[JM] These RFCs specify the preparation, enforcement, and comparison of internationalized strings (PRECIS) and were published with UTF-8 characters.

Best regards,
Jean



    And then there are the RFCs that contain NUL bytes, like RFC 674…
    I didn’t do a full categorization of these critters.

We have the following, although it’s been many years since it was checked for accuracy:

|def charset(self) -> str: """ Most RFCs are UTF-8, or it's ASCII subset. A few are not. Return an appropriate encoding for the text of this RFC. """ if (self.doc_id == "RFC0064") or (self.doc_id == "RFC0101") or \ (self.doc_id == "RFC0177") or (self.doc_id == "RFC0178") or \ (self.doc_id == "RFC0182") or (self.doc_id == "RFC0227") or \ (self.doc_id == "RFC0234") or (self.doc_id == "RFC0235") or \ (self.doc_id == "RFC0237") or (self.doc_id == "RFC0243") or \ (self.doc_id == "RFC0270") or (self.doc_id == "RFC0282") or \ (self.doc_id == "RFC0288") or (self.doc_id == "RFC0290") or \ (self.doc_id == "RFC0292") or (self.doc_id == "RFC0303") or \ (self.doc_id == "RFC0306") or (self.doc_id == "RFC0307") or \ (self.doc_id == "RFC0310") or (self.doc_id == "RFC0313") or \ (self.doc_id == "RFC0315") or (self.doc_id == "RFC0316") or \ (self.doc_id == "RFC0317") or (self.doc_id == "RFC0323") or \ (self.doc_id == "RFC0327") or (self.doc_id == "RFC0367") or \ (self.doc_id == "RFC0369") or (self.doc_id == "RFC0441") or \ (self.doc_id == "RFC1305"): return "iso8859_1" elif self.doc_id == "RFC2166": return "windows-1252" elif (self.doc_id == "RFC2497") or (self.doc_id == "RFC2557"): return "iso8859_1" elif self.doc_id == "RFC2708": # This RFC is corrupt: line 521 has a byte with value 0xC6 that # is clearly intended to be a ' character, but that code point # doesn't correspond to ' in any character set I can find. Use # ISO 8859-1 which gets all characters right apart from this. # # According to Greg Skinner: "regarding the test in line 268 # for RFC2708, as far as I can tell, U+0092 was introduced in # draft-ietf-printmib-job-protomap-01 in multiple places. In -02, # it was replaced with U+0027 everywhere except section 5.0. # Somehow, that stray character became the corrupt text you # identified." # (https://github.com/glasgow-ipl/ietfdata/ issues/137) return "iso8859_1" elif self.doc_id == "RFC2875": # Both the text and PDF versions of this document have corrupt # characters (lines 754 and 926 of the text version). Using # ISO 8859-1 is no more corrupt than the original. return "iso8859_1" else: return "utf-8" |

Cheers,
Colin


_______________________________________________
rfc-interest mailing list -- [email protected]
To unsubscribe send an email to [email protected]

_______________________________________________
rfc-interest mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to