[rfc-i] Re: Mutable properties of RFCs

Jean Mahoney Wed, 16 Apr 2025 10:37:58 -0700

Hi all,

Following up on a discussion on yesterday's Tools Call --


On 4/14/25 2:40 AM, Colin Perkins wrote:

(belatedly, inline)

On 20 Mar 2025, at 6:27, Carsten Bormann wrote:

    On 20. Mar 2025, at 07:11, Robert Sparks [email protected]
    <mailto:[email protected]> wrote:

        On 3/20/25 11:09 AM, Carsten Bormann wrote:

            On 20. Mar 2025, at 04:45, Jean Mahoney [email protected]
            editor.org <mailto:[email protected]> wrote:

                [JM] TEXT is used for RFCs created in the RFCXML v3 era.
                ASCII is for older RFCs. The TEXT label indicates the
                file can contain non-ASCII characters [2].

            There are a dozen or so pre-v3 RFCs that are beyond-ASCII.
            (And actually a couple that aren’t even UTF-8!)

        Pointers to the non-UTF8 encoded RFCs please?

    I didn’t take notes when I last checked this, but I can do the check
    again.

    Let’s start with:
    rfc101 rfc177 rfc178 rfc182 rfc227 rfc234 rfc235 rfc237 rfc243 rfc270
    rfc282 rfc288 rfc290 rfc292 rfc303 rfc306 rfc307 rfc310 rfc313 rfc315

rfc316 rfc317 rfc323 rfc327 rfc367 rfc369

[JM] The RFCs listed above have a note at the bottom of the file thatnotes they were put into machine-readable form. These notes are the onlyplace non-ASCII is found in these file:


          [This RFC was put into machine readable form for entry]
      [into the online RFC archives by Kelly Tardif, Viag�nie 10/99]

rfc441

[JM] In addition to the note about putting the file intomachine-readable form, there's this:


      U + 3 --> X�

rfc2497


[JM] Has an em dash in the [EU164] reference.

rfc2557


[JM] There's non-ASCII in an example:

      E with acute accent becomes �.<br>
      E with acute accent becomes &Eacute;.<p>

rfc2708 rfc2875


[JM] A smart apostrophe was used:

   assigned ID�s, there is...

rfc2875


[JM] Smart quotes and apostrophe were used:

   TBS: the �text� for computing the SHA-1 HMAC.

   Signature verification requires CA�s private key


    For info, here are a few RFCs that are not v3 but not ASCII either:

rfc8187

[JM] RFC 8187 is titled "Indicating Character Encoding and Language forHTTP Header Field Parameters". It's the first RFC published with UTF-8characters at the request of the authors.

rfc8264 rfc8265 rfc8266

[JM] These RFCs specify the preparation, enforcement, and comparison ofinternationalized strings (PRECIS) and were published with UTF-8characters.


Best regards,
Jean

    And then there are the RFCs that contain NUL bytes, like RFC 674…
    I didn’t do a full categorization of these critters.
We have the following, although it’s been many years since it waschecked for accuracy:
|def charset(self) -> str: """ Most RFCs are UTF-8, or it's ASCIIsubset. A few are not. Return an appropriate encoding for the text ofthis RFC. """ if (self.doc_id == "RFC0064") or (self.doc_id =="RFC0101") or \ (self.doc_id == "RFC0177") or (self.doc_id == "RFC0178")or \ (self.doc_id == "RFC0182") or (self.doc_id == "RFC0227") or \(self.doc_id == "RFC0234") or (self.doc_id == "RFC0235") or \(self.doc_id == "RFC0237") or (self.doc_id == "RFC0243") or \(self.doc_id == "RFC0270") or (self.doc_id == "RFC0282") or \(self.doc_id == "RFC0288") or (self.doc_id == "RFC0290") or \(self.doc_id == "RFC0292") or (self.doc_id == "RFC0303") or \(self.doc_id == "RFC0306") or (self.doc_id == "RFC0307") or \(self.doc_id == "RFC0310") or (self.doc_id == "RFC0313") or \(self.doc_id == "RFC0315") or (self.doc_id == "RFC0316") or \(self.doc_id == "RFC0317") or (self.doc_id == "RFC0323") or \(self.doc_id == "RFC0327") or (self.doc_id == "RFC0367") or \(self.doc_id == "RFC0369") or (self.doc_id == "RFC0441") or \(self.doc_id == "RFC1305"): return "iso8859_1" elif self.doc_id =="RFC2166": return "windows-1252" elif (self.doc_id == "RFC2497") or(self.doc_id == "RFC2557"): return "iso8859_1" elif self.doc_id =="RFC2708": # This RFC is corrupt: line 521 has a byte with value 0xC6that # is clearly intended to be a ' character, but that code point# doesn't correspond to ' in any character set I can find. Use # ISO8859-1 which gets all characters right apart from this. # # According toGreg Skinner: "regarding the test in line 268 # for RFC2708, as far as Ican tell, U+0092 was introduced in # draft-ietf-printmib-job-protomap-01in multiple places. In -02, # it was replaced with U+0027 everywhereexcept section 5.0. # Somehow, that stray character became the corrupttext you # identified." # (https://github.com/glasgow-ipl/ietfdata/issues/137) return "iso8859_1" elif self.doc_id == "RFC2875": # Both thetext and PDF versions of this document have corrupt # characters (lines754 and 926 of the text version). Using # ISO 8859-1 is no more corruptthan the original. return "iso8859_1" else: return "utf-8" |
Cheers,
Colin


_______________________________________________
rfc-interest mailing list -- [email protected]
To unsubscribe send an email to [email protected]


_______________________________________________
rfc-interest mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[rfc-i] Re: Mutable properties of RFCs

Reply via email to