[Lynx-dev] fix for decoding utf-8 in CDATA sections

Hiltjo Posthuma Thu, 27 Jul 2023 13:43:41 -0700

Hi,

I use lynx to convert HTML to plain-text, but noticed an issue where part of
the output is missing with UTF-8 in CDATA sections.


Below is a small test-case to reproduce it:

<p>Works correctly:</p>
<p>a’b</p>

<p>Doesn't work correctly:</p>
<p><![CDATA[a’b]]></p>

This byte sequence for the UTF-8 codepoint is: printf '\342\200\231'


I use the following command to convert HTML to text:

        lynx -stdin -dump \
                -underline_links -image_links \
                -display_charset="utf-8" -assume_charset="utf-8"


My system information:
I tested on the latest lynx-cur: lynx2.9.0dev.12

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8


What I found:

I think it only prints the first byte instead of printing the processed
codepoint (clong).  I noticed in the file WWW/Library/Implementation/SGML.c
there is a similar case for comments for example for "S_comment_put_c:".

Below is a patch. I'm not sure it covers all lynx options though. I hope it 
does:


diff --git a/WWW/Library/Implementation/SGML.c 
b/WWW/Library/Implementation/SGML.c
index 2534606..8632670 100644
--- a/WWW/Library/Implementation/SGML.c
+++ b/WWW/Library/Implementation/SGML.c
@@ -3502,9 +3502,13 @@ static void SGML_character(HTStream *me, int c_in)
            me->state = S_text;
            break;
        }
-       HTChunkPutc(string, c);
-       break;
 
+       if (me->T.decode_utf8) {
+            HTChunkPutUtf8Char(string, clong);
+       } else {
+            HTChunkPutc(string, c);
+       }
+       break;
     case S_sgmlent:            /* Expecting ENTITY. - FM */
        if (!me->first_dash && c == '-') {
            HTChunkPutc(string, c);


Thank you for lynx,

-- 
Kind regards,
Hiltjo

[Lynx-dev] fix for decoding utf-8 in CDATA sections

Reply via email to