Hi, I use lynx to convert HTML to plain-text, but noticed an issue where part of the output is missing with UTF-8 in CDATA sections.
Below is a small test-case to reproduce it: <p>Works correctly:</p> <p>a’b</p> <p>Doesn't work correctly:</p> <p><![CDATA[a’b]]></p> This byte sequence for the UTF-8 codepoint is: printf '\342\200\231' I use the following command to convert HTML to text: lynx -stdin -dump \ -underline_links -image_links \ -display_charset="utf-8" -assume_charset="utf-8" My system information: I tested on the latest lynx-cur: lynx2.9.0dev.12 $ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=en_US.UTF-8 What I found: I think it only prints the first byte instead of printing the processed codepoint (clong). I noticed in the file WWW/Library/Implementation/SGML.c there is a similar case for comments for example for "S_comment_put_c:". Below is a patch. I'm not sure it covers all lynx options though. I hope it does: diff --git a/WWW/Library/Implementation/SGML.c b/WWW/Library/Implementation/SGML.c index 2534606..8632670 100644 --- a/WWW/Library/Implementation/SGML.c +++ b/WWW/Library/Implementation/SGML.c @@ -3502,9 +3502,13 @@ static void SGML_character(HTStream *me, int c_in) me->state = S_text; break; } - HTChunkPutc(string, c); - break; + if (me->T.decode_utf8) { + HTChunkPutUtf8Char(string, clong); + } else { + HTChunkPutc(string, c); + } + break; case S_sgmlent: /* Expecting ENTITY. - FM */ if (!me->first_dash && c == '-') { HTChunkPutc(string, c); Thank you for lynx, -- Kind regards, Hiltjo