Re: [Lynx-dev] fix for decoding utf-8 in CDATA sections

Thomas Dickey Fri, 28 Jul 2023 01:49:09 -0700

On Thu, Jul 27, 2023 at 10:25:13PM +0200, Hiltjo Posthuma wrote:
> Hi,
> 
> I use lynx to convert HTML to plain-text, but noticed an issue where part of
> the output is missing with UTF-8 in CDATA sections.
> 
> Below is a small test-case to reproduce it:
> 
> <p>Works correctly:</p>
> <p>a’b</p>
> 
> <p>Doesn't work correctly:</p>
> <p><![CDATA[a’b]]></p>


agreed - I see

   Works correctly:

   a’b

   Doesn't work correctly:

   a?b

> This byte sequence for the UTF-8 codepoint is: printf '\342\200\231'
> 
> 
> I use the following command to convert HTML to text:
> 
>       lynx -stdin -dump \
>               -underline_links -image_links \
>               -display_charset="utf-8" -assume_charset="utf-8"
> 
> 
> My system information:
> I tested on the latest lynx-cur: lynx2.9.0dev.12
> 
> $ locale
> LANG=en_US.UTF-8
> LC_CTYPE="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_PAPER="en_US.UTF-8"
> LC_NAME="en_US.UTF-8"
> LC_ADDRESS="en_US.UTF-8"
> LC_TELEPHONE="en_US.UTF-8"
> LC_MEASUREMENT="en_US.UTF-8"
> LC_IDENTIFICATION="en_US.UTF-8"
> LC_ALL=en_US.UTF-8
> 
> 
> What I found:
> 
> I think it only prints the first byte instead of printing the processed
> codepoint (clong).  I noticed in the file WWW/Library/Implementation/SGML.c
> there is a similar case for comments for example for "S_comment_put_c:".
> 
> Below is a patch. I'm not sure it covers all lynx options though. I hope it 
> does:

thanks - will review, etc

-- 
Thomas E. Dickey <[email protected]>
https://invisible-island.net

signature.asc
Description: PGP signature

Re: [Lynx-dev] fix for decoding utf-8 in CDATA sections

Reply via email to