On Thu, Jul 27, 2023 at 10:25:13PM +0200, Hiltjo Posthuma wrote: > Hi, > > I use lynx to convert HTML to plain-text, but noticed an issue where part of > the output is missing with UTF-8 in CDATA sections. > > Below is a small test-case to reproduce it: > > <p>Works correctly:</p> > <p>a’b</p> > > <p>Doesn't work correctly:</p> > <p><![CDATA[a’b]]></p>
agreed - I see Works correctly: a’b Doesn't work correctly: a?b > This byte sequence for the UTF-8 codepoint is: printf '\342\200\231' > > > I use the following command to convert HTML to text: > > lynx -stdin -dump \ > -underline_links -image_links \ > -display_charset="utf-8" -assume_charset="utf-8" > > > My system information: > I tested on the latest lynx-cur: lynx2.9.0dev.12 > > $ locale > LANG=en_US.UTF-8 > LC_CTYPE="en_US.UTF-8" > LC_NUMERIC="en_US.UTF-8" > LC_TIME="en_US.UTF-8" > LC_COLLATE="en_US.UTF-8" > LC_MONETARY="en_US.UTF-8" > LC_MESSAGES="en_US.UTF-8" > LC_PAPER="en_US.UTF-8" > LC_NAME="en_US.UTF-8" > LC_ADDRESS="en_US.UTF-8" > LC_TELEPHONE="en_US.UTF-8" > LC_MEASUREMENT="en_US.UTF-8" > LC_IDENTIFICATION="en_US.UTF-8" > LC_ALL=en_US.UTF-8 > > > What I found: > > I think it only prints the first byte instead of printing the processed > codepoint (clong). I noticed in the file WWW/Library/Implementation/SGML.c > there is a similar case for comments for example for "S_comment_put_c:". > > Below is a patch. I'm not sure it covers all lynx options though. I hope it > does: thanks - will review, etc -- Thomas E. Dickey <dic...@invisible-island.net> https://invisible-island.net
signature.asc
Description: PGP signature