Re: [Lynx-dev] fix for decoding utf-8 in CDATA sections

2023-10-03 Thread Thomas Dickey
On Tue, Oct 03, 2023 at 11:29:07PM +0200, Hiltjo Posthuma wrote:
> On Thu, Jul 27, 2023 at 10:25:13PM +0200, Hiltjo Posthuma wrote:
> > Hi,
> > 
> > I use lynx to convert HTML to plain-text, but noticed an issue where part of
> > the output is missing with UTF-8 in CDATA sections.
...
> Any updates on the status / review of this patch?

I applied it -

https://github.com/ThomasDickey/lynx-snapshots

and have been busy on other programs.  At the moment I anticipate spending
a week or two catching up with Lynx to make a new version.

-- 
Thomas E. Dickey 
https://invisible-island.net


signature.asc
Description: PGP signature


Re: [Lynx-dev] fix for decoding utf-8 in CDATA sections

2023-10-03 Thread Hiltjo Posthuma
On Thu, Jul 27, 2023 at 10:25:13PM +0200, Hiltjo Posthuma wrote:
> Hi,
> 
> I use lynx to convert HTML to plain-text, but noticed an issue where part of
> the output is missing with UTF-8 in CDATA sections.
> 
> Below is a small test-case to reproduce it:
> 
> Works correctly:
> a’b
> 
> Doesn't work correctly:
> 
> 
> This byte sequence for the UTF-8 codepoint is: printf '\342\200\231'
> 
> 
> I use the following command to convert HTML to text:
> 
>   lynx -stdin -dump \
>   -underline_links -image_links \
>   -display_charset="utf-8" -assume_charset="utf-8"
> 
> 
> My system information:
> I tested on the latest lynx-cur: lynx2.9.0dev.12
> 
> $ locale
> LANG=en_US.UTF-8
> LC_CTYPE="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_PAPER="en_US.UTF-8"
> LC_NAME="en_US.UTF-8"
> LC_ADDRESS="en_US.UTF-8"
> LC_TELEPHONE="en_US.UTF-8"
> LC_MEASUREMENT="en_US.UTF-8"
> LC_IDENTIFICATION="en_US.UTF-8"
> LC_ALL=en_US.UTF-8
> 
> 
> What I found:
> 
> I think it only prints the first byte instead of printing the processed
> codepoint (clong).  I noticed in the file WWW/Library/Implementation/SGML.c
> there is a similar case for comments for example for "S_comment_put_c:".
> 
> Below is a patch. I'm not sure it covers all lynx options though. I hope it 
> does:
> 
> 
> diff --git a/WWW/Library/Implementation/SGML.c 
> b/WWW/Library/Implementation/SGML.c
> index 2534606..8632670 100644
> --- a/WWW/Library/Implementation/SGML.c
> +++ b/WWW/Library/Implementation/SGML.c
> @@ -3502,9 +3502,13 @@ static void SGML_character(HTStream *me, int c_in)
>   me->state = S_text;
>   break;
>   }
> - HTChunkPutc(string, c);
> - break;
>  
> + if (me->T.decode_utf8) {
> +  HTChunkPutUtf8Char(string, clong);
> + } else {
> +  HTChunkPutc(string, c);
> + }
> + break;
>  case S_sgmlent:  /* Expecting ENTITY. - FM */
>   if (!me->first_dash && c == '-') {
>   HTChunkPutc(string, c);
> 
> 
> Thank you for lynx,
> 
> -- 
> Kind regards,
> Hiltjo
> 

Hi,

Any updates on the status / review of this patch?

Thank you,

-- 
Kind regards,
Hiltjo



Re: [Lynx-dev] 'Please enable JS and disable any ad blocker'

2023-10-03 Thread David Woolley

On 03/10/2023 14:52, Chime Hart wrote:
Well, Russell, that is certainly a shaim. Yesterday I visited 
punchbowlnews.com and no-matter which browser, I get a 403 error, but 
even checking or unchecking a user-agent, I receive the message you 
experienced.


You need to consider that, for most news web sites, text only browser 
users are defective products (you are the product and the advertisers 
are the customers).




Re: [Lynx-dev] 'Please enable JS and disable any ad blocker'

2023-10-03 Thread Chime Hart
Well, Russell, that is certainly a shaim. Yesterday I visited punchbowlnews.com 
and no-matter which browser, I get a 403 error, but even checking or unchecking 
a user-agent, I receive the message you experienced.

Chime




[Lynx-dev] 'Please enable JS and disable any ad blocker'

2023-10-03 Thread rbell--- via Lynx-dev


Starting Monday evening nytimes.com started giving me this.  I
can fetch the file with 'lynx -source'.  What's the diff?  When I
fetch an html file from nytimes.com with lynx it compresses it, but
not with 'lynx -source'.

russell bell