subject:"\[Lynx\-dev\] Extract links from html with application\/ld\+json script"

Re: [Lynx-dev] Extract links from html with application/ld+json script

2023-12-17 Thread Thorsten Glaser

David Woolley dixit:

> Lynx does not even have a JSON interpreter and I'm sure it doesn't
> have a JSON pretty printer.

Yeah, that’s totally out of scope. Use tools like cURL / GNU wget,
sed/tidy/xmlstarlet to extract the JSON, jq to parse it, instead.

bye,
//mirabilos
-- 
Support mksh as /bin/sh and RoQA dash NOW!
‣ src:bash (429 (458) bugs: 0 RC, 295 (315) I&N, 134 (143) M&W, 0 F&P) + 209
‣ src:dash (90 (104) bugs: 0 RC, 51 (54) I&N, 39 (50) M&W, 0 F&P) + 62 ubu
‣ src:mksh (1 bug: 0 RC, 0 I&N, 1 M&W, 0 F&P)
dash has two RC bugs they just closed because they don’t care about quality…

Re: [Lynx-dev] Extract links from html with application/ld+json script

2023-12-17 Thread David Woolley

Looking a bit further, ld+json is a database serialisation format, based 
on javascript, but it is declarative.  It definitely isn't HTML, but one 
could render it by basically pretty printing, without the need to handle 
the generalities of javascript.  You may, though have to manually 
extract it from the page, as I suspect general execution of javascript 
may be needed to actually find it reliably.


Lynx does not even have a JSON interpreter and I'm sure it doesn't have 
a JSON pretty printer.


Using  to pretty print it, 
the core of one of the items comes out as (I've just used an extract to 
minimise copyright issues):


  {
"@type": "VideoObject",
"name": "The Chokepoint (EGC Finals)",
"url": 
"https://clips.twitch.tv/ElatedIncredulousPepperOpieOP-oUeW6hXXZs8nmWtX";,
"description": "Watch EGCTV's clip of Age of Empires IV on 
Twitch!",

"thumbnailUrl": [

"https://clips-media-assets2.twitch.tv/A-IO1KFHluoV12bPJ5lrVw/AT-cm%7CA-IO1KFHluoV12bPJ5lrVw-preview-86x45.jpg";,

"https://clips-media-assets2.twitch.tv/A-IO1KFHluoV12bPJ5lrVw/AT-cm%7CA-IO1KFHluoV12bPJ5lrVw-preview-260x147.jpg";,

"https://clips-media-assets2.twitch.tv/A-IO1KFHluoV12bPJ5lrVw/AT-cm%7CA-IO1KFHluoV12bPJ5lrVw-preview-480x272.jpg";
],
"uploadDate": "2023-12-17T16:16:18Z",
"duration": "PT60S",
"position": 2,
"interactionStatistic": {
  "@type": "InteractionCounter",
  "interactionType": {
"@type": "http://schema.org/WatchAction";
  },
  "userInteractionCount": 29
},
"embedUrl": 
"https://player.twitch.tv?video=1542310342&autoplay=true&parent=meta.tag";

  },

I'm pretty sure that most of the tags have no intrinsic meaning, and you 
still need the full javascript code, or to guess from the names, to 
correctly interpret them.


The actual HTML doesn't include anything renderable.  Everything is done 
as empty DIVs and relies on styling for any display, so can't be 
considered foreground content.  There is some directly renderable 
content, but it is SVG, with no accessible text fallback.  This  is an 
image format, so useless for a text only browser.



On 17/12/2023 20:44, David Woolley wrote:

On 17/12/2023 19:31, Super Bonaci via Lynx-dev wrote:

Lynx is not able to extract most html links inside the html file.



There are no HTML links in 9ed7a8bb (no anchor elements, and all 
occurrences of href are either in link elements, which don't generate 
visible hyperlinks, inline, except for one, which is in javascript 
code)!  I think this is a Javascript application program, not an HTML 
document.  Lynx doesn't have a javascript interpreter and doesn't parse 
HTML in a way that creates a document object model in a format that 
would allow such an interpreter to do anything non-trivial.


Any links are created by manipulating the document in the browser, which 
Lynx can't do.


Supporting javascript applications would require a complete rewrite from 
first principles.  The result would not be Lynx.


I suspect the same is true of the other document.

Since the Lynx version is from 2018 


I don't think there have been major changes in HTML in the last five 
years that would break a real HTML document on Lynx.  The problem with 
web applications is over a decade old.  It goes back to the original 
Netscape, but was solidified when the Web Hypertext Applications 
Technology working group effectively took over control of HTML from W3C 
leading to the creation of HTML5.  Although that can be used for pure 
documents, the name of the working group clearly indicates that the 
intention was otherwise.  That happened about 19 years ago.


Commercial artists and marketing managers, don't buy into the TBL notion 
of HTML and want programs that can be run on the advertising consumer's 
machine.  Whilst there are some cases where this is valid, for 
technical, or privacy reasons, most such applications are written for 
marketing reasons.


Some text mode browsers handle some javascript uses, but I'm pretty sure 
they would not cope with your examples.


The only certain way of finding the links in javascript code is run the 
program.

Re: [Lynx-dev] Extract links from html with application/ld+json script

2023-12-17 Thread David Woolley


On 17/12/2023 19:31, Super Bonaci via Lynx-dev wrote:

Lynx is not able to extract most html links inside the html file.



There are no HTML links in 9ed7a8bb (no anchor elements, and all 
occurrences of href are either in link elements, which don't generate 
visible hyperlinks, inline, except for one, which is in javascript 
code)!  I think this is a Javascript application program, not an HTML 
document.  Lynx doesn't have a javascript interpreter and doesn't parse 
HTML in a way that creates a document object model in a format that 
would allow such an interpreter to do anything non-trivial.


Any links are created by manipulating the document in the browser, which 
Lynx can't do.


Supporting javascript applications would require a complete rewrite from 
first principles.  The result would not be Lynx.


I suspect the same is true of the other document.

Since the Lynx version is from 2018 


I don't think there have been major changes in HTML in the last five 
years that would break a real HTML document on Lynx.  The problem with 
web applications is over a decade old.  It goes back to the original 
Netscape, but was solidified when the Web Hypertext Applications 
Technology working group effectively took over control of HTML from W3C 
leading to the creation of HTML5.  Although that can be used for pure 
documents, the name of the working group clearly indicates that the 
intention was otherwise.  That happened about 19 years ago.


Commercial artists and marketing managers, don't buy into the TBL notion 
of HTML and want programs that can be run on the advertising consumer's 
machine.  Whilst there are some cases where this is valid, for 
technical, or privacy reasons, most such applications are written for 
marketing reasons.


Some text mode browsers handle some javascript uses, but I'm pretty sure 
they would not cope with your examples.


The only certain way of finding the links in javascript code is run the 
program.

[Lynx-dev] Extract links from html with application/ld+json script

2023-12-17 Thread Super Bonaci via Lynx-dev

Version in use: Lynx Version 2.8.9rel.1 (08 Jul 2018)

Some html pages contain

Re: [Lynx-dev] Extract links from html with application/ld+json script

Re: [Lynx-dev] Extract links from html with application/ld+json script

Re: [Lynx-dev] Extract links from html with application/ld+json script

[Lynx-dev] Extract links from html with application/ld+json script

4 matches

Site Navigation

Mail list logo

Footer information