Re: [racket-users] Html to text, how to obtain a rough preview

Jon Zeppieri Tue, 30 May 2017 14:17:35 -0700

((sxpath '(// *text*)) doc)

should return all (and only) the text nodes in doc. I'm not so
familiar with the sxml-xexp compatibility stuff, so I don't know if
you can use an xexp here or if you really need an sxml document.


On Tue, May 30, 2017 at 7:08 AM, Erich Rast <er...@snafu.de> wrote:
> Hi all,
>
> I need a function to provide a rough textual preview (without
> formatting except newlines) of the content of a web page.
>
> So far I'm using this:
>
> (require net/url
>          html-parsing
>          sxml)
>
> (provide fetch fetch-string-content)
>
> (define (fetch url)
>   (call/input-url url
>                   get-pure-port
>                   port->string))
>
> (define (fetch-string-content url)
>   (sxml:text ((sxpath '(html body)) (html->xexp (fetch url)))))
>
> The sxpath correctly returns the body sexp, but fetch-string-content
> still only returns an empty string or a bunch of "\n\n\n".
>
> I guess the problem is that sxml:text only returns what is immediately
> below the element, and that's not what I want. There are all kinds of
> unknown div and span tags in web pages. I'm looking for a way to get
> a simplified version of the textual content of the html body. If I was
> on Linux only I'd use "lynx -dump -nolist" in a subprocess, but it needs
> to be cross-platform.
>
> Is there a sxml trick to achieve that? It doesn't need to be perfect.
>
> Best,
>
> Erich
>
> --
> You received this message because you are subscribed to the Google Groups 
> "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to racket-users+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Html to text, how to obtain a rough preview

Reply via email to