Thank you! I wasn't aware of the html-parsing library.
Jon
On 2/25/2016 11:21 AM, Jay McCarthy wrote:
You should double check against the HTML 4.01 spec
https://www.w3.org/TR/html4/
Since you mention "in the wild", I think you probably don't want to
use the html library but instead want to use
http://docs.racket-lang.org/html-parsing/index.html
Jay
On Thu, Feb 25, 2016 at 1:13 PM, jon stenerson <jonstener...@comcast.net> wrote:
I find that when I use the html library I have to make a few simple changes
to html-spec.rkt. It seems that <ins> and <del> are not treated like <b> and
<i> . You can see in this example that while <b> remains in the enclosing
<p>, <ins> does not. I also find that I have to allow pcdata as a child of
<ol> and <ul>. I don't know whether pcdata is "supposed to" appear there but
it often does in the wild.
Jon
#lang racket
(require (prefix-in h: html) (prefix-in x: xml))
(define (xml->list x)
(cond
[(x:pcdata? x) (x:pcdata-string x)]
[(x:entity? x) (list)]
[(x:element? x)
(list (x:element-name x)
(map xml->list (x:element-content x)))]
[(list? x) (map xml->list x)]))
(printf "~s\n" (xml->list (h:read-html-as-xml (open-input-string "<p>Hello
world <b>Testing</b>!</p>"))))
(printf "~s\n" (xml->list (h:read-html-as-xml (open-input-string "<p>Hello
world <ins>Testing</ins>!</p>"))))
--
You received this message because you are subscribed to the Google Groups
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Racket
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.