Thank you! I wasn't aware of the html-parsing library.

Jon


On 2/25/2016 11:21 AM, Jay McCarthy wrote:
You should double check against the HTML 4.01 spec

https://www.w3.org/TR/html4/

Since you mention "in the wild", I think you probably don't want to
use the html library but instead want to use

http://docs.racket-lang.org/html-parsing/index.html

Jay

On Thu, Feb 25, 2016 at 1:13 PM, jon stenerson <jonstener...@comcast.net> wrote:
I find that when I use the html library I have to make a few simple changes
to html-spec.rkt. It seems that <ins> and <del> are not treated like <b> and
<i> . You can see in this example that while <b> remains in the enclosing
<p>, <ins> does not. I also find that I have to allow pcdata as a child of
<ol> and <ul>. I don't know whether pcdata is "supposed to" appear there but
it often does in the wild.

Jon



#lang racket

(require (prefix-in h: html)  (prefix-in x: xml))

(define (xml->list x)
   (cond
         [(x:pcdata? x) (x:pcdata-string x)]
         [(x:entity? x) (list)]
         [(x:element? x)
          (list (x:element-name x)
                (map xml->list (x:element-content x)))]
         [(list? x) (map xml->list x)]))

(printf "~s\n" (xml->list (h:read-html-as-xml (open-input-string "<p>Hello
world <b>Testing</b>!</p>"))))
(printf "~s\n" (xml->list (h:read-html-as-xml (open-input-string "<p>Hello
world <ins>Testing</ins>!</p>"))))

--
You received this message because you are subscribed to the Google Groups
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to