So right now it looks the HTML parser only sends through script tags if the hay 
a src attribute. Is this likely to change or should I use another parser for 
HTML? I could submit a patch for this of course.

Also, does anyone have an opinion if the underlying tag soup stuff is tolerant 
of HTML in a similar manner to browsers which will try to render anything) or 
is expecting well-formed HTML. I can go look at the Tag Soup stuff directly of 
course, but just wondered if anyone has experience of using Tika to parse HTML.


Cheers,

Jim

Reply via email to