So right now it looks the HTML parser only sends through script tags if the hay a src attribute. Is this likely to change or should I use another parser for HTML? I could submit a patch for this of course.
Also, does anyone have an opinion if the underlying tag soup stuff is tolerant of HTML in a similar manner to browsers which will try to render anything) or is expecting well-formed HTML. I can go look at the Tag Soup stuff directly of course, but just wondered if anyone has experience of using Tika to parse HTML. Cheers, Jim
