some parsers produces glued words
---------------------------------
Key: TIKA-343
URL: https://issues.apache.org/jira/browse/TIKA-343
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 0.5, 0.6
Reporter: Piotr B.
Some parsers ignores word/line delimiters.
Document:
"<html><head></head><body>test<br>test</body></html>"
is decoded by HtmlParser to "testtest".
I think the HtmlParser.mapSafeElement method should be extended by:
if ("BR".equals(name)) return "br";
if ("DIV".equals(name)) return "div";
if ("HR".equals(name)) return "hr";
if ("ADDRESS".equals(name)) return "address";
if ("FIELDSET".equals(name)) return "fieldset";
if ("FORM".equals(name)) return "form";
if ("NOSCRIPT".equals(name)) return "noscript";
if ("NOFRAMES".equals(name)) return "noframes";
Also application/xml documents are parsed by removing unknown tags instead of
replacing them into spaces.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.