[
https://issues.apache.org/jira/browse/TIKA-343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jukka Zitting resolved TIKA-343.
--------------------------------
Resolution: Fixed
Fix Version/s: 0.6
Assignee: Jukka Zitting
Good points. I added the semantic <address> to the list of elements that get
passed through by default and used the automatic newline rules from
XHTMLContentHandler for any block level elements (or <br>) that would otherwise
be silently dropped.
See also TIKA-347 (and TIKA-304) for a way to customize the HTML mappings used
by Tika. By default the idea is to only pass through those elements that a
generic client (i.e. one that has no domain-specific knowledge) can use to
better understand the semantics of the extracted text.
> some parsers produces glued words
> ---------------------------------
>
> Key: TIKA-343
> URL: https://issues.apache.org/jira/browse/TIKA-343
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.5, 0.6
> Reporter: Piotr B.
> Assignee: Jukka Zitting
> Fix For: 0.6
>
>
> Some parsers ignores word/line delimiters.
> Document:
> "<html><head></head><body>test<br>test</body></html>"
> is decoded by HtmlParser to "testtest".
> I think the HtmlParser.mapSafeElement method should be extended by:
> if ("BR".equals(name)) return "br";
> if ("DIV".equals(name)) return "div";
> if ("HR".equals(name)) return "hr";
> if ("ADDRESS".equals(name)) return "address";
> if ("FIELDSET".equals(name)) return "fieldset";
> if ("FORM".equals(name)) return "form";
> if ("NOSCRIPT".equals(name)) return "noscript";
> if ("NOFRAMES".equals(name)) return "noframes";
> Also application/xml documents are parsed by removing unknown tags instead of
> replacing them into spaces.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.