[ https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated TIKA-985: ------------------------------- Attachment: TIKA-985-1.5.patch Dirty patch for Tika 1.5. This patch allows for headings (h1...h6) to be embedded inside elements like anchors etc. This is allowed in HTML5 and some pages already use this. Without this patch headings are reported out of order as SAX events. > Support for HTML5 elements > -------------------------- > > Key: TIKA-985 > URL: https://issues.apache.org/jira/browse/TIKA-985 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 1.2 > Reporter: Markus Jelsma > Fix For: 1.5 > > Attachments: TIKA-985-1.3-1.patch, TIKA-985-1.3-2.patch, > TIKA-985-1.3-3.patch, TIKA-985-1.5.patch > > > TagSoup's schema.tssl does not include some HTML5 elements (e.g. article, > section). This prevents some custom ContentHandlers from reading expected > elements and/or attributes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira