[ https://issues.apache.org/jira/browse/TIKA-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nick Burch resolved TIKA-644. ----------------------------- Resolution: Fixed Fix Version/s: 1.0 Assignee: Nick Burch Good spot! Fixed in r1095429. > parsing of Microsoft Word doc with style "Heading X" where X>6 creates > invalid HTML with tags <h7>,<h8> etc > ----------------------------------------------------------------------------------------------------------- > > Key: TIKA-644 > URL: https://issues.apache.org/jira/browse/TIKA-644 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.9 > Reporter: chris hudson > Assignee: Nick Burch > Priority: Minor > Labels: doc, h, parser, tag > Fix For: 1.0 > > Original Estimate: 5m > Remaining Estimate: 5m > > org.apache.tika.parser.microsoft.WordExtractor will translate heading styles > to "h" tags with a level greater than 6 which means the xhtml is invalid. The > xhtml DTD only defines header elements 1 to 6: > <!ENTITY % heading "h1|h2|h3|h4|h5|h6"> > changing line 380 from: > tag = "h"+num; > to > tag = "h"+Math.min(num, 6); > will resolve this. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira