parsing of Microsoft Word doc with style "Heading X" where X>6 creates invalid 
HTML with tags <h7>,<h8> etc
-----------------------------------------------------------------------------------------------------------

                 Key: TIKA-644
                 URL: https://issues.apache.org/jira/browse/TIKA-644
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.9
            Reporter: chris hudson
            Priority: Minor


org.apache.tika.parser.microsoft.WordExtractor will translate heading styles to 
"h" tags with a level greater than 6 which means the xhtml is invalid. The 
xhtml DTD only defines header elements 1 to 6:
<!ENTITY % heading "h1|h2|h3|h4|h5|h6">

changing line 380 from:
tag = "h"+num;
to
tag = "h"+Math.max(num, 6);

will resolve this. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to