[ https://issues.apache.org/jira/browse/TIKA-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432189#comment-13432189 ]
Ken Krugler commented on TIKA-889: ---------------------------------- Hi John - I tried this with trunk, and it works as expected. Yes, it's true that XHTMLDowngradeHandler will uppercase the element names, but then DefaultHtmlMapper.mapSafeElement() lower-cases them (I know, seems odd to me too). So the comparison works, and I see the expected output. I'm adding a test case to validate behavior, at least for a simple <ul><li>xxx</li></ul> example. > XHTMLContentHandler wont emit newline when html element matches ENDLINE set > --------------------------------------------------------------------------- > > Key: TIKA-889 > URL: https://issues.apache.org/jira/browse/TIKA-889 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: John Conwell > Assignee: Ken Krugler > > XHTMLContentHandler.endElement checks if the element is in the ENDLINE set to > see if it should emit a newline. The html elements in ENDLINE are all lower > case, but the HtmlParser class uses the XHTMLDowngradeHandler handler to > upper case all html elements. This means that none of the html elements in > the web page will match the elements in the ENDLINE set. > This also is a problem with the INDENT set as well -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira