[ https://issues.apache.org/jira/browse/TIKA-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16495754#comment-16495754 ]
Hudson commented on TIKA-2100: ------------------------------ UNSTABLE: Integrated in Jenkins build tika-2.x-windows #262 (See [https://builds.apache.org/job/tika-2.x-windows/262/]) TIKA-2100 -- fix unit test (tallison: rev 198d5ef995532f262c970f2ef76e64b852bed7f4) * (edit) tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java > Html Parser does not keep the html tag attributes > ------------------------------------------------- > > Key: TIKA-2100 > URL: https://issues.apache.org/jira/browse/TIKA-2100 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.13 > Reporter: Gerard Bouchar > Priority: Major > Fix For: 1.19, 2.0.0 > > > Parsing a very simple html like > <!DOCTYPE html> > <html lang="en"> > <head> > <title>Page Title</title> > </head> > <body> > <h1 align="left">My First Heading</h1> > <p>My first paragraph.</p> > </body> > </html> > you won't be able to access the html tag's attributes (here lang="en") in the > ContentHandler : > *in the method startElement(String ns, String localName, String name, > Attributes atts), atts is empty. > *Moreover it seems that the html tag's attributes are not passed trough the > HtmlMapper.mapSafeAttribute method too. -- This message was sent by Atlassian JIRA (v7.6.3#76005)