[ https://issues.apache.org/jira/browse/TIKA-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494082#comment-16494082 ]
Hudson commented on TIKA-2100: ------------------------------ UNSTABLE: Integrated in Jenkins build tika-branch-1x #33 (See [https://builds.apache.org/job/tika-branch-1x/33/]) TIKA-2100 extract content language from html lang attribute (tallison: [https://github.com/apache/tika/commit/8d26096e9d579bee74ac01df9ae773e66e0bfc74]) * (edit) tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java * (edit) tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java > Html Parser does not keep the html tag attributes > ------------------------------------------------- > > Key: TIKA-2100 > URL: https://issues.apache.org/jira/browse/TIKA-2100 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.13 > Reporter: Gerard Bouchar > Priority: Major > Fix For: 1.19, 2.0.0 > > > Parsing a very simple html like > <!DOCTYPE html> > <html lang="en"> > <head> > <title>Page Title</title> > </head> > <body> > <h1 align="left">My First Heading</h1> > <p>My first paragraph.</p> > </body> > </html> > you won't be able to access the html tag's attributes (here lang="en") in the > ContentHandler : > *in the method startElement(String ns, String localName, String name, > Attributes atts), atts is empty. > *Moreover it seems that the html tag's attributes are not passed trough the > HtmlMapper.mapSafeAttribute method too. -- This message was sent by Atlassian JIRA (v7.6.3#76005)