[ https://issues.apache.org/jira/browse/TIKA-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592314#action_12592314 ]
julien nioche commented on TIKA-140: ------------------------------------ I had a closer look a the problem and found that it is due to the HTML element having attributes defined in a namespace ("<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">"). Having attributes without explicit namespace works fine (<html lang="en">) > HTML parser unable to extract text > ----------------------------------- > > Key: TIKA-140 > URL: https://issues.apache.org/jira/browse/TIKA-140 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.2-incubating > Reporter: julien nioche > Assignee: Jukka Zitting > Fix For: 0.2-incubating > > Attachments: 1.html > > > At revision 648732 > The file in attachment is not parsed properly by the current HTML parser > which returns an empty string when calling ParseUtils.getStringContent(). > Saving the same document as .txt from Firefox gives some text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.