[
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852095#action_12852095
]
Julien Nioche commented on NUTCH-794:
-------------------------------------
The issue has not been fixed in Tika. Will refile post 1.1 as you suggested.
Can we update to Tika 0.7 before finalising 1.1?
> Language Identification must use check the parse metadata for language values
> ------------------------------------------------------------------------------
>
> Key: NUTCH-794
> URL: https://issues.apache.org/jira/browse/NUTCH-794
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-794.patch
>
>
> The following HTML document :
> <html lang="fi"><head>document 1 title</head><body>jotain
> suomeksi</body></html>
> is rendered as the following xhtml by Tika :
> <?xml version="1.0" encoding="UTF-8"?><html
> xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1
> titlejotain suomeksi</body></html>
> with the lang attribute getting lost. The lang is not stored in the metadata
> either.
> I will open an issue on Tika and modify TestHTMLLanguageParser so that the
> tests don't break anymore
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.