[ 
https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856794#action_12856794
 ] 

Julien Nioche commented on TIKA-379:
------------------------------------

This is indeed a more generic problem. It also affects HTML elements like 
*link* which are commonly used in head sections to specify favicons or 
canonical representations. These values are not stored in the metadata  either 
and are vital for a crawler.

Is there a specific reason why these things are not rendered in the XHTML? I 
agree with Ken that it would be better not only to store information in the 
metadata but also to be able to retrieve them from the SAX events. 

Any thoughts on this?





> Html elements and attributes not available in XHTML representation 
> -------------------------------------------------------------------
>
>                 Key: TIKA-379
>                 URL: https://issues.apache.org/jira/browse/TIKA-379
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Julien Nioche
>            Priority: Critical
>
> The following HTML document :
> <html lang="fi"><head>document 1 title</head><body>jotain 
> suomeksi</body></html>
> is rendered as the following xhtml by Tika :
> <?xml version="1.0" encoding="UTF-8"?><html 
> xmlns="http://www.w3.org/1999/xhtml";><head><title/></head><body>document 1 
> titlejotain suomeksi</body></html>
> with the lang attribute getting lost. The lang is not stored in the metadata 
> either.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to