[jira] [Commented] (TIKA-1514) http-equiv content-type extraction should pick first parseable content value

Tim Allison (JIRA) Wed, 14 Jan 2015 11:23:51 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277503#comment-14277503
 ]


Tim Allison commented on TIKA-1514:
-----------------------------------

I dug into this a bit.  It will take more effort than it is worth expending to 
fix this particular problem.  The encoding extractor is choosing the correct 
value if there is more than one.

However, I found that our HTMLParser is setting the "Content-Type" to whatever 
the last value of "content" is an in http-equiv header.

So, in this case:

{noformat}
<meta http-equiv=Content-Type content="text/html; charset=iso-8859-6" 
content="application/pdf">
{noformat}

The metadata is:
{noformat}
Content-Encoding:ISO-8859-6
X-Parsed-By:org.apache.tika.parser.DefaultParser
X-Parsed-By:org.apache.tika.parser.html.HtmlParser
Content-Type:application/pdf
{noformat}

Or in this case:
{noformat}
<meta http-equiv=Content-Type content="text/html; charset=iso-8859-6" 
content="blah de blah blah">
{noformat}

The metadata is:
{noformat}
Content-Encoding:ISO-8859-6
X-Parsed-By:org.apache.tika.parser.DefaultParser
X-Parsed-By:org.apache.tika.parser.html.HtmlParser
Content-Type:blah de blah blah
{noformat}

Shouldn't we be setting the Content-Type as "text/html; charset=ISO-8859-6" so 
that malformed or incorrect html won't yield incorrect Content-Type data?  We 
can include what the Content-Type metahttp header alleges in a different key 
("Content-Type-Meta-HTTP-Equiv" ?), but I'd prefer "Content-Type" to mean the 
content type that Tika detected not whatever the html happened to include. 

What do others think?


> http-equiv content-type extraction should pick first parseable content value 
> -----------------------------------------------------------------------------
>
>                 Key: TIKA-1514
>                 URL: https://issues.apache.org/jira/browse/TIKA-1514
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Tim Allison
>            Priority: Trivial
>             Fix For: 1.8
>
>
> In a handful of files from govdocs1, there are some creative http-equiv 
> content-type headers, including: 
> {noformat}
> <meta http-equiv="content-type" content="text/html; charset=iso-8859-1" 
> name="keywords" content="DNRC, division of nutrition">
> {noformat}
> The content type that is going into the metadata for this file is "DNRC, 
> division of nutrition".
> Let's modify our html metaheader charset detector to pick the first parseable 
> charset value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1514) http-equiv content-type extraction should pick first parseable content value

Reply via email to