[ 
https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15023123#comment-15023123
 ] 

Sebastian Nagel commented on NUTCH-2158:
----------------------------------------

We need to the pass the rendered HTML, returned by the server (Jetty) for the 
jsp page, to Tika. Done by adding a sleep to the unit test so that the document 
can be fetched:
{noformat}
% wget -O basic-http.jsp.html -d http://127.0.0.1:47504/basic-http.jsp
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
...
Server: Jetty(6.1.26)
...
% cat basic-http.jsp.html 
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html xmlns="http://www.w3.org/1999/xhtml";>
  <head>
    <base href="http://127.0.0.1:47504/";>
    
    <title>HelloWorld</title>
    <meta http-equiv="content-type" content="text/html;charset=utf-8" />
    <meta name="Language" content="en" />
        <meta http-equiv="pragma" content="no-cache">
        <meta http-equiv="cache-control" content="no-cache">
        <meta http-equiv="expires" content="0">    
        <meta http-equiv="keywords" content="keyword1,keyword2,keyword3">
        <meta http-equiv="description" content="This is my page">
        <!--
        <link rel="stylesheet" type="text/css" href="styles.css">
        -->
  </head>
  
  <body>
    Hello World!!! <br>
  </body>
</html>
% java -jar tika-app-1.10.jar -d basic-http.jsp.html 
application/xhtml+xml
% java -jar tika-app-1.11.jar -d basic-http.jsp.html 
text/html
{noformat}

It's definitely a change in Tika, probably by TIKA-1771 which lowers the 
probability of {{application/xhtml+xml}}.

But we can probably live with this changed behavior, it's more an improvement 
than a bug:
- both the HTTP header and the metadata claim {{text/html}}
- the document itself isn't clean XHTML

> Upgrade to Tika 1.11
> --------------------
>
>                 Key: NUTCH-2158
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2158
>             Project: Nutch
>          Issue Type: Task
>          Components: parser
>            Reporter: Chris A. Mattmann
>            Assignee: Julien Nioche
>             Fix For: 1.11
>
>         Attachments: NUTCH-2158.patch
>
>
> Upgrade parse-tika to 1.11 release for Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to