[ https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15023123#comment-15023123 ]
Sebastian Nagel commented on NUTCH-2158: ---------------------------------------- We need to the pass the rendered HTML, returned by the server (Jetty) for the jsp page, to Tika. Done by adding a sleep to the unit test so that the document can be fetched: {noformat} % wget -O basic-http.jsp.html -d http://127.0.0.1:47504/basic-http.jsp HTTP/1.1 200 OK Content-Type: text/html; charset=utf-8 ... Server: Jetty(6.1.26) ... % cat basic-http.jsp.html <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <base href="http://127.0.0.1:47504/"> <title>HelloWorld</title> <meta http-equiv="content-type" content="text/html;charset=utf-8" /> <meta name="Language" content="en" /> <meta http-equiv="pragma" content="no-cache"> <meta http-equiv="cache-control" content="no-cache"> <meta http-equiv="expires" content="0"> <meta http-equiv="keywords" content="keyword1,keyword2,keyword3"> <meta http-equiv="description" content="This is my page"> <!-- <link rel="stylesheet" type="text/css" href="styles.css"> --> </head> <body> Hello World!!! <br> </body> </html> % java -jar tika-app-1.10.jar -d basic-http.jsp.html application/xhtml+xml % java -jar tika-app-1.11.jar -d basic-http.jsp.html text/html {noformat} It's definitely a change in Tika, probably by TIKA-1771 which lowers the probability of {{application/xhtml+xml}}. But we can probably live with this changed behavior, it's more an improvement than a bug: - both the HTTP header and the metadata claim {{text/html}} - the document itself isn't clean XHTML > Upgrade to Tika 1.11 > -------------------- > > Key: NUTCH-2158 > URL: https://issues.apache.org/jira/browse/NUTCH-2158 > Project: Nutch > Issue Type: Task > Components: parser > Reporter: Chris A. Mattmann > Assignee: Julien Nioche > Fix For: 1.11 > > Attachments: NUTCH-2158.patch > > > Upgrade parse-tika to 1.11 release for Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)