I'm still interested on what steps I could take to get to the bottom of the 
failing tests.  Is there additional information that I should provide?

Some of the output below got mangled in the email - here are the (hopefully) 
complete lines:

This has a <a shape="rect" href="http://www.apache.org";>link&lt;/a>. (Tika 0.9)
This has a &lt;a href="http://www.apache.org";>link&lt;/a>. (Tika 0.4)



________________________________
From: Andreas Kemkes <a5s...@yahoo.com>
To: solr-user@lucene.apache.org
Sent: Tue, March 22, 2011 10:30:57 AM
Subject: Solr 1.4.1 and Tika 0.9 - some tests not passing

Due to some PDF indexing issues with the Solr 1.4.1 distribution, we would like 
to upgrade it to Tika 0.9, as the issues are not occurring in Tika 0.9.

With the changes we made to Solr 1.4.1, we can successfully index the 
previously 

failing PDF documents.

Unfortunately we cannot get the HTML-related tests to pass.

The following asserts in ExtractingRequestHandlerTest.java are failing:

assertQ(req("title:Welcome"), "//*[@numFound='1']");
assertQ(req("+id:simple2 +t_href:[* TO *]"), "//*[@numFound='1']");
assertQ(req("t_href:http"), "//*[@numFound='2']");
assertQ(req("t_href:http"), "//doc[1]/str[.='simple3']");
assertQ(req("+id:simple4 +t_content:Solr"), "//*[@numFound='1']");
assertQ(req("defaultExtr:http\\://www.apache.org"), "//*[@numFound='1']");
assertQ(req("+id:simple2 +t_href:[* TO *]"), "//*[@numFound='1']");
assertTrue(val + " is not equal to " + "linkNews", val.equals("linkNews") == 
true);//there are two <a> tags, and they get collapesd

Below are the differences in output from Tika 0.4 and Tika 0.9 for simple.html.

Tika 0.9 has additional meta tags, a shape attribute, and some additional white 
space.  Is this what throws it off?  

What do we need to consider so that Solr 1.4.1 will process the Tika 0.9 output 
correctly?

Do we need to configure different filters and tokenizers?  Which ones?

Or is it something else entirely?

Thanks in advance for any help,

Andreas

$ java -jar tika-app-0.4.jar 
../../../apache-solr-1.4.1-with-tika-0.9/contrib/extraction/src/test/resources/simple.html


<?xml version="1.0" encoding="UTF-8"?>
<head>
<title>Welcome to Solr</title>
</head>
<body>
<p>
  Here is some text
</p>

Here is some text in a div
This has a link'>http://www.apache.org";>link.


</body>
</html>

$ java -jar tika-app-0.9.jar 
../../../apache-solr-1.4.1-with-tika-0.9/contrib/extraction/src/test/resources/simple.html
 

<?xml version="1.0" encoding="UTF-8"?>
<head>
<meta name="Content-Length" content="209"/>
<meta name="Content-Encoding" content="ISO-8859-1"/>
<meta name="Content-Type" content="text/html"/>
<meta name="resourceName" content="simple.html"/>
<title>Welcome to Solr</title>
</head>
<body>
<p>
  Here is some text
</p>

Here is some text in a div

This has a link'>http://www.apache.org";>link.

</body>
</html>

Reply via email to