Thank you. That is valuable guidance. In light of the recent release of Solr 3.1, I decided to first try that distribution, as it already uses Tika 0.8, which is much closer to my target.
Out of the box (i.e., w/o replacing the Tika and PDFBox libraries) the tests pass, yet I see the error below. When I change ignoreException("unknown field 'a'"); to ignoreException("unknown field 'meta'"); in the testDefaultField test, the error output goes away. I am wondering, if that particular error is expected, or whether the error should in fact be "unknown field 'a'" and I'm only masking an issue with the change. All extraction test pass also after I replace the Tika and PDFBox libraries with the newer versions. -- Andreas test: [junit] Testsuite: org.apache.solr.handler.ExtractingRequestHandlerTest [junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 6.424 sec [junit] [junit] ------------- Standard Error ----------------- [junit] 01/04/2011 22:49:59 org.apache.solr.common.SolrException log [junit] SEVERE: org.apache.solr.common.SolrException: ERROR:unknown field 'meta' [junit] at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:321) [junit] at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60) [junit] at org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121) [junit] at org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126) [junit] at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:198) [junit] at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55) [junit] at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) [junit] at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360) [junit] at org.apache.solr.util.TestHarness.queryAndResponse(TestHarness.java:337) [junit] at org.apache.solr.handler.ExtractingRequestHandlerTest.loadLocal(ExtractingRequestHandlerTest.java:373) [junit] at org.apache.solr.handler.ExtractingRequestHandlerTest.testDefaultField(ExtractingRequestHandlerTest.java:156) ________________________________ From: Chris Hostetter <hossman_luc...@fucit.org> To: solr-user@lucene.apache.org Sent: Thu, March 31, 2011 7:19:05 PM Subject: Re: Solr 1.4.1 and Tika 0.9 - some tests not passing : I'm still interested on what steps I could take to get to the bottom of the : failing tests. Is there additional information that I should provide? i'm not really up to speed on what might have changed in Tika 0.9 to cause this, but the best thing to do would probably be to look at what *does* work compared to what doesn't work. if *none* of hte asserts for dealing with an html doc work, that suggests that fundementally something is just completley broken about the html parsing. Consider this first assertion failure... : assertQ(req("title:Welcome"), "//*[@numFound='1']"); ...in the context of what you said tika 0.9 gives you for that doc on the command line... : $ java -jar tika-app-0.9.jar : ../../../apache-solr-1.4.1-with-tika-0.9/contrib/extraction/src/test/resources/simple.html ... : <title>Welcome to Solr</title> ...if that basic little bit of info can't be extracted, then i'm guessing nothing is being extracted. I would suggest you run the example (with the 0.9 tika jars) and manually attempt to index one document, and then use the schema browser to see exactly what gets indexed. you may need to experiment with tweaking the config options for the extraction handler. -Hoss