: I'm still interested on what steps I could take to get to the bottom of the 
: failing tests.  Is there additional information that I should provide?

i'm not really up to speed on what might have changed in Tika 0.9 to cause 
this, but the best thing to do would probably be to look at what *does* 
work compared to what doesn't work.

if *none* of hte asserts for dealing with an html doc work, that suggests 
that fundementally something is just completley broken about the html 
parsing.

Consider this first assertion failure...

: assertQ(req("title:Welcome"), "//*[@numFound='1']");

...in the context of what you said tika 0.9 gives you for that doc on the 
command line...

: $ java -jar tika-app-0.9.jar 
: 
../../../apache-solr-1.4.1-with-tika-0.9/contrib/extraction/src/test/resources/simple.html
        ...
: <title>Welcome to Solr</title>

...if that basic little bit of info can't be extracted, then i'm guessing 
nothing is being extracted.

I would suggest you run the example (with the 0.9 tika jars) and manually 
attempt to index one document, and then use the schema browser to see 
exactly what gets indexed.

you may need to experiment with tweaking the config options for the 
extraction handler.

-Hoss

Reply via email to