OK, got past the schema.xml problem, but now I'm back to square one.

I can index the contents of binary files (Word, PDF etc...), as well as text files, but it won't index the content of files inside a zip.

As an example, I have two txt files - doc1.txt and doc2.txt. If I index either of them individually using:

curl "http://localhost:8983/solr/core0/update/extract?literal.docid=74&fmap.content=text&literal.type=5"; -F "file=@doc1.txt"

and commit, Solr will index the contents and searches will match.

If I zip those two files up into solr1.zip, and index that using:

curl "http://localhost:8983/solr/core0/update/extract?literal.docid=74&fmap.content=text&literal.type=5"; -F "file=@solr1.zip"

and commit, the file names are indexed, but not their contents.

I have checked that Tika can correctly process the zip file when used standalone with the tika-app jar - it outputs both the filenames and contents. Should I be able to index the contents of files stored in a zip by using extract ?

Thanks and kind regards,
Gary.


On 25/01/2011 15:32, Gary Taylor wrote:
Thanks Erlend.

Not used SVN before, but have managed to download and build latest trunk code.

Now I'm getting an error when trying to access the admin page (via Jetty) because I specify HTMLStripStandardTokenizerFactory in my schema.xml, but this appears to be no-longer supplied as part of the build so I get an exception cos it can't find that class. I've checked the CHANGES.txt and found the following in the change list to 1.4.0 (!?) :

66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader, HTMLStripWhitespaceTokenizerFactory and HTMLStripStandardTokenizerFactory deprecated. To strip HTML tags, HTMLStripCharFilter can be used with an arbitrary Tokenizer. (koji)

Unfortunately, I can't seem to get that to work correctly. Does anyone have an example fieldType stanza (for schema.xml) for stripping out HTML ?

Thanks and kind regards,
Gary.



On 25/01/2011 14:17, Erlend Garåsen wrote:
On 25.01.11 11.30, Erlend Garåsen wrote:

Tika version 0.8 is not included in the latest release/trunk from SVN.

Ouch, I wrote "not" instead of "now". Sorry, I replied in a hurry.

And to clarify, by "content" I mean the main content of a Word file. Title and other kinds of metadata are successfully extracted by the old 0.4 version of Tika, but you need a newer Tika version (0.8) in order to fetch the main content as well. So try the newest Solr version from trunk.

Erlend




Reply via email to