Re: Extracting contents of zipped files with Tika and Solr 1.4.1

Gary Taylor Tue, 25 Jan 2011 08:09:32 -0800

OK, got past the schema.xml problem, but now I'm back to square one.

I can index the contents of binary files (Word, PDF etc...), as well astext files, but it won't index the content of files inside a zip.

As an example, I have two txt files - doc1.txt and doc2.txt. If I indexeither of them individually using:

curl"http://localhost:8983/solr/core0/update/extract?literal.docid=74&fmap.content=text&literal.type=5";-F "file=@doc1.txt"


and commit, Solr will index the contents and searches will match.

If I zip those two files up into solr1.zip, and index that using:

curl"http://localhost:8983/solr/core0/update/extract?literal.docid=74&fmap.content=text&literal.type=5";-F "file=@solr1.zip"


and commit, the file names are indexed, but not their contents.

I have checked that Tika can correctly process the zip file when usedstandalone with the tika-app jar - it outputs both the filenames andcontents. Should I be able to index the contents of files stored in azip by using extract ?


Thanks and kind regards,
Gary.


On 25/01/2011 15:32, Gary Taylor wrote:

Thanks Erlend.
Not used SVN before, but have managed to download and build latesttrunk code.
Now I'm getting an error when trying to access the admin page (viaJetty) because I specify HTMLStripStandardTokenizerFactory in myschema.xml, but this appears to be no-longer supplied as part of thebuild so I get an exception cos it can't find that class. I'vechecked the CHANGES.txt and found the following in the change list to1.4.0 (!?) :
66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader,HTMLStripWhitespaceTokenizerFactory andHTMLStripStandardTokenizerFactory deprecated. To strip HTML tags,HTMLStripCharFilter can be used with an arbitrary Tokenizer. (koji)
Unfortunately, I can't seem to get that to work correctly. Doesanyone have an example fieldType stanza (for schema.xml) for strippingout HTML ?
Thanks and kind regards,
Gary.



On 25/01/2011 14:17, Erlend Garåsen wrote:
On 25.01.11 11.30, Erlend Garåsen wrote:
Tika version 0.8 is not included in the latest release/trunk from SVN.
Ouch, I wrote "not" instead of "now". Sorry, I replied in a hurry.
And to clarify, by "content" I mean the main content of a Word file.Title and other kinds of metadata are successfully extracted by theold 0.4 version of Tika, but you need a newer Tika version (0.8) inorder to fetch the main content as well. So try the newest Solrversion from trunk.
Erlend

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

Reply via email to