Re: Extracting contents of zipped files with Tika and Solr 1.4.1

Jayendra Patil Tue, 25 Jan 2011 08:48:56 -0800

Hi Gary,

The latest Solr Trunk was able to extract and index the contents of the zip
file using the ExtractingRequestHandler.
The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and
worked pretty well.


Tested again with sample url and works fine -
curl "
http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zip&literal.id=777045&literal.title=Test&commit=true
"

You would probably need to drill down to the Tika Jars and
the apache-solr-cell-4.0-dev.jar used for Rich documents indexing.

Regards,
Jayendra

On Tue, Jan 25, 2011 at 11:08 AM, Gary Taylor <g...@inovem.com> wrote:

> OK, got past the schema.xml problem, but now I'm back to square one.
>
> I can index the contents of binary files (Word, PDF etc...), as well as
> text files, but it won't index the content of files inside a zip.
>
> As an example, I have two txt files - doc1.txt and doc2.txt.  If I index
> either of them individually using:
>
> curl "
> http://localhost:8983/solr/core0/update/extract?literal.docid=74&fmap.content=text&literal.type=5";
> -F "file=@doc1.txt"
>
> and commit, Solr will index the contents and searches will match.
>
> If I zip those two files up into solr1.zip, and index that using:
>
> curl "
> http://localhost:8983/solr/core0/update/extract?literal.docid=74&fmap.content=text&literal.type=5";
> -F "file=@solr1.zip"
>
> and commit, the file names are indexed, but not their contents.
>
> I have checked that Tika can correctly process the zip file when used
> standalone with the tika-app jar - it outputs both the filenames and
> contents.  Should I be able to index the contents of files stored in a zip
> by using extract ?
>
>
> Thanks and kind regards,
> Gary.
>
>
> On 25/01/2011 15:32, Gary Taylor wrote:
>
>> Thanks Erlend.
>>
>> Not used SVN before, but have managed to download and build latest trunk
>> code.
>>
>> Now I'm getting an error when trying to access the admin page (via Jetty)
>> because I specify HTMLStripStandardTokenizerFactory in my schema.xml, but
>> this appears to be no-longer supplied as part of the build so I get an
>> exception cos it can't find that class.  I've checked the CHANGES.txt and
>> found the following in the change list to 1.4.0 (!?) :
>>
>> 66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader,
>> HTMLStripWhitespaceTokenizerFactory and    HTMLStripStandardTokenizerFactory
>> deprecated. To strip HTML tags, HTMLStripCharFilter can be used with an
>> arbitrary Tokenizer. (koji)
>>
>> Unfortunately, I can't seem to get that to work correctly.  Does anyone
>> have an example fieldType stanza (for schema.xml) for stripping out HTML ?
>>
>> Thanks and kind regards,
>> Gary.
>>
>>
>>
>> On 25/01/2011 14:17, Erlend Garåsen wrote:
>>
>>> On 25.01.11 11.30, Erlend Garåsen wrote:
>>>
>>>  Tika version 0.8 is not included in the latest release/trunk from SVN.
>>>>
>>>
>>> Ouch, I wrote "not" instead of "now". Sorry, I replied in a hurry.
>>>
>>> And to clarify, by "content" I mean the main content of a Word file.
>>> Title and other kinds of metadata are successfully extracted by the old 0.4
>>> version of Tika, but you need a newer Tika version (0.8) in order to fetch
>>> the main content as well. So try the newest Solr version from trunk.
>>>
>>> Erlend
>>>
>>>
>>
>>
>

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

Reply via email to