Re: Extracting contents of zipped files with Tika and Solr 1.4.1

Gary Taylor Mon, 31 Jan 2011 05:28:27 -0800

Can anyone shed any light on this, and whether it could be a configissue? I'm now using the latest SVN trunk, which includes the Tika 0.8jars.

When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt)to the ExtractingRequestHandler, I get the following log entry(formatted for ease of reading) :


SolrInputDocument[
    {
    ignored_meta=ignored_meta(1.0)={

[stream_source_info, file, stream_content_type,application/octet-stream, stream_size, 260, stream_name, solr1.zip,Content-Type, application/zip]

        },
    ignored_=ignored_(1.0)={
        [package-entry, package-entry]
        },
    ignored_stream_source_info=ignored_stream_source_info(1.0)={file},

ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream},

    ignored_stream_size=ignored_stream_size(1.0)={260},
    ignored_stream_name=ignored_stream_name(1.0)={solr1.zip},
    ignored_content_type=ignored_content_type(1.0)={application/zip},
    docid=docid(1.0)={74},
    type=type(1.0)={5},
    text=text(1.0)={                  doc2.txt    doc1.txt    }
    }
]

So, the data coming back from Tika when parsing a ZIP file does notinclude the file contents, only the names of the files containedtherein. I've tried forcing stream.type=application/zip in the CURLstring, but that makes no difference. If I specify an invalidstream.type then I get an exception response, so I know it's being used.

When I send one of those txt files individually to theExtractingRequestHandler, I get:


SolrInputDocument[
    {
    ignored_meta=ignored_meta(1.0)={

[stream_source_info, file, stream_content_type, text/plain,stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt]

        },
    ignored_stream_source_info=ignored_stream_source_info(1.0)={file},

ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain},

    ignored_stream_size=ignored_stream_size(1.0)={30},
    ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1},
    ignored_stream_name=ignored_stream_name(1.0)={doc1.txt},
    docid=docid(1.0)={74},
    type=type(1.0)={5},
    text=text(1.0)={                The quick brown fox  }
    }
]

and we see the file contents in the "text" field.

I'm using the following requestHandler definition in solrconfig.xml:

<!-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler -->

<requestHandler name="/update/extract"class="org.apache.solr.handler.extraction.ExtractingRequestHandler"startup="lazy">

<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
           the extracted text or do highlighting, use a stored field. -->
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>

<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>

Is there any further debug or diagnostic I can get out of Tika to helpme work out why it's only returning the file names and not the filecontents when parsing a ZIP file?


Thanks and kind regards,
Gary.



On 25/01/2011 16:48, Jayendra Patil wrote:

Hi Gary,

The latest Solr Trunk was able to extract and index the contents of the zip
file using the ExtractingRequestHandler.
The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and
worked pretty well.

Tested again with sample url and works fine -
curl "
http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zip&literal.id=777045&literal.title=Test&commit=true
"

You would probably need to drill down to the Tika Jars and
the apache-solr-cell-4.0-dev.jar used for Rich documents indexing.

Regards,
Jayendra

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

Reply via email to