Can anyone shed any light on this, and whether it could be a config issue? I'm now using the latest SVN trunk, which includes the Tika 0.8 jars.

When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt) to the ExtractingRequestHandler, I get the following log entry (formatted for ease of reading) :

SolrInputDocument[
    {
    ignored_meta=ignored_meta(1.0)={
[stream_source_info, file, stream_content_type, application/octet-stream, stream_size, 260, stream_name, solr1.zip, Content-Type, application/zip]
        },
    ignored_=ignored_(1.0)={
        [package-entry, package-entry]
        },
    ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream},
    ignored_stream_size=ignored_stream_size(1.0)={260},
    ignored_stream_name=ignored_stream_name(1.0)={solr1.zip},
    ignored_content_type=ignored_content_type(1.0)={application/zip},
    docid=docid(1.0)={74},
    type=type(1.0)={5},
    text=text(1.0)={                  doc2.txt    doc1.txt    }
    }
]

So, the data coming back from Tika when parsing a ZIP file does not include the file contents, only the names of the files contained therein. I've tried forcing stream.type=application/zip in the CURL string, but that makes no difference. If I specify an invalid stream.type then I get an exception response, so I know it's being used.

When I send one of those txt files individually to the ExtractingRequestHandler, I get:

SolrInputDocument[
    {
    ignored_meta=ignored_meta(1.0)={
[stream_source_info, file, stream_content_type, text/plain, stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt]
        },
    ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain},
    ignored_stream_size=ignored_stream_size(1.0)={30},
    ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1},
    ignored_stream_name=ignored_stream_name(1.0)={doc1.txt},
    docid=docid(1.0)={74},
    type=type(1.0)={5},
    text=text(1.0)={                The quick brown fox  }
    }
]

and we see the file contents in the "text" field.

I'm using the following requestHandler definition in solrconfig.xml:

<!-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler -->
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler" startup="lazy">
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
           the extracted text or do highlighting, use a stored field. -->
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>

<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>

Is there any further debug or diagnostic I can get out of Tika to help me work out why it's only returning the file names and not the file contents when parsing a ZIP file?

Thanks and kind regards,
Gary.



On 25/01/2011 16:48, Jayendra Patil wrote:
Hi Gary,

The latest Solr Trunk was able to extract and index the contents of the zip
file using the ExtractingRequestHandler.
The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and
worked pretty well.

Tested again with sample url and works fine -
curl "
http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zip&literal.id=777045&literal.title=Test&commit=true
"

You would probably need to drill down to the Tika Jars and
the apache-solr-cell-4.0-dev.jar used for Rich documents indexing.

Regards,
Jayendra


Reply via email to