Can anyone shed any light on this, and whether it could be a config
issue? I'm now using the latest SVN trunk, which includes the Tika 0.8
jars.
When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt)
to the ExtractingRequestHandler, I get the following log entry
(formatted for ease of reading) :
SolrInputDocument[
{
ignored_meta=ignored_meta(1.0)={
[stream_source_info, file, stream_content_type,
application/octet-stream, stream_size, 260, stream_name, solr1.zip,
Content-Type, application/zip]
},
ignored_=ignored_(1.0)={
[package-entry, package-entry]
},
ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream},
ignored_stream_size=ignored_stream_size(1.0)={260},
ignored_stream_name=ignored_stream_name(1.0)={solr1.zip},
ignored_content_type=ignored_content_type(1.0)={application/zip},
docid=docid(1.0)={74},
type=type(1.0)={5},
text=text(1.0)={ doc2.txt doc1.txt }
}
]
So, the data coming back from Tika when parsing a ZIP file does not
include the file contents, only the names of the files contained
therein. I've tried forcing stream.type=application/zip in the CURL
string, but that makes no difference. If I specify an invalid
stream.type then I get an exception response, so I know it's being used.
When I send one of those txt files individually to the
ExtractingRequestHandler, I get:
SolrInputDocument[
{
ignored_meta=ignored_meta(1.0)={
[stream_source_info, file, stream_content_type, text/plain,
stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt]
},
ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain},
ignored_stream_size=ignored_stream_size(1.0)={30},
ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1},
ignored_stream_name=ignored_stream_name(1.0)={doc1.txt},
docid=docid(1.0)={74},
type=type(1.0)={5},
text=text(1.0)={ The quick brown fox }
}
]
and we see the file contents in the "text" field.
I'm using the following requestHandler definition in solrconfig.xml:
<!-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler -->
<requestHandler name="/update/extract"
class="org.apache.solr.handler.extraction.ExtractingRequestHandler"
startup="lazy">
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>
Is there any further debug or diagnostic I can get out of Tika to help
me work out why it's only returning the file names and not the file
contents when parsing a ZIP file?
Thanks and kind regards,
Gary.
On 25/01/2011 16:48, Jayendra Patil wrote:
Hi Gary,
The latest Solr Trunk was able to extract and index the contents of the zip
file using the ExtractingRequestHandler.
The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and
worked pretty well.
Tested again with sample url and works fine -
curl "
http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zip&literal.id=777045&literal.title=Test&commit=true
"
You would probably need to drill down to the Tika Jars and
the apache-solr-cell-4.0-dev.jar used for Rich documents indexing.
Regards,
Jayendra