Piero,
it sounds you're looking for an integration of Solr Cell and Solr's DIH
facility -- a feature that isn't implemented yet (but the issue is
already addressed in Solr-1358).
As a workaround, you could store the extracted contents in plain text
files (either by using Solr Cell or Apache Tika directly, which is under
the hood of Solr Cell). Afterwards, you could use DIH's
XPathEntityProcessor (to read the metadata in your XML files) in
conjunction with DIH's PlainTextEntityProcessor (to read the previously
created text files).
Another workaround would be to pass the metadata content as literal
parameters along with the /update/extract request, as described in [1].
This would require you to write a small program that constructs and
sends appropriate POST requests by parsing your XML metadata files.
Best,
Sascha
[1] http://wiki.apache.org/solr/ExtractingRequestHandler#Literals
Rodolico Piero wrote:
Hi,
I need to index the contents of a file (doc, pdf, ecc) and a set of
custom metadata specified in the XML like a standard request to Solr.
From the documentation I can extract the contents of a file with the
request "/update/extract" (tika) and index metadata with a second
request "/update" by passing the XML. How do I do it all in a single
request? (without using curl but using http java lib or solrj). For
example (although I know that is not correct):
<add>
<doc>
<field name="id> </ field>
<field name="myfield-1> </ field>
<field name="myfield-n> </ field>
<field name="content"> content of the extracted file (text) </
field>
</doc>
</add>
So I search it or by using metadata or full text on the content.
Sorry for my English ...
Thanks a lot.
Piero