Piero,

it sounds you're looking for an integration of Solr Cell and Solr's DIH facility -- a feature that isn't implemented yet (but the issue is already addressed in Solr-1358).

As a workaround, you could store the extracted contents in plain text files (either by using Solr Cell or Apache Tika directly, which is under the hood of Solr Cell). Afterwards, you could use DIH's XPathEntityProcessor (to read the metadata in your XML files) in conjunction with DIH's PlainTextEntityProcessor (to read the previously created text files).

Another workaround would be to pass the metadata content as literal parameters along with the /update/extract request, as described in [1]. This would require you to write a small program that constructs and sends appropriate POST requests by parsing your XML metadata files.

Best,
Sascha

[1] http://wiki.apache.org/solr/ExtractingRequestHandler#Literals

Rodolico Piero wrote:
Hi,

I need to index the contents of a file (doc, pdf, ecc) and a set of
custom metadata specified in the XML like a standard request to Solr.
From the documentation I can extract the contents of a file with the
request "/update/extract" (tika) and index metadata with a second
request "/update" by passing the XML. How do I do it all in a single
request? (without using curl but using http java lib or solrj). For
example (although I know that is not correct):

<add>
  <doc>
    <field name="id> </ field>
    <field name="myfield-1> </ field>
    <field name="myfield-n> </ field>
    <field name="content"> content of the extracted file (text) </
field>
    </doc>
  </add>

So I search it or by using metadata or full text on the content.
Sorry for my English ...

Thanks a lot.

Piero




Reply via email to