The Tika integration with the DataImportHandler allows you to control
many aspects of what goes into the index, including solving this
problem:

http://wiki.apache.org/solr/TikaEntityProcessor

(Tika is the extraction library, and ExtractingRequestHandler and the
TikaEntityProcessor both use it.)

On Thu, Feb 4, 2010 at 7:04 AM, Christoph Brill
<christoph.br...@chamaeleon.de> wrote:
> Hi list,
>
> I'm using the ExtractingRequestHandler to extract content from
> documents. It's extracting the "last_modified" field quite fine, but of
> course only for documents where this field is set. If this field is not
> set I want to pass the file system timestamp of the file.
>
> I'm doing:
>
> final ContentStreamUpdateRequest up =
>   new ContentStreamUpdateRequest("/update/extract");
>
> up.setParam("literal.last_modified",
>   format.format(new Date(file.lastModified())));
>
> This works fine but only for documents that don't have a last modified
> field inside (like many PDFs have). Then I get
>
> "multiple values encountered for non multiValued field last_modified"
>
> Is it possible to make ExtractingRequestHandler overwrite the
> last_modified I passed as parameter with the one Tika extracted?
>
> Thanks,
>  Chris
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to