The Tika integration with the DataImportHandler allows you to control many aspects of what goes into the index, including solving this problem:
http://wiki.apache.org/solr/TikaEntityProcessor (Tika is the extraction library, and ExtractingRequestHandler and the TikaEntityProcessor both use it.) On Thu, Feb 4, 2010 at 7:04 AM, Christoph Brill <christoph.br...@chamaeleon.de> wrote: > Hi list, > > I'm using the ExtractingRequestHandler to extract content from > documents. It's extracting the "last_modified" field quite fine, but of > course only for documents where this field is set. If this field is not > set I want to pass the file system timestamp of the file. > > I'm doing: > > final ContentStreamUpdateRequest up = > new ContentStreamUpdateRequest("/update/extract"); > > up.setParam("literal.last_modified", > format.format(new Date(file.lastModified()))); > > This works fine but only for documents that don't have a last modified > field inside (like many PDFs have). Then I get > > "multiple values encountered for non multiValued field last_modified" > > Is it possible to make ExtractingRequestHandler overwrite the > last_modified I passed as parameter with the one Tika extracted? > > Thanks, > Chris > -- Lance Norskog goks...@gmail.com