Hi,

You can create a custom update request processor [1] to strip unwanted input 
as it is about to enter the index.

[1]: http://wiki.apache.org/solr/UpdateRequestProcessor

Cheers,

On Monday 06 December 2010 17:36:09 Emmanuel Bégué wrote:
> Hello,
> 
> Is it possible to manipulate the value of a field before it is stored?
> 
> I'm indexing a database where some field contain raw HTML, including
> named character entities.
> 
> Using solr.HTMLStripCharFilterFactory on the index analyzer, results
> in this HTML being correctly stripped, and named character entities
> replaced by the corresponding characters, in the index (as verified
> when searching, and with Luke).
> 
> But, the stored values of the documents are stored unmodified, so the
> result sets, including highlights, contain HTML tags (that are
> escaped) and "entities" (where the leading '&' is also escaped) which
> make handling the results quite difficult.
> 
> So, is it possible to apply some filters to the data before it is
> stored in the non-indexed fields?
> 
> I couldn't find a part of the documentation that said whether it was
> 
> possible or not; I did find this message in the archives of this list:
>     > From: Noble Paul
>     > Sent: Tuesday, March 31, 2009 5:41 PM
>     > Subject: Re: indexed fields vs stored fields
>     > 
>     > indexed = can be searched (mean you can use this to query). This
> 
> undergoes tokenization filter etc
> 
>     > stored = can be retrieved. No modification to the data. This is
> 
> stored verbatim
> 
> which seems to say that it is not possible; but maybe things have
> changed since then?
> 
> Any other idea? given that:
> - I have zero control over what is stored in the database
> - using the Solr XML update protocol i could probably transform the
> data before sending it
> - ... but I'd much rather continue using DataImportHandler to access
> the database
> 
> Thanks,
> Regards,
> EB

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to