A Tika integration with the DataImportHandler is in the Solr trunk.
With this, you can copy the raw HTML into different fields and process
one copy with Tika.

If it's just straight HTML, would the HTMLStripCharFilter be good enough?

http://www.lucidimagination.com/search/document/CDRG_ch05_5.7.2

On Mon, Mar 8, 2010 at 5:50 AM, Mark Roberts <mark.robe...@red-gate.com> wrote:
> I'm uploading .htm files to be extracted - some of these files are "include" 
> files that have snippets of HTML rather than fully formed html documents.
>
> solr-cell stores the raw HTML for these items, rather than extracting the 
> text. Is there any way I can get solr to encode this content prior to storing 
> it?
>
> At the moment, I have the problem that when the highlighted snippets are  
> retrieved via search, I need to parse the snippet and HTML encode the bits of 
> HTML that where indexed, whilst *not* encoding the bits that where added by 
> the highlighter, which is messy and time consuming.
>
> Thanks! Mark,
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to