Re: HTML encode extracted docs - Problems with solr.HTMLStripCharFilter

Lance Norskog Sat, 13 Mar 2010 14:55:53 -0800

HTMLStripCharFilter is only in the analyzer: it creates searchable
terms from the HTML input. The raw HTML is stored and fetched.


There are some bugs in term positions and highlighting, An
EntityProcessor wrapping the HTMLStripCharFIlter would be really
useful.

On Tue, Mar 9, 2010 at 5:31 AM, Mark Roberts <mark.robe...@red-gate.com> wrote:
> Sounds like "solr.HTMLStripCharFilter" may work... except, I'm getting a 
> couple of problems:
>
> 1) HTML still seems to be getting into my content field
>
> All I did was add <charFilter class="solr.HTMLStripCharFilterFactory" /> to 
> the index analyzer for the my "text" fieldType.
>
>
> 2) Some it seems to have broken my highlighting, I get this error:
>
> 'org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token wrong 
> exceeds length of provided text sized 3862'
>
>
>
> Any ideas how I can fix this?
>
>
>
>
>
> -----Original Message-----
> From: Lance Norskog [mailto:goks...@gmail.com]
> Sent: 09 March 2010 04:36
> To: solr-user@lucene.apache.org
> Subject: Re: HTML encode extracted docs
>
> A Tika integration with the DataImportHandler is in the Solr trunk.
> With this, you can copy the raw HTML into different fields and process
> one copy with Tika.
>
> If it's just straight HTML, would the HTMLStripCharFilter be good enough?
>
> http://www.lucidimagination.com/search/document/CDRG_ch05_5.7.2
>
> On Mon, Mar 8, 2010 at 5:50 AM, Mark Roberts <mark.robe...@red-gate.com> 
> wrote:
>> I'm uploading .htm files to be extracted - some of these files are "include" 
>> files that have snippets of HTML rather than fully formed html documents.
>>
>> solr-cell stores the raw HTML for these items, rather than extracting the 
>> text. Is there any way I can get solr to encode this content prior to 
>> storing it?
>>
>> At the moment, I have the problem that when the highlighted snippets are  
>> retrieved via search, I need to parse the snippet and HTML encode the bits 
>> of HTML that where indexed, whilst *not* encoding the bits that where added 
>> by the highlighter, which is messy and time consuming.
>>
>> Thanks! Mark,
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>



-- 
Lance Norskog
goks...@gmail.com

Re: HTML encode extracted docs - Problems with solr.HTMLStripCharFilter

Reply via email to