Re: extracting/indexing HTML via cURL

okayndc Tue, 01 May 2012 07:08:28 -0700

Thank you Jack.

So, it's not doable/possible to search and highlight keywords within a
field that contains the raw formatted HTML?  and strip out the HTML tags
during analysis...so that a user would get back nothing if they did a
search for (ex. <p>)?


On Mon, Apr 30, 2012 at 5:17 PM, Jack Krupansky <j...@basetechnology.com>wrote:

> I was thinking that you wanted to index the actual text from the HTML
> page, but have the stored field value still have the raw HTML with tags. If
> you just want to store only the raw HTML, a simple string field is
> sufficient, but then you can't easily do a text search on it.
>
> Or, you can have two fields, one string field for the raw HTML (stored,
> but not indexed) and then do a CopyField to a text field field that has the
> HTMLStripCharFilter to strip the HTML tags and index only the text
> (indexed, but not stored.)
>
> -- Jack Krupansky
>
> -----Original Message----- From: okayndc
> Sent: Monday, April 30, 2012 5:06 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr: extracting/indexing HTML via cURL
>
> Great, thank you for the input.  My understanding of HTMLStripCharFilter is
> that it strips HTML tags, which is not what I want ~ is this correct?  I
> want to keep the HTML tags intact.
>
> On Mon, Apr 30, 2012 at 11:55 AM, Jack Krupansky <j...@basetechnology.com>
> **wrote:
>
>  If by "extracting HTML content via cURL" you mean using SolrCell to parse
>> html files, this seems to make sense. The sequence is that regardless of
>> the file type, each file extraction "parser" will strip off all formatting
>> and produce a raw text stream. Office, PDF, and HTML files are all treated
>> the same in that way. Then, the unformatted text stream is sent through
>> the
>> field type analyzers to be tokenized into terms that Lucene can index. The
>> input string to the field type analyzer is what gets stored for the field,
>> but this occurs after the extraction file parser has already removed
>> formatting.
>>
>> No way for the formatting to be preserved in that case, other than to go
>> back to the original input document before extraction parsing.
>>
>> If you really do want to preserve full HTML formatted text, you would need
>> to define a field whose field type uses the HTMLStripCharFilter and then
>> directly add documents that direct the raw HTML to that field.
>>
>> There may be some other way to hook into the update processing chain, but
>> that may be too much effort compared to the HTML strip filter.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: okayndc
>> Sent: Monday, April 30, 2012 10:07 AM
>> To: solr-user@lucene.apache.org
>> Subject: Solr: extracting/indexing HTML via cURL
>>
>>
>> Hello,
>>
>> Over the weekend I experimented with extracting HTML content via cURL and
>> just
>> wondering why the extraction/indexing process does not include the HTML
>> tags.
>> It seems as though the HTML tags either being ignored or stripped
>> somewhere
>> in the pipeline.
>> If this is the case, is it possible to include the HTML tags, as I would
>> like to keep the
>> formatted HTML intact?
>>
>> Any help is greatly appreciated.
>>
>>
>

Re: extracting/indexing HTML via cURL

Reply via email to