The PatternReplace and HTMPStrip tokenizers might be the right bet.
The easiest way to go about this is to make a bunch of text fields
with different analysis stacks and investigate them in the Scema
Browser. You can paste an HTML document into the text box and see
exactly how the words & markup get torn apart.

On 6/6/10, Andrew Clegg <andrew.cl...@gmail.com> wrote:
>
> Hi Solr gurus,
>
> I'm wondering if there is an easy way to keep the targets of hyperlinks from
> a field which may contain HTML fragments, while stripping the HTML.
>
> e.g. if I had a field that looked like this:
>
> "This is the entire content of my field, but  http://example.com/ some of
> the words  are a hyperlink."
>
> Then I'd like to keep "http://example.com/"; as a single token (along with
> all of the actual words) but not the "a" and "href", giving me:
>
> "This is the entire content of my field but http://example.com/ some of the
> words are a hyperlink"
>
> I'm thinking that since we're dealing with individual fragments rather than
> entire HTML pages, Tika/SolrCell may be poorly suited and/or too heavyweight
> -- but please correct me if I'm wrong.
>
> Maybe something using regular expressions? Does anyone have a code snippet
> they could share?
>
> Many thanks,
>
> Andrew.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-link-targets-in-HTML-fragments-tp874547p874547.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


-- 
Lance Norskog
goks...@gmail.com

Reply via email to