Could you tell us your schema used for indexing. In my opinion, using
standardanalyzer / Snowball analyzer will do the best. They will not break
the URLs. Add href, and other related html tags as part of stop words and it
will removed while indexing.

Regards
Aditya
www.findbestopensource.com


On Mon, Jun 7, 2010 at 12:20 PM, Andrew Clegg <andrew.cl...@gmail.com>wrote:

>
>
> Lance Norskog-2 wrote:
> >
> > The PatternReplace and HTMPStrip tokenizers might be the right bet.
> > The easiest way to go about this is to make a bunch of text fields
> > with different analysis stacks and investigate them in the Scema
> > Browser. You can paste an HTML document into the text box and see
> > exactly how the words & markup get torn apart.
> >
>
> Thanks Lance, I'll experiment.
>
> For reference, for anyone else who comes across this thread -- the html in
> my original post might have got munged on the way into or out of the list
> server. It was supposed to look like this:
>
> This is the entire content of my field, but [a
> href="http://example.com/"]some of the words[/a] are a hyperlink.
>
> (but with real html tags instead of the square brackets)
>
> and I am just trying to extract the words and the link target but lose the
> rest of the markup.
>
> Cheers,
>
> Andrew.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-link-targets-in-HTML-fragments-tp874547p875503.html
>  Sent from the Solr - User mailing list archive at Nabble.com.
>

Reply via email to