Re: Indexing link targets in HTML fragments
Lance Norskog-2 wrote: The PatternReplace and HTMPStrip tokenizers might be the right bet. The easiest way to go about this is to make a bunch of text fields with different analysis stacks and investigate them in the Scema Browser. You can paste an HTML document into the text box and see exactly how the words markup get torn apart. Thanks Lance, I'll experiment. For reference, for anyone else who comes across this thread -- the html in my original post might have got munged on the way into or out of the list server. It was supposed to look like this: This is the entire content of my field, but [a href=http://example.com/]some of the words[/a] are a hyperlink. (but with real html tags instead of the square brackets) and I am just trying to extract the words and the link target but lose the rest of the markup. Cheers, Andrew. -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-link-targets-in-HTML-fragments-tp874547p875503.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing link targets in HTML fragments
Could you tell us your schema used for indexing. In my opinion, using standardanalyzer / Snowball analyzer will do the best. They will not break the URLs. Add href, and other related html tags as part of stop words and it will removed while indexing. Regards Aditya www.findbestopensource.com On Mon, Jun 7, 2010 at 12:20 PM, Andrew Clegg andrew.cl...@gmail.comwrote: Lance Norskog-2 wrote: The PatternReplace and HTMPStrip tokenizers might be the right bet. The easiest way to go about this is to make a bunch of text fields with different analysis stacks and investigate them in the Scema Browser. You can paste an HTML document into the text box and see exactly how the words markup get torn apart. Thanks Lance, I'll experiment. For reference, for anyone else who comes across this thread -- the html in my original post might have got munged on the way into or out of the list server. It was supposed to look like this: This is the entire content of my field, but [a href=http://example.com/]some of the words[/a] are a hyperlink. (but with real html tags instead of the square brackets) and I am just trying to extract the words and the link target but lose the rest of the markup. Cheers, Andrew. -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-link-targets-in-HTML-fragments-tp874547p875503.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing link targets in HTML fragments
findbestopensource wrote: Could you tell us your schema used for indexing. In my opinion, using standardanalyzer / Snowball analyzer will do the best. They will not break the URLs. Add href, and other related html tags as part of stop words and it will removed while indexing. This project's still in the planning stages -- I haven't designed the pipeline yet. But you're right, maybe starting with everything and just stopping out the tag and attribute names is the most fail-safe approach. Then at least if I get something wrong I won't miss anything. Worst case scenario, I just end up with some extra terms in the index. Thanks, Andrew. -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-link-targets-in-HTML-fragments-tp874547p876343.html Sent from the Solr - User mailing list archive at Nabble.com.