Re: Indexing link targets in HTML fragments

2010-06-07 Thread Andrew Clegg


Lance Norskog-2 wrote:
 
 The PatternReplace and HTMPStrip tokenizers might be the right bet.
 The easiest way to go about this is to make a bunch of text fields
 with different analysis stacks and investigate them in the Scema
 Browser. You can paste an HTML document into the text box and see
 exactly how the words  markup get torn apart.
 

Thanks Lance, I'll experiment.

For reference, for anyone else who comes across this thread -- the html in
my original post might have got munged on the way into or out of the list
server. It was supposed to look like this:

This is the entire content of my field, but [a
href=http://example.com/]some of the words[/a] are a hyperlink.

(but with real html tags instead of the square brackets)

and I am just trying to extract the words and the link target but lose the
rest of the markup.

Cheers,

Andrew.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-link-targets-in-HTML-fragments-tp874547p875503.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing link targets in HTML fragments

2010-06-07 Thread findbestopensource
Could you tell us your schema used for indexing. In my opinion, using
standardanalyzer / Snowball analyzer will do the best. They will not break
the URLs. Add href, and other related html tags as part of stop words and it
will removed while indexing.

Regards
Aditya
www.findbestopensource.com


On Mon, Jun 7, 2010 at 12:20 PM, Andrew Clegg andrew.cl...@gmail.comwrote:



 Lance Norskog-2 wrote:
 
  The PatternReplace and HTMPStrip tokenizers might be the right bet.
  The easiest way to go about this is to make a bunch of text fields
  with different analysis stacks and investigate them in the Scema
  Browser. You can paste an HTML document into the text box and see
  exactly how the words  markup get torn apart.
 

 Thanks Lance, I'll experiment.

 For reference, for anyone else who comes across this thread -- the html in
 my original post might have got munged on the way into or out of the list
 server. It was supposed to look like this:

 This is the entire content of my field, but [a
 href=http://example.com/]some of the words[/a] are a hyperlink.

 (but with real html tags instead of the square brackets)

 and I am just trying to extract the words and the link target but lose the
 rest of the markup.

 Cheers,

 Andrew.

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Indexing-link-targets-in-HTML-fragments-tp874547p875503.html
  Sent from the Solr - User mailing list archive at Nabble.com.



Re: Indexing link targets in HTML fragments

2010-06-07 Thread Andrew Clegg


findbestopensource wrote:
 
 Could you tell us your schema used for indexing. In my opinion, using
 standardanalyzer / Snowball analyzer will do the best. They will not break
 the URLs. Add href, and other related html tags as part of stop words and
 it
 will removed while indexing.
 

This project's still in the planning stages -- I haven't designed the
pipeline yet.

But you're right, maybe starting with everything and just stopping out the
tag and attribute names is the most fail-safe approach.

Then at least if I get something wrong I won't miss anything. Worst case
scenario, I just end up with some extra terms in the index.

Thanks,

Andrew.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-link-targets-in-HTML-fragments-tp874547p876343.html
Sent from the Solr - User mailing list archive at Nabble.com.