Hi,

You can very well think of doing that if you know that you would crawl and
index only a selected set of web pages, which follow the same design.
Otherwise, it would turn out to be a never ending process - i.e., finding
out the sections, frames, divs, spans, css classes and the likes - from each
of the web pages. Scalability would obviously be an issue.


dealmaker wrote:
> 
> Most webpages have sections like navigation, header, left column for
> related links, footer, etc.  How can I prevent Nutch from returning search
> results that contain keywords only in the non-main body of the page?  e.g.
> keywords can appear in navigation bar or footer, but they may not appear
> in the main body of the webpage, so this webpage may not be relevant.
> 
> Maybe I can:
> 
> a) specify sections to index?
> b) specify sections to not index?
> c) build a parse filter that strips out the content?
> 
> Thanks.
> 

-- 
View this message in context: 
http://www.nabble.com/How-to-ignore-search-results-that-don%27t-have-related-keywords-in-main-body--tp22654668p22661252.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to