Hi, You can very well think of doing that if you know that you would crawl and index only a selected set of web pages, which follow the same design. Otherwise, it would turn out to be a never ending process - i.e., finding out the sections, frames, divs, spans, css classes and the likes - from each of the web pages. Scalability would obviously be an issue.
dealmaker wrote: > > Most webpages have sections like navigation, header, left column for > related links, footer, etc. How can I prevent Nutch from returning search > results that contain keywords only in the non-main body of the page? e.g. > keywords can appear in navigation bar or footer, but they may not appear > in the main body of the webpage, so this webpage may not be relevant. > > Maybe I can: > > a) specify sections to index? > b) specify sections to not index? > c) build a parse filter that strips out the content? > > Thanks. > -- View this message in context: http://www.nabble.com/How-to-ignore-search-results-that-don%27t-have-related-keywords-in-main-body--tp22654668p22661252.html Sent from the Nutch - User mailing list archive at Nabble.com.