hi guyes....
it's just what im talking about in my post 'indexing just certain content‏'...
you can read it mabe it could help you...
i was asking how to get rid of the garbage sections in a document and to parse 
only the important data...so i guess you will create your own parser and 
indexer...but the problem is how could we delete those garbage section from an 
html...try to read my post...mabe we can gather our two posts...i dont know if 
we can gather posts on thsi mailing list...to keep tracking only one post...

best regards




> Date: Sat, 10 Oct 2009 17:31:57 +0200
> From: a...@getopt.org
> To: nutch-user@lucene.apache.org
> Subject: Re: How to ignore search results that don't have related keywords in 
> main body?
> 
> winz wrote:
> > 
> > Venkateshprasanna wrote:
> >> Hi,
> >>
> >> You can very well think of doing that if you know that you would crawl and
> >> index only a selected set of web pages, which follow the same design.
> >> Otherwise, it would turn out to be a never ending process - i.e., finding
> >> out the sections, frames, divs, spans, css classes and the likes - from
> >> each of the web pages. Scalability would obviously be an issue.
> >>
> > 
> > Hi,
> > Could I please know how we can ignore template items like header, footer and
> > menu/navigations while crawling and indexing pages which follow the same
> > design??
> > I'm using a content management system called Infoglue to develop my website.
> > A standard template is applied for all the pages on the website.
> > 
> > The search results from Nutch shows content from menu/navigation bar
> > multiple times.
> > I need to get rid of menu/navigation content from the search result.
> 
> If all you index is this particular site, then you know the positions of 
> navigation items, right? Then you can remove these elements in your 
> HtmlParseFilter, or modify DOMContentUtils (in parse-html) to skip these 
> elements.
> 
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
                                          
_________________________________________________________________
New! Faster Messenger access on the new MSN homepage
http://go.microsoft.com/?linkid=9677406

Reply via email to