As i explained in my poste the sections i dont wnat to index areare headers, top menus, right menus, left menus : this is what i mean by garbage. <div id = 'header'> bla bla </div> <div id = 'top_menu'> bla bla </div> <div id = 'left_menu'> bla bla </div> <div id = 'right_menu'> bla bla </div>each page contains the same header and menus section and i don't want index them becoz they are the same... so in each page i just want to parse those sections to get outlinks but dont want to index them...so i have to create a filtred content (without those section). but how to construct this content since i dont know all the blocks and tags that this pages will contains and i even dont know if they are well formed...(its just HTML).... the only thing i'm sure about it that there is a template which applies to all pages, this templates are the div sections described above...(menus, left-menus, ....etc). so i guess the easiest solution is to find a java class which take an HTML file and certains sections <div id = 'header'>.... as parameters and just delete those sections form the HTML file and produce the new cleaned HTML....
http://www.israel-stop.com/fr <a href="http://www.israel-stop.com/fr">israel</a> > Date: Sat, 10 Oct 2009 18:21:47 +0200 > From: a...@getopt.org > To: nutch-user@lucene.apache.org > Subject: Re: How to ignore search results that don't have related keywords in > main body? > > BELLINI ADAM wrote: > > hi guyes.... it's just what im talking about in my post 'indexing > > just certain content‏'... you can read it mabe it could help you... i > > was asking how to get rid of the garbage sections in a document and > > to parse only the important data...so i guess you will create your > > own parser and indexer...but the problem is how could we delete those > > garbage section from an html...try to read my post...mabe we can > > gather our two posts...i dont know if we can gather posts on thsi > > mailing list...to keep tracking only one post... > > What is garbage? Can you define it in terms of regex pattern or XPath > expression that points to specific elements in DOM tree? If you crawl a > single (or few) sites with well defined templates then you can hardcode > some rules for removing unwanted parts of the page. > > If you can't do this, then there are some heuristic methods to solve > this. There are two groups of methods: > > * page at a time (local): this group of methods considers only the > current page that you analyze. The quality of filtering is usually limited. > > * groups of pages (e.g. per site): these methods consider many pages at > a time, and try to find recurring theme among them. Since you first need > to accumulate some pages it can't be done on the fly, i.e. this requires > a separate post-processing step. > > The easiest to implement in Nutch is the first approach (page at a > time). There are many possible implementations - e.g. based on text > patterns, on visual position of elements, on DOM tree patterns, on > "block of content" characteristics, etc. > > Here's for example a simple method: > > * collect text from the page in blocks, where each block fits within > structural tags (div and table tags). Collect also the number of <a> > links in each block. > > * remove a percentage of the smallest blocks, where link number is high > - these are likely navigational elements. > > * reconstruct the whole page from the remaining blocks. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > _________________________________________________________________ New! Faster Messenger access on the new MSN homepage http://go.microsoft.com/?linkid=9677406