As i explained in my poste the sections i dont wnat to index areare headers, 
top menus, right menus, left menus :
this is what i mean by garbage. 
<div id = 'header'>     bla bla </div>
 <div id = 'top_menu'>  bla bla </div>
 <div id = 'left_menu'>  bla bla </div>
 <div id = 'right_menu'> bla bla </div>each page contains the same header and 
menus section and i don't want index them becoz they are the same...
so in each page i just want to parse those sections  to get outlinks but dont 
want to index them...so i have to create a filtred content (without those 
section).  but how to construct this content since i dont know all the blocks 
and tags that this pages will contains and i even dont know if they are well 
formed...(its just HTML)....
the only thing i'm sure about it that there is a template which applies to all 
pages, this templates are the div sections described above...(menus, 
left-menus, ....etc).
so i guess the easiest solution is to find a java class which take an HTML file 
and certains sections
<div id = 'header'>....
  as parameters and just delete those sections form the HTML file and produce 
the new cleaned HTML....



http://www.israel-stop.com/fr
<a href="http://www.israel-stop.com/fr";>israel</a>




> Date: Sat, 10 Oct 2009 18:21:47 +0200
> From: a...@getopt.org
> To: nutch-user@lucene.apache.org
> Subject: Re: How to ignore search results that don't have related keywords in 
> main body?
> 
> BELLINI ADAM wrote:
> > hi guyes.... it's just what im talking about in my post 'indexing
> > just certain content‏'... you can read it mabe it could help you... i
> > was asking how to get rid of the garbage sections in a document and
> > to parse only the important data...so i guess you will create your
> > own parser and indexer...but the problem is how could we delete those
> > garbage section from an html...try to read my post...mabe we can
> > gather our two posts...i dont know if we can gather posts on thsi
> > mailing list...to keep tracking only one post...
> 
> What is garbage? Can you define it in terms of regex pattern or XPath 
> expression that points to specific elements in DOM tree? If you crawl a 
> single (or few) sites with well defined templates then you can hardcode 
> some rules for removing unwanted parts of the page.
> 
> If you can't do this, then there are some heuristic methods to solve 
> this. There are two groups of methods:
> 
> * page at a time (local): this group of methods considers only the 
> current page that you analyze. The quality of filtering is usually limited.
> 
> * groups of pages (e.g. per site): these methods consider many pages at 
> a time, and try to find recurring theme among them. Since you first need 
> to accumulate some pages it can't be done on the fly, i.e. this requires 
> a separate post-processing step.
> 
> The easiest to implement in Nutch is the first approach (page at a 
> time). There are many possible implementations - e.g. based on text 
> patterns, on visual position of elements, on DOM tree patterns, on 
> "block of content" characteristics, etc.
> 
> Here's for example a simple method:
> 
> * collect text from the page in blocks, where each block fits within 
> structural tags (div and table tags). Collect also the number of <a> 
> links in each block.
> 
> * remove a percentage of the smallest blocks, where link number is high 
> - these are likely navigational elements.
> 
> * reconstruct the whole page from the remaining blocks.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
                                          
_________________________________________________________________
New! Faster Messenger access on the new MSN homepage
http://go.microsoft.com/?linkid=9677406

Reply via email to