Re: [Nutch-general] The ranking is wrong

Naess, Ronny Fri, 29 Jun 2007 07:16:38 -0700

Thanks both of you.

I think this might be the something I must do.

I have played around with plugin parse-html, but I havent found the correct 
place to hook into yet. I can print out the text (even the menu text), but it 
is aleready stripped with html content, so I must be in the wrong place. I 
printet the text out from getTextHelper(...) in DOMContentUtils.

Any pointers to where I should start? Act ually, I would have liked to express 
a regex with a class id for a current div to skip, that would have been realy 
nice. Of course that is not something everyone would like, but for Intranet 
searching where you normally know what to index it might be helpfull. 

This is something I would like to go away. 

<div class="menuContainer">
...lots of unwanted content...
<div>

-Ronny

-  

-----Opprinnelig melding-----
Fra: Doğacan Güney [mailto:[EMAIL PROTECTED] 
Sendt: 27. juni 2007 13:06
Til: [EMAIL PROTECTED]
Emne: Re: The ranking is wrong

On 6/27/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Naess, Ronny wrote:
> > Thanks, Ann.
> >
> > You gave me some good pointers.
> >
> > I see that the navigation menu is giving med all the trouble with 
> > ranking. Does somebody know a way to make the parser skip some content?
> > I would like the parser to skip global header and navigation menu so 
> > the content contains the uniq stuff not everything. Guess this is 
> > not a simple thing.
>
>
> No, it's not. Do a Google search for "template detection".
>
> A crude approach, which still might be sufficient in your case, is to 
> do the following:
>
> * remove all font/color/style formatting elements, and coalesce their 
> text children with their parents. E.g.
>
>         this is <span style="abc">a text</span>
>         <b>with bold</b> fragment
>
> becomes:
>         this is a text with bold fragment
>
> * do the same with all non-divisional (structural) tags, i.e. any 
> formatting tags except for div-s, tables and iframe-s.
>
> * sort the remaining text blocks by size
>
> * drop a certain number (or percentage) of the smallest of the text blocks.
>
> * put the blocks back in order, and extract only their text content.
> This is the "main body" text.
>

Alternatively, for any given divisional tag, you might measure the amount of 
anchor text versus non-anchor text. If a table/div/...
contains mostly anchor text (and all anchor texts consist of a couple of 
words), you can assume that it is a menu and not relevant content.

>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||  
> \|  ||  |  Embedded Unix, System Integration http://www.sigram.com  
> Contact: info at sigram dot com
>
>

--
Doğacan Güney

!DSPAM:46824775224381387220021!

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] The ranking is wrong

Reply via email to