On 6/27/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Naess, Ronny wrote:
> > Thanks, Ann.
> >
> > You gave me some good pointers.
> >
> > I see that the navigation menu is giving med all the trouble with
> > ranking. Does somebody know a way to make the parser skip some content?
> > I would like the parser to skip global header and navigation menu so the
> > content contains the uniq stuff not everything. Guess this is not a
> > simple thing.
>
>
> No, it's not. Do a Google search for "template detection".
>
> A crude approach, which still might be sufficient in your case, is to do
> the following:
>
> * remove all font/color/style formatting elements, and coalesce their
> text children with their parents. E.g.
>
>         this is <span style="abc">a text</span>
>         <b>with bold</b> fragment
>
> becomes:
>         this is a text with bold fragment
>
> * do the same with all non-divisional (structural) tags, i.e. any
> formatting tags except for div-s, tables and iframe-s.
>
> * sort the remaining text blocks by size
>
> * drop a certain number (or percentage) of the smallest of the text blocks.
>
> * put the blocks back in order, and extract only their text content.
> This is the "main body" text.
>

Alternatively, for any given divisional tag, you might measure the
amount of anchor text versus non-anchor text. If a table/div/...
contains mostly anchor text (and all anchor texts consist of a couple
of words), you can assume that it is a menu and not relevant content.

>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to