Naess, Ronny wrote:
> Thanks, Ann.
> 
> You gave me some good pointers.
> 
> I see that the navigation menu is giving med all the trouble with
> ranking. Does somebody know a way to make the parser skip some content?
> I would like the parser to skip global header and navigation menu so the
> content contains the uniq stuff not everything. Guess this is not a
> simple thing.


No, it's not. Do a Google search for "template detection".

A crude approach, which still might be sufficient in your case, is to do 
the following:

* remove all font/color/style formatting elements, and coalesce their 
text children with their parents. E.g.

        this is <span style="abc">a text</span>
        <b>with bold</b> fragment

becomes:
        this is a text with bold fragment

* do the same with all non-divisional (structural) tags, i.e. any 
formatting tags except for div-s, tables and iframe-s.

* sort the remaining text blocks by size

* drop a certain number (or percentage) of the smallest of the text blocks.

* put the blocks back in order, and extract only their text content. 
This is the "main body" text.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to