Not sure abour Google and others... Yes, my suggestion was not to extract
only full plain text (including OPTION groups, repeated menus, footers,
headers, ...)

HTML is not an XML, and putting Creative XML inside HTML comments ;-) is
very good sample!

And smbd suggests to constract DOM! Why for? To constract IE's buttons and
dropdowns, and to format a text for presentation?

Another suggest to use XSLT for possible finding Creatively commented DOM
element.



-----Original Message-----
From: yoursoft [mailto:[EMAIL PROTECTED] 
Sent: Saturday, February 25, 2006 1:23 PM
To: [email protected]
Subject: Re: Nutch Improvement - HTML Parser


Google and other big search engines not extract only plain texts.
e.g.:
When you search in google for 'anything'.
Google will rate up that pages where the 'anything' is in 
<h1..6></h1..6> or is in <b></b>.


Fuad Efendi wrote:
> But we do not need 'better parsing of malformed html', we need only to
> extract plain text... Yes, meta-information such as Creative Commons
embedde
> XML in HTML comments is important too, and plugin technics does the job
very
> well.
>
> I am only trying to focus on specific task, such as removal of repeated
> tokens (menu items, options, ...), automatic web-tree building using
anchors
> and some statistics, calculating rank for repeated tokens and indexing
only
> specific sentences with low rank. I simply ignore DOM/SAX, I don't need
it.
>
>
> -----Original Message-----
> From: Jérôme Charron [mailto:[EMAIL PROTECTED] 
> Sent: Saturday, February 25, 2006 4:05 AM
> To: [email protected]
> Subject: Re: Nutch Improvement - HTML Parser
>
>
>   
>> It's not a tool,
>> IT IS stupidness of Nutch, it uses DOM just to extract plain text and
>> Outlink[]...
>> It's very easy to design specific routine to 'parse' byte[], we can
>> improve
>> everything 100 times... At Least!
>>     
>
> Yes sure. I think everybody has already done such things at school...
> Building a DOM provide:
> 1. a better parsing of malformed html documents (and there is a lot of
> malformed docs on the web)
> 2. gives ability to extract meta-information such as creative commons
> license
> 3. gives a high degree of extensibility (HtmlParser extension point) to
> extract some specific informations without parsing the document many times
> (for instance extracting technorati like tags, ...) and just providing a
> simple plugin.
>
> Regards
>
> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/
>
>
>
>   





-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to