Re: Parsing extra fields from an html page in the web. ....

Marcin Okraszewski Thu, 27 Sep 2007 12:30:03 -0700

I brief. You need to write HtmlParserFilter, then IndexingFilter and 
QueryFilter. You register them through extension points. Search USER (not dev) 
group, there answers already.


BTW. This questions is asked over and over. It seems to be a good subject to 
write on wiki.

Marcin

> Hi,
> We are working on an Indian Language search engine and are using
> nutch-0.9as the basic framework.
> 
> However when the html pages are parsed during the fetching phase, the
> htmlParser which runs on the page extracts the title text and metatags and
> the outlinks.
> what do i need to do if i need to add in more fields like <author>,
> <language>, <script>  to the segments extracted from the web page. In case
> the data is unavailable in the page, we can load in some default values.
> 
> Do i need to touch the actual parser code (parser used here is a neko-html
> parser if am not wrong) or the additions can be done right from within the
> nutch code.
> 
> It would be of great help if you could get me through this.
> 
> -- 
> Pratyush Banerjee
> SPO, CLIA
> IIT Kharagpur

Re: Parsing extra fields from an html page in the web. ....

Reply via email to