Re: Separately indexing headings of the content

Markus Jelsma Mon, 12 Sep 2011 02:22:39 -0700

https://issues.apache.org/jira/browse/NUTCH-1005


> Hi,
> 
> Since I'm relatively new to Nutch/Solr, I was wondering if the following
> would make sense:
> 
> Headings in web pages (h1, h2, h3) should be more important than any
> other content of the page, so if a match to a query turns up in a
> heading, the ranking of the document should be higher. In order to boost
> a field, I would need to separately index it - this would mean on
> parsing the crawled pages, I would need to strip out the headings h1, h2
> and h3, index them in separate fields, and remove them from the content
> field. I presume I would have to modify the HTML Parser and Index Basic
> plugin for this, or is there an easier solution?
> 
> Any input appreciated,
> Elisabeth

Re: Separately indexing headings of the content

Reply via email to