Hi Nick,

Yes, sounds like either custom Nutch parsing code or custom HTML parser that 
has the logic you described and feeds Solr with docs constructed based on this 
logic.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Nick Tkach <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, August 13, 2008 12:44:58 PM
> Subject: Indexing Only Parts of HTML Pages
> 
> I'm wondering, is there some way ("out of the box") to tell Solr that 
> we're only interested in indexing certain parts of a page?  For example, 
> let's say I have a bunch of pages in my site that contain some common 
> navigation elements, roughly like this:
> 
> 
>   
>   
>     

>        Stuff here about parts of my site
>     
>     

>        More stuff about other parts of the site
>     
>      ....A bunch of stuff particular to each individual page...
>   
> 
> 
> Is there some way to either tell Solr to not index what's in the two 
> divs whenever it encounters them (and it will-in nearly every page) or, 
> failing that, to somehow easily give content in those areas a large 
> negative score in order to get the same effect?
> 
> FWIW, we are using Nutch to do the crawling, but as I understand it 
> there's no way to get Nutch to skip only parts of pages without writing 
> custom code, right?

Reply via email to