Re: Indexing Only Parts of HTML Pages

Otis Gospodnetic Fri, 15 Aug 2008 08:52:24 -0700

Hi Nick,

Yes, sounds like either custom Nutch parsing code or custom HTML parser that 
has the logic you described and feeds Solr with docs constructed based on this 
logic.


 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Nick Tkach <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, August 13, 2008 12:44:58 PM
> Subject: Indexing Only Parts of HTML Pages
> 
> I'm wondering, is there some way ("out of the box") to tell Solr that 
> we're only interested in indexing certain parts of a page?  For example, 
> let's say I have a bunch of pages in my site that contain some common 
> navigation elements, roughly like this:
> 
> 
>   
>   
>     

>        Stuff here about parts of my site
>     
>     

>        More stuff about other parts of the site
>     
>      ....A bunch of stuff particular to each individual page...
>   
> 
> 
> Is there some way to either tell Solr to not index what's in the two 
> divs whenever it encounters them (and it will-in nearly every page) or, 
> failing that, to somehow easily give content in those areas a large 
> negative score in order to get the same effect?
> 
> FWIW, we are using Nutch to do the crawling, but as I understand it 
> there's no way to get Nutch to skip only parts of pages without writing 
> custom code, right?

Re: Indexing Only Parts of HTML Pages

Reply via email to