Hi Nick, Yes, sounds like either custom Nutch parsing code or custom HTML parser that has the logic you described and feeds Solr with docs constructed based on this logic.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Nick Tkach <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Wednesday, August 13, 2008 12:44:58 PM > Subject: Indexing Only Parts of HTML Pages > > I'm wondering, is there some way ("out of the box") to tell Solr that > we're only interested in indexing certain parts of a page? For example, > let's say I have a bunch of pages in my site that contain some common > navigation elements, roughly like this: > > > > > > Stuff here about parts of my site > > > More stuff about other parts of the site > > ....A bunch of stuff particular to each individual page... > > > > Is there some way to either tell Solr to not index what's in the two > divs whenever it encounters them (and it will-in nearly every page) or, > failing that, to somehow easily give content in those areas a large > negative score in order to get the same effect? > > FWIW, we are using Nutch to do the crawling, but as I understand it > there's no way to get Nutch to skip only parts of pages without writing > custom code, right?