Re: Removing Common Web Page Header and Footer from All Content Fetched by Nutch

Israel Ekpo Tue, 19 Oct 2010 02:14:28 -0700

Thanks Otis and Markus for your input.

I will check it out today.


On Tue, Oct 19, 2010 at 4:45 AM, Markus Jelsma
<markus.jel...@openindex.io>wrote:

> Unfortunately, Nutch still uses Tika 0.7 in 1.2 and trunk. Nutch needs to
> be
> upgraded to Tika 0.8 (when it's released or just the current trunk). Also,
> the
> Boilerpipe API needs to be exposed through Nutch configuration, which
> extractor
> can be used, which parameters need to be set etc.
>
> Upgrading to Tika's trunk might be relatively easy but exposing Boilerpipe
> surely isn't.
>
> On Tuesday, October 19, 2010 06:47:43 am Otis Gospodnetic wrote:
> > Hi Israel,
> >
> > You can use this: http://search-lucene.com/?q=boilerpipe&fc_project=Tika
> > Not sure if it's built into Nutch, though...
> >
> > Otis
> > ----
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Lucene ecosystem search :: http://search-lucene.com/
> >
> >
> >
> > ----- Original Message ----
> >
> > > From: Israel Ekpo <israele...@gmail.com>
> > > To: solr-user@lucene.apache.org; u...@nutch.apache.org
> > > Sent: Mon, October 18, 2010 9:01:50 PM
> > > Subject: Removing Common Web Page Header and Footer from All Content
> > > Fetched by
> > >
> > >Nutch
> > >
> > > Hi All,
> > >
> > > I am indexing a web application with approximately 9500 distinct  URL
> and
> > > contents using Nutch and Solr.
> > >
> > > I use Nutch to fetch the urls,  links and the crawl the entire web
> > > application to extract all the content for  all pages.
> > >
> > > Then I run the solrindex command to send the content to  Solr.
> > >
> > > The problem that I have now is that the first 1000 or so characters  of
> > > some pages and the last 400 characters of the pages are showing up in
> > > the  search results.
> > >
> > > These are contents of the common header and footer  used in the site
> > > respectively.
> > >
> > > The only work around that I have now is  to index everything and then
> go
> > > through each document one at a time to remove  the first 1000
> characters
> > > if the levenshtein distance between the first 1000  characters of the
> > > page and the common header is less than a certain value.  Same applies
> > > to the footer content common to all pages.
> > >
> > > Is there a way  to ignore certain "stop phrase" so to speak in the
> Nutch
> > > configuration based  on levenshtein distance or jaro winkler distance
> so
> > > that certain parts of the  fetched data that matches this stop phrases
> > > will not be parsed?
> > >
> > > Any  useful pointers would be highly appreciated.
> > >
> > > Thanks in  advance.
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536600 / 06-50258350
>



-- 
°O°
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/

Re: Removing Common Web Page Header and Footer from All Content Fetched by Nutch

Reply via email to