Re: Removing Common Web Page Header and Footer from All Content Fetched by Nutch

2010-10-19 Thread Israel Ekpo
ext :: http://sematext.com/ :: Solr - Lucene - Nutch > > Lucene ecosystem search :: http://search-lucene.com/ > > > > > > > > - Original Message > > > > > From: Israel Ekpo > > > To: solr-user@lucene.apache.org; u...@nutch.apache.org > > > Sent: Mon, Oc

Re: Removing Common Web Page Header and Footer from All Content Fetched by Nutch

2010-10-19 Thread Markus Jelsma
; u...@nutch.apache.org > > Sent: Mon, October 18, 2010 9:01:50 PM > > Subject: Removing Common Web Page Header and Footer from All Content > > Fetched by > > > >Nutch > > > > Hi All, > > > > I am indexing a web application with approximately 9

Re: Removing Common Web Page Header and Footer from All Content Fetched by Nutch

2010-10-18 Thread Otis Gospodnetic
From: Israel Ekpo > To: solr-user@lucene.apache.org; u...@nutch.apache.org > Sent: Mon, October 18, 2010 9:01:50 PM > Subject: Removing Common Web Page Header and Footer from All Content Fetched > by >Nutch > > Hi All, > > I am indexing a web application with approxim

Removing Common Web Page Header and Footer from All Content Fetched by Nutch

2010-10-18 Thread Israel Ekpo
Hi All, I am indexing a web application with approximately 9500 distinct URL and contents using Nutch and Solr. I use Nutch to fetch the urls, links and the crawl the entire web application to extract all the content for all pages. Then I run the solrindex command to send the content to Solr. T