Thanks Otis and Markus for your input. I will check it out today.
On Tue, Oct 19, 2010 at 4:45 AM, Markus Jelsma <markus.jel...@openindex.io>wrote: > Unfortunately, Nutch still uses Tika 0.7 in 1.2 and trunk. Nutch needs to > be > upgraded to Tika 0.8 (when it's released or just the current trunk). Also, > the > Boilerpipe API needs to be exposed through Nutch configuration, which > extractor > can be used, which parameters need to be set etc. > > Upgrading to Tika's trunk might be relatively easy but exposing Boilerpipe > surely isn't. > > On Tuesday, October 19, 2010 06:47:43 am Otis Gospodnetic wrote: > > Hi Israel, > > > > You can use this: http://search-lucene.com/?q=boilerpipe&fc_project=Tika > > Not sure if it's built into Nutch, though... > > > > Otis > > ---- > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > Lucene ecosystem search :: http://search-lucene.com/ > > > > > > > > ----- Original Message ---- > > > > > From: Israel Ekpo <israele...@gmail.com> > > > To: solr-user@lucene.apache.org; u...@nutch.apache.org > > > Sent: Mon, October 18, 2010 9:01:50 PM > > > Subject: Removing Common Web Page Header and Footer from All Content > > > Fetched by > > > > > >Nutch > > > > > > Hi All, > > > > > > I am indexing a web application with approximately 9500 distinct URL > and > > > contents using Nutch and Solr. > > > > > > I use Nutch to fetch the urls, links and the crawl the entire web > > > application to extract all the content for all pages. > > > > > > Then I run the solrindex command to send the content to Solr. > > > > > > The problem that I have now is that the first 1000 or so characters of > > > some pages and the last 400 characters of the pages are showing up in > > > the search results. > > > > > > These are contents of the common header and footer used in the site > > > respectively. > > > > > > The only work around that I have now is to index everything and then > go > > > through each document one at a time to remove the first 1000 > characters > > > if the levenshtein distance between the first 1000 characters of the > > > page and the common header is less than a certain value. Same applies > > > to the footer content common to all pages. > > > > > > Is there a way to ignore certain "stop phrase" so to speak in the > Nutch > > > configuration based on levenshtein distance or jaro winkler distance > so > > > that certain parts of the fetched data that matches this stop phrases > > > will not be parsed? > > > > > > Any useful pointers would be highly appreciated. > > > > > > Thanks in advance. > > -- > Markus Jelsma - CTO - Openindex > http://www.linkedin.com/in/markus17 > 050-8536600 / 06-50258350 > -- °O° "Good Enough" is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once. http://www.israelekpo.com/