URLNormalizers can have a scope, see http://nutch.apache.org/apidocs-1.6/org/apache/nutch/net/URLNormalizers.html#SCOPE_INDEXER. Should help to normalise only at indexing time
On 22 April 2013 16:56, Markus Jelsma <markus.jel...@openindex.io> wrote: > Hi, > > The 1.x indexer takes a -normalize parameter and there you can rewrite > your URL's. Judging from your patterns the RegexURLNormalizer should be > sufficient. Make sure you use the config file containing that pattern only > when indexing, otherwise they'll end up in the CrawlDB and segments. Use > urlnormalizer.regex.file to specifiy the file or pass patterns directly > using urlnormalizer.regex.rules. > > Cheers, > Markus > > > -----Original message----- > > From:Niels Boldt <nielsbo...@gmail.com> > > Sent: Mon 22-Apr-2013 15:56 > > To: user@nutch.apache.org > > Subject: rewriting urls that are index > > > > Hi, > > > > We are crawling a site using nutch 1.6 and indexing into solr. > > > > However, we need to rewrite the urls that are indexed in the following > way > > > > For instance, nutch crawls a page http://www.example.com/article=xxx but > > when moving data to the index we would like to use the url > > > > http://www.example.com/kb#article=xxx < > http://www.example.com/article=xxx> > > > > Instead. So when we get data from solr it will show links to > > http://www.example.com/kb#article=xxx > > <http://www.example.com/article=xxx> instead > > of http://www.example.com/article=xxx > > > > Is that possible to do by creating a plugin that extends the > UrlNormalizer, > > eg > > > > > http://nutch.apache.org/apidocs-1.4/org/apache/nutch/net/URLNormalizer.html > > > > Or is it better to add a new indexed property that we use. > > > > Best Regards > > Niels > > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble