Markus, Julien Seems to be exactly what I was looking for.
Thanks Niels On Mon, Apr 22, 2013 at 6:19 PM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > URLNormalizers can have a scope, see > > http://nutch.apache.org/apidocs-1.6/org/apache/nutch/net/URLNormalizers.html#SCOPE_INDEXER > . > Should help to normalise only at indexing time > > > On 22 April 2013 16:56, Markus Jelsma <markus.jel...@openindex.io> wrote: > > > Hi, > > > > The 1.x indexer takes a -normalize parameter and there you can rewrite > > your URL's. Judging from your patterns the RegexURLNormalizer should be > > sufficient. Make sure you use the config file containing that pattern > only > > when indexing, otherwise they'll end up in the CrawlDB and segments. Use > > urlnormalizer.regex.file to specifiy the file or pass patterns directly > > using urlnormalizer.regex.rules. > > > > Cheers, > > Markus > > > > > > -----Original message----- > > > From:Niels Boldt <nielsbo...@gmail.com> > > > Sent: Mon 22-Apr-2013 15:56 > > > To: user@nutch.apache.org > > > Subject: rewriting urls that are index > > > > > > Hi, > > > > > > We are crawling a site using nutch 1.6 and indexing into solr. > > > > > > However, we need to rewrite the urls that are indexed in the following > > way > > > > > > For instance, nutch crawls a page http://www.example.com/article=xxxbut > > > when moving data to the index we would like to use the url > > > > > > http://www.example.com/kb#article=xxx < > > http://www.example.com/article=xxx> > > > > > > Instead. So when we get data from solr it will show links to > > > http://www.example.com/kb#article=xxx > > > <http://www.example.com/article=xxx> instead > > > of http://www.example.com/article=xxx > > > > > > Is that possible to do by creating a plugin that extends the > > UrlNormalizer, > > > eg > > > > > > > > > http://nutch.apache.org/apidocs-1.4/org/apache/nutch/net/URLNormalizer.html > > > > > > Or is it better to add a new indexed property that we use. > > > > > > Best Regards > > > Niels > > > > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > -- BinaryConstructors ApS Vestergade 10a, 4th 1456 Kbh K Denmark phone: +4529722259 web: http://www.binaryconstructors.dk mail: n...@binaryconstructors.dk skype: nielsboldt