URLNormalizers can have a scope, see
http://nutch.apache.org/apidocs-1.6/org/apache/nutch/net/URLNormalizers.html#SCOPE_INDEXER.
Should help to normalise only at indexing time


On 22 April 2013 16:56, Markus Jelsma <markus.jel...@openindex.io> wrote:

> Hi,
>
> The 1.x indexer takes a -normalize parameter and there you can rewrite
> your URL's. Judging from your patterns the RegexURLNormalizer should be
> sufficient. Make sure you use the config file containing that pattern only
> when indexing, otherwise they'll end up in the CrawlDB and segments. Use
> urlnormalizer.regex.file to specifiy the file or pass patterns directly
> using urlnormalizer.regex.rules.
>
> Cheers,
> Markus
>
>
> -----Original message-----
> > From:Niels Boldt <nielsbo...@gmail.com>
> > Sent: Mon 22-Apr-2013 15:56
> > To: user@nutch.apache.org
> > Subject: rewriting urls that are index
> >
> > Hi,
> >
> > We are crawling a site using nutch 1.6 and indexing into solr.
> >
> > However, we need to rewrite the urls that are indexed in the following
> way
> >
> > For instance, nutch crawls a page http://www.example.com/article=xxx but
> > when moving data to the index we would like to use the url
> >
> > http://www.example.com/kb#article=xxx <
> http://www.example.com/article=xxx>
> >
> > Instead. So when we get data from solr it will show links to
> > http://www.example.com/kb#article=xxx
> > <http://www.example.com/article=xxx> instead
> > of http://www.example.com/article=xxx
> >
> > Is that possible to do by creating a plugin that extends the
> UrlNormalizer,
> > eg
> >
> >
> http://nutch.apache.org/apidocs-1.4/org/apache/nutch/net/URLNormalizer.html
> >
> > Or is it better to add a new indexed property that we use.
> >
> > Best Regards
> > Niels
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to