Markus, Julien

Seems to be exactly what I was looking for.

Thanks
Niels


On Mon, Apr 22, 2013 at 6:19 PM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> URLNormalizers can have a scope, see
>
> http://nutch.apache.org/apidocs-1.6/org/apache/nutch/net/URLNormalizers.html#SCOPE_INDEXER
> .
> Should help to normalise only at indexing time
>
>
> On 22 April 2013 16:56, Markus Jelsma <markus.jel...@openindex.io> wrote:
>
> > Hi,
> >
> > The 1.x indexer takes a -normalize parameter and there you can rewrite
> > your URL's. Judging from your patterns the RegexURLNormalizer should be
> > sufficient. Make sure you use the config file containing that pattern
> only
> > when indexing, otherwise they'll end up in the CrawlDB and segments. Use
> > urlnormalizer.regex.file to specifiy the file or pass patterns directly
> > using urlnormalizer.regex.rules.
> >
> > Cheers,
> > Markus
> >
> >
> > -----Original message-----
> > > From:Niels Boldt <nielsbo...@gmail.com>
> > > Sent: Mon 22-Apr-2013 15:56
> > > To: user@nutch.apache.org
> > > Subject: rewriting urls that are index
> > >
> > > Hi,
> > >
> > > We are crawling a site using nutch 1.6 and indexing into solr.
> > >
> > > However, we need to rewrite the urls that are indexed in the following
> > way
> > >
> > > For instance, nutch crawls a page http://www.example.com/article=xxxbut
> > > when moving data to the index we would like to use the url
> > >
> > > http://www.example.com/kb#article=xxx <
> > http://www.example.com/article=xxx>
> > >
> > > Instead. So when we get data from solr it will show links to
> > > http://www.example.com/kb#article=xxx
> > > <http://www.example.com/article=xxx> instead
> > > of http://www.example.com/article=xxx
> > >
> > > Is that possible to do by creating a plugin that extends the
> > UrlNormalizer,
> > > eg
> > >
> > >
> >
> http://nutch.apache.org/apidocs-1.4/org/apache/nutch/net/URLNormalizer.html
> > >
> > > Or is it better to add a new indexed property that we use.
> > >
> > > Best Regards
> > > Niels
> > >
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



-- 
BinaryConstructors ApS
Vestergade 10a, 4th
1456 Kbh K
Denmark
phone: +4529722259
web: http://www.binaryconstructors.dk
mail: n...@binaryconstructors.dk
skype: nielsboldt

Reply via email to