Excellent.

Thank you.

Per.


Marcin Okraszewski skrev:
There is regex-normalize.xml in conf dir, which allows to manipulate URLs (eg. 
remove string after '#"). Remember to have urlnormalizer-regex in 
plugins.include option (nutch-site.xml).

Marcin


Dnia 26 stycznia 2008 9:36 Prafulla <[EMAIL PROTECTED]> napisaƂ(a):

Hi,

The crawl-urlfilter.txt in conf directory can be used to provide regular
expressions to control the urls that are crawled. However this will help you
to ignore urls containing #. I don't think you can ask the crawler to just
ignore the part of the url after the hash sign by configuring properties,
you may have to write some code to achieve that

Regards,
Prafulla

On Jan 26, 2008 1:41 PM, Per Andreas Buer  wrote:

Hi.

I'm indexing an intranet and I see some pages are fetched twenty times.
There are a lot of anchors used so there are a lot of links like the
ones in the subject.

Is there some way I can instruct the crawler to discard the part of the
url which is after the hash sign? I'm using nutch from trunk a few
months back in time.

TIA,


Per.


Reply via email to