Hi Remi,

Thank you so much for your reply. We have decided not to take any further
actions on this matter as this is not necesarry anymore.
Still i would like to thank you for your time!

Kind regards,
Roberto Gardenier


-----Oorspronkelijk bericht-----
Van: remi tassing [mailto:tassingr...@gmail.com] 
Verzonden: woensdag 2 mei 2012 2:21
Aan: user@nutch.apache.org
Onderwerp: Re: Crawl sites with hashtags in url

Hi Roberto,

If you're having an invalid URI error, then this might probably help you:
http://lucene.472066.n3.nabble.com/Invalid-uri-td3742047.html

Remi

On Tue, May 1, 2012 at 7:25 PM, Roberto Gardenier
<r.garden...@simgroep.nl>wrote:

> Hello,
>
>
>
> Im currently trying to crawl a site which uses hashtags in the urls. I
dont
> seem to get any results and Im hoping im just overlooking something.
>
> I have created a JIRA bug report because I was not aware of the existence
> of
> this mailing list. Its my first time using such channels so i hope
> correctly
> sending  this message.
>
> Link: https://issues.apache.org/jira/browse/NUTCH-1343
>
>
>
> The site structure that im trying to index, is as follow:
>
> http://domain.com (landingpage)
>
> http://domain.com/#/page1
>
> http://domain.com/#/page1/subpage1
>
> http://domain.com/#/page2
>
> http://domain.com/#/page2/subpage1
>
> and so on.
>
>
>
> I've pointed nutch to http://domain.com as start url and in my filter i've
> placed all kind of rules.
>
> First i thought this would be sufficient:
>
> +http\://domain\.com\/#
>
> But then i realised that # is used for comments so i escaped it:
>
> +http\://domain\.com\/#
>
>
>
> Still no results. So i thought i could use the asterix for it:
>
> +http\://domain\.com\/*
>
> Still no luck.. So i started using various regex stuff but without
success.
>
>
>
> I noticed the following messages in hadoop.log:
>
> INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
>
> Ive researched on this setting but i dont know for sure if this affects my
> problem in a way. This property is set to false in my configs.
>
>
>
> I dont know if this is even related to the situation above but maybe it
> helps.
>
>
>
> Any help is very much appreciated! I've tried googling the problem but i
> couldnt find documentation or anyone else with this problem.
>
>
>
> Many thanks in advance.
>
>
>
> With kind regard,
>
> Roberto Gardenier
>
>
>
>

Reply via email to