Hi Remi, Thank you so much for your reply. We have decided not to take any further actions on this matter as this is not necesarry anymore. Still i would like to thank you for your time!
Kind regards, Roberto Gardenier -----Oorspronkelijk bericht----- Van: remi tassing [mailto:tassingr...@gmail.com] Verzonden: woensdag 2 mei 2012 2:21 Aan: user@nutch.apache.org Onderwerp: Re: Crawl sites with hashtags in url Hi Roberto, If you're having an invalid URI error, then this might probably help you: http://lucene.472066.n3.nabble.com/Invalid-uri-td3742047.html Remi On Tue, May 1, 2012 at 7:25 PM, Roberto Gardenier <r.garden...@simgroep.nl>wrote: > Hello, > > > > Im currently trying to crawl a site which uses hashtags in the urls. I dont > seem to get any results and Im hoping im just overlooking something. > > I have created a JIRA bug report because I was not aware of the existence > of > this mailing list. Its my first time using such channels so i hope > correctly > sending this message. > > Link: https://issues.apache.org/jira/browse/NUTCH-1343 > > > > The site structure that im trying to index, is as follow: > > http://domain.com (landingpage) > > http://domain.com/#/page1 > > http://domain.com/#/page1/subpage1 > > http://domain.com/#/page2 > > http://domain.com/#/page2/subpage1 > > and so on. > > > > I've pointed nutch to http://domain.com as start url and in my filter i've > placed all kind of rules. > > First i thought this would be sufficient: > > +http\://domain\.com\/# > > But then i realised that # is used for comments so i escaped it: > > +http\://domain\.com\/# > > > > Still no results. So i thought i could use the asterix for it: > > +http\://domain\.com\/* > > Still no luck.. So i started using various regex stuff but without success. > > > > I noticed the following messages in hadoop.log: > > INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off > > Ive researched on this setting but i dont know for sure if this affects my > problem in a way. This property is set to false in my configs. > > > > I dont know if this is even related to the situation above but maybe it > helps. > > > > Any help is very much appreciated! I've tried googling the problem but i > couldnt find documentation or anyone else with this problem. > > > > Many thanks in advance. > > > > With kind regard, > > Roberto Gardenier > > > >