Crawl sites with hashtags in url -------------------------------- Key: NUTCH-1343 URL: https://issues.apache.org/jira/browse/NUTCH-1343 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Roberto Gardenier Priority: Blocker
Hello, Im currently trying to crawl a site which uses hashtags in the urls. I dont seem to get any results and Im hoping im just overlooking something. Site structure is as follows: http://domain.com (landingpage) http://domain.com/#/page1 http://domain.com/#/page1/subpage1 http://domain.com/#/page2 http://domain.com/#/page2/subpage1 and so on. I've pointed nutch to http://domain.com as start url and in my filter i've placed all kind of rules. First i thought this would be sufficient: +http\://domain\.com\/# But then i realised that # is used for comments so i escaped it: +http\://domain\.com\/\# Still no results. So i thought i could use the asterix for it: +http\://domain\.com\/* Still no luck.. So i started using various regex stuff but without success. I noticed the following messages in hadoop.log: INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off Ive researched on this setting but i dont know for sure if this affects my problem in a way. This property is set to false in my configs. I dont know if this is even related to the situation above but maybe it helps. Any help is very much appreciated! I've tried googling the problem but i couldnt find documentation or anyone else with this problem. Many thanks in advance. With kind regard, Roberto Gardenier -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira