[jira] [Commented] (NUTCH-1343) Crawl sites with hashtags in url
[ https://issues.apache.org/jira/browse/NUTCH-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13265751#comment-13265751 ] Roberto Gardenier commented on NUTCH-1343: -- Markus Jelsma, I got notified that you have closed my jira ticket, chaning its resolution status to Invalid. I wonder why you have closed my ticket and marked it invalid as i did not commit any changes or solutions? With kind regards, Roberto Gardenier -Oorspronkelijk bericht- Van: Markus Jelsma (JIRA) [mailto:j...@apache.org] Verzonden: dinsdag 1 mei 2012 13:40 Aan: r.garden...@simgroep.nl Onderwerp: [jira] [Closed] (NUTCH-1343) Crawl sites with hashtags in url [ https://issues.apache.org/jira/browse/NUTCH-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-1343. Resolution: Invalid -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira Crawl sites with hashtags in url Key: NUTCH-1343 URL: https://issues.apache.org/jira/browse/NUTCH-1343 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Roberto Gardenier Priority: Blocker Hello, Im currently trying to crawl a site which uses hashtags in the urls. I dont seem to get any results and Im hoping im just overlooking something. Site structure is as follows: http://domain.com (landingpage) http://domain.com/#/page1 http://domain.com/#/page1/subpage1 http://domain.com/#/page2 http://domain.com/#/page2/subpage1 and so on. I've pointed nutch to http://domain.com as start url and in my filter i've placed all kind of rules. First i thought this would be sufficient: +http\://domain\.com\/# But then i realised that # is used for comments so i escaped it: +http\://domain\.com\/\# Still no results. So i thought i could use the asterix for it: +http\://domain\.com\/* Still no luck.. So i started using various regex stuff but without success. I noticed the following messages in hadoop.log: INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off Ive researched on this setting but i dont know for sure if this affects my problem in a way. This property is set to false in my configs. I dont know if this is even related to the situation above but maybe it helps. Any help is very much appreciated! I've tried googling the problem but i couldnt find documentation or anyone else with this problem. Many thanks in advance. With kind regard, Roberto Gardenier -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1343) Crawl sites with hashtags in url
[ https://issues.apache.org/jira/browse/NUTCH-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13265758#comment-13265758 ] Markus Jelsma commented on NUTCH-1343: -- Questions should be asked on the mailing list, as you just did. Concrete bugs and changes can be filed in Jira. Please check the mailinglist for replies to your inquiry. Crawl sites with hashtags in url Key: NUTCH-1343 URL: https://issues.apache.org/jira/browse/NUTCH-1343 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Roberto Gardenier Priority: Blocker Hello, Im currently trying to crawl a site which uses hashtags in the urls. I dont seem to get any results and Im hoping im just overlooking something. Site structure is as follows: http://domain.com (landingpage) http://domain.com/#/page1 http://domain.com/#/page1/subpage1 http://domain.com/#/page2 http://domain.com/#/page2/subpage1 and so on. I've pointed nutch to http://domain.com as start url and in my filter i've placed all kind of rules. First i thought this would be sufficient: +http\://domain\.com\/# But then i realised that # is used for comments so i escaped it: +http\://domain\.com\/\# Still no results. So i thought i could use the asterix for it: +http\://domain\.com\/* Still no luck.. So i started using various regex stuff but without success. I noticed the following messages in hadoop.log: INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off Ive researched on this setting but i dont know for sure if this affects my problem in a way. This property is set to false in my configs. I dont know if this is even related to the situation above but maybe it helps. Any help is very much appreciated! I've tried googling the problem but i couldnt find documentation or anyone else with this problem. Many thanks in advance. With kind regard, Roberto Gardenier -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1343) Crawl sites with hashtags in url
[ https://issues.apache.org/jira/browse/NUTCH-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13265759#comment-13265759 ] Roberto Gardenier commented on NUTCH-1343: -- Thank you for your response. I will check the mailinglist for any possible reactions. Thank you very much. Crawl sites with hashtags in url Key: NUTCH-1343 URL: https://issues.apache.org/jira/browse/NUTCH-1343 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Roberto Gardenier Priority: Blocker Hello, Im currently trying to crawl a site which uses hashtags in the urls. I dont seem to get any results and Im hoping im just overlooking something. Site structure is as follows: http://domain.com (landingpage) http://domain.com/#/page1 http://domain.com/#/page1/subpage1 http://domain.com/#/page2 http://domain.com/#/page2/subpage1 and so on. I've pointed nutch to http://domain.com as start url and in my filter i've placed all kind of rules. First i thought this would be sufficient: +http\://domain\.com\/# But then i realised that # is used for comments so i escaped it: +http\://domain\.com\/\# Still no results. So i thought i could use the asterix for it: +http\://domain\.com\/* Still no luck.. So i started using various regex stuff but without success. I noticed the following messages in hadoop.log: INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off Ive researched on this setting but i dont know for sure if this affects my problem in a way. This property is set to false in my configs. I dont know if this is even related to the situation above but maybe it helps. Any help is very much appreciated! I've tried googling the problem but i couldnt find documentation or anyone else with this problem. Many thanks in advance. With kind regard, Roberto Gardenier -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira