Hi Folks, I have a question on crawling URLs with query string. I am crawling about 10,000 sites. Some of the site uses query string to serve the content while some uses simple URLs. Example I have following cases
Case 1: site1.com/article1 site1.com/article2 Case 2: site2.com/?pid=123 site2.com/?pid=124 The only way to crawl and fetch webpages/articles in case 2 is to fetch URLs with query string "?" . While for the case 1 I can set NOT to fetch "?" in URL. Thus currently in my regex-urlfilter.txt , I commented the following lines for my crawler to fetch URL with query string. # skip URLs containing certain characters as probable queries, etc. #-[?*!@=] The above setting cause the crawler to fetch all URLs including URLs with query string thus pages such as download, login, comments, search query, printer friendly pages, zoom in view and other not valuable pages are being fetch. Practically, the crawler is going deep web. The undesirable cause of this is as following: 1. Duplicate pages are being fetch, effecting the crawl DB to be bloated - Printer friendly view, zoom in view e.g. site1.com/article1 e.g. site1.com/article1/?view=printerfriendly e.g. site1.com/article1/?zoom=large e.g. site1.com/article1/?zoom=extralarge 2. Download pages are being fetch, effecting the segment to be too large e.g. site1/com/getcontentID?id=1&format=pdf e.g. site1/com/getcontentID?id=1&format=doc 3. Crawling take very long time (10 days for depth 5) since is it going deep web. My current solution to the problem is to add additional regex in the regex-urlfilter.txt to prevent the crawler from fetching undesired pages. Now I have another problems. 1. regex to exclude undesired URLs patter is not exhausted for there are many site and many pattern. Thus crawler is still going deep web. 2. regex filters to exclude is getting too long so far 50 regex to exclude the URLs pattern. I hope I am not the only one with the similar problem and someone knows smarter way to solve the problem. Has anybody have a solution or suggestion on how to solve the problem? Some tips or direction would be very much appreciated. Btw, I am using nutch 1.2 but I believe the crawler principle is pretty much the same. Warm Regards, Ye -- View this message in context: http://lucene.472066.n3.nabble.com/Crawling-URLs-with-query-string-while-limiting-only-web-pages-tp4042381.html Sent from the Nutch - User mailing list archive at Nabble.com.

