one correction in red below. On Fri, Feb 22, 2013 at 8:20 PM, Tejas Patil <[email protected]>wrote:
> I think that what you have done till now is logical. Typically in nutch > crawls people dont want urls with query string but nowadays things have > changed. For instance, category #2 you pointed out may capture some vital > pages. I once ran into the similar issue. Crawler cant be made intelligent > beyond a certain point and I had to go through crawl logs to check what all > urls are being fetched and later redefine by regex rules. > > Some things that I had considered doing: > 1. Start off with rules which are less restrictive and observe the logs > for which urls are visited. This will give you an idea about the bad urls > and the good ones. As you already have crawled for 10 days, you are (just > !!) left with studying the logs. > 2. After #1 is done, launch crawls with accept rules for the good urls and > put a "-." in the end to avoid the bad urls. > 3. Having a huge list of regexes is bad thing because its comparing urls > against regexes is a costly operation and done for every url. A url getting > a match early saves this time. So have patterns which capture a huge set of > urls at the top for the regex urlfilter file. > 4. Sometimes you dont want the parser to extract urls from certain areas > of the page as you know that its not going to yield anything good to you. > Lets say that the "print" or "zoom" urls are coming from some specific tags > of the html source. Its better not to parse those things and thus not have > those urls itself in the first place. The profit here is that now the regex > rules to be defined are reduced. > 5. An improvement over *#4* is that if you know the nature of pages that > are being crawled, you can tweak parsers to extract urls from specific tags > only. This reduces noise and much cleaner fetch list. > > As far as I feel, this problem wont have an automated solution like > modifying some config/setting. There is a decent amount of human > intervention required to get things right. Knowing the nature of pages you > plan to crawl is vital in making smart decisions. > > Thanks, > Tejas Patil > > > On Fri, Feb 22, 2013 at 5:52 PM, ytthet <[email protected]> wrote: > >> Hi Folks, >> >> I have a question on crawling URLs with query string. I am crawling about >> 10,000 sites. Some of the site uses query string to serve the content >> while >> some uses simple URLs. Example I have following cases >> >> Case 1: >> >> site1.com/article1 >> site1.com/article2 >> >> Case 2: >> site2.com/?pid=123 >> site2.com/?pid=124 >> >> The only way to crawl and fetch webpages/articles in case 2 is to fetch >> URLs >> with query string "?" . While for the case 1 I can set NOT to fetch "?" in >> URL. Thus currently in my regex-urlfilter.txt , I commented the following >> lines for my crawler to fetch URL with query string. >> >> # skip URLs containing certain characters as probable queries, etc. >> #-[?*!@=] >> >> The above setting cause the crawler to fetch all URLs including URLs with >> query string thus pages such as download, login, comments, search query, >> printer friendly pages, zoom in view and other not valuable pages are >> being >> fetch. Practically, the crawler is going deep web. The undesirable cause >> of >> this is as following: >> >> 1. Duplicate pages are being fetch, effecting the crawl DB to be bloated >> - Printer friendly view, zoom in view >> e.g. site1.com/article1 >> e.g. site1.com/article1/?view=printerfriendly >> e.g. site1.com/article1/?zoom=large >> e.g. site1.com/article1/?zoom=extralarge >> >> 2. Download pages are being fetch, effecting the segment to be too large >> e.g. site1/com/getcontentID?id=1&format=pdf >> e.g. site1/com/getcontentID?id=1&format=doc >> >> 3. Crawling take very long time (10 days for depth 5) since is it going >> deep >> web. >> >> My current solution to the problem is to add additional regex in the >> regex-urlfilter.txt to prevent the crawler from fetching undesired pages. >> Now I have another problems. >> 1. regex to exclude undesired URLs patter is not exhausted for there are >> many site and many pattern. Thus crawler is still going deep web. >> 2. regex filters to exclude is getting too long so far 50 regex to exclude >> the URLs pattern. >> >> I hope I am not the only one with the similar problem and someone knows >> smarter way to solve the problem. Has anybody have a solution or >> suggestion >> on how to solve the problem? Some tips or direction would be very much >> appreciated. >> >> Btw, I am using nutch 1.2 but I believe the crawler principle is pretty >> much >> the same. >> >> Warm Regards, >> >> Ye >> >> >> >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Crawling-URLs-with-query-string-while-limiting-only-web-pages-tp4042381.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> > >

