Hi Ye Can you add this pattern to regex-normalize.xml configuration file for the RegexUrlNormalize class.
<!-- removes session ids from urls (such as jsessionid and PHPSESSID) --> <regex> <pattern>([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid|view|zoom)=.*?)(\?|&|#|$)</pattern> <substitution>$4</substitution> </regex> it will removes session ids from urls such as view and zoom. e.g. site1.com/article1/?view=printerfriendly e.g. site1.com/article1/?zoom=large e.g. site1.com/article1/?zoom=extralarge to e.g. site1.com/article1 On Sun, Feb 24, 2013 at 9:48 PM, Ye T Thet <[email protected]> wrote: > Tejas, > > Thanks for your pointers. They are really helpful. As of now my approach is > according to your direction 1, 2 and 3. Since my sites are around 10k in > number, I hope it would be manageable for near future. > > I might need to apply as per your direction 4 and 5 in the future as well. > But I believe it might be out of my league to get it right though. > > Some extra information my approach, most of my target sites are using CMS > and quite a number of them DOES NOT use pretty URL. I have been greping the > log and identify the pattern of redundant or non-important URL and adding > regex rules to regex-urlfilter. 2 millions URL is quite hard to process for > one man though. Phew! > > I would share if I could fine an approach that could benefit us all. > > Regards, > > Ye > > On Sat, Feb 23, 2013 at 12:22 PM, Tejas Patil <[email protected] > >wrote: > > > one correction in red below. > > > > On Fri, Feb 22, 2013 at 8:20 PM, Tejas Patil <[email protected] > > >wrote: > > > > > I think that what you have done till now is logical. Typically in nutch > > > crawls people dont want urls with query string but nowadays things have > > > changed. For instance, category #2 you pointed out may capture some > vital > > > pages. I once ran into the similar issue. Crawler cant be made > > intelligent > > > beyond a certain point and I had to go through crawl logs to check what > > all > > > urls are being fetched and later redefine by regex rules. > > > > > > Some things that I had considered doing: > > > 1. Start off with rules which are less restrictive and observe the logs > > > for which urls are visited. This will give you an idea about the bad > urls > > > and the good ones. As you already have crawled for 10 days, you are > (just > > > !!) left with studying the logs. > > > 2. After #1 is done, launch crawls with accept rules for the good urls > > and > > > put a "-." in the end to avoid the bad urls. > > > 3. Having a huge list of regexes is bad thing because its comparing > urls > > > against regexes is a costly operation and done for every url. A url > > getting > > > a match early saves this time. So have patterns which capture a huge > set > > of > > > urls at the top for the regex urlfilter file. > > > 4. Sometimes you dont want the parser to extract urls from certain > areas > > > of the page as you know that its not going to yield anything good to > you. > > > Lets say that the "print" or "zoom" urls are coming from some specific > > tags > > > of the html source. Its better not to parse those things and thus not > > have > > > those urls itself in the first place. The profit here is that now the > > regex > > > rules to be defined are reduced. > > > 5. An improvement over *#4* is that if you know the nature of pages > that > > > are being crawled, you can tweak parsers to extract urls from specific > > tags > > > only. This reduces noise and much cleaner fetch list. > > > > > > As far as I feel, this problem wont have an automated solution like > > > modifying some config/setting. There is a decent amount of human > > > intervention required to get things right. Knowing the nature of pages > > you > > > plan to crawl is vital in making smart decisions. > > > > > > Thanks, > > > Tejas Patil > > > > > > > > > On Fri, Feb 22, 2013 at 5:52 PM, ytthet <[email protected]> > wrote: > > > > > >> Hi Folks, > > >> > > >> I have a question on crawling URLs with query string. I am crawling > > about > > >> 10,000 sites. Some of the site uses query string to serve the content > > >> while > > >> some uses simple URLs. Example I have following cases > > >> > > >> Case 1: > > >> > > >> site1.com/article1 > > >> site1.com/article2 > > >> > > >> Case 2: > > >> site2.com/?pid=123 > > >> site2.com/?pid=124 > > >> > > >> The only way to crawl and fetch webpages/articles in case 2 is to > fetch > > >> URLs > > >> with query string "?" . While for the case 1 I can set NOT to fetch > "?" > > in > > >> URL. Thus currently in my regex-urlfilter.txt , I commented the > > following > > >> lines for my crawler to fetch URL with query string. > > >> > > >> # skip URLs containing certain characters as probable queries, etc. > > >> #-[?*!@=] > > >> > > >> The above setting cause the crawler to fetch all URLs including URLs > > with > > >> query string thus pages such as download, login, comments, search > query, > > >> printer friendly pages, zoom in view and other not valuable pages are > > >> being > > >> fetch. Practically, the crawler is going deep web. The undesirable > cause > > >> of > > >> this is as following: > > >> > > >> 1. Duplicate pages are being fetch, effecting the crawl DB to be > bloated > > >> - Printer friendly view, zoom in view > > >> e.g. site1.com/article1 > > >> e.g. site1.com/article1/?view=printerfriendly > > >> e.g. site1.com/article1/?zoom=large > > >> e.g. site1.com/article1/?zoom=extralarge > > >> > > >> 2. Download pages are being fetch, effecting the segment to be too > large > > >> e.g. site1/com/getcontentID?id=1&format=pdf > > >> e.g. site1/com/getcontentID?id=1&format=doc > > >> > > >> 3. Crawling take very long time (10 days for depth 5) since is it > going > > >> deep > > >> web. > > >> > > >> My current solution to the problem is to add additional regex in the > > >> regex-urlfilter.txt to prevent the crawler from fetching undesired > > pages. > > >> Now I have another problems. > > >> 1. regex to exclude undesired URLs patter is not exhausted for there > are > > >> many site and many pattern. Thus crawler is still going deep web. > > >> 2. regex filters to exclude is getting too long so far 50 regex to > > exclude > > >> the URLs pattern. > > >> > > >> I hope I am not the only one with the similar problem and someone > knows > > >> smarter way to solve the problem. Has anybody have a solution or > > >> suggestion > > >> on how to solve the problem? Some tips or direction would be very much > > >> appreciated. > > >> > > >> Btw, I am using nutch 1.2 but I believe the crawler principle is > pretty > > >> much > > >> the same. > > >> > > >> Warm Regards, > > >> > > >> Ye > > >> > > >> > > >> > > >> > > >> > > >> -- > > >> View this message in context: > > >> > > > http://lucene.472066.n3.nabble.com/Crawling-URLs-with-query-string-while-limiting-only-web-pages-tp4042381.html > > >> Sent from the Nutch - User mailing list archive at Nabble.com. > > >> > > > > > > > > > -- Don't Grow Old, Grow Up... :-)

