Re: Microsoft flooding sites with fake traffic
Paul Lussier wrote: Ben Scott [EMAIL PROTECTED] writes: In particular, remember Hanlon's razor. Does that have more or less blades than the competing Occam's Razor? And how do either of those compare with Vipul's Razor? And are *any* of those better than the Motorola Razr? Or Grey's Law: Any sufficiently advanced incompetence is indistinguishable from malice Interesting discussion at http://en.wikipedia.org/wiki/Hanlon's_razor Years ago, a friend and I were on the Redmond campus for a couple days of indoctrination^H^H^H er, brainwash^H^H^H^H er, meetings, and at lunch break, he walked the length and breadth of the campus (it was a lot smaller back then). There is no One Microsoft Way, no street by that name, no postal stop. That's not their address, it's their statement of philosophy. But the sad truth is there is no Microsoft Way. That's worth contemplating a bit, as it says more than Microsoft probably intended. ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: Microsoft flooding sites with fake traffic
On Feb 21, 2008, at 10:00, Arc Riley wrote: msnbot accesses robots.txt more than any other search engine (seconded by Yahoo! Slurp). I had an e-commerce client DoS'ed by MSNBot during the holiday season. It was downloading 40GB of dynamic pages per day, for a site with 4GB of possible data (I crawled it myself to measure). The site as-idle could handle that kind of traffic but during peak shopping it was the proverbial straw. I wound up counting up the total number of possible URI's on the site and dividing it into the number of seconds in a month, and gave MSNBot: Crawl-delay: 320 in robots.txt to give it one copy per month. It seems to have worked. I found a webpage describing this problem that dated from Summer of '06. Raise your hand if you're shocked... -Bill - Bill McGonigle, Owner Work: 603.448.4440 BFC Computing, LLC Home: 603.448.1668 [EMAIL PROTECTED] Cell: 603.252.2606 http://www.bfccomputing.com/Page: 603.442.1833 Blog: http://blog.bfccomputing.com/ VCard: http://bfccomputing.com/vcard/bill.vcf ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: Microsoft flooding sites with fake traffic
On Wed, 20 Feb 2008 21:34:53 -0500 Ben Scott [EMAIL PROTECTED] wrote: Those two both seem rather unlikely. In particular, remember Hanlon's razor. My guess is some kind of crawler robot. You block Google from indexing your site, too, then, right? FWIW I know nothing from the technical side of this, but I mentioned this to someone who works at MSFT and their first comment was that it was likely Live Search crawling to build an index. Ed Lawson ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: Microsoft flooding sites with fake traffic
Ed lawson wrote: I know nothing from the technical side of this, but I mentioned this to someone who works at MSFT and their first comment was that it was likely Live Search crawling to build an index. Except: - the referrer is a single-word search at search.live.com, e.g. http://search.live.com/results.aspx?q=marketingmrt=en-usFORM=LIVSOP - The client acts like a browser, in that it fetches CSS and JavaScript files as well as the primary page, and the User-Agent seems to be MSIE 7: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322) Here is a complete sequence from my logs: 65.55.165.51 - - [20/Feb/2008:02:22:16 -0500] GET /category/Web-Marketing/ HTTP/1.1 200 15810 http://search.live.com/results.aspx?q=marketingmrt=en-usFORM=LIVSOP; Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322) 65.55.165.51 - - [20/Feb/2008:02:22:18 -0500] GET /media/public/css/blogcosm.css HTTP/1.1 200 8114 http://blogcosm.com/category/Web-Marketing/; Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322) 65.55.165.51 - - [20/Feb/2008:02:22:19 -0500] GET /media/public/css/category_detail.css HTTP/1.1 200 2952 http://blogcosm.com/category/Web-Marketing/; Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322) 65.55.165.51 - - [20/Feb/2008:02:22:19 -0500] GET /media/public/css/toc.css HTTP/1.1 200 399 http://blogcosm.com/category/Web-Marketing/; Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322) 65.55.165.51 - - [20/Feb/2008:02:22:19 -0500] GET /media/public/css/one-liners.css HTTP/1.1 200 223 http://blogcosm.com/category/Web-Marketing/; Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322) 65.55.165.51 - - [20/Feb/2008:02:22:19 -0500] GET /css/colors.css HTTP/1.1 200 4410 http://blogcosm.com/category/Web-Marketing/; Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322) I seem to have one of these roughly every 1/2 hour though the interval varies widely. Kent ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: Microsoft flooding sites with fake traffic
Kent Johnson wrote: Ed lawson wrote: I know nothing from the technical side of this, but I mentioned this to someone who works at MSFT and their first comment was that it was likely Live Search crawling to build an index. Except: - the referrer is a single-word search at search.live.com, e.g. http://search.live.com/results.aspx?q=marketingmrt=en-usFORM=LIVSOP - The client acts like a browser, in that it fetches CSS and JavaScript files as well as the primary page, and the User-Agent seems to be MSIE 7: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322) Here is a complete sequence from my logs: 65.55.165.51 - - [20/Feb/2008:02:22:16 -0500] GET /category/Web-Marketing/ HTTP/1.1 200 15810 http://search.live.com/results.aspx?q=marketingmrt=en-usFORM=LIVSOP; Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322) 65.55.165.51 - - [20/Feb/2008:02:22:18 -0500] GET /media/public/css/blogcosm.css HTTP/1.1 200 8114 http://blogcosm.com/category/Web-Marketing/; Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322) 65.55.165.51 - - [20/Feb/2008:02:22:19 -0500] GET /media/public/css/category_detail.css HTTP/1.1 200 2952 http://blogcosm.com/category/Web-Marketing/; Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322) 65.55.165.51 - - [20/Feb/2008:02:22:19 -0500] GET /media/public/css/toc.css HTTP/1.1 200 399 http://blogcosm.com/category/Web-Marketing/; Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322) 65.55.165.51 - - [20/Feb/2008:02:22:19 -0500] GET /media/public/css/one-liners.css HTTP/1.1 200 223 http://blogcosm.com/category/Web-Marketing/; Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322) 65.55.165.51 - - [20/Feb/2008:02:22:19 -0500] GET /css/colors.css HTTP/1.1 200 4410 http://blogcosm.com/category/Web-Marketing/; Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322) I seem to have one of these roughly every 1/2 hour though the interval varies widely. Kent It's not really out of the realm of reality that Microsoft could be using a farm of Windows machines running IE7 to gather the data... It's also not necessarily out of the realm of reality that their indexing algorithm is trying to find single keyword results. Maybe they perform the union/intersection of multiple search terms on their end. -- Coleman Kane ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: Microsoft flooding sites with fake traffic
On Thu, 2008-02-21 at 08:56 -0500, Kent Johnson wrote: Except: - The client acts like a browser, in that it fetches CSS and JavaScript files as well as the primary page, and the User-Agent seems to be MSIE 7: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322) This *could* be explained by wanting to be able to display a thumbnail version of the website. Just a thought. -- Cole Tuininga [EMAIL PROTECTED] Code Energy (http://www.code-energy.com) ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: Microsoft flooding sites with fake traffic
Cole Tuininga wrote: On Thu, 2008-02-21 at 08:56 -0500, Kent Johnson wrote: Except: - The client acts like a browser, in that it fetches CSS and JavaScript files as well as the primary page, and the User-Agent seems to be MSIE 7: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322) This *could* be explained by wanting to be able to display a thumbnail version of the website. Just a thought. Just an aside: It is extremely amusing to me that MSIE still identifies as Mozilla/4.0... -- Coleman Kane ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: Microsoft flooding sites with fake traffic
What's your robots.txt look like? Does it forbid this kind of behavior? The IPs in question never accessed robots.txt. Only MSNBot and family on different IP blocks. You will see the bad behavior if you log referers. The bots in question claim to be arriving on your site by various search terms via search.live.com and claim to be a normal web browser (MSIE 6 or 7). If this was an honesty checker, which Google does, verifying that the pages being sent to the crawlers is the same as being sent to normal web browsers, they wouldn't claim to be arriving via search.live.com. Many of the hits were also to pages specifically forbidden to * User-agent, such as Disallow: /*? My logs show 95.7% of our traffic via search engines are from Google. MSN, once you take all the hits from Microsoft's networks out of the equation, only result in 0.3%. What's the rate like, in requests/time and bytes/time? Are they flooding your site, or slowly crawling it over time? It changes per day, but the day before yesterday 28.3% of the pagehits were from those subnets. The pages they seemed to be targeting were some of the highest CPU load. Apache was using roughly 60% of the CPU which dropped to 2% when Microsoft was firewalled. The server in question is an Athlon XP 2200+ with a gig of ram. ... please join me in blocking them ... You block Google from indexing your site, too, then, right? This is not a cry against crawlers. This is a cry against deception and bots behaving badly. On Thu, Feb 21, 2008 at 9:02 AM, Coleman Kane [EMAIL PROTECTED] wrote: It's also not necessarily out of the realm of reality that their indexing algorithm is trying to find single keyword results. Maybe they perform the union/intersection of multiple search terms on their end. That could be true, if the reported search terms had anything to do with the content on the sites. I could not find a single instance of any of the search terms on our site in the earlier searches, much of which were pornographic or sexual in nature. The bot that generates thumbnails of the sites and grabs images is msnbot-media ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: Microsoft flooding sites with fake traffic
On Thu, Feb 21, 2008 at 9:15 AM, Coleman Kane [EMAIL PROTECTED] wrote: It is extremely amusing to me that MSIE still identifies as Mozilla/4.0... All user agents are Mozilla, thanks to brain-damage perpetrated back in the mid-1990's by dumb web developers who assumed that Netscape was the web and used User-Agent to block all other browsers, even compatible ones. Now, of course, we have the same problem in the other direction; one sometimes has to tell Firefox/Opera/etc. to claim to be MSIE. I hate dumb web developers. -- Ben ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: Microsoft flooding sites with fake traffic
On 02/21/2008 09:30 AM, Ben Scott wrote: All user agents are Mozilla, thanks to brain-damage perpetrated back in the mid-1990's by dumb web developers who assumed that Netscape was the web and used User-Agent to block all other browsers, even compatible ones. Now, of course, we have the same problem in the other direction; one sometimes has to tell Firefox/Opera/etc. to claim to be MSIE. I hate dumb web developers. This e-mail best viewed at 800x600. -Mark ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: Microsoft flooding sites with fake traffic
On Thu, Feb 21, 2008 at 9:18 AM, Arc Riley [EMAIL PROTECTED] wrote: What's your robots.txt look like? Does it forbid this kind of behavior? The IPs in question never accessed robots.txt. [...] Many of the hits were also to pages specifically forbidden to * User-agent, such as Disallow: /*? Interesting. liberty$ find -name access_log\* | xargs egrep -h '^(131\.107|65.5[2-5])' | fgrep robots.txt | wc -l 1453 It appears your server is seeing different behavior than GNHLUG's server. I suppose that could be a malfunction on Microsoft's end, but I can't think of why a cluster of crawlers would malfunction for just some sites. I suppose it could be intentional differentiation, but what would be the point of that? Have you tried contact the help desk for Microsoft's crawler? It changes per day, but the day before yesterday 28.3% of the pagehits were from those subnets. Yikes! If this was an honesty checker, which Google does, verifying that the pages being sent to the crawlers is the same as being sent to normal web browsers, they wouldn't claim to be arriving via search.live.com. Why not? I can think of a few scenarios where that might be legit. Following links of saved searches, or some kind of follow-on driven by search results of users. The high request rate, and the apparent ignorance of robots.txt, on the other hand... That could be true, if the reported search terms had anything to do with the content on the sites. I could not find a single instance of any of the search terms on our site in the earlier searches, much of which were pornographic or sexual in nature. Are you sure you don't have a wiki or tag cloud or comment board or file share or similar application that's been hijacked? Scam artists like to use such to host their content, or crank up their page rank, or spam others. I know they keep trying to hit GNHLUG (we watch it fairly closely and remove any such attempts). I know PySIG had to shut down their wiki it got nailed so often. -- Ben ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: Microsoft flooding sites with fake traffic
liberty$ find -name access_log\* | xargs egrep -h '^(131\.107|65.5[2-5])' | fgrep robots.txt | wc -l 1453 Look specifically at the IPs the faked live.com search results are hit from vs those running msnbot. msnbot accesses robots.txt more than any other search engine (seconded by Yahoo! Slurp). Are you sure you don't have a wiki or tag cloud or comment board or file share or similar application that's been hijacked? Scam artists like to use such to host their content, or crank up their page rank, or spam others. I know they keep trying to hit GNHLUG (we watch it fairly closely and remove any such attempts). I know PySIG had to shut down their wiki it got nailed so often. We have a fairly agressive anti-spam system setup, no such spam appears on our site, certainly not on the landing pages of these faked live.com search hits which contain svn changelog diffs. The only place we've had spam is the ticket system which none of the hits in question are for /ticket/*. We require reg to file a ticket and monitor regs to isolate spammers before they can hit the site, so far it works fairly well. ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: Microsoft flooding sites with fake traffic
Date: Wed, 20 Feb 2008 16:46:37 -0500 From: Arc Riley [EMAIL PROTECTED] Do yourselves a favor and search your logs for connections from 131.107.* 65.52.* 65.53.* 65.54.* and 65.55.* All of it, well 97.2%, from the above two subnets, belonging to Microsoft. I don't seem to have this problem. CF my robots.txt file: http://peapod.podzone.net:1234/robots.txt which might explain why. :) Interestingly, for me, hits from '^(131\.107|65.5[2-5])' seem to fall into two categories: (1) requests for robots.txt (makes sense) (2) requests for content pages with *google* as referrer (hm...) #2 suggests that msnbot is actually crawling *google* in order to populate its *own* search database. Maybe, knowing this, google could give a good schticking to microsoft... by, say, returning pseudo-random results to searches from '^(131\.107|65.5[2-5])' :) ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: Microsoft flooding sites with fake traffic
Ben Scott [EMAIL PROTECTED] writes: In particular, remember Hanlon's razor. Does that have more or less blades than the competing Occam's Razor? And how do either of those compare with Vipul's Razor? And are *any* of those better than the Motorola Razr? -- Seeya, Paul ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Microsoft flooding sites with fake traffic
Hey guys Do yourselves a favor and search your logs for connections from 131.107.* 65.52.* 65.53.* 65.54.* and 65.55.* I found a good % of traffic we got, not reported to Google Analytics so I didn't see it sooner, was referred from http://search.live.com/ for search queries involving pornography, cars, drugs, and random gibberish. The landing pages from these searches were subversion changesets, source code in the Trac browser, and other places those search queries certainly don't exist in. All of it, well 97.2%, from the above two subnets, belonging to Microsoft. It'd be humorous if I didn't just purchase a new colo server to handle the large volume of traffic pysoy.org gets. I can't tell if MS is trying to skew the statistics in favor of MSIE/Live/etc or if it's conducting a denial of service attack against free software project sites, perhaps both (two birds with one stone?). If you see the similar childish behavior in your logs, please join me in blocking them and being very vocal as to why. ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: Microsoft flooding sites with fake traffic
Arc Riley wrote: Hey guys Do yourselves a favor and search your logs for connections from 131.107.* 65.52.* 65.53.* 65.54.* and 65.55.* I found a good % of traffic we got, not reported to Google Analytics so I didn't see it sooner, was referred from http://search.live.com/ for search queries involving pornography, cars, drugs, and random gibberish. The landing pages from these searches were subversion changesets, source code in the Trac browser, and other places those search queries certainly don't exist in. All of it, well 97.2%, from the above two subnets, belonging to Microsoft. It'd be humorous if I didn't just purchase a new colo server to handle the large volume of traffic pysoy.org http://pysoy.org gets. I can't tell if MS is trying to skew the statistics in favor of MSIE/Live/etc or if it's conducting a denial of service attack against free software project sites, perhaps both (two birds with one stone?). If you see the similar childish behavior in your logs, please join me in blocking them and being very vocal as to why. An interesting find. I just checked my sites and I see the same thing, however most of the search queries seem to be pretty pertinent to the content of the pages that they reference. It is almost like theres some script running on a farm of windows computers that just performs single-word searches on their Windows LiveSearch database, and visits the results (posting, of course, the LiveSearch referral in the request). Here's my distribution: cat apachelogs/* | grep live.com | cut -d\ -f1 | cut -d. -f1,2 | sort | uniq -c | sort -rn 308 65.55 10 131.107 4 85.159 3 142.161 2 71.164 2 68.95 2 4.246 2 207.224 1 86.144 1 84.202 There are many, many more with single visits, but I left them off the list because they probably represent normal livesearch users. -- Coleman Kane ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: Microsoft flooding sites with fake traffic
Coleman Kane wrote: Arc Riley wrote: Hey guys Do yourselves a favor and search your logs for connections from 131.107.* 65.52.* 65.53.* 65.54.* and 65.55.* I found a good % of traffic we got, not reported to Google Analytics so I didn't see it sooner, was referred from http://search.live.com/ for search queries involving pornography, cars, drugs, and random gibberish. The landing pages from these searches were subversion changesets, source code in the Trac browser, and other places those search queries certainly don't exist in. All of it, well 97.2%, from the above two subnets, belonging to Microsoft. It'd be humorous if I didn't just purchase a new colo server to handle the large volume of traffic pysoy.org http://pysoy.org gets. I can't tell if MS is trying to skew the statistics in favor of MSIE/Live/etc or if it's conducting a denial of service attack against free software project sites, perhaps both (two birds with one stone?). If you see the similar childish behavior in your logs, please join me in blocking them and being very vocal as to why. An interesting find. I just checked my sites and I see the same thing, however most of the search queries seem to be pretty pertinent to the content of the pages that they reference. It is almost like theres some script running on a farm of windows computers that just performs single-word searches on their Windows LiveSearch database, and visits the results (posting, of course, the LiveSearch referral in the request). Here's my distribution: cat apachelogs/* | grep live.com | cut -d\ -f1 | cut -d. -f1,2 | sort | uniq -c | sort -rn 308 65.55 10 131.107 4 85.159 3 142.161 2 71.164 2 68.95 2 4.246 2 207.224 1 86.144 1 84.202 There are many, many more with single visits, but I left them off the list because they probably represent normal livesearch users. -- Coleman Kane Went a little further and found that all my 65.55 traffic comes from the 65.55.165 class C. I decided to pass all the visitors to the host program and found that all of the visitors have PTR records like this: livebot-65-55-165-87.search.live.com. The 131.107 traffic was all from two machines: tide525.microsoft.com and tide526.microsoft.com Maybe some others could look at their logs and pull information on the other subnets? -- Coleman Kane ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: Microsoft flooding sites with fake traffic
Do you happen to be running google analytics on your site? On Wed, Feb 20, 2008 at 6:08 PM, Coleman Kane [EMAIL PROTECTED] wrote: Coleman Kane wrote: Arc Riley wrote: Hey guys Do yourselves a favor and search your logs for connections from 131.107.* 65.52.* 65.53.* 65.54.* and 65.55.* I found a good % of traffic we got, not reported to Google Analytics so I didn't see it sooner, was referred from http://search.live.com/ for search queries involving pornography, cars, drugs, and random gibberish. The landing pages from these searches were subversion changesets, source code in the Trac browser, and other places those search queries certainly don't exist in. All of it, well 97.2%, from the above two subnets, belonging to Microsoft. It'd be humorous if I didn't just purchase a new colo server to handle the large volume of traffic pysoy.org http://pysoy.org gets. I can't tell if MS is trying to skew the statistics in favor of MSIE/Live/etc or if it's conducting a denial of service attack against free software project sites, perhaps both (two birds with one stone?). If you see the similar childish behavior in your logs, please join me in blocking them and being very vocal as to why. An interesting find. I just checked my sites and I see the same thing, however most of the search queries seem to be pretty pertinent to the content of the pages that they reference. It is almost like theres some script running on a farm of windows computers that just performs single-word searches on their Windows LiveSearch database, and visits the results (posting, of course, the LiveSearch referral in the request). Here's my distribution: cat apachelogs/* | grep live.com | cut -d\ -f1 | cut -d. -f1,2 | sort | uniq -c | sort -rn 308 65.55 10 131.107 4 85.159 3 142.161 2 71.164 2 68.95 2 4.246 2 207.224 1 86.144 1 84.202 There are many, many more with single visits, but I left them off the list because they probably represent normal livesearch users. -- Coleman Kane Went a little further and found that all my 65.55 traffic comes from the 65.55.165 class C. I decided to pass all the visitors to the host program and found that all of the visitors have PTR records like this: livebot-65-55-165-87.search.live.com. The 131.107 traffic was all from two machines: tide525.microsoft.com and tide526.microsoft.com Maybe some others could look at their logs and pull information on the other subnets? -- Coleman Kane ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: Microsoft flooding sites with fake traffic
On Wed, Feb 20, 2008 at 4:46 PM, Arc Riley [EMAIL PROTECTED] wrote: Do yourselves a favor and search your logs for connections from 131.107.* 65.52.* 65.53.* 65.54.* and 65.55.* On the GNHLUG web server in /var/log/httpd/ ... liberty$ find -name access_log\* | xargs egrep '^(131\.107|65.5[2-5])' | wc -l 14293 liberty$ find -name access_log\* | xargs cat | wc -l 185492 We keep logs going back a month, rotated weekly. We don't log referrals or user agents on the GNHLUG server. Maybe we should. All of it, well 97.2%, from the above two subnets, belonging to Microsoft. Interesting. And a relatively small number of unique hosts (152), given that there are five /16's in question (327680, give or take). liberty$ find -name access_log\* | xargs egrep -h '^(131\.107|65.5[2-5])' | awk '{ print $1 }' | sort -u /tmp/hostlist liberty$ wc -l /tmp/hostlist 152 /tmp/hostlist The IP address DNS reverse to a few different things. Matching regexp patterns would be: NXDOMAIN tide[0-9]+.microsoft.com b[ly]1sch[0-9]+.phx.gbl livebot-65-55-[0-9]+-[0-9]+.search.live.com I also took a look at unique URLs: liberty$ find -name access_log\* | xargs egrep -h '^(131\.107|65.5[2-5])' | awk -F\ '{ print $2 }' | awk '{ print $2 }' | sort -u /tmp/urls liberty$ wc -l /tmp/urls 7237 /tmp/urls The URLs themselves... hmmm, hard to know for sure with our site, but it looks to me like something is walking the entire site, following every link, including TWiki's search, history, and edit links. Think wget -r. I can't tell if MS is trying to skew the statistics in favor of MSIE/Live/etc or if it's conducting a denial of service attack against free software project sites ... Those two both seem rather unlikely. In particular, remember Hanlon's razor. My guess is some kind of crawler robot. Possibly a malfunctioning and/or poorly-designed one. (That was my guess before I started digging into logs, by the way. I also guessed maybe a botnet, but the small number of requesting hosts make that less likely.) Our numbers show that as roughly 8% of our traffic, by object request count. In the distant past, when we did some traffic analysis, the bulk of the traffic hitting the GNHLUG site was crawler robots. So that doesn't seem out of line. What's your robots.txt look like? Does it forbid this kind of behavior? What's the rate like, in requests/time and bytes/time? Are they flooding your site, or slowly crawling it over time? ... please join me in blocking them ... You block Google from indexing your site, too, then, right? -- Ben ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: Microsoft flooding sites with fake traffic
Arc Riley wrote: Do you happen to be running google analytics on your site? No, I'm just parsing the logs. I use awstats (http://awstats.sourceforge.net) for collecting stats from my logs. I'm not really familiar with many of google.com's services. -- Coleman Kane On Wed, Feb 20, 2008 at 6:08 PM, Coleman Kane [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: Coleman Kane wrote: Arc Riley wrote: Hey guys Do yourselves a favor and search your logs for connections from 131.107.* 65.52.* 65.53.* 65.54.* and 65.55.* I found a good % of traffic we got, not reported to Google Analytics so I didn't see it sooner, was referred from http://search.live.com/ for search queries involving pornography, cars, drugs, and random gibberish. The landing pages from these searches were subversion changesets, source code in the Trac browser, and other places those search queries certainly don't exist in. All of it, well 97.2%, from the above two subnets, belonging to Microsoft. It'd be humorous if I didn't just purchase a new colo server to handle the large volume of traffic pysoy.org http://pysoy.org http://pysoy.org gets. I can't tell if MS is trying to skew the statistics in favor of MSIE/Live/etc or if it's conducting a denial of service attack against free software project sites, perhaps both (two birds with one stone?). If you see the similar childish behavior in your logs, please join me in blocking them and being very vocal as to why. An interesting find. I just checked my sites and I see the same thing, however most of the search queries seem to be pretty pertinent to the content of the pages that they reference. It is almost like theres some script running on a farm of windows computers that just performs single-word searches on their Windows LiveSearch database, and visits the results (posting, of course, the LiveSearch referral in the request). Here's my distribution: cat apachelogs/* | grep live.com http://live.com | cut -d\ -f1 | cut -d. -f1,2 | sort | uniq -c | sort -rn 308 65.55 10 131.107 4 85.159 3 142.161 2 71.164 2 68.95 2 4.246 2 207.224 1 86.144 1 84.202 There are many, many more with single visits, but I left them off the list because they probably represent normal livesearch users. -- Coleman Kane Went a little further and found that all my 65.55 traffic comes from the 65.55.165 class C. I decided to pass all the visitors to the host program and found that all of the visitors have PTR records like this: livebot-65-55-165-87.search.live.com http://livebot-65-55-165-87.search.live.com. The 131.107 traffic was all from two machines: tide525.microsoft.com http://tide525.microsoft.com and tide526.microsoft.com http://tide526.microsoft.com Maybe some others could look at their logs and pull information on the other subnets? -- Coleman Kane ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/