Re: Microsoft flooding sites with fake traffic

2008-02-22 Thread Ted Roche
Paul Lussier wrote:
 Ben Scott [EMAIL PROTECTED] writes:
 
 In particular, remember Hanlon's razor.
 
 Does that have more or less blades than the competing Occam's Razor?
 And how do either of those compare with Vipul's Razor?
 
 And are *any* of those better than the Motorola Razr?

Or Grey's Law: Any sufficiently advanced incompetence is 
indistinguishable from malice

Interesting discussion at http://en.wikipedia.org/wiki/Hanlon's_razor

Years ago, a friend and I were on the Redmond campus for a couple days 
of indoctrination^H^H^H er, brainwash^H^H^H^H er, meetings, and at lunch 
  break, he walked the length and breadth of the campus (it was a lot 
smaller back then). There is no One Microsoft Way, no street by that 
name, no postal stop. That's not their address, it's their statement of 
philosophy. But the sad truth is there is no Microsoft Way. That's 
worth contemplating a bit, as it says more than Microsoft probably intended.
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Microsoft flooding sites with fake traffic

2008-02-22 Thread Bill McGonigle
On Feb 21, 2008, at 10:00, Arc Riley wrote:

 msnbot accesses robots.txt more than any other
 search engine (seconded by Yahoo! Slurp).

I had an e-commerce client DoS'ed by MSNBot during the holiday  
season.  It was downloading 40GB of dynamic pages per day, for a site  
with 4GB of possible data (I crawled it myself to measure).  The site  
as-idle could handle that kind of traffic but during peak shopping it  
was the proverbial straw.

I wound up counting up the total number of possible URI's on the site  
and dividing it into the number of seconds in a month, and gave MSNBot:

   Crawl-delay: 320

in robots.txt to give it one copy per month.  It seems to have worked.

I found a webpage describing this problem that dated from Summer of  
'06.  Raise your hand if you're shocked...

-Bill

-
Bill McGonigle, Owner   Work: 603.448.4440
BFC Computing, LLC  Home: 603.448.1668
[EMAIL PROTECTED]   Cell: 603.252.2606
http://www.bfccomputing.com/Page: 603.442.1833
Blog: http://blog.bfccomputing.com/
VCard: http://bfccomputing.com/vcard/bill.vcf

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Microsoft flooding sites with fake traffic

2008-02-21 Thread Ed lawson
On Wed, 20 Feb 2008 21:34:53 -0500
Ben Scott [EMAIL PROTECTED] wrote:



   Those two both seem rather unlikely.  In particular, remember
 Hanlon's razor.   My guess is some kind of crawler robot.

   You block Google from indexing your site, too, then, right?
 

FWIW

I know nothing from the technical side of this, but I mentioned this to
someone who works at MSFT and their first comment was that it was
likely Live Search crawling to build an index.

Ed Lawson
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Microsoft flooding sites with fake traffic

2008-02-21 Thread Kent Johnson
Ed lawson wrote:

 I know nothing from the technical side of this, but I mentioned this to
 someone who works at MSFT and their first comment was that it was
 likely Live Search crawling to build an index.

Except:
- the referrer is a single-word search at search.live.com, e.g.
http://search.live.com/results.aspx?q=marketingmrt=en-usFORM=LIVSOP

- The client acts like a browser, in that it fetches CSS and JavaScript 
files as well as the primary page, and the User-Agent seems to be MSIE 7:
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)

Here is a complete sequence from my logs:
65.55.165.51 - - [20/Feb/2008:02:22:16 -0500] GET 
/category/Web-Marketing/ HTTP/1.1 200 15810 
http://search.live.com/results.aspx?q=marketingmrt=en-usFORM=LIVSOP; 
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)

65.55.165.51 - - [20/Feb/2008:02:22:18 -0500] GET 
/media/public/css/blogcosm.css HTTP/1.1 200 8114 
http://blogcosm.com/category/Web-Marketing/; Mozilla/4.0 (compatible; 
MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)

65.55.165.51 - - [20/Feb/2008:02:22:19 -0500] GET 
/media/public/css/category_detail.css HTTP/1.1 200 2952 
http://blogcosm.com/category/Web-Marketing/; Mozilla/4.0 (compatible; 
MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)

65.55.165.51 - - [20/Feb/2008:02:22:19 -0500] GET 
/media/public/css/toc.css HTTP/1.1 200 399 
http://blogcosm.com/category/Web-Marketing/; Mozilla/4.0 (compatible; 
MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)

65.55.165.51 - - [20/Feb/2008:02:22:19 -0500] GET 
/media/public/css/one-liners.css HTTP/1.1 200 223 
http://blogcosm.com/category/Web-Marketing/; Mozilla/4.0 (compatible; 
MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)

65.55.165.51 - - [20/Feb/2008:02:22:19 -0500] GET /css/colors.css 
HTTP/1.1 200 4410 http://blogcosm.com/category/Web-Marketing/; 
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)


I seem to have one of these roughly every 1/2 hour though the interval 
varies widely.

Kent
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Microsoft flooding sites with fake traffic

2008-02-21 Thread Coleman Kane
Kent Johnson wrote:
 Ed lawson wrote:

   
 I know nothing from the technical side of this, but I mentioned this to
 someone who works at MSFT and their first comment was that it was
 likely Live Search crawling to build an index.
 

 Except:
 - the referrer is a single-word search at search.live.com, e.g.
 http://search.live.com/results.aspx?q=marketingmrt=en-usFORM=LIVSOP

 - The client acts like a browser, in that it fetches CSS and JavaScript 
 files as well as the primary page, and the User-Agent seems to be MSIE 7:
 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)

 Here is a complete sequence from my logs:
 65.55.165.51 - - [20/Feb/2008:02:22:16 -0500] GET 
 /category/Web-Marketing/ HTTP/1.1 200 15810 
 http://search.live.com/results.aspx?q=marketingmrt=en-usFORM=LIVSOP; 
 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)

 65.55.165.51 - - [20/Feb/2008:02:22:18 -0500] GET 
 /media/public/css/blogcosm.css HTTP/1.1 200 8114 
 http://blogcosm.com/category/Web-Marketing/; Mozilla/4.0 (compatible; 
 MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)

 65.55.165.51 - - [20/Feb/2008:02:22:19 -0500] GET 
 /media/public/css/category_detail.css HTTP/1.1 200 2952 
 http://blogcosm.com/category/Web-Marketing/; Mozilla/4.0 (compatible; 
 MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)

 65.55.165.51 - - [20/Feb/2008:02:22:19 -0500] GET 
 /media/public/css/toc.css HTTP/1.1 200 399 
 http://blogcosm.com/category/Web-Marketing/; Mozilla/4.0 (compatible; 
 MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)

 65.55.165.51 - - [20/Feb/2008:02:22:19 -0500] GET 
 /media/public/css/one-liners.css HTTP/1.1 200 223 
 http://blogcosm.com/category/Web-Marketing/; Mozilla/4.0 (compatible; 
 MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)

 65.55.165.51 - - [20/Feb/2008:02:22:19 -0500] GET /css/colors.css 
 HTTP/1.1 200 4410 http://blogcosm.com/category/Web-Marketing/; 
 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)


 I seem to have one of these roughly every 1/2 hour though the interval 
 varies widely.

 Kent
   
It's not really out of the realm of reality that Microsoft could be
using a farm of Windows machines running IE7 to gather the data... It's
also not necessarily out of the realm of reality that their indexing
algorithm is trying to find single keyword results. Maybe they perform
the union/intersection of multiple search terms on their end.

--
Coleman Kane


___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Microsoft flooding sites with fake traffic

2008-02-21 Thread Cole Tuininga
On Thu, 2008-02-21 at 08:56 -0500, Kent Johnson wrote:
 Except:
 - The client acts like a browser, in that it fetches CSS and 
 JavaScript 
 files as well as the primary page, and the User-Agent seems to be MSIE 7:
 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)

This *could* be explained by wanting to be able to display a thumbnail 
version of the website.  Just a thought.

-- 
Cole Tuininga [EMAIL PROTECTED]
Code Energy (http://www.code-energy.com)

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Microsoft flooding sites with fake traffic

2008-02-21 Thread Coleman Kane
Cole Tuininga wrote:
 On Thu, 2008-02-21 at 08:56 -0500, Kent Johnson wrote:
   
 Except:
 - The client acts like a browser, in that it fetches CSS and 
 JavaScript 
 files as well as the primary page, and the User-Agent seems to be MSIE 7:
 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
 

 This *could* be explained by wanting to be able to display a thumbnail 
 version of the website.  Just a thought.
   
Just an aside:

It is extremely amusing to me that MSIE still identifies as Mozilla/4.0...

--
Coleman Kane

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Microsoft flooding sites with fake traffic

2008-02-21 Thread Arc Riley
   What's your robots.txt look like?  Does it forbid this kind of behavior?


The IPs in question never accessed robots.txt.  Only MSNBot and family on
different IP blocks.

You will see the bad behavior if you log referers.  The bots in question
claim to be arriving on your site by various search terms via
search.live.com and claim to be a normal web browser (MSIE 6 or 7).

If this was an honesty checker, which Google does, verifying that the pages
being sent to the crawlers is the same as being sent to normal web browsers,
they wouldn't claim to be arriving via search.live.com.

Many of the hits were also to pages specifically forbidden to * User-agent,
such as Disallow: /*?

My logs show 95.7% of our traffic via search engines are from Google.  MSN,
once you take all the hits from Microsoft's networks out of the equation,
only result in 0.3%.


 What's the rate like, in requests/time and bytes/time?  Are they
 flooding your site, or slowly crawling it over time?


It changes per day, but the day before yesterday 28.3%  of the pagehits were
from those subnets.  The pages they seemed to be targeting were some of the
highest CPU load.  Apache was using roughly 60% of the CPU which dropped to
2% when Microsoft was firewalled.  The server in question is an Athlon XP
2200+ with a gig of ram.


 ... please join me in blocking them ...

  You block Google from indexing your site, too, then, right?


This is not a cry against crawlers.  This is a cry against deception and
bots behaving badly.



On Thu, Feb 21, 2008 at 9:02 AM, Coleman Kane [EMAIL PROTECTED] wrote:


 It's also not necessarily out of the realm of reality that their indexing
 algorithm is trying to find single keyword results. Maybe they perform
 the union/intersection of multiple search terms on their end.


That could be true, if the reported search terms had anything to do with the
content on the sites.  I could not find a single instance of any of the
search terms on our site in the earlier searches, much of which were
pornographic or sexual in nature.

The bot that generates thumbnails of the sites and grabs images is
msnbot-media
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Microsoft flooding sites with fake traffic

2008-02-21 Thread Ben Scott
On Thu, Feb 21, 2008 at 9:15 AM, Coleman Kane [EMAIL PROTECTED] wrote:
  It is extremely amusing to me that MSIE still identifies as Mozilla/4.0...

  All user agents are Mozilla, thanks to brain-damage perpetrated
back in the mid-1990's by dumb web developers who assumed that
Netscape was the web and used User-Agent to block all other
browsers, even compatible ones.  Now, of course, we have the same
problem in the other direction; one sometimes has to tell
Firefox/Opera/etc. to claim to be MSIE.  I hate dumb web developers.

-- Ben
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Microsoft flooding sites with fake traffic

2008-02-21 Thread Mark Komarinski
On 02/21/2008 09:30 AM, Ben Scott wrote:
   All user agents are Mozilla, thanks to brain-damage perpetrated
 back in the mid-1990's by dumb web developers who assumed that
 Netscape was the web and used User-Agent to block all other
 browsers, even compatible ones.  Now, of course, we have the same
 problem in the other direction; one sometimes has to tell
 Firefox/Opera/etc. to claim to be MSIE.  I hate dumb web developers.
   
This e-mail best viewed at 800x600.

-Mark
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Microsoft flooding sites with fake traffic

2008-02-21 Thread Ben Scott
On Thu, Feb 21, 2008 at 9:18 AM, Arc Riley [EMAIL PROTECTED] wrote:
   What's your robots.txt look like?  Does it forbid this kind of behavior?

  The IPs in question never accessed robots.txt.
[...]
  Many of the hits were also to pages specifically forbidden to * User-agent,
 such as Disallow: /*?

  Interesting.

liberty$ find -name access_log\* | xargs egrep -h
'^(131\.107|65.5[2-5])' | fgrep robots.txt  | wc -l
1453

  It appears your server is seeing different behavior than GNHLUG's
server.  I suppose that could be a malfunction on Microsoft's end, but
I can't think of why a cluster of crawlers would malfunction for just
some sites.  I suppose it could be intentional differentiation, but
what would be the point of that?

  Have you tried contact the help desk for Microsoft's crawler?

 It changes per day, but the day before yesterday 28.3%  of the pagehits
 were from those subnets.

  Yikes!

  If this was an honesty checker, which Google does, verifying that the pages
 being sent to the crawlers is the same as being sent to normal web browsers,
 they wouldn't claim to be arriving via search.live.com.

  Why not?  I can think of a few scenarios where that might be legit.
Following links of saved searches, or some kind of follow-on driven by
search results of users.  The high request rate, and the apparent
ignorance of robots.txt, on the other hand...

 That could be true, if the reported search terms had anything to do with the
 content on the sites.  I could not find a single instance of any of the
 search terms on our site in the earlier searches, much of which were
 pornographic or sexual in nature.

  Are you sure you don't have a wiki or tag cloud or comment board or
file share or similar application that's been hijacked?  Scam artists
like to use such to host their content, or crank up their page rank,
or spam others.  I know they keep trying to hit GNHLUG (we watch it
fairly closely and remove any such attempts).  I know PySIG had to
shut down their wiki it got nailed so often.

-- Ben
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Microsoft flooding sites with fake traffic

2008-02-21 Thread Arc Riley
 liberty$ find -name access_log\* | xargs egrep -h
 '^(131\.107|65.5[2-5])' | fgrep robots.txt  | wc -l
 1453


Look specifically at the IPs the faked live.com search results are hit from
vs those running msnbot.  msnbot accesses robots.txt more than any other
search engine (seconded by Yahoo! Slurp).


 Are you sure you don't have a wiki or tag cloud or comment board or
 file share or similar application that's been hijacked?  Scam artists
 like to use such to host their content, or crank up their page rank,
 or spam others.  I know they keep trying to hit GNHLUG (we watch it
 fairly closely and remove any such attempts).  I know PySIG had to
 shut down their wiki it got nailed so often.


We have a fairly agressive anti-spam system setup, no such spam appears on
our site, certainly not on the landing pages of these faked live.com search
hits which contain svn changelog diffs.

The only place we've had spam is the ticket system which none of the hits in
question are for /ticket/*.  We require reg to file a ticket and monitor
regs to isolate spammers before they can hit the site, so far it works
fairly well.
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Microsoft flooding sites with fake traffic

2008-02-21 Thread VirginSnow
 Date: Wed, 20 Feb 2008 16:46:37 -0500
 From: Arc Riley [EMAIL PROTECTED]

 Do yourselves a favor and search your logs for connections from 131.107.*
 65.52.* 65.53.* 65.54.* and 65.55.*

 All of it, well 97.2%, from the above two subnets, belonging to Microsoft.

I don't seem to have this problem.  CF my robots.txt file:

  http://peapod.podzone.net:1234/robots.txt

which might explain why. :)

Interestingly, for me, hits from '^(131\.107|65.5[2-5])' seem to fall
into two categories:

 (1) requests for robots.txt (makes sense)
 (2) requests for content pages with *google* as referrer (hm...)

#2 suggests that msnbot is actually crawling *google* in order to
populate its *own* search database.  Maybe, knowing this, google could
give a good schticking to microsoft... by, say, returning
pseudo-random results to searches from '^(131\.107|65.5[2-5])' :)
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Microsoft flooding sites with fake traffic

2008-02-21 Thread Paul Lussier
Ben Scott [EMAIL PROTECTED] writes:

 In particular, remember Hanlon's razor.

Does that have more or less blades than the competing Occam's Razor?
And how do either of those compare with Vipul's Razor?

And are *any* of those better than the Motorola Razr?
-- 
Seeya,
Paul
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Microsoft flooding sites with fake traffic

2008-02-20 Thread Arc Riley
Hey guys

Do yourselves a favor and search your logs for connections from 131.107.*
65.52.* 65.53.* 65.54.* and 65.55.*

I found a good % of traffic we got, not reported to Google Analytics so I
didn't see it sooner, was referred from http://search.live.com/ for search
queries involving pornography, cars, drugs, and random gibberish.  The
landing pages from these searches were subversion changesets, source code in
the Trac browser, and other places those search queries certainly don't
exist in.

All of it, well 97.2%, from the above two subnets, belonging to Microsoft.
It'd be humorous if I didn't just purchase a new colo server to handle the
large volume of traffic pysoy.org gets.  I can't tell if MS is trying to
skew the statistics in favor of MSIE/Live/etc or if it's conducting a denial
of service attack against free software project sites, perhaps both (two
birds with one stone?).

If you see the similar childish behavior in your logs, please join me in
blocking them and being very vocal as to why.
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Microsoft flooding sites with fake traffic

2008-02-20 Thread Coleman Kane
Arc Riley wrote:
 Hey guys

 Do yourselves a favor and search your logs for connections from
 131.107.* 65.52.* 65.53.* 65.54.* and 65.55.*

 I found a good % of traffic we got, not reported to Google Analytics
 so I didn't see it sooner, was referred from http://search.live.com/
 for search queries involving pornography, cars, drugs, and random
 gibberish.  The landing pages from these searches were subversion
 changesets, source code in the Trac browser, and other places those
 search queries certainly don't exist in.

 All of it, well 97.2%, from the above two subnets, belonging to
 Microsoft.  It'd be humorous if I didn't just purchase a new colo
 server to handle the large volume of traffic pysoy.org
 http://pysoy.org gets.  I can't tell if MS is trying to skew the
 statistics in favor of MSIE/Live/etc or if it's conducting a denial of
 service attack against free software project sites, perhaps both (two
 birds with one stone?).

 If you see the similar childish behavior in your logs, please join me
 in blocking them and being very vocal as to why.

An interesting find. I just checked my sites and I see the same thing,
however most of the search queries seem to be pretty pertinent to the
content of the pages that they reference. It is almost like theres some
script running on a farm of windows computers that just performs
single-word searches on their Windows LiveSearch database, and visits
the results (posting, of course, the LiveSearch referral in the request).

Here's my distribution:

cat apachelogs/*  | grep live.com  | cut -d\  -f1 | cut -d. -f1,2 | sort
| uniq -c | sort -rn

308 65.55
 10 131.107
  4 85.159
  3 142.161
  2 71.164
  2 68.95
  2 4.246
  2 207.224
  1 86.144
  1 84.202

There are many, many more with single visits, but I left them off the
list because they probably represent normal livesearch users.

--
Coleman Kane

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Microsoft flooding sites with fake traffic

2008-02-20 Thread Coleman Kane
Coleman Kane wrote:
 Arc Riley wrote:
   
 Hey guys

 Do yourselves a favor and search your logs for connections from
 131.107.* 65.52.* 65.53.* 65.54.* and 65.55.*

 I found a good % of traffic we got, not reported to Google Analytics
 so I didn't see it sooner, was referred from http://search.live.com/
 for search queries involving pornography, cars, drugs, and random
 gibberish.  The landing pages from these searches were subversion
 changesets, source code in the Trac browser, and other places those
 search queries certainly don't exist in.

 All of it, well 97.2%, from the above two subnets, belonging to
 Microsoft.  It'd be humorous if I didn't just purchase a new colo
 server to handle the large volume of traffic pysoy.org
 http://pysoy.org gets.  I can't tell if MS is trying to skew the
 statistics in favor of MSIE/Live/etc or if it's conducting a denial of
 service attack against free software project sites, perhaps both (two
 birds with one stone?).

 If you see the similar childish behavior in your logs, please join me
 in blocking them and being very vocal as to why.

 
 An interesting find. I just checked my sites and I see the same thing,
 however most of the search queries seem to be pretty pertinent to the
 content of the pages that they reference. It is almost like theres some
 script running on a farm of windows computers that just performs
 single-word searches on their Windows LiveSearch database, and visits
 the results (posting, of course, the LiveSearch referral in the request).

 Here's my distribution:

 cat apachelogs/*  | grep live.com  | cut -d\  -f1 | cut -d. -f1,2 | sort
 | uniq -c | sort -rn

 308 65.55
  10 131.107
   4 85.159
   3 142.161
   2 71.164
   2 68.95
   2 4.246
   2 207.224
   1 86.144
   1 84.202

 There are many, many more with single visits, but I left them off the
 list because they probably represent normal livesearch users.

 --
 Coleman Kane
   
Went a little further and found that all my 65.55 traffic comes from the
65.55.165 class C. I decided to pass all the visitors to the host
program and found that all of the visitors have PTR records like this:
livebot-65-55-165-87.search.live.com. The 131.107 traffic was all from
two machines: tide525.microsoft.com and tide526.microsoft.com

Maybe some others could look at their logs and pull information on the
other subnets?

--
Coleman Kane

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Microsoft flooding sites with fake traffic

2008-02-20 Thread Arc Riley
Do you happen to be running google analytics on your site?

On Wed, Feb 20, 2008 at 6:08 PM, Coleman Kane [EMAIL PROTECTED] wrote:

 Coleman Kane wrote:
  Arc Riley wrote:
 
  Hey guys
 
  Do yourselves a favor and search your logs for connections from
  131.107.* 65.52.* 65.53.* 65.54.* and 65.55.*
 
  I found a good % of traffic we got, not reported to Google Analytics
  so I didn't see it sooner, was referred from http://search.live.com/
  for search queries involving pornography, cars, drugs, and random
  gibberish.  The landing pages from these searches were subversion
  changesets, source code in the Trac browser, and other places those
  search queries certainly don't exist in.
 
  All of it, well 97.2%, from the above two subnets, belonging to
  Microsoft.  It'd be humorous if I didn't just purchase a new colo
  server to handle the large volume of traffic pysoy.org
  http://pysoy.org gets.  I can't tell if MS is trying to skew the
  statistics in favor of MSIE/Live/etc or if it's conducting a denial of
  service attack against free software project sites, perhaps both (two
  birds with one stone?).
 
  If you see the similar childish behavior in your logs, please join me
  in blocking them and being very vocal as to why.
 
 
  An interesting find. I just checked my sites and I see the same thing,
  however most of the search queries seem to be pretty pertinent to the
  content of the pages that they reference. It is almost like theres some
  script running on a farm of windows computers that just performs
  single-word searches on their Windows LiveSearch database, and visits
  the results (posting, of course, the LiveSearch referral in the
 request).
 
  Here's my distribution:
 
  cat apachelogs/*  | grep live.com  | cut -d\  -f1 | cut -d. -f1,2 | sort
  | uniq -c | sort -rn
 
  308 65.55
   10 131.107
4 85.159
3 142.161
2 71.164
2 68.95
2 4.246
2 207.224
1 86.144
1 84.202
 
  There are many, many more with single visits, but I left them off the
  list because they probably represent normal livesearch users.
 
  --
  Coleman Kane
 
 Went a little further and found that all my 65.55 traffic comes from the
 65.55.165 class C. I decided to pass all the visitors to the host
 program and found that all of the visitors have PTR records like this:
 livebot-65-55-165-87.search.live.com. The 131.107 traffic was all from
 two machines: tide525.microsoft.com and tide526.microsoft.com

 Maybe some others could look at their logs and pull information on the
 other subnets?

 --
 Coleman Kane


___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Microsoft flooding sites with fake traffic

2008-02-20 Thread Ben Scott
On Wed, Feb 20, 2008 at 4:46 PM, Arc Riley [EMAIL PROTECTED] wrote:
 Do yourselves a favor and search your logs for connections from 131.107.*
 65.52.* 65.53.* 65.54.* and 65.55.*

  On the GNHLUG web server in /var/log/httpd/ ...

liberty$ find -name access_log\* | xargs egrep '^(131\.107|65.5[2-5])' | wc -l
14293
liberty$ find -name access_log\* | xargs cat | wc -l
185492

  We keep logs going back a month, rotated weekly.

  We don't log referrals or user agents on the GNHLUG server.  Maybe we should.

 All of it, well 97.2%, from the above two subnets, belonging to Microsoft.

  Interesting.  And a relatively small number of unique hosts (152),
given that there are five /16's in question (327680, give or take).

liberty$ find -name access_log\* | xargs egrep -h
'^(131\.107|65.5[2-5])' | awk '{ print $1 }' | sort -u  /tmp/hostlist
liberty$ wc -l /tmp/hostlist
152 /tmp/hostlist

  The IP address DNS reverse to a few different things.  Matching
regexp patterns would be:

NXDOMAIN
tide[0-9]+.microsoft.com
b[ly]1sch[0-9]+.phx.gbl
livebot-65-55-[0-9]+-[0-9]+.search.live.com

  I also took a look at unique URLs:

liberty$ find -name access_log\* | xargs egrep -h
'^(131\.107|65.5[2-5])' | awk -F\ '{ print $2 }' | awk '{ print $2 }'
| sort -u  /tmp/urls
liberty$ wc -l /tmp/urls
7237 /tmp/urls

  The URLs themselves... hmmm, hard to know for sure with our site,
but it looks to me like something is walking the entire site,
following every link, including TWiki's search, history, and edit
links.  Think wget -r.

 I can't tell if MS is trying to skew the statistics in favor of MSIE/Live/etc
 or if it's conducting a denial of service attack against free software
 project sites ...

  Those two both seem rather unlikely.  In particular, remember
Hanlon's razor.   My guess is some kind of crawler robot.  Possibly a
malfunctioning and/or poorly-designed one.  (That was my guess before
I started digging into logs, by the way.  I also guessed maybe a
botnet, but the small number of requesting hosts make that less
likely.)

  Our numbers show that as roughly 8% of our traffic, by object
request count.  In the distant past, when we did some traffic
analysis, the bulk of the traffic hitting the GNHLUG site was crawler
robots.  So that doesn't seem out of line.

  What's your robots.txt look like?  Does it forbid this kind of behavior?

  What's the rate like, in requests/time and bytes/time?  Are they
flooding your site, or slowly crawling it over time?

 ... please join me in blocking them ...

  You block Google from indexing your site, too, then, right?

-- Ben
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Microsoft flooding sites with fake traffic

2008-02-20 Thread Coleman Kane
Arc Riley wrote:
 Do you happen to be running google analytics on your site?
No, I'm just parsing the logs. I use awstats
(http://awstats.sourceforge.net) for collecting stats from my logs. I'm
not really familiar with many of google.com's services.

--
Coleman Kane


 On Wed, Feb 20, 2008 at 6:08 PM, Coleman Kane [EMAIL PROTECTED]
 mailto:[EMAIL PROTECTED] wrote:

 Coleman Kane wrote:
  Arc Riley wrote:
 
  Hey guys
 
  Do yourselves a favor and search your logs for connections from
  131.107.* 65.52.* 65.53.* 65.54.* and 65.55.*
 
  I found a good % of traffic we got, not reported to Google
 Analytics
  so I didn't see it sooner, was referred from
 http://search.live.com/
  for search queries involving pornography, cars, drugs, and random
  gibberish.  The landing pages from these searches were subversion
  changesets, source code in the Trac browser, and other places those
  search queries certainly don't exist in.
 
  All of it, well 97.2%, from the above two subnets, belonging to
  Microsoft.  It'd be humorous if I didn't just purchase a new colo
  server to handle the large volume of traffic pysoy.org
 http://pysoy.org
  http://pysoy.org gets.  I can't tell if MS is trying to skew the
  statistics in favor of MSIE/Live/etc or if it's conducting a
 denial of
  service attack against free software project sites, perhaps
 both (two
  birds with one stone?).
 
  If you see the similar childish behavior in your logs, please
 join me
  in blocking them and being very vocal as to why.
 
 
  An interesting find. I just checked my sites and I see the same
 thing,
  however most of the search queries seem to be pretty pertinent
 to the
  content of the pages that they reference. It is almost like
 theres some
  script running on a farm of windows computers that just performs
  single-word searches on their Windows LiveSearch database, and
 visits
  the results (posting, of course, the LiveSearch referral in the
 request).
 
  Here's my distribution:
 
  cat apachelogs/*  | grep live.com http://live.com  | cut -d\
  -f1 | cut -d. -f1,2 | sort
  | uniq -c | sort -rn
 
  308 65.55
   10 131.107
4 85.159
3 142.161
2 71.164
2 68.95
2 4.246
2 207.224
1 86.144
1 84.202
 
  There are many, many more with single visits, but I left them
 off the
  list because they probably represent normal livesearch users.
 
  --
  Coleman Kane
 
 Went a little further and found that all my 65.55 traffic comes
 from the
 65.55.165 class C. I decided to pass all the visitors to the host
 program and found that all of the visitors have PTR records like this:
 livebot-65-55-165-87.search.live.com
 http://livebot-65-55-165-87.search.live.com. The 131.107 traffic
 was all from
 two machines: tide525.microsoft.com http://tide525.microsoft.com
 and tide526.microsoft.com http://tide526.microsoft.com

 Maybe some others could look at their logs and pull information on the
 other subnets?

 --
 Coleman Kane



___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/