Re: www.gigablast.com

2006-07-17 Thread Stephane Bortzmeyer

On Wed, Jul 12, 2006 at 06:24:08PM -0400,
 Jim Popovitch [EMAIL PROTECTED] wrote 
 a message of 32 lines which said:

 The strangeness is that some of their crawling is looking for URLs
 with multiple exclamation points, those URLs never existed. This may
 be indicative of a character translation on my system or theirs.

From my experience (and I talked with people - or at least intelligent
bots - at Gigablast), their HTML parser is seriously broken and it
generates non-existing URL quite often. For instance a
href=http://www.example.fr/Cafe%20au%20lait; will make their crawler
ask for /Cafe.

I reported the problem months ago but I got nothing except standard
Thanks for telling us.



RE: www.gigablast.com

2006-07-13 Thread Bill Woodcock

   What gigablast seems to be doing, on the other hand, is trying to open
 every window in a house in the hopes that it will find one that's open.

Just looking at the text strings in the URLs, my off-the-top-of-my-head 
guess was that those were URLs it saw in email spam.  They looked very 
similar to a lot of the ascii-garbage that gets generated by spammers 
trying to get through bayesian filters.  It seemed plausible to me (not a 
good idea, of course, but the sort of thing that happens) that they might 
have been grepping web pages for URLs, and run across an archive of spam.

-Bill



www.gigablast.com

2006-07-12 Thread Jim Popovitch


Feel free to clue me in on this please... ;-)

What is www.gigablast.com?   And why is it constantly performing 
questionable queries (mostly http) across every IP that I have access 
to check.


I get a could of thousand hits (mostly questionable non-existing URL 
requests) from that ip (66.154.103.75).  Anyone else seeing/questioning 
this?


Completewhois shows some listings in some RBLs, but not the more popular 
ones.


-Jim P.


Re: www.gigablast.com

2006-07-12 Thread Jim Popovitch


:-) Let me add something before everyone on NANOG reminds me that 
gigablast is a search engine. I know what they do, but what I don't 
understand is why are they searching my systems for URLs that haven't 
ever existed there before.  It's as though they are doing random word 
searches in hopes of striking lucky.  They are crawling for URLs like 
this:  (unfortunately most people won't see these because their spam 
blockers will block all the exclamation points)


/Hj!!lpMall
/BuscaP!!gina
/!!-!!
/P!!ginasAbandonadas
/HilfeIndex
/CategoryCategory
/Aktuelle!!nderungen
/EfterladteSider
/SystemPagesInDanishGroup
/!!rvaLapok
/ForSide
/
/!!-!!!
/StartSeite
/!!
/Hj!!lpTilHenvisninger
/!!-
/ExplorerCeWiki
/Xslt
/P!!ginaInicial
/SenesteRettelser
/!!
/Pr!!f!!rencesUtilisateur
/WikiHomePage
/HilfeZuParsern
/AiutoModello
/GewenstePaginas
/HilfeZu!!berschriften

-Jim P.

Jim Popovitch wrote:


Feel free to clue me in on this please... ;-)

What is www.gigablast.com?   And why is it constantly performing 
questionable queries (mostly http) across every IP that I have access 
to check.


I get a could of thousand hits (mostly questionable non-existing URL 
requests) from that ip (66.154.103.75).  Anyone else seeing/questioning 
this?


Completewhois shows some listings in some RBLs, but not the more popular 
ones.


-Jim P.



Re: www.gigablast.com

2006-07-12 Thread Malcolm Staudinger


Google is your friend?
They're a search engine. robots.txt and forget it.

Malcolm

Jim Popovitch wrote:


Feel free to clue me in on this please... ;-)

What is www.gigablast.com?   And why is it constantly performing 
questionable queries (mostly http) across every IP that I have 
access to check.


I get a could of thousand hits (mostly questionable non-existing URL 
requests) from that ip (66.154.103.75).  Anyone else 
seeing/questioning this?


Completewhois shows some listings in some RBLs, but not the more 
popular ones.


-Jim P.






Re: www.gigablast.com

2006-07-12 Thread Payam Tarverdyan Chychi

That’s exactly it... they are doing site indexing .. if you like google...
you'll need to like them! =P

I personally wouldn’t worry about anything in the logs unless you start
seeing attempts to search and exploit .cgi and executable files...

-Payam



 Google is your friend?
 They're a search engine. robots.txt and forget it.

 Malcolm

 Jim Popovitch wrote:

 Feel free to clue me in on this please... ;-)

 What is www.gigablast.com?   And why is it constantly performing
 questionable queries (mostly http) across every IP that I have
 access to check.

 I get a could of thousand hits (mostly questionable non-existing URL
 requests) from that ip (66.154.103.75).  Anyone else
 seeing/questioning this?

 Completewhois shows some listings in some RBLs, but not the more
 popular ones.

 -Jim P.






-- 
-- 
Payam Tarverdyan Chychi
Network Analyst




RE: www.gigablast.com

2006-07-12 Thread David Schwartz


 :-) Let me add something before everyone on NANOG reminds me that
 gigablast is a search engine. I know what they do, but what I don't
 understand is why are they searching my systems for URLs that haven't
 ever existed there before.  It's as though they are doing random word
 searches in hopes of striking lucky.  They are crawling for URLs like
 this:  (unfortunately most people won't see these because their spam
 blockers will block all the exclamation points)

[list of random path names snipped]

This seems to be a very wrong and bad thing to do. Google searches URLs
because a human gives it permission to do so, for example by linking to that
URL. (What purpose does a link have other than to be something to click on.)

What gigablast seems to be doing, on the other hand, is trying to open
every window in a house in the hopes that it will find one that's open. It
has no invitation or permission to do this, and I would consider such
behavior inappropriate.

You do not have the right to make requests of other people's computers
without their permission. You can certainly argue implied permission in many
cases -- for example, if Ford registers the domain ford.com, and assigns an
IP address to 'www.ford.com', you can certainly argue that they have invited
the public to access that URL because that's the normal reason people create
such things. However, you have no implied permission to try numerous
combinations of random paths on the end of that in the hopes that you'll
find something Ford did not invite you into.

DS




Re: www.gigablast.com

2006-07-12 Thread Jeremy Chadwick

On Wed, Jul 12, 2006 at 02:50:54PM -0700, Malcolm Staudinger wrote:
 Google is your friend?
 They're a search engine. robots.txt and forget it.
 
 Malcolm

That's assuming whoever designed their software actually adheres
to robots.txt.  RFCs recommend people adhere to it, but there are
some who don't; it's operationally optional.

I can't find a single reference to what standards GigaBlast
adheres to, or any technical data about how their engine works.
The way their site is designed, it looks like a total fly-by-night
operation.

If GigaBlast is supposedly indexing his site, they have to be
basing their GET requests on something (the equivalent of a normal
browsers' Referer header; but again, who knows if they pass that
along?).  The requests Jim is seeing appear to be garbage, similar
to spam composition, not based on actual references/indexes.  I
could be outright wrong here.

Additionally, how does this solve the issue of Jim's bandwidth,
CPU, memory, if not his time, being wasted for HTTP requests which
shouldn't necessarily even be arriving at his boxes (which is what
he's essentially complaining about)?  So filter upstream, or on
the machine itself.  Okay, that's a solution, but it doesn't address
incoming traffic (just responses).

-- 
| Jeremy Chadwick jdc at parodius.com |
| Parodius Networkinghttp://www.parodius.com/ |
| UNIX Systems Administrator   Mountain View, CA, USA |
| Making life hard for others since 1977.   PGP: 4BD6C0CB |



Re: www.gigablast.com

2006-07-12 Thread Jim Popovitch


It appears that some of the queries are valid for an older site that 
existed in the past. That site was a wiki and some of the Giga hits are 
for internationalized versions of the default help/support pages.  This 
is fine and acceptable behavior by them (IMHO).  The fact that they are 
querying something that no longer exist is something I can deal with. 
The strangeness is that some of their crawling is looking for URLs with 
multiple exclamation points, those URLs never existed. This may be 
indicative of a character translation on my system or theirs.  BUT, the 
net net is that I no longer feel a need to be concerned about them.


Thanks all,

-Jim P.

Jim Popovitch wrote:


Feel free to clue me in on this please... ;-)

What is www.gigablast.com?   And why is it constantly performing 
questionable queries (mostly http) across every IP that I have access 
to check.


I get a could of thousand hits (mostly questionable non-existing URL 
requests) from that ip (66.154.103.75).  Anyone else seeing/questioning 
this?


Completewhois shows some listings in some RBLs, but not the more popular 
ones.


-Jim P.