[Robots] Re: Looksmart's robots.txt file

2002-05-30 Thread Rasmus Mohr


Richard,

You're absolutely right! But why does the Acme.Spider use that user-agent
id? I can't find any reference to it in the source, i wonder


> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of [EMAIL PROTECTED]
> Sent: Thursday, May 30, 2002 3:58 PM
> To: [EMAIL PROTECTED]
> Subject: [Robots] Re: Looksmart's robots.txt file
> 
> 
> 
> Rasmus Mohr writes:
>  > Yes, that would be the case. For some unknown reason 
> Looksmart allows
>  > recognized robots/crawlers/spider and other non-standard 
> user-agents
>  > unlimited access according to the the robots.txt - all 
> others are excluded.
>  > I'd guess the weird looking "java" user-agent originates 
> from an Java
>  > application running on a platform/JVM unable to set the 
> user-agent property.
>  > The guys at Looksmart probably detected it in their logfiles...
> 
> I don't think so. I think they just processed the web robots list
> autocatically. In fact, that's what it says at the top of the
> robots.txt file. If you look at
>
http://www.robotstxt.org/wc/active/html/contact.html
you'll see where it comes from.

 > eh...beef?

gripes, wrath, criticism, complaints, etc. General feeling of displeasure
directed to some person or thing,

Richard

--
Rasmus T. MohrDirect  : +45 36 910 122
Application Developer Mobile  : +45 28 731 827
Netpointers Intl. ApS Phone   : +45 70 117 117
Vestergade 18 B   Fax : +45 70 115 115
1456 Copenhagen K Email   : mailto:[EMAIL PROTECTED]
Denmark   Website : http://www.netpointers.com

"Remember that there are no bugs, only undocumented features."
--




[Robots] Re: Looksmart's robots.txt file

2002-05-30 Thread Rasmus Mohr


> Are you suggesting a robot is checking that string against 
> its UA???  I find
> that hard to believe, but assuming that is the case such a 
> robot would be
> allowed unrestricted access to looksmart.com, including all 
> their Pay Per
> Click (PPC) URLs.  I'm thinking that many robots are reading 
> looksmart.com,
> some with permission from robots.txt, and some without.

Yes, that would be the case. For some unknown reason Looksmart allows
recognized robots/crawlers/spider and other non-standard user-agents
unlimited access according to the the robots.txt - all others are excluded.
I'd guess the weird looking "java" user-agent originates from an Java
application running on a platform/JVM unable to set the user-agent property.
The guys at Looksmart probably detected it in their logfiles...


> I'm looking for reasons why advertisers can't reconcile 
> clickthroughs with
> the figures provided by Looksmart.
> 
> One suggestion is that if Looksmart aren't checking the User Agents of
> clients accessing the PPC URLs, that would be a reason why 
> many more clicks
> were being seen by Looksmart than by advertisers.  The robot 
> either may not
> follow the redirect, or may follow without providing a 
> referrer, or may be
> silently filtered by an advertiser's stats package because it 
> is a robot not
> a human visitor.
> 
> Another suggestion is that some robots will be masquerading 
> as browsers, but
> still may not follow redirects or send a referrer allowing 
> the clicks to be
> reconciled.

That would be true, unless some sort of server-side mechanism ensures that
these well-known (probably non-human) users are provided with a different
content than "normal" users, i.e. pages without PPC URLs. You could check
this theory by creating a simple "robot", using one of the user-agents in
the robots.txt file, and comparing the server output with the output given
to a normal user (IE, Mozilla, Opera...).
 
> I'm trying to gather likelihood on the possibility of each 
> scenario, so I'm
> looking for
> 
> a) how many robots, given www.looksmart.com/robots.txt, would 
> read those
> looksmart.com PPC URLs?
> b) how many of those robots would be recognisable as robots, 
> i.e. use a
> unique User Agent?
> 
> Do we all agree that if a robot masquerades as a browser, 
> ignores robots.txt
> and incurs clickthrough fees for advertisers, then the 
> advertisers' beef (if
> any) should be with the robot owner rather than the PPC 
> provider?  But if a
> robot sends a recognisable UA, complies with robots.txt and 
> advertisers
> still incur clickthrough fees, the advertisers' beef (if any) 
> should be with
> the PPC provider?

eh...beef?
 
> Alan Perkins
> CTO, e-Brand Management Limited
> http://www.ebrandmanagement.com/
> 
> 
> 
--
Rasmus T. MohrDirect  : +45 36 910 122
Application Developer Mobile  : +45 28 731 827
Netpointers Intl. ApS Phone   : +45 70 117 117
Vestergade 18 B   Fax : +45 70 115 115
1456 Copenhagen K Email   : mailto:[EMAIL PROTECTED]
Denmark   Website : http://www.netpointers.com

"Remember that there are no bugs, only undocumented features."
--




[Robots] Re: Looksmart's robots.txt file

2002-05-29 Thread Rasmus Mohr


It seems to me that Looksmart is doing the right thing. Excluding
user-agents named "Due to a deficiency in Java it's not currently possible
to set the User-Agent." will exclude all Java-based "browsers" unable to set
the user-agent property using the java.net.URLConnection.setRequestProperty
method.

--
Rasmus T. MohrDirect  : +45 36 910 122
Application Developer Mobile  : +45 28 731 827
Netpointers Intl. ApS Phone   : +45 70 117 117
Vestergade 18 B   Fax : +45 70 115 115
1456 Copenhagen K Email   : mailto:[EMAIL PROTECTED]
Denmark   Website : http://www.netpointers.com

"Remember that there are no bugs, only undocumented features."
--

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Alan Perkins
> Sent: Tuesday, May 28, 2002 12:41 PM
> To: [EMAIL PROTECTED]
> Subject: [Robots] Looksmart's robots.txt file
> 
> 
> 
> Hi there
> 
> I'm sure most of you are aware of the furore following looksmart.com's
> recent shift to pay-per-click (PPC).  One of the issues 
> reported by many
> people is the number of "false clicks" reported by Looksmart, i.e.
> advertisers just cannot reconcile Looksmart's reported 
> clickthroughs with
> clickthroughs derived from their Web logs.  These same advertisers can
> reconcile clickthroughs from other PPC providers such as 
> Overture or Google
> so the problem doesn't appear to lie with the advertiser.
> 
> I've been looking at Looksmart's robots.txt file and it is - 
> well, shall we
> say unusual?
> 
> www.looksmart.com/robots.txt
> 
> In my opinion this file demonstrates a lack of understanding 
> of robots in
> several different respects, e.g. lines like:
> 
> 
> User-agent: Due to a deficiency in Java it's not currently 
> possible to set
> the User-Agent.
> Disallow:
> 
> 
> I'm wondering if this lack of understanding permeates through 
> to Looksmart's
> PPC-accounting department.
> 
> In other words, I'm wondering how many of the false clicks seen by
> advertisers are from robots (particularly robots masquerading 
> as a Mozilla
> browser).  Looksmart's robots.txt does not prevent robots 
> from reading the
> URLs that cause advertisers to incur a fee.  So if Looksmart cannot
> recognise the robot as a robot (and especially if they aren't 
> even checking
> for robots) advertisers could be incurring fees from 
> robot-clickthroughs.
> Most robots do not send a referrer in their HTTP request so this would
> explain why advertisers could not reconcile clickthroughs.
> 
> Looksmart's URLs are featured in the SERPs (search engine 
> results pages) of
> its search engine partners, as well as throughout 
> looksmart.com itself.  So
> any robot that crawls SERPs and/or the web could cause these false
> clickthroughs.  I know of at least two robots that crawl out 
> from SERPs
> masquerading as browsers to analyse why pages rank well.
> 
> So your thoughts please on
> 
> a) how many robots, given www.looksmart.com/robots.txt, would 
> read those
> looksmart.com PPC URLs?
> b) how many of those robots would be recognisable as robots, 
> i.e. use a
> unique User Agent?
> 
> Alan Perkins
> CTO, e-Brand Management Limited
> http://www.ebrandmanagement.com/
> 
> 
> 
> 
> 
> 




[Robots] SV: Re: SV: matching and "User-Agent:" in robots.txt

2002-03-18 Thread Rasmus Mohr


HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestProperty("User-Agent","user agent description");

Where url is an instance of java.net.URL

--
Rasmus T. MohrDirect  : +45 36 910 122
Application Developer Mobile  : +45 28 731 827
Netpointers Intl. ApS Phone   : +45 70 117 117
Vestergade 18 B   Fax : +45 70 115 115
1456 Copenhagen K Email   : mailto:[EMAIL PROTECTED]
Denmark   Website : http://www.netpointers.com

"Remember that there are no bugs, only undocumented features."
--

-Oprindelig meddelelse-
Fra: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]På
vegne af Thomas Huber
Sendt: 15. marts 2002 00:00
Til: [EMAIL PROTECTED]
Emne: [Robots] Re: SV: matching and "User-Agent:" in robots.txt




How can the UA be set in Java?

> Create a user-agent object thus:
> 
> "$ua = LWP::RobotUA->new('Banjo/1.1','http://nowhere.int/banjo.html
> [EMAIL PROTECTED]')





--
This message was sent by the Internet robots and spiders discussion list
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of
a message to "[EMAIL PROTECTED]".

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".



[Robots] SV: matching and "UserAgent:" in robots.txt

2002-03-14 Thread Rasmus Mohr


Any idea how widespread the use of this library is? We've observed some
weird behaviors from some of the major search engines' spiders (basically
ignoring robots.txt sections) - maybe this is the explanation?

--
Rasmus T. MohrDirect  : +45 36 910 122
Application Developer Mobile  : +45 28 731 827
Netpointers Intl. ApS Phone   : +45 70 117 117
Vestergade 18 B   Fax : +45 70 115 115
1456 Copenhagen K Email   : mailto:[EMAIL PROTECTED]
Denmark   Website : http://www.netpointers.com

"Remember that there are no bugs, only undocumented features."
--

-Oprindelig meddelelse-
Fra: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]Pa
vegne af Sean M. Burke
Sendt: 14. marts 2002 11:08
Til: [EMAIL PROTECTED]
Emne: [Robots] matching and "UserAgent:" in robots.txt



I'm a bit perplexed over whether the current Perl library WWW::RobotRules 
implements a certain part of the Robots Exclusion Standard correctly.  So 
forgive me if this seems a simple question, but my reading of the Robots 
Exclusion Standard hasn't really cleared it up in my mind yet.


Basically the current WWW::RobotRules logic is this:
As a WWW:::RobotRules object is parsing the lines in the robots.txt file, 
if it sees a line that says "User-Agent: ...foo...", it extracts the foo, 
and if the name of the current user-agent is a substring of "...foo...", 
then it considers this line as applying to it.

So if the agent being modeled is called "Banjo", and the robots.txt line 
being parsed says "User-Agent: Thing, Woozle, Banjo, Stuff", then the 
library says "OK, 'Banjo' is a substring in 'Thing, Woozle, Banjo, Stuff', 
so this rule is talking to me!"

However, the substring matching currently goes only one way.  So if the 
user-agent object is called "Banjo/1.1 [http://nowhere.int/banjo.html 
[EMAIL PROTECTED]]" and the robots.txt line being parsed says "User-Agent: 
Thing, Woozle, Banjo, Stuff", then the library says "'Banjo/1.1 
[http://nowhere.int/banjo.html [EMAIL PROTECTED]]' is NOT a substring of 
'Thing, Woozle, Banjo, Stuff', so this rule is NOT talking to me!"

I have the feeling that that's not right -- notably because that means that 
every robot ID string has to appear in toto on the "User-Agent" robots.txt 
line, which is clearly a bad thing.
But before I submit a patch, I'm tempted to ask... what /is/ the proper 
behavior?

Maybe shave the current user-agent's name at the first slash or space 
(getting just "Banjo"), and then seeing if /that/ is a substring of a given 
robots.txt "User-Agent:" line?

--
Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of
a message to "[EMAIL PROTECTED]".

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".