[Robots] Re: Looksmart's robots.txt file
Richard, You're absolutely right! But why does the Acme.Spider use that user-agent id? I can't find any reference to it in the source, i wonder > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On > Behalf Of [EMAIL PROTECTED] > Sent: Thursday, May 30, 2002 3:58 PM > To: [EMAIL PROTECTED] > Subject: [Robots] Re: Looksmart's robots.txt file > > > > Rasmus Mohr writes: > > Yes, that would be the case. For some unknown reason > Looksmart allows > > recognized robots/crawlers/spider and other non-standard > user-agents > > unlimited access according to the the robots.txt - all > others are excluded. > > I'd guess the weird looking "java" user-agent originates > from an Java > > application running on a platform/JVM unable to set the > user-agent property. > > The guys at Looksmart probably detected it in their logfiles... > > I don't think so. I think they just processed the web robots list > autocatically. In fact, that's what it says at the top of the > robots.txt file. If you look at > http://www.robotstxt.org/wc/active/html/contact.html you'll see where it comes from. > eh...beef? gripes, wrath, criticism, complaints, etc. General feeling of displeasure directed to some person or thing, Richard -- Rasmus T. MohrDirect : +45 36 910 122 Application Developer Mobile : +45 28 731 827 Netpointers Intl. ApS Phone : +45 70 117 117 Vestergade 18 B Fax : +45 70 115 115 1456 Copenhagen K Email : mailto:[EMAIL PROTECTED] Denmark Website : http://www.netpointers.com "Remember that there are no bugs, only undocumented features." --
[Robots] Re: Looksmart's robots.txt file
> Are you suggesting a robot is checking that string against > its UA??? I find > that hard to believe, but assuming that is the case such a > robot would be > allowed unrestricted access to looksmart.com, including all > their Pay Per > Click (PPC) URLs. I'm thinking that many robots are reading > looksmart.com, > some with permission from robots.txt, and some without. Yes, that would be the case. For some unknown reason Looksmart allows recognized robots/crawlers/spider and other non-standard user-agents unlimited access according to the the robots.txt - all others are excluded. I'd guess the weird looking "java" user-agent originates from an Java application running on a platform/JVM unable to set the user-agent property. The guys at Looksmart probably detected it in their logfiles... > I'm looking for reasons why advertisers can't reconcile > clickthroughs with > the figures provided by Looksmart. > > One suggestion is that if Looksmart aren't checking the User Agents of > clients accessing the PPC URLs, that would be a reason why > many more clicks > were being seen by Looksmart than by advertisers. The robot > either may not > follow the redirect, or may follow without providing a > referrer, or may be > silently filtered by an advertiser's stats package because it > is a robot not > a human visitor. > > Another suggestion is that some robots will be masquerading > as browsers, but > still may not follow redirects or send a referrer allowing > the clicks to be > reconciled. That would be true, unless some sort of server-side mechanism ensures that these well-known (probably non-human) users are provided with a different content than "normal" users, i.e. pages without PPC URLs. You could check this theory by creating a simple "robot", using one of the user-agents in the robots.txt file, and comparing the server output with the output given to a normal user (IE, Mozilla, Opera...). > I'm trying to gather likelihood on the possibility of each > scenario, so I'm > looking for > > a) how many robots, given www.looksmart.com/robots.txt, would > read those > looksmart.com PPC URLs? > b) how many of those robots would be recognisable as robots, > i.e. use a > unique User Agent? > > Do we all agree that if a robot masquerades as a browser, > ignores robots.txt > and incurs clickthrough fees for advertisers, then the > advertisers' beef (if > any) should be with the robot owner rather than the PPC > provider? But if a > robot sends a recognisable UA, complies with robots.txt and > advertisers > still incur clickthrough fees, the advertisers' beef (if any) > should be with > the PPC provider? eh...beef? > Alan Perkins > CTO, e-Brand Management Limited > http://www.ebrandmanagement.com/ > > > -- Rasmus T. MohrDirect : +45 36 910 122 Application Developer Mobile : +45 28 731 827 Netpointers Intl. ApS Phone : +45 70 117 117 Vestergade 18 B Fax : +45 70 115 115 1456 Copenhagen K Email : mailto:[EMAIL PROTECTED] Denmark Website : http://www.netpointers.com "Remember that there are no bugs, only undocumented features." --
[Robots] Re: Looksmart's robots.txt file
It seems to me that Looksmart is doing the right thing. Excluding user-agents named "Due to a deficiency in Java it's not currently possible to set the User-Agent." will exclude all Java-based "browsers" unable to set the user-agent property using the java.net.URLConnection.setRequestProperty method. -- Rasmus T. MohrDirect : +45 36 910 122 Application Developer Mobile : +45 28 731 827 Netpointers Intl. ApS Phone : +45 70 117 117 Vestergade 18 B Fax : +45 70 115 115 1456 Copenhagen K Email : mailto:[EMAIL PROTECTED] Denmark Website : http://www.netpointers.com "Remember that there are no bugs, only undocumented features." -- > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On > Behalf Of Alan Perkins > Sent: Tuesday, May 28, 2002 12:41 PM > To: [EMAIL PROTECTED] > Subject: [Robots] Looksmart's robots.txt file > > > > Hi there > > I'm sure most of you are aware of the furore following looksmart.com's > recent shift to pay-per-click (PPC). One of the issues > reported by many > people is the number of "false clicks" reported by Looksmart, i.e. > advertisers just cannot reconcile Looksmart's reported > clickthroughs with > clickthroughs derived from their Web logs. These same advertisers can > reconcile clickthroughs from other PPC providers such as > Overture or Google > so the problem doesn't appear to lie with the advertiser. > > I've been looking at Looksmart's robots.txt file and it is - > well, shall we > say unusual? > > www.looksmart.com/robots.txt > > In my opinion this file demonstrates a lack of understanding > of robots in > several different respects, e.g. lines like: > > > User-agent: Due to a deficiency in Java it's not currently > possible to set > the User-Agent. > Disallow: > > > I'm wondering if this lack of understanding permeates through > to Looksmart's > PPC-accounting department. > > In other words, I'm wondering how many of the false clicks seen by > advertisers are from robots (particularly robots masquerading > as a Mozilla > browser). Looksmart's robots.txt does not prevent robots > from reading the > URLs that cause advertisers to incur a fee. So if Looksmart cannot > recognise the robot as a robot (and especially if they aren't > even checking > for robots) advertisers could be incurring fees from > robot-clickthroughs. > Most robots do not send a referrer in their HTTP request so this would > explain why advertisers could not reconcile clickthroughs. > > Looksmart's URLs are featured in the SERPs (search engine > results pages) of > its search engine partners, as well as throughout > looksmart.com itself. So > any robot that crawls SERPs and/or the web could cause these false > clickthroughs. I know of at least two robots that crawl out > from SERPs > masquerading as browsers to analyse why pages rank well. > > So your thoughts please on > > a) how many robots, given www.looksmart.com/robots.txt, would > read those > looksmart.com PPC URLs? > b) how many of those robots would be recognisable as robots, > i.e. use a > unique User Agent? > > Alan Perkins > CTO, e-Brand Management Limited > http://www.ebrandmanagement.com/ > > > > > >
[Robots] SV: Re: SV: matching and "User-Agent:" in robots.txt
HttpURLConnection connection = (HttpURLConnection) url.openConnection(); connection.setRequestProperty("User-Agent","user agent description"); Where url is an instance of java.net.URL -- Rasmus T. MohrDirect : +45 36 910 122 Application Developer Mobile : +45 28 731 827 Netpointers Intl. ApS Phone : +45 70 117 117 Vestergade 18 B Fax : +45 70 115 115 1456 Copenhagen K Email : mailto:[EMAIL PROTECTED] Denmark Website : http://www.netpointers.com "Remember that there are no bugs, only undocumented features." -- -Oprindelig meddelelse- Fra: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]På vegne af Thomas Huber Sendt: 15. marts 2002 00:00 Til: [EMAIL PROTECTED] Emne: [Robots] Re: SV: matching and "User-Agent:" in robots.txt How can the UA be set in Java? > Create a user-agent object thus: > > "$ua = LWP::RobotUA->new('Banjo/1.1','http://nowhere.int/banjo.html > [EMAIL PROTECTED]') -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]". -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
[Robots] SV: matching and "UserAgent:" in robots.txt
Any idea how widespread the use of this library is? We've observed some weird behaviors from some of the major search engines' spiders (basically ignoring robots.txt sections) - maybe this is the explanation? -- Rasmus T. MohrDirect : +45 36 910 122 Application Developer Mobile : +45 28 731 827 Netpointers Intl. ApS Phone : +45 70 117 117 Vestergade 18 B Fax : +45 70 115 115 1456 Copenhagen K Email : mailto:[EMAIL PROTECTED] Denmark Website : http://www.netpointers.com "Remember that there are no bugs, only undocumented features." -- -Oprindelig meddelelse- Fra: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]Pa vegne af Sean M. Burke Sendt: 14. marts 2002 11:08 Til: [EMAIL PROTECTED] Emne: [Robots] matching and "UserAgent:" in robots.txt I'm a bit perplexed over whether the current Perl library WWW::RobotRules implements a certain part of the Robots Exclusion Standard correctly. So forgive me if this seems a simple question, but my reading of the Robots Exclusion Standard hasn't really cleared it up in my mind yet. Basically the current WWW::RobotRules logic is this: As a WWW:::RobotRules object is parsing the lines in the robots.txt file, if it sees a line that says "User-Agent: ...foo...", it extracts the foo, and if the name of the current user-agent is a substring of "...foo...", then it considers this line as applying to it. So if the agent being modeled is called "Banjo", and the robots.txt line being parsed says "User-Agent: Thing, Woozle, Banjo, Stuff", then the library says "OK, 'Banjo' is a substring in 'Thing, Woozle, Banjo, Stuff', so this rule is talking to me!" However, the substring matching currently goes only one way. So if the user-agent object is called "Banjo/1.1 [http://nowhere.int/banjo.html [EMAIL PROTECTED]]" and the robots.txt line being parsed says "User-Agent: Thing, Woozle, Banjo, Stuff", then the library says "'Banjo/1.1 [http://nowhere.int/banjo.html [EMAIL PROTECTED]]' is NOT a substring of 'Thing, Woozle, Banjo, Stuff', so this rule is NOT talking to me!" I have the feeling that that's not right -- notably because that means that every robot ID string has to appear in toto on the "User-Agent" robots.txt line, which is clearly a bad thing. But before I submit a patch, I'm tempted to ask... what /is/ the proper behavior? Maybe shave the current user-agent's name at the first slash or space (getting just "Banjo"), and then seeing if /that/ is a substring of a given robots.txt "User-Agent:" line? -- Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]". -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".