Re: Just what *does* robots.txt mean for a LOD site?
On 11 Aug 2014, at 15:49, Sarven Capadisli i...@csarven.ca wrote: I briefly brought up something like this to Henry Story for WebIDs. That is, it'd be cool to encourage the use of WebID's for crawlers, so that, the server logs would show them in place of User-Agents. That URI could also say something like we are crawling these domains, so, yes, it is really us if you see us in your logs (and not someone pretending). I don't know what the state of that stuff is with WebID. Maybe Kingsley or Henry can comment further. yes that seemed like long term a good idea. For WebID see: http://www.w3.org/2005/Incubator/webid/spec/ The WebID Profile could contain info about the type of Agent. The WebID-TLS auth could allow the robot to authenticate. ( Other WebID based authentications to be developed could be used too. you can easily work out from the WebID-TLS what another system of authentication could be ) This would allow one then to create Web Access Control rules where one can allow any robot access in Read Only for a certain type of resource. One could then also attach useage rules ( to be developed ) to the document. Henry On 2014-08-11 12:11, Hugh Glaser wrote: So should we have a class of agents for Linked Data access? Can you have a class of agents, or rather can an agent have more than one ID? (In particular, can a spider identify as both class and spider instance?) Actually ldspider is quite a good class ID :-) On 10 Aug 2014, at 21:31, Sarven Capadisli i...@csarven.ca wrote: Hi Hugh, just a side discussion Currently I let all the bots have a field day, unless they are clearly abusing or some student didn't get the memo on making reasonably frequent requests. If I were to start blocking the bigger crawlers, Google would be first to go. That's beside the fact that it is possible to control their crawl rate through Webmaster tools. The main reason for me is that, I simply don't see a return from them. They don't mind hammering the site if you let them, but try checking all those resources in Google search results - it is a gamble. I have a lot of resources which are statistical observations that don't really differ much from one document to another (at least what most humans or Google would consider). So, any way, I would give SW/LD crawlers the VIP line if I can because they tend to hit sporadically. Which is something I can live with. -Sarven On 2014-08-09 14:17, Hugh Glaser wrote:hye Hi Tobias, I have also done the same in http://sameas.org/robots.txt (Well, Kingsley said “Yes”, when I asked if I should :-) I know it is past time for the spider, but it will happen next time, I guess. And it will also open up all the sub-stores (http://www.sameas.org/store/), such as Sarven’s 270a. I’m not sure how the sameas.org URIs will work in fact - it may be that the linkage won’t make it happen, but it will be interesting to see. Have at them whenever you like :-) Very best Hugh On 6 Aug 2014, at 00:01, Tobias Käfer tobias.kae...@kit.edu wrote: :-) I thought I had done what you suggested: User-agent: ldspider Disallow: Allow: / Which should allow ldspider to crawl the site. OK, then I got your No, thank you. line wrong. But the robots.txt is fine then :) and ldspider will not refrain from crawling the site any more. Btw. One of the two lines -Allow: / and Disallow:- is sufficient. The Disallow line is the older way of putting it, so you might want to remove the Allow: / line again. Cheers, Tobias On 5 Aug 2014, at 18:06, Tobias Käfer tobias.kae...@kit.edu wrote: Hi Hugh, sorry for getting you wrong, but still I do not get what behaviour you want. What you are saying looks different from the robots.txt. If you tell me how you want it, I can help with the robots.txt (hopefully). Cheers, Tobias Am 05.08.2014 um 19:01 schrieb Hugh Glaser: Hi Tobias, On 5 Aug 2014, at 17:33, Tobias Käfer tobias.kae...@kit.edu wrote: Hi Hugh, By the way, have I got my robots.txt right? In particular, is the User-agent: LDSpider correct? Should I worry about case-sensitivity? The library (norbert) that is employed in LDspider is case-insensitive for the user agent. The user agent that is sent is ldspider. I suppose you want ldspider to crawl your site (highly appreciated), No, thank you. so you should change the line in your robots.txt for LDspider to: a) Disallow: b) Allow: / And not leave it with: c) Allow: * The star there does not bring the desired behaviour (and I have not found it in the spec for the path either), in fact, it keeps LDspider from crawling the folders you specified for exclusion for the other crawlers. Hopefully it is OK now: http://ibm.rkbexplorer.com/robots.txt Cheers Cheers, Tobias Social Web Architect http://bblfish.net/
Re: Just what *does* robots.txt mean for a LOD site?
On 8/11/14 3:20 PM, henry.st...@bblfish.net wrote: On 11 Aug 2014, at 15:49, Sarven Capadislii...@csarven.ca wrote: I briefly brought up something like this to Henry Story for WebIDs. That is, it'd be cool to encourage the use of WebID's for crawlers, so that, the server logs would show them in place of User-Agents. That URI could also say something like we are crawling these domains, so, yes, it is really us if you see us in your logs (and not someone pretending). I don't know what the state of that stuff is with WebID. Maybe Kingsley or Henry can comment further. yes that seemed like long term a good idea. For WebID see:http://www.w3.org/2005/Incubator/webid/spec/ The WebID Profile could contain info about the type of Agent. The WebID-TLS auth could allow the robot to authenticate. ( Other WebID based authentications to be developed could be used too. you can easily work out from the WebID-TLS what another system of authentication could be ) This would allow one then to create Web Access Control rules where one can allow any robot access in Read Only for a certain type of resource. One could then also attach useage rules ( to be developed ) to the document. Henry As stated, in a earlier post, with slight modification: The following could be issued via an HTTP user agent, as part of an HTTP request [1]: Slug: UserAgent Link: http://kingsley.idehen.net/dataspace/person/kidehen#this ; rel=http://example.org/action#onBehalfOf; Enabling a server discern the following from a request: #UserAgentID http://example.org/action#onBehalfOf http://kingsley.idehen.net/dataspace/person/kidehen#this. A protected resource access ACL on a server can be built around the relation above, using WebID-TLS, basic PKI, or some other HTTP based authentication protocol. The one requirement is that the server in question has the ability to comprehend the nature and form of relations represented using RDF statements. Link: [1] http://lists.w3.org/Archives/Public/public-webpayments/2014Jul/0112.html -- issue opened in regards to Link: and HTTP requests. -- Regards, Kingsley Idehen Founder CEO OpenLink Software Company Web: http://www.openlinksw.com Personal Weblog 1: http://kidehen.blogspot.com Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen Twitter Profile: https://twitter.com/kidehen Google+ Profile: https://plus.google.com/+KingsleyIdehen/about LinkedIn Profile: http://www.linkedin.com/in/kidehen Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this smime.p7s Description: S/MIME Cryptographic Signature
Re: Just what *does* robots.txt mean for a LOD site?
Hi Hugh, By the way, have I got my robots.txt right? In particular, is the User-agent: LDSpider correct? Should I worry about case-sensitivity? The library (norbert) that is employed in LDspider is case-insensitive for the user agent. The user agent that is sent is ldspider. I suppose you want ldspider to crawl your site (highly appreciated), so you should change the line in your robots.txt for LDspider to: a) Disallow: b) Allow: / And not leave it with: c) Allow: * The star there does not bring the desired behaviour (and I have not found it in the spec for the path either), in fact, it keeps LDspider from crawling the folders you specified for exclusion for the other crawlers. Cheers, Tobias
Re: Just what *does* robots.txt mean for a LOD site?
Thanks all. OK, I can live with that. So things like Tabulator, Sig.ma and SemWeb Browsers can be expected to go through a general robots.txt Disallow, which is what I was hoping. Yes, thanks Aidan, I know I can do various User-agents, but I really just wanted to stop anything like googlebot. By the way, have I got my robots.txt right? http://ibm.rkbexplorer.com/robots.txt In particular, is the User-agent: LDSpider correct? Should I worry about case-sensitivity? Thanks again, all. Hugh On 27 Jul 2014, at 19:23, Gannon Dick gannon_d...@yahoo.com wrote: On Sat, 7/26/14, aho...@dcc.uchile.cl aho...@dcc.uchile.cl wrote: The difference in opinion remains to what extent Linked Data agents need to pay attention to the robots.txt file. As many others have suggested, I buy into the idea of any agent not relying document-wise on user input being subject to robots.txt. = +1 Just a comment. Somewhere, sometime, somebody with Yahoo Mail decided that public-lod mail was spam, so every morning I dig it out because I value the content. Of course, I could wish for a Linked Data Agent which does that for me, but that would be to complete a banal or vicious cycle, depending on the circle classification scheme in use. I'm looking gor virtuous cycles and in the case of robots.txt, The lady doth protest too much, methinks. --Gannon -- Hugh Glaser 20 Portchester Rise Eastleigh SO50 4QS Mobile: +44 75 9533 4155, Home: +44 23 8061 5652
Re: Just what *does* robots.txt mean for a LOD site?
On Sat, 7/26/14, aho...@dcc.uchile.cl aho...@dcc.uchile.cl wrote: The difference in opinion remains to what extent Linked Data agents need to pay attention to the robots.txt file. As many others have suggested, I buy into the idea of any agent not relying document-wise on user input being subject to robots.txt. = +1 Just a comment. Somewhere, sometime, somebody with Yahoo Mail decided that public-lod mail was spam, so every morning I dig it out because I value the content. Of course, I could wish for a Linked Data Agent which does that for me, but that would be to complete a banal or vicious cycle, depending on the circle classification scheme in use. I'm looking gor virtuous cycles and in the case of robots.txt, The lady doth protest too much, methinks. --Gannon
Just what *does* robots.txt mean for a LOD site?
Hi. I’m pretty sure this discussion suggest that we (the LD community) should come try to come to some consensus of policy on exactly what it means if an agent finds a robots.txt on a Linked Data site. So I have changed the subject line - sorry Chris, it should have been changed earlier. Not an easy thing to come to, I suspect, but it seems to have become significant. Is there a more official forum for this sort of thing? On 26 Jul 2014, at 00:55, Luca Matteis lmatt...@gmail.com wrote: On Sat, Jul 26, 2014 at 1:34 AM, Hugh Glaser h...@glasers.org wrote: That sort of sums up what I want. Indeed. So I agree that robots.txt should probably not establish whether something is a linked dataset or not. To me your data is still linked data even though robots.txt is blocking access of specific types of agents, such as crawlers. Aidan, *) a Linked Dataset behind a robots.txt blacklist is not a Linked Dataset. Isn't that a bit harsh? That would be the case if the only type of agent is a crawler. But as Hugh mentioned, linked datasets can be useful simply by treating URIs as dereferenceable identifiers without following links. In Aidan’s view (I hope I am right here), it is perfectly sensible. If you start from the premise that robots.txt is intended to prohibit access be anything other than a browser with a human at it, then only humans could fetch the RDF documents. Which means that the RDF document is completely useless as a machine-interpretable semantics for the resource, since it would need a human to do some cut and paste or something to get it into a processor. It isn’t really a question of harsh - it is perfectly logical from that view of robots.txt (which isn’t our view, because we think that robots.txt is about specific types of agents”, as you say). Cheers Hugh -- Hugh Glaser 20 Portchester Rise Eastleigh SO50 4QS Mobile: +44 75 9533 4155, Home: +44 23 8061 5652
Re: Just what *does* robots.txt mean for a LOD site?
Thanks Hugh for the subject change and the reasonable summary. @Luca, per my previous emails, I think that a robots.txt blacklist should affect a broad range of Linked Data agents, so much so that I would no longer consider the URIs affected dereferenceable, and thus I would no longer call the affected data Linked Data. I don't feel that harsh is applicable ... but I guess there is room for discussion. :) The difference in opinion remains to what extent Linked Data agents need to pay attention to the robots.txt file. As many others have suggested, I buy into the idea of any agent not relying document-wise on user input being subject to robots.txt. I should add that in your case Hugh, you can avoid problems while considering more fine-grained controls in your robots.txt file. For example, you can specifically ban Google/Yahoo/Yandex!/Bing agents, etc., from parts of your site using robots.txt. Likewise, if you are considered about the use of resources, you can throttle agents using Crawl-delay (a non-standard exception, but one that should be respected by the big agents). You can set a crawl delay with respect to the costs you foresee per request and the number of agents you see competing for resources. Note also that even the big spiders like Google, Yahoo!, etc., are unlikely to actually crawl very deep into your dataset unless you've a lot of incoming links. Essentially your site as you describe sounds like a part of the Deep Web. Best, Aidan On 26/07/2014 07:16, Hugh Glaser wrote: Hi. Im pretty sure this discussion suggest that we (the LD community) should come try to come to some consensus of policy on exactly what it means if an agent finds a robots.txt on a Linked Data site. So I have changed the subject line - sorry Chris, it should have been changed earlier. Not an easy thing to come to, I suspect, but it seems to have become significant. Is there a more official forum for this sort of thing? On 26 Jul 2014, at 00:55, Luca Matteis lmatt...@gmail.com wrote: On Sat, Jul 26, 2014 at 1:34 AM, Hugh Glaser h...@glasers.org wrote: That sort of sums up what I want. Indeed. So I agree that robots.txt should probably not establish whether something is a linked dataset or not. To me your data is still linked data even though robots.txt is blocking access of specific types of agents, such as crawlers. Aidan, *) a Linked Dataset behind a robots.txt blacklist is not a Linked Dataset. Isn't that a bit harsh? That would be the case if the only type of agent is a crawler. But as Hugh mentioned, linked datasets can be useful simply by treating URIs as dereferenceable identifiers without following links. In Aidans view (I hope I am right here), it is perfectly sensible. If you start from the premise that robots.txt is intended to prohibit access be anything other than a browser with a human at it, then only humans could fetch the RDF documents. Which means that the RDF document is completely useless as a machine- interpretable semantics for the resource, since it would need a human to do some cut and paste or something to get it into a processor. It isnt really a question of harsh - it is perfectly logical from that view of robots.txt (which isnt our view, because we think that robots.txt is about specific types of agents, as you say). Cheers Hugh