Re: Just what *does* robots.txt mean for a LOD site?

2014-08-11 Thread henry.st...@bblfish.net

On 11 Aug 2014, at 15:49, Sarven Capadisli i...@csarven.ca wrote:

 I briefly brought up something like this to Henry Story for WebIDs. That is, 
 it'd be cool to encourage the use of WebID's for crawlers, so that, the 
 server logs would show them in place of User-Agents. That URI could also say 
 something like we are crawling these domains, so, yes, it is really us if 
 you see us in your logs (and not someone pretending).
 
 I don't know what the state of that stuff is with WebID. Maybe Kingsley or 
 Henry can comment further.

yes that seemed like long term a good idea.

For WebID see: http://www.w3.org/2005/Incubator/webid/spec/

The WebID Profile could contain info about the type of Agent.
The WebID-TLS auth could allow the robot to authenticate.
( Other WebID based authentications to be developed could be used too.
  you can easily work out from the WebID-TLS what another system of
  authentication could be )

This would allow one then to create Web Access Control rules where
one can allow any robot access in Read Only for a certain type of resource.
One could then also attach useage rules ( to be developed ) to the document.

Henry


 On 2014-08-11 12:11, Hugh Glaser wrote:
 So should we have a class of agents for Linked Data access?
 Can you have a class of agents, or rather can an agent have more than one ID?
 (In particular, can a spider identify as both class and spider instance?)
 Actually ldspider is quite a good class ID :-)
 
 On 10 Aug 2014, at 21:31, Sarven Capadisli i...@csarven.ca wrote:
 
 Hi Hugh,  just a side discussion
 
 Currently I let all the bots have a field day, unless they are clearly 
 abusing or some student didn't get the memo on making reasonably frequent 
 requests.
 
 If I were to start blocking the bigger crawlers, Google would be first to 
 go. That's beside the fact that it is possible to control their crawl rate 
 through Webmaster tools. The main reason for me is that, I simply don't see 
 a return from them. They don't mind hammering the site if you let them, 
 but try checking all those resources in Google search results - it is a 
 gamble. I have a lot of resources which are statistical observations that 
 don't really differ much from one document to another (at least what most 
 humans or Google would consider). So, any way, I would give SW/LD crawlers 
 the VIP line if I can because they tend to hit sporadically. Which is 
 something I can live with.
 
 -Sarven
 
 On 2014-08-09 14:17, Hugh Glaser wrote:hye
 Hi Tobias,
 I have also done the same in
 http://sameas.org/robots.txt
 (Well, Kingsley said “Yes”, when I asked if I should :-)
 
 I know it is past time for the spider, but it will happen next time, I 
 guess.
 And it will also open up all the sub-stores 
 (http://www.sameas.org/store/), such as Sarven’s 270a.
 I’m not sure how the sameas.org URIs will work in fact - it may be that 
 the linkage won’t make it happen, but it will be interesting to see.
 Have at them whenever you like :-)
 
 Very best
 Hugh
 
 On 6 Aug 2014, at 00:01, Tobias Käfer tobias.kae...@kit.edu wrote:
 
 :-)
 I thought I had done what you suggested:
 
 User-agent: ldspider
 Disallow:
 Allow: /
 
 Which should allow ldspider to crawl the site.
 
 OK, then I got your No, thank you. line wrong.
 
 But the robots.txt is fine then :) and ldspider will not refrain from 
 crawling the site any more.
 
 Btw. One of the two lines -Allow: / and Disallow:- is sufficient. The 
 Disallow line is the older way of putting it, so you might want to remove 
 the Allow: / line again.
 
 Cheers,
 
 Tobias
 
 On 5 Aug 2014, at 18:06, Tobias Käfer tobias.kae...@kit.edu wrote:
 
 Hi Hugh,
 
 sorry for getting you wrong, but still I do not get what behaviour you 
 want. What you are saying looks different from the robots.txt. If you 
 tell me how you want it, I can help with the robots.txt (hopefully).
 
 Cheers,
 
 Tobias
 
 Am 05.08.2014 um 19:01 schrieb Hugh Glaser:
 Hi Tobias,
 On 5 Aug 2014, at 17:33, Tobias Käfer tobias.kae...@kit.edu wrote:
 
 Hi Hugh,
 
 By the way, have I got my robots.txt right?
 In particular, is the
 User-agent: LDSpider
 correct?
 Should I worry about case-sensitivity?
 
 The library (norbert) that is employed in LDspider is 
 case-insensitive for the user agent. The user agent that is sent is 
 ldspider.
 
 I suppose you want ldspider to crawl your site (highly appreciated),
 No, thank you.
 so you should change the line in your robots.txt for LDspider to:
 a) Disallow:
 b) Allow: /
 And not leave it with:
 c) Allow: *
 The star there does not bring the desired behaviour (and I have not 
 found it in the spec for the path either), in fact, it keeps LDspider 
 from crawling the folders you specified for exclusion for the other 
 crawlers.
 Hopefully it is OK now:
 http://ibm.rkbexplorer.com/robots.txt
 
 Cheers
 
 Cheers,
 
 Tobias
 
 
 
 
 
 
 
 
 

Social Web Architect
http://bblfish.net/




Re: Just what *does* robots.txt mean for a LOD site?

2014-08-11 Thread Kingsley Idehen

On 8/11/14 3:20 PM, henry.st...@bblfish.net wrote:

On 11 Aug 2014, at 15:49, Sarven Capadislii...@csarven.ca  wrote:


I briefly brought up something like this to Henry Story for WebIDs. That is, it'd be 
cool to encourage the use of WebID's for crawlers, so that, the server logs would show them 
in place of User-Agents. That URI could also say something like we are crawling these 
domains, so, yes, it is really us if you see us in your logs (and not someone 
pretending).

I don't know what the state of that stuff is with WebID. Maybe Kingsley or 
Henry can comment further.

yes that seemed like long term a good idea.

For WebID see:http://www.w3.org/2005/Incubator/webid/spec/

The WebID Profile could contain info about the type of Agent.
The WebID-TLS auth could allow the robot to authenticate.
( Other WebID based authentications to be developed could be used too.
   you can easily work out from the WebID-TLS what another system of
   authentication could be )

This would allow one then to create Web Access Control rules where
one can allow any robot access in Read Only for a certain type of resource.
One could then also attach useage rules ( to be developed ) to the document.

Henry




As stated, in a earlier post, with slight modification:

The following could be issued via an HTTP user agent, as part of an HTTP 
request [1]:


Slug: UserAgent
Link: http://kingsley.idehen.net/dataspace/person/kidehen#this ; 
rel=http://example.org/action#onBehalfOf;


Enabling a server discern the following from a request:

#UserAgentID http://example.org/action#onBehalfOf 
http://kingsley.idehen.net/dataspace/person/kidehen#this.


A protected resource access ACL on a server can be built around the 
relation above, using WebID-TLS, basic PKI, or some other HTTP based 
authentication protocol. The one requirement is that the server in 
question has the ability to comprehend the nature and form of relations 
represented using RDF statements.


Link:

[1] 
http://lists.w3.org/Archives/Public/public-webpayments/2014Jul/0112.html 
-- issue opened in regards to Link: and HTTP requests.


--
Regards,

Kingsley Idehen 
Founder  CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this




smime.p7s
Description: S/MIME Cryptographic Signature


Re: Just what *does* robots.txt mean for a LOD site?

2014-08-05 Thread Tobias Käfer

Hi Hugh,

 By the way, have I got my robots.txt right?

In particular, is the
User-agent: LDSpider
correct?
Should I worry about case-sensitivity?


The library (norbert) that is employed in LDspider is case-insensitive 
for the user agent. The user agent that is sent is ldspider.


I suppose you want ldspider to crawl your site (highly appreciated), so 
you should change the line in your robots.txt for LDspider to:

a) Disallow:
b) Allow: /
And not leave it with:
c) Allow: *
The star there does not bring the desired behaviour (and I have not 
found it in the spec for the path either), in fact, it keeps LDspider 
from crawling the folders you specified for exclusion for the other 
crawlers.


Cheers,

Tobias



Re: Just what *does* robots.txt mean for a LOD site?

2014-07-30 Thread Hugh Glaser
Thanks all.
OK, I can live with that.

So things like Tabulator, Sig.ma and SemWeb Browsers can be expected to go 
through a general robots.txt Disallow, which is what I was hoping.

Yes, thanks Aidan, I know I can do various User-agents, but I really just 
wanted to stop anything like googlebot.

By the way, have I got my robots.txt right?
http://ibm.rkbexplorer.com/robots.txt
In particular, is the
User-agent: LDSpider
correct?
Should I worry about case-sensitivity?

Thanks again, all.
Hugh


On 27 Jul 2014, at 19:23, Gannon Dick gannon_d...@yahoo.com wrote:

 
 
 On Sat, 7/26/14, aho...@dcc.uchile.cl aho...@dcc.uchile.cl wrote:
 
 The difference in opinion remains to what extent Linked Data
 agents need to pay attention to the robots.txt file.
 
 As many others have suggested, I buy into the idea of any
 agent not relying document-wise on user input being subject to
 robots.txt.
 
 =
 +1
 Just a comment.
 
 Somewhere, sometime, somebody with Yahoo Mail decided that public-lod mail 
 was spam, so every morning I dig it out because I value the content.
 
 Of course, I could wish for a Linked Data Agent which does that for me, but 
 that would be to complete a banal or vicious cycle, depending on the circle 
 classification scheme in use.  I'm looking gor virtuous cycles and in the 
 case of robots.txt, The lady doth protest too much, methinks.
 --Gannon
 
 
 

-- 
Hugh Glaser
   20 Portchester Rise
   Eastleigh
   SO50 4QS
Mobile: +44 75 9533 4155, Home: +44 23 8061 5652





Re: Just what *does* robots.txt mean for a LOD site?

2014-07-27 Thread Gannon Dick


On Sat, 7/26/14, aho...@dcc.uchile.cl aho...@dcc.uchile.cl wrote:

 The difference in opinion remains to what extent Linked Data
 agents need to pay attention to the robots.txt file.
 
 As many others have suggested, I buy into the idea of any
 agent not relying document-wise on user input being subject to
 robots.txt.

=
+1
Just a comment.

Somewhere, sometime, somebody with Yahoo Mail decided that public-lod mail was 
spam, so every morning I dig it out because I value the content.

Of course, I could wish for a Linked Data Agent which does that for me, but 
that would be to complete a banal or vicious cycle, depending on the circle 
classification scheme in use.  I'm looking gor virtuous cycles and in the case 
of robots.txt, The lady doth protest too much, methinks.
--Gannon





Just what *does* robots.txt mean for a LOD site?

2014-07-26 Thread Hugh Glaser
Hi.

I’m pretty sure this discussion suggest that we (the LD community) should come 
try to come to some consensus of policy on exactly what it means if an agent 
finds a robots.txt on a Linked Data site.

So I have changed the subject line - sorry Chris, it should have been changed 
earlier.

Not an easy thing to come to, I suspect, but it seems to have become 
significant.
Is there a more official forum for this sort of thing?

On 26 Jul 2014, at 00:55, Luca Matteis lmatt...@gmail.com wrote:

 On Sat, Jul 26, 2014 at 1:34 AM, Hugh Glaser h...@glasers.org wrote:
 That sort of sums up what I want.
 
 Indeed. So I agree that robots.txt should probably not establish
 whether something is a linked dataset or not. To me your data is still
 linked data even though robots.txt is blocking access of specific
 types of agents, such as crawlers.
 
 Aidan,
 
 *) a Linked Dataset behind a robots.txt blacklist is not a Linked Dataset.
 
 Isn't that a bit harsh? That would be the case if the only type of
 agent is a crawler. But as Hugh mentioned, linked datasets can be
 useful simply by treating URIs as dereferenceable identifiers without
 following links.
In Aidan’s view (I hope I am right here), it is perfectly sensible.
If you start from the premise that robots.txt is intended to prohibit access be 
anything other than a browser with a human at it, then only humans could fetch 
the RDF documents.
Which means that the RDF document is completely useless as a 
machine-interpretable semantics for the resource, since it would need a human 
to do some cut and paste or something to get it into a processor.

It isn’t really a question of harsh - it is perfectly logical from that view of 
robots.txt (which isn’t our view, because we think that robots.txt is about 
specific types of agents”, as you say).

Cheers
Hugh

-- 
Hugh Glaser
   20 Portchester Rise
   Eastleigh
   SO50 4QS
Mobile: +44 75 9533 4155, Home: +44 23 8061 5652





Re: Just what *does* robots.txt mean for a LOD site?

2014-07-26 Thread ahogan
Thanks Hugh for the subject change and the reasonable summary.

@Luca, per my previous emails, I think that a robots.txt blacklist should
affect a broad range of Linked Data agents, so much so that I would no
longer consider the URIs affected dereferenceable, and thus I would no
longer call the affected data Linked Data. I don't feel that harsh is
applicable ... but I guess there is room for discussion. :)

The difference in opinion remains to what extent Linked Data agents need
to pay attention to the robots.txt file.

As many others have suggested, I buy into the idea of any agent not
relying document-wise on user input being subject to robots.txt.


I should add that in your case Hugh, you can avoid problems while
considering more fine-grained controls in your robots.txt file. For
example, you can specifically ban Google/Yahoo/Yandex!/Bing agents, etc.,
from parts of your site using robots.txt. Likewise, if you are considered
about the use of resources, you can throttle agents using Crawl-delay (a
non-standard exception, but one that should be respected by the big
agents). You can set a crawl delay with respect to the costs you foresee
per request and the number of agents you see competing for resources.

Note also that even the big spiders like Google, Yahoo!, etc., are
unlikely to actually crawl very deep into your dataset unless you've a lot
of incoming links. Essentially your site as you describe sounds like a
part of the Deep Web.

Best,
Aidan

On 26/07/2014 07:16, Hugh Glaser wrote:
 Hi.

 I’m pretty sure this discussion suggest that we (the LD community) should
 come try to come to some consensus of policy on exactly what it means if
 an agent finds a robots.txt on a Linked Data site.

 So I have changed the subject line - sorry Chris, it should have been
 changed earlier.

 Not an easy thing to come to, I suspect, but it seems to have become
 significant.
 Is there a more official forum for this sort of thing?

 On 26 Jul 2014, at 00:55, Luca Matteis lmatt...@gmail.com wrote:

 On Sat, Jul 26, 2014 at 1:34 AM, Hugh Glaser h...@glasers.org wrote:
 That sort of sums up what I want.

 Indeed. So I agree that robots.txt should probably not establish
 whether something is a linked dataset or not. To me your data is still
 linked data even though robots.txt is blocking access of specific
 types of agents, such as crawlers.

 Aidan,

 *) a Linked Dataset behind a robots.txt blacklist is not a Linked
 Dataset.

 Isn't that a bit harsh? That would be the case if the only type of
 agent is a crawler. But as Hugh mentioned, linked datasets can be
 useful simply by treating URIs as dereferenceable identifiers without
 following links.
 In Aidan’s view (I hope I am right here), it is perfectly sensible.
 If you start from the premise that robots.txt is intended to prohibit
 access be anything other than a browser with a human at it, then only
 humans could fetch the RDF documents.
 Which means that the RDF document is completely useless as a machine-
 interpretable semantics for the resource, since it would need a human
 to do some cut and paste or something to get it into a processor.

 It isn’t really a question of harsh - it is perfectly logical from that
 view of robots.txt (which isn’t our view, because we think that robots.txt
 is about specific types of agents”, as you say).

 Cheers
 Hugh