date:20110622

On 23 Jun 2011, at 00:11, Alexandre Passant wrote:

> 
> On 22 Jun 2011, at 22:49, Richard Cyganiak wrote:
> 
>> On 21 Jun 2011, at 10:44, Martin Hepp wrote:
>>> PS: I will not release the IP ranges from which the trouble originated, but 
>>> rest assured, there were top research institutions among them.
>> 
>> The right answer is: name and shame. That is the way to teach them.
> 
> You may have find the right word: teach.
> We've (as academic) given tutorials on how to publish and consume LOD, lots 
> of things about best practices for publishing, but not much about consuming.
> Why not simply coming with reasonable guidelines for this, that should also 
> be taught in institutes / universities where people use LOD, and in tutorials 
> given in various conferences.

That is of course a good idea. But longer term you don't want to teach that 
way. It's too time consuming. You need the machines to do the teaching. 

Think about Facebook. How did 500 million people go to use it? Because they 
were introduced by friends, by using it, but not by doing tutorials and going 
to courses. The system itself teaches people how to use it. 

So the same way, if you want to teach people linked data, get the social web 
going and they will learn the rest by themselves. If you want to teach crawlers 
to behave, make bad behaviour uninteresting. Create a game and rules where good 
behaviour are rewarded and bad behaviour has the opposite effect.

This is why I think using WebID can help. You can use the information to build 
lists and rankings of good and bad crawlers, people with good crawlers get to 
present papers and crawling confs, bad crawlers get throttled out of crawling.  
Make it so that the system can grow beyond academic and teaching settings, into 
the world of billions of users spread across the world, living in different 
political institutions and speaking different languages. We have had good 
crawling practices since the beginning of the web, but you need to make them 
evident and self teaching.

EG. A crawler that crawls to much will get slowed down, and redirected to pages 
on crawling behavior, written and translated into every single language on the 
planet.

Henry

> 
> m2c
> 
> Alex.
> 
>> 
>> Like Karl said, we should collect information about abusive crawlers so that 
>> site operators can defend themselves. It won't be *that* hard to research 
>> and collect the IP ranges of offending universities.
>> 
>> I started a list here:
>> http://www.w3.org/wiki/Bad_Crawlers
>> 
>> The list is currently empty. I hope it stays that way.
>> 
>> Thank you all,
>> Richard
> 
> --
> Dr. Alexandre Passant, 
> Social Software Unit Leader
> Digital Enterprise Research Institute, 
> National University of Ireland, Galway
> 
> 
> 
> 

Social Web Architect
http://bblfish.net/

Re: Think before you write Semantic Web crawlers


On 6/22/11 4:51 PM, Dave Challis wrote:

On 22/06/11 16:05, Kingsley Idehen wrote:

On 6/22/11 3:57 PM, Steve Harris wrote:

Yes, exactly.

I think that the problem is at least partly (and I say this as an
ex-academic) that few people in academia have the slightest idea how
much it costs to run a farm of servers in the Real World™.

From the point of view of the crawler they're trying to get as much
data as possible in a short a time as possible, but don't realise that
the poor guy at the other end just got his 95th percentile shot
through the roof, and now has a several thousand dollar bandwidth bill
heading his way.

You can cap bandwidth, but that then might annoy paying customers,
which is clearly not good.


Yes, so we need QoS algorithms or heuristics capable of fine-grained
partitioning re. Who can do What, When, and Where :-)

Kingsley


There are plenty of these around when it comes to web traffic in 
general.  For apache, I can think of ModSecurity 
(http://www.modsecurity.org/) and mod_evasive 
(http://www.zdziarski.com/blog/?page_id=442).


Both of these will look at traffic patterns and dynamically blacklist 
as needed.


How do they deal with "Who" without throwing baby out with bath water 
re. Linked Data ?


Innocent Linked Data consumer triggers a transitive crawl, all other 
visitors from that IP get on a blacklist? Nobody meant any harm. In 
RDBMS realm would it be reasonable to take any of the following actions:


1. Cut off marketing because someone triggered: SELECT * from Customers 
, as part of MS Query or MS Access usage
2. Cut off sales and/or marketing because as part of trying to grok SQL 
joins they generated a lot of Cartesian products.


You need granularity within the data access technology itself. WebID 
offers that to Linked Data. Linked Data is the evolution hitting the Web 
and redefining crawling in the process.




ModSecurity also allows for custom rules to be written depending on 
get/post content, so it should be perfectly feasible to set up rules 
based on estimated/actual query cost (e.g. blacklist if client makes > 
X requests per Y mins which return > Z triples).


How does it know about: 
http://kingsley.idehen.net/dataspace/person#this,  For better or for 
worse re. QoS?




Can't see any reason why a hybrid approach couldn't be used, e.g. 
apply rules to unauthenticated traffic, and auto-whitelist clients 
identifying themselves via WebID.


Of course a hybrid system is how it has to work. WebID isn't a silver 
bullet, nothing is. Hence the need for heuristics and algorithms. WebID 
is just a critical factor, ditto Trust Logic.



--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Alexandre Passant

On 22 Jun 2011, at 22:49, Richard Cyganiak wrote:

> On 21 Jun 2011, at 10:44, Martin Hepp wrote:
>> PS: I will not release the IP ranges from which the trouble originated, but 
>> rest assured, there were top research institutions among them.
> 
> The right answer is: name and shame. That is the way to teach them.

You may have find the right word: teach.
We've (as academic) given tutorials on how to publish and consume LOD, lots of 
things about best practices for publishing, but not much about consuming.
Why not simply coming with reasonable guidelines for this, that should also be 
taught in institutes / universities where people use LOD, and in tutorials 
given in various conferences.

m2c

Alex.

> 
> Like Karl said, we should collect information about abusive crawlers so that 
> site operators can defend themselves. It won't be *that* hard to research and 
> collect the IP ranges of offending universities.
> 
> I started a list here:
> http://www.w3.org/wiki/Bad_Crawlers
> 
> The list is currently empty. I hope it stays that way.
> 
> Thank you all,
> Richard

--
Dr. Alexandre Passant, 
Social Software Unit Leader
Digital Enterprise Research Institute, 
National University of Ireland, Galway

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Richard Cyganiak

On 21 Jun 2011, at 10:44, Martin Hepp wrote:
> PS: I will not release the IP ranges from which the trouble originated, but 
> rest assured, there were top research institutions among them.

The right answer is: name and shame. That is the way to teach them.

Like Karl said, we should collect information about abusive crawlers so that 
site operators can defend themselves. It won't be *that* hard to research and 
collect the IP ranges of offending universities.

I started a list here:
http://www.w3.org/wiki/Bad_Crawlers

The list is currently empty. I hope it stays that way.

Thank you all,
Richard

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Lin Clark

On Wed, Jun 22, 2011 at 9:33 PM, Sebastian Schaffert <
sebastian.schaff...@salzburgresearch.at> wrote:

>
> Your complaint sounds to me a bit like "help, too many clients access my
> data".

I'm sure that Martin is really tired of saying this, so I will reiterate for
him: It wasn't his data, they weren't his servers. He's speaking on behalf
of people who aren't part of our insular community... people who don't have
a compelling reason to subsidize a PhD student's Best Paper award with their
own dollars and bandwidth.

Agents can use Linked Data just fine without firing 150 requests per second
at a server. There are TONs of use cases that do not require that kind of
server load.

-- 
Lin Clark
DERI, NUI Galway 

lin-clark.com
twitter.com/linclark

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Dave Challis


On 22/06/11 16:05, Kingsley Idehen wrote:

On 6/22/11 3:57 PM, Steve Harris wrote:

Yes, exactly.

I think that the problem is at least partly (and I say this as an
ex-academic) that few people in academia have the slightest idea how
much it costs to run a farm of servers in the Real World™.

From the point of view of the crawler they're trying to get as much
data as possible in a short a time as possible, but don't realise that
the poor guy at the other end just got his 95th percentile shot
through the roof, and now has a several thousand dollar bandwidth bill
heading his way.

You can cap bandwidth, but that then might annoy paying customers,
which is clearly not good.


Yes, so we need QoS algorithms or heuristics capable of fine-grained
partitioning re. Who can do What, When, and Where :-)

Kingsley


There are plenty of these around when it comes to web traffic in 
general.  For apache, I can think of ModSecurity 
(http://www.modsecurity.org/) and mod_evasive 
(http://www.zdziarski.com/blog/?page_id=442).


Both of these will look at traffic patterns and dynamically blacklist as 
needed.


ModSecurity also allows for custom rules to be written depending on 
get/post content, so it should be perfectly feasible to set up rules 
based on estimated/actual query cost (e.g. blacklist if client makes > X 
requests per Y mins which return > Z triples).


Can't see any reason why a hybrid approach couldn't be used, e.g. apply 
rules to unauthenticated traffic, and auto-whitelist clients identifying 
themselves via WebID.


--
Dave Challis
d...@ecs.soton.ac.uk

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Sebastian Schaffert

Martin,

I followed the thread a bit, and I have just a small and maybe naive question: 
what use is a Linked Data Web that does not even scale to the access of 
crawlers? And how to we expect agents to use Linked Data if we cannot provide 
technology that scales?

Your complaint sounds to me a bit like "help, too many clients access my data". 
I think worse things could happen. What we need to do is improve our 
technology, and not whining about people trying to use our data. Even though it 
is not good if people stop *providing* Linked Data, it is also not good if 
people stop *using* Linked Data. And I find your approach of stopping to send 
pings counter-productive.

My 2 cents to the discussion ... :-)

Greetings,

Sebastian


Am 22.06.2011 um 20:57 schrieb Martin Hepp:

> Jiri:
> The crawlers causing problems were run by Universities, mostly in the context 
> of ISWC submissions. No need to cast any doubt on that.
> 
> All:
> As a consequence to those events, I will not publish sitemaps etc. of future 
> GoodRelations datasets on these lists, but just inform non-toy consumers.
> If you consider yourself a non-toy consumer of e-commerce data, please send 
> me an e-mail, and we will add you to out ping chain.
> 
> We will also stop sending pings to PTSW, Watson, Swoogle, et al., because 
> they will just expose sites adopting GoodRelations and related technology to 
> academic crawling.
> 
> In the meantime, I recommend the LOD bubble diagram sources for 
> self-referential research.
> 
> Best
> M. Hepp
> 
> 
> 
> On Jun 22, 2011, at 4:03 PM, Jiří Procházka wrote:
> 
>> I understand that, but I doubt your conclusion, that those crawlers are
>> targeting semantic web, since like you said they don't even properly
>> identify themselves and as far as I know, on Universities also regular
>> web search and crawling is researched. Maybe lot of them are targeting
>> semantic web, but we should look at all measures to conserve bandwidth,
>> from avoiding regular web crawler interest, aiding infrastructure like
>> Ping the Semantic Web to optimizing delivery and even distribution of
>> the data among resouces.
>> 
>> Best,
>> Jiri
>> 
>> On 06/22/2011 03:21 PM, Martin Hepp wrote:
>>> Thanks, Jiri, but the load comes from academic crawler prototypes firing 
>>> from broad University infrastructures.
>>> Best
>>> Martin
>>> 
>>> 
>>> On Jun 22, 2011, at 12:40 PM, Jiří Procházka wrote:
>>> 
 I wonder, are ways to link RDF data so that convential crawlers do not
 crawl it, but only the semantic web aware ones do?
 I am not sure how the current practice of linking by link tag in the
 html headers could cause this, but it may be case that those heavy loads
 come from a crawlers having nothing to do with semantic web...
 Maybe we should start linking to our rdf/xml, turtle, ntriples files and
 publishing sitemap info in RDFa...
 
 Best,
 Jiri
 
 On 06/22/2011 09:00 AM, Steve Harris wrote:
> While I don't agree with Andreas exactly that it's the site owners fault, 
> this is something that publishers of non-semantic data have to deal with.
> 
> If you publish a large collection of interlinked data which looks 
> interesting to conventional crawlers and is expensive to generate, 
> conventional web crawlers will be all over it. The main difference is 
> that a greater percentage of those are written properly, to follow 
> robots.txt and the guidelines about hit frequency (maximum 1 request per 
> second per domain, no parallel crawling).
> 
> Has someone published similar guidelines for semantic web crawlers?
> 
> The ones that don't behave themselves get banned, either in robots.txt, 
> or explicitly by the server. 
> 
> - Steve
> 
> On 2011-06-22, at 06:07, Martin Hepp wrote:
> 
>> Hi Daniel,
>> Thanks for the link! I will relay this to relevant site-owners.
>> 
>> However, I still challenge Andreas' statement that the site-owners are 
>> to blame for publishing large amounts of data on small servers.
>> 
>> One can publish 10,000 PDF documents on a tiny server without being hit 
>> by DoS-style crazy crawlers. Why should the same not hold if I publish 
>> RDF?
>> 
>> But for sure, it is necessary to advise all publishers of large RDF 
>> datasets to protect themselves against hungry crawlers and actual DoS 
>> attacks.
>> 
>> Imagine if a large site was brought down by a botnet that is exploiting 
>> Semantic Sitemap information for DoS attacks, focussing on the large 
>> dump files. 
>> This could end LOD experiments for that site.
>> 
>> 
>> Best
>> 
>> Martin
>> 
>> 
>> On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:
>> 
>>> 
>>> Hi Martin,
>>> 
>>> Have you tried to put a Squid [1]  as reverse proxy in front of your 
>>> servers and use delay pools [2] to catch

Re: Think before you write Semantic Web crawlers


On 22 Jun 2011, at 21:05, Martin Hepp wrote:

> Glenn:
> 
>> If there isn't, why not? We're the Semantic Web, dammit. If we aren't the 
>> masters of data interoperability, what are we?
> The main question is: Is the Semantic Web an evolutionary improvement of the 
> Web, the Web understood as an ecosystem comprising protocols, data models, 
> people, and economics - or is it a tiny special interest branch.
> 
> As said: I bet a bottle of champagne that the academic Semantic Web 
> community's technical proposals will never gain more than 10 % market share 
> among "real" site-owners, because of

I worked for AltaVista and Sun Microsystems, so I am not an academic.  And it 
would be difficult to get back to academia, as salaries are so low there. So we 
should be thankful at how much good work these people are putting into this for 
love of the subject. 

> - unnecessary complexity (think of the simplicity of publishing an HTML page 
> vs. following LOD publishing principles),

Well, data manipulation is more difficult of course than simple web pages. But 
there are large benefits to be gained from more structured data. In the 
academic/buisness nonsense, you should look at how much IBM and co put into 
SOAP, and where that got them. Pretty much nowhere. The semantic web seems a 
lot more fruitful than SOAP to me, and has a lot more potential. It is not that 
difficult, it's just that people - in mass - are slow learners. But you know 
there is time.

> - bad design decisions (e.g explicit datatyping of data instances in RDFa),
> - poor documentation for non-geeks, and
> - a lack of understanding of the economics of technology diffusion.

Technology diffuses a lot slower than people think. But in aggregate it 
diffuses a lot faster than we can cope with.
  - "history of technology adoption" http://bblfish.net/blog/page1.html#14
  - "was moore's law inevitable" 
http://www.kk.org/thetechnium/archives/2009/07/was_moores_law.php
  

In any case WebID is so mindbogglingly simple, it falsifies all the points 
above. You have a problem and there is a solution to it. Of course we need to 
stop bad crawling. But also we should start showing how the web can protect 
itself, without asking just for good will. 

Henry


> 
> Never ever.
> 
> Best
> 
> Martin
> 
> On Jun 22, 2011, at 3:18 PM, glenn mcdonald wrote:
> 
>>> From my perspective as the designer of a system that both consumes and 
>>> publishes data, the load/burden issue here is not at all particular to the 
>>> semantic web. Needle obeys robots.txt rules, but that's a small deal 
>>> compared to the difficulty of extracting whole data from sites set up to 
>>> deliver it only in tiny pieces. I'd say about 98% of the time I can 
>>> describe the data I want from a site with a single conceptual query. 
>>> Indeed, once I've got the data into Needle I can almost always actually 
>>> produce that query. But on the source site, I usually can't, and thus we 
>>> are forced to waste everybody's time navigating the machines through 
>>> superfluous presentation rendering designed for people. 10-at-a-time 
>>> results lists, interminable AJAX refreshes, animated DIV reveals, grafting 
>>> back together the splintered bits of tree-traversals, etc. This is all 
>>> absurdly unnecessary. Why is anybody having to "crawl" an open semantic-web 
>>> dataset? Isn't there a "download" link, and/or a SPARQL endpoint? If there 
>>> isn't, why not? We're the Semantic Web, dammit. If we aren't the masters of 
>>> data interoperability, what are we?
>> 
>> glenn
>> (www.needlebase.com)
> 
> 

Social Web Architect
http://bblfish.net/

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Yves Raimond

On Wed, Jun 22, 2011 at 8:29 PM, Andreas Harth  wrote:
> Hi Martin,
>
> On 06/22/2011 09:08 PM, Martin Hepp wrote:
>>
>> Please make a survey among typical Web site owners on how many of them
>> have
>>
>> 1. access to this level of server configuration and
>
>> 2. the skills necessary to implement these recommendations.
>
> d'accord .
>
> But the case we're discussing there's also:
>
> 3. publishes millions of pages
>
> I am glad you brought up the issue, as there are several data providers
> out there (some with quite prominent names) with hundreds of millions of
> triples, but unable to sustain lookups every couple of seconds or so.

Very funny :-) At peak times, a single crawler was hitting us with 150
rq/s. Quite far from "every couple of seconds or so".

Best,
y

>
> I am very much in favour of amateur web enthusiasts (I would like to claim
> I've started as one).  Unfortunately, you get them on both ends, publishers
> and consumers.  Postel's law applies to both, I guess.
>
> Best regards,
> Andreas.
>
>

Re: Think before you write Semantic Web crawlers


On 6/22/11 8:29 PM, Andreas Harth wrote:

I am glad you brought up the issue, as there are several data providers
out there (some with quite prominent names) with hundreds of millions of
triples, but unable to sustain lookups every couple of seconds or so. 

But that's quite general a statement you make.

What about the fact that the data providers have configured their 
systems such that lookups aren't performed by anyone every couple of 
seconds or so? That pinhole is enough fooder for DoS in Web context :-)


--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Andreas Harth


Hi Martin,

On 06/22/2011 09:08 PM, Martin Hepp wrote:

Please make a survey among typical Web site owners on how many of them have

1. access to this level of server configuration and

> 2. the skills necessary to implement these recommendations.

d'accord .

But the case we're discussing there's also:

3. publishes millions of pages

I am glad you brought up the issue, as there are several data providers
out there (some with quite prominent names) with hundreds of millions of
triples, but unable to sustain lookups every couple of seconds or so.

I am very much in favour of amateur web enthusiasts (I would like to claim
I've started as one).  Unfortunately, you get them on both ends, publishers
and consumers.  Postel's law applies to both, I guess.

Best regards,
Andreas.

Re: WebID vs. JSON (Was: Re: Think before you write Semantic Web crawlers)

2011-06-22 Thread Melvin Carvalho

On 22 June 2011 16:41, William Waites  wrote:
> What does WebID have to do with JSON? They're somehow representative
> of two competing trends.
>
> The RDF/JSON, JSON-LD, etc. work is supposed to be about making it
> easier to work with RDF for your average programmer, to remove the
> need for complex parsers, etc. and generally to lower the barriers.
>
> The WebID arrangement is about raising barriers. Not intended to be
> the same kind of barriers, certainly the intent isn't to make
> programmer's lives more difficult, rather to provide a good way to do
> distributed authentication without falling into the traps of PKI and
> such.
>
> While I like WebID, and I think it is very elegant, the fact is that I
> can use just about any HTTP client to retrieve a document whereas to
> get rdf processing clients, agents, whatever, to do it will require
> quite a lot of work [1]. This is one reason why, for example, 4store's
> arrangement of /sparql/ for read operations and /data/ and /update/
> for write operations is *so* much easier to work with than Virtuoso's
> OAuth and WebID arrangement - I can just restrict access using all of
> the normal tools like apache, nginx, squid, etc..
>
> So in the end we have some work being done to address the perception
> that RDF is difficult to work with and on the other hand a suggestion
> of widespread putting in place of authentication infrastructure which,
> whilst obviously filling a need, stands to make working with the data
> behind it more difficult.
>
> How do we balance these two tendencies?

I think It's fine to use JSON / LD for a bunch of stuff, but adopting
WebID will pay back the investment over time, imho.

A machine readable, identity aware, world is where the Web is heading,
or maybe is where it started, but many of us failed to see it.

Perhaps it's possible to wait for the tool chains to catch up, or
build a bridge between the two, which is even better still.

>
> [1] examples of non-WebID aware clients: rapper / rasqal, python
> rdflib, curl, the javascript engine in my web browser that doesn't
> properly support client certificates, etc.
> --
> William Waites                
> http://river.styx.org/ww/        
> F4B3 39BF E775 CF42 0BAB  3DF0 BE40 A6DF B06F FD45
>
>

Re: Think before you write Semantic Web crawlers


On 6/22/11 8:08 PM, Martin Hepp wrote:

Hi Andreas:

Please make a survey among typical Web site owners on how many of them have

1. access to this level of server configuration and
2. the skills necessary to implement these recommendations.


Okay, I think that answers my question from last post :-)

+1


The WWW was anti-pedantic by design.


I would say "deceptively simple" . Basically, there are complexities to 
AWWW, but never hitting you at the front door re. initial engagement. 
The great thing about "deceptively simple" is that this kind of systems 
architecture ultimately delivers pleasant surprises. This is also why 
turning RDF into a Linked Data distraction sets me off, big time! The 
AWWW at its core already had the mechanism for Linked Data via use of 
hyperlinks for whole data representation built in.

  This was the root of its success.


Yes!


The pedants were the traditional SGML/Hypertext communities. Why are we 
breeding new pedants?


I don't know :-)


Kingsley

Martin

On Jun 22, 2011, at 11:44 AM, Andreas Harth wrote:


Hi Christopher,

On 06/22/2011 10:14 AM, Christopher Gutteridge wrote:

Right now queries to data.southampton.ac.uk (eg.
http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are made live,
but this is not efficient. My colleague, Dave Challis, has prepared a SPARQL
endpoint which caches results which we can turn on if the load gets too high,
which should at least mitigate the problem. Very few datasets change in a 24
hours period.

setting the Expires header and enabling mod_cache in Apache httpd (or adding
a Squid proxy in front of the HTTP server) works quite well in these cases.

Best regards,
Andreas.







--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: Think before you write Semantic Web crawlers

On 6/22/11 8:05 PM, Martin Hepp wrote:

Glenn:

If there isn't, why not? We're the Semantic Web, dammit. If we aren't the
masters of data interoperability, what are we?

The main question is: Is the Semantic Web an evolutionary improvement of the
Web, the Web understood as an ecosystem comprising protocols, data models,
people, and economics - or is it a tiny special interest branch.

As said: I bet a bottle of champagne that the academic Semantic Web community's technical
proposals will never gain more than 10 % market share among "real" site-owners,
because of
- unnecessary complexity (think of the simplicity of publishing an HTML page
vs. following LOD publishing principles),
- bad design decisions (e.g explicit datatyping of data instances in RDFa),
- poor documentation for non-geeks, and
- a lack of understanding of the economics of technology diffusion.

Hoping you don't place WebID in the academic adventure bucket, right?

WebID, like URI abstraction, is well thought out critical infrastructure
tech.

Kingsley

Never ever.

Best

Martin

On Jun 22, 2011, at 3:18 PM, glenn mcdonald wrote:

> From my perspective as the designer of a system that both consumes and publishes data, the
load/burden issue here is not at all particular to the semantic web. Needle obeys robots.txt rules,
but that's a small deal compared to the difficulty of extracting whole data from sites set up to
deliver it only in tiny pieces. I'd say about 98% of the time I can describe the data I want from a
site with a single conceptual query. Indeed, once I've got the data into Needle I can almost always
actually produce that query. But on the source site, I usually can't, and thus we are forced to waste
everybody's time navigating the machines through superfluous presentation rendering designed for
people. 10-at-a-time results lists, interminable AJAX refreshes, animated DIV reveals, grafting back
together the splintered bits of tree-traversals, etc. This is all absurdly unnecessary. Why is anybody
having to "crawl" an open semantic-web dataset? Isn't there a "download" link,
and/or a SPARQL endpoint? If there isn't, why not? We're the Semantic Web, dammit. If we aren't the
masters of data interoperability, what are we?

glenn
(www.needlebase.com)

Regards,

Kingsley Idehen
President& CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: Think before you write Semantic Web crawlers

Hi Andreas:

Please make a survey among typical Web site owners on how many of them have

1. access to this level of server configuration and
2. the skills necessary to implement these recommendations.

The WWW was anti-pedantic by design. This was the root of its success. The 
pedants were the traditional SGML/Hypertext communities. Why are we breeding 
new pedants?

Martin

On Jun 22, 2011, at 11:44 AM, Andreas Harth wrote:

> Hi Christopher,
> 
> On 06/22/2011 10:14 AM, Christopher Gutteridge wrote:
>> Right now queries to data.southampton.ac.uk (eg.
>> http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are made 
>> live,
>> but this is not efficient. My colleague, Dave Challis, has prepared a SPARQL
>> endpoint which caches results which we can turn on if the load gets too high,
>> which should at least mitigate the problem. Very few datasets change in a 24
>> hours period.
> 
> setting the Expires header and enabling mod_cache in Apache httpd (or adding
> a Squid proxy in front of the HTTP server) works quite well in these cases.
> 
> Best regards,
> Andreas.
>

Re: Think before you write Semantic Web crawlers


On 6/22/11 7:57 PM, Martin Hepp wrote:

Jiri:
The crawlers causing problems were run by Universities, mostly in the context 
of ISWC submissions. No need to cast any doubt on that.

All:
As a consequence to those events, I will not publish sitemaps etc. of future 
GoodRelations datasets on these lists, but just inform non-toy consumers.
If you consider yourself a non-toy consumer of e-commerce data, please send me 
an e-mail, and we will add you to out ping chain.

We will also stop sending pings to PTSW, Watson, Swoogle, et al., because they 
will just expose sites adopting GoodRelations and related technology to 
academic crawling.

In the meantime, I recommend the LOD bubble diagram sources for 
self-referential research.


Martin,

Linked Data is Linked Data. Serendipitious Discovery Quotient of every 
LINK is inherently high, and higher as the mesh gets denser. Once a 
Linked Data is out there is a path to crawl.


Inevitably Linked Data access needs ACL control. Luckily we actually do 
have a solution in WebID. Maybe, we use this problem as another use case 
this time addressing:


1. HTML+RDFa
2. Access Control Lists
3. Crawling.

We have to protect innovations via innovation. If Linked Data is useful 
then it should provide foundation for its consumption and publication 
challenges. These issues are just beginning, there is so much more to come.


Let's just solve the problem. Requesting good behavior will never bring 
stability or decorum to a jungle full of critters. We have to make them 
feel the robustness of the system :-)


Kingsley


Best
M. Hepp



On Jun 22, 2011, at 4:03 PM, Jiří Procházka wrote:


I understand that, but I doubt your conclusion, that those crawlers are
targeting semantic web, since like you said they don't even properly
identify themselves and as far as I know, on Universities also regular
web search and crawling is researched. Maybe lot of them are targeting
semantic web, but we should look at all measures to conserve bandwidth,
from avoiding regular web crawler interest, aiding infrastructure like
Ping the Semantic Web to optimizing delivery and even distribution of
the data among resouces.

Best,
Jiri

On 06/22/2011 03:21 PM, Martin Hepp wrote:

Thanks, Jiri, but the load comes from academic crawler prototypes firing from 
broad University infrastructures.
Best
Martin


On Jun 22, 2011, at 12:40 PM, Jiří Procházka wrote:


I wonder, are ways to link RDF data so that convential crawlers do not
crawl it, but only the semantic web aware ones do?
I am not sure how the current practice of linking by link tag in the
html headers could cause this, but it may be case that those heavy loads
come from a crawlers having nothing to do with semantic web...
Maybe we should start linking to our rdf/xml, turtle, ntriples files and
publishing sitemap info in RDFa...

Best,
Jiri

On 06/22/2011 09:00 AM, Steve Harris wrote:

While I don't agree with Andreas exactly that it's the site owners fault, this 
is something that publishers of non-semantic data have to deal with.

If you publish a large collection of interlinked data which looks interesting 
to conventional crawlers and is expensive to generate, conventional web 
crawlers will be all over it. The main difference is that a greater percentage 
of those are written properly, to follow robots.txt and the guidelines about 
hit frequency (maximum 1 request per second per domain, no parallel crawling).

Has someone published similar guidelines for semantic web crawlers?

The ones that don't behave themselves get banned, either in robots.txt, or 
explicitly by the server.

- Steve

On 2011-06-22, at 06:07, Martin Hepp wrote:


Hi Daniel,
Thanks for the link! I will relay this to relevant site-owners.

However, I still challenge Andreas' statement that the site-owners are to blame 
for publishing large amounts of data on small servers.

One can publish 10,000 PDF documents on a tiny server without being hit by 
DoS-style crazy crawlers. Why should the same not hold if I publish RDF?

But for sure, it is necessary to advise all publishers of large RDF datasets to 
protect themselves against hungry crawlers and actual DoS attacks.

Imagine if a large site was brought down by a botnet that is exploiting 
Semantic Sitemap information for DoS attacks, focussing on the large dump files.
This could end LOD experiments for that site.


Best

Martin


On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:


Hi Martin,

Have you tried to put a Squid [1]  as reverse proxy in front of your servers 
and use delay pools [2] to catch hungry crawlers?

Cheers,
Daniel

[1] http://www.squid-cache.org/
[2] http://wiki.squid-cache.org/Features/DelayPools

On 21.06.2011, at 09:49, Martin Hepp wrote:


Hi all:

For the third time in a few weeks, we had massive complaints from site-owners 
that Semantic Web crawlers from Universities visited their sites in a way close 
to a denial-of-service attack, i.e., crawling data with maximum bandwidth in a 
parallelized app

Re: Think before you write Semantic Web crawlers

Glenn:

> If there isn't, why not? We're the Semantic Web, dammit. If we aren't the 
> masters of data interoperability, what are we?
The main question is: Is the Semantic Web an evolutionary improvement of the 
Web, the Web understood as an ecosystem comprising protocols, data models, 
people, and economics - or is it a tiny special interest branch.

As said: I bet a bottle of champagne that the academic Semantic Web community's 
technical proposals will never gain more than 10 % market share among "real" 
site-owners, because of
- unnecessary complexity (think of the simplicity of publishing an HTML page 
vs. following LOD publishing principles),
- bad design decisions (e.g explicit datatyping of data instances in RDFa),
- poor documentation for non-geeks, and
- a lack of understanding of the economics of technology diffusion.

Never ever.

Best

Martin

On Jun 22, 2011, at 3:18 PM, glenn mcdonald wrote:

> >From my perspective as the designer of a system that both consumes and 
> >publishes data, the load/burden issue here is not at all particular to the 
> >semantic web. Needle obeys robots.txt rules, but that's a small deal 
> >compared to the difficulty of extracting whole data from sites set up to 
> >deliver it only in tiny pieces. I'd say about 98% of the time I can describe 
> >the data I want from a site with a single conceptual query. Indeed, once 
> >I've got the data into Needle I can almost always actually produce that 
> >query. But on the source site, I usually can't, and thus we are forced to 
> >waste everybody's time navigating the machines through superfluous 
> >presentation rendering designed for people. 10-at-a-time results lists, 
> >interminable AJAX refreshes, animated DIV reveals, grafting back together 
> >the splintered bits of tree-traversals, etc. This is all absurdly 
> >unnecessary. Why is anybody having to "crawl" an open semantic-web dataset? 
> >Isn't there a "download" link, and/or a SPARQL endpoint? If there isn't, why 
> >not? We're the Semantic Web, dammit. If we aren't the masters of data 
> >interoperability, what are we?
> 
> glenn
> (www.needlebase.com)

Re: WebID vs. JSON (Was: Re: Think before you write Semantic Web crawlers)


On 6/22/11 4:14 PM, William Waites wrote:

* [2011-06-22 16:00:49 +0100] Kingsley Idehen  écrit:

] explain to me how the convention you espouse enables me confine access
] to a SPARQL endpoint for:
]
] A person identified by URI based Name (WebID) that a member of a
] foaf:Group (which also has its own WebID).

This is not a use case I encounter much. Usually I have some
application code that needs write access to the store and some public
code (maybe javascript in a browser, maybe some program run by a third
party) that needs read access.


I am assuming you seek multiple users of your end product (the 
application), right?


I assume all users aren't equal i.e., they have varying profiles, right?

If the answer is to teach my application code about WebID, it's going
to be a hard sell because really I want to be working on other things
than protocol plumbing.


Remember, I like to take the "problem solving approach" to technology. 
Never technology for the sake of it, never.


There is a fundamental problem: you seek >1 users of your apps. All 
users aren't the same, profile wise.


Simple case in point, right here, and right now. This thread is about a 
critical challenge (always there btw) that Linked Data propagation 
unveils. The very same problems hit us in the early '90s re. ODBC i.e., 
how do we control access to data bearing in mind ODBC application user 
profile variations. Should anyone be able to access pensions and payroll 
data to make a very obvious example.


The gaping security hole that ODBC introduced to the enterprise is still 
doing damage to this very day. I won't mention names, but as you hear 
about security breaches, do some little digging about what's behind many 
of these systems. Hint: a relational database, and free ODBC, JDBC, 
OLE-DB, ADO.NET providers, in many cases. Have one of those libraries  
on a system, you can get into the RDBMS via social engineering (in the 
absolute worst case, or throttle with CPUs for passwords).


Way back then we use Windows INI structure to construct a graph based 
data representation format that we called "session rules book". Via 
these rules we enabled organizations to say: Kingsley can only access 
records in certain ODBC/JDBC/OLE-DB/ADO.NET accessible databases if he 
met certain criteria that included the IP address he logs in from, his 
username, client application name, arbitrary identifiers that the system 
owner could conjure up etc.. The only drag for us what it was little 
OpenLink rather than a behemoth like Microsoft.


When we encountered RDF and the whole Semantic Web vision we realized 
there was a standardized route for addressing these fundamental issues. 
This is why WebID is simply a major deal. It is inherently quite 
contradictory to push Linked Data and push-back at WebID. That's only 
second to rejecting essence of URI abstraction by conflating Names and 
Addresses re. fine grained data access that address troubling problems 
of yore.




If you then go further and say that *all* access to the endpoint needs
to use WebID because of resource-management issues, then every client
now needs to do a bunch of things that end with shaving a yak before
they can even start on working on whatever they were meant to be
working on.



No.

This is what we (WebID implementers) are saying:

1. Publish Linked Data
2. Apply Linked Data prowess to the critical issue of controlled access 
to Linked Data Spaces.


Use Linked Data to solve a real problem. In doing so we'll achieve the 
critical mass we all seek because the early adopters of Linked Data will 
be associated with:


1. Showing how Linked Data solves a real problem
2. Using Linked Data to make its use and consumption easier for others 
who seek justification and use case examples en route to full investment.



On the other hand, arranging things so that access control can be done
by existing tools without burdening the clients is a lot easier, if
less general. And easier is what we want working with RDF to be.


It has nothing to do with RDF. It has everything to do with Linked Data 
i.e., Data Objects endowed with Names that resolve to their 
Representations. Said representations take the form of EAV/SPO based 
graphs. RDF is one of the options for achieving this goal via a syntax 
with high semantic fidelity (most of that comes from granularity 
covering datatypes and locale issues).


What people want, and have always sought is: open access to relevant 
data from platforms and tools of their choice without any performance or 
security compromises. HTTP, URIs, and exploitation of full URI 
abstraction as mechanism for graph based whole data representation, 
without graph format/syntax distractions is the beachhead we need right 
now. The semantic fidelity benefits of RDF re. datatypes and locale 
issues comes after that. Thus, first goal is to actually simplify Linked 
Data, and make its use and exploitation practical,  starting with 
appreciation of trust logic based AC

Re: Think before you write Semantic Web crawlers

Jiri:
The crawlers causing problems were run by Universities, mostly in the context 
of ISWC submissions. No need to cast any doubt on that.

All:
As a consequence to those events, I will not publish sitemaps etc. of future 
GoodRelations datasets on these lists, but just inform non-toy consumers.
If you consider yourself a non-toy consumer of e-commerce data, please send me 
an e-mail, and we will add you to out ping chain.

We will also stop sending pings to PTSW, Watson, Swoogle, et al., because they 
will just expose sites adopting GoodRelations and related technology to 
academic crawling.

In the meantime, I recommend the LOD bubble diagram sources for 
self-referential research.

Best
M. Hepp



On Jun 22, 2011, at 4:03 PM, Jiří Procházka wrote:

> I understand that, but I doubt your conclusion, that those crawlers are
> targeting semantic web, since like you said they don't even properly
> identify themselves and as far as I know, on Universities also regular
> web search and crawling is researched. Maybe lot of them are targeting
> semantic web, but we should look at all measures to conserve bandwidth,
> from avoiding regular web crawler interest, aiding infrastructure like
> Ping the Semantic Web to optimizing delivery and even distribution of
> the data among resouces.
> 
> Best,
> Jiri
> 
> On 06/22/2011 03:21 PM, Martin Hepp wrote:
>> Thanks, Jiri, but the load comes from academic crawler prototypes firing 
>> from broad University infrastructures.
>> Best
>> Martin
>> 
>> 
>> On Jun 22, 2011, at 12:40 PM, Jiří Procházka wrote:
>> 
>>> I wonder, are ways to link RDF data so that convential crawlers do not
>>> crawl it, but only the semantic web aware ones do?
>>> I am not sure how the current practice of linking by link tag in the
>>> html headers could cause this, but it may be case that those heavy loads
>>> come from a crawlers having nothing to do with semantic web...
>>> Maybe we should start linking to our rdf/xml, turtle, ntriples files and
>>> publishing sitemap info in RDFa...
>>> 
>>> Best,
>>> Jiri
>>> 
>>> On 06/22/2011 09:00 AM, Steve Harris wrote:
 While I don't agree with Andreas exactly that it's the site owners fault, 
 this is something that publishers of non-semantic data have to deal with.
 
 If you publish a large collection of interlinked data which looks 
 interesting to conventional crawlers and is expensive to generate, 
 conventional web crawlers will be all over it. The main difference is that 
 a greater percentage of those are written properly, to follow robots.txt 
 and the guidelines about hit frequency (maximum 1 request per second per 
 domain, no parallel crawling).
 
 Has someone published similar guidelines for semantic web crawlers?
 
 The ones that don't behave themselves get banned, either in robots.txt, or 
 explicitly by the server. 
 
 - Steve
 
 On 2011-06-22, at 06:07, Martin Hepp wrote:
 
> Hi Daniel,
> Thanks for the link! I will relay this to relevant site-owners.
> 
> However, I still challenge Andreas' statement that the site-owners are to 
> blame for publishing large amounts of data on small servers.
> 
> One can publish 10,000 PDF documents on a tiny server without being hit 
> by DoS-style crazy crawlers. Why should the same not hold if I publish 
> RDF?
> 
> But for sure, it is necessary to advise all publishers of large RDF 
> datasets to protect themselves against hungry crawlers and actual DoS 
> attacks.
> 
> Imagine if a large site was brought down by a botnet that is exploiting 
> Semantic Sitemap information for DoS attacks, focussing on the large dump 
> files. 
> This could end LOD experiments for that site.
> 
> 
> Best
> 
> Martin
> 
> 
> On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:
> 
>> 
>> Hi Martin,
>> 
>> Have you tried to put a Squid [1]  as reverse proxy in front of your 
>> servers and use delay pools [2] to catch hungry crawlers?
>> 
>> Cheers,
>> Daniel
>> 
>> [1] http://www.squid-cache.org/
>> [2] http://wiki.squid-cache.org/Features/DelayPools
>> 
>> On 21.06.2011, at 09:49, Martin Hepp wrote:
>> 
>>> Hi all:
>>> 
>>> For the third time in a few weeks, we had massive complaints from 
>>> site-owners that Semantic Web crawlers from Universities visited their 
>>> sites in a way close to a denial-of-service attack, i.e., crawling data 
>>> with maximum bandwidth in a parallelized approach.
>>> 
>>> It's clear that a single, stupidly written crawler script, run from a 
>>> powerful University network, can quickly create terrible traffic load. 
>>> 
>>> Many of the scripts we saw
>>> 
>>> - ignored robots.txt,
>>> - ignored clear crawling speed limitations in robots.txt,
>>> - did not identify themselves properly in the H

CfP: ACM RecSys 2011 International Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011)

2011-06-22 Thread Iván Cantador

[Apologies if you receive this more than once]


2nd Call for Papers

2nd International Workshop on Information Heterogeneity and Fusion in
Recommender Systems (HetRec 2011)
27th October 2011 | Chicago, IL, USA 
http://ir.ii.uam.es/hetrec2011 

Held in conjunction with the
5th ACM Conference on Recommender Systems (RecSys 2011) 
http://recsys.acm.org/2011 

 

+++
Important dates
+++

* Paper submission: 25 July 2011
* Notification of acceptance:   19 August 2011
* Camera-ready version due: 12 September 2011
* HetRec 2011 Workshop: 27 October 2011

++
Motivation
++

In recent years, increasing attention has been given to finding ways for
combining, integrating and mediating heterogeneous sources of information
for the purpose of providing better personalized services in many
information seeking and e-commerce applications. Information heterogeneity
can indeed be identified in any of the pillars of a recommender system: the
modeling of user preferences, the description of resource contents, the
modeling and exploitation of the context in which recommendations are made,
and the characteristics of the suggested resource lists.

Almost all current recommender systems are designed for specific domains and
applications, and thus usually try to make best use of a local user model,
using a single kind of personal data, and without explicitly addressing the
heterogeneity of the existing personal information that may be freely
available (on social networks, homepages, etc.). Recognizing this
limitation, among other issues: a) user models could be based on different
types of explicit and implicit personal preferences, such as ratings, tags,
textual reviews, records of views, queries, and purchases; b) recommended
resources may belong to several domains and media, and may be described with
multilingual metadata; c) context could be modeled and exploited in
multi-dimensional feature spaces; d) and ranked recommendation lists could
be diverse according to particular user preferences and resource attributes,
oriented to groups of users, and driven by multiple user evaluation
criteria.

The aim of HetRec workshop is to bring together students, faculty,
researchers and professionals from both academia and industry who are
interested in addressing any of the above forms of information heterogeneity
and fusion in recommender systems.

The workshop goals are broad. We would like to raise awareness of the
potential of using multiple sources of information, and look for sharing
expertise and suitable models and techniques. Another dire need is for
strong datasets, and one of our aims is to establish benchmarks and standard
datasets on which the problems could be investigated.

++
Topics of interest
++

The goal of the workshop is to bring together researchers and practitioners
interested in addressing the challenges posed by information heterogeneity
in recommender systems, and studying information fusion in this context.

Topics of interest include, but are not limited to:

* Fusion of user profiles from different representations, e.g. ratings,
text reviews, tags, and bookmarks
* Combination of short- and long-term user preferences
* Combination of different types of user preferences: tastes, interests,
needs, goals, mood
* Cross-domain recommendations, based on user preferences about
different interest aspects, e.g. by merging movie and music tastes
* Cross-representation recommendations, considering diverse sources of
user preferences: explicit and implicit feedback
* Recommendation of resources of different nature: news, reviews,
scientific papers, etc.
* Recommendation of resources belonging to different multimedia: text,
image, audio, video
* Recommendation of diverse resources, e.g. according to content
attributes, and user consuming behaviors
* Recommendation of resources annotated in different languages
* Contextualization of multiple user preferences, e.g. by distinguishing
user preferences at work and on holidays
* Cross-context recommendations, e.g. by merging information about
location, time and social aspects
* Multi-dimensional recommendation based on several contextual features,
e.g. physical and social environment, device and network settings, and
external events
* Multi-criteria recommendation, exploiting ratings and evaluations
about multiple user/item characteristics
* Group recommendation, oriented to several users, e.g. suggesting
tourist attractions to a group of friends, and suggesting a TV show to a
family

+++
Keynote
+++

We are pleased to announce that Yehuda Koren, from Yahoo! Research, will be
our invited speaker.

The title of his talk is "I Want to Answer, Who Has a Question? Yahoo!
Answers Recommender Sys

Re: WebID vs. JSON (Was: Re: Think before you write Semantic Web crawlers)

On 22 Jun 2011, at 17:14, William Waites wrote:

> * [2011-06-22 16:00:49 +0100] Kingsley Idehen  écrit:
> 
> ] explain to me how the convention you espouse enables me confine access 
> ] to a SPARQL endpoint for:
> ] 
> ] A person identified by URI based Name (WebID) that a member of a 
> ] foaf:Group (which also has its own WebID).
> 
> This is not a use case I encounter much. Usually I have some
> application code that needs write access to the store and some public
> code (maybe javascript in a browser, maybe some program run by a third
> party) that needs read access.
> 
> If the answer is to teach my application code about WebID, it's going
> to be a hard sell because really I want to be working on other things
> than protocol plumbing.

So you're in luck. https is shipped in all client libraries, so you just need
to get your application a webid certificate. That should be as easy as one post 
request
to get it. At least for browsers it's a one click affair for the end user, as 
shown
here

   http://bblfish.net/blog/2011/05/25/

It would be easy to do the same for robots. In fact that is why at the 
University
of Manchester Bruno Harbulot and Mike Jones are using WebID for their Grid 
computing 
work, because it makes access control to the grid so much easier that any of 
the other
top heavy technologies available.

> If you then go further and say that *all* access to the endpoint needs
> to use WebID because of resource-management issues, then every client
> now needs to do a bunch of things that end with shaving a yak before
> they can even start on working on whatever they were meant to be
> working on.

You can be very flexible there.  If users have WebId you give them a better 
service.
Seems fair deal. It can also be very flexible. You don't need all your site to 
be WebID
enabled. You could use cookie auth on http endpoints, and for clients that 
don't have
a cookie redirect them to an https endpoint where they can auth with webid. If 
they don't
ask them to auth with somithing like OpenId. I'd say pretty soon your crawlers 
and users will
be a lot happier with WebID.

> On the other hand, arranging things so that access control can be done
> by existing tools without burdening the clients is a lot easier, if
> less general. And easier is what we want working with RDF to be.

All your tools probably already are webId enabled. It's just a matter now of
giving a foaf profile to yourself and robots, getting a cert with the webid 
inthere,
and getting going. Seems to be that that's a lot easier than building crawlers, 
or semweb clients, or semweb servers, or pretty much anything.

Henry

> 
> Cheers,
> -w
> 
> -- 
> William Waites
> http://river.styx.org/ww/
> F4B3 39BF E775 CF42 0BAB  3DF0 BE40 A6DF B06F FD45
> 

Social Web Architect
http://bblfish.net/

WebID and client tools - was: WebID vs. JSON (Was: Re: Think before you write Semantic Web crawlers)

On 22 Jun 2011, at 16:41, William Waites wrote:

> 
> [1] examples of non-WebID aware clients: rapper / rasqal, python
> rdflib, curl, the javascript engine in my web browser that doesn't
> properly support client certificates, etc.

curl is WebID aware. You just need to get yourself a certificate for your 
crawler, and then use

  -E/--cert   

arguments to pass that certificate if the server requests it. 

The specs for HTTPS client certs are so old and well established that it is 
built by default into most libraries. So it would not take a lot to expose it, 
if it is not already in all the libs you mention.

But thanks for this new FAQ [1]. We'll try to fill in the details on how to 
work with the libs above using webid.

There is a Javascript layer for https too, but what is the point of doing that 
there? Let the browser do the https for you.

Henry

[1] http://www.w3.org/wiki/Foaf%2Bssl/FAQ

Social Web Architect
http://bblfish.net/

Re: WebID vs. JSON (Was: Re: Think before you write Semantic Web crawlers)

2011-06-22 Thread William Waites

* [2011-06-22 16:00:49 +0100] Kingsley Idehen  écrit:

] explain to me how the convention you espouse enables me confine access 
] to a SPARQL endpoint for:
] 
] A person identified by URI based Name (WebID) that a member of a 
] foaf:Group (which also has its own WebID).

This is not a use case I encounter much. Usually I have some
application code that needs write access to the store and some public
code (maybe javascript in a browser, maybe some program run by a third
party) that needs read access.

If the answer is to teach my application code about WebID, it's going
to be a hard sell because really I want to be working on other things
than protocol plumbing.

If you then go further and say that *all* access to the endpoint needs
to use WebID because of resource-management issues, then every client
now needs to do a bunch of things that end with shaving a yak before
they can even start on working on whatever they were meant to be
working on.

On the other hand, arranging things so that access control can be done
by existing tools without burdening the clients is a lot easier, if
less general. And easier is what we want working with RDF to be.

Cheers,
-w

-- 
William Waites
http://river.styx.org/ww/
F4B3 39BF E775 CF42 0BAB  3DF0 BE40 A6DF B06F FD45

Re: WebID vs. JSON (Was: Re: Think before you write Semantic Web crawlers)


On 6/22/11 4:08 PM, Dave Reynolds wrote:

On Wed, 2011-06-22 at 15:52 +0100, Leigh Dodds wrote:

Hi,

On 22 June 2011 15:41, William Waites  wrote:

What does WebID have to do with JSON? They're somehow representative
of two competing trends.

The RDF/JSON, JSON-LD, etc. work is supposed to be about making it
easier to work with RDF for your average programmer, to remove the
need for complex parsers, etc. and generally to lower the barriers.

The WebID arrangement is about raising barriers. Not intended to be
the same kind of barriers, certainly the intent isn't to make
programmer's lives more difficult, rather to provide a good way to do
distributed authentication without falling into the traps of PKI and
such.

While I like WebID, and I think it is very elegant, the fact is that I
can use just about any HTTP client to retrieve a document whereas to
get rdf processing clients, agents, whatever, to do it will require
quite a lot of work [1]. This is one reason why, for example, 4store's
arrangement of /sparql/ for read operations and /data/ and /update/
for write operations is *so* much easier to work with than Virtuoso's
OAuth and WebID arrangement - I can just restrict access using all of
the normal tools like apache, nginx, squid, etc..

So in the end we have some work being done to address the perception
that RDF is difficult to work with and on the other hand a suggestion
of widespread putting in place of authentication infrastructure which,
whilst obviously filling a need, stands to make working with the data
behind it more difficult.

How do we balance these two tendencies?

By recognising that often we just need to use existing technologies
more effectively and more widely, rather than throw more technology at
a problem, thereby creating an even greater education and adoption
problem?

+1

Don't raise barriers to linked data use/publication by tying it to
widespread adoption and support for WebID.


-1

You are misunderstanding WebID and what it delivers.

I am popping out, but I expect a response. Should Henry not put this 
misconception to REST, I'll certainly reply.


Got to go do some walking for now :-)

Dave







--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: WebID vs. JSON (Was: Re: Think before you write Semantic Web crawlers)


On 6/22/11 3:41 PM, William Waites wrote:

So in the end we have some work being done to address the perception
that RDF is difficult to work with and on the other hand a suggestion
of widespread putting in place of authentication infrastructure which,
whilst obviously filling a need, stands to make working with the data
behind it more difficult.

That's really a misconception if WebID is the target of that commentary.

WebID (assuming its the target) is an unobtrusive at it comes. That's 
why I say: its only second to the URI re. inherent power to exploit AWWW.



--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: WebID vs. JSON (Was: Re: Think before you write Semantic Web crawlers)

2011-06-22 Thread Dave Reynolds

On Wed, 2011-06-22 at 15:52 +0100, Leigh Dodds wrote: 
> Hi,
> 
> On 22 June 2011 15:41, William Waites  wrote:
> > What does WebID have to do with JSON? They're somehow representative
> > of two competing trends.
> >
> > The RDF/JSON, JSON-LD, etc. work is supposed to be about making it
> > easier to work with RDF for your average programmer, to remove the
> > need for complex parsers, etc. and generally to lower the barriers.
> >
> > The WebID arrangement is about raising barriers. Not intended to be
> > the same kind of barriers, certainly the intent isn't to make
> > programmer's lives more difficult, rather to provide a good way to do
> > distributed authentication without falling into the traps of PKI and
> > such.
> >
> > While I like WebID, and I think it is very elegant, the fact is that I
> > can use just about any HTTP client to retrieve a document whereas to
> > get rdf processing clients, agents, whatever, to do it will require
> > quite a lot of work [1]. This is one reason why, for example, 4store's
> > arrangement of /sparql/ for read operations and /data/ and /update/
> > for write operations is *so* much easier to work with than Virtuoso's
> > OAuth and WebID arrangement - I can just restrict access using all of
> > the normal tools like apache, nginx, squid, etc..
> >
> > So in the end we have some work being done to address the perception
> > that RDF is difficult to work with and on the other hand a suggestion
> > of widespread putting in place of authentication infrastructure which,
> > whilst obviously filling a need, stands to make working with the data
> > behind it more difficult.
> >
> > How do we balance these two tendencies?
> 
> By recognising that often we just need to use existing technologies
> more effectively and more widely, rather than throw more technology at
> a problem, thereby creating an even greater education and adoption
> problem?

+1

Don't raise barriers to linked data use/publication by tying it to
widespread adoption and support for WebID.

Dave

Re: Think before you write Semantic Web crawlers


On 6/22/11 3:57 PM, Steve Harris wrote:

Yes, exactly.

I think that the problem is at least partly (and I say this as an ex-academic) 
that few people in academia have the slightest idea how much it costs to run a 
farm of servers in the Real World™.

 From the point of view of the crawler they're trying to get as much data as 
possible in a short a time as possible, but don't realise that the poor guy at 
the other end just got his 95th percentile shot through the roof, and now has a 
several thousand dollar bandwidth bill heading his way.

You can cap bandwidth, but that then might annoy paying customers, which is 
clearly not good.


Yes, so we need QoS algorithms or heuristics capable of fine-grained 
partitioning re. Who can do What, When, and Where :-)


Kingsley

- Steve

On 2011-06-22, at 12:54, Hugh Glaser wrote:


Hi Chris.
One way to do the caching really efficiently:
http://lists.w3.org/Archives/Public/semantic-web/2007Jun/0012.html
Which is what rkb has always done.
But of course caching does not solve the problem of one bad crawler.
It actually makes it worse.
You add a cache write cost to the query, without a significant probability of a 
future cache hit. And increase disk usage.

Hugh

- Reply message -
From: "Christopher Gutteridge"
To: "Martin Hepp"
Cc: "Daniel Herzig", "semantic-...@w3.org", 
"public-lod@w3.org"
Subject: Think before you write Semantic Web crawlers
Date: Wed, Jun 22, 2011 9:18 am



The difference between these two scenarios is that there's almost no CPU 
involvement in serving the PDF file, but naive RDF sites use lots of cycles to 
generate the response to a query for an RDF document.

Right now queries to data.southampton.ac.uk (eg. 
http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are made 
live, but this is not efficient. My colleague, Dave Challis, has prepared a 
SPARQL endpoint which caches results which we can turn on if the load gets too 
high, which should at least mitigate the problem. Very few datasets change in a 
24 hours period.

Martin Hepp wrote:

Hi Daniel,
Thanks for the link! I will relay this to relevant site-owners.

However, I still challenge Andreas' statement that the site-owners are to blame 
for publishing large amounts of data on small servers.

One can publish 10,000 PDF documents on a tiny server without being hit by 
DoS-style crazy crawlers. Why should the same not hold if I publish RDF?

But for sure, it is necessary to advise all publishers of large RDF datasets to 
protect themselves against hungry crawlers and actual DoS attacks.

Imagine if a large site was brought down by a botnet that is exploiting 
Semantic Sitemap information for DoS attacks, focussing on the large dump files.
This could end LOD experiments for that site.


Best

Martin


On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:



Hi Martin,

Have you tried to put a Squid [1]  as reverse proxy in front of your servers 
and use delay pools [2] to catch hungry crawlers?

Cheers,
Daniel

[1] http://www.squid-cache.org/
[2] http://wiki.squid-cache.org/Features/DelayPools

On 21.06.2011, at 09:49, Martin Hepp wrote:



Hi all:

For the third time in a few weeks, we had massive complaints from site-owners 
that Semantic Web crawlers from Universities visited their sites in a way close 
to a denial-of-service attack, i.e., crawling data with maximum bandwidth in a 
parallelized approach.

It's clear that a single, stupidly written crawler script, run from a powerful 
University network, can quickly create terrible traffic load.

Many of the scripts we saw

- ignored robots.txt,
- ignored clear crawling speed limitations in robots.txt,
- did not identify themselves properly in the HTTP request header or lacked 
contact information therein,
- used no mechanisms at all for limiting the default crawling speed and 
re-crawling delays.

This irresponsible behavior can be the final reason for site-owners to say 
farewell to academic/W3C-sponsored semantic technology.

So please, please - advise all of your colleagues and students to NOT write simple 
crawler scripts for the billion triples challenge or whatsoever without familiarizing 
themselves with the state of the art in "friendly crawling".

Best wishes

Martin Hepp








--
Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248

You should read the ECS Web Team blog: http://blogs.ecs.soton.ac.uk/webteam/





--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: WebID vs. JSON (Was: Re: Think before you write Semantic Web crawlers)


On 6/22/11 3:52 PM, Leigh Dodds wrote:

Hi,

On 22 June 2011 15:41, William Waites  wrote:

What does WebID have to do with JSON? They're somehow representative
of two competing trends.

The RDF/JSON, JSON-LD, etc. work is supposed to be about making it
easier to work with RDF for your average programmer, to remove the
need for complex parsers, etc. and generally to lower the barriers.

The WebID arrangement is about raising barriers. Not intended to be
the same kind of barriers, certainly the intent isn't to make
programmer's lives more difficult, rather to provide a good way to do
distributed authentication without falling into the traps of PKI and
such.

While I like WebID, and I think it is very elegant, the fact is that I
can use just about any HTTP client to retrieve a document whereas to
get rdf processing clients, agents, whatever, to do it will require
quite a lot of work [1]. This is one reason why, for example, 4store's
arrangement of /sparql/ for read operations and /data/ and /update/
for write operations is *so* much easier to work with than Virtuoso's
OAuth and WebID arrangement - I can just restrict access using all of
the normal tools like apache, nginx, squid, etc..

So in the end we have some work being done to address the perception
that RDF is difficult to work with and on the other hand a suggestion
of widespread putting in place of authentication infrastructure which,
whilst obviously filling a need, stands to make working with the data
behind it more difficult.

How do we balance these two tendencies?

By recognising that often we just need to use existing technologies
more effectively and more widely, rather than throw more technology at
a problem, thereby creating an even greater education and adoption
problem?


WebID is existing technology. It essence is:

1. URI based Names
2. URL based Data Access Addresses
3. Graph based Data Representation
4. Data Access Logic for determining Who can do what from what Address.

We are dogfooding the very technology we want people to use. We are 
applying it to a serious and unavoidable problem.




Cheers,

L.




--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: Think before you write Semantic Web crawlers


On 6/22/11 3:44 PM, Karl Dubost wrote:

Le 22 juin 2011 à 10:41, Kingsley Idehen a écrit :

But that doesn't solve the big problem.

maybe… that solve the resource issue in the meantime ;)
small steps.

Yes, but it we make one small viral step. It just scales. Otherwise, we 
will continue to make little steps that don't scale etc..


--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: WebID vs. JSON (Was: Re: Think before you write Semantic Web crawlers)


On 6/22/11 3:41 PM, William Waites wrote:

While I like WebID, and I think it is very elegant, the fact is that I
can use just about any HTTP client to retrieve a document whereas to
get rdf processing clients, agents, whatever, to do it will require
quite a lot of work [1]. This is one reason why, for example, 4store's
arrangement of/sparql/  for read operations and/data/  and/update/
for write operations is*so*  much easier to work with than Virtuoso's
OAuth and WebID arrangement - I can just restrict access using all of
the normal tools like apache, nginx, squid, etc..

Huh?

WebID and SPARQL is about making an Endpoint with ACLs. ACL membership 
is driven by WebID for people, organizations, or groups (of either).


Don't really want to get into a Virtuoso vs 4-Store argument, but do 
explain to me how the convention you espouse enables me confine access 
to a SPARQL endpoint for:


A person identified by URI based Name (WebID) that a member of a 
foaf:Group (which also has its own WebID).


How does this approach leave ACL membership management to designated 
members of the foaf:Group?


Again, don't wanna do a 4-Store vs Virtuoso, but I really don't get your 
point re. WebID and the fidelity it brings to data access in general. 
Also note, SPARQL endpoints are but one type of data access address. 
WebID protects access to data accessible via Addresses by implicitly 
understanding the difference between a generic Name and a Name 
specifically used a Data Source Address or Location.




--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Steve Harris

Yes, exactly.

I think that the problem is at least partly (and I say this as an ex-academic) 
that few people in academia have the slightest idea how much it costs to run a 
farm of servers in the Real World™.

From the point of view of the crawler they're trying to get as much data as 
possible in a short a time as possible, but don't realise that the poor guy at 
the other end just got his 95th percentile shot through the roof, and now has a 
several thousand dollar bandwidth bill heading his way.

You can cap bandwidth, but that then might annoy paying customers, which is 
clearly not good.

- Steve

On 2011-06-22, at 12:54, Hugh Glaser wrote:

> Hi Chris.
> One way to do the caching really efficiently:
> http://lists.w3.org/Archives/Public/semantic-web/2007Jun/0012.html
> Which is what rkb has always done.
> But of course caching does not solve the problem of one bad crawler.
> It actually makes it worse.
> You add a cache write cost to the query, without a significant probability of 
> a future cache hit. And increase disk usage.
> 
> Hugh
> 
> - Reply message -
> From: "Christopher Gutteridge" 
> To: "Martin Hepp" 
> Cc: "Daniel Herzig" , "semantic-...@w3.org" 
> , "public-lod@w3.org" 
> Subject: Think before you write Semantic Web crawlers
> Date: Wed, Jun 22, 2011 9:18 am
> 
> 
> 
> The difference between these two scenarios is that there's almost no CPU 
> involvement in serving the PDF file, but naive RDF sites use lots of cycles 
> to generate the response to a query for an RDF document.
> 
> Right now queries to data.southampton.ac.uk (eg. 
> http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are made 
> live, but this is not efficient. My colleague, Dave Challis, has prepared a 
> SPARQL endpoint which caches results which we can turn on if the load gets 
> too high, which should at least mitigate the problem. Very few datasets 
> change in a 24 hours period.
> 
> Martin Hepp wrote:
> 
> Hi Daniel,
> Thanks for the link! I will relay this to relevant site-owners.
> 
> However, I still challenge Andreas' statement that the site-owners are to 
> blame for publishing large amounts of data on small servers.
> 
> One can publish 10,000 PDF documents on a tiny server without being hit by 
> DoS-style crazy crawlers. Why should the same not hold if I publish RDF?
> 
> But for sure, it is necessary to advise all publishers of large RDF datasets 
> to protect themselves against hungry crawlers and actual DoS attacks.
> 
> Imagine if a large site was brought down by a botnet that is exploiting 
> Semantic Sitemap information for DoS attacks, focussing on the large dump 
> files.
> This could end LOD experiments for that site.
> 
> 
> Best
> 
> Martin
> 
> 
> On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:
> 
> 
> 
> Hi Martin,
> 
> Have you tried to put a Squid [1]  as reverse proxy in front of your servers 
> and use delay pools [2] to catch hungry crawlers?
> 
> Cheers,
> Daniel
> 
> [1] http://www.squid-cache.org/
> [2] http://wiki.squid-cache.org/Features/DelayPools
> 
> On 21.06.2011, at 09:49, Martin Hepp wrote:
> 
> 
> 
> Hi all:
> 
> For the third time in a few weeks, we had massive complaints from site-owners 
> that Semantic Web crawlers from Universities visited their sites in a way 
> close to a denial-of-service attack, i.e., crawling data with maximum 
> bandwidth in a parallelized approach.
> 
> It's clear that a single, stupidly written crawler script, run from a 
> powerful University network, can quickly create terrible traffic load.
> 
> Many of the scripts we saw
> 
> - ignored robots.txt,
> - ignored clear crawling speed limitations in robots.txt,
> - did not identify themselves properly in the HTTP request header or lacked 
> contact information therein,
> - used no mechanisms at all for limiting the default crawling speed and 
> re-crawling delays.
> 
> This irresponsible behavior can be the final reason for site-owners to say 
> farewell to academic/W3C-sponsored semantic technology.
> 
> So please, please - advise all of your colleagues and students to NOT write 
> simple crawler scripts for the billion triples challenge or whatsoever 
> without familiarizing themselves with the state of the art in "friendly 
> crawling".
> 
> Best wishes
> 
> Martin Hepp
> 
> 
> 
> 
> 
> 
> 
> 
> --
> Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248
> 
> You should read the ECS Web Team blog: http://blogs.ecs.soton.ac.uk/webteam/
> 
> 

-- 
Steve Harris, CTO, Garlik Limited
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD

Re: WebID vs. JSON (Was: Re: Think before you write Semantic Web crawlers)

2011-06-22 Thread Leigh Dodds

Hi,

On 22 June 2011 15:41, William Waites  wrote:
> What does WebID have to do with JSON? They're somehow representative
> of two competing trends.
>
> The RDF/JSON, JSON-LD, etc. work is supposed to be about making it
> easier to work with RDF for your average programmer, to remove the
> need for complex parsers, etc. and generally to lower the barriers.
>
> The WebID arrangement is about raising barriers. Not intended to be
> the same kind of barriers, certainly the intent isn't to make
> programmer's lives more difficult, rather to provide a good way to do
> distributed authentication without falling into the traps of PKI and
> such.
>
> While I like WebID, and I think it is very elegant, the fact is that I
> can use just about any HTTP client to retrieve a document whereas to
> get rdf processing clients, agents, whatever, to do it will require
> quite a lot of work [1]. This is one reason why, for example, 4store's
> arrangement of /sparql/ for read operations and /data/ and /update/
> for write operations is *so* much easier to work with than Virtuoso's
> OAuth and WebID arrangement - I can just restrict access using all of
> the normal tools like apache, nginx, squid, etc..
>
> So in the end we have some work being done to address the perception
> that RDF is difficult to work with and on the other hand a suggestion
> of widespread putting in place of authentication infrastructure which,
> whilst obviously filling a need, stands to make working with the data
> behind it more difficult.
>
> How do we balance these two tendencies?

By recognising that often we just need to use existing technologies
more effectively and more widely, rather than throw more technology at
a problem, thereby creating an even greater education and adoption
problem?

Cheers,

L.

-- 
Leigh Dodds
Programme Manager, Talis Platform
Mobile: 07850 928381
http://kasabi.com
http://talis.com

Talis Systems Ltd
43 Temple Row
Birmingham
B2 5LS

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Karl Dubost


Le 22 juin 2011 à 10:41, Kingsley Idehen a écrit :
> But that doesn't solve the big problem.

maybe… that solve the resource issue in the meantime ;)
small steps. 

-- 
Karl Dubost - http://dev.opera.com/
Developer Relations & Tools, Opera Software

Re: Think before you write Semantic Web crawlers


On 6/22/11 3:34 PM, Karl Dubost wrote:

Le 21 juin 2011 à 03:49, Martin Hepp a écrit :

Many of the scripts we saw
- ignored robots.txt,
- ignored clear crawling speed limitations in robots.txt,
- did not identify themselves properly in the HTTP request header or lacked 
contact information therein,
- used no mechanisms at all for limiting the default crawling speed and 
re-crawling delays.


Do you have a list of those and how to identify them?
So we can put them in our blocking lists?

.htaccess or Apache config with rules such as:

# added for abusive downloads or not respecting robots.txt
SetEnvIfNoCase User-Agent ".*Technorati*." bad_bot
SetEnvIfNoCase User-Agent ".*WikioFeedBot*." bad_bot
# [… cut part of my list …]
Order Allow,Deny
Deny from 85.88.12.104
Deny from env=bad_bot
Allow from all





But that doesn't solve the big problem. An Apache module for WebID that 
allows QoS algorithms or heuristics based on Trust Logics is the only 
way this will scale, ultimately. Apache can get with the program, via 
modules. Henry and Joe and a few others are working on keeping Apache in 
step with the new Data Space dimension of the Web :-)


--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

WebID vs. JSON (Was: Re: Think before you write Semantic Web crawlers)

2011-06-22 Thread William Waites

What does WebID have to do with JSON? They're somehow representative
of two competing trends.

The RDF/JSON, JSON-LD, etc. work is supposed to be about making it
easier to work with RDF for your average programmer, to remove the
need for complex parsers, etc. and generally to lower the barriers.

The WebID arrangement is about raising barriers. Not intended to be
the same kind of barriers, certainly the intent isn't to make
programmer's lives more difficult, rather to provide a good way to do
distributed authentication without falling into the traps of PKI and
such.

While I like WebID, and I think it is very elegant, the fact is that I
can use just about any HTTP client to retrieve a document whereas to
get rdf processing clients, agents, whatever, to do it will require
quite a lot of work [1]. This is one reason why, for example, 4store's
arrangement of /sparql/ for read operations and /data/ and /update/
for write operations is *so* much easier to work with than Virtuoso's
OAuth and WebID arrangement - I can just restrict access using all of
the normal tools like apache, nginx, squid, etc..

So in the end we have some work being done to address the perception
that RDF is difficult to work with and on the other hand a suggestion
of widespread putting in place of authentication infrastructure which,
whilst obviously filling a need, stands to make working with the data
behind it more difficult.

How do we balance these two tendencies?

[1] examples of non-WebID aware clients: rapper / rasqal, python
rdflib, curl, the javascript engine in my web browser that doesn't
properly support client certificates, etc.
-- 
William Waites
http://river.styx.org/ww/
F4B3 39BF E775 CF42 0BAB  3DF0 BE40 A6DF B06F FD45

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Karl Dubost


Le 21 juin 2011 à 03:49, Martin Hepp a écrit :
> Many of the scripts we saw
> - ignored robots.txt,
> - ignored clear crawling speed limitations in robots.txt,
> - did not identify themselves properly in the HTTP request header or lacked 
> contact information therein, 
> - used no mechanisms at all for limiting the default crawling speed and 
> re-crawling delays.


Do you have a list of those and how to identify them?
So we can put them in our blocking lists?

.htaccess or Apache config with rules such as:

# added for abusive downloads or not respecting robots.txt
SetEnvIfNoCase User-Agent ".*Technorati*." bad_bot
SetEnvIfNoCase User-Agent ".*WikioFeedBot*." bad_bot
# [… cut part of my list …]
Order Allow,Deny
Deny from 85.88.12.104
Deny from env=bad_bot
Allow from all



-- 
Karl Dubost - http://dev.opera.com/
Developer Relations & Tools, Opera Software

Re: Think before you write Semantic Web crawlers


On 6/22/11 3:00 PM, Dieter Fensel wrote:

At 14:37 22.06.2011, Andreas Harth wrote:

Hi Martin,

first let me say that I do think crawlers should follow basic politeness
rules (contact info in User-Agent, adhere to the Robot Exclusion 
Protocol).


However, I am delighted that people actually start consuming Linked 
Data,

and we should encourage that.


The real challenge may be to achieve usage of data in a way that 
provide benefits to its provider.





Closer, but the core algorithm or heuristic will always require 
something like WebID plus some serious Linked Data dog-fooding re. trust 
logics. Owl ! :-)



--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: Think before you write Semantic Web crawlers


On 6/22/11 2:18 PM, glenn mcdonald wrote:
Why is anybody having to "crawl" an open semantic-web dataset? Isn't 
there a "download" link, and/or a SPARQL endpoint? If there isn't, why 
not? We're the Semantic Web, dammit. If we aren't the masters of data 
interoperability, what are we?
Well, as I said, at the onset of this community, we did speak about the 
need for SPARQL endpoint and RDF dump combos. Unfortunately, SPARQL 
endpoints or RDF dumps became the norm re. Linked Data publishing. Thus, 
as is always the case with us humans, prevention never works because we 
prefer cures post catastrophe :-(


We've had to grapple with these matters and more re. DBpedia (in 
particular) and other Linked Data spaces we've contributed to the 
public. I am happy to see these matters are finally resurfacing courtesy 
of experiences elsewhere.


--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: Think before you write Semantic Web crawlers


On 6/22/11 1:18 PM, Henry Story wrote:
You need to move to strong defences. This is what WebID provides very 
efficiently. Each resource can ask the requestor for their identity 
before giving access to a resource. It is completely decentralised and 
about as efficient as one can get.
So just as the power of computing has grown for everyone to write 
silly software so TLS and https has become cheaper and cheaper. Google 
is now moving to put all its servers behind https and so is Facebook. 
Soon all the web will be behind https - and that will massively 
increase the security on the whole web.


Increase security from snooping, yes-ish.

Only when you add WebID to the equation does TLS truly have an 
opportunity to be very smart.


The physics of computing has changed, genetic algorithms are going to 
become the norm rather than the exception. Trust Logics will separate 
the winners from the losers. This is an inevitability.


--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Dieter Fensel


At 14:37 22.06.2011, Andreas Harth wrote:

Hi Martin,

first let me say that I do think crawlers should follow basic politeness
rules (contact info in User-Agent, adhere to the Robot Exclusion Protocol).

However, I am delighted that people actually start consuming Linked Data,
and we should encourage that.


The real challenge may be to achieve usage of data in a way that 
provide benefits to its provider.



--
Dieter Fensel
Director STI Innsbruck, University of Innsbruck, Austria
http://www.sti-innsbruck.at/
phone: +43-512-507-6488/5, fax: +43-512-507-9872

Re: Think before you write Semantic Web crawlers


On 6/22/11 12:54 PM, Hugh Glaser wrote:

Hi Chris.
One way to do the caching really efficiently:
http://lists.w3.org/Archives/Public/semantic-web/2007Jun/0012.html
Which is what rkb has always done.
But of course caching does not solve the problem of one bad crawler.


Or a SPARQL query gone horribly wrong albeit inadvertently. In short, 
this is the most challenging case of all. We even had to protect against 
the same thing re. SQL access via ODBC, JDBC, ADO.NET, and OLE-DB. 
Basically, why we still have a business selling drivers when DBMS 
vendors offer free variants.




It actually makes it worse.
You add a cache write cost to the query, without a significant probability of a 
future cache hit. And increase disk usage.


Yes, and WebID adds fidelity to such inevitable challenges.

This is why (IMHO) WebID is the second most important innovation 
following the URI re., Linked Data.



Kingsley

Hugh

- Reply message -
From: "Christopher Gutteridge"
To: "Martin Hepp"
Cc: "Daniel Herzig", "semantic-...@w3.org", 
"public-lod@w3.org"
Subject: Think before you write Semantic Web crawlers
Date: Wed, Jun 22, 2011 9:18 am



The difference between these two scenarios is that there's almost no CPU 
involvement in serving the PDF file, but naive RDF sites use lots of cycles to 
generate the response to a query for an RDF document.

Right now queries to data.southampton.ac.uk (eg. 
http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are made 
live, but this is not efficient. My colleague, Dave Challis, has prepared a 
SPARQL endpoint which caches results which we can turn on if the load gets too 
high, which should at least mitigate the problem. Very few datasets change in a 
24 hours period.

Martin Hepp wrote:

Hi Daniel,
Thanks for the link! I will relay this to relevant site-owners.

However, I still challenge Andreas' statement that the site-owners are to blame 
for publishing large amounts of data on small servers.

One can publish 10,000 PDF documents on a tiny server without being hit by 
DoS-style crazy crawlers. Why should the same not hold if I publish RDF?

But for sure, it is necessary to advise all publishers of large RDF datasets to 
protect themselves against hungry crawlers and actual DoS attacks.

Imagine if a large site was brought down by a botnet that is exploiting 
Semantic Sitemap information for DoS attacks, focussing on the large dump files.
This could end LOD experiments for that site.


Best

Martin


On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:



Hi Martin,

Have you tried to put a Squid [1]  as reverse proxy in front of your servers 
and use delay pools [2] to catch hungry crawlers?

Cheers,
Daniel

[1] http://www.squid-cache.org/
[2] http://wiki.squid-cache.org/Features/DelayPools

On 21.06.2011, at 09:49, Martin Hepp wrote:



Hi all:

For the third time in a few weeks, we had massive complaints from site-owners 
that Semantic Web crawlers from Universities visited their sites in a way close 
to a denial-of-service attack, i.e., crawling data with maximum bandwidth in a 
parallelized approach.

It's clear that a single, stupidly written crawler script, run from a powerful 
University network, can quickly create terrible traffic load.

Many of the scripts we saw

- ignored robots.txt,
- ignored clear crawling speed limitations in robots.txt,
- did not identify themselves properly in the HTTP request header or lacked 
contact information therein,
- used no mechanisms at all for limiting the default crawling speed and 
re-crawling delays.

This irresponsible behavior can be the final reason for site-owners to say 
farewell to academic/W3C-sponsored semantic technology.

So please, please - advise all of your colleagues and students to NOT write simple 
crawler scripts for the billion triples challenge or whatsoever without familiarizing 
themselves with the state of the art in "friendly crawling".

Best wishes

Martin Hepp








--
Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248

You should read the ECS Web Team blog: http://blogs.ecs.soton.ac.uk/webteam/






--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Jiří Procházka

I understand that, but I doubt your conclusion, that those crawlers are
targeting semantic web, since like you said they don't even properly
identify themselves and as far as I know, on Universities also regular
web search and crawling is researched. Maybe lot of them are targeting
semantic web, but we should look at all measures to conserve bandwidth,
from avoiding regular web crawler interest, aiding infrastructure like
Ping the Semantic Web to optimizing delivery and even distribution of
the data among resouces.

Best,
Jiri

On 06/22/2011 03:21 PM, Martin Hepp wrote:
> Thanks, Jiri, but the load comes from academic crawler prototypes firing from 
> broad University infrastructures.
> Best
> Martin
> 
> 
> On Jun 22, 2011, at 12:40 PM, Jiří Procházka wrote:
> 
>> I wonder, are ways to link RDF data so that convential crawlers do not
>> crawl it, but only the semantic web aware ones do?
>> I am not sure how the current practice of linking by link tag in the
>> html headers could cause this, but it may be case that those heavy loads
>> come from a crawlers having nothing to do with semantic web...
>> Maybe we should start linking to our rdf/xml, turtle, ntriples files and
>> publishing sitemap info in RDFa...
>>
>> Best,
>> Jiri
>>
>> On 06/22/2011 09:00 AM, Steve Harris wrote:
>>> While I don't agree with Andreas exactly that it's the site owners fault, 
>>> this is something that publishers of non-semantic data have to deal with.
>>>
>>> If you publish a large collection of interlinked data which looks 
>>> interesting to conventional crawlers and is expensive to generate, 
>>> conventional web crawlers will be all over it. The main difference is that 
>>> a greater percentage of those are written properly, to follow robots.txt 
>>> and the guidelines about hit frequency (maximum 1 request per second per 
>>> domain, no parallel crawling).
>>>
>>> Has someone published similar guidelines for semantic web crawlers?
>>>
>>> The ones that don't behave themselves get banned, either in robots.txt, or 
>>> explicitly by the server. 
>>>
>>> - Steve
>>>
>>> On 2011-06-22, at 06:07, Martin Hepp wrote:
>>>
 Hi Daniel,
 Thanks for the link! I will relay this to relevant site-owners.

 However, I still challenge Andreas' statement that the site-owners are to 
 blame for publishing large amounts of data on small servers.

 One can publish 10,000 PDF documents on a tiny server without being hit by 
 DoS-style crazy crawlers. Why should the same not hold if I publish RDF?

 But for sure, it is necessary to advise all publishers of large RDF 
 datasets to protect themselves against hungry crawlers and actual DoS 
 attacks.

 Imagine if a large site was brought down by a botnet that is exploiting 
 Semantic Sitemap information for DoS attacks, focussing on the large dump 
 files. 
 This could end LOD experiments for that site.


 Best

 Martin


 On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:

>
> Hi Martin,
>
> Have you tried to put a Squid [1]  as reverse proxy in front of your 
> servers and use delay pools [2] to catch hungry crawlers?
>
> Cheers,
> Daniel
>
> [1] http://www.squid-cache.org/
> [2] http://wiki.squid-cache.org/Features/DelayPools
>
> On 21.06.2011, at 09:49, Martin Hepp wrote:
>
>> Hi all:
>>
>> For the third time in a few weeks, we had massive complaints from 
>> site-owners that Semantic Web crawlers from Universities visited their 
>> sites in a way close to a denial-of-service attack, i.e., crawling data 
>> with maximum bandwidth in a parallelized approach.
>>
>> It's clear that a single, stupidly written crawler script, run from a 
>> powerful University network, can quickly create terrible traffic load. 
>>
>> Many of the scripts we saw
>>
>> - ignored robots.txt,
>> - ignored clear crawling speed limitations in robots.txt,
>> - did not identify themselves properly in the HTTP request header or 
>> lacked contact information therein, 
>> - used no mechanisms at all for limiting the default crawling speed and 
>> re-crawling delays.
>>
>> This irresponsible behavior can be the final reason for site-owners to 
>> say farewell to academic/W3C-sponsored semantic technology.
>>
>> So please, please - advise all of your colleagues and students to NOT 
>> write simple crawler scripts for the billion triples challenge or 
>> whatsoever without familiarizing themselves with the state of the art in 
>> "friendly crawling".
>>
>> Best wishes
>>
>> Martin Hepp
>>
>


>>>
>>
> 



signature.asc
Description: OpenPGP digital signature

Re: Think before you write Semantic Web crawlers

Thanks, Jiri, but the load comes from academic crawler prototypes firing from 
broad University infrastructures.
Best
Martin


On Jun 22, 2011, at 12:40 PM, Jiří Procházka wrote:

> I wonder, are ways to link RDF data so that convential crawlers do not
> crawl it, but only the semantic web aware ones do?
> I am not sure how the current practice of linking by link tag in the
> html headers could cause this, but it may be case that those heavy loads
> come from a crawlers having nothing to do with semantic web...
> Maybe we should start linking to our rdf/xml, turtle, ntriples files and
> publishing sitemap info in RDFa...
> 
> Best,
> Jiri
> 
> On 06/22/2011 09:00 AM, Steve Harris wrote:
>> While I don't agree with Andreas exactly that it's the site owners fault, 
>> this is something that publishers of non-semantic data have to deal with.
>> 
>> If you publish a large collection of interlinked data which looks 
>> interesting to conventional crawlers and is expensive to generate, 
>> conventional web crawlers will be all over it. The main difference is that a 
>> greater percentage of those are written properly, to follow robots.txt and 
>> the guidelines about hit frequency (maximum 1 request per second per domain, 
>> no parallel crawling).
>> 
>> Has someone published similar guidelines for semantic web crawlers?
>> 
>> The ones that don't behave themselves get banned, either in robots.txt, or 
>> explicitly by the server. 
>> 
>> - Steve
>> 
>> On 2011-06-22, at 06:07, Martin Hepp wrote:
>> 
>>> Hi Daniel,
>>> Thanks for the link! I will relay this to relevant site-owners.
>>> 
>>> However, I still challenge Andreas' statement that the site-owners are to 
>>> blame for publishing large amounts of data on small servers.
>>> 
>>> One can publish 10,000 PDF documents on a tiny server without being hit by 
>>> DoS-style crazy crawlers. Why should the same not hold if I publish RDF?
>>> 
>>> But for sure, it is necessary to advise all publishers of large RDF 
>>> datasets to protect themselves against hungry crawlers and actual DoS 
>>> attacks.
>>> 
>>> Imagine if a large site was brought down by a botnet that is exploiting 
>>> Semantic Sitemap information for DoS attacks, focussing on the large dump 
>>> files. 
>>> This could end LOD experiments for that site.
>>> 
>>> 
>>> Best
>>> 
>>> Martin
>>> 
>>> 
>>> On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:
>>> 
 
 Hi Martin,
 
 Have you tried to put a Squid [1]  as reverse proxy in front of your 
 servers and use delay pools [2] to catch hungry crawlers?
 
 Cheers,
 Daniel
 
 [1] http://www.squid-cache.org/
 [2] http://wiki.squid-cache.org/Features/DelayPools
 
 On 21.06.2011, at 09:49, Martin Hepp wrote:
 
> Hi all:
> 
> For the third time in a few weeks, we had massive complaints from 
> site-owners that Semantic Web crawlers from Universities visited their 
> sites in a way close to a denial-of-service attack, i.e., crawling data 
> with maximum bandwidth in a parallelized approach.
> 
> It's clear that a single, stupidly written crawler script, run from a 
> powerful University network, can quickly create terrible traffic load. 
> 
> Many of the scripts we saw
> 
> - ignored robots.txt,
> - ignored clear crawling speed limitations in robots.txt,
> - did not identify themselves properly in the HTTP request header or 
> lacked contact information therein, 
> - used no mechanisms at all for limiting the default crawling speed and 
> re-crawling delays.
> 
> This irresponsible behavior can be the final reason for site-owners to 
> say farewell to academic/W3C-sponsored semantic technology.
> 
> So please, please - advise all of your colleagues and students to NOT 
> write simple crawler scripts for the billion triples challenge or 
> whatsoever without familiarizing themselves with the state of the art in 
> "friendly crawling".
> 
> Best wishes
> 
> Martin Hepp
> 
 
>>> 
>>> 
>> 
>

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread glenn mcdonald

>From my perspective as the designer of a system that both consumes and
publishes data, the load/burden issue here is not at all particular to the
semantic web. Needle obeys robots.txt rules, but that's a small deal
compared to the difficulty of extracting whole data from sites set up to
deliver it only in tiny pieces. I'd say about 98% of the time I can describe
the data I want from a site with a single conceptual query. Indeed, once
I've got the data into Needle I can almost always actually produce that
query. But on the source site, I usually can't, and thus we are forced to
waste everybody's time navigating the machines through superfluous
presentation rendering designed for people. 10-at-a-time results lists,
interminable AJAX refreshes, animated DIV reveals, grafting back together
the splintered bits of tree-traversals, etc. This is all absurdly
unnecessary. Why is anybody having to "crawl" an open semantic-web dataset?
Isn't there a "download" link, and/or a SPARQL endpoint? If there isn't, why
not? We're the Semantic Web, dammit. If we aren't the masters of data
interoperability, what are we?

glenn
(www.needlebase.com)

Re: Hackers - Re: Schema.org considered helpful

2011-06-22 Thread adasal

Hi
I haven't had time to follow link
I expect there is an issue of how to think about a semantic web.
I can see Google is about ruthlessly exploiting the atomisation of the
Bazaar. Of course from within the walls of their own Cathedral.
Recall is in inverse proportion to accuracy.
I think web behaviours influence our own (mind) behaviours. We respond
to environment. Hints from that environment are assimilated very
quickly.
The web is an (absorbing for important reasons undiscussed here) environment.
I rely on Google very happily. It brings fragments some times random
often according to rules I half guess at. This is how it deals with
recall/accuracy.
SemWeb should be different. It is machine/machine. But there is an
ultimate human arbiter of relevance and quality of data for human
consumption. SemWeb needs a series of a priories - the ontologies.
It seems there are two human arbiter questions.
1. What data would I like to see - describe a coherent package of concepts.
2. Describe an ontology as a package of concepts.
In other words concept packages should be able to function independent
of attachment to ontology. And there needs a function to translate
between them. Ontology is already too low level.
It is impossible to characterise what people may be able to agree upon
as concept packages - data aims.
What people agree on depends on all the mixes of any human situation.
Is there a base strata of factors, a common field. I don't know but
I'm sure work has been done in the area. At simplest this is relation
between beliefs, hopes and desires which can never fully be known and
intersect in some group such that an agreed model can be made.
Models aspire to this. Groups create rules to facilitate this.
This is the responsibility the semweb has.
1. To identify such means of modelling and
2. mediate (show what it takes; what it is like to mediate) the
movement between model and some norms.
Here I mean behavioural norms. (So they need to be established case by
case. WebId to prevent unfriendly crawlers is a good simple example)
Not logical rules.
It is only with this in mind that anything of interest can be created.
Note: this is not creating something in the Bazaar of random market
forces. And, as with all heavily patterned behaviour, this is very
expensive in effort. It is also without the background data generation
of google as we traverse their graph. No gleaning off users. Radically
different.

Best

Adam

On 17/06/2011, Henry Story  wrote:
>
> On 17 Jun 2011, at 19:27, adasal wrote:
>
>> That said the hacker is a various beast,
>
> Indeed, hackers are not angels. But the people on this list should get back
> to hacking or work together with open source projects to get initial minimal
> working pieces embedded there. WebID is one; foaf is another, pingback,
> access control, ...
> Get the really simple pieces working.
>
>> and I wonder if this sort of thing can really be addressed without
>> overarching political/ethical/idealogical concerns. It's tough.
>
> It all fits together really nicely. I gave a talk on the philosophy of the
> Social Web if you are interested.
>  http://www.slideshare.net/bblfish/philosophy-and-the-social-web-5583083
>
> Hackers tend to be engineers with a political attitude, so they are more
> receptive to the bigger picture. But solving the big picture problem should
> have an easy entry cost if we want to get it going.
>
> I talked to the BBC but they have limited themselves to what they will do in
> the Social Web space as far as profile hosting goes. Again, I'd start small.
> Facebook started in universities not that long ago.
>
> Henry
>
>
> Social Web Architect
> http://bblfish.net/
>
>

-- 
Sent from my mobile device

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Andreas Harth


Hi Martin,

first let me say that I do think crawlers should follow basic politeness
rules (contact info in User-Agent, adhere to the Robot Exclusion Protocol).

However, I am delighted that people actually start consuming Linked Data,
and we should encourage that.

On 06/22/2011 11:42 AM, Martin Hepp wrote:

OpenEAN - a transcript of>1 Mio product models and their EAN/UPC code at
http://openean.kaufkauf.net/id/ has been permanently shut down by the site
operator because fighting with bad semweb crawlers is taking too much of his
time.


I've put a wrapper online [1] that provides RDF based on their API (which,
incidentally, currently does not seem to work either).

The wrapper does some caching and has a limit of one lookup every 8 seconds,
which means (24*60*60)/8 = 10800 lookups per day.  Data transfer is capped
to 1 GB/day, which means a maximum cost of 0.15 Euro/day at Amazon AWS pricing.

At that rate, it would take 925 days to collect descriptions of just one
million products.  Whether the ratio of data size and lookup limit is sensible
in that case is open to debate.

If the OpenEAN guys can redirect requests to [1] there would even be some
continuity for data consumers.

Best regards,
Andreas.

[1] http://openeanwrap.appspot.com/

Re: Think before you write Semantic Web crawlers

On 22 Jun 2011, at 13:31, Lin Clark wrote:

> I was with you on this until the cathedral and the bazaar thing...

Yes, I think the metaphors there have ended up getting cross-wired.  The paper 
on the Cathedral and the Bazaar was a very good paper, and made in a world that 
thought that only centralised ways of thinking could build good software. It 
really helped spread an old idea of peer to peer development.

Peer to peer is great, but it brings with it the potential of viruses and 
various other problems. That is why even in peer 2 peer software development, 
you allow people to fork a project, but not necessarily to write to your 
repository. The Web is peer to peer (the bazaar) but you don't allow everyone 
to write to your home page. One could think of the web as a number of 
cathedrals linked up in a peer to peer fashion. A bazaar of cathedrals if you 
wish.

It is this diversity of peers that makes the richness of the web. This 
diversity is guaranteed by the protection each site has from being attacked, 
and the guarantee therefore that each site expresses a unique point of view. 

So until recently crawlers were few and far between because the computing 
resources just cost so much, that only specialised engineers wrote crawlers. At 
AltaVista the crawler was written initially by Louis Monier in 1996 then later 
by Spiderman. Spiderman was on the project for years, and his was carefully 
tested and reviewed. The DEC alpha machines at the times were 500Mhz 64 bit 
Alpha computers with 8GB of RAM and cost a fortune. DEC was selling clusters of 
8 of those together. You had to be very rich to get the bandwidth. 

Now every laptop has 8GB of RAM, 4 cores at 2.3Ghz, and every household has 
amazing bandwidth to the internet. So silly programs are going to become more 
prevalent. Going on conventions such as robots.txt files placed at a 
conventional location as described by some spec written out somewhere on the 
internet is not going to work in this new world. Neither is it really going to 
help to look at HTTP headers and other such conventional methods.

You need to move to strong defences. This is what WebID provides very 
efficiently. Each resource can ask the requestor for their identity before 
giving access to a resource. It is completely decentralised and about as 
efficient as one can get. 
So just as the power of computing has grown for everyone to write silly 
software so TLS and https has become cheaper and cheaper. Google is now moving 
to put all its servers behind https and so is Facebook. Soon all the web will 
be behind https - and that will massively increase the security on the whole 
web.

Henry

Many papers and implementations of WebID are here http://esw.w3.org/foaf+ssl
Also please join the WebID incubator group at the W3C 
http://www.w3.org/2005/Incubator/webid/charter

> I think it is a serious misreading of cathedral and bazaar to think that if 
> something is naive and irresponsible, it is by definition bazaar style 
> development. Bazaar style is about how code is developed (in the open by a 
> loosely organized and fluctuating group of developers). Cathedral means that 
> it is a smaller, generally hierarchically organized group which doesn't work 
> in a public, open way between releases. There is a good summary on wikipedia, 
> http://en.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar
> 
> The bazaar style of development can lead to things that are more responsible 
> than their cathedral counterparts. Bazaar means continuously documenting your 
> decisions in the public, posting patches for testing and review by everyone 
> (not just your extremely busy team mates), and opening your dev process to 
> co-developers who you don't already know. These organizational strategies 
> have lead to some REALLY BIG engineering wins... and these engineering wins 
> have resulted in more responsible products than their cathedral-built 
> counterparts.
> 
> I also would question the assertion that people want cathedrals... the 
> general direction on the Web seems to be away from cathedrals like Microsoft 
> and Flash and towards bazaar developed solutions.
> 
> However, the call to responsibility is still a very valid one. I'm quite 
> sorry to hear that a large data publisher has been pushed out of the 
> community effort by people who should be working on the same team. 
> 
> -Lin
> 
> 
> On Wed, Jun 22, 2011 at 11:59 AM,  wrote:
> Yes. But are there things such as Squid and WebId that can be instituted the 
> provider side? This is an interesting moment. Is it the academic SemWeb 
> running out of public facing steam. A retreat. Or is it a moment of 
> transition from naivety to responsibility. When we think about the Cathedral 
> and the Bazaar. There is a reason why people want Cathedrals. I suggest 
> SemWeb is about Cathedrals. Responsibility for some order and structure.
> 
> Adam
> Sent using BlackBerry® from Orange
> 
> -Original Message-
> From: Martin Hepp 
> S

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Hugh Glaser

Hi Chris.
One way to do the caching really efficiently:
http://lists.w3.org/Archives/Public/semantic-web/2007Jun/0012.html
Which is what rkb has always done.
But of course caching does not solve the problem of one bad crawler.
It actually makes it worse.
You add a cache write cost to the query, without a significant probability of a 
future cache hit. And increase disk usage.

Hugh

- Reply message -
From: "Christopher Gutteridge" 
To: "Martin Hepp" 
Cc: "Daniel Herzig" , "semantic-...@w3.org" 
, "public-lod@w3.org" 
Subject: Think before you write Semantic Web crawlers
Date: Wed, Jun 22, 2011 9:18 am



The difference between these two scenarios is that there's almost no CPU 
involvement in serving the PDF file, but naive RDF sites use lots of cycles to 
generate the response to a query for an RDF document.

Right now queries to data.southampton.ac.uk (eg. 
http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are made 
live, but this is not efficient. My colleague, Dave Challis, has prepared a 
SPARQL endpoint which caches results which we can turn on if the load gets too 
high, which should at least mitigate the problem. Very few datasets change in a 
24 hours period.

Martin Hepp wrote:

Hi Daniel,
Thanks for the link! I will relay this to relevant site-owners.

However, I still challenge Andreas' statement that the site-owners are to blame 
for publishing large amounts of data on small servers.

One can publish 10,000 PDF documents on a tiny server without being hit by 
DoS-style crazy crawlers. Why should the same not hold if I publish RDF?

But for sure, it is necessary to advise all publishers of large RDF datasets to 
protect themselves against hungry crawlers and actual DoS attacks.

Imagine if a large site was brought down by a botnet that is exploiting 
Semantic Sitemap information for DoS attacks, focussing on the large dump files.
This could end LOD experiments for that site.


Best

Martin


On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:



Hi Martin,

Have you tried to put a Squid [1]  as reverse proxy in front of your servers 
and use delay pools [2] to catch hungry crawlers?

Cheers,
Daniel

[1] http://www.squid-cache.org/
[2] http://wiki.squid-cache.org/Features/DelayPools

On 21.06.2011, at 09:49, Martin Hepp wrote:



Hi all:

For the third time in a few weeks, we had massive complaints from site-owners 
that Semantic Web crawlers from Universities visited their sites in a way close 
to a denial-of-service attack, i.e., crawling data with maximum bandwidth in a 
parallelized approach.

It's clear that a single, stupidly written crawler script, run from a powerful 
University network, can quickly create terrible traffic load.

Many of the scripts we saw

- ignored robots.txt,
- ignored clear crawling speed limitations in robots.txt,
- did not identify themselves properly in the HTTP request header or lacked 
contact information therein,
- used no mechanisms at all for limiting the default crawling speed and 
re-crawling delays.

This irresponsible behavior can be the final reason for site-owners to say 
farewell to academic/W3C-sponsored semantic technology.

So please, please - advise all of your colleagues and students to NOT write 
simple crawler scripts for the billion triples challenge or whatsoever without 
familiarizing themselves with the state of the art in "friendly crawling".

Best wishes

Martin Hepp








--
Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248

You should read the ECS Web Team blog: http://blogs.ecs.soton.ac.uk/webteam/

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Lin Clark

I was with you on this until the cathedral and the bazaar thing... I think
it is a serious misreading of cathedral and bazaar to think that if
something is naive and irresponsible, it is by definition bazaar style
development. Bazaar style is about how code is developed (in the open by a
loosely organized and fluctuating group of developers). Cathedral means that
it is a smaller, generally hierarchically organized group which doesn't work
in a public, open way between releases. There is a good summary on
wikipedia, http://en.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar

The bazaar style of development can lead to things that are more responsible
than their cathedral counterparts. Bazaar means continuously documenting
your decisions in the public, posting patches for testing and review by
everyone (not just your extremely busy team mates), and opening your dev
process to co-developers who you don't already know. These organizational
strategies have lead to some REALLY BIG engineering wins... and these
engineering wins have resulted in more responsible products than their
cathedral-built counterparts.

I also would question the assertion that people want cathedrals... the
general direction on the Web seems to be away from cathedrals like Microsoft
and Flash and towards bazaar developed solutions.

However, the call to responsibility is still a very valid one. I'm quite
sorry to hear that a large data publisher has been pushed out of the
community effort by people who should be working on the same team.

-Lin

On Wed, Jun 22, 2011 at 11:59 AM,  wrote:

> Yes. But are there things such as Squid and WebId that can be instituted
> the provider side? This is an interesting moment. Is it the academic SemWeb
> running out of public facing steam. A retreat. Or is it a moment of
> transition from naivety to responsibility. When we think about the Cathedral
> and the Bazaar. There is a reason why people want Cathedrals. I suggest
> SemWeb is about Cathedrals. Responsibility for some order and structure.
>
> Adam
> Sent using BlackBerry® from Orange
>
> -Original Message-
> From: Martin Hepp 
> Sender: semantic-web-requ...@w3.org
> Date: Wed, 22 Jun 2011 11:42:58
> To: Yves Raimond
> Cc: Christopher Gutteridge; Daniel Herzig<
> her...@kit.edu>; ; 
> Subject: Re: Think before you write Semantic Web crawlers
>
> Just to inform the community that the BTC / research crawlers have been
> successful in killing a major RDF source for e-commerce:
>
> OpenEAN - a transcript of >1 Mio product models and their EAN/UPC code at
> http://openean.kaufkauf.net/id/ has been permanently shut down by the site
> operator because fighting with bad semweb crawlers is taking too much of his
> time.
>
> Thanks a lot for everybody who contributed to that. It trashes a month of
> work and many million useful triples.
>
> Best
>
> Martin Hepp
>
>
>
> On Jun 22, 2011, at 11:37 AM, Yves Raimond wrote:
>
> > Hello!
> >
> >> The difference between these two scenarios is that there's almost no CPU
> >> involvement in serving the PDF file, but naive RDF sites use lots of
> cycles
> >> to generate the response to a query for an RDF document.
> >>
> >> Right now queries to data.southampton.ac.uk (eg.
> >> http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are
> made
> >> live, but this is not efficient. My colleague, Dave Challis, has
> prepared a
> >> SPARQL endpoint which caches results which we can turn on if the load
> gets
> >> too high, which should at least mitigate the problem. Very few datasets
> >> change in a 24 hours period.
> >
> > Hmm, I would strongly argue it is not the case (and stale datasets are
> > a bit issue in LOD imho!). The data on the BBC website, for example,
> > changes approximately 10 times a second.
> >
> > We've also been hit in the past (and still now, to a lesser extent) by
> > badly behaving crawlers. I agree that, as we don't provide dumps, it
> > is the only way to generate an aggregation of BBC data, but we've had
> > downtime in the past caused by crawlers. After that happened, it
> > caused lots of discussions on whether we should publish RDF data at
> > all (thankfully, we succeeded to argue that we should keep it - but
> > that's a lot of time spent arguing instead of publishing new juicy RDF
> > data!)
> >
> > I also want to point out (in response to Andreas's email) that HTTP
> > caches are *completely* inefficient to protect a dataset against that,
> > as crawlers tend to be exhaustive. ETags and Expiry headers are
> > helpful, but chances are that 1) you don't know when the data will
> > change, you can just make a wild guess based on previous behavior 2)
> > the cache would have expired the time the crawler requests a document
> > a second time, as it has ~100M (in our case) documents to crawl
> > through.
> >
> > Request throttling would work, but you would have to find a way to
> > identify crawlers, which is tricky: most of them use multiple IPs and
> > don't set appropriate us

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread adam . saltiel

Yes. But are there things such as Squid and WebId that can be instituted the 
provider side? This is an interesting moment. Is it the academic SemWeb running 
out of public facing steam. A retreat. Or is it a moment of transition from 
naivety to responsibility. When we think about the Cathedral and the Bazaar. 
There is a reason why people want Cathedrals. I suggest SemWeb is about 
Cathedrals. Responsibility for some order and structure. 

Adam 
Sent using BlackBerry® from Orange

-Original Message-
From: Martin Hepp 
Sender: semantic-web-requ...@w3.org
Date: Wed, 22 Jun 2011 11:42:58 
To: Yves Raimond
Cc: Christopher Gutteridge; Daniel 
Herzig; ; 
Subject: Re: Think before you write Semantic Web crawlers

Just to inform the community that the BTC / research crawlers have been 
successful in killing a major RDF source for e-commerce:

OpenEAN - a transcript of >1 Mio product models and their EAN/UPC code at 
http://openean.kaufkauf.net/id/ has been permanently shut down by the site 
operator because fighting with bad semweb crawlers is taking too much of his 
time.

Thanks a lot for everybody who contributed to that. It trashes a month of work 
and many million useful triples.

Best

Martin Hepp



On Jun 22, 2011, at 11:37 AM, Yves Raimond wrote:

> Hello!
> 
>> The difference between these two scenarios is that there's almost no CPU
>> involvement in serving the PDF file, but naive RDF sites use lots of cycles
>> to generate the response to a query for an RDF document.
>> 
>> Right now queries to data.southampton.ac.uk (eg.
>> http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are made
>> live, but this is not efficient. My colleague, Dave Challis, has prepared a
>> SPARQL endpoint which caches results which we can turn on if the load gets
>> too high, which should at least mitigate the problem. Very few datasets
>> change in a 24 hours period.
> 
> Hmm, I would strongly argue it is not the case (and stale datasets are
> a bit issue in LOD imho!). The data on the BBC website, for example,
> changes approximately 10 times a second.
> 
> We've also been hit in the past (and still now, to a lesser extent) by
> badly behaving crawlers. I agree that, as we don't provide dumps, it
> is the only way to generate an aggregation of BBC data, but we've had
> downtime in the past caused by crawlers. After that happened, it
> caused lots of discussions on whether we should publish RDF data at
> all (thankfully, we succeeded to argue that we should keep it - but
> that's a lot of time spent arguing instead of publishing new juicy RDF
> data!)
> 
> I also want to point out (in response to Andreas's email) that HTTP
> caches are *completely* inefficient to protect a dataset against that,
> as crawlers tend to be exhaustive. ETags and Expiry headers are
> helpful, but chances are that 1) you don't know when the data will
> change, you can just make a wild guess based on previous behavior 2)
> the cache would have expired the time the crawler requests a document
> a second time, as it has ~100M (in our case) documents to crawl
> through.
> 
> Request throttling would work, but you would have to find a way to
> identify crawlers, which is tricky: most of them use multiple IPs and
> don't set appropriate user agents (the crawlers that currently hit us
> the most are wget and Java 1.6 :/ ).
> 
> So overall, there is no excuse for badly behaving crawlers!
> 
> Cheers,
> y
> 
>> 
>> Martin Hepp wrote:
>> 
>> Hi Daniel,
>> Thanks for the link! I will relay this to relevant site-owners.
>> 
>> However, I still challenge Andreas' statement that the site-owners are to
>> blame for publishing large amounts of data on small servers.
>> 
>> One can publish 10,000 PDF documents on a tiny server without being hit by
>> DoS-style crazy crawlers. Why should the same not hold if I publish RDF?
>> 
>> But for sure, it is necessary to advise all publishers of large RDF datasets
>> to protect themselves against hungry crawlers and actual DoS attacks.
>> 
>> Imagine if a large site was brought down by a botnet that is exploiting
>> Semantic Sitemap information for DoS attacks, focussing on the large dump
>> files.
>> This could end LOD experiments for that site.
>> 
>> 
>> Best
>> 
>> Martin
>> 
>> 
>> On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:
>> 
>> 
>> 
>> Hi Martin,
>> 
>> Have you tried to put a Squid [1]  as reverse proxy in front of your servers
>> and use delay pools [2] to catch hungry crawlers?
>> 
>> Cheers,
>> Daniel
>> 
>> [1] http://www.squid-cache.org/
>> [2] http://wiki.squid-cache.org/Features/DelayPools
>> 
>> On 21.06.2011, at 09:49, Martin Hepp wrote:
>> 
>> 
>> 
>> Hi all:
>> 
>> For the third time in a few weeks, we had massive complaints from
>> site-owners that Semantic Web crawlers from Universities visited their sites
>> in a way close to a denial-of-service attack, i.e., crawling data with
>> maximum bandwidth in a parallelized approach.
>> 
>> It's clear that a single, stupidly

Re: Think before you write Semantic Web crawlers

On 6/22/11 10:42 AM, Martin Hepp wrote:

Just to inform the community that the BTC / research crawlers have been
successful in killing a major RDF source for e-commerce:

OpenEAN - a transcript of>1 Mio product models and their EAN/UPC code at
http://openean.kaufkauf.net/id/ has been permanently shut down by the site
operator because fighting with bad semweb crawlers is taking too much of his time.

Thanks a lot for everybody who contributed to that. It trashes a month of work
and many million useful triples.

Martin,

Is there a dump anywhere? Can they at least continue to produce RDF dumps?

We have some of their data (from prior dump loads) in our lod cloud
cache [1].

Links:

1.
http://lod.openlinksw.com/describe/?url=http%3A%2F%2Fopenean.kaufkauf.net%2Fid%2F&urilookup=1

Kingsley

Best

Martin Hepp

On Jun 22, 2011, at 11:37 AM, Yves Raimond wrote:

Hello!

The difference between these two scenarios is that there's almost no CPU
involvement in serving the PDF file, but naive RDF sites use lots of cycles
to generate the response to a query for an RDF document.

Right now queries to data.southampton.ac.uk (eg.
http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are made
live, but this is not efficient. My colleague, Dave Challis, has prepared a
SPARQL endpoint which caches results which we can turn on if the load gets
too high, which should at least mitigate the problem. Very few datasets
change in a 24 hours period.

Hmm, I would strongly argue it is not the case (and stale datasets are
a bit issue in LOD imho!). The data on the BBC website, for example,
changes approximately 10 times a second.

We've also been hit in the past (and still now, to a lesser extent) by
badly behaving crawlers. I agree that, as we don't provide dumps, it
is the only way to generate an aggregation of BBC data, but we've had
downtime in the past caused by crawlers. After that happened, it
caused lots of discussions on whether we should publish RDF data at
all (thankfully, we succeeded to argue that we should keep it - but
that's a lot of time spent arguing instead of publishing new juicy RDF
data!)

I also want to point out (in response to Andreas's email) that HTTP
caches are *completely* inefficient to protect a dataset against that,
as crawlers tend to be exhaustive. ETags and Expiry headers are
helpful, but chances are that 1) you don't know when the data will
change, you can just make a wild guess based on previous behavior 2)
the cache would have expired the time the crawler requests a document
a second time, as it has ~100M (in our case) documents to crawl
through.

Request throttling would work, but you would have to find a way to
identify crawlers, which is tricky: most of them use multiple IPs and
don't set appropriate user agents (the crawlers that currently hit us
the most are wget and Java 1.6 :/ ).

So overall, there is no excuse for badly behaving crawlers!

Cheers,
y

Martin Hepp wrote:

Hi Daniel,
Thanks for the link! I will relay this to relevant site-owners.

However, I still challenge Andreas' statement that the site-owners are to
blame for publishing large amounts of data on small servers.

One can publish 10,000 PDF documents on a tiny server without being hit by
DoS-style crazy crawlers. Why should the same not hold if I publish RDF?

But for sure, it is necessary to advise all publishers of large RDF datasets
to protect themselves against hungry crawlers and actual DoS attacks.

Imagine if a large site was brought down by a botnet that is exploiting
Semantic Sitemap information for DoS attacks, focussing on the large dump
files.
This could end LOD experiments for that site.

Best

Martin

On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:

Hi Martin,

Have you tried to put a Squid [1] as reverse proxy in front of your servers
and use delay pools [2] to catch hungry crawlers?

Cheers,
Daniel

[1] http://www.squid-cache.org/
[2] http://wiki.squid-cache.org/Features/DelayPools

On 21.06.2011, at 09:49, Martin Hepp wrote:

Hi all:

For the third time in a few weeks, we had massive complaints from
site-owners that Semantic Web crawlers from Universities visited their sites
in a way close to a denial-of-service attack, i.e., crawling data with
maximum bandwidth in a parallelized approach.

It's clear that a single, stupidly written crawler script, run from a
powerful University network, can quickly create terrible traffic load.

Many of the scripts we saw

- ignored robots.txt,
- ignored clear crawling speed limitations in robots.txt,
- did not identify themselves properly in the HTTP request header or lacked
contact information therein,
- used no mechanisms at all for limiting the default crawling speed and
re-crawling delays.

This irresponsible behavior can be the final reason for site-owners to say
farewell to academic/W3C-sponsored semantic technology.

So please, please - advise all of your colleagues and students to NOT write
simple crawler scripts for the billion

Re: Think before you write Semantic Web crawlers


On 6/22/11 10:37 AM, Yves Raimond wrote:

Request throttling would work, but you would have to find a way to
identify crawlers, which is tricky: most of them use multiple IPs and
don't set appropriate user agents (the crawlers that currently hit us
the most are wget and Java 1.6 :/ ).
Hence the requirement for incorporation of WebID as basis for QoS for 
identifier agents. Everyone else gets to be constrained with rate limits 
etc..


Anyway, Identification is the key, the InterWeb jungle needs WebID to 
help reduce costs of serving up Linked Data etc..


Amazing its taken us until 2011 to revisit this critical matter.

--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Jiří Procházka

I wonder, are ways to link RDF data so that convential crawlers do not
crawl it, but only the semantic web aware ones do?
I am not sure how the current practice of linking by link tag in the
html headers could cause this, but it may be case that those heavy loads
come from a crawlers having nothing to do with semantic web...
Maybe we should start linking to our rdf/xml, turtle, ntriples files and
publishing sitemap info in RDFa...

Best,
Jiri

On 06/22/2011 09:00 AM, Steve Harris wrote:
> While I don't agree with Andreas exactly that it's the site owners fault, 
> this is something that publishers of non-semantic data have to deal with.
> 
> If you publish a large collection of interlinked data which looks interesting 
> to conventional crawlers and is expensive to generate, conventional web 
> crawlers will be all over it. The main difference is that a greater 
> percentage of those are written properly, to follow robots.txt and the 
> guidelines about hit frequency (maximum 1 request per second per domain, no 
> parallel crawling).
> 
> Has someone published similar guidelines for semantic web crawlers?
> 
> The ones that don't behave themselves get banned, either in robots.txt, or 
> explicitly by the server. 
> 
> - Steve
> 
> On 2011-06-22, at 06:07, Martin Hepp wrote:
> 
>> Hi Daniel,
>> Thanks for the link! I will relay this to relevant site-owners.
>>
>> However, I still challenge Andreas' statement that the site-owners are to 
>> blame for publishing large amounts of data on small servers.
>>
>> One can publish 10,000 PDF documents on a tiny server without being hit by 
>> DoS-style crazy crawlers. Why should the same not hold if I publish RDF?
>>
>> But for sure, it is necessary to advise all publishers of large RDF datasets 
>> to protect themselves against hungry crawlers and actual DoS attacks.
>>
>> Imagine if a large site was brought down by a botnet that is exploiting 
>> Semantic Sitemap information for DoS attacks, focussing on the large dump 
>> files. 
>> This could end LOD experiments for that site.
>>
>>
>> Best
>>
>> Martin
>>
>>
>> On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:
>>
>>>
>>> Hi Martin,
>>>
>>> Have you tried to put a Squid [1]  as reverse proxy in front of your 
>>> servers and use delay pools [2] to catch hungry crawlers?
>>>
>>> Cheers,
>>> Daniel
>>>
>>> [1] http://www.squid-cache.org/
>>> [2] http://wiki.squid-cache.org/Features/DelayPools
>>>
>>> On 21.06.2011, at 09:49, Martin Hepp wrote:
>>>
 Hi all:

 For the third time in a few weeks, we had massive complaints from 
 site-owners that Semantic Web crawlers from Universities visited their 
 sites in a way close to a denial-of-service attack, i.e., crawling data 
 with maximum bandwidth in a parallelized approach.

 It's clear that a single, stupidly written crawler script, run from a 
 powerful University network, can quickly create terrible traffic load. 

 Many of the scripts we saw

 - ignored robots.txt,
 - ignored clear crawling speed limitations in robots.txt,
 - did not identify themselves properly in the HTTP request header or 
 lacked contact information therein, 
 - used no mechanisms at all for limiting the default crawling speed and 
 re-crawling delays.

 This irresponsible behavior can be the final reason for site-owners to say 
 farewell to academic/W3C-sponsored semantic technology.

 So please, please - advise all of your colleagues and students to NOT 
 write simple crawler scripts for the billion triples challenge or 
 whatsoever without familiarizing themselves with the state of the art in 
 "friendly crawling".

 Best wishes

 Martin Hepp

>>>
>>
>>
> 

signature.asc
Description: OpenPGP digital signature

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Andreas Harth


Hi Christopher,

On 06/22/2011 10:14 AM, Christopher Gutteridge wrote:

Right now queries to data.southampton.ac.uk (eg.
http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are made live,
but this is not efficient. My colleague, Dave Challis, has prepared a SPARQL
endpoint which caches results which we can turn on if the load gets too high,
which should at least mitigate the problem. Very few datasets change in a 24
hours period.


setting the Expires header and enabling mod_cache in Apache httpd (or adding
a Squid proxy in front of the HTTP server) works quite well in these cases.

Best regards,
Andreas.

Re: Think before you write Semantic Web crawlers