Re: Think before you publish large datasets (was: Re: Think before you write Semantic Web crawlers)

Dieter Fensel Tue, 21 Jun 2011 11:25:07 -0700

-1.
Obviously it is not useful to kill the web server of small shops due to
academic experiments.


At 02:29 PM 6/21/2011, Andreas Harth wrote:

Dear Martin,

I agree with you in that software accessing large portions of the web
should adhere to basic principles (such as robots.txt).

However, I wonder why you publish large datasets and then complain when
people actually use the data.

If you provide a site with millions of triples your infrastructure should
scale beyond "I have clicked on a few links and the server seems to be
doing something".  You should set HTTP expires header to leverage the widely
deployed HTTP caches.  You should have stable URIs.  Also, you should
configure your servers to shield them from both mad crawlers and DOS
attacks (see e.g., [1]).

Publishing millions of triples is slightly more complex than publishing your
personal homepage.

Best regards,
Andreas.

[1] http://code.google.com/p/ldspider/wiki/ServerConfig


--
Dieter Fensel
Director STI Innsbruck, University of Innsbruck, Austria
http://www.sti-innsbruck.at/
phone: +43-512-507-6488/5, fax: +43-512-507-9872

Re: Think before you publish large datasets (was: Re: Think before you write Semantic Web crawlers)

Reply via email to