Re: Please stop massive crawling against http://openean.kaufkauf.net/id/

2010-06-09 Thread Jürgen Umbrich
Hi all, 

> The volunteer who is hosting http://openean.kaufkauf.net/id/, a huge set of 
> GoodRelations product model data, is experiencing a problematic amount of 
> traffic from unidentified crawlers located in Ireland (DERI?), the 
> Netherlands (VUA?), and the USA.
> 


Another crawler used from DERI is the LDSpider[1] which we use to crawl data 
for the SWSE search engine and recently for the BTC 2010 dataset. 
Along these lines we admittedly have been doing an unusually large amount of 
crawling in the past month or two.

> The crawling has been so intense that he had to temporarily block all traffic 
> to this dataset.
> 
> In case you are operating any kind of Semantic Web crawlers that tried to 
> access this dataset, please
> 
> 1. check your crawler for bugs that create excessive traffic (e.g. by 
> redundant requests),

> 2. identify your crawler agent properly in the HTTP header, indicating a 
> contact person, and

User-agent of the LDSpider:
  * ldspider (http://code.google.com/p/ldspider/wiki/Robots)

> 3. implement some bandwidth throttling technique that limits the bandwidth 
> consumption on a single host to a moderate amount.


The LDSpider uses a delay policy similar to the one proposed in the IRLBot 
system. 
We have the following delay times per PLD (in the case of 
http://openean.kaufkauf.net/id the PLD is kaufkauf.net)
 * 500 ms for lookups which return content (200 resp code)
 * 250 ms for lookups which return no content (e.g. 30X, 40X, 50X).

There are also solutions for server side bandwidth throttling (e.g.  see [2]).

Please see also the reply of Andreas Harth at the semantic-web mailing list [3].

Best
   Juergen

[1] http://code.google.com/p/ldspider/
[2] http://code.google.com/p/ldspider/wiki/ServerConfig
[3] http://lists.w3.org/Archives/Public/semantic-web/2010Jun/0048.html


Re: Please stop massive crawling against http://openean.kaufkauf.net/id/

2010-06-08 Thread Robert Fuller

Kingsley Idehen wrote:

The LOD Cloud Cache at DERI is a live Virtuoso instance with 15 Billion+ 
Triples loaded. It covers as much of the LOD Cloud as we've be able to 
get our hands on plus 6.4 Billion Triples from the Data.Gov effort.


I'll drop a more detailed note about this instance (via blog post) once 
we are done with data loading (there's a massive collection of eCommerce 
oriented Products & Services data to be loaded amongst others).


I wonder is this data load the culprit responsible for the "massive 
crawling"?


--
Robert Fuller
Research Associate
Sindice Team
DERI, Galway
http://sindice.com/



Re: Please stop massive crawling against http://openean.kaufkauf.net/id/

2010-06-08 Thread Kingsley Idehen

Robert Fuller wrote:

Kingsley Idehen wrote:

The LOD Cloud Cache at DERI is a live Virtuoso instance with 15 
Billion+ Triples loaded. It covers as much of the LOD Cloud as we've 
be able to get our hands on plus 6.4 Billion Triples from the 
Data.Gov effort.


I'll drop a more detailed note about this instance (via blog post) 
once we are done with data loading (there's a massive collection of 
eCommerce oriented Products & Services data to be loaded amongst 
others).


I wonder is this data load the culprit responsible for the "massive 
crawling"?




I don't understand how it can be. That said, there might be services out 
there crawling the instance (as they do DBpedia) which then leads them 
to the actual original data space (even though all the data is actually 
in the lod.openlinksw.com instance) :-(


We'll double check to see that robots.txt is crystal clear re. crawl paths.


--

Regards,

Kingsley Idehen	  
President & CEO 
OpenLink Software 
Web: http://www.openlinksw.com

Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen 









Re: Please stop massive crawling against http://openean.kaufkauf.net/id/

2010-06-08 Thread Kingsley Idehen

Robert Fuller wrote:

Hi,

Sindice clearly identifies itself in the user agent http header. 
Currently we use these user agents:


1. "Mozilla/5.0 (compatible; sindice-fetcher/0.1.0 
+http://sindice.com/developers/bot)"


2. "SindiceFetcher/Ping Manager (http://sindice.com/developers/bot";

3. "sindice.net ontology fetcher"

Niceness is implemented in our main fetcher. In some cases there may 
be bursts on sites providing distributed ontologies. Speaking with the 
group here it seems unlikely that we have not been hitting 
kaufkauf.net,  however if you can provide an IP address I can do some 
further verification.


I understand that http://lod.openlinksw.com/sparql is now hosted at 
DERI, and I wonder could some of the traffic be related to that? 
Again, if you can provide an IP address I will do some further 
verification.


Robert,

As indicated by Martin, the  instance hosted 
at DERI should negate the need to go back to the original source.


Others:

The LOD Cloud Cache at DERI is a live Virtuoso instance with 15 Billion+ 
Triples loaded. It covers as much of the LOD Cloud as we've be able to 
get our hands on plus 6.4 Billion Triples from the Data.Gov effort.


I'll drop a more detailed note about this instance (via blog post) once 
we are done with data loading (there's a massive collection of eCommerce 
oriented Products & Services data to be loaded amongst others).



Kingsley



Kind regards,
Rob.

--
Robert Fuller
Research Associate
DERI, Galway






--

Regards,

Kingsley Idehen	  
President & CEO 
OpenLink Software 
Web: http://www.openlinksw.com

Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen 









Re: Please stop massive crawling against http://openean.kaufkauf.net/id/

2010-06-08 Thread Robert Fuller

Hi,

Sindice clearly identifies itself in the user agent http header. 
Currently we use these user agents:


1. "Mozilla/5.0 (compatible; sindice-fetcher/0.1.0 
+http://sindice.com/developers/bot)"


2. "SindiceFetcher/Ping Manager (http://sindice.com/developers/bot";

3. "sindice.net ontology fetcher"

Niceness is implemented in our main fetcher. In some cases there may be 
bursts on sites providing distributed ontologies. Speaking with the 
group here it seems unlikely that we have not been hitting kaufkauf.net, 
 however if you can provide an IP address I can do some further 
verification.


I understand that http://lod.openlinksw.com/sparql is now hosted at 
DERI, and I wonder could some of the traffic be related to that? Again, 
if you can provide an IP address I will do some further verification.



Kind regards,
Rob.

--
Robert Fuller
Research Associate
DERI, Galway




Re: Please stop massive crawling against http://openean.kaufkauf.net/id/

2010-06-08 Thread Christophe Guéret

Dear Martin,

I guess the VUA crawler was our. The deficient process has been stopped 
now and won't be restarted before being checked for bugs.

Sorry about all the problems caused.

Best regards,
Christophe


On 06/08/2010 10:03 AM, Martin Hepp (UniBW) wrote:

 Dear all:

 The volunteer who is hosting http://openean.kaufkauf.net/id/, a huge
 set of GoodRelations product model data, is experiencing a problematic
 amount of traffic from unidentified crawlers located in Ireland
 (DERI?), the Netherlands (VUA?), and the USA.

 The crawling has been so intense that he had to temporarily block all
 traffic to this dataset.

 In case you are operating any kind of Semantic Web crawlers that tried
 to access this dataset, please

 1. check your crawler for bugs that create excessive traffic (e.g. by
 redundant requests),
 2. identify your crawler agent properly in the HTTP header, indicating
 a contact person, and
 3. implement some bandwidth throttling technique that limits the
 bandwidth consumption on a single host to a moderate amount.

 Note that the full dataset is always up to date in the LOD SPARQL
 endpoint at

 http://lod.openlinksw.com/sparql

 Thus, there is rarely a need to crawl the complete dataset.

 Thanks for your consideration.

 Best wishes

 Martin Hepp




--
Dr. Christophe Guéret (cgue...@few.vu.nl)
http://www.few.vu.nl/~cgueret/
Postdoc working on SOKS (http://www.few.vu.nl/soks)
Knowledge Representation&   Reasoning Group
Computational Intelligence Group
Department of Computer Science, AI
VU University Amsterdam



<>

Re: Please stop massive crawling against http://openean.kaufkauf.net/id/

2010-06-08 Thread Dan Brickley
On Tue, Jun 8, 2010 at 10:03 AM, Martin Hepp (UniBW)
 wrote:
> Dear all:
>
> The volunteer who is hosting http://openean.kaufkauf.net/id/, a huge set of
> GoodRelations product model data, is experiencing a problematic amount of
> traffic from unidentified crawlers located in Ireland (DERI?), the
> Netherlands (VUA?), and the USA.
>
> The crawling has been so intense that he had to temporarily block all
> traffic to this dataset.

Any reason not to block the troublemakers by IP address?

> In case you are operating any kind of Semantic Web crawlers that tried to
> access this dataset, please
>
> 1. check your crawler for bugs that create excessive traffic (e.g. by
> redundant requests),
> 2. identify your crawler agent properly in the HTTP header, indicating a
> contact person, and
> 3. implement some bandwidth throttling technique that limits the bandwidth
> consumption on a single host to a moderate amount.

Yes, de-referencing is a privilege not a right!

Also folk should respect robots.txt -
http://en.wikipedia.org/wiki/Robots_exclusion_standard

cheers,

Dan



Re: Please stop massive crawling against http://openean.kaufkauf.net/id/

2010-06-08 Thread Story Henry
One could put the data behind foaf+ssl, and so identify agents :-)

Henry

On 8 Jun 2010, at 10:03, Martin Hepp (UniBW) wrote:

> Dear all:
> 
> The volunteer who is hosting http://openean.kaufkauf.net/id/, a huge set of 
> GoodRelations product model data, is experiencing a problematic amount of 
> traffic from unidentified crawlers located in Ireland (DERI?), the 
> Netherlands (VUA?), and the USA.
> 
> The crawling has been so intense that he had to temporarily block all traffic 
> to this dataset.
> 
> In case you are operating any kind of Semantic Web crawlers that tried to 
> access this dataset, please
> 
> 1. check your crawler for bugs that create excessive traffic (e.g. by 
> redundant requests),
> 2. identify your crawler agent properly in the HTTP header, indicating a 
> contact person, and
> 3. implement some bandwidth throttling technique that limits the bandwidth 
> consumption on a single host to a moderate amount.
> 
> Note that the full dataset is always up to date in the LOD SPARQL endpoint at
> 
> http://lod.openlinksw.com/sparql
> 
> Thus, there is rarely a need to crawl the complete dataset.
> 
> Thanks for your consideration.
> 
> Best wishes
> 
> Martin Hepp
> 
> -- 
> 
> -- 
> --
> martin hepp
> e-business&  web science research group
> universitaet der bundeswehr muenchen
> 
> e-mail:  h...@ebusiness-unibw.org
> phone:   +49-(0)89-6004-4217
> fax: +49-(0)89-6004-4620
> www: http://www.unibw.de/ebusiness/ (group)
> http://www.heppnetz.de/ (personal)
> skype:   mfhepp
> twitter: mfhepp
> 
> Check out GoodRelations for E-Commerce on the Web of Linked Data!
> =
> 
> Project page:
> http://purl.org/goodrelations/
> 
> Resources for developers:
> http://www.ebusiness-unibw.org/wiki/GoodRelations
> 
> Webcasts:
> Overview - http://www.heppnetz.de/projects/goodrelations/webcast/
> How-to   - http://vimeo.com/7583816
> 
> Recipe for Yahoo SearchMonkey:
> http://www.ebusiness-unibw.org/wiki/GoodRelations_and_Yahoo_SearchMonkey
> 
> Talk at the Semantic Technology Conference 2009:
> "Semantic Web-based E-Commerce: The GoodRelations Ontology"
> http://www.slideshare.net/mhepp/semantic-webbased-ecommerce-the-goodrelations-ontology-1535287
> 
> Overview article on Semantic Universe:
> http://www.semanticuniverse.com/articles-semantic-web-based-e-commerce-webmasters-get-ready.html
> 
> Tutorial materials:
> ISWC 2009 Tutorial: The Web of Data for E-Commerce in Brief: A Hands-on 
> Introduction to the GoodRelations Ontology, RDFa, and Yahoo! SearchMonkey
> http://www.ebusiness-unibw.org/wiki/Web_of_Data_for_E-Commerce_Tutorial_ISWC2009
> 
> 




Please stop massive crawling against http://openean.kaufkauf.net/id/

2010-06-08 Thread Martin Hepp (UniBW)

Dear all:

The volunteer who is hosting http://openean.kaufkauf.net/id/, a huge set 
of GoodRelations product model data, is experiencing a problematic 
amount of traffic from unidentified crawlers located in Ireland (DERI?), 
the Netherlands (VUA?), and the USA.


The crawling has been so intense that he had to temporarily block all 
traffic to this dataset.


In case you are operating any kind of Semantic Web crawlers that tried 
to access this dataset, please


1. check your crawler for bugs that create excessive traffic (e.g. by 
redundant requests),
2. identify your crawler agent properly in the HTTP header, indicating a 
contact person, and
3. implement some bandwidth throttling technique that limits the 
bandwidth consumption on a single host to a moderate amount.


Note that the full dataset is always up to date in the LOD SPARQL 
endpoint at


http://lod.openlinksw.com/sparql

Thus, there is rarely a need to crawl the complete dataset.

Thanks for your consideration.

Best wishes

Martin Hepp

--

--
--
martin hepp
e-business&  web science research group
universitaet der bundeswehr muenchen

e-mail:  h...@ebusiness-unibw.org
phone:   +49-(0)89-6004-4217
fax: +49-(0)89-6004-4620
www: http://www.unibw.de/ebusiness/ (group)
 http://www.heppnetz.de/ (personal)
skype:   mfhepp
twitter: mfhepp

Check out GoodRelations for E-Commerce on the Web of Linked Data!
=

Project page:
http://purl.org/goodrelations/

Resources for developers:
http://www.ebusiness-unibw.org/wiki/GoodRelations

Webcasts:
Overview - http://www.heppnetz.de/projects/goodrelations/webcast/
How-to   - http://vimeo.com/7583816

Recipe for Yahoo SearchMonkey:
http://www.ebusiness-unibw.org/wiki/GoodRelations_and_Yahoo_SearchMonkey

Talk at the Semantic Technology Conference 2009:
"Semantic Web-based E-Commerce: The GoodRelations Ontology"
http://www.slideshare.net/mhepp/semantic-webbased-ecommerce-the-goodrelations-ontology-1535287

Overview article on Semantic Universe:
http://www.semanticuniverse.com/articles-semantic-web-based-e-commerce-webmasters-get-ready.html

Tutorial materials:
ISWC 2009 Tutorial: The Web of Data for E-Commerce in Brief: A Hands-on 
Introduction to the GoodRelations Ontology, RDFa, and Yahoo! SearchMonkey
http://www.ebusiness-unibw.org/wiki/Web_of_Data_for_E-Commerce_Tutorial_ISWC2009