Hi Joseph,

> Are you sure you want to disallow Google from crawling your browse-titles or
> your entire repository? I like being able to find our items on Google.

Your absolutely correct, I do need Google to index the site!

I was a bit fixated on simply stopping it crawling browse-titles in order to 
stop it spamming me with errors but on reflection, I'm guessing this browse 
interface is the best way to ensure Google reaches all the items . . . ?

It may be that a mail rule to auto delete DSpace Internal Server errors of this 
kind is the way to go!

> Move your robots.txt like this (sorry if this is duh-obvious but I had that
> moment myself a few months back so no worries!):
> 
> # cp [tomcat]/webapps/dspace/robots.txt [tomcat]/ROOT/robots.txt

D'oh! Yup, I think I'm having a day full of "senior moments" ;-) 

Cheers,

Mike

Michael White 
eLearning Developer
Centre for eLearning Development (CeLD) 
3V3a, Cottrell
University of Stirling 
Stirling SCOTLAND 
FK9 4LA 
Email: michael.wh...@stir.ac.uk 
Tel: +44 (0) 1786 466877 
Fax: +44 (0) 1786 466880 
http://www.is.stir.ac.uk/celd/


-----Original Message-----
From: Joseph Greene [mailto:joseph.gre...@ucd.ie] 
Sent: 11 February 2010 15:49
To: dspace-tech@lists.sourceforge.net; Michael White
Subject: RE: Bad robot! Googlebot and Internal Server Errors

Are you sure you want to disallow Google from crawling your browse-titles or
your entire repository? I like being able to find our items on Google.

Move your robots.txt like this (sorry if this is duh-obvious but I had that
moment myself a few months back so no worries!):

# cp [tomcat]/webapps/dspace/robots.txt [tomcat]/ROOT/robots.txt

I've been getting a similar one, related to collection pages' handles

-- URL Was: http://irserver.ucd.ie/dspace/browse-title?top=10197/853
-- Method: GET
-- Parameters were:
-- top: "10197/853"

Bad robot indeed!

Joseph Greene
Institutional Repository Project Manager
325 James Joyce Library
University College Dublin
Belfield, Dublin 4

353 (0)1 716 7398
joseph.gre...@ucd.ie
http://irserver.ucd.ie/dspace/

Message: 1
Date: Thu, 11 Feb 2010 12:30:04 +0000
From: Michael White <michael.wh...@stir.ac.uk>
Subject: [Dspace-tech] Bad robot! Googlebot and Internal Server Errors
To: "dspace-tech@lists.sourceforge.net"
        <dspace-tech@lists.sourceforge.net>
Message-ID:
        <7c43cb6f3460394f9b5236c0f68d7b6a5d6baa4...@exch2007.ad.stir.ac.uk>
Content-Type: text/plain; charset="us-ascii"

Hi,

Our DSpace (v1.4.1) has recently started logging a lot of Internal Server
Errors that appear to be being caused by a Googlebot. They appear to be
happening like clockwork every 14 minutes and come in blocks (sometimes
lasting several hours).

They are all associated with the IP Address 66.249.71.176, which, when
looked up, appears to be "crawl-66-249-71-176.googlebot.com". The errors all
have the form:

============================
2010-02-11 11:34:07,739 WARN
org.dspace.app.webui.servlet.InternalErrorServlet @
:session_id=9E40BFD899A2AA5C23E81404AF5B97A5:internal_error:-- URL Was:
https://dspace.stir.ac.uk/dspace/browse-title?bottom=1893/214
-- Method: GET
-- Parameters were:
-- bottom: "1893/214"

java.lang.ClassCastException
        at
org.dspace.app.webui.servlet.BrowseServlet.doDSGet(BrowseServlet.java:282)
        at
org.dspace.app.webui.servlet.DSpaceServlet.processRequest(DSpaceServlet.java
:151)
        at
org.dspace.app.webui.servlet.DSpaceServlet.doGet(DSpaceServlet.java:99)
==============================

I have checked our robots.txt file (from /usr/src/dspace-1.4.1-source/jsp),
which contains:

--------------------------------
User-agent: *

Disallow: /browse-author
Disallow: /items-by-author
Disallow: /browse-date
Disallow: /browse-subject
--------------------------------

I'm not that familiar with robots.txt, but I surmise that adding:

Disallow:/browse-title

- might do the trick? However, on further investigation, it appears that the
googlebot is not obeying any of the rules as it appears that it is accessing
other "Disallow"ed browse interfaces - I see a lot of this kind of thing in
the DSpace logs:

2010-02-11 02:09:16,746 INFO  org.dspace.app.webui.servlet.BrowseServlet @
anonymous:session_id=FBC689A1F89C3B962F0D9BFEC0B4D8ED:ip_addr=66.249.71.176:
browse_author:starts_with=Farkas, Jozsef Z.,results=21

- and mapping this to the Tomcat logs:

66.249.71.176 - - [11/Feb/2010:02:09:16 +0000] "GET
/dspace/browse-author?starts_with=Farkas%2C+Jozsef+Z. HTTP/1.1" 200 16836


So, 2 (related?) issues here - googlebot is causing errors when it is
crawling the site, and it also appears to me that the googlebot is not
obeying the robots.txt file at all :-( - or am I misunderstanding anything?

Given that this has only just started happening (we have had no trouble with
bots or spiders in the past), I was wondering if anyone else had noticed
anything like this related to the googlebot, or if anyone was aware of
anything that may have changed to cause this to start happening?

More importantly, rather than me randomly trying things, any bot/robots.txt
experts out there able to tell me how I can stop this but still allow
legitimate crawling of the site for indexing purposes?

Cheers,

Mike

Michael White
eLearning Developer
Centre for eLearning Development (CeLD) 3V3a, Cottrell University of
Stirling Stirling SCOTLAND
FK9 4LA
Email: michael.wh...@stir.ac.uk
Tel: +44 (0) 1786 466877
Fax: +44 (0) 1786 466880
http://www.is.stir.ac.uk/celd/


-- 
The Sunday Times Scottish University of the Year 2009/2010
The University of Stirling is a charity registered in Scotland, 
 number SC 011159.


------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to