Re: [Dspace-tech] Google bots and web crawlers

2009-01-14 Thread Van Ly

This would be a good opportunity to construct a reasonably good default 
robots.txt file and add it to the documentation set.

At http://ses.library.usyd.edu.au/robots.txt, I have the following:

 User-agent: *
 Crawl-Delay: 11
 Disallow: /browse
 Disallow: /browse?
 Disallow: /browse-title
 Disallow: /bitstream
 Disallow: /dspace/
 Disallow: /feed/
 Disallow: /feedback
 Disallow: /password-login
 #Disallow: /retrieve/
 #Disallow: /handle/
 #Disallow: /oai/

/bitstream is intended to deter crawlers from triggering the catalina error + 
dspace warning.

which lines should I re-use from Jeff's example and why?

The lines I have are based on my best guess at what a crawler ought not to be 
interested in.

Thanks in advance.

--
Van Ly : University of Sydney Library


-Original Message-
From: Robert Tansley [mailto:roberttans...@google.com]
Sent: Thu 15/01/2009 7:52 AM
To: Shane Beers
Cc: dspace-tech Tech; Jeffrey Trimble
Subject: Re: [Dspace-tech] Google bots and web crawlers
 
As of DSpace 1.5, sitemaps are supported which allow search engines to
selectively crawl only new items, while massively reducing the server
load:

http://www.dspace.org/1_5_1Documentation/ch03.html#N10B44

Unfortunately, it seems that relatively few DSpace instances actually
use this feature.

I would strongly recommend against blocking  /dspace/bitstream/* and
/dspace/html/*, as these prevent crawlers from accessing the full-text
of items, vital for effective indexing. As of DSpace 1.4.2 (and
possibly earlier), these URLs support the if-modified-after header,
which means that crawlers don't re-retrieve files if they haven't been
changed since the last crawl.

Rob

On Wed, Jan 14, 2009 at 14:20, Shane Beers  wrote:
> Jeff:
> We had an issue with our local google instance crawling our DSpace
> installation and causing huge issues. I re-wrote the robots.txt to disallow
> anything besides the item pages themselves - no browsing pages or search
> pages and whatnot. Here is a copy of ours:
> User-agent: *
> Disallow: /dspace/browse-author
> Disallow: /dspace/browse-author*
> Disallow: /dspace/items-by-author
> Disallow: /dspace/items-by-author*
> Disallow: /dspace/browse-date*
> Disallow: /dspace/browse-date
> Disallow: /dspace/browse-title*
> Disallow: /dspace/browse-title
> Disallow: /dspace/feedback
> Disallow: /dspace/feedback/*
> Disallow: /dspace/items-by-subject
> Disallow: /dspace/items-by-subject/*
> Disallow: /dspace/handle/1920/*/brow! se-title*
> ace/handle/1920/*/browse-author*
> Disallow: /dspace/handle/1920/*/browse-subject*
> Disallow: /dspace/handle/1920/*/browse-date*
> Disallow: /dspace/handle/1920/*/items-by-subject*
> Disallow: /dspace/handle/1920/*/items-by-author*
> Disallow: /dspace/bitstream/*
> Disallow: /dspace/image/*
> Disallow: /dspace/html/*
> Disallow: /dspace/simple-search*
> This likely would live in your tomcat directory.
> Shane Beers
> Digital Repository Services Librarian
> George Mason University
> sbe...@gmu.edu
> http://mars.gmu.edu
> !
> 703- lass="Apple-interchange-newline">
>
> On Jan 14, 2009, at 1:09 PM, Jeffrey Trimble wrote:
>
> Is there something simple I can place in the jsp that will prohibit the
> crawlers from
> using my server resources?
> TIA,
> Jeff
>
> Jeffrey Trimble
> Systems Librarian
> Maag Library
> Youngstown State University
> 330-941-2483 (Office)
> jtrim...@cc.ysu.edu
> http://www.maag.ysu.edu
> http! ://digita div>
>
>
> --
> This SF.net email is sponsored by:
> SourcForge Community
> SourceForge wants to tell your story.
> http://p.sf.net/sfu/sf-spreadtheword___
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>
>
> --
> This SF.net email is sponsored by:
> SourcForge Community
> SourceForge wants to tell your story.
> http://p.sf.net/sfu/sf-spreadtheword
> ___
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>
>

--
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

--
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Google bots and web crawlers

2009-01-14 Thread Tom De Mulder
On Wed, 14 Jan 2009, Shane Beers wrote:

> We had an issue with our local google instance crawling our DSpace 
> installation and causing huge issues. I re-wrote the robots.txt to disallow 
> anything besides the item pages themselves - no browsing pages or search 
> pages 
> and whatnot. Here is a copy of ours:

We've had to do that for years; without it DSpace just crumbles under the 
load. I've got a small Perl script which generates a flat html file with 
links to all our item pages, and we put a link to that in the footer.

So we can block all browse pages, but not item or bitstreams, and still 
get indexed.

DSpace 1.x has major scalability issues, alas. No matter how much hardware 
you throw at it.


Best,

--
Tom De Mulder  - Cambridge University Computing Service
+44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
-> 14/01/2009 : The Moon is Waning Gibbous (83% of Full)

--
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Google bots and web crawlers

2009-01-14 Thread George Kozak

Jeff:

What I am using is a robots.txt file that I put in the dspace webapps 
directory in tomcat.  I think it's working (at least we haven't 
crashed lately).  If you're interested in seeing my robots.txt file, 
I can send it to you.


At 01:09 PM 1/14/2009, Jeffrey Trimble wrote:
Is there something simple I can place in the jsp that will prohibit 
the crawlers from

using my server resources?

TIA,

Jeff

Jeffrey Trimble
Systems Librarian
Maag Library
Youngstown State University
330-941-2483 (Office)
jtrim...@cc.ysu.edu
http://www.maag.ysu.edu
http://digital.maag.ysu.edu



--
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword

___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


***
George Kozak
Digital Library Information Technology
501 Olin Library
Cornell University
607-255-8924
***
g...@cornell.edu --
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Google bots and web crawlers

2009-01-14 Thread Robert Tansley
As of DSpace 1.5, sitemaps are supported which allow search engines to
selectively crawl only new items, while massively reducing the server
load:

http://www.dspace.org/1_5_1Documentation/ch03.html#N10B44

Unfortunately, it seems that relatively few DSpace instances actually
use this feature.

I would strongly recommend against blocking  /dspace/bitstream/* and
/dspace/html/*, as these prevent crawlers from accessing the full-text
of items, vital for effective indexing. As of DSpace 1.4.2 (and
possibly earlier), these URLs support the if-modified-after header,
which means that crawlers don't re-retrieve files if they haven't been
changed since the last crawl.

Rob

On Wed, Jan 14, 2009 at 14:20, Shane Beers  wrote:
> Jeff:
> We had an issue with our local google instance crawling our DSpace
> installation and causing huge issues. I re-wrote the robots.txt to disallow
> anything besides the item pages themselves - no browsing pages or search
> pages and whatnot. Here is a copy of ours:
> User-agent: *
> Disallow: /dspace/browse-author
> Disallow: /dspace/browse-author*
> Disallow: /dspace/items-by-author
> Disallow: /dspace/items-by-author*
> Disallow: /dspace/browse-date*
> Disallow: /dspace/browse-date
> Disallow: /dspace/browse-title*
> Disallow: /dspace/browse-title
> Disallow: /dspace/feedback
> Disallow: /dspace/feedback/*
> Disallow: /dspace/items-by-subject
> Disallow: /dspace/items-by-subject/*
> Disallow: /dspace/handle/1920/*/brow! se-title*
> ace/handle/1920/*/browse-author*
> Disallow: /dspace/handle/1920/*/browse-subject*
> Disallow: /dspace/handle/1920/*/browse-date*
> Disallow: /dspace/handle/1920/*/items-by-subject*
> Disallow: /dspace/handle/1920/*/items-by-author*
> Disallow: /dspace/bitstream/*
> Disallow: /dspace/image/*
> Disallow: /dspace/html/*
> Disallow: /dspace/simple-search*
> This likely would live in your tomcat directory.
> Shane Beers
> Digital Repository Services Librarian
> George Mason University
> sbe...@gmu.edu
> http://mars.gmu.edu
> !
> 703- lass="Apple-interchange-newline">
>
> On Jan 14, 2009, at 1:09 PM, Jeffrey Trimble wrote:
>
> Is there something simple I can place in the jsp that will prohibit the
> crawlers from
> using my server resources?
> TIA,
> Jeff
>
> Jeffrey Trimble
> Systems Librarian
> Maag Library
> Youngstown State University
> 330-941-2483 (Office)
> jtrim...@cc.ysu.edu
> http://www.maag.ysu.edu
> http! ://digita div>
>
>
> --
> This SF.net email is sponsored by:
> SourcForge Community
> SourceForge wants to tell your story.
> http://p.sf.net/sfu/sf-spreadtheword___
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>
>
> --
> This SF.net email is sponsored by:
> SourcForge Community
> SourceForge wants to tell your story.
> http://p.sf.net/sfu/sf-spreadtheword
> ___
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>
>

--
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Google bots and web crawlers

2009-01-14 Thread Shane Beers

Jeff:

We had an issue with our local google instance crawling our DSpace  
installation and causing huge issues. I re-wrote the robots.txt to  
disallow anything besides the item pages themselves - no browsing  
pages or search pages and whatnot. Here is a copy of ours:


User-agent: *
Disallow: /dspace/browse-author
Disallow: /dspace/browse-author*
Disallow: /dspace/items-by-author
Disallow: /dspace/items-by-author*
Disallow: /dspace/browse-date*
Disallow: /dspace/browse-date
Disallow: /dspace/browse-title*
Disallow: /dspace/browse-title
Disallow: /dspace/feedback
Disallow: /dspace/feedback/*
Disallow: /dspace/items-by-subject
Disallow: /dspace/items-by-subject/*
Disallow: /dspace/handle/1920/*/browse-title*
Disallow: /dspace/handle/1920/*/browse-author*
Disallow: /dspace/handle/1920/*/browse-subject*
Disallow: /dspace/handle/1920/*/browse-date*
Disallow: /dspace/handle/1920/*/items-by-subject*
Disallow: /dspace/handle/1920/*/items-by-author*
Disallow: /dspace/bitstream/*
Disallow: /dspace/image/*
Disallow: /dspace/html/*
Disallow: /dspace/simple-search*

This likely would live in your tomcat directory.

Shane Beers
Digital Repository Services Librarian
George Mason University
sbe...@gmu.edu
http://mars.gmu.edu
703-993-3742



On Jan 14, 2009, at 1:09 PM, Jeffrey Trimble wrote:

Is there something simple I can place in the jsp that will prohibit  
the crawlers from

using my server resources?

TIA,

Jeff

Jeffrey Trimble
Systems Librarian
Maag Library
Youngstown State University
330-941-2483 (Office)
jtrim...@cc.ysu.edu
http://www.maag.ysu.edu
http://digital.maag.ysu.edu



--
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


--
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


[Dspace-tech] Google bots and web crawlers

2009-01-14 Thread Jeffrey Trimble
Is there something simple I can place in the jsp that will prohibit  
the crawlers from

using my server resources?

TIA,

Jeff

Jeffrey Trimble
Systems Librarian
Maag Library
Youngstown State University
330-941-2483 (Office)
jtrim...@cc.ysu.edu
http://www.maag.ysu.edu
http://digital.maag.ysu.edu



--
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech