Jason, You may want to allow people just to give you the robots.txt file which references the sitemap. I also register the sitemaps individually with the big search engines for our site, but I found that very large sitemaps aren't processed very well. So, for our site I think I limited the number of items per sitemap to 40,000. Which results in ten sitemaps for the digital objects and an additional sitemap for all the collections.
http://ufdc.ufl.edu/robots.txt Or else perhaps give more boxes, so we can include all the sitemaps utilized in our systems. Cheers! Mark Mark V Sullivan Digital Development and Web Coordinator Technology and Support Services University of Florida Libraries 352-273-2907 (office) 352-682-9692 (mobile) mars...@uflib.ufl.edu ________________________________________ From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Jason Ronallo [jrona...@gmail.com] Sent: Friday, February 01, 2013 11:14 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] digital collections sitemaps Hi, I've seen registries for digital collections that make their metadata available through OAI-PMH, but I have yet to see a listing of digital collections that just make their resources available on the Web the way the Web works [1]. Sitemaps are the main mechanism for listing Web resources for automated crawlers [2]. Knowing about all of these various sitemaps could have many uses for research and improving the discoverability of digital collections on the open Web [3]. So I thought I'd put up a quick form to start collecting digital collections sitemaps. One required field for the sitemap itself. Please take a few seconds to add any digital collections sitemaps you know about--they don't necessarily have to be yours. https://docs.google.com/spreadsheet/viewform?formkey=dE1JMDRIcXJMSzJ0YVlRaWdtVnhLcmc6MQ#gid=0 At this point I'll make the data available to anyone that asks for it. Thank you, Jason [1] At least I don't recall seeing such a sitemap registry site or service. If you know of an existing registry of digital collections sitemaps, please let me know about it! [2] http://www.sitemaps.org/ For more information on robots see http://wiki.code4lib.org/index.php/Robots_Are_Our_Friends [3] For instance you can see how I've started to investigate whether digital collections are being crawled by the Common Crawl: http://jronallo.github.com/blog/common-crawl-url-index/