Sally
The situation is much more complex than this. Yep, oversimplifying, but it's natural. Publishers' sites are crawled by Googlebot because (a) robots are allowed in to the public areas of publisher sites, and (b) there are relatively few publishers, well indexed. Google Scholar is a selective service based on Googlebot's results: it chooses what to include (little) and what to leave out (the vast majority). Google Scholar has algorithms that select what from the publisher's site is an article or the metadata thereof, and what is plain publisher guff (like subscription info, guidelines for authors, etc). BTW, Google Scholar does not crawl separately, it uses selectivity on the Googlebot results. Repositories are very unlikely to bar robot entry (through robots.txt), though I would not say categorically that it has never happened. You actually have to extra work to bar robots from a website, and I can't understand why a manager would do so. (Of course password protected data or behind a search barrier is inaccessible to a robot anyway.) However, it is not so clear what is a repository, how many there area or where they are. The number keeps changing. This is problem No 1. The second problem is that Google Scholar seems to apply different rules to repositories than publisher sites. Repositories contain all sorts of things that are not 'articles', such as archival material, unpublished works, conference presentations, etc. One theory is that Google Scholar is happiest if it finds an open-access pdf hanging off a metadata entry in a repository. In other words, if the file is in XML, XHTML, Word, iBook, etc formats, it is not regarded as an article. And what to do if the metadata has several pdfs attached to it (or other formats), which is common with theses? When in doubt, leave it out... Google Scholar is about being ultra-selective on the Internet. The third problem is that Googlebot does not always crawl the entire site. It optimizes its time to best use. One trick it uses is to limit the depth of the link tree to search. Another is not to go too far at any one level. In the case of a publisher site the depth is relatively shallow and each list is short. One finds the list of issues, then each leads to a list of articles, and bingo! Or possibly years -> issues -> articles. Repositories are not so well organized, necessarily. Unless they are optimized for Googlebot (EPrints is) the robot might well find a year index, leading to 1000s of 'articles' per year. Googlebot gives up well before the end. Next time it may well do the same. Optimal is to have a link to 'most recent deposits' high on the home page (so the robot finds it early), and to provide Googlebot with an easy way to eventually search all of the site. The Google database may build up over time. And finally Problem No 4. How does Google Scholar regard the metadata? It prefers publisher formats. This is referred to in the paper cited. BTW Note that Wouter Gerritsma's comments indicate that Problem No 2 or Problem No 4 dominate over No 3 for his repository. Google knows about the article (Googlebot indexed it) but Google Scholar doesn't. I agree with Stevan Harnad that we need fuller repositories, but disagree that there is any inconsistency in also pressing for improved performance by Google Scholar. The people who deposit papers in repositories, and the programmers in Google are almost completely disjoint groups (Google workers don't publish much - they keep the processes a commercial secret). We can do both at once, and they will have a synergistic effect on each other. Lack of synergy holds back open access. But enough of that. I just wanted to explain what was happening with Google Scholar. Arthur Sale Tasmania, Australia From: goal-boun...@eprints.org [mailto:goal-boun...@eprints.org] On Behalf Of Sally Morris Sent: Saturday, 5 January 2013 11:14 PM To: 'Global Open Access List (Successor of AmSci)' Subject: [GOAL] Re: Searching for OA vs. Providing OA It's my understanding that Google (and Google Scholar) find published articles because the publishers enable crawling - whether the content is freely available or not (if I'm oversimplifying, someone will no doubt set me right). Are repository managers unintentionally blocking this? Sally Sally Morris South House, The Street, Clapham, Worthing, West Sussex, UK BN13 3UU Tel: +44 (0)1903 871286 Email: sa...@morris-assocs.demon.co.uk _____ On Fri, Jan 4, 2013 at 5:03 PM, Gerritsma, Wouter <wouter.gerrit...@wur.nl> wrote: Google Scholar is a very good fulltext scholarly search engine, no doubt about it. But it doesn't find all the ftxt available on the web, albeit it does a good job. Take e.g. one of my articles http://scholar.google.com/scholar?cluster=17014920805021872143 <http://scholar.google.com/scholar?cluster=17014920805021872143&hl=en&as_sdt =0,5> &hl=en&as_sdt=0,5 GS found two PDF version's but not the one on our universities repository. That is still not fully indexed. Although it gets close http://library.wur.nl/WebQuery/wurpubs/lang/380005 it found our metadata reocrd, but not the ftxt. I guess this is still the case with many repositories. Earlier this year it was even reported in the literature: Arlitsch, K. & P.S. O'Brien (2012). Invisible institutional repositories: addressing the low indexing ratios of IRs in Google. Library Hi Tech, 30(1): 60-81 http://dx.doi.org/10.1108/07378831211213210 So Google Scholar is still not the cure all for all OA available in the world. Interestingly our repository is better indexed in the standard Google search engine rather than the Scholar version. So my point is, doing a search on GS, and finding a lot of hits still doesn't guarantee to find all the ftxt of those papers.
_______________________________________________ GOAL mailing list GOAL@eprints.org http://mailman.ecs.soton.ac.uk/mailman/listinfo/goal