[GOAL] Re: Searching for OA vs. Providing OA

Arthur Sale Sat, 05 Jan 2013 21:47:08 -0800

Sally


The situation is much more complex than this. Yep, oversimplifying, but it's
natural.  Publishers' sites are crawled by Googlebot because (a) robots are
allowed in to the public areas of publisher sites, and (b) there are
relatively few publishers, well indexed. Google Scholar is a selective
service based on Googlebot's results: it chooses what to include (little)
and what to leave out (the vast majority). Google Scholar has algorithms
that select what from the publisher's site is an article or the metadata
thereof, and what is plain publisher guff (like subscription info,
guidelines for authors, etc).

 

BTW, Google Scholar does not crawl separately, it uses selectivity on the
Googlebot results.

 

Repositories are very unlikely to bar robot entry (through robots.txt),
though I would not say categorically that it has never happened.  You
actually have to extra work to bar robots from a website, and I can't
understand why a manager would do so. (Of course password protected data or
behind a search barrier is inaccessible to a robot anyway.)

 

However, it is not so clear what is a repository, how many there area or
where they are.  The number keeps changing. This is problem No 1. 

 

The second problem is that Google Scholar seems to apply different rules to
repositories than publisher sites. Repositories contain all sorts of things
that are not 'articles', such as archival material, unpublished works,
conference presentations, etc.  One theory is that Google Scholar is
happiest if it finds an open-access pdf hanging off a metadata entry in a
repository. In other words, if the file is in XML, XHTML, Word, iBook, etc
formats, it is not regarded as an article. And what to do if the metadata
has several pdfs attached to it (or other formats), which is common with
theses? When in doubt, leave it out... Google Scholar is about being
ultra-selective on the Internet.

 

The third problem is that Googlebot does not always crawl the entire site.
It optimizes its time to best use. One trick it uses is to limit the depth
of the link tree to search. Another is not to go too far at any one level.
In the case of a publisher site the depth is relatively shallow and each
list is short. One finds the list of issues, then each leads to a list of
articles, and bingo! Or possibly years -> issues -> articles. Repositories
are not so well organized, necessarily. Unless they are optimized for
Googlebot (EPrints is) the robot might well find a year index, leading to
1000s of 'articles' per year. Googlebot gives up well before the end. Next
time it may well do the same. Optimal is to have a link to 'most recent
deposits' high on the home page (so the robot finds it early), and to
provide Googlebot with an easy way to eventually search all of the site. The
Google database may build up over time.

 

And finally Problem No 4. How does Google Scholar regard the metadata? It
prefers publisher formats. This is referred to in the paper cited.

 

BTW Note that Wouter Gerritsma's comments indicate that Problem No 2 or
Problem No 4 dominate over No 3 for his repository. Google knows about the
article (Googlebot indexed it) but Google Scholar doesn't.

 

I agree with Stevan Harnad that we need fuller repositories, but disagree
that there is any inconsistency in also pressing for improved performance by
Google Scholar. The people who deposit papers in repositories, and the
programmers in Google are almost completely disjoint groups (Google workers
don't publish much - they keep the processes a commercial secret). We can do
both at once, and they will have a synergistic effect on each other. Lack of
synergy holds back open access. But enough of that. I just wanted to explain
what was happening with Google Scholar.

 

Arthur Sale

Tasmania, Australia

 

From: goal-boun...@eprints.org [mailto:goal-boun...@eprints.org] On Behalf
Of Sally Morris
Sent: Saturday, 5 January 2013 11:14 PM
To: 'Global Open Access List (Successor of AmSci)'
Subject: [GOAL] Re: Searching for OA vs. Providing OA

 

It's my understanding that Google (and Google Scholar) find published
articles because the publishers enable crawling - whether the content is
freely available or not (if I'm oversimplifying, someone will no doubt set
me right).  Are repository managers unintentionally blocking this?

 

Sally

 

Sally Morris

South House, The Street, Clapham, Worthing, West Sussex, UK  BN13 3UU

Tel:  +44 (0)1903 871286

Email:  sa...@morris-assocs.demon.co.uk

 

 

  _____  

On Fri, Jan 4, 2013 at 5:03 PM, Gerritsma, Wouter <wouter.gerrit...@wur.nl>
wrote:

 

 Google Scholar is a very good fulltext scholarly search engine, no doubt
about it. But it doesn't find all the ftxt available on the web, albeit it
does a good job.

Take e.g. one of my articles
http://scholar.google.com/scholar?cluster=17014920805021872143
<http://scholar.google.com/scholar?cluster=17014920805021872143&hl=en&as_sdt
=0,5> &hl=en&as_sdt=0,5 GS found two PDF version's but not the one on our
universities repository. That is still not fully indexed. Although it gets
close http://library.wur.nl/WebQuery/wurpubs/lang/380005 it found our
metadata reocrd, but not the ftxt.

I guess this is still the case with many repositories. Earlier this year it
was even reported in the literature:

 

Arlitsch, K. & P.S. O'Brien (2012). Invisible institutional repositories:
addressing the low indexing ratios of IRs in Google. Library Hi Tech, 30(1):
60-81 http://dx.doi.org/10.1108/07378831211213210

 

So Google Scholar is still not the cure all for all OA available in the
world. Interestingly our repository is better indexed in the standard Google
search engine rather than the Scholar version.

 

So my point is, doing a search on GS, and finding a lot of hits still
doesn't guarantee to find all the ftxt of those papers.

_______________________________________________
GOAL mailing list
GOAL@eprints.org
http://mailman.ecs.soton.ac.uk/mailman/listinfo/goal

[GOAL] Re: Searching for OA vs. Providing OA

Reply via email to