Re: Google Scholar

2005-02-16 Thread Lee Giles

The Google scholar is outstanding, but I still feel there is a place
for specialized search in topical domains such as CiteSeer, which
I maintain. Our community still very much likes CiteSeer but
also uses the Google Scholar.

Best

Lee Giles

Thomas Walker wrote:


As T.S.Mahadevan recently pointed out on the BOAI Forum, what those
who are
searching for open archive and other scholarly literature really want
is a
single website where they can search the entire set of such literature.

Google is already accounting for a significant portion of the hits on the
OA journal articles I monitor.  Might Google Scholar be that website?

===

Google Scholar (beta version online at http://scholar.google.com)
restricts
Google searches to scholarly literature, including peer-reviewed papers,
theses, books, preprints, abstracts and technical reports from all fields
of research, and finds articles from a wide variety of academic
publishers,
professional societies, preprint repositories and universities, as
well as
scholarly articles available across the web.  Google Scholar ranks search
results by their relevance to the query, so the most useful references
should appear at the top of the page. The relevance ranking takes into
account the full text of each article as well as the article's author,
the
publication in which the article appeared and how often it has been cited
in scholarly literature. Google Scholar also automatically analyzes and
extracts citations and presents them as separate results, even if the
documents they refer to are not online. This means that search results
may
include citations of older works and seminal articles that appear only in
books or other offline publications. [Parts of this description taken
directly from http://scholar.google.com/scholar/about.html#about.]

===

Tom Walker




Thomas J. Walker
Department of Entomology & Nematology
PO Box 110620 (or Natural Area Drive)
University of Florida, Gainesville, FL 32611-0620
E-mail: t...@ufl.edu  (or tjwal...@ifas.ufl.edu)
FAX: (352)392-0190
Web: http://tjwalker.ifas.ufl.edu



Re: Google Scholar

2005-02-16 Thread Heather Morrison

With all the emphasis on immediate open access, I'm wondering - how up
to date is google scholar?

A quick search by publication year yields the following:

2001:  62,000 items
2002:  68,600 items
2003:  63,700 items
2004:  8,060 items

While it is possible that 2004 statistics will not be complete due to
publishing delays, this does suggest to me that there is a delay in
google scholar harvesting - whether of open access or
subscription-based resources, or both, is hard to say of course.  I do
think this data suggests that if there is one place to look for OA
materials at the moment, it is not google scholar.

My own searching confirms this suspicion - I am finding that if a
needed item is not found in google scholar, then an open access copy
may well still be found through a regular web search.

hope this helps,

Heather Morrison


On 16-Feb-05, at 6:16 AM, Thomas Walker wrote:


As T.S.Mahadevan recently pointed out on the BOAI Forum, what those
who are
searching for open archive and other scholarly literature really want
is a
single website where they can search the entire set of such literature.

Google is already accounting for a significant portion of the hits on
the
OA journal articles I monitor.  Might Google Scholar be that website?

===

Google Scholar (beta version online at http://scholar.google.com)
restricts
Google searches to scholarly literature, including peer-reviewed
papers,
theses, books, preprints, abstracts and technical reports from all
fields
of research, and finds articles from a wide variety of academic
publishers,
professional societies, preprint repositories and universities, as
well as
scholarly articles available across the web.  Google Scholar ranks
search
results by their relevance to the query, so the most useful references
should appear at the top of the page. The relevance ranking takes into
account the full text of each article as well as the article's author,
the
publication in which the article appeared and how often it has been
cited
in scholarly literature. Google Scholar also automatically analyzes and
extracts citations and presents them as separate results, even if the
documents they refer to are not online. This means that search results
may
include citations of older works and seminal articles that appear only
in
books or other offline publications. [Parts of this description taken
directly from http://scholar.google.com/scholar/about.html#about.]

===

Tom Walker




Thomas J. Walker
Department of Entomology & Nematology
PO Box 110620 (or Natural Area Drive)
University of Florida, Gainesville, FL 32611-0620
E-mail: t...@ufl.edu  (or tjwal...@ifas.ufl.edu)
FAX: (352)392-0190
Web: http://tjwalker.ifas.ufl.edu



Heather G. Morrison
Project Coordinator
BC Electronic Library Network

Phone: 604-268-7001
Fax: 604-291-3023
Email:  heath...@eln.bc.ca
Web: http://www.eln.bc.ca


Re: Google Scholar

2005-02-16 Thread Hamaker, Chuck
Heather and others: The date search can be misleading and generally
inaccurate from my experience with Google Scholar. Blackwell's journals
for example, were indexed by GS before they were in CINAHL when I
checked Blackwell's nursing journals and their most recent issues
against Google Scholar content. Ebsco has introduced a "pre cinhal" to
speed up the process of identification of nursing content. I didn't
compare GS to "Pre-CINAHL" but that would be a good way to check
timeliness of coverage in GS. 

IN GS several publishers and aggregators were indexed very quickly in
what I looked at in December. 100% as far as I could tell, of Extenza
content was indexed, for example, and I noted particularly rapid
indexing of a significant percentage but not all Ingenta content.

The other area where currency of coverage seems pretty good is the 35
some 
CrossRef publishers working with Google. For Open Access, the links to
"other" versions was particularly useful, as I found for some journal
titles and in some subject fields, significant portions of the content
was also available on individual or institutional servers. Bringing the
original article together with the archived versions is a unique service
that for secondary searching (i.e. if your local resources fail to
provide access to the article you need) is a powerful tool. 
 
Chuck Hamaker
Associate University Librarian Collections and Technical Services
Atkins Library
University of North Carolina Charlotte
Charlotte, NC 28223
phone 704 687-2825


-Original Message-
From: American Scientist Open Access Forum
[mailto:american-scientist-open-access-fo...@listserver.sigmaxi.org] On
Behalf Of Heather Morrison
Sent: Wednesday, February 16, 2005 12:48 PM
To: american-scientist-open-access-fo...@listserver.sigmaxi.org
Subject: Re: Google Scholar

With all the emphasis on immediate open access, I'm wondering - how up
to date is google scholar?

A quick search by publication year yields the following:

2001:  62,000 items
2002:  68,600 items
2003:  63,700 items
2004:  8,060 items

While it is possible that 2004 statistics will not be complete due to
publishing delays, this does suggest to me that there is a delay in
google scholar harvesting - whether of open access or
subscription-based resources, or both, is hard to say of course.  I do
think this data suggests that if there is one place to look for OA
materials at the moment, it is not google scholar.

My own searching confirms this suspicion - I am finding that if a
needed item is not found in google scholar, then an open access copy
may well still be found through a regular web search.

hope this helps,

Heather Morrison


On 16-Feb-05, at 6:16 AM, Thomas Walker wrote:

> As T.S.Mahadevan recently pointed out on the BOAI Forum, what those
> who are
> searching for open archive and other scholarly literature really want
> is a
> single website where they can search the entire set of such
literature.
>
> Google is already accounting for a significant portion of the hits on
> the
> OA journal articles I monitor.  Might Google Scholar be that website?
>
> ===
>
> Google Scholar (beta version online at http://scholar.google.com)
> restricts
> Google searches to scholarly literature, including peer-reviewed
> papers,
> theses, books, preprints, abstracts and technical reports from all
> fields
> of research, and finds articles from a wide variety of academic
> publishers,
> professional societies, preprint repositories and universities, as
> well as
> scholarly articles available across the web.  Google Scholar ranks
> search
> results by their relevance to the query, so the most useful references
> should appear at the top of the page. The relevance ranking takes into
> account the full text of each article as well as the article's author,
> the
> publication in which the article appeared and how often it has been
> cited
> in scholarly literature. Google Scholar also automatically analyzes
and
> extracts citations and presents them as separate results, even if the
> documents they refer to are not online. This means that search results
> may
> include citations of older works and seminal articles that appear only
> in
> books or other offline publications. [Parts of this description taken
> directly from http://scholar.google.com/scholar/about.html#about.]
>
> ===
>
> Tom Walker
>
>
>
> 
> Thomas J. Walker
> Department of Entomology & Nematology
> PO Box 110620 (or Natural Area Drive)
> University of Florida, Gainesville, FL 32611-0620
> E-mail: t...@ufl.edu  (or tjwal...@ifas.ufl.edu)
> FAX: (352)392-0190
> Web: http://tjwalker.ifas.ufl.edu
> 
>
Heather G. Morrison
Project Coordinator
BC Electronic Library Network

Phone: 604-268-7001
Fax: 604-291-3023
Email:  heath...@eln.bc.ca
Web: http://www.eln.bc.ca


[GOAL] Fwd: Re: Google Scholar discoverability of repository content

2012-02-17 Thread Stevan Harnad
Important feedback from Tim Brody, one of the developers of EPrints:

Begin forwarded message:

  From: Tim Brody 
List-Post: goal@eprints.org
List-Post: goal@eprints.org
Date: February 17, 2012 6:33:22 AM EST
To: eprints-t...@ecs.soton.ac.uk
Cc: jisc-repositor...@jiscmail.ac.uk
Subject: [EP-tech] Re: Google Scholar discoverability of repository
content


Hi All,

Here is some specific advice for existing repository administrators from
Google Scholar:
http://roar.eprints.org/help/google_scholar.html

As far as I'm aware there isn't anyone running EPrints 2 now, so
EPrints-based repositories are already (and for a long) the "best in
class" for Google Scholar.


Right, this paper ...

Table 1 is irrelevant and misleading. Scholar links first to the
publisher and, only if there is no publisher link, directly to the IR
version. That's a policy decision on the part of Scholar and nothing to
do with IRs.

Table 2 gives us some useful data. The headline rate for EPrints is 88%
(based on CalTech). Unfortunately the authors haven't provided an
analysis of what happened to the missing records. I've done a quick
random sample of CalTech and I suspect the missing records will consist
of:
1) Non-OA/non-full-text records (I'm sure a query to the CalTech
repository admin could supply the data).
2) A percentage of PDFs that Scholar won't be able to parse. CalTech
contains some old (1950s), scanned PDFs from Journals. Where the article
isn't at the top of the page Scholar will struggle to parse the
title/authors/abstract and therefore won't be able to match it to their
records e.g. http://authors.library.caltech.edu/5815/


The remainder of the paper describes the authors' process of fixing
their own IR (based on CONTENTdm).


The authors then wrongly conclude:

"Despite GS’s endorsement of three software packages, the surveys
conducted for this paper demonstrates that software is not a deciding
factor for indexing ratio in GS. Each of the three recommended software
packages showed good indexing ratios for some repositories and poor
ratios for others."

The authors looked at one instance of EPrints and, despite being a
relatively old version, found 88% of its records indexed in GS.

It is unfortunate that this paper has suggested that IR software in
general is poorly indexed in GS. On the contrary, some badly implemented
IR software is poorly indexed in GS.


After all that is said, the most critical factor to IR visibility is
having (BOAI definition) open access content. Hiding content behind
search forms, click-throughs and other things that emphasise the IR at
the expense of the content will hurt your visibility.

Lastly, Google will index your metadata-only records while Google
Scholar is looking for full-texts. Your GS/Google ratio will approximate
how many of your records have an attached open access PDF (.doc etc).


Sincerely,
Tim Brody
(EPrints Developer)

On Wed, 2012-02-15 at 11:31 +, Stevan Harnad wrote:
  Can we enhance the google-scholar discoverability of EPrints
  (and

  DSpace) repositories?


http://linksource.ebsco.com/linking.aspx?sid=google&auinit=K&aulast=Arlitsch&at
itle=Invisible+Institutional+Repositories:+Addressing+the+Low+Indexing+Ratios+o
f+IRs+in+Google+Scholar&title=Library+Hi+Tech&volume=30&issue=1&date=2012&spage
  =4&issn=0737-8831


  Kenning Arlitsch, Patrick Shawn OBrien, (2012) "Invisible
  Institutional

  Repositories: Addressing the Low Indexing Ratios of IRs in
  Google

  Scholar", Library Hi Tech, Vol. 30 Iss: 1


  Purpose - Google Scholar has difficulty indexing the contents
  of

  institutional repositories, and the authors hypothesize the
  reason is

  that most repositories use Dublin Core, which cannot express

  bibliographic citation information adequately for academic
  papers.

  Google Scholar makes specific recommendations for
  repositories,

  including the use of publishing industry metadata schemas over
  Dublin

  Core. This paper tests a theory that transforming metadata
  schemas in

  institutional repositories will lead to increased indexing by
  Google

  Scholar.


  Design/methodology/approach - The authors conducted two
  surveys of

  institutional and disciplinary repositories across the United
  States,

  using different methodologies. They also conducted three pilot
  projects

  that transformed the metadata of a subset of papers from
  USpace, the

  University of Utah's institutional repository, and examined
  the results

  of Google Scholar's explicit harvests.


  Findings - Repositories that use GS recommended metadata
  schemas and

  express them in HTML meta tags experienced significantly
  higher indexing

  ratios. The eas

[GOAL] {Disarmed} Re: Google Scholar discoverability of repository content

2012-02-17 Thread Stevan Harnad
Begin forwarded message:

  From: Betsy Coles 
List-Post: goal@eprints.org
List-Post: goal@eprints.org
Date: February 17, 2012 5:48:42 PM EST
To: jisc-repositor...@jiscmail.ac.uk
Subject: Re: [EP-tech] Re: Google Scholar discoverability of repository
content

I'm the technical manager for the main IR at Caltech, CaltechAUTHORS

  (MailScanner has detected a possible fraud attempt from
  "authors.library.caltech.edu" claiming to be
  http://authors.library.caltech..edu), currently running EPrints
  3.1.3.  

  Tim's conjecture 1) below seems to account almost exactly for the
  result

  the article authors found: 87.7% of the 25,072 eprints in
  CaltechAUTHORS

  have OA documents attached; the remainder have only documents that
  are

  either restricted to campus or to repository staff.  I don't think
  there are very

  many cases of Tim's conjecture 2), since we have concentrated on
  adding

  current content.

  I haven't read the article in question (we don't subscribe), but the
  percentage

  of open access eprints is almost exactly the same as the authors'
  report of GS

  indexed items in Table 2.  I haven't tested specifically, but it's
  tempting to

  conclude that GS is indexing 100% of our open access content.

  Betsy Coles
  Caltech Library IT Group
  bco...@caltech.edu

  -Original Message-
  From: eprints-tech-boun...@ecs.soton.ac.uk
  [mailto:eprints-tech-boun...@ecs.soton.ac.uk] On Behalf Of Tim Brody
  Sent: Friday, February 17, 2012 3:33 AM
  To: eprints-t...@ecs.soton.ac.uk
  Cc: jisc-repositor...@jiscmail.ac.uk
  Subject: [EP-tech] Re: Google Scholar discoverability of repository
  content

  Hi All,

  Here is some specific advice for existing repository administrators
  from Google Scholar:
  http://roar.eprints.org/help/google_scholar.html

  As far as I'm aware there isn't anyone running EPrints 2 now, so
  EPrints-based repositories are already (and for a long) the "best in
  class" for Google Scholar.


  Right, this paper ...

  Table 1 is irrelevant and misleading. Scholar links first to the
  publisher and, only if there is no publisher link, directly to the
  IR version. That's a policy decision on the part of Scholar and
  nothing to do with IRs.

  Table 2 gives us some useful data. The headline rate for EPrints is
  88% (based on CalTech). Unfortunately the authors haven't provided
  an analysis of what happened to the missing records. I've done a
  quick random sample of CalTech and I suspect the missing records
  will consist
  of:
  1) Non-OA/non-full-text records (I'm sure a query to the CalTech
  repository admin could supply the data).
  2) A percentage of PDFs that Scholar won't be able to parse. CalTech
  contains some old (1950s), scanned PDFs from Journals. Where the
  article isn't at the top of the page Scholar will struggle to parse
  the title/authors/abstract and therefore won't be able to match it
  to their records e.g. http://authors.library.caltech.edu/5815/


  The remainder of the paper describes the authors' process of fixing
  their own IR (based on CONTENTdm).


  The authors then wrongly conclude:

  "Despite GS’s endorsement of three software packages, the surveys
  conducted for this paper demonstrates that software is not a
  deciding factor for indexing ratio in GS. Each of the three
  recommended software packages showed good indexing ratios for some
  repositories and poor ratios for others."

  The authors looked at one instance of EPrints and, despite being a
  relatively old version, found 88% of its records indexed in GS.

  It is unfortunate that this paper has suggested that IR software in
  general is poorly indexed in GS. On the contrary, some badly
  implemented IR software is poorly indexed in GS.


  After all that is said, the most critical factor to IR visibility is
  having (BOAI definition) open access content. Hiding content behind
  search forms, click-throughs and other things that emphasise the IR
  at the expense of the content will hurt your visibility.

  Lastly, Google will index your metadata-only records while Google
  Scholar is looking for full-texts. Your GS/Google ratio will
  approximate how many of your records have an attached open access
  PDF (.doc etc).


  Sincerely,
  Tim Brody
  (EPrints Developer)

  On Wed, 2012-02-15 at 11:31 +, Stevan Harnad wrote:
Can we enhance the google-scholar discoverability of
EPrints (and

DSpace) repositories?



http://linksource.ebsco.com/li

[GOAL] Fwd: Re: Google Scholar discoverability of repository content

2012-02-17 Thread Stevan Harnad
Important feedback from Tim Brody, one of the developers of EPrints:

Begin forwarded message:

> From: Tim Brody 
> Date: February 17, 2012 6:33:22 AM EST
> To: eprints-tech at ecs.soton.ac.uk
> Cc: JISC-REPOSITORIES at JISCMAIL.AC.UK
> Subject: [EP-tech] Re: Google Scholar discoverability of repository content
> 
> 
> Hi All,
> 
> Here is some specific advice for existing repository administrators from
> Google Scholar:
> http://roar.eprints.org/help/google_scholar.html
> 
> As far as I'm aware there isn't anyone running EPrints 2 now, so
> EPrints-based repositories are already (and for a long) the "best in
> class" for Google Scholar.
> 
> 
> Right, this paper ...
> 
> Table 1 is irrelevant and misleading. Scholar links first to the
> publisher and, only if there is no publisher link, directly to the IR
> version. That's a policy decision on the part of Scholar and nothing to
> do with IRs.
> 
> Table 2 gives us some useful data. The headline rate for EPrints is 88%
> (based on CalTech). Unfortunately the authors haven't provided an
> analysis of what happened to the missing records. I've done a quick
> random sample of CalTech and I suspect the missing records will consist
> of:
> 1) Non-OA/non-full-text records (I'm sure a query to the CalTech
> repository admin could supply the data).
> 2) A percentage of PDFs that Scholar won't be able to parse. CalTech
> contains some old (1950s), scanned PDFs from Journals. Where the article
> isn't at the top of the page Scholar will struggle to parse the
> title/authors/abstract and therefore won't be able to match it to their
> records e.g. http://authors.library.caltech.edu/5815/
> 
> 
> The remainder of the paper describes the authors' process of fixing
> their own IR (based on CONTENTdm).
> 
> 
> The authors then wrongly conclude:
> 
> "Despite GS?s endorsement of three software packages, the surveys
> conducted for this paper demonstrates that software is not a deciding
> factor for indexing ratio in GS. Each of the three recommended software
> packages showed good indexing ratios for some repositories and poor
> ratios for others."
> 
> The authors looked at one instance of EPrints and, despite being a
> relatively old version, found 88% of its records indexed in GS.
> 
> It is unfortunate that this paper has suggested that IR software in
> general is poorly indexed in GS. On the contrary, some badly implemented
> IR software is poorly indexed in GS.
> 
> 
> After all that is said, the most critical factor to IR visibility is
> having (BOAI definition) open access content. Hiding content behind
> search forms, click-throughs and other things that emphasise the IR at
> the expense of the content will hurt your visibility.
> 
> Lastly, Google will index your metadata-only records while Google
> Scholar is looking for full-texts. Your GS/Google ratio will approximate
> how many of your records have an attached open access PDF (.doc etc).
> 
> 
> Sincerely,
> Tim Brody
> (EPrints Developer)
> 
> On Wed, 2012-02-15 at 11:31 +, Stevan Harnad wrote:
>> Can we enhance the google-scholar discoverability of EPrints (and
>> DSpace) repositories?
>> 
>> http://linksource.ebsco.com/linking.aspx?sid=google&auinit=K&aulast=Arlitsch&atitle=Invisible+Institutional+Repositories:+Addressing+the+Low+Indexing+Ratios+of+IRs+in+Google+Scholar&title=Library+Hi+Tech&volume=30&issue=1&date=2012&spage=4&issn=0737-8831
>> 
>> Kenning Arlitsch, Patrick Shawn OBrien, (2012) "Invisible Institutional
>> Repositories: Addressing the Low Indexing Ratios of IRs in Google
>> Scholar", Library Hi Tech, Vol. 30 Iss: 1
>> 
>> Purpose - Google Scholar has difficulty indexing the contents of
>> institutional repositories, and the authors hypothesize the reason is
>> that most repositories use Dublin Core, which cannot express
>> bibliographic citation information adequately for academic papers.
>> Google Scholar makes specific recommendations for repositories,
>> including the use of publishing industry metadata schemas over Dublin
>> Core. This paper tests a theory that transforming metadata schemas in
>> institutional repositories will lead to increased indexing by Google
>> Scholar.
>> 
>> Design/methodology/approach - The authors conducted two surveys of
>> institutional and disciplinary repositories across the United States,
>> using different methodologies. They also conducted three pilot projects
>> that transformed the metadata of a subset of papers from USpace, the
>> University of Utah's institu

[GOAL] {Disarmed} Re: Google Scholar discoverability of repository content

2012-02-17 Thread Stevan Harnad
Begin forwarded message:

> From: Betsy Coles 
> Date: February 17, 2012 5:48:42 PM EST
> To: JISC-REPOSITORIES at JISCMAIL.AC.UK
> Subject: Re: [EP-tech] Re: Google Scholar discoverability of repository 
> content
> 
> I'm the technical manager for the main IR at Caltech, CaltechAUTHORS
> (http://authors.library.caltech.edu), currently running EPrints 3.1.3.  
> 
> Tim's conjecture 1) below seems to account almost exactly for the result
> the article authors found: 87.7% of the 25,072 eprints in CaltechAUTHORS
> have OA documents attached; the remainder have only documents that are
> either restricted to campus or to repository staff.  I don't think there are 
> very
> many cases of Tim's conjecture 2), since we have concentrated on adding
> current content.
> 
> I haven't read the article in question (we don't subscribe), but the 
> percentage
> of open access eprints is almost exactly the same as the authors' report of GS
> indexed items in Table 2.  I haven't tested specifically, but it's tempting to
> conclude that GS is indexing 100% of our open access content.
> 
> Betsy Coles
> Caltech Library IT Group
> bcoles at caltech.edu
> 
> -Original Message-
> From: eprints-tech-bounces at ecs.soton.ac.uk [mailto:eprints-tech-bounces at 
> ecs.soton.ac.uk] On Behalf Of Tim Brody
> Sent: Friday, February 17, 2012 3:33 AM
> To: eprints-tech at ecs.soton.ac.uk
> Cc: JISC-REPOSITORIES at JISCMAIL.AC.UK
> Subject: [EP-tech] Re: Google Scholar discoverability of repository content
> 
> Hi All,
> 
> Here is some specific advice for existing repository administrators from 
> Google Scholar:
> http://roar.eprints.org/help/google_scholar.html
> 
> As far as I'm aware there isn't anyone running EPrints 2 now, so 
> EPrints-based repositories are already (and for a long) the "best in class" 
> for Google Scholar.
> 
> 
> Right, this paper ...
> 
> Table 1 is irrelevant and misleading. Scholar links first to the publisher 
> and, only if there is no publisher link, directly to the IR version. That's a 
> policy decision on the part of Scholar and nothing to do with IRs.
> 
> Table 2 gives us some useful data. The headline rate for EPrints is 88% 
> (based on CalTech). Unfortunately the authors haven't provided an analysis of 
> what happened to the missing records. I've done a quick random sample of 
> CalTech and I suspect the missing records will consist
> of:
> 1) Non-OA/non-full-text records (I'm sure a query to the CalTech repository 
> admin could supply the data).
> 2) A percentage of PDFs that Scholar won't be able to parse. CalTech contains 
> some old (1950s), scanned PDFs from Journals. Where the article isn't at the 
> top of the page Scholar will struggle to parse the title/authors/abstract and 
> therefore won't be able to match it to their records e.g. 
> http://authors.library.caltech.edu/5815/
> 
> 
> The remainder of the paper describes the authors' process of fixing their own 
> IR (based on CONTENTdm).
> 
> 
> The authors then wrongly conclude:
> 
> "Despite GS?s endorsement of three software packages, the surveys conducted 
> for this paper demonstrates that software is not a deciding factor for 
> indexing ratio in GS. Each of the three recommended software packages showed 
> good indexing ratios for some repositories and poor ratios for others."
> 
> The authors looked at one instance of EPrints and, despite being a relatively 
> old version, found 88% of its records indexed in GS.
> 
> It is unfortunate that this paper has suggested that IR software in general 
> is poorly indexed in GS. On the contrary, some badly implemented IR software 
> is poorly indexed in GS.
> 
> 
> After all that is said, the most critical factor to IR visibility is having 
> (BOAI definition) open access content. Hiding content behind search forms, 
> click-throughs and other things that emphasise the IR at the expense of the 
> content will hurt your visibility.
> 
> Lastly, Google will index your metadata-only records while Google Scholar is 
> looking for full-texts. Your GS/Google ratio will approximate how many of 
> your records have an attached open access PDF (.doc etc).
> 
> 
> Sincerely,
> Tim Brody
> (EPrints Developer)
> 
> On Wed, 2012-02-15 at 11:31 +, Stevan Harnad wrote:
>> Can we enhance the google-scholar discoverability of EPrints (and
>> DSpace) repositories?
>> 
>> http://linksource.ebsco.com/linking.aspx?sid=google&auinit=K&aulast=Ar
>> litsch&atitle=Invisible+Institutional+Repositories:+Addressing+the+Low
>> +Indexing+Rat

[GOAL] Re: {Disarmed} Re: Google Scholar discoverability of repository content

2012-02-20 Thread Dirk Pieper
Hi,

there were several articles by Peter Jasco in the past regarding search quality
in Google Scholar (GS), see for example

http://www.libraryjournal.com/article/CA6698580.html

The GS "inclusion guidelines" are about three years old now, so I'm wondering
about the discussion now. In the past, the number of documents from repositories
was even higher in Google than in GS! The experience with our own repositories
shows, that providing GS metadata clearly increased the number of documents of
our repositories in GS, but is still below 50%. BASE covers repository content
much better than GS, of course GS has other qualities (citation counts, library
links, ...).

>From my point of view the exciting question is, if GS uses the GS metadata only
to get the fulltext easier from a repository or if GS uses the metadata in
addition to the fulltext in order to improve the search quality within GS. The
other question is, how much the Google/GS ratio of documents from repositories
has changed in the last years.

Best
Dirk

--
Dirk Pieper
Bielefeld UL - BASE
Universitätsstr. 25, D-33615 Bielefeld
E-mail: dirk.pie...@uni-bielefeld.de | Tel.: +49 521 106-4010
Fax: +49 521 106-4052

www.ub.uni-bielefeld.de
www.base-search.net
--


+++ Welcome to the 10th International Bielefeld Conference,
24. - 26. April 2012,
http://conference.ub.uni-bielefeld.de +++



- Ursprüngliche Nachricht -
Von: Stevan Harnad 
Datum: Samstag, 18. Februar 2012, 8:31
Betreff: [GOAL] {Disarmed} Re: Google Scholar discoverability of repository
content
An: "Global Open Access List (Successor of AmSci)" 
Cc: SPARC IR 

> Begin forwarded message:
>
  > From: Betsy Coles 
> Date: February 17, 2012 5:48:42 PM EST
> To: jisc-repositor...@jiscmail.ac.uk
> Subject: Re: [EP-tech] Re: Google Scholar discoverability of repository
content
>
> I'm the technical manager for the main IR at Caltech, CaltechAUTHORS

  > (MailScanner has detected a possible fraud attempt from
  "authors.library.caltech.edu" claiming to be
  http://authors.library.caltech..edu), currently running EPrints
  3.1.3.  
  >
  > Tim's conjecture 1) below seems to account almost exactly for the
  result

  > the article authors found: 87.7% of the 25,072 eprints in
  CaltechAUTHORS

  > have OA documents attached; the remainder have only documents that
  are

  > either restricted to campus or to repository staff.  I don't think
  there are very

  > many cases of Tim's conjecture 2), since we have concentrated on
  adding

  > current content.
  >
  > I haven't read the article in question (we don't subscribe), but
  the percentage

  > of open access eprints is almost exactly the same as the authors'
  report of GS

  > indexed items in Table 2.  I haven't tested specifically, but it's
  tempting to

  > conclude that GS is indexing 100% of our open access content.
  >
  > Betsy Coles
  > Caltech Library IT Group
  > bco...@caltech.edu
  >
  > -Original Message-
  > From: eprints-tech-boun...@ecs.soton.ac.uk
  [mailto:eprints-tech-boun...@ecs.soton.ac.uk] On Behalf Of Tim Brody
  > Sent: Friday, February 17, 2012 3:33 AM
  > To: eprints-t...@ecs.soton.ac.uk
  > Cc: jisc-repositor...@jiscmail.ac.uk
  > Subject: [EP-tech] Re: Google Scholar discoverability of
  repository content
  >
  > Hi All,
  >
  > Here is some specific advice for existing repository
  administrators from Google Scholar:
  > http://roar.eprints.org/help/google_scholar.html
  >
  > As far as I'm aware there isn't anyone running EPrints 2 now, so
  EPrints-based repositories are already (and for a long) the "best in
  class" for Google Scholar.
  >
  >
  > Right, this paper ...
  >
  > Table 1 is irrelevant and misleading. Scholar links first to the
  publisher and, only if there is no publisher link, directly to the
  IR version. That's a policy decision on the part of Scholar and
  nothing to do with IRs.
  >
  > Table 2 gives us some useful data. The headline rate for EPrints
  is 88% (based on CalTech). Unfortunately the authors haven't
  provided an analysis of what happened to the missing records. I've
  done a quick random sample of CalTech and I suspect the missing
  records will consist
  > of:
  > 1) Non-OA/non-full-text records (I'm sure a query to the CalTech
  repository admin could supply the data).
  > 2) A percentage of PDFs that Scholar won't be able

[GOAL] Re: {Disarmed} Re: Google Scholar discoverability of repository content

2012-02-20 Thread Dirk Pieper
Hi,
 
 there were several articles by Peter Jasco in the past regarding search 
quality in Google Scholar (GS), see for example
  
  http://www.libraryjournal.com/article/CA6698580.html
 
 The GS "inclusion guidelines" are about three years old now, so I'm  wondering 
about the discussion now. In the past, the number of documents  from 
repositories was even higher in Google than in GS! The experience  with our own 
repositories shows, that providing GS metadata clearly  increased the number of 
documents of our repositories in GS, but is  still below 50%. BASE covers 
repository content much better than GS, of  course GS has other qualities 
(citation counts, library links, ...).
 
 From my point of view the exciting question is, if GS uses the GS  metadata 
only to get the fulltext easier from a repository or if GS uses  the metadata 
in addition to the fulltext in order to improve the search  quality within GS. 
The other question is, how much the Google/GS ratio  of documents from 
repositories has changed in the last years. 
 
 Best
 Dirk
 
 --
 Dirk Pieper
 Bielefeld UL - BASE
 Universit?tsstr. 25, D-33615 Bielefeld
 E-mail: dirk.pieper at uni-bielefeld.de | Tel.: +49 521 106-4010
 Fax: +49 521 106-4052
 
 www.ub.uni-bielefeld.de
 www.base-search.net
 --
 
 
 +++ Welcome to the 10th International Bielefeld Conference,
24. - 26. April 2012,
http://conference.ub.uni-bielefeld.de +++



- Urspr?ngliche Nachricht -
Von: Stevan Harnad 
Datum: Samstag, 18. Februar 2012, 8:31
Betreff: [GOAL] {Disarmed} Re: Google Scholar discoverability of repository 
content
An: "Global Open Access List (Successor of AmSci)" 
Cc: SPARC IR 


---
| 


> Begin forwarded message:> > From: Betsy Coles 
> Date: February 17, 2012 5:48:42 PM EST
> To: JISC-REPOSITORIES at JISCMAIL.AC.UK
> Subject: Re: [EP-tech] Re: Google Scholar discoverability of repository 
> content> 
> I'm the technical manager for the main IR at Caltech, CaltechAUTHORS> 
> (MailScanner has detected a possible fraud attempt from 
> "authors.library.caltech.edu" claiming to be 
> http://authors.library.caltech..edu), currently running EPrints 3.1.3.  
> 
> Tim's conjecture 1) below seems to account almost exactly for the result> the 
> article authors found: 87.7% of the 25,072 eprints in CaltechAUTHORS> have OA 
> documents attached; the remainder have only documents that are> either 
> restricted to campus or to repository staff.  I don't think there are very> 
> many cases of Tim's conjecture 2), since we have concentrated on adding> 
> current content.
> 
> I haven't read the article in question (we don't subscribe), but the 
> percentage> of open access eprints is almost exactly the same as the authors' 
> report of GS> indexed items in Table 2.  I haven't tested specifically, but 
> it's tempting to> conclude that GS is indexing 100% of our open access 
> content.
> 
> Betsy Coles
> Caltech Library IT Group
> bcoles at caltech.edu
> 
> -Original Message-
> From: eprints-tech-bounces at ecs.soton.ac.uk [mailto:eprints-tech-bounces at 
> ecs.soton.ac.uk] On Behalf Of Tim Brody
> Sent: Friday, February 17, 2012 3:33 AM
> To: eprints-tech at ecs.soton.ac.uk
> Cc: JISC-REPOSITORIES at JISCMAIL.AC.UK
> Subject: [EP-tech] Re: Google Scholar discoverability of repository content
> 
> Hi All,
> 
> Here is some specific advice for existing repository administrators from 
> Google Scholar:
> http://roar.eprints.org/help/google_scholar.html
> 
> As far as I'm aware there isn't anyone running EPrints 2 now, so 
> EPrints-based repositories are already (and for a long) the "best in class" 
> for Google Scholar.
> 
> 
> Right, this paper ...
> 
> Table 1 is irrelevant and misleading. Scholar links first to the publisher 
> and, only if there is no publisher link, directly to the IR version. That's a 
> policy decision on the part of Scholar and nothing to do with IRs.
> 
> Table 2 gives us some useful data. The headline rate for EPrints is 88% 
> (based on CalTech). Unfortunately the authors haven't provided an analysis of 
> what happened to the missing records. I've done a quick random sample of 
> CalTech and I suspect the missing records will consist
> of:
> 1) Non-OA/non-full-text records (I'm sure a query to the CalTech repository 
> admin could supply the data).
> 2) A percentage of PDFs that Scholar won't be able to parse. CalTech contains 
> some old (1950s), scanned PDFs from Journals. Where the article isn't at the 
> top of the page Scholar will struggl