Re: Google Scholar
The Google scholar is outstanding, but I still feel there is a place for specialized search in topical domains such as CiteSeer, which I maintain. Our community still very much likes CiteSeer but also uses the Google Scholar. Best Lee Giles Thomas Walker wrote: As T.S.Mahadevan recently pointed out on the BOAI Forum, what those who are searching for open archive and other scholarly literature really want is a single website where they can search the entire set of such literature. Google is already accounting for a significant portion of the hits on the OA journal articles I monitor. Might Google Scholar be that website? === Google Scholar (beta version online at http://scholar.google.com) restricts Google searches to scholarly literature, including peer-reviewed papers, theses, books, preprints, abstracts and technical reports from all fields of research, and finds articles from a wide variety of academic publishers, professional societies, preprint repositories and universities, as well as scholarly articles available across the web. Google Scholar ranks search results by their relevance to the query, so the most useful references should appear at the top of the page. The relevance ranking takes into account the full text of each article as well as the article's author, the publication in which the article appeared and how often it has been cited in scholarly literature. Google Scholar also automatically analyzes and extracts citations and presents them as separate results, even if the documents they refer to are not online. This means that search results may include citations of older works and seminal articles that appear only in books or other offline publications. [Parts of this description taken directly from http://scholar.google.com/scholar/about.html#about.] === Tom Walker Thomas J. Walker Department of Entomology & Nematology PO Box 110620 (or Natural Area Drive) University of Florida, Gainesville, FL 32611-0620 E-mail: t...@ufl.edu (or tjwal...@ifas.ufl.edu) FAX: (352)392-0190 Web: http://tjwalker.ifas.ufl.edu
Re: Google Scholar
With all the emphasis on immediate open access, I'm wondering - how up to date is google scholar? A quick search by publication year yields the following: 2001: 62,000 items 2002: 68,600 items 2003: 63,700 items 2004: 8,060 items While it is possible that 2004 statistics will not be complete due to publishing delays, this does suggest to me that there is a delay in google scholar harvesting - whether of open access or subscription-based resources, or both, is hard to say of course. I do think this data suggests that if there is one place to look for OA materials at the moment, it is not google scholar. My own searching confirms this suspicion - I am finding that if a needed item is not found in google scholar, then an open access copy may well still be found through a regular web search. hope this helps, Heather Morrison On 16-Feb-05, at 6:16 AM, Thomas Walker wrote: As T.S.Mahadevan recently pointed out on the BOAI Forum, what those who are searching for open archive and other scholarly literature really want is a single website where they can search the entire set of such literature. Google is already accounting for a significant portion of the hits on the OA journal articles I monitor. Might Google Scholar be that website? === Google Scholar (beta version online at http://scholar.google.com) restricts Google searches to scholarly literature, including peer-reviewed papers, theses, books, preprints, abstracts and technical reports from all fields of research, and finds articles from a wide variety of academic publishers, professional societies, preprint repositories and universities, as well as scholarly articles available across the web. Google Scholar ranks search results by their relevance to the query, so the most useful references should appear at the top of the page. The relevance ranking takes into account the full text of each article as well as the article's author, the publication in which the article appeared and how often it has been cited in scholarly literature. Google Scholar also automatically analyzes and extracts citations and presents them as separate results, even if the documents they refer to are not online. This means that search results may include citations of older works and seminal articles that appear only in books or other offline publications. [Parts of this description taken directly from http://scholar.google.com/scholar/about.html#about.] === Tom Walker Thomas J. Walker Department of Entomology & Nematology PO Box 110620 (or Natural Area Drive) University of Florida, Gainesville, FL 32611-0620 E-mail: t...@ufl.edu (or tjwal...@ifas.ufl.edu) FAX: (352)392-0190 Web: http://tjwalker.ifas.ufl.edu Heather G. Morrison Project Coordinator BC Electronic Library Network Phone: 604-268-7001 Fax: 604-291-3023 Email: heath...@eln.bc.ca Web: http://www.eln.bc.ca
Re: Google Scholar
Heather and others: The date search can be misleading and generally inaccurate from my experience with Google Scholar. Blackwell's journals for example, were indexed by GS before they were in CINAHL when I checked Blackwell's nursing journals and their most recent issues against Google Scholar content. Ebsco has introduced a "pre cinhal" to speed up the process of identification of nursing content. I didn't compare GS to "Pre-CINAHL" but that would be a good way to check timeliness of coverage in GS. IN GS several publishers and aggregators were indexed very quickly in what I looked at in December. 100% as far as I could tell, of Extenza content was indexed, for example, and I noted particularly rapid indexing of a significant percentage but not all Ingenta content. The other area where currency of coverage seems pretty good is the 35 some CrossRef publishers working with Google. For Open Access, the links to "other" versions was particularly useful, as I found for some journal titles and in some subject fields, significant portions of the content was also available on individual or institutional servers. Bringing the original article together with the archived versions is a unique service that for secondary searching (i.e. if your local resources fail to provide access to the article you need) is a powerful tool. Chuck Hamaker Associate University Librarian Collections and Technical Services Atkins Library University of North Carolina Charlotte Charlotte, NC 28223 phone 704 687-2825 -Original Message- From: American Scientist Open Access Forum [mailto:american-scientist-open-access-fo...@listserver.sigmaxi.org] On Behalf Of Heather Morrison Sent: Wednesday, February 16, 2005 12:48 PM To: american-scientist-open-access-fo...@listserver.sigmaxi.org Subject: Re: Google Scholar With all the emphasis on immediate open access, I'm wondering - how up to date is google scholar? A quick search by publication year yields the following: 2001: 62,000 items 2002: 68,600 items 2003: 63,700 items 2004: 8,060 items While it is possible that 2004 statistics will not be complete due to publishing delays, this does suggest to me that there is a delay in google scholar harvesting - whether of open access or subscription-based resources, or both, is hard to say of course. I do think this data suggests that if there is one place to look for OA materials at the moment, it is not google scholar. My own searching confirms this suspicion - I am finding that if a needed item is not found in google scholar, then an open access copy may well still be found through a regular web search. hope this helps, Heather Morrison On 16-Feb-05, at 6:16 AM, Thomas Walker wrote: > As T.S.Mahadevan recently pointed out on the BOAI Forum, what those > who are > searching for open archive and other scholarly literature really want > is a > single website where they can search the entire set of such literature. > > Google is already accounting for a significant portion of the hits on > the > OA journal articles I monitor. Might Google Scholar be that website? > > === > > Google Scholar (beta version online at http://scholar.google.com) > restricts > Google searches to scholarly literature, including peer-reviewed > papers, > theses, books, preprints, abstracts and technical reports from all > fields > of research, and finds articles from a wide variety of academic > publishers, > professional societies, preprint repositories and universities, as > well as > scholarly articles available across the web. Google Scholar ranks > search > results by their relevance to the query, so the most useful references > should appear at the top of the page. The relevance ranking takes into > account the full text of each article as well as the article's author, > the > publication in which the article appeared and how often it has been > cited > in scholarly literature. Google Scholar also automatically analyzes and > extracts citations and presents them as separate results, even if the > documents they refer to are not online. This means that search results > may > include citations of older works and seminal articles that appear only > in > books or other offline publications. [Parts of this description taken > directly from http://scholar.google.com/scholar/about.html#about.] > > === > > Tom Walker > > > > > Thomas J. Walker > Department of Entomology & Nematology > PO Box 110620 (or Natural Area Drive) > University of Florida, Gainesville, FL 32611-0620 > E-mail: t...@ufl.edu (or tjwal...@ifas.ufl.edu) > FAX: (352)392-0190 > Web: http://tjwalker.ifas.ufl.edu > > Heather G. Morrison Project Coordinator BC Electronic Library Network Phone: 604-268-7001 Fax: 604-291-3023 Email: heath...@eln.bc.ca Web: http://www.eln.bc.ca
[GOAL] Fwd: Re: Google Scholar discoverability of repository content
Important feedback from Tim Brody, one of the developers of EPrints: Begin forwarded message: From: Tim Brody List-Post: goal@eprints.org List-Post: goal@eprints.org Date: February 17, 2012 6:33:22 AM EST To: eprints-t...@ecs.soton.ac.uk Cc: jisc-repositor...@jiscmail.ac.uk Subject: [EP-tech] Re: Google Scholar discoverability of repository content Hi All, Here is some specific advice for existing repository administrators from Google Scholar: http://roar.eprints.org/help/google_scholar.html As far as I'm aware there isn't anyone running EPrints 2 now, so EPrints-based repositories are already (and for a long) the "best in class" for Google Scholar. Right, this paper ... Table 1 is irrelevant and misleading. Scholar links first to the publisher and, only if there is no publisher link, directly to the IR version. That's a policy decision on the part of Scholar and nothing to do with IRs. Table 2 gives us some useful data. The headline rate for EPrints is 88% (based on CalTech). Unfortunately the authors haven't provided an analysis of what happened to the missing records. I've done a quick random sample of CalTech and I suspect the missing records will consist of: 1) Non-OA/non-full-text records (I'm sure a query to the CalTech repository admin could supply the data). 2) A percentage of PDFs that Scholar won't be able to parse. CalTech contains some old (1950s), scanned PDFs from Journals. Where the article isn't at the top of the page Scholar will struggle to parse the title/authors/abstract and therefore won't be able to match it to their records e.g. http://authors.library.caltech.edu/5815/ The remainder of the paper describes the authors' process of fixing their own IR (based on CONTENTdm). The authors then wrongly conclude: "Despite GSâs endorsement of three software packages, the surveys conducted for this paper demonstrates that software is not a deciding factor for indexing ratio in GS. Each of the three recommended software packages showed good indexing ratios for some repositories and poor ratios for others." The authors looked at one instance of EPrints and, despite being a relatively old version, found 88% of its records indexed in GS. It is unfortunate that this paper has suggested that IR software in general is poorly indexed in GS. On the contrary, some badly implemented IR software is poorly indexed in GS. After all that is said, the most critical factor to IR visibility is having (BOAI definition) open access content. Hiding content behind search forms, click-throughs and other things that emphasise the IR at the expense of the content will hurt your visibility. Lastly, Google will index your metadata-only records while Google Scholar is looking for full-texts. Your GS/Google ratio will approximate how many of your records have an attached open access PDF (.doc etc). Sincerely, Tim Brody (EPrints Developer) On Wed, 2012-02-15 at 11:31 +, Stevan Harnad wrote: Can we enhance the google-scholar discoverability of EPrints (and DSpace) repositories? http://linksource.ebsco.com/linking.aspx?sid=google&auinit=K&aulast=Arlitsch&at itle=Invisible+Institutional+Repositories:+Addressing+the+Low+Indexing+Ratios+o f+IRs+in+Google+Scholar&title=Library+Hi+Tech&volume=30&issue=1&date=2012&spage =4&issn=0737-8831 Kenning Arlitsch, Patrick Shawn OBrien, (2012) "Invisible Institutional Repositories: Addressing the Low Indexing Ratios of IRs in Google Scholar", Library Hi Tech, Vol. 30 Iss: 1 Purpose - Google Scholar has difficulty indexing the contents of institutional repositories, and the authors hypothesize the reason is that most repositories use Dublin Core, which cannot express bibliographic citation information adequately for academic papers. Google Scholar makes specific recommendations for repositories, including the use of publishing industry metadata schemas over Dublin Core. This paper tests a theory that transforming metadata schemas in institutional repositories will lead to increased indexing by Google Scholar. Design/methodology/approach - The authors conducted two surveys of institutional and disciplinary repositories across the United States, using different methodologies. They also conducted three pilot projects that transformed the metadata of a subset of papers from USpace, the University of Utah's institutional repository, and examined the results of Google Scholar's explicit harvests. Findings - Repositories that use GS recommended metadata schemas and express them in HTML meta tags experienced significantly higher indexing ratios. The eas
[GOAL] {Disarmed} Re: Google Scholar discoverability of repository content
Begin forwarded message: From: Betsy Coles List-Post: goal@eprints.org List-Post: goal@eprints.org Date: February 17, 2012 5:48:42 PM EST To: jisc-repositor...@jiscmail.ac.uk Subject: Re: [EP-tech] Re: Google Scholar discoverability of repository content I'm the technical manager for the main IR at Caltech, CaltechAUTHORS (MailScanner has detected a possible fraud attempt from "authors.library.caltech.edu" claiming to be http://authors.library.caltech..edu), currently running EPrints 3.1.3. Â Tim's conjecture 1) below seems to account almost exactly for the result the article authors found: 87.7% of the 25,072 eprints in CaltechAUTHORS have OA documents attached; the remainder have only documents that are either restricted to campus or to repository staff. Â I don't think there are very many cases of Tim's conjecture 2), since we have concentrated on adding current content. I haven't read the article in question (we don't subscribe), but the percentage of open access eprints is almost exactly the same as the authors' report of GS indexed items in Table 2. Â I haven't tested specifically, but it's tempting to conclude that GS is indexing 100% of our open access content. Betsy Coles Caltech Library IT Group bco...@caltech.edu -Original Message- From: eprints-tech-boun...@ecs.soton.ac.uk [mailto:eprints-tech-boun...@ecs.soton.ac.uk] On Behalf Of Tim Brody Sent: Friday, February 17, 2012 3:33 AM To: eprints-t...@ecs.soton.ac.uk Cc: jisc-repositor...@jiscmail.ac.uk Subject: [EP-tech] Re: Google Scholar discoverability of repository content Hi All, Here is some specific advice for existing repository administrators from Google Scholar: http://roar.eprints.org/help/google_scholar.html As far as I'm aware there isn't anyone running EPrints 2 now, so EPrints-based repositories are already (and for a long) the "best in class" for Google Scholar. Right, this paper ... Table 1 is irrelevant and misleading. Scholar links first to the publisher and, only if there is no publisher link, directly to the IR version. That's a policy decision on the part of Scholar and nothing to do with IRs. Table 2 gives us some useful data. The headline rate for EPrints is 88% (based on CalTech). Unfortunately the authors haven't provided an analysis of what happened to the missing records. I've done a quick random sample of CalTech and I suspect the missing records will consist of: 1) Non-OA/non-full-text records (I'm sure a query to the CalTech repository admin could supply the data). 2) A percentage of PDFs that Scholar won't be able to parse. CalTech contains some old (1950s), scanned PDFs from Journals. Where the article isn't at the top of the page Scholar will struggle to parse the title/authors/abstract and therefore won't be able to match it to their records e.g. http://authors.library.caltech.edu/5815/ The remainder of the paper describes the authors' process of fixing their own IR (based on CONTENTdm). The authors then wrongly conclude: "Despite GSâs endorsement of three software packages, the surveys conducted for this paper demonstrates that software is not a deciding factor for indexing ratio in GS. Each of the three recommended software packages showed good indexing ratios for some repositories and poor ratios for others." The authors looked at one instance of EPrints and, despite being a relatively old version, found 88% of its records indexed in GS. It is unfortunate that this paper has suggested that IR software in general is poorly indexed in GS. On the contrary, some badly implemented IR software is poorly indexed in GS. After all that is said, the most critical factor to IR visibility is having (BOAI definition) open access content. Hiding content behind search forms, click-throughs and other things that emphasise the IR at the expense of the content will hurt your visibility. Lastly, Google will index your metadata-only records while Google Scholar is looking for full-texts. Your GS/Google ratio will approximate how many of your records have an attached open access PDF (.doc etc). Sincerely, Tim Brody (EPrints Developer) On Wed, 2012-02-15 at 11:31 +, Stevan Harnad wrote: Can we enhance the google-scholar discoverability of EPrints (and DSpace) repositories? http://linksource.ebsco.com/li
[GOAL] Fwd: Re: Google Scholar discoverability of repository content
Important feedback from Tim Brody, one of the developers of EPrints: Begin forwarded message: > From: Tim Brody > Date: February 17, 2012 6:33:22 AM EST > To: eprints-tech at ecs.soton.ac.uk > Cc: JISC-REPOSITORIES at JISCMAIL.AC.UK > Subject: [EP-tech] Re: Google Scholar discoverability of repository content > > > Hi All, > > Here is some specific advice for existing repository administrators from > Google Scholar: > http://roar.eprints.org/help/google_scholar.html > > As far as I'm aware there isn't anyone running EPrints 2 now, so > EPrints-based repositories are already (and for a long) the "best in > class" for Google Scholar. > > > Right, this paper ... > > Table 1 is irrelevant and misleading. Scholar links first to the > publisher and, only if there is no publisher link, directly to the IR > version. That's a policy decision on the part of Scholar and nothing to > do with IRs. > > Table 2 gives us some useful data. The headline rate for EPrints is 88% > (based on CalTech). Unfortunately the authors haven't provided an > analysis of what happened to the missing records. I've done a quick > random sample of CalTech and I suspect the missing records will consist > of: > 1) Non-OA/non-full-text records (I'm sure a query to the CalTech > repository admin could supply the data). > 2) A percentage of PDFs that Scholar won't be able to parse. CalTech > contains some old (1950s), scanned PDFs from Journals. Where the article > isn't at the top of the page Scholar will struggle to parse the > title/authors/abstract and therefore won't be able to match it to their > records e.g. http://authors.library.caltech.edu/5815/ > > > The remainder of the paper describes the authors' process of fixing > their own IR (based on CONTENTdm). > > > The authors then wrongly conclude: > > "Despite GS?s endorsement of three software packages, the surveys > conducted for this paper demonstrates that software is not a deciding > factor for indexing ratio in GS. Each of the three recommended software > packages showed good indexing ratios for some repositories and poor > ratios for others." > > The authors looked at one instance of EPrints and, despite being a > relatively old version, found 88% of its records indexed in GS. > > It is unfortunate that this paper has suggested that IR software in > general is poorly indexed in GS. On the contrary, some badly implemented > IR software is poorly indexed in GS. > > > After all that is said, the most critical factor to IR visibility is > having (BOAI definition) open access content. Hiding content behind > search forms, click-throughs and other things that emphasise the IR at > the expense of the content will hurt your visibility. > > Lastly, Google will index your metadata-only records while Google > Scholar is looking for full-texts. Your GS/Google ratio will approximate > how many of your records have an attached open access PDF (.doc etc). > > > Sincerely, > Tim Brody > (EPrints Developer) > > On Wed, 2012-02-15 at 11:31 +, Stevan Harnad wrote: >> Can we enhance the google-scholar discoverability of EPrints (and >> DSpace) repositories? >> >> http://linksource.ebsco.com/linking.aspx?sid=google&auinit=K&aulast=Arlitsch&atitle=Invisible+Institutional+Repositories:+Addressing+the+Low+Indexing+Ratios+of+IRs+in+Google+Scholar&title=Library+Hi+Tech&volume=30&issue=1&date=2012&spage=4&issn=0737-8831 >> >> Kenning Arlitsch, Patrick Shawn OBrien, (2012) "Invisible Institutional >> Repositories: Addressing the Low Indexing Ratios of IRs in Google >> Scholar", Library Hi Tech, Vol. 30 Iss: 1 >> >> Purpose - Google Scholar has difficulty indexing the contents of >> institutional repositories, and the authors hypothesize the reason is >> that most repositories use Dublin Core, which cannot express >> bibliographic citation information adequately for academic papers. >> Google Scholar makes specific recommendations for repositories, >> including the use of publishing industry metadata schemas over Dublin >> Core. This paper tests a theory that transforming metadata schemas in >> institutional repositories will lead to increased indexing by Google >> Scholar. >> >> Design/methodology/approach - The authors conducted two surveys of >> institutional and disciplinary repositories across the United States, >> using different methodologies. They also conducted three pilot projects >> that transformed the metadata of a subset of papers from USpace, the >> University of Utah's institu
[GOAL] {Disarmed} Re: Google Scholar discoverability of repository content
Begin forwarded message: > From: Betsy Coles > Date: February 17, 2012 5:48:42 PM EST > To: JISC-REPOSITORIES at JISCMAIL.AC.UK > Subject: Re: [EP-tech] Re: Google Scholar discoverability of repository > content > > I'm the technical manager for the main IR at Caltech, CaltechAUTHORS > (http://authors.library.caltech.edu), currently running EPrints 3.1.3. > > Tim's conjecture 1) below seems to account almost exactly for the result > the article authors found: 87.7% of the 25,072 eprints in CaltechAUTHORS > have OA documents attached; the remainder have only documents that are > either restricted to campus or to repository staff. I don't think there are > very > many cases of Tim's conjecture 2), since we have concentrated on adding > current content. > > I haven't read the article in question (we don't subscribe), but the > percentage > of open access eprints is almost exactly the same as the authors' report of GS > indexed items in Table 2. I haven't tested specifically, but it's tempting to > conclude that GS is indexing 100% of our open access content. > > Betsy Coles > Caltech Library IT Group > bcoles at caltech.edu > > -Original Message- > From: eprints-tech-bounces at ecs.soton.ac.uk [mailto:eprints-tech-bounces at > ecs.soton.ac.uk] On Behalf Of Tim Brody > Sent: Friday, February 17, 2012 3:33 AM > To: eprints-tech at ecs.soton.ac.uk > Cc: JISC-REPOSITORIES at JISCMAIL.AC.UK > Subject: [EP-tech] Re: Google Scholar discoverability of repository content > > Hi All, > > Here is some specific advice for existing repository administrators from > Google Scholar: > http://roar.eprints.org/help/google_scholar.html > > As far as I'm aware there isn't anyone running EPrints 2 now, so > EPrints-based repositories are already (and for a long) the "best in class" > for Google Scholar. > > > Right, this paper ... > > Table 1 is irrelevant and misleading. Scholar links first to the publisher > and, only if there is no publisher link, directly to the IR version. That's a > policy decision on the part of Scholar and nothing to do with IRs. > > Table 2 gives us some useful data. The headline rate for EPrints is 88% > (based on CalTech). Unfortunately the authors haven't provided an analysis of > what happened to the missing records. I've done a quick random sample of > CalTech and I suspect the missing records will consist > of: > 1) Non-OA/non-full-text records (I'm sure a query to the CalTech repository > admin could supply the data). > 2) A percentage of PDFs that Scholar won't be able to parse. CalTech contains > some old (1950s), scanned PDFs from Journals. Where the article isn't at the > top of the page Scholar will struggle to parse the title/authors/abstract and > therefore won't be able to match it to their records e.g. > http://authors.library.caltech.edu/5815/ > > > The remainder of the paper describes the authors' process of fixing their own > IR (based on CONTENTdm). > > > The authors then wrongly conclude: > > "Despite GS?s endorsement of three software packages, the surveys conducted > for this paper demonstrates that software is not a deciding factor for > indexing ratio in GS. Each of the three recommended software packages showed > good indexing ratios for some repositories and poor ratios for others." > > The authors looked at one instance of EPrints and, despite being a relatively > old version, found 88% of its records indexed in GS. > > It is unfortunate that this paper has suggested that IR software in general > is poorly indexed in GS. On the contrary, some badly implemented IR software > is poorly indexed in GS. > > > After all that is said, the most critical factor to IR visibility is having > (BOAI definition) open access content. Hiding content behind search forms, > click-throughs and other things that emphasise the IR at the expense of the > content will hurt your visibility. > > Lastly, Google will index your metadata-only records while Google Scholar is > looking for full-texts. Your GS/Google ratio will approximate how many of > your records have an attached open access PDF (.doc etc). > > > Sincerely, > Tim Brody > (EPrints Developer) > > On Wed, 2012-02-15 at 11:31 +, Stevan Harnad wrote: >> Can we enhance the google-scholar discoverability of EPrints (and >> DSpace) repositories? >> >> http://linksource.ebsco.com/linking.aspx?sid=google&auinit=K&aulast=Ar >> litsch&atitle=Invisible+Institutional+Repositories:+Addressing+the+Low >> +Indexing+Rat
[GOAL] Re: {Disarmed} Re: Google Scholar discoverability of repository content
Hi, there were several articles by Peter Jasco in the past regarding search quality in Google Scholar (GS), see for example http://www.libraryjournal.com/article/CA6698580.html The GS "inclusion guidelines" are about three years old now, so I'm wondering about the discussion now. In the past, the number of documents from repositories was even higher in Google than in GS! The experience with our own repositories shows, that providing GS metadata clearly increased the number of documents of our repositories in GS, but is still below 50%. BASE covers repository content much better than GS, of course GS has other qualities (citation counts, library links, ...). >From my point of view the exciting question is, if GS uses the GS metadata only to get the fulltext easier from a repository or if GS uses the metadata in addition to the fulltext in order to improve the search quality within GS. The other question is, how much the Google/GS ratio of documents from repositories has changed in the last years. Best Dirk -- Dirk Pieper Bielefeld UL - BASE Universitätsstr. 25, D-33615 Bielefeld E-mail: dirk.pie...@uni-bielefeld.de | Tel.: +49 521 106-4010 Fax: +49 521 106-4052 www.ub.uni-bielefeld.de www.base-search.net -- +++ Welcome to the 10th International Bielefeld Conference, 24. - 26. April 2012, http://conference.ub.uni-bielefeld.de +++ - Ursprüngliche Nachricht - Von: Stevan Harnad Datum: Samstag, 18. Februar 2012, 8:31 Betreff: [GOAL] {Disarmed} Re: Google Scholar discoverability of repository content An: "Global Open Access List (Successor of AmSci)" Cc: SPARC IR > Begin forwarded message: > > From: Betsy Coles > Date: February 17, 2012 5:48:42 PM EST > To: jisc-repositor...@jiscmail.ac.uk > Subject: Re: [EP-tech] Re: Google Scholar discoverability of repository content > > I'm the technical manager for the main IR at Caltech, CaltechAUTHORS > (MailScanner has detected a possible fraud attempt from "authors.library.caltech.edu" claiming to be http://authors.library.caltech..edu), currently running EPrints 3.1.3.  > > Tim's conjecture 1) below seems to account almost exactly for the result > the article authors found: 87.7% of the 25,072 eprints in CaltechAUTHORS > have OA documents attached; the remainder have only documents that are > either restricted to campus or to repository staff.  I don't think there are very > many cases of Tim's conjecture 2), since we have concentrated on adding > current content. > > I haven't read the article in question (we don't subscribe), but the percentage > of open access eprints is almost exactly the same as the authors' report of GS > indexed items in Table 2.  I haven't tested specifically, but it's tempting to > conclude that GS is indexing 100% of our open access content. > > Betsy Coles > Caltech Library IT Group > bco...@caltech.edu > > -Original Message- > From: eprints-tech-boun...@ecs.soton.ac.uk [mailto:eprints-tech-boun...@ecs.soton.ac.uk] On Behalf Of Tim Brody > Sent: Friday, February 17, 2012 3:33 AM > To: eprints-t...@ecs.soton.ac.uk > Cc: jisc-repositor...@jiscmail.ac.uk > Subject: [EP-tech] Re: Google Scholar discoverability of repository content > > Hi All, > > Here is some specific advice for existing repository administrators from Google Scholar: > http://roar.eprints.org/help/google_scholar.html > > As far as I'm aware there isn't anyone running EPrints 2 now, so EPrints-based repositories are already (and for a long) the "best in class" for Google Scholar. > > > Right, this paper ... > > Table 1 is irrelevant and misleading. Scholar links first to the publisher and, only if there is no publisher link, directly to the IR version. That's a policy decision on the part of Scholar and nothing to do with IRs. > > Table 2 gives us some useful data. The headline rate for EPrints is 88% (based on CalTech). Unfortunately the authors haven't provided an analysis of what happened to the missing records. I've done a quick random sample of CalTech and I suspect the missing records will consist > of: > 1) Non-OA/non-full-text records (I'm sure a query to the CalTech repository admin could supply the data). > 2) A percentage of PDFs that Scholar won't be able
[GOAL] Re: {Disarmed} Re: Google Scholar discoverability of repository content
Hi, there were several articles by Peter Jasco in the past regarding search quality in Google Scholar (GS), see for example http://www.libraryjournal.com/article/CA6698580.html The GS "inclusion guidelines" are about three years old now, so I'm wondering about the discussion now. In the past, the number of documents from repositories was even higher in Google than in GS! The experience with our own repositories shows, that providing GS metadata clearly increased the number of documents of our repositories in GS, but is still below 50%. BASE covers repository content much better than GS, of course GS has other qualities (citation counts, library links, ...). From my point of view the exciting question is, if GS uses the GS metadata only to get the fulltext easier from a repository or if GS uses the metadata in addition to the fulltext in order to improve the search quality within GS. The other question is, how much the Google/GS ratio of documents from repositories has changed in the last years. Best Dirk -- Dirk Pieper Bielefeld UL - BASE Universit?tsstr. 25, D-33615 Bielefeld E-mail: dirk.pieper at uni-bielefeld.de | Tel.: +49 521 106-4010 Fax: +49 521 106-4052 www.ub.uni-bielefeld.de www.base-search.net -- +++ Welcome to the 10th International Bielefeld Conference, 24. - 26. April 2012, http://conference.ub.uni-bielefeld.de +++ - Urspr?ngliche Nachricht - Von: Stevan Harnad Datum: Samstag, 18. Februar 2012, 8:31 Betreff: [GOAL] {Disarmed} Re: Google Scholar discoverability of repository content An: "Global Open Access List (Successor of AmSci)" Cc: SPARC IR --- | > Begin forwarded message:> > From: Betsy Coles > Date: February 17, 2012 5:48:42 PM EST > To: JISC-REPOSITORIES at JISCMAIL.AC.UK > Subject: Re: [EP-tech] Re: Google Scholar discoverability of repository > content> > I'm the technical manager for the main IR at Caltech, CaltechAUTHORS> > (MailScanner has detected a possible fraud attempt from > "authors.library.caltech.edu" claiming to be > http://authors.library.caltech..edu), currently running EPrints 3.1.3. > > Tim's conjecture 1) below seems to account almost exactly for the result> the > article authors found: 87.7% of the 25,072 eprints in CaltechAUTHORS> have OA > documents attached; the remainder have only documents that are> either > restricted to campus or to repository staff. I don't think there are very> > many cases of Tim's conjecture 2), since we have concentrated on adding> > current content. > > I haven't read the article in question (we don't subscribe), but the > percentage> of open access eprints is almost exactly the same as the authors' > report of GS> indexed items in Table 2. I haven't tested specifically, but > it's tempting to> conclude that GS is indexing 100% of our open access > content. > > Betsy Coles > Caltech Library IT Group > bcoles at caltech.edu > > -Original Message- > From: eprints-tech-bounces at ecs.soton.ac.uk [mailto:eprints-tech-bounces at > ecs.soton.ac.uk] On Behalf Of Tim Brody > Sent: Friday, February 17, 2012 3:33 AM > To: eprints-tech at ecs.soton.ac.uk > Cc: JISC-REPOSITORIES at JISCMAIL.AC.UK > Subject: [EP-tech] Re: Google Scholar discoverability of repository content > > Hi All, > > Here is some specific advice for existing repository administrators from > Google Scholar: > http://roar.eprints.org/help/google_scholar.html > > As far as I'm aware there isn't anyone running EPrints 2 now, so > EPrints-based repositories are already (and for a long) the "best in class" > for Google Scholar. > > > Right, this paper ... > > Table 1 is irrelevant and misleading. Scholar links first to the publisher > and, only if there is no publisher link, directly to the IR version. That's a > policy decision on the part of Scholar and nothing to do with IRs. > > Table 2 gives us some useful data. The headline rate for EPrints is 88% > (based on CalTech). Unfortunately the authors haven't provided an analysis of > what happened to the missing records. I've done a quick random sample of > CalTech and I suspect the missing records will consist > of: > 1) Non-OA/non-full-text records (I'm sure a query to the CalTech repository > admin could supply the data). > 2) A percentage of PDFs that Scholar won't be able to parse. CalTech contains > some old (1950s), scanned PDFs from Journals. Where the article isn't at the > top of the page Scholar will struggl