Re: [Dspace-general] PDF Cover Pages & Google Scholar - Search Engine inclusion implications

Tim Donohue Fri, 19 Jun 2015 06:34:24 -0700

Hi All,

Anurag from Google Scholar is tracking this ongoing discussion. He's 
asked me to forward on this response from him.

-----------------------------------------------------------------------------------------------------------------------------------------------

Hi: I had reached out to Tim about DSpace support for PDF cover pages. I
would like to thank him for initiating this discussion.

As I had mentioned in my talk, cover pages impact the effectiveness of
all downstream automated systems that analyze PDF articles. This
includes search services as well as personal collection managers such as
Zotero, Mendeley, Papers etc that are widely used.

Given the diversity of layouts and structure, automated analysis &
metadata extraction from PDF articles is always a challenge. This is
even more of a challenge if you consider that indexing systems need to
run their algorithms over not just documents in specific repositories
but all documents on the web.

Handling of PDF articles with cover pages is far more error-prone than
that for original PDF. What happens to work today may no longer be able
to work as algorithms are updated to handle the expanding diversity of
layouts. We have seen this happen for several repositories :(

The concern that is most frequently mentioned as a reason for keeping
cover pages is that the original PDF may not have sufficient/suitable
information about where/how it was published. Which would keep it from
being cited/referred to. However, pretty much no one tracks or cites
articles or references by looking at the article by hand. For new
articles, researchers use referencing tools like EndNote, bibtex etc and
save structured bibliographic info from the article source. Publisher
sites, repositories, A&Is, search services, all provide structured
references in multiple formats. For older collections, the common
approach, by far, is to use collection managers like Zotero, Mendeley,
Papers. All of these depend on being able to analyze the PDF to automate
the management. Making original PDF available as-is, with no changes,
helps collection managers in the same way as it helps search services.

Based on my experience, I would recommend disabling automated cover
pages - either front or back. The potential upside is no longer
significant for most researchers. The potential downside, however, is
pretty large -- impacting all downstream automated systems that make the
research process easier.

I would like to thank the community for your consideration.

cheers,
anurag

------------------------------------------------------------------------------
_______________________________________________
Dspace-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-general

Re: [Dspace-general] PDF Cover Pages & Google Scholar - Search Engine inclusion implications

Reply via email to