Re: [Dspace-general] PDF Cover Pages & Google Scholar - Search Engine inclusion implications

Tim Donohue Thu, 18 Jun 2015 10:14:09 -0700

Hi Hilton,

First off, thanks for the feedback.  A few comments inline based on your 
thoughts and observations.

On 6/18/2015 11:52 AM, Hilton Gibson wrote:
> The use case is very important. People download DF's and do not remember
> where it came from. More importantly the PDF itself usually has no
> permanent identifiers to help with citations.

I definitely can understand the use case, and it is the same one I've 
heard from other users as well.

But, at the same time, we do need to find a balance with *preserving* 
the original PDF (which is the counter use case). Dynamically altering a 
PDF may never been entirely error-free, so there's the risk that 
inserting these cover pages could cause issues with the downloaded PDF 
itself.

Google Scholar has made it clear that they are much more interested in 
the original PDF than any locally modified version.

> Why is Google concentrating on extracting metadata from the PDF files.
> The PDF format is not standard to start with, secondly DSpace already
> does this. So just expose the metadata DSpace extracts to Google.

Google does also grab the metadata from the repository itself. But, from 
my understanding, Google has found that the repository metadata is often 
either incomplete or wrong (not just in DSpace but everywhere). There 
may be spelling errors in the metadata, authors missing (some 
institutions only enter metadata for authors *at* their institution), 
incorrect dates of publication, or other important metadata fields which 
are just missing.

So, Google's practice has been to also extract this information from the 
PDF itself as an additional source and to try to resolve discrepancies 
between multiple sites. For example, multiple repositories may include 
the same PDF article, but Google Scholar wants to only list it once in 
their results page, providing multiple links to where that same article 
can be downloaded.

 From my understanding Google Scholar has figured out ways to extract 
this metadata based on the structure of a "normal" scholarly 
document/article (which often includes a title, author, abstract and 
even dates/citation information all on the first page). There, the 
addition of custom cover pages can throw things off, and may cause 
Google Scholar to no longer be able to verify the reported metadata or 
resolve discrepancies. This in turn can sometimes cause the item to not 
be indexed by Google Scholar.

But, please understand, I obviously don't know exactly how Google 
performs all these metadata extraction activities. This is just based on 
what I've heard from Anurag (co-creator of Google Scholar).

> Perhaps add a warning about Google and suggest putting the "cover page"
> at the back of the PDF.
> See: http://wiki.lib.sun.ac.za/index.php/SUNScholar/PDF_Cover_Page/5.X
> for our config.

This might be an option for institutions who really want to have some 
sort of repository-based metadata added to their PDFs. Moving the "cover 
page" to the last page might be a possible compromise here.

- Tim

------------------------------------------------------------------------------
_______________________________________________
Dspace-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-general

Re: [Dspace-general] PDF Cover Pages & Google Scholar - Search Engine inclusion implications

Reply via email to