Sorry for another intrusion from the trenches.  Our full text processing 
guidelines include entering basic bibliographic data (author, title, date, 
subject or keyword; no abstract) to the .PDF internal metadata.  My opinion 
that this metadata is part of the repository staff workflow (even if author 
provided some metadata, it should be checked and edited for consistency.)

Susan
__________________________
Susan Matveyeva, PhD, MLIS, B.Mus
Associate Professor, Catalog & 
Institutional Repository Librarian
Wichita State University Libraries
1845 Fairmount, Wichita, KS 67260-0068

Office: (316) 978-5139
Fax: (316) 978-3496
[email protected]
http://soar.wichita.edu



-----Original Message-----
From: Mark H. Wood [mailto:[email protected]] 
Sent: 19 June 2015 10:30
To: [email protected]
Subject: Re: [Dspace-general] PDF Cover Pages & Google Scholar - Search Engine 
inclusion implications

On Fri, Jun 19, 2015 at 07:52:40AM -0700, Mark Diggory wrote:
> Putting all this "tail wagging the dog" aside. I think it would be 
> very good to get the appropriate "metadata" added to the PDF.
> 
> I wanted to contribute that we recently had a "non-coverpage" case 
> where the title of a paper was correct in the first page of the pdf 
> and in the DSpace metadata, but the PDF had the incorrect title in its 
> internal metadata. This caused Google Scholar to show the incorrect 
> title in its search results, which caused much confusion for the owner of 
> that document.
> Changing the metadata resulted in the GS record changing. From this 
> point, it is clear the GS is leaning heavily on PDF internal metadata 
> as is primary source for its records.
> 
> I think that if the appropriate metadata were populated in the pdf 
> process, that it would take precedence over the cover page in GS.

Hear, hear.  Having correct, complete machine-readable metadata in the document 
itself is a Good Thing.

Researcher:  if you do this yourself, it's in your interest to ensure that you 
do it well.  If you have an assistant to take care of such things, it's in your 
interest to ensure that your assistant knows how to do it well.  If you depend 
on Google Scholar or something like it, you (all) get out of it what you (all) 
put into it.

The notion of a repository doing this automatically, whether machine-readably 
or by generated cover pages, leads to some interesting corner cases.  If the 
title page, repo. metadata, and document metadata disagree, which one is 
correct?  If the document contains poor-quality metadata, but it does contain 
them, then should the repo. *replace* them with corrected values?  On the other 
end of the ingestion process, what if we *extract* metadata from the document 
and then have to correct them? do we fix the document?  And regardless of how 
much we trust our own process, will search engines trust our repo., the 
document metadata, or their own heuristic fishing in the first page?

To gather some ideas, we might want to see what commercial publishers do about 
these issues.  (Oh, boy: what if an academic repo. and a publisher make 
*different* adjustments to document metadata?  Can we get repo.s, publishers, 
and researchers to agree on priorities and a process for polishing and 
harmonizing document metadata?)

I think that, in the end, all parties want "the best we can reasonably do."  
But how do we get there?

--
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu

------------------------------------------------------------------------------
_______________________________________________
Dspace-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-general

Reply via email to