Hi Hilton, First off, thanks for the feedback. A few comments inline based on your thoughts and observations.
On 6/18/2015 11:52 AM, Hilton Gibson wrote: > The use case is very important. People download DF's and do not remember > where it came from. More importantly the PDF itself usually has no > permanent identifiers to help with citations. I definitely can understand the use case, and it is the same one I've heard from other users as well. But, at the same time, we do need to find a balance with *preserving* the original PDF (which is the counter use case). Dynamically altering a PDF may never been entirely error-free, so there's the risk that inserting these cover pages could cause issues with the downloaded PDF itself. Google Scholar has made it clear that they are much more interested in the original PDF than any locally modified version. > Why is Google concentrating on extracting metadata from the PDF files. > The PDF format is not standard to start with, secondly DSpace already > does this. So just expose the metadata DSpace extracts to Google. Google does also grab the metadata from the repository itself. But, from my understanding, Google has found that the repository metadata is often either incomplete or wrong (not just in DSpace but everywhere). There may be spelling errors in the metadata, authors missing (some institutions only enter metadata for authors *at* their institution), incorrect dates of publication, or other important metadata fields which are just missing. So, Google's practice has been to also extract this information from the PDF itself as an additional source and to try to resolve discrepancies between multiple sites. For example, multiple repositories may include the same PDF article, but Google Scholar wants to only list it once in their results page, providing multiple links to where that same article can be downloaded. From my understanding Google Scholar has figured out ways to extract this metadata based on the structure of a "normal" scholarly document/article (which often includes a title, author, abstract and even dates/citation information all on the first page). There, the addition of custom cover pages can throw things off, and may cause Google Scholar to no longer be able to verify the reported metadata or resolve discrepancies. This in turn can sometimes cause the item to not be indexed by Google Scholar. But, please understand, I obviously don't know exactly how Google performs all these metadata extraction activities. This is just based on what I've heard from Anurag (co-creator of Google Scholar). > Perhaps add a warning about Google and suggest putting the "cover page" > at the back of the PDF. > See: http://wiki.lib.sun.ac.za/index.php/SUNScholar/PDF_Cover_Page/5.X > for our config. This might be an option for institutions who really want to have some sort of repository-based metadata added to their PDFs. Moving the "cover page" to the last page might be a possible compromise here. - Tim ------------------------------------------------------------------------------ _______________________________________________ Dspace-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-general
