I welcome the Wellcome stance on OA archiving, and like Stevan, believe that the issue at stake is one of strategy. After all, since 1999's formation of the Open Archiving Initiative, repositories have been built with an eye to interoperability because it is recognised that they operate in a larger context than any single repository can accommodate. Even Robert's remarks on subject-based science can seem parochial in a world of increasing inter- and multi-disciplinarity! From a pragmatic point of view I know that my repository can put procedures in place to harvest Southampton University papers and metadata from PMC so that they appear in our IR's record of our institution's research (appropriately slaved to the PMC versions to avoid wanton version proliferation).
But there are larger issues that Wellcome's position opens up, so bearing in mind that I agree with Robert on all but strategy, I'd like to explore some of his comments. On 19 May 2005, at 18:36, Terry ,Mr Robert wrote:
it is important to remember that the Trust operates globally supporting 4000 researchers in more that 40 countries - we need a repository that meets all our needs today and PMC offers that.
I think that this is the heart of the matter - Wellcome has its own specific requirements and it wants to control the software (and the repository ingest process and hence the information) to ensure that those requirements are filled. As we shall see from below, Wellcome's requirements fall into categories: document preservation and data integration. I don't believe that "a separate funding body archive" is necessary to fulfill either of these requirements.
we want a long-term digital archive (i.e. not Word or PDF files but XML files) that will integrate the research literature with the data. We fund research from a scientific perspective, not its geographical location, and we want to ensure that when the literature is searched the search engine can go deeper than the metadata and provide links between, for example, genome sequence, chemical compounds or MRI scan images embedded in an article and databases such as PubChem and Genbank. It will move between the databases and PMC and visa versa - a Japanese or French team working on a gene but not publishing in English will be able to discover other research groups working on the same sequence. Teams working on drug compounds but investigating different uses will be able to discover who else is working on that compound either by searching the literature or the database.
As I understand it, the PMC ingest process involves the translation of the submitted document into an XML-based format (with necessary rounds of manuscript checking and reviewing by authors). Of course this XML is only used internally - visitors to PMC see an HTML presentation generated from the XML sources. With regard to research data, it would appear that PMC takes data "as is" in whatever format the author supplies (sometimes XML, more often Word, tar.gz bundles, perl scripts etc). While I applaud the effort to have material stored in XML. Taking a look at the current contents of PMC, it is apparent that many of the "Supplementary Materials" are provided in Word form. Even those that are provided in XML have no DTD, Schema or instructions on how they should be interpreted. As far as Open Access is concerned, all of this functionality is available from any Institutional Repository (whether it's EPrints or DSpace). The storage of multiple formats of a document (text, HTML, XML, PDF, Word, RTF, LaTeX, PostScript, JPEG, MPEG) can be accommodated, as can the storage of supplementary research data. What the repositories lack is the editorial processes to support document translation - and Wellcome could easily provide that as a separate service to interoperate with any archive. [[ Also as far as Open Access is concerned, can I just say that certainly PDF and probably Word are pretty much as good as "XML". What gets lost in some "long term archiving" discussions, is that there is no such document format as "XML". It is a meta-language for defining document vocabularies. Even then (when one uses a specific DTD or Schema to enforce a particular grammar on your documents' structure) all you have is a well-formed but (literally) meaningless tree. What is required is a way of applying some sort of interpretation to this tree (e.g. a way of rendering the document onto a screen using CSS or XSL stylesheets to convert into HTML or XSL-FO) and it is there that the complexity starts to come in. XML, PDF and Word formats all rely on the existence of documented ways of interpretation, and available software renders. PDF and Word have these, from various organisations. We are not living in the 1980s any more, when formats were opaque and interoperability was a dirty word! ]] XML is DEFINITELY a thousand times preferable to Word or PDF when you need to reuse (reformat, republish, repurpose) a document. I can imagine situations where it will be useful to do this in an OA context (e.g. representing papers for small, handheld devices), but that is providing an added value service on top of Plain Old Open Access.
PMC already offers this functionality and that's vital to enhance the potential that the Internet offers.
Please excuse my unfamiliarity with PMC - can you give an example of a PMC entry showing this integration (ie beyond listing supplementary materials)?
The life sciences have already moved beyond the need to read a word document on a local website
I definitely agree with you! And more - it's not only the life sciences. It's all the experimental sciences. And engineering. And social sciences.
Institutional repositories may never offer the same degree of functionality until every single institution uses the same ingestion and storage system
You are thinking in terms of monolithic and centrally controlled software. In the web-based, distributed and interoperable environment in which we find ourselves, I could easily deposit my research articles inside my Plain Old Institutional Repository and my research data inside my Learned Society's Advanced Chemistry-Aware Repository, and have the scientific record seamlessly and automatically tied together. Document and data. Measurement, analysis and interpretation, all interoperable, all open for scrutiny and use.
OAI only links the metadata to files that might be in Word or PDF which may be unreadable in the years to come.
There is indeed no constraint within OAI on the formats in which its items are to be provided. However, PDF documents could only become unreadable if all the public PDF specifications were systematically destroyed. (And no-one had bothered to create a translation program from PDF to the majority formats of the day.) There is a lot of work being undertaken in these topics by various projects. The JISC-funded E-Bank project (http://www.ukoln.ac.uk/ projects/ebank-uk/) of which UKOLN and Southampton are partners, are producing the kind of integration between data and document that you are describing, precisely for supplementing Institutional Repositories. In particular, the project is taking the view that the data format must be well-understood, and that i must be exposed to harvesters to allow chemistry-specific searching. The new JISC Digital Repositories programme will soon have a raft of related data- based repository work. Despite my comments about PDF and Word, I agree with Robert that repositories should be managed with an understanding of preservation! Our repository has a cheap policy of "including at least one safe format" whereas Wellcome has a relatively expensive conversion process in place. In the end we disagree about which formats are, practically speaking, safe. I applaud Wellcome for putting their money where their mouth is and providing a service. BUT, that service could easily be made to work within a network of institutional repositories. ALSO, the data integration could be made to work within a network of institutional repositories. So we're back to strategy, because there is no technical barrier against Wellcome's policy working with Institutions. Finally, I hope that Robert will accept an invitation to visit the EBank project and to discuss the nature of scientific communication and the advantage that our respective repositories can offer scientists. --- Dr Leslie Carr Eprints Technical Director EBank Project partner