On the Deep Disanalogy Between Text and Software and Between Text and Data Insofar as Free/Open Access is Concerned
Stevan Harnad It would be a *great* conceptual and strategic mistake for the movement dedicated to open access to peer-reviewed research (BOAI) http://www.soros.org/openaccess/ to conflate its sense of "free" vs. open" with the sense of "free vs. open" as it is used in the free/open-source software movements. The two senses are not at all the same, and importing the software-movements' distinction just adds to the still widespread confusion and misunderstanding that there is in the research community about toll-free access. I will try to state it in the simplest and most direct terms possible: Software is code that you use to *do* things. It may not be enough to let you use the code for free to do things, because one of the things you may want to do is to modify the code so it will do *other* things. Hence you may need not only free use of the code, but the code itself has to be open, so you can see and modify it. There is simply *no counterpart* to this in peer-reviewed research article use. None. Researchers, in using one another's articles, are using and re-using the *content* (what the articles are reporting), and not the *code* (i.e., the actually words in the text). Yes, they read the text. Yes (within limits) they may quote it. Yes, it is helpful to be able to navigate the code by character-string and boolean searching. But what researchers are fundamentally *not* doing in writing their own articles (which build on the articles they have read) is anything faintly analogous to modifying the code for the original article! I hope that that is now transparent, having been pointed out and written in longhand like this. So if it is obvious that what researchers do with the articles they read is not to modify the text in order to generate a new text, as programmers may modify a program to generate a new program, then where on earth did this open/free source/access conflation come from? And there is a second conflation inherent in it, namely, a conflation between research publishing (i.e., peer-reviewed journal articles) and public data-archiving (scientific and scholarly databases consisting of the raw and processed data on which the research reports are based). Digital data archiving (e.g., the various genome databases, astrophysical databases, etc.) is relatively new, and it is a powerful *supplement* to peer-reviewed article publishing. In general, the data are not *in* the published article, they are *associated with* it. In paper days, there was not the page-quota or the money to publish all the data. And even in digital days, there is no standardized practice yet of making the raw data as public as the research findings themselves; but there is definite movement in that direction, because of its obvious power and utility. The point, however, is this: As of today, articles and data are not the same thing. The 2,000,000 new articles appearing every year in the planet's 20,000 peer-reviewed journals (the full-text literature that -- as we cannot keep reminding ourselves often enough, apparently -- the open/free access movement is dedicated to freeing from access-tolls) consists of articles only, *not* the research data on which the articles are based. Hence, today, the access problem concerns toll-access to the full-texts of 2,000,000 articles published yearly, not access to the data on which they are based (most of which are not yet archived online, let alone published; and, when they *are* archived online, they are often already publicly accessible toll-free!). No doubt research practices will evolve toward making all data accessible to would-be users, along with the articles reporting the research findings. This is quite natural, and in line with researchers' desire to maximize the use and hence the impact of their research. What may happen is that journals will eventually include some or all the underlying data as part of the peer-reviewed publication itself (there may even be "peer-reviewed data"), but in an online digital supplement only, rather than in the paper edition. (What is *dead-certain*, though, is that, as this happens, authors will not be idiotic enough to sign over copyright for their research data to their publishers, the same way they have been signing over copyright for the texts of their research reports! So let's not even waste time on that implausible hypothetical contingency. The research community may be slow off the mark in reaching for the free-access that is already within its grasp in the online era, but they have not altogether taken leave of their senses!) But that bridge (digital data supplements), if it ever comes, can be crossed if/when we get to it. Right now, when we are talking about the peer-reviewed literature to which we are trying to free access we are talking about *articles* and not about *data*. Hence, exactly as in the conflation of text with software in the invalid and misleading open/free source analogy, the conflation of open/free full-text access to the refereed literature with hypothetical questions about data-access and data re-use and re-analysis capability is likewise invalid and misleading. Article-access and data-access are different, and it is only the first that is at issue today. Open/free access -- (in this flurry of definitional fussiness and fancy one no longer knows which word to use!) -- to the refereed research literature is already vastly overdue, even though it has been 100% within our practical reach for several years now. http://cogprints.soton.ac.uk/documents/disk0/00/00/16/85/index.html Research usage and impact and productivity are still being needlessly lost daily, in untold quantities, because of access-denial by toll-barriers. Why on earth do we keep wasting our time, energy and attention on minor diversions and irrelevancies, while keeping the solution to the real, pressing problem on hold, as we ponder the ramifications of incoherent analogies with software and with data-archiving, when there is a real job to be done: freeing (sic) full-text access to the planet's yearly 2,000,000 peer-reviewed research articles, now! http://www.nature.com/nature/debates/e-access/Articles/harnad.html I will now quote/comment this latest variant of that Protean microbe that keeps on infecting us with Zeno's Paralysis in our progress along the road to the optimal and inevitable. In the past, the source of this persistent virus and its ever-mutating variants had been the adversaries of free access (some toll-access publishers), as well as its over-timorous potential beneficiaries (researchers, librarians, administrators). http://www.ecs.soton.ac.uk/~harnad/Tp/resolution.htm#8 But now the paralysis-inducing bug is also originating from the ranks of free-access activists, who risk balkanizing the free-access movement by driving an idealogical wedge between "free" and "open," despite the fact that nothing substantive is to be gained, and only more time to be lost thereby. I will pass to quote/comment mode to illustrate this: On Thu, 14 Aug 2003, Matthew Cockerill wrote: > The open source software community [uses] the shorthand 'free, as in beer' The open/free distinction in software is based on the modifiability of the code. This is irrelevant to refereed-article full-text. (And the beer analogy was silly and uninformative in both cases! Lots of laughs, but little light cast.) > Sure, if you are given some limited access to something and that access is > 'free, as in beer', that can be very useful. > In the world of software, say, that would apply to Windows Media Player, > which you can download for free from the Microsoft website (even though the > software itself is highly proprietary, and Microsoft would not take kindly > to you reverse-engineering it or distributing a modified version). This is all irrelevant to article-access, except that toll-access publishers can, like every other product- or service-provider, use partial or temporary give-aways as a marketing "hook." Temporary access is not free access (or rather it is free access only while it is free). And partial access is free only for whatever it is access to, not for what it is not access to. (We're all "non-smokers" while we are asleep...) But none of this provides any basis at all for the analogy with proprietary code, as in software, nor with any need for code modifiability, whatsoever. > But free/open source software is more than 'free as in beer', it is 'free as > in speech', and this offers hugely significant extra freedoms (which is why > open source software has had such a revolutionary effect on the software > industry). This free-beer/free-speech analogy was already dubious in the software case (not all programmers wish to give away their code [the freedom to produce non-give-away products/services is a freedom too!], either for use or for modification, or both; and my speech, whether spoken or written, is spoken/written for you to hear/read, not for you to alter or to claim to have been your own words, whether in unaltered or altered form; and we are free to say or write what we like, as long as it is indeed our own words and ideas [some of this enforceable by law, most of it only enforceable by social convention -- these days with some help from technology], etc., etc.). But never mind. We will not try to repair another domain's incoherent analogy here; but, please, let us not import it where it just sows still more confusion in an already confused terrain: Refereed-research-article authors (unlike the authors of most other forms of "written speech") are not interested in earning access-royalties from the sale or use of their words. They just want their words *used,* as much as possible. (That's "research impact.") But to use their words is not to modify their *form* (the code) and then re-issue them, perhaps as the modifier's own. To use their words is to use their *content*, by incorporating that content into the user's own content, in his *own* words, with proper source attribution, so as to produce another text, another "written speech." It would be nice if all programmers were willing and motivated to make all their code free, not just for use, but for modification too. It would also be nice if the writers of all words were willing and motivated to make their words free, not just for use, but for modification too. But alas humans and their egos and their selfish genes are monadic, not distributed and diffuse, and their motivation is usually local, and quid pro quo. So there will always be programmers who program only if it pays by the unit-sold, and they may want the credit as well as the first-dibs at modification and development. Nolo contendere there. But the same is true of writers. Some will always want to be paid for access to their words by the unit-sold, and virtually all will want to keep their own words as their own alone. http://cogprints.ecs.soton.ac.uk/archive/00001700/index.html Refereed-article writers, however, don't want to be paid for access to their words, any which way, because access-tolls reduce the usage of their ideas and findings, and usage is what they really want to maximize (because that research impact is what brings them their rewards, both financial and scholarly/scientific). Because the words are in natural language, there is no question of researchers concealing their code (if they choose to publish at all). But what they want you freely using is its *content* (with proper attribution). There is no question of your modifying its form. As software does not have this form/content duality, the analogy simply does not apply; it is incoherent. > The Free Software Foundation defines these freedoms as: > * The freedom to run the program, for any purpose (freedom 0). Inapplicable to text: "Running the program" is accessing the text. > * The freedom to study how the program works, and adapt it to your needs > (freedom 1). Access to the source code is a precondition for this. Irrelevant to text. You may study and use the *content* of my (giveway, refereed-article) text (with attribution) in any way you like, and you may quote it (with attribution). That's all. And there all analogy between text and software ends. There are also many new software-based uses (indexing, search, navigation, digitometric analyses) that one can make of online text, which refereed-article authors also welcome, but the big hurdle is free full-text access, and not these perks, which will come with the territory. But no reprocessing of *my* text code in order to turn it into *your* text code (other than via its content, as processed by your brain)! (And remember that data, and data-processing, are not part of refereed-article text.) > * The freedom to redistribute copies so you can help your neighbor > (freedom 2). Moot for text, when all you need redistribute is the URL of its toll-free full-text online. > * The freedom to improve the program, and release your improvements to the > public, so that the whole community benefits (freedom 3). Access to the > source code is a precondition for this. > (see http://www.gnu.org/philosophy/free-sw.html ) Irrelevant to refereed-article text. You may only improve on the content, in text of your own, with proper attribution. (And again, data re-analysis is an orthogonal matter.) Only *I* can improve on my own text. > This philosophy fits exceptionally well with the needs of the scientific > community to share and build on each others research, which is why very many > academic software development projects are developed using an open source > model. Scientific *software*. But we were talking about scientific-article *text*, and this was supposed to be an analogy! There is no counterpart to collective software development at the article-code level. It is only *content* that the scientific community develops collectively -- and even that, only while faithfully tracking sources through citation (and quotation, where verbatim text is used). Nor did the collective, cumulative use of scientific content require any cues from the software community! Open-source *content* has been the rule with scholarship for centuries: That's why scholars *publish*. The new question is only about online-access to their content (via their text). Please let's not forget or obscure that fundamental new question in this welter of free-associative digital analogies of doubtful relevance and coherence. > BioMed Central's policy of Open Access is based on giving the scientific > community a similarly broad freedom to make use of the research articles > that we publish. The scientific community already has the freedom to make use of published articles. What it lacks is toll-free access to their texts! > This includes giving access to the structured form of the articles, We're back to XML mark-up again: a perk, a welcome perk, but we first, and far more urgently, need the basics, namely, toll-free access to the full-text. Please let us focus on that, rather than getting side-tracked onto perks, especially those that make it seem as if free access were somehow not enough, somehow not "truly open." We do not have free access today. We don't need advice on the shortcomings of free access; we need help in getting free access, as soon as possible. > and giving the right to redistribute and create derivative works > from the articles. I've already replied to this in an earlier posting: When the full-text is online and toll-free, the only relevant mode of "redistribution" is to distribute the URL. Ditto for "derivative works." Quotes, as always, require attribution. And text without attribution may be neither "re-used" nor modified. So what is really the point here? > This isn't just a philosophical issue - it has practical implications: > > e.g. in the August 14 issue of Nature (Vol 424 p727), Donat Agosti, from the > American Museum of Natural History, New York, laments the fact that the > www.antbase.org database of ant taxonomy is missing much critical > information because a large fraction of all descriptions of new ant species > are covered by publisher copyright. I couldn't follow this. If the database is toll-free, the database is toll-free. If making the database useful requires toll-free access to the full-text of refereed-articles, then the full-text of refereed-articles needs to be made accessible toll-free! We knew that already! What is the point of all these further free-associations and free-floating analogies? We are running in circles instead of breaking out of the circle. > In a true Open Access environment, not only could Antbase link to the > articles on the publishers web site, but it could also make use the images > and the text within those published descriptions to compile a universal and > authoritative catalog of Ant taxonomy. Translation: We need free access not only to the database, but to the full-text. This can be clearly seen without conflating the two. (Please jettison this "true open access" locution, or save it for when we at last have universal false-but-toll-free full-text access, and we have nothing more urgent left to do than to optimize it further. My guess is that the rest will already have come with the territory of its own accord. But please, let's go for the territory, before the "truth" [see Keats quote at the end of this posting]). > Finally, to respond to Sally's point questioning the benefits of > deposition in a standard repository: I re-read Sally Morris's point, and I now see that (in agreeing on #5) I misconstrued it as as addressing only the trivial differences between the types of "databases" -- "archives," "repositories": how we unfailingly prefer to fuss with and multiply terminological trivia instead of staying focussed on matter of substance! -- in which a full-text might be deposited (e.g., Eprints vs Dspace, or central vs. institutional). I now realize that Sally was refereeing there to BioMedCentral's (BMC's) [requirement? recommendation?] that BMC authors self-archive their BMC full-texts in an open-access database such as PubMed Central. Hence what my reply to Sally should have been was this: >sm> 5) Whether the item and/or its metadata are deposited in certain >sm> types of databases (this last seems to me supremely irrelevant) I agree it's irrelevant, if by "certain type" you mean, say, Eprints vs. Dspace. http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/2670.html But it's certainly not irrelevant whether the item (full-text) is deposited in *some* type of database *at all*, for if it is not deposited in a free-access database of *some* type, it is not free access! Whether that database type is institutional and distributed, disciplinary and central, or the toll-free access database of an open-access or a toll-access publisher is an implementational and strategic matter. And whether or not that database is OAI-compliant is a matter of functionality and efficiency (interoperable OAI-compliant databases greatly preferred!). > Although theoretically it might not matter where something is available, or > in what format, it should be clear that in practical terms these are > absolutely vital issues. Absolutely vital *relative to what*? In practical terms, we do not have free full-text online access to most of the refereed literature (2,000,000 annual articles, in 20,000 refereed journals) today. What is absolutely vital is getting that free access, now, and putting an end at last to the needless daily impact-loss that continues until that happens. Whether that free access is via this type of archive or that, and has or lacks these perks or those, is certainly not the absolutely vital issue today. On the contrary, foregrounding such minor details when we still lack the basics, and thereby raising the goal post for what we should all be aiming for, slows and diverts rather than speeds progress. Free access, now! Never mind the rest until we have those long-overdue basics in hand, at last! > So for example, theoretically, every DNA sequencing > lab could put up its own web page and make available the sequences they > themselves have obtained, using their own choice of format. The scientific > community would thereby have free access to all those DNA sequences. Correct. And this has absolutely *nothing* to do with the free-access movement, which is about toll-free access to the 2M articles in the 20K toll-access journals, not about data-archiving, which is a parallel but independent development that proceeds apace, and does not need free-access's (or publishers') permission! (Data-archiving, on the other hand, might help accelerate article-archiving!) http://www.ecs.soton.ac.uk/~harnad/Temp/data-archiving.htm > But in > fact, the deposition of all DNA sequences in a standard format with Genbank > has a truly enormous benefit in practical terms, and has served as a crucial > foundation for the development of tools to mine the genome. PubMed Central's > role as a repository for biomedical research articles is very much > analogous to Genbank's role as a repository for DNA sequence data. An archive is an archive. There is an analogy (as well as a complementarity) between data-archives and article-archives, but the big difference is that both data archiving and data-archives are (1) new, and (2) do not have a prior tradition and current status quo of being non-free, whereas articles are (1) old, and (2) do have a prior tradition and current status quo of being non-free. Publishers' relatively new toll-based online article-archives are also non-free. So the relevant point about article archiving is that article-archives should be free. "that is all ye know on earth, and all ye need to know" Stevan Harnad NOTE: A complete archive of the ongoing discussion of providing open access to the peer-reviewed research literature online is available at the American Scientist September Forum (98 & 99 & 00 & 01 & 02 & 03): http://amsci-forum.amsci.org/archives/American-Scientist-Open-Access-Forum.html or http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/index.html Discussion can be posted to: american-scientist-open-access-fo...@amsci.org