Min-Yen KAN writes: >we are currently trying to estimate the amount of data that >will eventually flow into the institutional archive.
"archive"? "eventually"? As Stevan Harnad has noted, the volume of data for preprints, postprints, and theses is likely to be minuscule, even if you are successful in getting widespread adoption by your faculty. At most institutions the average annual peer-reviewed production [per author] is probably between 1 and 5 papers. Let's say you got total buy-in from a faculty of 1000, that all of your faculty submit 10 papers per year, and that each paper is a PDF document 400KB. That's 10K items and 4GB/year max, which is tiny in terms of disk space. Note that a plausible first-order model of growth is initial exponential expansion settling down to linear growth when the service matures. That would be great, because the cost of technology will decline exponentially (Moore's law). Note that although the cost of storage and server is minimal, the cost of archival is potentially very large. If you agree with Stevan you don't care much about long-term access. We don't agree, and hence have to budget for preservation, which for us means regularly scheduled (every 5 years or so) collection surveys and remediation through format conversion (e.g. from PDF version 27 to PDF version 57, or HTML to XML, or GIF to PNG, or ...). That's expensive. Achieving faculty buy-in and self-archiving is expensive too; we don't think the top-down approach (provost mandates) is likely to be successful in most places, so you should budget a lot for marketing and hand-holding. A huge issue in planning an institutional repository, though, is that you are unlikely to collect just preprints, postprints, and printable theses. A natural extension of a preprint goal would be to collect supporting materials for those preprints. Such supporting materials may be very large; it's not too unusual to have a multi-TB dataset in some fields such as astronomy or biology. It only takes one such large dataset to completely blow away any space calculations based only on collecting the paper-publishable text. Even if you are collecting just preprints and theses, the size estimates depend on how you are handling acquisition of multimedia materials; if you collect theses in dance you might have videos of performances, each of which is several GB. Conclusion: space needs depend sensitively on the details of your submission/collection policies, and on the behavior and needs of a tiny fraction of your faculty clientele. [aside: we believe that if we DON'T collect such unprintable items we'll never get faculty buy-in for Stevan's laudable goal of collecting the printable peer-reviewed works] We put quite a bit of effort into estimating the expected rate of growth of our institutional repository, and eventually gave up. We took a very pragmatic approach and sized our initial server based on hardware we happened to have available (about 25GB at the moment), with the expectation that we will radically increase the disk space (probably into the low-TB range) over the next 1 to 3 years if the service catches on. However, our IR goals and policies are quite different in detail from Stevan's and probably yours, so the one thing guaranteed is that your mileage will vary. JQ Johnson Office: 115F Knight Library Academic Education Coordinator e-mail: j...@darkwing.uoregon.edu 1299 University of Oregon 1-541-346-1746 (v); -3485 (fax) Eugene, OR 97403-1299 http://darkwing.uoregon.edu/~jqj