Just a few musings that are interesting to think about DSpace can be described as having two parts. The database which handles all the metadata and then the actual stored file objects such as PDF documents or image/audio files. The objects file system occupies most of the space and are stored as files on a heirachial file system. The files are conveniently split into directory heirarchies and numbered with DSpace identifiers.
So yes data is stored external to the database and that data might be roughly estimated to be 500 times greater in size than the database. Cloud storage is to me highly suspicious not being under local administration and recently confirmed to be sifted by various agencies. Nevertheless the cloud storage might provide high availability (in theory and if you pay) Fifty terrabytes would have a whole set of interesting problems. If you handle the backup your self maintaining a second copy then you would need to continually poll the data file system for changes and update those changes perhaps using rsync and using Postgres logs to locate the place in the file system that needs to be updated. Another approach is to duplicate the Postgres commands on two systems but with this method the two systems will not be identical. For example different handle allocations. Here at USyd we do rsync updates daily but updates are infrequent and our system is only a terrabyte. The machine is on a virtual machine, in theory providing, high availability. Central IT are unable to do backups of the database so we use rsync onto duplicate machines for both the database and file system - DEV,UAT and DR. The best scenario would initiate rsync whenever a file system change was made. If the database is very active then keeping the backup database and data file system synchronised will give you a number of interesting software projects. Can't comment about NoSQL I think Hadoop is an Apache flavour which DSpace is likely to support. Linux' LVS might be another possibility. On Fri, 2013-08-23 at 22:21 +0000, Charles Keagle wrote: > We are looking at building a 50TB DSpace Repository in AWS and are new > to DSpace. At this scale, it does not look like Amazon Relational > Database Service can meet the 50TB requirement. RDS has a 3 TB > maximum size limitation. Some questions we have come up with will > help us decide how to go about this task: > > 1. Can DSpace store data content external to the database? The > Amazon S3 is a good place to store the data, but it is not database, > it is object storage. The database can then store a pointer to that > external data. A URL in DSpace would be a good way to access S3 data. > Comparing S3 (Simple Storage Service) and EBS (Elastic Block Storage) > costs for 50TB makes S3 look very attractive. > > 2. What are the High Availability solutions for DSpace? > > 3. Is there a replication mechanism in DSpace for High > Availability if we store in Amazon Ephemeral Storage which is not > persistent? This replication would synchronize the database in > multiple Amazon Availability Zones in the same Region. This is > another much less costly alternative than EBS. Not all that reliable > though, when instance fails, data is lost. > > 4. How far along is the MySQL implementation in DSpace. I saw an > article in the email lists about MySQL that was several years old. > > 5. Is there an Hadoop alternative for DSpace storage? > > 6. Is the a NoSQL alternative for DSpace storage? > > > > Thank you. The storage requirement grew from 100GB to 50TB in the > blink of an eye. Now the scaling part of it. > > > > Charles Keagle > > Sr. Cloud Engineer | 2nd Watch > > 603 Stewart St, Suite 707 | Seattle, WA | 98101 > > Mobile 425-417-3434 | Office 888.747.8254 > > http://www.2ndwatch.com > > 2ndwatch > > aws-image > > CONFIDENTIALITY NOTICE: The information contained in this email and > any accompanying attachment(s) is intended only for the use of the > intended recipient and may be confidential and/or privileged. If any > reader of this communication is not the intended recipient, > unauthorized use, disclosure or copying is strictly prohibited, and > may be unlawful. If you have received this communication in error, > please immediately notify the sender by telephone at 425.224.3127 or > by return email, and delete the original message and all copies from > your system. Thank you. > > > > > ------------------------------------------------------------------------------ > Introducing Performance Central, a new site from SourceForge and > AppDynamics. Performance Central is your source for news, insights, > analysis and resources for efficient Application Performance Management. > Visit us today! > http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk > _______________________________________________ DSpace-tech mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dspace-tech List Etiquette: > https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette ------------------------------------------------------------------------------ Introducing Performance Central, a new site from SourceForge and AppDynamics. Performance Central is your source for news, insights, analysis and resources for efficient Application Performance Management. Visit us today! http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

