Just a few musings that are interesting to think about

DSpace can be described as having two parts. The database which handles
all the metadata and then the actual stored file objects such as PDF
documents or image/audio files. The objects file system occupies most of
the space and are stored as files on a heirachial file system. The files
are conveniently split into directory heirarchies and numbered with
DSpace identifiers.

So yes data is stored external to the database and that data might be
roughly estimated to be 500 times greater in size than the database.

Cloud storage is to me highly suspicious not being under local
administration and recently confirmed to be sifted by various agencies.

Nevertheless the cloud storage might provide high availability (in
theory and if you pay) Fifty terrabytes would have a whole set of
interesting problems.

If you handle the backup your self maintaining a second copy then you
would need to continually poll the data file system for changes and
update those changes perhaps using rsync and using Postgres logs to
locate the place in the file system that needs to be updated. 

Another approach is to duplicate the Postgres commands on two systems
but with this method the two systems will not be identical. For example
different handle allocations.

Here at USyd we do rsync updates daily but updates are infrequent and
our system is only a terrabyte. The machine is on a virtual machine, in
theory providing, high availability. Central IT are unable to do backups
of the database so we use rsync onto duplicate machines for both the
database and file system - DEV,UAT and DR.

The best scenario would initiate rsync whenever a file system change was
made. If the database is very active then keeping the backup database
and data file system synchronised will give you a number of interesting
software projects.

Can't comment about NoSQL

I think Hadoop is an Apache flavour which DSpace is likely to support.
Linux' LVS might be another possibility.



On Fri, 2013-08-23 at 22:21 +0000, Charles Keagle wrote:
> We are looking at building a 50TB DSpace Repository in AWS and are new
> to DSpace.  At this scale, it does not look like Amazon Relational
> Database Service can meet the 50TB requirement.  RDS has a 3 TB
> maximum size limitation.  Some questions we have come up with will
> help us decide how to go about this task:
> 
> 1.     Can DSpace store data content external to the database?  The
> Amazon S3 is a good place to store the data, but it is not database,
> it is object storage.  The database can then store a pointer to that
> external data.  A URL in DSpace would be a good way to access S3 data.
> Comparing S3 (Simple Storage Service) and EBS (Elastic Block Storage)
> costs for 50TB makes S3 look very attractive.
> 
> 2.     What are the High Availability solutions for DSpace?
> 
> 3.     Is there a replication mechanism in DSpace for High
> Availability if we store in Amazon Ephemeral Storage which is not
> persistent?  This replication would synchronize the database in
> multiple Amazon Availability Zones in the same Region.  This is
> another much less costly alternative than EBS.  Not all that reliable
> though, when instance fails, data is lost.
> 
> 4.     How far along is the MySQL implementation in DSpace.  I saw an
> article in the email lists about MySQL that was several years old.
> 
> 5.     Is there an Hadoop alternative for DSpace storage?
> 
> 6.     Is the a NoSQL alternative for DSpace storage?
> 
>  
> 
> Thank you.  The storage requirement grew from 100GB to 50TB in the
> blink of an eye.  Now the scaling part of it.
> 
>  
> 
> Charles Keagle
> 
> Sr. Cloud Engineer | 2nd Watch 
> 
> 603 Stewart St, Suite 707 | Seattle, WA | 98101
> 
> Mobile 425-417-3434 | Office 888.747.8254
> 
> http://www.2ndwatch.com
> 
> 2ndwatch
> 
> aws-image
> 
> CONFIDENTIALITY NOTICE: The information contained in this email and
> any accompanying attachment(s) is intended only for the use of the
> intended recipient and may be confidential and/or privileged. If any
> reader of this communication is not the intended recipient,
> unauthorized use, disclosure or copying is strictly prohibited, and
> may be unlawful. If you have received this communication in error,
> please immediately notify the sender by telephone at 425.224.3127 or
> by return email, and delete the original message and all copies from
> your system. Thank you.
> 
>  
> 
> 
> ------------------------------------------------------------------------------
> Introducing Performance Central, a new site from SourceForge and 
> AppDynamics. Performance Central is your source for news, insights, 
> analysis and resources for efficient Application Performance Management. 
> Visit us today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
> _______________________________________________ DSpace-tech mailing list 
> [email protected] 
> https://lists.sourceforge.net/lists/listinfo/dspace-tech List Etiquette: 
> https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette



------------------------------------------------------------------------------
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

Reply via email to