Re: [Sedna-discussion] Disk Space Usage

Justin Johansson Tue, 08 Sep 2009 13:27:33 -0700

Hi Dave,

You will find that this issue is not confined to only Sedna but rather 
most XML databases whether they be "native XML databases" or implemented 
over a relational DB such as it MonetDB.  Except if your application is 
running in a very disk space limited environment (such as a mobile 
device), disk space these days is so cheap that it's not really an issue 
to worry too much about.  Having said that I'll try to explain why it is 
like that.

Going from 70MB log file (presumably plain text as variable length log 
lines) to 144MB in XML format is easily explainable realizing the space 
that the addition of XML tags take up. (That's not telling you anything 
new as you seem to appreciate that bit).  Going from XML text to 
persisting the data in XML database has a storage overhead for analogous 
reasons that being the addition of XPath-axis relationship information 
between the nodes in the XML if for no other reason.

Think for a moment about how XML-DOM (Document Object Model) is 
implemented.  (Not saying that Sedna is a implemented as a persistent 
DOM but it's useful to analyze your issue this way).  For each node in 
the document, in order for the database to implement XPath navigation 
efficiently it needs to store "pointers" to parent and ancestor nodes, 
child nodes, previous sibling and following sibling nodes, list of 
attribute nodes (in case of element nodes) and so on for all 13 (in 
number I think) different XPath axes.  This all takes space.  Even if 
there are no child nodes, the node would have to record "NULL" for the 
children and even "NULL" takes space.

The problem is exacerbated in an 
"XML-database-on-top-of-a-relational-database" scenario whereby all 
these relationships take tons of rubble (multitudes of tables) to 
express with any hope of runtime performance benefit.

All-in-all its back to one of the fundamental principles in computer 
science.  (Memory) space and (execution) time are generally inversely 
related:  If you want to use the smallest amount of space for data 
storage you "zip it up" but then it will take a long time to find a your 
data in the compressed file.  If you want to access your data in the 
smallest amount of time, you "expand it out" and use whatever amount of 
memory you can to get the best time performance out of your algorithms.

I wonder how much space your 70MB log file takes when zipped up? Betcha 
there's lots of redundancy in the information and the compression ratio 
will be high.  Storing the data in an XML database simply takes the 
ratio in the other direction :-)  It is reasonable to expect a decent 
(execution time) performance benefit though in accessing/navigating the 
data.  If there wasn't this benefit (amongst others) the tradeoff would 
not be worth it.

Trust this rather wordy explanation helps.

Cheers
Justin Johansson

btw.  So just what is the zipped up compression ratio for your log file?

Dave Stav wrote:
 > Hi List Members,
 >
 > I noticed that sedna database takes quite a lot of disk space, 
compared to the data it contains, and I was wondering why it is like that.
 >
 > I am converting a 70MB log file to an xml file which takes up 144 MB.
 > After loading this xml file to a newly created sedna database, I can 
see that the database directory takes up 450MB.
 >
 > Does anyone know why this is happening and/or if there is anything we 
can do to reduce the disk usage?
 >
 > Thanks!
 >
 >  - Dave
 >

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Sedna-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/sedna-discussion

Re: [Sedna-discussion] Disk Space Usage

Reply via email to