Re: [Sedna-discussion] Disk Space Usage

Dave Stav Wed, 09 Sep 2009 00:16:15 -0700

Hi Justin,

Thank you for your detailed explanation! It does indeed makes a lot of sense.


The 66MB log file is gzipped to 5.5MB (same is in zip). This was a test log 
file.
My plan is to convert and import a total of about 50GB of log files to sedna. 
Do you think that the ratio will be the same? i.e. 50GB of log files will turn 
to 109GB of xml which will be saved as 340GB?

BTW, do you have any recommendation as to the way the data is saved? I am 
considering separate databases or separate documents, one document with many 
sub-nodes. From your explanation I understand that the more the data is divided 
into nodes the more disk space it will require, so perhaps I'm better off 
separating the data into several documents.

I am also concerned about performance. Has a 340GB database ever been tried on 
sedna to your knowledge?

Thanks for your help!

 - Dave

On Wed, Sep 09, 2009 at 05:56:29AM +0930, Justin Johansson wrote:
> Hi Dave,
>
> You will find that this issue is not confined to only Sedna but rather  
> most XML databases whether they be "native XML databases" or implemented  
> over a relational DB such as it MonetDB.  Except if your application is  
> running in a very disk space limited environment (such as a mobile  
> device), disk space these days is so cheap that it's not really an issue  
> to worry too much about.  Having said that I'll try to explain why it is  
> like that.
>
> Going from 70MB log file (presumably plain text as variable length log  
> lines) to 144MB in XML format is easily explainable realizing the space  
> that the addition of XML tags take up. (That's not telling you anything  
> new as you seem to appreciate that bit).  Going from XML text to  
> persisting the data in XML database has a storage overhead for analogous  
> reasons that being the addition of XPath-axis relationship information  
> between the nodes in the XML if for no other reason.
>
> Think for a moment about how XML-DOM (Document Object Model) is  
> implemented.  (Not saying that Sedna is a implemented as a persistent  
> DOM but it's useful to analyze your issue this way).  For each node in  
> the document, in order for the database to implement XPath navigation  
> efficiently it needs to store "pointers" to parent and ancestor nodes,  
> child nodes, previous sibling and following sibling nodes, list of  
> attribute nodes (in case of element nodes) and so on for all 13 (in  
> number I think) different XPath axes.  This all takes space.  Even if  
> there are no child nodes, the node would have to record "NULL" for the  
> children and even "NULL" takes space.
>
> The problem is exacerbated in an  
> "XML-database-on-top-of-a-relational-database" scenario whereby all  
> these relationships take tons of rubble (multitudes of tables) to  
> express with any hope of runtime performance benefit.
>
> All-in-all its back to one of the fundamental principles in computer  
> science.  (Memory) space and (execution) time are generally inversely  
> related:  If you want to use the smallest amount of space for data  
> storage you "zip it up" but then it will take a long time to find a your  
> data in the compressed file.  If you want to access your data in the  
> smallest amount of time, you "expand it out" and use whatever amount of  
> memory you can to get the best time performance out of your algorithms.
>
> I wonder how much space your 70MB log file takes when zipped up? Betcha  
> there's lots of redundancy in the information and the compression ratio  
> will be high.  Storing the data in an XML database simply takes the  
> ratio in the other direction :-)  It is reasonable to expect a decent  
> (execution time) performance benefit though in accessing/navigating the  
> data.  If there wasn't this benefit (amongst others) the tradeoff would  
> not be worth it.
>
> Trust this rather wordy explanation helps.
>
> Cheers
> Justin Johansson
>
> btw.  So just what is the zipped up compression ratio for your log file?
>
>
>
> Dave Stav wrote:
> > Hi List Members,
> >
> > I noticed that sedna database takes quite a lot of disk space,  
> compared to the data it contains, and I was wondering why it is like 
> that.
> >
> > I am converting a 70MB log file to an xml file which takes up 144 MB.
> > After loading this xml file to a newly created sedna database, I can  
> see that the database directory takes up 450MB.
> >
> > Does anyone know why this is happening and/or if there is anything we  
> can do to reduce the disk usage?
> >
> > Thanks!
> >
> >  - Dave
> >
>
>

-- 
 EE 77 7F 30 4A 64 2E C5  83 5F E7 49 A6 82 29 BA    ~. .~   Tk Open Systems
 =}-----------------------------------------------ooO--U--Ooo-------------{=
      - [email protected] - tel: +972.2.679.5364, http://www.tkos.co.il -

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Sedna-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/sedna-discussion

Re: [Sedna-discussion] Disk Space Usage

Reply via email to