Re: [Sedna-discussion] Disk Space Usage

Dave Stav Wed, 09 Sep 2009 01:58:57 -0700

Hi Ivan,

Thanks for your reply.


Here is an example of the XML file I am loading. I create this xml using an awk 
script that parses plain text log files.

<?xml version="1.0" standalone="yes"?>
<mike>
 <mike_logs device_id="938880111">
  <device_events>
   <device_event>
    <datetime>2009-01-27T09:37:29+02:00</datetime>
    <event_id>1</event_id>
    <numerical_severity>801</numerical_severity>
    <category>DISTORTION</category>
    <text>Module #245 has been distorted.</text>
   </device_event>
   <device_event>
    <datetime>2009-01-27T09:37:48+02:00</datetime>
    <event_id>2</event_id>
    <severity>Critical</severity>
    <category>POWER</category>
    <text>Power loss due to error BAD_JHU.</text>
   </device_event>
   <device_event>
    <datetime>2009-01-27T09:37:48+02:00</datetime>
    <event_id>3</event_id>
    <category>MANUAL</category>
    <text>Received manual interruption: tried disconnecting as directed by John 
(E-Mail ID #43213)</text>
   </device_event>
  </device_events>
 </mike_logs>
</mike>

Most of my queries search for text data ( using fn:contains() ) and search for 
events (device_event nodes) that occur between other events (also using 
fn:contains()).
For example: How many MANUAL events occured between a text containing "John" 
and a critical error.

I would love to here you recommendation regarding how to build the nodes.

Thanks!

 - Dave


On Wed, Sep 09, 2009 at 11:51:23AM +0400, Ivan Shcheklein wrote:
> Hi Dave,
> 
> 
> 
> > My plan is to convert and import a total of about 50GB of log files to
> > sedna. Do you think that the ratio will be the same? i.e. 50GB of log files
> > will turn to 109GB of xml which will be saved as 340GB?
> >
> 
> 
> Yes, it's likely the ratio will be approximatelly the same. It strictly
> depends on the structure of your data. Can you give us an example of the XML
> you want to load?
> 
> 
> BTW, do you have any recommendation as to the way the data is saved? I am
> > considering separate databases or separate documents, one document with many
> > sub-nodes. From your explanation I understand that the more the data is
> > divided into nodes the more disk space it will require, so perhaps I'm
> > better off separating the data into several documents.
> >
> 
> 
> Usually it's not a good idea to to use many databases. For example, you
> won't be able to query them simultaneously. Do you have one big document or
> many small documents? You will have approximatelly the same result either
> you load your data as one big document into database or as several documents
> into collection (
> http://modis.ispras.ru/sedna/progguide/ProgGuidesu8.html#x14-470002.5.2 ).
> 
> I am also concerned about performance. Has a 340GB database ever been tried
> > on sedna to your knowledge?
> >
> 
> 
> Sure. We had experience with 500-600GB databases. BTW, WikiXMLDB demo has
> 130GB database.
> 
> 
> Ivan Shcheklein,
> Sedna Team
> 
> 
> > Thanks for your help!
> >
> >  - Dave
> >
> > On Wed, Sep 09, 2009 at 05:56:29AM +0930, Justin Johansson wrote:
> > > Hi Dave,
> > >
> > > You will find that this issue is not confined to only Sedna but rather
> > > most XML databases whether they be "native XML databases" or implemented
> > > over a relational DB such as it MonetDB.  Except if your application is
> > > running in a very disk space limited environment (such as a mobile
> > > device), disk space these days is so cheap that it's not really an issue
> > > to worry too much about.  Having said that I'll try to explain why it is
> > > like that.
> > >
> > > Going from 70MB log file (presumably plain text as variable length log
> > > lines) to 144MB in XML format is easily explainable realizing the space
> > > that the addition of XML tags take up. (That's not telling you anything
> > > new as you seem to appreciate that bit).  Going from XML text to
> > > persisting the data in XML database has a storage overhead for analogous
> > > reasons that being the addition of XPath-axis relationship information
> > > between the nodes in the XML if for no other reason.
> > >
> > > Think for a moment about how XML-DOM (Document Object Model) is
> > > implemented.  (Not saying that Sedna is a implemented as a persistent
> > > DOM but it's useful to analyze your issue this way).  For each node in
> > > the document, in order for the database to implement XPath navigation
> > > efficiently it needs to store "pointers" to parent and ancestor nodes,
> > > child nodes, previous sibling and following sibling nodes, list of
> > > attribute nodes (in case of element nodes) and so on for all 13 (in
> > > number I think) different XPath axes.  This all takes space.  Even if
> > > there are no child nodes, the node would have to record "NULL" for the
> > > children and even "NULL" takes space.
> > >
> > > The problem is exacerbated in an
> > > "XML-database-on-top-of-a-relational-database" scenario whereby all
> > > these relationships take tons of rubble (multitudes of tables) to
> > > express with any hope of runtime performance benefit.
> > >
> > > All-in-all its back to one of the fundamental principles in computer
> > > science.  (Memory) space and (execution) time are generally inversely
> > > related:  If you want to use the smallest amount of space for data
> > > storage you "zip it up" but then it will take a long time to find a your
> > > data in the compressed file.  If you want to access your data in the
> > > smallest amount of time, you "expand it out" and use whatever amount of
> > > memory you can to get the best time performance out of your algorithms.
> > >
> > > I wonder how much space your 70MB log file takes when zipped up? Betcha
> > > there's lots of redundancy in the information and the compression ratio
> > > will be high.  Storing the data in an XML database simply takes the
> > > ratio in the other direction :-)  It is reasonable to expect a decent
> > > (execution time) performance benefit though in accessing/navigating the
> > > data.  If there wasn't this benefit (amongst others) the tradeoff would
> > > not be worth it.
> > >
> > > Trust this rather wordy explanation helps.
> > >
> > > Cheers
> > > Justin Johansson
> > >
> > > btw.  So just what is the zipped up compression ratio for your log file?
> > >
> > >
> > >
> > > Dave Stav wrote:
> > > > Hi List Members,
> > > >
> > > > I noticed that sedna database takes quite a lot of disk space,
> > > compared to the data it contains, and I was wondering why it is like
> > > that.
> > > >
> > > > I am converting a 70MB log file to an xml file which takes up 144 MB.
> > > > After loading this xml file to a newly created sedna database, I can
> > > see that the database directory takes up 450MB.
> > > >
> > > > Does anyone know why this is happening and/or if there is anything we
> > > can do to reduce the disk usage?
> > > >
> > > > Thanks!
> > > >
> > > >  - Dave
> > > >
> > >
> > >
> >
> > --
> >  EE 77 7F 30 4A 64 2E C5  83 5F E7 49 A6 82 29 BA    ~. .~   Tk Open
> > Systems
> >
> >  =}-----------------------------------------------ooO--U--Ooo-------------{=
> >      - [email protected] - tel: +972.2.679.5364, http://www.tkos.co.il -
> >
> >
> > ------------------------------------------------------------------------------
> > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
> > trial. Simplify your report design, integration and deployment - and focus
> > on
> > what you do best, core application coding. Discover what's new with
> > Crystal Reports now.  http://p.sf.net/sfu/bobj-july
> > _______________________________________________
> > Sedna-discussion mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/sedna-discussion
> >
> The 66MB log file is gzipped to 5.5MB (same is in zip). This was a test log
> file.

-- 
 EE 77 7F 30 4A 64 2E C5  83 5F E7 49 A6 82 29 BA    ~. .~   Tk Open Systems
 =}-----------------------------------------------ooO--U--Ooo-------------{=
      - [email protected] - tel: +972.2.679.5364, http://www.tkos.co.il -

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Sedna-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/sedna-discussion

Re: [Sedna-discussion] Disk Space Usage

Reply via email to