Re: [Sedna-discussion] Disk Space Usage

Ivan Shcheklein Wed, 09 Sep 2009 02:05:17 -0700

Dave, how many files do you have? Am I right that each describes unique
user's log (e.g. mike's logs)?


On Wed, Sep 9, 2009 at 12:57 PM, Dave Stav <[email protected]> wrote:

> Hi Ivan,
>
> Thanks for your reply.
>
> Here is an example of the XML file I am loading. I create this xml using an
> awk script that parses plain text log files.
>
> <?xml version="1.0" standalone="yes"?>
> <mike>
>  <mike_logs device_id="938880111">
>  <device_events>
>   <device_event>
>    <datetime>2009-01-27T09:37:29+02:00</datetime>
>    <event_id>1</event_id>
>    <numerical_severity>801</numerical_severity>
>    <category>DISTORTION</category>
>    <text>Module #245 has been distorted.</text>
>   </device_event>
>   <device_event>
>    <datetime>2009-01-27T09:37:48+02:00</datetime>
>    <event_id>2</event_id>
>    <severity>Critical</severity>
>    <category>POWER</category>
>    <text>Power loss due to error BAD_JHU.</text>
>   </device_event>
>   <device_event>
>    <datetime>2009-01-27T09:37:48+02:00</datetime>
>    <event_id>3</event_id>
>    <category>MANUAL</category>
>    <text>Received manual interruption: tried disconnecting as directed by
> John (E-Mail ID #43213)</text>
>   </device_event>
>  </device_events>
>  </mike_logs>
> </mike>
>
> Most of my queries search for text data ( using fn:contains() ) and search
> for events (device_event nodes) that occur between other events (also using
> fn:contains()).
> For example: How many MANUAL events occured between a text containing
> "John" and a critical error.
>
> I would love to here you recommendation regarding how to build the nodes.
>
> Thanks!
>
>  - Dave
>
>
> On Wed, Sep 09, 2009 at 11:51:23AM +0400, Ivan Shcheklein wrote:
> > Hi Dave,
> >
> >
> >
> > > My plan is to convert and import a total of about 50GB of log files to
> > > sedna. Do you think that the ratio will be the same? i.e. 50GB of log
> files
> > > will turn to 109GB of xml which will be saved as 340GB?
> > >
> >
> >
> > Yes, it's likely the ratio will be approximatelly the same. It strictly
> > depends on the structure of your data. Can you give us an example of the
> XML
> > you want to load?
> >
> >
> > BTW, do you have any recommendation as to the way the data is saved? I am
> > > considering separate databases or separate documents, one document with
> many
> > > sub-nodes. From your explanation I understand that the more the data is
> > > divided into nodes the more disk space it will require, so perhaps I'm
> > > better off separating the data into several documents.
> > >
> >
> >
> > Usually it's not a good idea to to use many databases. For example, you
> > won't be able to query them simultaneously. Do you have one big document
> or
> > many small documents? You will have approximatelly the same result either
> > you load your data as one big document into database or as several
> documents
> > into collection (
> > http://modis.ispras.ru/sedna/progguide/ProgGuidesu8.html#x14-470002.5.2).
> >
> > I am also concerned about performance. Has a 340GB database ever been
> tried
> > > on sedna to your knowledge?
> > >
> >
> >
> > Sure. We had experience with 500-600GB databases. BTW, WikiXMLDB demo has
> > 130GB database.
> >
> >
> > Ivan Shcheklein,
> > Sedna Team
> >
> >
> > > Thanks for your help!
> > >
> > >  - Dave
> > >
> > > On Wed, Sep 09, 2009 at 05:56:29AM +0930, Justin Johansson wrote:
> > > > Hi Dave,
> > > >
> > > > You will find that this issue is not confined to only Sedna but
> rather
> > > > most XML databases whether they be "native XML databases" or
> implemented
> > > > over a relational DB such as it MonetDB.  Except if your application
> is
> > > > running in a very disk space limited environment (such as a mobile
> > > > device), disk space these days is so cheap that it's not really an
> issue
> > > > to worry too much about.  Having said that I'll try to explain why it
> is
> > > > like that.
> > > >
> > > > Going from 70MB log file (presumably plain text as variable length
> log
> > > > lines) to 144MB in XML format is easily explainable realizing the
> space
> > > > that the addition of XML tags take up. (That's not telling you
> anything
> > > > new as you seem to appreciate that bit).  Going from XML text to
> > > > persisting the data in XML database has a storage overhead for
> analogous
> > > > reasons that being the addition of XPath-axis relationship
> information
> > > > between the nodes in the XML if for no other reason.
> > > >
> > > > Think for a moment about how XML-DOM (Document Object Model) is
> > > > implemented.  (Not saying that Sedna is a implemented as a persistent
> > > > DOM but it's useful to analyze your issue this way).  For each node
> in
> > > > the document, in order for the database to implement XPath navigation
> > > > efficiently it needs to store "pointers" to parent and ancestor
> nodes,
> > > > child nodes, previous sibling and following sibling nodes, list of
> > > > attribute nodes (in case of element nodes) and so on for all 13 (in
> > > > number I think) different XPath axes.  This all takes space.  Even if
> > > > there are no child nodes, the node would have to record "NULL" for
> the
> > > > children and even "NULL" takes space.
> > > >
> > > > The problem is exacerbated in an
> > > > "XML-database-on-top-of-a-relational-database" scenario whereby all
> > > > these relationships take tons of rubble (multitudes of tables) to
> > > > express with any hope of runtime performance benefit.
> > > >
> > > > All-in-all its back to one of the fundamental principles in computer
> > > > science.  (Memory) space and (execution) time are generally inversely
> > > > related:  If you want to use the smallest amount of space for data
> > > > storage you "zip it up" but then it will take a long time to find a
> your
> > > > data in the compressed file.  If you want to access your data in the
> > > > smallest amount of time, you "expand it out" and use whatever amount
> of
> > > > memory you can to get the best time performance out of your
> algorithms.
> > > >
> > > > I wonder how much space your 70MB log file takes when zipped up?
> Betcha
> > > > there's lots of redundancy in the information and the compression
> ratio
> > > > will be high.  Storing the data in an XML database simply takes the
> > > > ratio in the other direction :-)  It is reasonable to expect a decent
> > > > (execution time) performance benefit though in accessing/navigating
> the
> > > > data.  If there wasn't this benefit (amongst others) the tradeoff
> would
> > > > not be worth it.
> > > >
> > > > Trust this rather wordy explanation helps.
> > > >
> > > > Cheers
> > > > Justin Johansson
> > > >
> > > > btw.  So just what is the zipped up compression ratio for your log
> file?
> > > >
> > > >
> > > >
> > > > Dave Stav wrote:
> > > > > Hi List Members,
> > > > >
> > > > > I noticed that sedna database takes quite a lot of disk space,
> > > > compared to the data it contains, and I was wondering why it is like
> > > > that.
> > > > >
> > > > > I am converting a 70MB log file to an xml file which takes up 144
> MB.
> > > > > After loading this xml file to a newly created sedna database, I
> can
> > > > see that the database directory takes up 450MB.
> > > > >
> > > > > Does anyone know why this is happening and/or if there is anything
> we
> > > > can do to reduce the disk usage?
> > > > >
> > > > > Thanks!
> > > > >
> > > > >  - Dave
> > > > >
> > > >
> > > >
> > >
> > > --
> > >  EE 77 7F 30 4A 64 2E C5  83 5F E7 49 A6 82 29 BA    ~. .~   Tk Open
> > > Systems
> > >
> > >
>  =}-----------------------------------------------ooO--U--Ooo-------------{=
> > >      - [email protected] - tel: +972.2.679.5364, http://www.tkos.co.il -
> > >
> > >
> > >
> ------------------------------------------------------------------------------
> > > Let Crystal Reports handle the reporting - Free Crystal Reports 2008
> 30-Day
> > > trial. Simplify your report design, integration and deployment - and
> focus
> > > on
> > > what you do best, core application coding. Discover what's new with
> > > Crystal Reports now.  http://p.sf.net/sfu/bobj-july
> > > _______________________________________________
> > > Sedna-discussion mailing list
> > > [email protected]
> > > https://lists.sourceforge.net/lists/listinfo/sedna-discussion
> > >
> > The 66MB log file is gzipped to 5.5MB (same is in zip). This was a test
> log
> > file.
>
> --
>  EE 77 7F 30 4A 64 2E C5  83 5F E7 49 A6 82 29 BA    ~. .~   Tk Open
> Systems
>
>  =}-----------------------------------------------ooO--U--Ooo-------------{=
>      - [email protected] - tel: +972.2.679.5364, http://www.tkos.co.il -
>

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july

_______________________________________________
Sedna-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/sedna-discussion

Re: [Sedna-discussion] Disk Space Usage

Reply via email to