Re: [Sedna-discussion] Disk Space Usage

Dave Stav Wed, 09 Sep 2009 03:34:16 -0700

Hi Ivan,

Actually, mike is the group type of logs. Most of the logs are type mike, but 
the syntax is similar.
I have about 100 log files per each device_id. Let's assume I only have mike 
logs.
Initially, my awk parser prints out the whole thing to one big xml file, which 
I pipe into sedna load command.
I will later need to periodically add new log files in. Let's say... every 10 
minutes add another 10 MB of log lines (parsed to 20MB xml) -- add more events 
to a device id.


Thanks,

 - Dave

On Wed, Sep 09, 2009 at 01:04:17PM +0400, Ivan Shcheklein wrote:
> Dave, how many files do you have? Am I right that each describes unique
> user's log (e.g. mike's logs)?
> 
> On Wed, Sep 9, 2009 at 12:57 PM, Dave Stav <[email protected]> wrote:
> 
> > Hi Ivan,
> >
> > Thanks for your reply.
> >
> > Here is an example of the XML file I am loading. I create this xml using an
> > awk script that parses plain text log files.
> >
> > <?xml version="1.0" standalone="yes"?>
> > <mike>
> >  <mike_logs device_id="938880111">
> >  <device_events>
> >   <device_event>
> >    <datetime>2009-01-27T09:37:29+02:00</datetime>
> >    <event_id>1</event_id>
> >    <numerical_severity>801</numerical_severity>
> >    <category>DISTORTION</category>
> >    <text>Module #245 has been distorted.</text>
> >   </device_event>
> >   <device_event>
> >    <datetime>2009-01-27T09:37:48+02:00</datetime>
> >    <event_id>2</event_id>
> >    <severity>Critical</severity>
> >    <category>POWER</category>
> >    <text>Power loss due to error BAD_JHU.</text>
> >   </device_event>
> >   <device_event>
> >    <datetime>2009-01-27T09:37:48+02:00</datetime>
> >    <event_id>3</event_id>
> >    <category>MANUAL</category>
> >    <text>Received manual interruption: tried disconnecting as directed by
> > John (E-Mail ID #43213)</text>
> >   </device_event>
> >  </device_events>
> >  </mike_logs>
> > </mike>
> >
> > Most of my queries search for text data ( using fn:contains() ) and search
> > for events (device_event nodes) that occur between other events (also using
> > fn:contains()).
> > For example: How many MANUAL events occured between a text containing
> > "John" and a critical error.
> >
> > I would love to here you recommendation regarding how to build the nodes.
> >
> > Thanks!
> >
> >  - Dave
> >
> >
> > On Wed, Sep 09, 2009 at 11:51:23AM +0400, Ivan Shcheklein wrote:
> > > Hi Dave,
> > >
> > >
> > >
> > > > My plan is to convert and import a total of about 50GB of log files to
> > > > sedna. Do you think that the ratio will be the same? i.e. 50GB of log
> > files
> > > > will turn to 109GB of xml which will be saved as 340GB?
> > > >
> > >
> > >
> > > Yes, it's likely the ratio will be approximatelly the same. It strictly
> > > depends on the structure of your data. Can you give us an example of the
> > XML
> > > you want to load?
> > >
> > >
> > > BTW, do you have any recommendation as to the way the data is saved? I am
> > > > considering separate databases or separate documents, one document with
> > many
> > > > sub-nodes. From your explanation I understand that the more the data is
> > > > divided into nodes the more disk space it will require, so perhaps I'm
> > > > better off separating the data into several documents.
> > > >
> > >
> > >
> > > Usually it's not a good idea to to use many databases. For example, you
> > > won't be able to query them simultaneously. Do you have one big document
> > or
> > > many small documents? You will have approximatelly the same result either
> > > you load your data as one big document into database or as several
> > documents
> > > into collection (
> > > http://modis.ispras.ru/sedna/progguide/ProgGuidesu8.html#x14-470002.5.2).
> > >
> > > I am also concerned about performance. Has a 340GB database ever been
> > tried
> > > > on sedna to your knowledge?
> > > >
> > >
> > >
> > > Sure. We had experience with 500-600GB databases. BTW, WikiXMLDB demo has
> > > 130GB database.
> > >
> > >
> > > Ivan Shcheklein,
> > > Sedna Team
> > >
> > >
> > > > Thanks for your help!
> > > >
> > > >  - Dave
> > > >
> > > > On Wed, Sep 09, 2009 at 05:56:29AM +0930, Justin Johansson wrote:
> > > > > Hi Dave,
> > > > >
> > > > > You will find that this issue is not confined to only Sedna but
> > rather
> > > > > most XML databases whether they be "native XML databases" or
> > implemented
> > > > > over a relational DB such as it MonetDB.  Except if your application
> > is
> > > > > running in a very disk space limited environment (such as a mobile
> > > > > device), disk space these days is so cheap that it's not really an
> > issue
> > > > > to worry too much about.  Having said that I'll try to explain why it
> > is
> > > > > like that.
> > > > >
> > > > > Going from 70MB log file (presumably plain text as variable length
> > log
> > > > > lines) to 144MB in XML format is easily explainable realizing the
> > space
> > > > > that the addition of XML tags take up. (That's not telling you
> > anything
> > > > > new as you seem to appreciate that bit).  Going from XML text to
> > > > > persisting the data in XML database has a storage overhead for
> > analogous
> > > > > reasons that being the addition of XPath-axis relationship
> > information
> > > > > between the nodes in the XML if for no other reason.
> > > > >
> > > > > Think for a moment about how XML-DOM (Document Object Model) is
> > > > > implemented.  (Not saying that Sedna is a implemented as a persistent
> > > > > DOM but it's useful to analyze your issue this way).  For each node
> > in
> > > > > the document, in order for the database to implement XPath navigation
> > > > > efficiently it needs to store "pointers" to parent and ancestor
> > nodes,
> > > > > child nodes, previous sibling and following sibling nodes, list of
> > > > > attribute nodes (in case of element nodes) and so on for all 13 (in
> > > > > number I think) different XPath axes.  This all takes space.  Even if
> > > > > there are no child nodes, the node would have to record "NULL" for
> > the
> > > > > children and even "NULL" takes space.
> > > > >
> > > > > The problem is exacerbated in an
> > > > > "XML-database-on-top-of-a-relational-database" scenario whereby all
> > > > > these relationships take tons of rubble (multitudes of tables) to
> > > > > express with any hope of runtime performance benefit.
> > > > >
> > > > > All-in-all its back to one of the fundamental principles in computer
> > > > > science.  (Memory) space and (execution) time are generally inversely
> > > > > related:  If you want to use the smallest amount of space for data
> > > > > storage you "zip it up" but then it will take a long time to find a
> > your
> > > > > data in the compressed file.  If you want to access your data in the
> > > > > smallest amount of time, you "expand it out" and use whatever amount
> > of
> > > > > memory you can to get the best time performance out of your
> > algorithms.
> > > > >
> > > > > I wonder how much space your 70MB log file takes when zipped up?
> > Betcha
> > > > > there's lots of redundancy in the information and the compression
> > ratio
> > > > > will be high.  Storing the data in an XML database simply takes the
> > > > > ratio in the other direction :-)  It is reasonable to expect a decent
> > > > > (execution time) performance benefit though in accessing/navigating
> > the
> > > > > data.  If there wasn't this benefit (amongst others) the tradeoff
> > would
> > > > > not be worth it.
> > > > >
> > > > > Trust this rather wordy explanation helps.
> > > > >
> > > > > Cheers
> > > > > Justin Johansson
> > > > >
> > > > > btw.  So just what is the zipped up compression ratio for your log
> > file?
> > > > >
> > > > >
> > > > >
> > > > > Dave Stav wrote:
> > > > > > Hi List Members,
> > > > > >
> > > > > > I noticed that sedna database takes quite a lot of disk space,
> > > > > compared to the data it contains, and I was wondering why it is like
> > > > > that.
> > > > > >
> > > > > > I am converting a 70MB log file to an xml file which takes up 144
> > MB.
> > > > > > After loading this xml file to a newly created sedna database, I
> > can
> > > > > see that the database directory takes up 450MB.
> > > > > >
> > > > > > Does anyone know why this is happening and/or if there is anything
> > we
> > > > > can do to reduce the disk usage?
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > >  - Dave
> > > > > >
> > > > >
> > > > >
> > > >
> > > > --
> > > >  EE 77 7F 30 4A 64 2E C5  83 5F E7 49 A6 82 29 BA    ~. .~   Tk Open
> > > > Systems
> > > >
> > > >
> >  =}-----------------------------------------------ooO--U--Ooo-------------{=
> > > >      - [email protected] - tel: +972.2.679.5364, http://www.tkos.co.il -
> > > >
> > > >
> > > >
> > ------------------------------------------------------------------------------
> > > > Let Crystal Reports handle the reporting - Free Crystal Reports 2008
> > 30-Day
> > > > trial. Simplify your report design, integration and deployment - and
> > focus
> > > > on
> > > > what you do best, core application coding. Discover what's new with
> > > > Crystal Reports now.  http://p.sf.net/sfu/bobj-july
> > > > _______________________________________________
> > > > Sedna-discussion mailing list
> > > > [email protected]
> > > > https://lists.sourceforge.net/lists/listinfo/sedna-discussion
> > > >
> > > The 66MB log file is gzipped to 5.5MB (same is in zip). This was a test
> > log
> > > file.
> >
> > --
> >  EE 77 7F 30 4A 64 2E C5  83 5F E7 49 A6 82 29 BA    ~. .~   Tk Open
> > Systems
> >
> >  =}-----------------------------------------------ooO--U--Ooo-------------{=
> >      - [email protected] - tel: +972.2.679.5364, http://www.tkos.co.il -
> >

-- 
 EE 77 7F 30 4A 64 2E C5  83 5F E7 49 A6 82 29 BA    ~. .~   Tk Open Systems
 =}-----------------------------------------------ooO--U--Ooo-------------{=
      - [email protected] - tel: +972.2.679.5364, http://www.tkos.co.il -

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Sedna-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/sedna-discussion

Re: [Sedna-discussion] Disk Space Usage

Reply via email to