Hi Ivan, Actually, mike is the group type of logs. Most of the logs are type mike, but the syntax is similar. I have about 100 log files per each device_id. Let's assume I only have mike logs. Initially, my awk parser prints out the whole thing to one big xml file, which I pipe into sedna load command. I will later need to periodically add new log files in. Let's say... every 10 minutes add another 10 MB of log lines (parsed to 20MB xml) -- add more events to a device id.
Thanks, - Dave On Wed, Sep 09, 2009 at 01:04:17PM +0400, Ivan Shcheklein wrote: > Dave, how many files do you have? Am I right that each describes unique > user's log (e.g. mike's logs)? > > On Wed, Sep 9, 2009 at 12:57 PM, Dave Stav <[email protected]> wrote: > > > Hi Ivan, > > > > Thanks for your reply. > > > > Here is an example of the XML file I am loading. I create this xml using an > > awk script that parses plain text log files. > > > > <?xml version="1.0" standalone="yes"?> > > <mike> > > <mike_logs device_id="938880111"> > > <device_events> > > <device_event> > > <datetime>2009-01-27T09:37:29+02:00</datetime> > > <event_id>1</event_id> > > <numerical_severity>801</numerical_severity> > > <category>DISTORTION</category> > > <text>Module #245 has been distorted.</text> > > </device_event> > > <device_event> > > <datetime>2009-01-27T09:37:48+02:00</datetime> > > <event_id>2</event_id> > > <severity>Critical</severity> > > <category>POWER</category> > > <text>Power loss due to error BAD_JHU.</text> > > </device_event> > > <device_event> > > <datetime>2009-01-27T09:37:48+02:00</datetime> > > <event_id>3</event_id> > > <category>MANUAL</category> > > <text>Received manual interruption: tried disconnecting as directed by > > John (E-Mail ID #43213)</text> > > </device_event> > > </device_events> > > </mike_logs> > > </mike> > > > > Most of my queries search for text data ( using fn:contains() ) and search > > for events (device_event nodes) that occur between other events (also using > > fn:contains()). > > For example: How many MANUAL events occured between a text containing > > "John" and a critical error. > > > > I would love to here you recommendation regarding how to build the nodes. > > > > Thanks! > > > > - Dave > > > > > > On Wed, Sep 09, 2009 at 11:51:23AM +0400, Ivan Shcheklein wrote: > > > Hi Dave, > > > > > > > > > > > > > My plan is to convert and import a total of about 50GB of log files to > > > > sedna. Do you think that the ratio will be the same? i.e. 50GB of log > > files > > > > will turn to 109GB of xml which will be saved as 340GB? > > > > > > > > > > > > > Yes, it's likely the ratio will be approximatelly the same. It strictly > > > depends on the structure of your data. Can you give us an example of the > > XML > > > you want to load? > > > > > > > > > BTW, do you have any recommendation as to the way the data is saved? I am > > > > considering separate databases or separate documents, one document with > > many > > > > sub-nodes. From your explanation I understand that the more the data is > > > > divided into nodes the more disk space it will require, so perhaps I'm > > > > better off separating the data into several documents. > > > > > > > > > > > > > Usually it's not a good idea to to use many databases. For example, you > > > won't be able to query them simultaneously. Do you have one big document > > or > > > many small documents? You will have approximatelly the same result either > > > you load your data as one big document into database or as several > > documents > > > into collection ( > > > http://modis.ispras.ru/sedna/progguide/ProgGuidesu8.html#x14-470002.5.2). > > > > > > I am also concerned about performance. Has a 340GB database ever been > > tried > > > > on sedna to your knowledge? > > > > > > > > > > > > > Sure. We had experience with 500-600GB databases. BTW, WikiXMLDB demo has > > > 130GB database. > > > > > > > > > Ivan Shcheklein, > > > Sedna Team > > > > > > > > > > Thanks for your help! > > > > > > > > - Dave > > > > > > > > On Wed, Sep 09, 2009 at 05:56:29AM +0930, Justin Johansson wrote: > > > > > Hi Dave, > > > > > > > > > > You will find that this issue is not confined to only Sedna but > > rather > > > > > most XML databases whether they be "native XML databases" or > > implemented > > > > > over a relational DB such as it MonetDB. Except if your application > > is > > > > > running in a very disk space limited environment (such as a mobile > > > > > device), disk space these days is so cheap that it's not really an > > issue > > > > > to worry too much about. Having said that I'll try to explain why it > > is > > > > > like that. > > > > > > > > > > Going from 70MB log file (presumably plain text as variable length > > log > > > > > lines) to 144MB in XML format is easily explainable realizing the > > space > > > > > that the addition of XML tags take up. (That's not telling you > > anything > > > > > new as you seem to appreciate that bit). Going from XML text to > > > > > persisting the data in XML database has a storage overhead for > > analogous > > > > > reasons that being the addition of XPath-axis relationship > > information > > > > > between the nodes in the XML if for no other reason. > > > > > > > > > > Think for a moment about how XML-DOM (Document Object Model) is > > > > > implemented. (Not saying that Sedna is a implemented as a persistent > > > > > DOM but it's useful to analyze your issue this way). For each node > > in > > > > > the document, in order for the database to implement XPath navigation > > > > > efficiently it needs to store "pointers" to parent and ancestor > > nodes, > > > > > child nodes, previous sibling and following sibling nodes, list of > > > > > attribute nodes (in case of element nodes) and so on for all 13 (in > > > > > number I think) different XPath axes. This all takes space. Even if > > > > > there are no child nodes, the node would have to record "NULL" for > > the > > > > > children and even "NULL" takes space. > > > > > > > > > > The problem is exacerbated in an > > > > > "XML-database-on-top-of-a-relational-database" scenario whereby all > > > > > these relationships take tons of rubble (multitudes of tables) to > > > > > express with any hope of runtime performance benefit. > > > > > > > > > > All-in-all its back to one of the fundamental principles in computer > > > > > science. (Memory) space and (execution) time are generally inversely > > > > > related: If you want to use the smallest amount of space for data > > > > > storage you "zip it up" but then it will take a long time to find a > > your > > > > > data in the compressed file. If you want to access your data in the > > > > > smallest amount of time, you "expand it out" and use whatever amount > > of > > > > > memory you can to get the best time performance out of your > > algorithms. > > > > > > > > > > I wonder how much space your 70MB log file takes when zipped up? > > Betcha > > > > > there's lots of redundancy in the information and the compression > > ratio > > > > > will be high. Storing the data in an XML database simply takes the > > > > > ratio in the other direction :-) It is reasonable to expect a decent > > > > > (execution time) performance benefit though in accessing/navigating > > the > > > > > data. If there wasn't this benefit (amongst others) the tradeoff > > would > > > > > not be worth it. > > > > > > > > > > Trust this rather wordy explanation helps. > > > > > > > > > > Cheers > > > > > Justin Johansson > > > > > > > > > > btw. So just what is the zipped up compression ratio for your log > > file? > > > > > > > > > > > > > > > > > > > > Dave Stav wrote: > > > > > > Hi List Members, > > > > > > > > > > > > I noticed that sedna database takes quite a lot of disk space, > > > > > compared to the data it contains, and I was wondering why it is like > > > > > that. > > > > > > > > > > > > I am converting a 70MB log file to an xml file which takes up 144 > > MB. > > > > > > After loading this xml file to a newly created sedna database, I > > can > > > > > see that the database directory takes up 450MB. > > > > > > > > > > > > Does anyone know why this is happening and/or if there is anything > > we > > > > > can do to reduce the disk usage? > > > > > > > > > > > > Thanks! > > > > > > > > > > > > - Dave > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > EE 77 7F 30 4A 64 2E C5 83 5F E7 49 A6 82 29 BA ~. .~ Tk Open > > > > Systems > > > > > > > > > > =}-----------------------------------------------ooO--U--Ooo-------------{= > > > > - [email protected] - tel: +972.2.679.5364, http://www.tkos.co.il - > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 > > 30-Day > > > > trial. Simplify your report design, integration and deployment - and > > focus > > > > on > > > > what you do best, core application coding. Discover what's new with > > > > Crystal Reports now. http://p.sf.net/sfu/bobj-july > > > > _______________________________________________ > > > > Sedna-discussion mailing list > > > > [email protected] > > > > https://lists.sourceforge.net/lists/listinfo/sedna-discussion > > > > > > > The 66MB log file is gzipped to 5.5MB (same is in zip). This was a test > > log > > > file. > > > > -- > > EE 77 7F 30 4A 64 2E C5 83 5F E7 49 A6 82 29 BA ~. .~ Tk Open > > Systems > > > > =}-----------------------------------------------ooO--U--Ooo-------------{= > > - [email protected] - tel: +972.2.679.5364, http://www.tkos.co.il - > > -- EE 77 7F 30 4A 64 2E C5 83 5F E7 49 A6 82 29 BA ~. .~ Tk Open Systems =}-----------------------------------------------ooO--U--Ooo-------------{= - [email protected] - tel: +972.2.679.5364, http://www.tkos.co.il - ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Sedna-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/sedna-discussion
