Hi I'm trying to load "large" (13 GB) set of XML documents into Sedna. I'm aware of the fact that such large pile of files requires lots of space, but the actual requirements for free space wildly surpassed my expectations (and yes, I have read this thread http://article.gmane.org/gmane.text.xml.sedna/1749). So I'm going to provide you with some information in hope that you can point out some errors of mine or just assure me that this is expected behaviour.
The data consists of large number of small files (1 - 400 KB) representing articles so it is similar to the wikixmldb demo. I set up a dedicated database (in se_cdb sense) with just one collection for that files. Some shell script then creates a list of files to load: \nac LOAD "/var/pages/000/1000.xml" "000-1000.xml" "inex" & LOAD "/var/pages/000/1009000.xml" "000-1009000.xml" "inex" & ... and so on ... \commit This list is then executed by se_term. The problem is that even for a 3GB fraction of the data, free space of 52 GB is not enough. Now I'm half way through loading of 300MB portion and the database directory already occupies 35GB (I have deleted the database after each experiment, so there are no remnants of older data) - this is quite a shock for me :) The text in the articles has a lots of markup (see example of just one word bellow), which I suspect to have an effect on resulting size: <region wordnetid="108630985" confidence="0.8"> <administrative_district wordnetid="108491826" confidence="0.8"> <location wordnetid="100027167" confidence="0.8"> <commune wordnetid="108541609" confidence="0.8"> <district wordnetid="108552138" confidence="0.8"> <link xlink:type="simple" xlink:href="../436/2166436.xml"> Berville-sur-Mer</link> </district> </commune> </location> </administrative_district> </region> So, do you think that 300 MB -> 35 GB (and counting) transformation is expected here? Thanks a lot. Martin B. ------------------------------------------------------------------------------ _______________________________________________ Sedna-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/sedna-discussion
