Hi

I'm trying to load "large" (13 GB) set of XML documents into Sedna.
I'm aware of the fact that such large pile of files requires lots of
space, but the actual requirements for free space wildly surpassed my
expectations (and yes, I have read this thread
http://article.gmane.org/gmane.text.xml.sedna/1749). So I'm going to
provide you with some information in hope that you can point out some
errors of mine or just assure me that this is expected behaviour.

The data consists of large number of small files (1 - 400 KB)
representing articles so it is similar to the wikixmldb demo. I set up
a dedicated database (in se_cdb sense) with just one collection for
that files. Some shell script then creates a list of files to load:

\nac
LOAD "/var/pages/000/1000.xml" "000-1000.xml" "inex" &
LOAD "/var/pages/000/1009000.xml" "000-1009000.xml" "inex" &
... and so on ...
\commit

This list is then executed by se_term. The problem is that even for a
3GB fraction of the data, free space of 52 GB is not enough. Now I'm
half way through loading of 300MB portion and the database directory
already occupies 35GB (I have deleted the database after each
experiment, so there are no remnants of older data) - this is quite a
shock for me :)

The text in the articles has a lots of markup (see example of just one
word bellow), which I suspect to have an effect on resulting size:

<region wordnetid="108630985" confidence="0.8">
<administrative_district wordnetid="108491826" confidence="0.8">
<location wordnetid="100027167" confidence="0.8">
<commune wordnetid="108541609" confidence="0.8">
<district wordnetid="108552138" confidence="0.8">
<link xlink:type="simple" xlink:href="../436/2166436.xml">
Berville-sur-Mer</link>
</district>
</commune>
</location>
</administrative_district>
</region>

So, do you think that 300 MB -> 35 GB (and counting) transformation is
expected here?

Thanks a lot.

Martin B.

------------------------------------------------------------------------------

_______________________________________________
Sedna-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/sedna-discussion

Reply via email to