Re: [Sedna-discussion] Unexpected disk space usage

Martin Bukatovic Wed, 12 May 2010 12:55:20 -0700

Hi Ivan,

Thank you for the explanation! Now I will probably write a stx script
for changing
the structure of the data.


As for concatenation into one document, since documents stored in a
collection have
a common descriptive schema, I suppose that the data will be
physically stored almost
in the same way as before, therefore there are no negative side
effects of this reorganization
(by negative I mean something not possible or significantly slower). Am I right?

Martin B.

On Tue, May 11, 2010 at 7:42 PM, Ivan Shcheklein <[email protected]> wrote:
> Hi Martin,
>
>
> Thank you for the data provided. We were able to reproduce the issue. This
> is not actually a bug but a peculiarity of the Sedna's internal
> representation. Let me explain shortly how XML is stored inside and why it's
> so hard to load data in your concrete case.
>
> Internal storage can be considered as an index based on descriptive schema.
> For example, let's consider the following XML snippet:
>
> <persons>
> <person id="person1">
> <name>Ivan</name>
> </person>
> <person id="person2">
> <name>Martin</name>
> </person>
> </persons>
>
>
> Descriptive schema of this document is the following:
>
> persons (S)
> |
>  == person (S)
>     |
>      == @id (S)
>     |
>      == name (S)
>         |
>          == text() (S)
>
> By definition every path of the document has exactly one path in the
> descriptive schema, and every path of the descriptive schema is a path of
> the document. Thereby each node in XML document is connected with exactly
> one schema node and each schema node may have many nodes connected with it.
> In our example person (S) schema node has two connected XML nodes.
>
> In Sedna all document (collection) nodes are stored in block chains (each
> block is 64KB). One chain per descriptive schema node. In our example,
> again, we have five chains of blocks - one chain for "persons" nodes, one
> chain for "person" nodes, one for "id" attribute, etc ...
>
> To retrieve descriptive schema of the document, collection or database one
> may run doc("$schema_<document_name>"), doc("$schema_<collection_name>") or
> just doc("$schema") queries, respectively. These queries also return how
> many blocks and nodes there are in each chain:
>
> <schema>
>   <document name="auction">
>     <document name="" total_nodes="1" total_blocks="1">
>       <element name="persons" total_nodes="1" total_blocks="1">
>         <element name="person" total_nodes="2" total_blocks="1">
>           <attribute name="id" total_nodes="2" total_blocks="1">
>           <element name="name" total_nodes="2" total_blocks="1">
>             <text name="" total_nodes="2" total_blocks="1">
>           </element>
>         </element>
>       </element>
>     </document>
>   </document>
> </schema>
>
>
> For further details on the Sedna's internal representaion refer to the
> http://panda.ispras.ru/~grinev/mypapers/sedna.pdf . For illustaration refer
> to the:
> http://www.slideshare.net/shcheklein/sedna-xml-database-system-internal-representation
> .
>
> Now let's consider your data. It has very complex descriptive schema (almost
> each node within XML document has unique path). It means that we have
> enormous number of almost empty blocks (each stores one-two nodes). The main
> reason of this complexity is hierarchy of entity describing tags which are
> nested and may have many different names and somewhere enclose entire
> articles.
>
> <creator>
>   <person ...>
>      <writer ...>
>        <novelist ... >
>         {content here}
>        </novelist>
>      </writer>
>    </person>
> </creator>
>
>
>
> If you want to load it into Sedna you have to change representaion a bit:
>
> 1. Simplify entity description blocks. For example the following
> representaion will be much easier to load:
>
> <entity>
>   <creator wordnetid="..." confidence="..."/>
>   <person wordnetid="..." confidence="..."/>
>   <writer wordnetid="..." confidence="..."/>
>   <novelist wordnetid="..." confidence="..."/>
>   {content here}
> </entity>
>
> 2. If you have several millions files, it's better to concatenate them into
> one document (bulk load will be better optimized since all data is known in
> advance):
>
> <articles>
>   <artcile id="00000.xml">
>     {content here}
>   </article>
>   <artcile id="00000.xml">
>     {content here}
>   </article>
>    ...
> <articles>
>
> 3. Use -bufs-num parameter to increase number of buffers allocated by
> Sedna's storage manager (se_sm). It'll significantly speedup bulk loading.
>
> ./se_sm -bufs-num 32000
>
>
> Moreover, it doesn't matter if you are going to use Sedna or not. I believe,
> such XML representation is very hard for almost every XML-tool or XML
> processing language.
>
>
>
> Ivan Shcheklein,
> Sedna Team
>
>
>
> On Sun, May 9, 2010 at 12:00 AM, Martin Bukatovic
> <[email protected]> wrote:
>>
>> On Sat, May 8, 2010 at 12:21 PM, Ivan Shcheklein <[email protected]>
>> wrote:
>> > I can give you access to the private ftp on modis server and guarantee
>> > that
>> > we'll not use your data except for the testing purposes.
>>
>> Seems OK. I will provide you with roughly 300 MB of xml data.
>>
>> > At least you can print schema of your collection: doc("$schema") . The
>> > more
>> > different nodes it has the bigger burst factor is. Try to load 300MB and
>> > send me result of this command.
>>
>> Even 300MB is huge enough to make it quite time consuming, therefore I
>> loaded
>> just 5 documents (with total size about 360 kB) successfully and the
>> database directory
>> has 235 MB (using version 3.3.55). The schema of this collection can be
>> reached
>> at http://www.fi.muni.cz/~xbukatov/nxd/tmp/sedna-inex-schema.xml
>>
>> Martin B.
>
>

------------------------------------------------------------------------------

_______________________________________________
Sedna-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/sedna-discussion

Re: [Sedna-discussion] Unexpected disk space usage

Reply via email to