Hi Ivan, Thank you for the explanation! Now I will probably write a stx script for changing the structure of the data.
As for concatenation into one document, since documents stored in a collection have a common descriptive schema, I suppose that the data will be physically stored almost in the same way as before, therefore there are no negative side effects of this reorganization (by negative I mean something not possible or significantly slower). Am I right? Martin B. On Tue, May 11, 2010 at 7:42 PM, Ivan Shcheklein <[email protected]> wrote: > Hi Martin, > > > Thank you for the data provided. We were able to reproduce the issue. This > is not actually a bug but a peculiarity of the Sedna's internal > representation. Let me explain shortly how XML is stored inside and why it's > so hard to load data in your concrete case. > > Internal storage can be considered as an index based on descriptive schema. > For example, let's consider the following XML snippet: > > <persons> > <person id="person1"> > <name>Ivan</name> > </person> > <person id="person2"> > <name>Martin</name> > </person> > </persons> > > > Descriptive schema of this document is the following: > > persons (S) > | > == person (S) > | > == @id (S) > | > == name (S) > | > == text() (S) > > By definition every path of the document has exactly one path in the > descriptive schema, and every path of the descriptive schema is a path of > the document. Thereby each node in XML document is connected with exactly > one schema node and each schema node may have many nodes connected with it. > In our example person (S) schema node has two connected XML nodes. > > In Sedna all document (collection) nodes are stored in block chains (each > block is 64KB). One chain per descriptive schema node. In our example, > again, we have five chains of blocks - one chain for "persons" nodes, one > chain for "person" nodes, one for "id" attribute, etc ... > > To retrieve descriptive schema of the document, collection or database one > may run doc("$schema_<document_name>"), doc("$schema_<collection_name>") or > just doc("$schema") queries, respectively. These queries also return how > many blocks and nodes there are in each chain: > > <schema> > <document name="auction"> > <document name="" total_nodes="1" total_blocks="1"> > <element name="persons" total_nodes="1" total_blocks="1"> > <element name="person" total_nodes="2" total_blocks="1"> > <attribute name="id" total_nodes="2" total_blocks="1"> > <element name="name" total_nodes="2" total_blocks="1"> > <text name="" total_nodes="2" total_blocks="1"> > </element> > </element> > </element> > </document> > </document> > </schema> > > > For further details on the Sedna's internal representaion refer to the > http://panda.ispras.ru/~grinev/mypapers/sedna.pdf . For illustaration refer > to the: > http://www.slideshare.net/shcheklein/sedna-xml-database-system-internal-representation > . > > Now let's consider your data. It has very complex descriptive schema (almost > each node within XML document has unique path). It means that we have > enormous number of almost empty blocks (each stores one-two nodes). The main > reason of this complexity is hierarchy of entity describing tags which are > nested and may have many different names and somewhere enclose entire > articles. > > <creator> > <person ...> > <writer ...> > <novelist ... > > {content here} > </novelist> > </writer> > </person> > </creator> > > > > If you want to load it into Sedna you have to change representaion a bit: > > 1. Simplify entity description blocks. For example the following > representaion will be much easier to load: > > <entity> > <creator wordnetid="..." confidence="..."/> > <person wordnetid="..." confidence="..."/> > <writer wordnetid="..." confidence="..."/> > <novelist wordnetid="..." confidence="..."/> > {content here} > </entity> > > 2. If you have several millions files, it's better to concatenate them into > one document (bulk load will be better optimized since all data is known in > advance): > > <articles> > <artcile id="00000.xml"> > {content here} > </article> > <artcile id="00000.xml"> > {content here} > </article> > ... > <articles> > > 3. Use -bufs-num parameter to increase number of buffers allocated by > Sedna's storage manager (se_sm). It'll significantly speedup bulk loading. > > ./se_sm -bufs-num 32000 > > > Moreover, it doesn't matter if you are going to use Sedna or not. I believe, > such XML representation is very hard for almost every XML-tool or XML > processing language. > > > > Ivan Shcheklein, > Sedna Team > > > > On Sun, May 9, 2010 at 12:00 AM, Martin Bukatovic > <[email protected]> wrote: >> >> On Sat, May 8, 2010 at 12:21 PM, Ivan Shcheklein <[email protected]> >> wrote: >> > I can give you access to the private ftp on modis server and guarantee >> > that >> > we'll not use your data except for the testing purposes. >> >> Seems OK. I will provide you with roughly 300 MB of xml data. >> >> > At least you can print schema of your collection: doc("$schema") . The >> > more >> > different nodes it has the bigger burst factor is. Try to load 300MB and >> > send me result of this command. >> >> Even 300MB is huge enough to make it quite time consuming, therefore I >> loaded >> just 5 documents (with total size about 360 kB) successfully and the >> database directory >> has 235 MB (using version 3.3.55). The schema of this collection can be >> reached >> at http://www.fi.muni.cz/~xbukatov/nxd/tmp/sedna-inex-schema.xml >> >> Martin B. > > ------------------------------------------------------------------------------ _______________________________________________ Sedna-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/sedna-discussion
