jbates 2002/12/03 06:21:20
Modified: src/documentation/content/xdocs/dev guide-internals.xml
Added: src/documentation/resources/images element.png element.xcf
Log:
Started Compressed DOM chapter in Internals Guide
Revision Changes Path
1.3 +96 -1
xml-xindice/src/documentation/content/xdocs/dev/guide-internals.xml
Index: guide-internals.xml
===================================================================
RCS file:
/home/cvs/xml-xindice/src/documentation/content/xdocs/dev/guide-internals.xml,v
retrieving revision 1.2
retrieving revision 1.3
diff -u -r1.2 -r1.3
--- guide-internals.xml 26 Nov 2002 09:20:42 -0000 1.2
+++ guide-internals.xml 3 Dec 2002 14:21:20 -0000 1.3
@@ -286,6 +286,12 @@
further, the start of the page's data.</p>
<section>
<title>3.1.1. Paged file header</title>
+ <p>The paged file header consists of a number of
fixed-length fields.
+ Fields which are longer than one byte, are
<em>always</em> stored in
+ Big Endian format, which means the most significant
byte is written at the
+ lowest address. This is regardless of the type of
architecture the server
+ process is running on, so your data files are
portable between
+ architectures.</p>
<figure src="/images/pagedfilehdr.png" alt="File header
structure"/>
<p>The meaning of the various fields in the file header,
whose structure
is shown above, is as follows:</p>
@@ -349,11 +355,100 @@
</section>
<section>
<title>4. XML storage</title>
+ <p>As we saw in the preceding chapter, the B+-Tree file format
allows for the
+ efficient storage of (name, value) pairs. In this chapter we
concern ourselves
+ with using such a (name,value) storage facility to store the
XML content of all
+ XML documents in a collection.</p>
+ <p>The principle Xindice uses is deceptively simple here: for
every XML <em>document</em>,
+ Xindice will calculate something called the <em>compressed
DOM</em>. This is an array of bytes
+ which can be used to reconstruct the complete XML document at
any time. An XML document is
+ then stored as a (name,value) pair in the B-Tree, where the
name is the name given to the XML document,
+ and the value is the calculated Compressed DOM.</p>
+ <p>The remaining mechanism to investigate is thus how to
construct the Compressed DOM
+ of a document.</p>
<section>
<title>4.1. The symbol tables</title>
+ <p>In order to store the XML content in a space-efficient
manner, Xindice uses
+ something called a <em>Symbol table</em>. This is an XML
file which associates
+ a 16-bit number with any (QName,namespace URI) pair used
as element or attribute name
+ in XML <em>all</em> XML files stored in a collection.
(i.e. there is <em>one</em>
+ symbol table per collection).</p>
+ <p>Consider the following XML document, to be added to a
Xindice collection:</p>
+<source><![CDATA[
+<?xml version="1.0"?>
+<p:person xmlns:p="http://www.xindice.org/Examples/PersonData"
+ gender="female"
+ xml:lang="fr">
+ <p:first-name>Susanne</p:first-name>
+ <p:last-name>Carpentier</p:last-name>
+ <p:e-mail active="yes">[EMAIL PROTECTED]</p:e-mail>
+</p:person>
+]]></source>
+ <p>When this document is stored into an empty Xindice
collection, the following
+ symbol table is created:</p>
+<source><![CDATA[
+<?xml version="1.0"?>
+<?xindice-class org.apache.xindice.xml.SymbolTable?>
+<symbols>
+ <symbol name="p:first-name"
nsuri="http://www.xindice.org/Examples/PersonData" id="4" />
+ <symbol name="p:e-mail"
nsuri="http://www.xindice.org/Examples/PersonData" id="6" />
+ <symbol name="p:last-name"
nsuri="http://www.xindice.org/Examples/PersonData" id="5" />
+ <symbol name="gender" id="2" />
+ <symbol name="xml:lang" id="3" />
+ <symbol name="p:person"
nsuri="http://www.xindice.org/Examples/PersonData" id="0" />
+ <symbol name="active" id="7" />
+ <symbol name="xmlns:p" nsuri="http://www.w3.org/2000/xmlns/" id="1" />
+</symbols>
+]]></source>
+ <p>As you can see, the symbol table is itself an XML
document which contains
+ an element for every (QName, namespace URI) pair used in
element and attribute
+ names in the XML documents of the collection. The
<code>id</code> attribute is
+ the 16-bit number that Xindice has assigned to the
(QName, namespace URI) pair.</p>
+ <p>As more documents are added to the
+ collection using different element and attribute names,
entries are added to the
+ collection's symbol table.</p>
+ <p>Usually, a collections's symbol table is stored as any
other XML document in
+ the Xindice database. All symbol tables stored in Xindice
are in the
+ <code>system/SysSymbols</code> collection using as name
the path of the collection,
+ with underscores (_) subsituted for the /'s in the
collection path.</p>
+ <p>Being a collection in Xindice,
<code>system/SysSymbols</code> itself has
+ a symbol table too. It is:</p>
+<source><![CDATA[
+<symbols>
+ <symbol name="symbols" id="0" />
+ <symbol name="symbol" id="1" />
+ <symbol name="name" id="2" />
+ <symbol name="id" id="3" />
+ <symbol name="nsuri" id="4" />
+</symbols>
+]]></source>
+ <p>Normally, this symbol table should be stored in an XML
document named
+ <code>system_SysSymbols</code> in the
<code>system/SysSymbols</code>
+ collection. Doing so however would create an endless
loop, as
+ <code>system/SysSymbols</code>'s symbol table is needed
to read itself!
+ This particular symbol table is therefore hardcoded into
the Xindice
+ source code.</p>
+ <p>For any other collection, you can always request the
symbol table
+ yourself by issuing the Xindice command-line
invocation:</p>
+<source>xindice rd -c /db/system/SysSymbols -n
[your_collection_path]</source>
</section>
<section>
<title>4.2. The Compressed DOM</title>
+ <p>Now that we understand symbol tables, we can take a look
at the way in
+ which Xindice generated a byte string from any given XML
document.</p>
+ <p>The trick is to understand that Xindice simply runs
through the XML document
+ recursively, building a byte sequence for a particular
node in the tree
+ representation of the XML. This will contain the byte
data for the children
+ of the node, and these sub-sequences contain the data for
their children etc...</p>
+ <p>Xindice thus starts by generating the byte sequence for
the document node, which
+ will set off generation for the whole XML document.</p>
+ <section>
+ <title>4.2.1. Element nodes</title>
+ <p>An element node is encoded as shown in the diagram
below:</p>
+ <figure src="images/element.png" alt="Element compressed
DOM format"/>
+ </section>
+
+
</section>
</section>
<section>
1.1
xml-xindice/src/documentation/resources/images/element.png
<<Binary file>>
1.1
xml-xindice/src/documentation/resources/images/element.xcf
<<Binary file>>