Hi Martin,

sorry for letting you wait, and thanks for giving a summary of your project.

Storing, indexing and querying gigabytes of XML data is something that
should be no major problem (some out-dated statistics can be found
here [1]; please note that the create databases did not include any
index structures). I assume you have already stumbled upon XQuery Full
Text, which also allows you to do text-based search [2].

Talking about scalability, do you have an approximate guess on the
total byte size of XML documents to be managed? Maybe the easiest
thing would be to simply run BaseX, create a first database from an
initial collection.

It surely gets more interesting and challenging when the original data
is to be changed, i.e. if texts are annotated. In this case, I would
recommend to keep the original documents untouched and well-indexed,
and store changes in an additional database. Node IDs could be used as
back references [3], and the updates could be merged back to the
original data in regular time intervals. As more than one databases
can be addressed by a single query, original and updated nodes can
also be merged on the fly, using XQuery Update [4].

Feel free to ask for more details,
Christian

[1] http://docs.basex.org/wiki/Statistics
[2] http://docs.basex.org/wiki/Full-Text
[3] http://docs.basex.org/wiki/Database_Module#db:open-pre
[4] http://docs.basex.org/wiki/XQuery_Update

Reply via email to