Over the last couple of years I’ve developed the Mirabel system that provides
DITA link management and query features over large volumes of content (the
ServiceNow product documentation source). In particular it knows what links to
what and enables viewing the content with all link information available.
For a given version of the product docs we have about 60K DITA topics and 100
root maps that organize those topics into publications.
The primary job of Mirabel is to capture all the hyperlink details as defined
by the DITA source and enable queries about the element-to-element and
document-to-document relationships established by those links.
My implementation approach for loading the link knowledge uses a multi-step
process:
1. Load the entire source content into a database
2. Create a “key space” database that reflects the DITA key-to-resource
mappings defined by each root DITA map. The key spaces are XQuery maps that map
from key names to resources identified by their database node IDs (essentially,
each use of a topic from a map has an associated unique key by which that use
of the topic can be referenced). The key spaces are a prerequisite for
resolving cross references to keys from one topic to other topics in the
context of some root map (the same root map or a different one).
3. Create a “link record keeping” database that contains the “where used”
index for the content.
The where-used index maps element node IDs to a record of every reference to
that node (cross references, content references, topic references from maps).
The where-used index is the core data used to know where a given map or topic
is used, which is used to answer questions like “what publications use this
topic?” or “is this topic used at all?”. The where-used table is constructed as
an XQuery map that is then turned into XML for storage (I implemented this
before BaseX added direct storage of maps but given the size I think it still
makes sense to store it as XML, but I could be wrong).
* Process all map-to-map and map-to-topic references and create the
initial map entries, one for each map and topic.
* For topics referenced from maps, process all topic-to-topic references
and update the records for each target topic to reflect the references to it.
The map context of a given topic determines the targets of key references from
that topic, so it is necessary to process the topics in the context of the root
maps that use them (in DITA, root maps determine the key-to-resource bindings
to which key references resolve).
* For topics not referenced from any maps, add entries for them to the
where-used table and process any topic-to-topic references (key references
cannot be resolved but direct URI references can be).
Convert the XQuery map to a single XML document and store in the link record
keeping database. The resulting database takes about 150MB of storage.
This third step can take two-to-three hours: 60K topics times 0.2 seconds for
each topic is 3.3 hours. 0.1 seconds is about as fast as the link processing
can go based on my testing.
This is all done using temporary databases so as not to disturb the working
databases used by the running Mirabel web application. The work is performed by
a BaseX “worker” server, not the main server that serves the web site. I
essentially have one BaseX http server for each core on my server and allocate
work to them based on load, so queries coming from the web app will not be
allocated to a worker currently doing a content update process.
Once all the new link data is loaded, the temporary databases are swapped into
production by renaming the production databases, renaming the temp databases to
their production names, then dropping the old databases. (Saying this just now
I’m realizing that I don’t know how to pause or wait for active queries against
the in-production databases to finish so I can swap the databases.)
Because all the index entries use node IDs, the content database and
record-keeping databases have to be put into production at the same time,
otherwise the content node IDs will be out of sync with the indexed record IDs.
I’m working on the assumption that renaming databases is essentially
instantaneous and so I can use that to swap the temp databases into production
reliably.
I use my job orchestration module
(https://github.com/ekimbernow/basex-orchestration) to manage the sequence of
operations, where each job calls the next job in the sequence once it has
finished.
This process works reliably for smaller volumes of content—for example, a
content set with only a couple of thousand topics and four or five root maps.
But at full scale I’m consistently seeing that the link record keeping
database, which only has two large XML documents in it, never completes
optimization: The database page shows the database with two things in it, but
when you open the database’s page, they do not show up and the job that
performs the optimization never completes, leaving the database in a locked
state. This means the new where-used index can’t be put into production.
I feel like I’m going about this the wrong way to make best use of BaseX and
avoid this problem with very large databases but I don’t see any obvious
alternative approaches. But it feels like I’m missing something fundamental or
making a silly error that I can’t see.
So my question:
How would you solve this problem?
In particular, how would you go about constructing the where-used index in a
way that works best with BaseX?
Or maybe the question is “should I be updating the in-production database with
the new data and doing the swapping into the production within the database
itself?” (i.e., by renaming the where used index document rather than the
database itself.)
I am currently using 11.6 and can move to 12 once it is released.
Thanks,
Eliot
_____________________________________________
Eliot Kimber
Sr. Staff Content Engineer
O: 512 554 9368
servicenow
servicenow.com<https://www.servicenow.com>
LinkedIn<https://www.linkedin.com/company/servicenow> |
X<https://twitter.com/servicenow> |
YouTube<https://www.youtube.com/user/servicenowinc> |
Instagram<https://www.instagram.com/servicenow>