Re: [basex-talk] Strategy for Persisting Maps that Contain Nodes: db:node-id()

Eliot Kimber Mon, 24 Jan 2022 05:44:15 -0800

The indexes I’m constructing are:


  1.  Where used: For each DITA map or topic, indexed by document URI (but 
probably better indexed by node ID), capture the direct references to that map 
or topic from other maps and topics.
  2.  Document-to-bundle map: For each DITA map or topic capture the “bundle” 
that document is in. (Bundles are a Zoomin software concept that becomes a 
major organizational label for our content. A bundle is represented by a DITA 
map and is used as a unit of publishing to Zoomin). Determining the bundle 
requires walking back up the reference path from a topic or map to the DITA 
bundle maps that ultimately refer to the topic or map. This is an expensive 
process even with the where-used table, so worth persisting. In a more 
generalized DITA context this index could be generalized to “doc-to-root-map” 
index, where you provide the business logic for determining which maps are root 
maps (root mapness is not an intrinsic property of DITA maps).


The strategy I have working for both indexes is a single top-level document for 
each index that then has a flat list of index entry elements, one for each 
topic, i.e.,:

<doc-where-used-index>
  <where-used-entry 
key="/pce-test-data-01/administer/tablet-mobile-ui/task/list-filter-sorting.dita"
 tagname="task" class="topic/topic task/task" id="list-filter-sorting">
    <title>Configure sorting capabilities within mobile filters</title>
    <conrefs/>
    <topicrefs/>
    <doc>
      <noderef node-id="2493717" database="pce-test-data-01" tagname="task" 
baseuri="/pce-test-data-01/administer/tablet-mobile-ui/task/list-filter-sorting.dita"/>
    </doc>
    <xrefs>
      <noderef node-id="2476418" database="pce-test-data-01" tagname="xref" 
baseuri="/pce-test-data-01/administer/tablet-mobile-ui/concept/mobile-list-filters.dita"
 href="../task/list-filter-sorting.dita"/>
    </xrefs>
  </where-used-entry>
…
</doc-where-used-index>

I then have some utility functions to resolve <noderef> elements back to nodes 
and the index works great.

By using single documents for the index I can use the “construct index doc and 
then either create DB or replace existing doc in one go” model as shown in the 
custom index example. Otherwise, as far as I can determine, one has to ensure 
that the database to hold the index already exists since you can’t create an 
index and then separately add to it in a single query. Alternatively, I could 
construct a very large sequence of individual document nodes and add those to 
the index as it’s created—I suspect it comes to the same thing but I haven’t 
tried it.

Using the where-used index to calculate the doc-to-bundle index, it takes about 
50ms per topic or map to determine the bundle (on my laptop), which is still 
10x slower than I’d like but certainly tolerable (at 50ms per topic it takes 
about 7.5 minutes to process 9400 topics). I’d like to know if there’re things 
I can to do reduce this time but I can take that up later—current result is 
more than good enough for my immediate purposes (which is to report data about 
the topics grouped by bundle, thus the need for the topic-to-bundle index).

>From the topic-to-bundle index I can generate a JSON representation of it 
>almost instantly by generating JSON XML and then serializing it (this JSON is 
>then consumed by an XSLT running elsewhere, at least for now).

What I haven’t done yet is implement updating these indexes to reflect file 
changes from git repo updates: that should be a relatively simple application 
of XQuery update but I’m not sure what the performance implications are of 
modifying individual nodes within a single document as opposed to modifying 
entire documents (i.e., if I made each index entry a separate document).

I also determined that constructing an XQuery map from the index data is very 
slow—clearly a “don’t do that” kind of thing, while constructing a JSON XML 
representation of the index is very fast. Not a surprising result but worth 
confirming.

I’ll be refining this system as I hammer it into a server-based web application 
for reporting information about our entire corpus of topics as they change over 
time.

Cheers,

E.

_____________________________________________
Eliot Kimber
Sr Staff Content Engineer
O: 512 554 9368
M: 512 554 9368
servicenow.com<https://www.servicenow.com>
LinkedIn<https://www.linkedin.com/company/servicenow> | 
Twitter<https://twitter.com/servicenow> | 
YouTube<https://www.youtube.com/user/servicenowinc> | 
Facebook<https://www.facebook.com/servicenow>

From: Christian Grün <christian.gr...@gmail.com>
Date: Monday, January 24, 2022 at 6:57 AM
To: Eliot Kimber <eliot.kim...@servicenow.com>
Cc: basex-talk@mailman.uni-konstanz.de <basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] Strategy for Persisting Maps that Contain Nodes: 
db:node-id()
[External Email]


> My approach is to create a separate element for each index entry, rather than 
> creating a single element that then contains all the index entries as shown 
> in the index construction example in the docs.

You mean you don’t group the nodes by the index key, as shown in the
docs? That should be fine as well. If the entries are grouped, a
single element may get larger, but the overall number of nodes to be
added or replaced will be smaller. If single entries need to be
updated in your scenario (e.g. because the key changes), grouping
might not be the solution, though.

There are usually various solutions for achieving the same goal. The
presented example is fairly simple indeed (most or our index
structures in real-world applications are certainly more complex). I
guess that 16 GB should be more than sufficient for a 70 MB index
database, but feel free to share your experiences.

Re: [basex-talk] Strategy for Persisting Maps that Contain Nodes: db:node-id()

Reply via email to