In my Mirabel system, I create a link database that records all the links made 
in a set of documents. This becomes a “where used” index over the content.

We have on the order if 200K links for one content set, so at 0.1 second per 
link it takes about 7 hours to build this index.

I’m currently doing this in one process that builds the whole index and then 
stores it in a database. This is failing in hard-to-diagnose ways, for example, 
because a database has a write lock on it when I go to rename it from it’s temp 
name to it’s production name (to replace the current production version).

The data is such that I could parallelize the processing but I’m not sure how I 
would do that in BaseX so that I can safely write to a single database from 
multiple threads.

The fork-join() docs clearly say “non-updating” functions, so that doesn’t seem 
to be an option.

I have multiple BaseX HTTP servers running so I could farm processing across 
them, but I think I would then run into write lock issues.

I could create separate databases for each thread of operation and then combine 
those at the end—that seems like it might be the best option.

Have I missed anything?

Thanks,

Eliot
_____________________________________________
Eliot Kimber
Sr. Staff Content Engineer
O: 512 554 9368

servicenow

servicenow.com<https://www.servicenow.com>
LinkedIn<https://www.linkedin.com/company/servicenow> | 
X<https://twitter.com/servicenow> | 
YouTube<https://www.youtube.com/user/servicenowinc> | 
Instagram<https://www.instagram.com/servicenow>

Reply via email to