I don't know if my experience is helpful, but I went a different way to
solve these sorts of problems.
I avoid adding business logic while generating the RDF from the source
data systems. It is almost entirely a simple transliteration from one
format to another (I use an R2RML mapping.) The purpose is to combine
data about the same subject from different source systems.
Separately, I wrote rules that check for several different desired
constraints on the data. I think this corresponds to your "application
logic". The source rules are written in RIF, then translated to HTML for
display and SPARQL for execution.
Using the rules (in SPARQL format) I materialize additional triples and
add them to the datastore.
This approach decouples data from business logic and lets each evolve
separately. When the source data schema changes, I change the R2RML
mapping. When the business logic changes, I change the RIF rules.
The logic (rules) may be applied in any of several different ways to
suit your needs (for example using a different rules engine--or shex or
shacl--, or generating results on-the-fly instead of materializing them
back into the datastore).
Regards,
--Paul
On 2/9/24 08:17, balduin.landolt@dasch.swiss.INVALID wrote:
Hi Andy,
If I understand correctly, this is a schema change requiring the data to change.
Correct, but we don't enforce any schema on the database level (no SHACL
involved), that's only done programmatically in the application.
The transformation of the data to the updated data model could be done offline,
that would reduce downtime. If the data is being continuously updated, that's
harder because the offline copy will get out of step with the live data.
How often does the data change (not due to application logic changes)?
The data might theoretically change constantly, so just doing it offline on a
copy isn't really possible.
A compromise I've been thinking about, which would still be better than
downtime, would be a read-only mode for the duration of the migration. But for
now, the application doesn't support something like this yet.
(And if we can get the downtime to something reasonable, that would be good
enough.)
Do you have a concrete example of such a change?
These changes can vary from very simple to very complex:
- The simplest case would maybe be that a certain property that used to be
optional on a certain type of resource becomes mandatory; for all instances
where this is not present, a default value needs to be supplied.
=> this we could easily do with a SPARQL update.
- The most complex case I encountered so far was roughly this:
Given in graph A (representing the data model for a subset of the data), a
particular statement on something of type P (defining some kind of property) is
present, and in graph B (the subset of data corresponding to the model in A), a
certain statement holds true for all V (which have a reference to P), then P
should be modified. If the statement does not hold true for all V, then each V
where it does not, must be modified to become a more complex object.
(More concretely: V represents a text value. If P says that V may contain
markup, then check if any V contains markup. If not, change P to say that it
does not contain markup; if any V contains markup, then all Vs that represent
text without markup need to be changed to contain text with markup. Text
without markup here represents a bit of reification around a string literal;
text with markup follows a sophisticated standoff markup model, and even if no
markup is present, it needs to contain information on nature of the markup that
is used.)
=> this is something I would not know how to, or feel comfortable attempting
in SPARQL, so it needs to happen in code.
Long story short: Some simple changes I could easily do in SPARQL; the more
complex ones would require me to be able to do the changes in code, but it
might be possible to have it set up in such a way that the code essentially has
read access to the data and can generate update queries from that.
Our previous setup worked like this (durations not measured, just from
experience):
On application start, if a migration needs doing, it won't start right away,
but kick that process off:
- download an entire dump of fuseki to a file on disk (ca. 20 min.)
- load the dump into an in-memory Jena model (10 min. -> plus huge memory
consumption that will always grow proportional to our data growing)
- perform the migration on the in-memory model (1 sec. - 1 min.)
- dump the model to a file on disk
- drop all graphs from fuseki (20 min.)
- upload the dump into fuseki (20 min.)
Then the application would start... so at least 1h downtime, clearly room for
improvement.
The good thing about this approach is that if the migration fails, the data
would not be corrupted because the data loaded in fuseki is not affected.
My best bet at this point is to say, we take the risk of data corrupt