Hi Adrian,
On 24/01/2025 16:29, Adrian Gschwend wrote:
Hi group,
I've been using riot --validate regularly to identify issues in RDF
datasets, and it has been a great tool for ensuring data quality.
Aside: do you use SHACL?
I’ve
noticed that it currently doesn’t offer a way to produce a "cleaned"
version of a dataset as output. Unless I’m overlooking something, this
could be a helpful addition.
What I’m envisioning is an option to generate a reduced dataset
containing only valid triples. Ideally, this could be implemented in two
modes:
1. "Super strict" mode: Filters out everything that triggers warnings or
errors.
2. "Clean" mode: Strips out only the triples with errors while retaining
those with warnings.
This would be particularly useful for scenarios where a "strict" version
of a dataset is required. Currently, I resort to some creative grep
scripting to manually filter out problematic triples based on the issues
flagged by riot --validate, but this is far from ideal and slow.
In this proposed mode, it would also be great if riot could:
- Avoid stopping on errors and simply log them instead.
Depends on the error. If it is a structural error in the language
syntax, we have to be careful not to generate fake triples.
e.g. seen subject S1, property P1, see junk doing the object ... skip
and recover, and the recovery point has next subject S2, then a triple
"S1 property object" might come out which should have been "S2 property
object". Recovery is imperfect.
Recovery means reworking parsers. Which RDF syntaxes are you interested
in? N-triples can make use of the statement per line. Turtle can't but
recovery might risk skip-to-DOT. RDF/XML is based on an XML parser.
JSON-LD is a 3rd party engine.
Warnings - do you mean IRI warnings? - or some syntax level warnings?
There's a new IRI subsystem in the pipeline - and its errors/warnigns
are more controllable -
https://lists.apache.org/[email protected]#:~:text=Fuseki%20development%20features.-,%3D%3D%3D%3D%20IRI3986,-Issue%3A%20https
For IRIs, structural problems, i.e. does not parse by the grammar of
IRIs, are errors, problems with scheme-specific rules are warnings.
- Optionally write warnings and/or error triples to a separate file for
later analysis and fixes at the source.
Andy
I understand this may not align with everyone's use cases, but for those
of us who often need to work with cleaned datasets for downstream
processing, this could be a very helpful enhancement.
In all the years I have not found another tool that is as useful as riot
--validate.
Thanks!
What do you think?
regards
Adrian