Hi group,
I've been using riot --validate regularly to identify issues in RDF
datasets, and it has been a great tool for ensuring data quality. I’ve
noticed that it currently doesn’t offer a way to produce a "cleaned"
version of a dataset as output. Unless I’m overlooking something, this
could be a helpful addition.
What I’m envisioning is an option to generate a reduced dataset
containing only valid triples. Ideally, this could be implemented in two
modes:
1. "Super strict" mode: Filters out everything that triggers warnings or
errors.
2. "Clean" mode: Strips out only the triples with errors while retaining
those with warnings.
This would be particularly useful for scenarios where a "strict" version
of a dataset is required. Currently, I resort to some creative grep
scripting to manually filter out problematic triples based on the issues
flagged by riot --validate, but this is far from ideal and slow.
In this proposed mode, it would also be great if riot could:
- Avoid stopping on errors and simply log them instead.
- Optionally write warnings and/or error triples to a separate file for
later analysis and fixes at the source.
I understand this may not align with everyone's use cases, but for those
of us who often need to work with cleaned datasets for downstream
processing, this could be a very helpful enhancement.
In all the years I have not found another tool that is as useful as riot
--validate.
What do you think?
regards
Adrian