Re: Feature Request: Filtered Output Option for riot --validate

Andy Seaborne Sun, 26 Jan 2025 10:15:01 -0800

Hi Adrian,

On 24/01/2025 16:29, Adrian Gschwend wrote:

Hi group,
I've been using riot --validate regularly to identify issues in RDFdatasets, and it has been a great tool for ensuring data quality.


Aside: do you use SHACL?

I’venoticed that it currently doesn’t offer a way to produce a "cleaned"version of a dataset as output. Unless I’m overlooking something, thiscould be a helpful addition.
What I’m envisioning is an option to generate a reduced datasetcontaining only valid triples. Ideally, this could be implemented in twomodes:
1. "Super strict" mode: Filters out everything that triggers warnings orerrors.
2. "Clean" mode: Strips out only the triples with errors while retainingthose with warnings.
This would be particularly useful for scenarios where a "strict" versionof a dataset is required. Currently, I resort to some creative grepscripting to manually filter out problematic triples based on the issuesflagged by riot --validate, but this is far from ideal and slow.
In this proposed mode, it would also be great if riot could:

- Avoid stopping on errors and simply log them instead.

Depends on the error. If it is a structural error in the languagesyntax, we have to be careful not to generate fake triples.

e.g. seen subject S1, property P1, see junk doing the object ... skipand recover, and the recovery point has next subject S2, then a triple"S1 property object" might come out which should have been "S2 propertyobject". Recovery is imperfect.

Recovery means reworking parsers. Which RDF syntaxes are you interestedin? N-triples can make use of the statement per line. Turtle can't butrecovery might risk skip-to-DOT. RDF/XML is based on an XML parser.JSON-LD is a 3rd party engine.


Warnings - do you mean IRI warnings? - or some syntax level warnings?

There's a new IRI subsystem in the pipeline - and its errors/warnignsare more controllable -


https://lists.apache.org/[email protected]#:~:text=Fuseki%20development%20features.-,%3D%3D%3D%3D%20IRI3986,-Issue%3A%20https

For IRIs, structural problems, i.e. does not parse by the grammar ofIRIs, are errors, problems with scheme-specific rules are warnings.

- Optionally write warnings and/or error triples to a separate file forlater analysis and fixes at the source.

        Andy

I understand this may not align with everyone's use cases, but for thoseof us who often need to work with cleaned datasets for downstreamprocessing, this could be a very helpful enhancement.
In all the years I have not found another tool that is as useful as riot--validate.


Thanks!


What do you think?

regards

Adrian

Re: Feature Request: Filtered Output Option for riot --validate

Reply via email to