Laura created JENA-1325:
---------------------------

             Summary: RIOT parse many files at once, output only valid ones
                 Key: JENA-1325
                 URL: https://issues.apache.org/jira/browse/JENA-1325
             Project: Apache Jena
          Issue Type: Improvement
          Components: RIOT
         Environment: GNU/Linux
            Reporter: Laura


This issue is more or less related to this other one 
https://issues.apache.org/jira/browse/JENA-1322

I have a folder with thousands of files, mostly small RDF/XML files. I'm using 
RIOT to validate them and dump the valid ones into ntriples files. The problem 
is that calling RIOT on each file is not going to cut it. The overhead is 
significant enough that this operation is just too slow (hours). So I've tried 
to call RIOT only once on all files together using

    riot \
        --verbose \
        --stop \
        --check \
        --strict \
        --output=nt \
        files/*.rdf > files.nt

and in this way validation is much faster. The problem is, that it's still 
dumping invalid files to the .nt output file. I'm downloading these files from 
the Internet, so I'm not going to fix them myself, I just want to skip bad 
files.
Now, to be clear, I understand that RIOT is of course not meant to fix bad 
data, and I'm not asking for this. I'm suggesting however to add an *--option* 
such that RIOT can do the following:

1. parse multiple files at once (so that there is no need to invoke the same 
RIOT command for each file)
2. for every file, check/validate it
3. if *--output** is set, only output those files or triples that didn't raise 
any ERROR

I think this is well in the scope of RIOT functionalities. Could this option 
please be added to RIOT?

Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to