[
https://issues.apache.org/jira/browse/JENA-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15979041#comment-15979041
]
Laura commented on JENA-1325:
-----------------------------
> What the OP is asking for is a two-pass algorithm (the parse whole file into
> a graph, output if and only if no errors)
I don't think this is what I'm asking, at all. Maybe I didn't formulate my
question very well?
To give an example of what I'm doing, I have a folder with hundreds of small
files, that happen to be in rdf/xml. I'm calling RIOT to parse all of them and
output as n-triples. All n-triples are then `cat` into a single .nt because all
the original files happen to be entities of the same graph. It's just that who
made the files decided to make an rdf/xml for each entity instead of a single
huge rdf/xml file. So, there is no problem for RIOT here, right? Big or small
files, it should work fine. Calling RIOT on every single file works;
well-formatted files produce a set of n-triples, while bad files rise some
exception and the processing **for that single file** stops. The problem that I
have is that I can't call RIOT on so many files one at a time, because it takes
forever. I'd simply like to tell RIOT "hey buddy, process all these files" such
that I only have to start the VM once. And again... I don't understand how this
would affect streaming.
> RIOT parse many files at once, output only valid ones
> -----------------------------------------------------
>
> Key: JENA-1325
> URL: https://issues.apache.org/jira/browse/JENA-1325
> Project: Apache Jena
> Issue Type: Improvement
> Components: RIOT
> Environment: GNU/Linux
> Reporter: Laura
> Labels: easyfix, performance
>
> This issue is more or less related to this other one
> https://issues.apache.org/jira/browse/JENA-1322
> I have a folder with thousands of files, mostly small RDF/XML files. I'm
> using RIOT to validate them and dump the valid ones into ntriples files. The
> problem is that calling RIOT on each file is not going to cut it. The
> overhead is significant enough that this operation is just too slow (hours).
> So I've tried to call RIOT only once on all files together using
> {noformat}
> riot \
> --verbose \
> --stop \
> --check \
> --strict \
> --output=nt \
> files/*.rdf > files.nt
> {noformat}
> and in this way validation is much faster. The problem is, that it's still
> dumping invalid files to the .nt output file. I'm downloading these files
> from the Internet, so I'm not going to fix them myself, I just want to skip
> bad files.
> Now, to be clear, I understand that RIOT is of course not meant to fix bad
> data, and I'm not asking for this. I'm suggesting however to add an
> *--option* such that RIOT can do the following:
> 1. parse multiple files at once (so that there is no need to invoke the same
> RIOT command for each file)
> 2. for every file, check/validate it
> 3. if *--output* is set, only output those files or triples that didn't raise
> any ERROR
> I think this is well in the scope of RIOT functionalities. Could this option
> please be added to RIOT?
> Thank you.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)