[ 
https://issues.apache.org/jira/browse/JENA-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15979041#comment-15979041
 ] 

Laura commented on JENA-1325:
-----------------------------

> What the OP is asking for is a two-pass algorithm (the parse whole file into 
> a graph, output if and only if no errors)

I don't think this is what I'm asking, at all. Maybe I didn't formulate my 
question very well?
To give an example of what I'm doing, I have a folder with hundreds of small 
files, that happen to be in rdf/xml. I'm calling RIOT to parse all of them and 
output as n-triples. All n-triples are then `cat` into a single .nt because all 
the original files happen to be entities of the same graph. It's just that who 
made the files decided to make an rdf/xml for each entity instead of a single 
huge rdf/xml file. So, there is no problem for RIOT here, right? Big or small 
files, it should work fine. Calling RIOT on every single file works; 
well-formatted files produce a set of n-triples, while bad files rise some 
exception and the processing **for that single file** stops. The problem that I 
have is that I can't call RIOT on so many files one at a time, because it takes 
forever. I'd simply like to tell RIOT "hey buddy, process all these files" such 
that I only have to start the VM once. And again... I don't understand how this 
would affect streaming.

> RIOT parse many files at once, output only valid ones
> -----------------------------------------------------
>
>                 Key: JENA-1325
>                 URL: https://issues.apache.org/jira/browse/JENA-1325
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: RIOT
>         Environment: GNU/Linux
>            Reporter: Laura
>              Labels: easyfix, performance
>
> This issue is more or less related to this other one 
> https://issues.apache.org/jira/browse/JENA-1322
> I have a folder with thousands of files, mostly small RDF/XML files. I'm 
> using RIOT to validate them and dump the valid ones into ntriples files. The 
> problem is that calling RIOT on each file is not going to cut it. The 
> overhead is significant enough that this operation is just too slow (hours). 
> So I've tried to call RIOT only once on all files together using
> {noformat}
>     riot \
>         --verbose \
>         --stop \
>         --check \
>         --strict \
>         --output=nt \
>         files/*.rdf > files.nt
> {noformat}
> and in this way validation is much faster. The problem is, that it's still 
> dumping invalid files to the .nt output file. I'm downloading these files 
> from the Internet, so I'm not going to fix them myself, I just want to skip 
> bad files.
> Now, to be clear, I understand that RIOT is of course not meant to fix bad 
> data, and I'm not asking for this. I'm suggesting however to add an 
> *--option* such that RIOT can do the following:
> 1. parse multiple files at once (so that there is no need to invoke the same 
> RIOT command for each file)
> 2. for every file, check/validate it
> 3. if *--output* is set, only output those files or triples that didn't raise 
> any ERROR
> I think this is well in the scope of RIOT functionalities. Could this option 
> please be added to RIOT?
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to