[jira] [Comment Edited] (JENA-1325) RIOT parse many files at once, output only valid ones

Andy Seaborne (JIRA) Fri, 21 Apr 2017 09:38:07 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15979035#comment-15979035
 ]


Andy Seaborne edited comment on JENA-1325 at 4/21/17 4:36 PM:
--------------------------------------------------------------

Re: rapper.

1/
If you are parsing each file separately, then write the NTriples to separate 
files. Load all files at once.

(unchecked)
{code}
for F in files/*.rdf
do
    B="$(basename $F)"
    parse < $F > $B.nt
done
tdbloader --loc DB *.nt
{code}

2/
Or
Use "sed" to replace `_:genId` with `_:gen_unique_` for some unique string for 
each parser run, like 1,2,3.


was (Author: andy.seaborne):
Re: rapper.

1/
If you are parsing each file separately, then write the NTriples to separate 
files. Load all files at once.

{code}
for F in files/*.rdf
do
    B="$(basename $F)"
    parse < $F > $B.nt
done
tdbloader --loc DB *.nt
{code}

2/
Or
Use "sed" to replace `_:genId` with `_:gen_unique_` for some unique string for 
each parser run, like 1,2,3.

> RIOT parse many files at once, output only valid ones
> -----------------------------------------------------
>
>                 Key: JENA-1325
>                 URL: https://issues.apache.org/jira/browse/JENA-1325
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: RIOT
>         Environment: GNU/Linux
>            Reporter: Laura
>              Labels: easyfix, performance
>
> This issue is more or less related to this other one 
> https://issues.apache.org/jira/browse/JENA-1322
> I have a folder with thousands of files, mostly small RDF/XML files. I'm 
> using RIOT to validate them and dump the valid ones into ntriples files. The 
> problem is that calling RIOT on each file is not going to cut it. The 
> overhead is significant enough that this operation is just too slow (hours). 
> So I've tried to call RIOT only once on all files together using
> {noformat}
>     riot \
>         --verbose \
>         --stop \
>         --check \
>         --strict \
>         --output=nt \
>         files/*.rdf > files.nt
> {noformat}
> and in this way validation is much faster. The problem is, that it's still 
> dumping invalid files to the .nt output file. I'm downloading these files 
> from the Internet, so I'm not going to fix them myself, I just want to skip 
> bad files.
> Now, to be clear, I understand that RIOT is of course not meant to fix bad 
> data, and I'm not asking for this. I'm suggesting however to add an 
> *--option* such that RIOT can do the following:
> 1. parse multiple files at once (so that there is no need to invoke the same 
> RIOT command for each file)
> 2. for every file, check/validate it
> 3. if *--output* is set, only output those files or triples that didn't raise 
> any ERROR
> I think this is well in the scope of RIOT functionalities. Could this option 
> please be added to RIOT?
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (JENA-1325) RIOT parse many files at once, output only valid ones

Reply via email to