[jira] [Commented] (JENA-1325) RIOT parse many files at once, output only valid ones

Andy Seaborne (JIRA) Sat, 22 Apr 2017 10:30:51 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15980038#comment-15980038
 ]


Andy Seaborne commented on JENA-1325:
-------------------------------------

If you have a contribution then please do create a JIRA and make a pull request 
on GitHub.

https://github.com/apache/jena/blob/master/CONTRIBUTING.md

----

Skipping bad triples is a different problem to skipping bad files.

Allowing incomplete data into a database will cause problems later. 
Incomplete data is bad data.

ETL in RDF is no different to ETL generally. Fix the data before loading.

"skip bad triples" JIRA is not defined; there are many different cases. This 
has come up many times before so new information is required.

What counts as a "bad" triple?  Bad URI? Bad structure? 

Bad URIs brak output later on.

h4. N-Triples

N-Triples/N-Quads input is relatively easy to recover from parse errors  - but 
how are the dropped bad triples communicated to the parser caller? It will need 
a new tokenizer and new grammar-parser (scan carefully to DOT - note the 
"carefully").

h4. Turtle

Turtle is possible but more than just "bad" triples will likely be dropped.  
Example: a badly written list. Recovery is harder.  Bad URIs may be _partially_ 
possible but if the trailing {{>}} is missing, the file is broken and recovery 
is pragmatic.  Note - this will require rewriting the tokenizer and also the 
parser-grammar.  That means a complete new Turtle reader. The one provided is 
highly tuned and there is a tight coupling of tokenizer and grammar for 
performance, such as buffered input. Many hours have gone in to performance and 
changing it to accmodate recovery will impact performance (NT data parsed by 
the TTL parser is slower for a reason - more complicated grammar so adding 
recovery cases will impact overall performance).

h4. RDF/XML

RDF/XML is impossible without using an error-recovering XML parser.  RDF/XML is 
layered - RDF processing over XML parsing. Jena uses Apache Xerces for XML 
parsing.

I don't know what recovery would mean in RDF/XML because just XML recovery 
isn't enough (e.g. striping).

h4. JSON-LD

Similar to RDF/XML, the JSON is parsed then the RDF processing is done. Jena 
uses a 3rd party JSON-LD system - {{jsonld-java}}.


(Inaccurate labels and flags removed - the request is not about performance nor 
is it an easyfix.)


> RIOT parse many files at once, output only valid ones
> -----------------------------------------------------
>
>                 Key: JENA-1325
>                 URL: https://issues.apache.org/jira/browse/JENA-1325
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: RIOT
>         Environment: GNU/Linux
>            Reporter: Laura
>
> This issue is more or less related to this other one 
> https://issues.apache.org/jira/browse/JENA-1322
> I have a folder with thousands of files, mostly small RDF/XML files. I'm 
> using RIOT to validate them and dump the valid ones into ntriples files. The 
> problem is that calling RIOT on each file is not going to cut it. The 
> overhead is significant enough that this operation is just too slow (hours). 
> So I've tried to call RIOT only once on all files together using
> {noformat}
>     riot \
>         --verbose \
>         --stop \
>         --check \
>         --strict \
>         --output=nt \
>         files/*.rdf > files.nt
> {noformat}
> and in this way validation is much faster. The problem is, that it's still 
> dumping invalid files to the .nt output file. I'm downloading these files 
> from the Internet, so I'm not going to fix them myself, I just want to skip 
> bad files.
> Now, to be clear, I understand that RIOT is of course not meant to fix bad 
> data, and I'm not asking for this. I'm suggesting however to add an 
> *--option* such that RIOT can do the following:
> 1. parse multiple files at once (so that there is no need to invoke the same 
> RIOT command for each file)
> 2. for every file, check/validate it
> 3. if *--output* is set, only output those files or triples that didn't raise 
> any ERROR
> I think this is well in the scope of RIOT functionalities. Could this option 
> please be added to RIOT?
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (JENA-1325) RIOT parse many files at once, output only valid ones

Reply via email to