On 25/08/11 07:48, Paolo Castagna wrote:
Hi Monika, (Hi Ian),
Ian has already answered your question.
However, I want to had a similar use case we have in relation to errors or
malformed
RDF input files. When loading large RDF files we typically use N-Triples or
N-Quads
and we want to continue parsing the file even if there are a few errors (i.e.
invalid
lines).
We use RIOT and, even if there is not a feature to tell the parser to ignore an
error,
skip the line and continue to parse, it's not expensive to construct a LangNQuad
object for each line of your input. So, this is what we often do:
String line = ...
Tokenizer tokenizer =
TokenizerFactory.makeTokenizerString(value.toString());
LangNQuads parser = new LangNQuads(tokenizer, profile, sink) ;
parser.parse();
You can then catch all the exception and continue processing the next line.
This happens also when we write MapReduce jobs, for example here [1] or here
[2]. (*)
Maybe, it's not that difficult to add a feature to RIOT's LangNQuad parser to
report
errors but skip to the next line and continue parsing. However, I think this is
close
to impossible for RDF/XML or Turtle serializations.
The recovery also needs to be incorporated in the tokenizer (e.g.
missing closing ").
For N-Triples,N-Quads, I think the best way is to use a text processing
(regexs, perl etc or Java) on the input to check for basic structural
validity before passing onto RIOT. Otherwise tricky cases include
missing closing " would need to be caught in the lexer, making it
complicated and potentially slower.
It's sort of doable for Turtle. Recovery could be skip to next DOT.
RDF/XML - it's nearly impossible because the error may in the XML
structure which is processed by the XML parser, not the RDF/XML parser.
It would need help and possibly quite tight integration with the the
XML parser itself.
Andy
Paolo
[1]
https://github.com/castagna/tdbloader3/blob/master/src/main/java/com/talis/labs/tdb/tdbloader3/FirstMapper.java
[2]
https://github.com/castagna/tdbloader3/blob/master/src/main/java/com/talis/labs/tdb/tdbloader3/io/QuadRecordReader.java
(*)
By the way, if someone wants to help me removing the bottleneck caused by the
fact
that I am using a single reducer in the first MapReduce job of tdbloader3 or has
ideas on how it could be done, let me know.
Monika Solanki wrote:
Is it possible to check if the incoming data is legal RDF before reading
into the model? I do not want my program to throw an error via
RDFDefaultErrorHandler if the incoming data is illegal RDF. I only want
a warning to be issued and the program should continue execution. If
there are any supporting examples, that would be very helpful.
Thanks,
Monika