Thanks Andy. I have a parser that works on String, but this time I want to do it right and make it streaming and plug it into Jena at the low level.
It seems that I should be able to reuse some code from TokenizerText. I understand StreamRDF is used to sink the triples, but what about ParserProfile? I see LangTurtleBase uses it: org.apache.jena.iri.IRI iri = profile.makeIRI(iriStr, currLine, currCol) ; How do I construct an instance of ParserProfile? Or is there an alternative way to construct IRIs etc.? Martynas On Mon, May 11, 2015 at 2:44 PM, Andy Seaborne <a...@apache.org> wrote: > On 10/05/15 21:48, Martynas Jusevičius wrote: >> >> Hey all, >> >> I want to refactor my RDF/POST parser into a Jena-compatible reader. >> An example of the format can be found here: >> http://www.lsrn.org/semweb/rdfpost.html#sec-examples >> >> The documentation suggests implementing ReaderRIOT interface: >> >> https://github.com/apache/jena/blob/master/jena-arq/src-examples/arq/examples/riot/ExRIOT_5.java >> >> However, if I look at (what I think is) existing readers such as >> Turtle for example, they do not seem to implement ReaderRIOT: >> >> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/riot/lang/LangTurtleBase.java >> >> What is the explanation for that? > > > Hi Martynas, > > It is historical - the Turtle derived parsers emerged with the RiotReader > interface and some code is/was around that used that interface. > > ReaderRIOTLang is the cross-over code from the proper interface ReaderRIOT > to RiotReader. RiotReader is a fixed set of parsers. > > This can be sorted out in Jena3. > >> >> Do I need to to tokenize the InputStream myself or is there some >> machinery I can reuse? > > > The Turtle-world tokenizer is TokenizerText. It is turtle term specific. > > Any tokenizing for a new language is often, in my experience, very sensitive > to the language details. > > If you are used to javacc, and performance isn't critical at scale, that's a > good tool. > > RIOT uses custom I/O for speed; Jena used to have a javacc parser for Turtle > but Turtle is sufficiently simple that a hand-written parser is doable. A > hand written tokenizer is for speed at scale (big file - about x2 than basic > javacc tokenizing) but you need large input to make it worthwhile. NTriples > dumps of databases make it worthwhile. > > If you do rdfpost -> Turtle (string manipulation), then you can parse the > Turtle as normal. Downside: Error messages may be confusing as they refer > to the Turtle, not the input string. > > Splitting up the query string, with all the HTTP escaping rules, can be done > with library code (see FusekiLib.parseQueryString [no longer used, but it > works without consuming the body, unlike the servlet operations which > combine form and query string processing] and probably lots of better code > examples on the web. > > Andy >> >> >> Martynas >> graphityhq.com >> >