Thank you Andy.
Andy Seaborne wrote:
Missed the important part ....
Any blank node written as _:label will be subject to label scope rules,
that is, per file, and not bNode preserving (that's why TDB does it's
own thing).
The tokenizer knows <_:xyz> "URIs" which create bNodes with the xyz as
the internal label.
Is ":" a legal character in the xyz part of the bNode internal label?
I am asking because the generated bNode internal labels seem to have ":"
in it and if I use RIOT's Tokenizer there is a problem, I think.
For example:
1 AnonId id = new AnonId("foo:bar");
2 Node node1 = Node_Blank.createAnon(id);
3 String str = NodeFmtLib.serialize(node1);
4 Tokenizer tokenizer = TokenizerFactory.makeTokenizerString(str);
5 assertTrue (tokenizer.hasNext());
6 assertEquals("[BNODE:foo]", tokenizer.next().toString());
7 assertTrue (tokenizer.hasNext());
8 assertEquals("[PREFIXED_NAME::bar]", tokenizer.next().toString());
9 assertFalse (tokenizer.hasNext());
At line 6, I would expect [BNODE:foo:bar] instead.
Now, I am looking at Token{Input|Output}Stream, TSV{Input|Output} and
OutputLangUtils.
Paolo
Andy
On 13/06/11 21:33, Andy Seaborne wrote:
On 13/06/11 16:55, Paolo Castagna wrote:
Andy Seaborne wrote:
On 26/05/11 15:37, Laurent Pellegrino wrote:
Hi all,
I am using FmtUtils.stringForNode(...) from ARQ to encode a Node to a
String. Now, I have to perform the reverse operation: from the String
I want to create the Node. Is there a class and method to do that from
the ARQ library?
It seems that NodecLib.decode(...) do the trick but it is in the TDB
library and I am not sure that it works with any output from
FmtUtils.stringForNode(...)?
Kind Regards,
Laurent
There are ways to reverse the process - too many in fact.
Simple: SSE.parseNode: String -> Node
It uses a javacc parser so the overall efficiency isn't ideal.
But RIOT is in the process of reworking I/O for efficiency; the input
side is the area that is most finished. The tokenizer will do what you
want.
What's missing in RIOT is Node to stream writing without using
FmtUtils -- this is OutputLangUtils which is unfinished. FmtUtils
creates intermediate strings, when the output could be straight to a
stream, avoiding a copy and the temporary object allocation.
The Tokenizer is:
interface Tokenizer extends Iterator<Token>
and see org.openjena.riot.tokens.TokenizerFactory
especially if you have a sequence of them to parse ... like a TSV
file. But you will have to manage newlines as to the tokenizer they
are whitespace like anything else.
There is some stuff in my scratch area for streams of tuples of RDF
terms and variables:
https://svn.apache.org/repos/asf/incubator/jena/Scratch/AFS/trunk/src/riot/io/
TokenInputStream and TokenOutputStream might be useful.
Until TSV, a tuple of terms is a number of RDF terms, terminated by a
DOT (not newline).
This could be useful to JENA-44, JENA-45 and JENA-69
Hi,
I am looking at the code to serialize bindings (in relation to JENA-44
and JENA-45) and I would like to use as much as I can what is already
available in RIOT (and/or help to add what's missing, once I understand
what is the right thing to do).
I am having a few problems with blank nodes.
This is a snipped of code which explains my problem:
1 Node node1 = Node_Blank.createAnon();
2 String str = NodeFmtLib.serialize(node1);
3 Tokenizer tokenizer = TokenizerFactory.makeTokenizerString(str);
4 Token token = tokenizer.next();
5 Node node2 = token.asNode();
6 assertEquals(node1, node2);
I have two different problems.
In the case the blank node id starts with a digit, the assertion at
line 6 fails with, for example: "expected:<1c7b85b4:13089a0cb42:-7fff>
but was:<1c7b85b4>".
If the blank node id is a negative number (i.e. it starts with a '-'),
I have a RiotParserException: "org.openjena.riot.RiotParseException:
[line: 1, col: 3 ] Blank node label does not start with alphabetic or
_ :-" from TokenizerText.java line 1067.
Setting onlySafeBNodeLabels to true might help.
Because TDB does not use the tokenizer for decode, the raw path may be
buggy.
See OutputLangUtils because that has the prospect of streaming.
We may need to switch OutputStream but there is OutStreamUTF8 if UTF-8
encoding by std Java is costly.
What I am trying to do is to rewrite the BindingSerializer in the patch
for JENA-44. These are the signatures of the two methods I am
implementing:
public void serialize(Binding b, DataOutputStream out) throws
IOException
public Binding deserialize(DataInputStream in) throws IOException
What's wrong with TokenOutputStream which even does some buffering.
Binding -> Nodes (you're only writing the RDF term values), beware of
missingbindings. See the TSV output format that Laurent has been looking
at.
DataOutputStream can only write 16bit lengths for strings - so you use
write(byte[]) and much of the point of DataOutputStream is lost. Seems
better to be to use our own internal interface and map to whatever
mechanism is most appropriate. testing the round-tripping between
TokenOutputStream and TokenInputStream being then done.
At the moment, I am assuming all the bindings written in the same file
have
the same variables and I am writing them only once at the beginning of
the
file and after that I am serializing binding values only:
for (Var var : vars) {
Node node = b.get(var);
byte[] buf = NodeFmtLib.serialize(node).getBytes("UTF-8");
whether this is faster that converting to UTF-8 duirectly into the
stream will need testing but it's a point optimization. For now, it's
the design that matters.
out.writeInt(buf.length);
out.write(buf);
}
Should I try to use OutputLangUtils instead? And Writer(s) instead of
DataOutputStream(s)?
Thanks,
Paolo
I'm keen that we create a single solid I/O layer so it can teste and
optimized then shared amongst all the code doing I/O related things.
Nodec is an interface specializes attempt to ByteBuffers for file, not
stream I/O. File I/O can be random access.
Andy