Hi,
The number of items isn't known at the start - it's streaming. The RDF Patch
format has a "begin" marker and a "commit"/"abort" marker which are supposed to
be connected to a transaction. If the stream ends with no "commit", it is a
system abort and all the items after the "begin" rejected. The idea being is
that the receiver can decide it needs to buffer to determine completeness or
connect to a database transaction where the changes can be streamed through and
an abort (marker or system generated) used to stop them becoming visible.
Chunks of a known maximum size could be used but the begin-commit usually will
be on related items (application determined) whereas chunking does not know
what is related.
Andy
On 2024/06/20 07:30:35 Max Ulidtko wrote:
> Hi Rob,
>
> Your (de)serialization protocol is insecure if it must rely on EOF
> detection for correctness. Sorry to say this to you, but no matter
> amount of workarounds and hacks, such will break in amusing ways.
>
> An important design difference here: Thrift statically knows, at
> compile / codegen time, how many fields of which width to expect next.
> If I understand correctly, the formats of Jena are defined as simply
> concatenations of Thrift Compact (or ProtoBuf) object serializations.
> This implies a protocol design flaw.
>
> Even as a "Jena outsider" and having not checked the code, what I
> immediately know about Jena streams receiver logic: it has no idea
> whatsoever, even at runtime (!), how many more objects to expect. Thus
> you think you need the EOF checks so badly.
>
> The classic ways to address this flaw are all about adding appropriate
> framing. Two basic options for the sender:
> When it has the luxury of knowing the array size in advance: send a
> "stream header" telling this size to the receiver. Examples: HTTP
> 'Content-Length' header, 'File Size' header field in BMP
> images.Otherwise, it must add signalling of "this is the last item" /in
> the layer of your protocol/. Examples: FIN flag in TCP, IEND tag in PNG
> images.
> In Thrift IDL, option 1 might look like:
>
> > struct RDF_StreamHeader {
> > 1: required i32 protoVersion
> > 2: required i32 nStreamItems
> > }
>
> And option 2 might look like:
>
> > union RDF_StreamRowVariant {
> > 1: RDF_PrefixDecl prefixDecl
> > 2: RDF_Triple triple
> > 3: RDF_Quad quad
> > }
> > struct RDF_StreamRow {
> > 1: required RDF_StreamRowVariant datum
> > 2: required bool isLastDatumInStream
> > }
>
> HTH & Best Regards
> Max
>
>
> On Wed, June 19 2024 at 08:42:38 +0000, Rob @ DNR
> <[email protected]> wrote:
> > Hey All
> >
> > Over in the Apache Jena [1] project one of the RDF serialisations we
> > support is based upon a binary encoding [2] of RDF using Thrift [3].
> > We use the Thrift compact protocol and have various implementations
> > of our high-level reader [4] and writer [5] interfaces wrapped around
> > those.
> >
> > However, it was recently noticed that when presented with completely
> > junk data our readers would silently return. And more worryingly,
> > taking an otherwise valid Thrift data stream and arbitrarily
> > truncating it would produce some valid output and silently ignore the
> > trailing malformed data thus losing some data.
> >
> > Upon debugging this was determined to be due to our own EOF exception
> > check at [6]. Historically this appears to have been added due to
> > THRIFT-5022 which was fixed in your codebase back in Nov 2019 [7].
> > However, even with that fix calling protocol.getTransport().isOpen()
> > isn’t a usable check for EOF as even if EOF has been reached this
> > can still return true as the underlying stream might not have been
> > closed yet. Similarly, protocol.getTransport().peek() doesn’t seem
> > a reliable indicator of whether EOF has been reached (it just defers
> > to isOpen() in the default implementation), and we don’t know
> > whether the underlying InputStream is buffered or not (though we do
> > try to enforce that in our wrapper code).
> >
> > So, we’ve currently ended up with this draft PR [8] which feels
> > like a complete hack as I’m basically examining the Exception stack
> > trace to see where in the read attempt we were when we failed and use
> > that as an indicator of malformed input vs true EOF.
> >
> > I feel like we’re missing something obvious about how best to use
> > the Thrift Java APIs, any guidance/suggestions on how to do
> > differentiate between EOF and malformed inputs better would be much
> > appreciated!
> >
> > Or if it’s preferable to handle this as a JIRA Issue let me know
> > and I can get that filed
> >
> > Cheers,
> >
> > Rob
> >
> > [1]: https://jena.apache.org <https://jena.apache.org/>
> > [2]: <https://jena.apache.org/documentation/io/rdf-binary.html>
> > [3]:
> > <https://github.com/apache/jena/blob/main/jena-arq/Grammar/RDF-Thrift/BinaryRDF.thrift>
> > [4]:
> > <https://github.com/apache/jena/blob/93335dd1b3b2502968ef3cdcf5b7d60540577e9b/jena-arq/src/main/java/org/apache/jena/riot/thrift/ThriftRDF.java#L165-L185>
> > [5]:
> > <https://github.com/apache/jena/blob/main/jena-arq/src/main/java/org/apache/jena/riot/thrift/StreamRDF2Thrift.java>
> > [6]:
> > <https://github.com/apache/jena/blob/93335dd1b3b2502968ef3cdcf5b7d60540577e9b/jena-arq/src/main/java/org/apache/jena/riot/thrift/ThriftRDF.java#L177-L179>
> > [7]:
> > <https://lists.apache.org/thread/j03hf73yv1ljc0fk6ffohfc22yo3v7hd>
> > [8]:
> > <https://github.com/apache/jena/pull/2408/files#diff-2dce962e024e06557e133fe135c8e968e1a381aee892894f236b9a22046008d4>
>
>