Hi, The number of items isn't known at the start - it's streaming. The RDF Patch format has a "begin" marker and a "commit"/"abort" marker which are supposed to be connected to a transaction. If the stream ends with no "commit", it is a system abort and all the items after the "begin" rejected. The idea being is that the receiver can decide it needs to buffer to determine completeness or connect to a database transaction where the changes can be streamed through and an abort (marker or system generated) used to stop them becoming visible.
Chunks of a known maximum size could be used but the begin-commit usually will be on related items (application determined) whereas chunking does not know what is related. Andy On 2024/06/20 07:30:35 Max Ulidtko wrote: > Hi Rob, > > Your (de)serialization protocol is insecure if it must rely on EOF > detection for correctness. Sorry to say this to you, but no matter > amount of workarounds and hacks, such will break in amusing ways. > > An important design difference here: Thrift statically knows, at > compile / codegen time, how many fields of which width to expect next. > If I understand correctly, the formats of Jena are defined as simply > concatenations of Thrift Compact (or ProtoBuf) object serializations. > This implies a protocol design flaw. > > Even as a "Jena outsider" and having not checked the code, what I > immediately know about Jena streams receiver logic: it has no idea > whatsoever, even at runtime (!), how many more objects to expect. Thus > you think you need the EOF checks so badly. > > The classic ways to address this flaw are all about adding appropriate > framing. Two basic options for the sender: > When it has the luxury of knowing the array size in advance: send a > "stream header" telling this size to the receiver. Examples: HTTP > 'Content-Length' header, 'File Size' header field in BMP > images.Otherwise, it must add signalling of "this is the last item" /in > the layer of your protocol/. Examples: FIN flag in TCP, IEND tag in PNG > images. > In Thrift IDL, option 1 might look like: > > > struct RDF_StreamHeader { > > 1: required i32 protoVersion > > 2: required i32 nStreamItems > > } > > And option 2 might look like: > > > union RDF_StreamRowVariant { > > 1: RDF_PrefixDecl prefixDecl > > 2: RDF_Triple triple > > 3: RDF_Quad quad > > } > > struct RDF_StreamRow { > > 1: required RDF_StreamRowVariant datum > > 2: required bool isLastDatumInStream > > } > > HTH & Best Regards > Max > > > On Wed, June 19 2024 at 08:42:38 +0000, Rob @ DNR > <rve...@dotnetrdf.org> wrote: > > Hey All > > > > Over in the Apache Jena [1] project one of the RDF serialisations we > > support is based upon a binary encoding [2] of RDF using Thrift [3]. > > We use the Thrift compact protocol and have various implementations > > of our high-level reader [4] and writer [5] interfaces wrapped around > > those. > > > > However, it was recently noticed that when presented with completely > > junk data our readers would silently return. And more worryingly, > > taking an otherwise valid Thrift data stream and arbitrarily > > truncating it would produce some valid output and silently ignore the > > trailing malformed data thus losing some data. > > > > Upon debugging this was determined to be due to our own EOF exception > > check at [6]. Historically this appears to have been added due to > > THRIFT-5022 which was fixed in your codebase back in Nov 2019 [7]. > > However, even with that fix calling protocol.getTransport().isOpen() > > isn’t a usable check for EOF as even if EOF has been reached this > > can still return true as the underlying stream might not have been > > closed yet. Similarly, protocol.getTransport().peek() doesn’t seem > > a reliable indicator of whether EOF has been reached (it just defers > > to isOpen() in the default implementation), and we don’t know > > whether the underlying InputStream is buffered or not (though we do > > try to enforce that in our wrapper code). > > > > So, we’ve currently ended up with this draft PR [8] which feels > > like a complete hack as I’m basically examining the Exception stack > > trace to see where in the read attempt we were when we failed and use > > that as an indicator of malformed input vs true EOF. > > > > I feel like we’re missing something obvious about how best to use > > the Thrift Java APIs, any guidance/suggestions on how to do > > differentiate between EOF and malformed inputs better would be much > > appreciated! > > > > Or if it’s preferable to handle this as a JIRA Issue let me know > > and I can get that filed > > > > Cheers, > > > > Rob > > > > [1]: https://jena.apache.org <https://jena.apache.org/> > > [2]: <https://jena.apache.org/documentation/io/rdf-binary.html> > > [3]: > > <https://github.com/apache/jena/blob/main/jena-arq/Grammar/RDF-Thrift/BinaryRDF.thrift> > > [4]: > > <https://github.com/apache/jena/blob/93335dd1b3b2502968ef3cdcf5b7d60540577e9b/jena-arq/src/main/java/org/apache/jena/riot/thrift/ThriftRDF.java#L165-L185> > > [5]: > > <https://github.com/apache/jena/blob/main/jena-arq/src/main/java/org/apache/jena/riot/thrift/StreamRDF2Thrift.java> > > [6]: > > <https://github.com/apache/jena/blob/93335dd1b3b2502968ef3cdcf5b7d60540577e9b/jena-arq/src/main/java/org/apache/jena/riot/thrift/ThriftRDF.java#L177-L179> > > [7]: > > <https://lists.apache.org/thread/j03hf73yv1ljc0fk6ffohfc22yo3v7hd> > > [8]: > > <https://github.com/apache/jena/pull/2408/files#diff-2dce962e024e06557e133fe135c8e968e1a381aee892894f236b9a22046008d4> > >