Hi,

The number of items isn't known at the start - it's streaming. The RDF Patch 
format has a "begin" marker and a "commit"/"abort" marker which are supposed to 
be connected to a transaction. If the stream ends with no "commit", it is a 
system abort and all the items after the "begin" rejected. The idea being is 
that the receiver can decide it needs to buffer to determine completeness or 
connect to a database transaction where the changes can be streamed through and 
an abort (marker or system generated) used to stop them becoming visible.

Chunks of a known maximum size could be used but the begin-commit usually will 
be on related items (application determined) whereas chunking does not know 
what is related.

    Andy

On 2024/06/20 07:30:35 Max Ulidtko wrote:
> Hi Rob,
> 
> Your (de)serialization protocol is insecure if it must rely on EOF 
> detection for correctness. Sorry to say this to you, but no matter 
> amount of workarounds and hacks, such will break in amusing ways.
> 
> An important design difference here: Thrift statically knows, at 
> compile / codegen time, how many fields of which width to expect next. 
> If I understand correctly, the formats of Jena are defined as simply 
> concatenations of Thrift Compact (or ProtoBuf) object serializations. 
> This implies a protocol design flaw.
> 
> Even as a "Jena outsider" and having not checked the code, what I 
> immediately know about Jena streams receiver logic: it has no idea 
> whatsoever, even at runtime (!), how many more objects to expect. Thus 
> you think you need the EOF checks so badly.
> 
> The classic ways to address this flaw are all about adding appropriate 
> framing. Two basic options for the sender:
> When it has the luxury of knowing the array size in advance: send a 
> "stream header" telling this size to the receiver. Examples: HTTP 
> 'Content-Length' header, 'File Size' header field in BMP 
> images.Otherwise, it must add signalling of "this is the last item" /in 
> the layer of your protocol/. Examples: FIN flag in TCP, IEND tag in PNG 
> images.
> In Thrift IDL, option 1 might look like:
> 
> > struct RDF_StreamHeader {
> > 1: required i32 protoVersion
> > 2: required i32 nStreamItems
> > }
> 
> And option 2 might look like:
> 
> > union RDF_StreamRowVariant {
> > 1: RDF_PrefixDecl   prefixDecl
> > 2: RDF_Triple       triple
> > 3: RDF_Quad         quad
> > }
> > struct RDF_StreamRow {
> > 1: required RDF_StreamRowVariant datum
> > 2: required bool isLastDatumInStream
> > }
> 
> HTH & Best Regards
> Max
> 
> 
> On Wed, June 19 2024 at 08:42:38 +0000, Rob @ DNR 
> <rve...@dotnetrdf.org> wrote:
> > Hey All
> > 
> > Over in the Apache Jena [1] project one of the RDF serialisations we 
> > support is based upon a binary encoding [2] of RDF using Thrift [3].  
> > We use the Thrift compact protocol and have various implementations 
> > of our high-level reader [4] and writer [5] interfaces wrapped around 
> > those.
> > 
> > However, it was recently noticed that when presented with completely 
> > junk data our readers would silently return.  And more worryingly, 
> > taking an otherwise valid Thrift data stream and arbitrarily 
> > truncating it would produce some valid output and silently ignore the 
> > trailing malformed data thus losing some data.
> > 
> > Upon debugging this was determined to be due to our own EOF exception 
> > check at [6].  Historically this appears to have been added due to 
> > THRIFT-5022 which was fixed in your codebase back in Nov 2019 [7].  
> > However, even with that fix calling protocol.getTransport().isOpen() 
> > isn’t a usable check for EOF as even if EOF has been reached this 
> > can still return true as the underlying stream might not have been 
> > closed yet.  Similarly, protocol.getTransport().peek() doesn’t seem 
> > a reliable indicator of whether EOF has been reached (it just defers 
> > to isOpen() in the default implementation), and we don’t know 
> > whether the underlying InputStream is buffered or not (though we do 
> > try to enforce that in our wrapper code).
> > 
> > So, we’ve currently ended up with this draft PR [8] which feels 
> > like a complete hack as I’m basically examining the Exception stack 
> > trace to see where in the read attempt we were when we failed and use 
> > that as an indicator of malformed input vs true EOF.
> > 
> > I feel like we’re missing something obvious about how best to use 
> > the Thrift Java APIs, any guidance/suggestions on how to do 
> > differentiate between EOF and malformed inputs better would be much 
> > appreciated!
> > 
> > Or if it’s preferable to handle this as a JIRA Issue let me know 
> > and I can get that filed
> > 
> > Cheers,
> > 
> > Rob
> > 
> > [1]: https://jena.apache.org <https://jena.apache.org/>
> > [2]: <https://jena.apache.org/documentation/io/rdf-binary.html>
> > [3]: 
> > <https://github.com/apache/jena/blob/main/jena-arq/Grammar/RDF-Thrift/BinaryRDF.thrift>
> > [4]: 
> > <https://github.com/apache/jena/blob/93335dd1b3b2502968ef3cdcf5b7d60540577e9b/jena-arq/src/main/java/org/apache/jena/riot/thrift/ThriftRDF.java#L165-L185>
> > [5]: 
> > <https://github.com/apache/jena/blob/main/jena-arq/src/main/java/org/apache/jena/riot/thrift/StreamRDF2Thrift.java>
> > [6]: 
> > <https://github.com/apache/jena/blob/93335dd1b3b2502968ef3cdcf5b7d60540577e9b/jena-arq/src/main/java/org/apache/jena/riot/thrift/ThriftRDF.java#L177-L179>
> > [7]: 
> > <https://lists.apache.org/thread/j03hf73yv1ljc0fk6ffohfc22yo3v7hd>
> > [8]: 
> > <https://github.com/apache/jena/pull/2408/files#diff-2dce962e024e06557e133fe135c8e968e1a381aee892894f236b9a22046008d4>
> 
> 

Reply via email to