Hey All

Over in the Apache Jena [1] project one of the RDF serialisations we support is 
based upon a binary encoding [2] of RDF using Thrift [3].  We use the Thrift 
compact protocol and have various implementations of our high-level reader [4] 
and writer [5] interfaces wrapped around those.

However, it was recently noticed that when presented with completely junk data 
our readers would silently return.  And more worryingly, taking an otherwise 
valid Thrift data stream and arbitrarily truncating it would produce some valid 
output and silently ignore the trailing malformed data thus losing some data.

Upon debugging this was determined to be due to our own EOF exception check at 
[6].  Historically this appears to have been added due to THRIFT-5022 which was 
fixed in your codebase back in Nov 2019 [7].  However, even with that fix 
calling protocol.getTransport().isOpen() isn’t a usable check for EOF as even 
if EOF has been reached this can still return true as the underlying stream 
might not have been closed yet.  Similarly, protocol.getTransport().peek() 
doesn’t seem a reliable indicator of whether EOF has been reached (it just 
defers to isOpen() in the default implementation), and we don’t know whether 
the underlying InputStream is buffered or not (though we do try to enforce that 
in our wrapper code).

So, we’ve currently ended up with this draft PR [8] which feels like a complete 
hack as I’m basically examining the Exception stack trace to see where in the 
read attempt we were when we failed and use that as an indicator of malformed 
input vs true EOF.

I feel like we’re missing something obvious about how best to use the Thrift 
Java APIs, any guidance/suggestions on how to do differentiate between EOF and 
malformed inputs better would be much appreciated!

Or if it’s preferable to handle this as a JIRA Issue let me know and I can get 
that filed

Cheers,

Rob

[1]: https://jena.apache.org
[2]: https://jena.apache.org/documentation/io/rdf-binary.html
[3]: 
https://github.com/apache/jena/blob/main/jena-arq/Grammar/RDF-Thrift/BinaryRDF.thrift
[4]: 
https://github.com/apache/jena/blob/93335dd1b3b2502968ef3cdcf5b7d60540577e9b/jena-arq/src/main/java/org/apache/jena/riot/thrift/ThriftRDF.java#L165-L185
[5]: 
https://github.com/apache/jena/blob/main/jena-arq/src/main/java/org/apache/jena/riot/thrift/StreamRDF2Thrift.java
[6]: 
https://github.com/apache/jena/blob/93335dd1b3b2502968ef3cdcf5b7d60540577e9b/jena-arq/src/main/java/org/apache/jena/riot/thrift/ThriftRDF.java#L177-L179
[7]: https://lists.apache.org/thread/j03hf73yv1ljc0fk6ffohfc22yo3v7hd
[8]: 
https://github.com/apache/jena/pull/2408/files#diff-2dce962e024e06557e133fe135c8e968e1a381aee892894f236b9a22046008d4

Reply via email to