Right, but this requires you to store state in the protocol (currently, protocols *can* do this but it is not a requirement), and also mandates the use of an internal buffer *at least* as large as the largest complex object you are going to serialize (there is no way to tell a socket API to go back and fill in 4 bytes earlier in the sequence... that means the whole serialized object has to go in application-space memory before writing to the transport). Not ideal if you plan to serialize very large containers. This was a consciously considered issue that we wanted to avoid.
If you really do "prefer 2D array of arrays," then you truly are a C programmer at heart, and we may just have to chalk this up as having different opinions about simplicity. =) Honestly, though, this debate seems silly to me. Whether it's a field-size or a type-identifier, it's still an extra byte of metadata (or more if the length doesn't fit in a byte) that the protocol needs to encode in addition to field-identifiers. The only difference is that one describes the application content (field-type), the other describes the serialized data. Cheers, mcslee -----Original Message----- From: Mayan Moudgill [mailto:[email protected]] Sent: Monday, May 03, 2010 11:00 PM To: [email protected]; Mark Slee Subject: Re: heterogeneous collections I wouldn't have worried too much about the encoding; you can do it using pretty much the same write*() interface with the following caveats: - all writes are to an "array" (generally, I prefer a 2D array of arrays) - writeFieldBegin pushes the offset within the array and advances by 4 bytes - write FieldEnd pops the offset, subtracts it from current offset and writes that value (=field bytes) to the offset. I think you overstate the "simplicity" of the current scheme. Mark Slee wrote: > The protocol scheme was written the way it was because it was very > simple, transparent, straightforward to implement, safe to version > changes, and reasonably defensive. > > - Field identifiers are necessary for versioning Type identifiers > - are necessary so that we know how to skip fields that we don't > - recognize Therefore, the protocol sends a field identifier, then a > - type identifier, then the data > > We could have used field-size instead of a type identifier. That > simplifies the skip read-operation, but comes at the cost of making > the write operation much more complicated. It means that if you are > serializing a complex type, you have to first compact the whole thing > down to determine its total size in bytes, then write out that header. > This leads to internal-buffering code in the protocols, not fun when > you're dealing with containers-of-containers. Even simple cases are > awkward, I can't know the byte-length of a list of strings without > actually iterating over all the strings first. > > So, using the type-identifier system keeps the TProtocol interface > incredibly flat and obvious, just serial calls to read/write simple > values, always one at a time. > > The checking of types for known field ids is just basic defensiveness, > protection against someone changing the type of a field but forgetting > to update its identifier. We don't generate errors because this is > considered the same class of occurrence as an invalid field identifier. > > I totally agree with you about Thrift *seeming* like a partial attempt > to implement dynamic RPC. This was basically my point -- I know they > look similar -- so I do happily excuse people for thinking this. =) > Cheers, > mcslee > > -----Original Message----- > From: Mayan Moudgill [mailto:[email protected]] > Sent: Monday, May 03, 2010 9:38 PM > To: [email protected]; Mark Slee > Subject: Re: heterogeneous collections > > > If the goal of Thrift was to transport strongly typed data, then it begs > the question: why was the protocol scheme being used currently adopted? > > Clearly, if the data is typed with the types being agreed to at both > ends, then NO type information needs to be exchanged (other than you > container lengths - which you may not consider type information). If the > data is typed, but the there may be disagreement over the type at the > reciever, then you have to send the complete type information along with > the data. TBinaryProtocol does neither. > > Thrift does't have strict strict typing: the stated goal of thrift to > support versioning, where the transmitted type and the receiver type are > permitted to differ by the addition or deletion of fields. This means > that the only information that needs to be transmitted are field-id and > field-size; either the field is a known one, in which case full type > info is known, or the field is unknown, in which case the number of > bytes to skip is known. > > Assuming that Thrift were intended to be strongly typed, the only reason > to actually transmit as much type information as TBinaryProtocol > actually does is that implicitly Thrift is also allowing for the type of > fields to be changed. Was this intended to account for the case where a > field was deleted, then reused [which does beg the question - what > happens if the reused type is the same as the original?]; if so, there > may very well have been different and better ways of doing this. > > Other than that the only reason I can come up with is that this was some > kind of type-checking half-measure to ensure correctness. But the > default behavior on a type-mismatch appears to be to discard the field, > not generate some kind of error. > > So, given the implementation of TBinaryProtocol, people could be excused > for thinking that its a partial attempt to implement a dynamically typed > RPC. > > Mark Slee wrote: > >>>>If, however, you're encoding the data for demarshalling at the server, >>>>it sounds like you want a different RPC framework. >> >>I'm going to slightly hijack the conversation to wax philosophic for a >>minute here. I think this statement roughly captures my sentiment here. >> >>One of Thrift's early goals was basically to do just one thing, but do >>it very simply and efficiently across lots of platforms. That thing is >>*strongly-typed* RPC and data-serialization. All of the components were >>essentially designed under the assumption that they would always be >>strongly-typed, and that they should always map to something efficient >>and obvious in a language like C++. >> >>Now, a lot of the things Thrift does are very *similar* to other >>sorts of interesting mechanisms data-serialization, marshalling, >>containering, and whatnot. I think it can be very tempting to look at >>these similarities, analyze the distance between the two things, and >>decide since that distance looks pretty crossable, so we should just >>build a bridge to connect the two. >> >>My fear is that in the long run this turns a small, neat, island into >>a complicated mess of bridges. If you find the right viewing angle and >>it's not a foggy day, you can sometimes still see the little island >>underneath the bridges, but this Thrift thing definitely looks like >>bridgework, not an island. >> >>In the long term, my personal bias is that this is bad for Thrift. Most >>people interested in building these features need them to solve specific >>problems and only care about one or two target languages. If we do a lot >>of this, we end up with a patchwork set of variable feature-lists that >>are inconsistent across languages. The Thrift "brand" will invariably >>move away from "simple, lightweight, lets you do the same thing in all >>programming languages" towards "a bit complicated, does some things in >>some languages." >> >>Part of the idea of Thrift's modular transport/protocol design >>was that it would make it easy for people to implement custom >>extensions/modifications to the system *outside of the core project.* >>Want to sub in your own weird encoding/transport/whatever? No problem, >>just write a TProtocol. Think other folks will be into it? Cool, post >>it online and send an email to the thrift-user@ list. Turns out lots of >>folks want to use it? Then maybe we should incorporate it. >> >>For better or worse, I really think simple things like "how many source >>files appear to be in this tarball?" can matter a lot for software >>adoption. Even if a project is just 10 easy-to-read files at its core, >>when you have to locate those 10 files amongst 40 files of extensions >>and add-ons, and the default make configuration builds everything, the >>project starts feeling like a complicated, awkward thing to deal with, >>and us engineers start getting that itchy feeling of "I can't possibly >>understand this entire thing, surely it is too complicated and slow, why >>don't we just write our own from scratch." >> >>I don't expect everyone to agree with this, and the direction of the >>project is ultimately at the behest of the developers most actively >>working on it, but when it comes to things like dynamic or heterogeneous >>containers, my opinion is that they just shouldn't be a core part of a >>strongly-typed software project with stated simplicity goals. > > > >>Cheers, >>Mark >> >>-----Original Message----- >>From: Mayan Moudgill [mailto:[email protected]] >>Sent: Monday, May 03, 2010 10:03 AM >>To: [email protected]; [email protected] >>Subject: Re: heterogeneous collections >> >> >>The idea of marshalling to strings seems somewhat counter-productive; >>after all, you're marshalling the data using Thrift, which then gets >>sent to a server, and demarshalls it. Now, on top of that you're adding >>another layer of marshalling. >> >>A similar thing happens in Cassandra (except that they use binary >>instead of strings), but at least at Cassandra the user-marshalled data >>is uninterpreted at the server - it only handles the data as an >>uninterpreted blob, so the marshalling/demarshalling is only confined to >>the client [I still wonder about how version control is managed - does >>everyone end up rolling their own?] >> >>If, however, you're encoding the data for demarshalling at the server, >>it sounds like you want a different RPC framework. For instance, do you >>really need the version flexibility that is provided by Thrift? Are your >>types fixed at source & destination? Do you need a leaner transport? In >>fact, why did you pick Thrift in the first place? >> >>Apropos the discussion on scalar/string compression in the >>https://issues.apache.org/jira/browse/THRIFT-110 >>I'm curious: if a particular application would tend to compress better >>using a different algo than the one(s) provided, what happens? >> >> >> >>>On Mon, May 3, 2010 at 7:09 AM, Bryan Duxbury <[email protected]> wrote: >>> >>> >>> >>> >>>>There is already a totally viable workaround, though - make a Union of the >>>>types you want in your collection, and then make the field list<YourUnion>. >>>>You get basically all the capabilities with very few drawbacks, plus the >>>>ability to include multiple logical "types" in the collection, not just >>>>physical types. Of course, if you literally need "any" possible object to >>>>go >>>>into the collection, then this won't do it for you. >>>> >>> >>> >>>Thanks for the suggestion, Bryan. >>> >>>I'm experimenting with marshalling my values to strings (I only deal with >>>basic types such as int32, int64, strings) right now. If that doesn't >>>work, I'll go with your suggestion. >>> >>>alex >>> >> >> >> >> > > >
