RE: heterogeneous collections

Mark Slee Mon, 03 May 2010 23:20:08 -0700

Right, but this requires you to store state in the protocol (currently, 
protocols *can* do this but it is not a requirement), and also mandates the use 
of an internal buffer *at least* as large as the largest complex object you are 
going to serialize (there is no way to tell a socket API to go back and fill in 
4 bytes earlier in the sequence... that means the whole serialized object has 
to go in application-space memory before writing to the transport). Not ideal 
if you plan to serialize very large containers. This was a consciously 
considered issue that we wanted to avoid.


If you really do "prefer 2D array of arrays," then you truly are a C programmer 
at heart, and we may just have to chalk this up as having different opinions 
about simplicity. =)

Honestly, though, this debate seems silly to me. Whether it's a field-size or a 
type-identifier, it's still an extra byte of metadata (or more if the length 
doesn't fit in a byte) that the protocol needs to encode in addition to 
field-identifiers. The only difference is that one describes the application 
content (field-type), the other describes the serialized data.

Cheers,
mcslee

-----Original Message-----
From: Mayan Moudgill [mailto:[email protected]] 
Sent: Monday, May 03, 2010 11:00 PM
To: [email protected]; Mark Slee
Subject: Re: heterogeneous collections


I wouldn't have worried too much about the encoding; you can do it using 
pretty much the same write*() interface with the following caveats:
- all writes are to an "array" (generally, I prefer a 2D array of arrays)
- writeFieldBegin pushes the offset within the array and advances by 4 bytes
- write FieldEnd pops the offset, subtracts it from current offset and 
writes that value (=field bytes) to the offset.

I think you overstate the "simplicity" of the current scheme.


Mark Slee wrote:

> The protocol scheme was written the way it was because it was very
> simple, transparent, straightforward to implement, safe to version
> changes, and reasonably defensive.
> 
> - Field identifiers are necessary for versioning Type identifiers
> - are necessary so that we know how to skip fields that we don't
> - recognize Therefore, the protocol sends a field identifier, then a
> - type identifier, then the data
> 
> We could have used field-size instead of a type identifier. That
> simplifies the skip read-operation, but comes at the cost of making
> the write operation much more complicated. It means that if you are
> serializing a complex type, you have to first compact the whole thing
> down to determine its total size in bytes, then write out that header.
> This leads to internal-buffering code in the protocols, not fun when
> you're dealing with containers-of-containers. Even simple cases are
> awkward, I can't know the byte-length of a list of strings without
> actually iterating over all the strings first.
> 
> So, using the type-identifier system keeps the TProtocol interface
> incredibly flat and obvious, just serial calls to read/write simple
> values, always one at a time.
> 
> The checking of types for known field ids is just basic defensiveness,
> protection against someone changing the type of a field but forgetting
> to update its identifier. We don't generate errors because this is
> considered the same class of occurrence as an invalid field identifier.
> 
> I totally agree with you about Thrift *seeming* like a partial attempt
> to implement dynamic RPC. This was basically my point -- I know they
> look similar -- so I do happily excuse people for thinking this. =)

> Cheers,
> mcslee
> 
> -----Original Message-----
> From: Mayan Moudgill [mailto:[email protected]] 
> Sent: Monday, May 03, 2010 9:38 PM
> To: [email protected]; Mark Slee
> Subject: Re: heterogeneous collections
> 
> 
> If the goal of Thrift was to transport strongly typed data, then it begs 
> the question: why was the protocol scheme being used currently adopted?
> 
> Clearly, if the data is typed with the types being agreed to at both 
> ends, then NO type information needs to be exchanged (other than you 
> container lengths - which you may not consider type information). If the 
> data is typed, but the there may be disagreement over the type at the 
> reciever, then you have to send the complete type information along with 
> the data. TBinaryProtocol does neither.
> 
> Thrift does't have strict strict typing: the stated goal of thrift to 
> support versioning, where the transmitted type and the receiver type are 
> permitted to differ by the addition or deletion of fields. This means 
> that the only information that needs to be transmitted are field-id and 
> field-size; either the field is a known one, in which case full type 
> info is known, or the field is unknown, in which case the number of 
> bytes to skip is known.
> 
> Assuming that Thrift were intended to be strongly typed, the only reason 
> to actually transmit as much type information as TBinaryProtocol 
> actually does is that implicitly Thrift is also allowing for the type of 
> fields to be changed. Was this intended to account for the case where a 
> field was deleted, then reused [which does beg the question - what 
> happens if the reused type is the same as the original?]; if so, there 
> may very well have been different and better ways of doing this.
> 
> Other than that the only reason I can come up with is that this was some 
> kind of type-checking half-measure to ensure correctness. But the 
> default behavior on a type-mismatch appears to be to discard the field, 
> not generate some kind of error.
> 
> So, given the implementation of TBinaryProtocol, people could be excused 
> for thinking that its a partial attempt to implement a dynamically typed 
> RPC.
> 
> Mark Slee wrote:
> 
>>>>If, however, you're encoding the data for demarshalling at the server, 
>>>>it sounds like you want a different RPC framework.
>>
>>I'm going to slightly hijack the conversation to wax philosophic for a
>>minute here. I think this statement roughly captures my sentiment here.
>>
>>One of Thrift's early goals was basically to do just one thing, but do
>>it very simply and efficiently across lots of platforms. That thing is
>>*strongly-typed* RPC and data-serialization. All of the components were
>>essentially designed under the assumption that they would always be
>>strongly-typed, and that they should always map to something efficient
>>and obvious in a language like C++.
>>
>>Now, a lot of the things Thrift does are very *similar* to other
>>sorts of interesting mechanisms data-serialization, marshalling,
>>containering, and whatnot. I think it can be very tempting to look at
>>these similarities, analyze the distance between the two things, and
>>decide since that distance looks pretty crossable, so we should just
>>build a bridge to connect the two.
>>
>>My fear is that in the long run this turns a small, neat, island into
>>a complicated mess of bridges. If you find the right viewing angle and
>>it's not a foggy day, you can sometimes still see the little island
>>underneath the bridges, but this Thrift thing definitely looks like
>>bridgework, not an island.
>>
>>In the long term, my personal bias is that this is bad for Thrift. Most
>>people interested in building these features need them to solve specific
>>problems and only care about one or two target languages. If we do a lot
>>of this, we end up with a patchwork set of variable feature-lists that
>>are inconsistent across languages. The Thrift "brand" will invariably
>>move away from "simple, lightweight, lets you do the same thing in all
>>programming languages" towards "a bit complicated, does some things in
>>some languages."
>>
>>Part of the idea of Thrift's modular transport/protocol design
>>was that it would make it easy for people to implement custom
>>extensions/modifications to the system *outside of the core project.*
>>Want to sub in your own weird encoding/transport/whatever? No problem,
>>just write a TProtocol. Think other folks will be into it? Cool, post
>>it online and send an email to the thrift-user@ list. Turns out lots of
>>folks want to use it? Then maybe we should incorporate it.
>>
>>For better or worse, I really think simple things like "how many source
>>files appear to be in this tarball?" can matter a lot for software
>>adoption. Even if a project is just 10 easy-to-read files at its core,
>>when you have to locate those 10 files amongst 40 files of extensions
>>and add-ons, and the default make configuration builds everything, the
>>project starts feeling like a complicated, awkward thing to deal with,
>>and us engineers start getting that itchy feeling of "I can't possibly
>>understand this entire thing, surely it is too complicated and slow, why
>>don't we just write our own from scratch."
>>
>>I don't expect everyone to agree with this, and the direction of the
>>project is ultimately at the behest of the developers most actively
>>working on it, but when it comes to things like dynamic or heterogeneous
>>containers, my opinion is that they just shouldn't be a core part of a
>>strongly-typed software project with stated simplicity goals.
> 
> 
> 
>>Cheers,
>>Mark
>>
>>-----Original Message-----
>>From: Mayan Moudgill [mailto:[email protected]] 
>>Sent: Monday, May 03, 2010 10:03 AM
>>To: [email protected]; [email protected]
>>Subject: Re: heterogeneous collections
>>
>>
>>The idea of marshalling to strings seems somewhat counter-productive; 
>>after all, you're marshalling the data using Thrift, which then gets 
>>sent to a server, and demarshalls it. Now, on top of that you're adding 
>>another layer of marshalling.
>>
>>A similar thing happens in  Cassandra (except that they use binary 
>>instead of strings), but at least at Cassandra the user-marshalled data 
>>is uninterpreted at the server - it only handles the data as an 
>>uninterpreted blob, so the marshalling/demarshalling is only confined to 
>>the client [I still wonder about how version control is managed - does 
>>everyone end up rolling their own?]
>>
>>If, however, you're encoding the data for demarshalling at the server, 
>>it sounds like you want a different RPC framework. For instance, do you 
>>really need the version flexibility that is provided by Thrift? Are your 
>>types fixed at source & destination? Do you need a leaner transport? In 
>>fact, why did you pick Thrift in the first place?
>>
>>Apropos the discussion on scalar/string compression in the 
>>https://issues.apache.org/jira/browse/THRIFT-110
>>I'm curious: if a particular application would tend to compress better 
>>using a different algo than the one(s) provided, what happens?
>>
>>
>>
>>>On Mon, May 3, 2010 at 7:09 AM, Bryan Duxbury <[email protected]> wrote:
>>>
>>>
>>>
>>>
>>>>There is already a totally viable workaround, though - make a Union of the
>>>>types you want in your collection, and then make the field list<YourUnion>.
>>>>You get basically all the capabilities with very few drawbacks, plus the
>>>>ability to include multiple logical "types" in the collection, not just
>>>>physical types. Of course, if you literally need "any" possible object to
>>>>go
>>>>into the collection, then this won't do it for you.
>>>>
>>>
>>>
>>>Thanks for the suggestion, Bryan.
>>>
>>>I'm experimenting with marshalling my values to strings (I only deal with
>>>basic types such as int32, int64, strings) right now.   If that doesn't
>>>work, I'll go with your suggestion.
>>>
>>>alex
>>>
>>
>>
>>
>>
> 
> 
>

RE: heterogeneous collections

Reply via email to