I wouldn't have worried too much about the encoding; you can do it using
pretty much the same write*() interface with the following caveats:
- all writes are to an "array" (generally, I prefer a 2D array of arrays)
- writeFieldBegin pushes the offset within the array and advances by 4 bytes
- write FieldEnd pops the offset, subtracts it from current offset and
writes that value (=field bytes) to the offset.
I think you overstate the "simplicity" of the current scheme.
Mark Slee wrote:
The protocol scheme was written the way it was because it was very
simple, transparent, straightforward to implement, safe to version
changes, and reasonably defensive.
- Field identifiers are necessary for versioning Type identifiers
- are necessary so that we know how to skip fields that we don't
- recognize Therefore, the protocol sends a field identifier, then a
- type identifier, then the data
We could have used field-size instead of a type identifier. That
simplifies the skip read-operation, but comes at the cost of making
the write operation much more complicated. It means that if you are
serializing a complex type, you have to first compact the whole thing
down to determine its total size in bytes, then write out that header.
This leads to internal-buffering code in the protocols, not fun when
you're dealing with containers-of-containers. Even simple cases are
awkward, I can't know the byte-length of a list of strings without
actually iterating over all the strings first.
So, using the type-identifier system keeps the TProtocol interface
incredibly flat and obvious, just serial calls to read/write simple
values, always one at a time.
The checking of types for known field ids is just basic defensiveness,
protection against someone changing the type of a field but forgetting
to update its identifier. We don't generate errors because this is
considered the same class of occurrence as an invalid field identifier.
I totally agree with you about Thrift *seeming* like a partial attempt
to implement dynamic RPC. This was basically my point -- I know they
look similar -- so I do happily excuse people for thinking this. =)
Cheers,
mcslee
-----Original Message-----
From: Mayan Moudgill [mailto:[email protected]]
Sent: Monday, May 03, 2010 9:38 PM
To: [email protected]; Mark Slee
Subject: Re: heterogeneous collections
If the goal of Thrift was to transport strongly typed data, then it begs
the question: why was the protocol scheme being used currently adopted?
Clearly, if the data is typed with the types being agreed to at both
ends, then NO type information needs to be exchanged (other than you
container lengths - which you may not consider type information). If the
data is typed, but the there may be disagreement over the type at the
reciever, then you have to send the complete type information along with
the data. TBinaryProtocol does neither.
Thrift does't have strict strict typing: the stated goal of thrift to
support versioning, where the transmitted type and the receiver type are
permitted to differ by the addition or deletion of fields. This means
that the only information that needs to be transmitted are field-id and
field-size; either the field is a known one, in which case full type
info is known, or the field is unknown, in which case the number of
bytes to skip is known.
Assuming that Thrift were intended to be strongly typed, the only reason
to actually transmit as much type information as TBinaryProtocol
actually does is that implicitly Thrift is also allowing for the type of
fields to be changed. Was this intended to account for the case where a
field was deleted, then reused [which does beg the question - what
happens if the reused type is the same as the original?]; if so, there
may very well have been different and better ways of doing this.
Other than that the only reason I can come up with is that this was some
kind of type-checking half-measure to ensure correctness. But the
default behavior on a type-mismatch appears to be to discard the field,
not generate some kind of error.
So, given the implementation of TBinaryProtocol, people could be excused
for thinking that its a partial attempt to implement a dynamically typed
RPC.
Mark Slee wrote:
If, however, you're encoding the data for demarshalling at the server,
it sounds like you want a different RPC framework.
I'm going to slightly hijack the conversation to wax philosophic for a
minute here. I think this statement roughly captures my sentiment here.
One of Thrift's early goals was basically to do just one thing, but do
it very simply and efficiently across lots of platforms. That thing is
*strongly-typed* RPC and data-serialization. All of the components were
essentially designed under the assumption that they would always be
strongly-typed, and that they should always map to something efficient
and obvious in a language like C++.
Now, a lot of the things Thrift does are very *similar* to other
sorts of interesting mechanisms data-serialization, marshalling,
containering, and whatnot. I think it can be very tempting to look at
these similarities, analyze the distance between the two things, and
decide since that distance looks pretty crossable, so we should just
build a bridge to connect the two.
My fear is that in the long run this turns a small, neat, island into
a complicated mess of bridges. If you find the right viewing angle and
it's not a foggy day, you can sometimes still see the little island
underneath the bridges, but this Thrift thing definitely looks like
bridgework, not an island.
In the long term, my personal bias is that this is bad for Thrift. Most
people interested in building these features need them to solve specific
problems and only care about one or two target languages. If we do a lot
of this, we end up with a patchwork set of variable feature-lists that
are inconsistent across languages. The Thrift "brand" will invariably
move away from "simple, lightweight, lets you do the same thing in all
programming languages" towards "a bit complicated, does some things in
some languages."
Part of the idea of Thrift's modular transport/protocol design
was that it would make it easy for people to implement custom
extensions/modifications to the system *outside of the core project.*
Want to sub in your own weird encoding/transport/whatever? No problem,
just write a TProtocol. Think other folks will be into it? Cool, post
it online and send an email to the thrift-user@ list. Turns out lots of
folks want to use it? Then maybe we should incorporate it.
For better or worse, I really think simple things like "how many source
files appear to be in this tarball?" can matter a lot for software
adoption. Even if a project is just 10 easy-to-read files at its core,
when you have to locate those 10 files amongst 40 files of extensions
and add-ons, and the default make configuration builds everything, the
project starts feeling like a complicated, awkward thing to deal with,
and us engineers start getting that itchy feeling of "I can't possibly
understand this entire thing, surely it is too complicated and slow, why
don't we just write our own from scratch."
I don't expect everyone to agree with this, and the direction of the
project is ultimately at the behest of the developers most actively
working on it, but when it comes to things like dynamic or heterogeneous
containers, my opinion is that they just shouldn't be a core part of a
strongly-typed software project with stated simplicity goals.
Cheers,
Mark
-----Original Message-----
From: Mayan Moudgill [mailto:[email protected]]
Sent: Monday, May 03, 2010 10:03 AM
To: [email protected]; [email protected]
Subject: Re: heterogeneous collections
The idea of marshalling to strings seems somewhat counter-productive;
after all, you're marshalling the data using Thrift, which then gets
sent to a server, and demarshalls it. Now, on top of that you're adding
another layer of marshalling.
A similar thing happens in Cassandra (except that they use binary
instead of strings), but at least at Cassandra the user-marshalled data
is uninterpreted at the server - it only handles the data as an
uninterpreted blob, so the marshalling/demarshalling is only confined to
the client [I still wonder about how version control is managed - does
everyone end up rolling their own?]
If, however, you're encoding the data for demarshalling at the server,
it sounds like you want a different RPC framework. For instance, do you
really need the version flexibility that is provided by Thrift? Are your
types fixed at source & destination? Do you need a leaner transport? In
fact, why did you pick Thrift in the first place?
Apropos the discussion on scalar/string compression in the
https://issues.apache.org/jira/browse/THRIFT-110
I'm curious: if a particular application would tend to compress better
using a different algo than the one(s) provided, what happens?
On Mon, May 3, 2010 at 7:09 AM, Bryan Duxbury <[email protected]> wrote:
There is already a totally viable workaround, though - make a Union of the
types you want in your collection, and then make the field list<YourUnion>.
You get basically all the capabilities with very few drawbacks, plus the
ability to include multiple logical "types" in the collection, not just
physical types. Of course, if you literally need "any" possible object to
go
into the collection, then this won't do it for you.
Thanks for the suggestion, Bryan.
I'm experimenting with marshalling my values to strings (I only deal with
basic types such as int32, int64, strings) right now. If that doesn't
work, I'll go with your suggestion.
alex