Re: [protobuf] Suggesting a new Wire Type which encodes length by nested field count rather than bytes

Marc Gravell Mon, 19 Apr 2010 14:41:39 -0700

That is pretty-much how protobuf-net handles it (although the trunk does
have a "strict" mode that enforces exact matches on wire-types). You'd still
have the issue of older code expecting a length-prefix and getting a
start-group, so even it is allowed either during deserialization, you'd want
it to keep serializing it as length-prefixed unless the contract somehow
specified "groups".


Serializing groups is a lot simpler; no messing calculating the length, and
(if you do it the other way) no byte-shuffling. The downside is of course
the skip time; I guess it depends whether you are commonly seeking through
records without materializing anything.

Marc

On 19 April 2010 20:57, Kenton Varda <[email protected]> wrote:

> [+jhaberman who has brought this up before]
>
> So, the last time we added a new wire type was in 2001.  This is not
> something to take lightly.  Adding a new wire type means that any message
> encoded with that wire type simply cannot be parsed by any version of
> protocol buffers prior to the one where the wiretype was introduced.  Older
> versions will not even be able to safely skip the field; they'll just fail
> to parse the whole message.  Also note that there is only room to add two
> more wire types before we run out of bits.
>
> Your wire type would also add a considerable amount of complication to the
> parser (which would now have to keep count of how many fields have been
> parsed in order to stop at the right moment) and serializer (which would now
> have to have a way to count fields, in addition to computing the byte size).
>
> I think that if we want to support streaming of large messages, the right
> thing to do is reuse the startgroup/endgroup wire types.  It's true that
> these don't allow you to pre-allocate space, but it seems like the whole
> point of your proposal is to support cases where you cannot pre-allocate
> space anyway, because the space would be too large.  Also note that
> pre-allocating space based on user input can easily lead to security
> vulnerabilities -- the message could claim that an impossibly large number
> of fields follow, causing the parser to exhaust memory trying to allocate
> space for it.  Note that the protobuf implementation carefully avoids
> pre-allocating large amounts of space for exactly this reason.
>
> I've been resistant to resurrecting groups before, because they have
> down-sides, including slower seeking and no ability to lazily parse
> sub-messages.  However, here's a thought that I'm open to considering:  What
> if we said that parsers should accept all sub-messages and groups encoded in
> *either* length-delimited or tag-delimited format.  So, the serializer would
> then be able to choose between length-delimited and tag-delimited output for
> each aggregate.  Stream-based serializers would always use tag-delimited,
> but serializers for in-memory objects would continue using length-delimited
> in order to allow for lazy parsing and such.
>
> I would be happy to accept this change if it can be shown that it does not
> significantly increase the compiled size of generated code.  Generated code
> is already far too large as it is, so we don't want to make it worse.  But
> considering that the similar change which allowed repeated primitives to be
> encoded in either packed or non-packed format did not end up increasing code
> size, it's possible that this wouldn't either.
>
> On Sat, Apr 17, 2010 at 4:59 PM, Sebastian Markbåge <[email protected]
> > wrote:
>
>> SUMMARY
>>
>> I’d like to add a new wire type similar to Length-delimited for
>> embedded messages and packed repeated fields. However, instead of
>> specifying the byte-length, the value will be preceded by the number
>> of fields within it.
>>
>> PROBLEM
>>
>> Protocol Buffers currently require that it’s possible to calculate the
>> full byte length of a message before the serialization process.
>>
>> This becomes a problem for large data sets where it’s costly to keep
>> the full prepared message structure in memory at a time. This data may
>> be computed at the serialization time, it may be derived from other in-
>> memory data or it may be read and derived from another source such as
>> a database.
>>
>> Essentially, other than storing the full message structure in memory
>> or disk, the only solution is to calculate the message structure
>> twice. Neither are great options for performance.
>>
>> ALTERNATIVES
>>
>> Let’s say we have a message consisting of 1000 embedded large
>> messages.
>>
>> I would assume the suggested alternative is to write a custom
>> serialization format that packs the embedded Protobuf messages within
>> it. This is fairly simple.
>>
>> Now let’s say we have 100 embedded messages that each contains 1000
>> nested messages. Now things get more complicated. We could keep our
>> large set of messages in a separate indexed format and perhaps
>> reference each sub-set from a standard set of Protobuf messages.
>>
>> As the messages get more complex, it becomes more difficult to
>> maintain the custom format around them.
>>
>> This essentially means that Protocol Buffers isn’t suitable for large
>> sets of nested data. This may be where we start looking elsewhere.
>>
>> SOLUTION
>>
>> My solution is based on the assumption that it’s fairly cheap to
>> derive the total number of items that’s going to be within a result
>> set without actually loading all data within it. E.g. it’s easy to
>> derive the number of items returned by a result set in a relational
>> database without loading the actual data rows. We certainly don’t have
>> to load any relationship tables that may contain nested data.
>>
>> Another case is if you have a large in-memory application structure
>> that needs to be serialized before being sent over the wire. Imagine a
>> complex graph structure or 3D drawing. The in-memory representation
>> may be very different from the serialized form. Computing that
>> serialization format twice is expensive. Duplicating it in memory is
>> also expensive. But you probably know the number nodes or groups
>> you’ll have.
>>
>> Even if we can’t derive the total number of items for every level in a
>> message tree, it’s enough to know the total number of message at the
>> first levels. That will at least give us the ability to break the
>> result up into manageable chunks.
>>
>> Now we can use this fact to add another wire type, similar to Length-
>> delimited. Except instead of specifying the number of bytes in a
>> nested message or packed repeated fields, we specify the number of
>> fields at the first level. Each single field within it still specifies
>> its own byte length by its wire type.
>>
>> Note: For nested messages or packed repeated fields we only need to
>> specify the number of fields directly within it. We don’t have to
>> count the number of fields within further nested messages.
>>
>> OUT OF SCOPE?
>>
>> Now I realize that Protobuf isn’t really designed to work with large
>> datasets like this. So this may be out of the scope of the project. I
>> thought I’d mention it since this is something I run into fairly
>> often. I would think that the case of large record sets in a
>> relational database is fairly common.
>>
>> The solution is fairly simple and versatile. It makes Protocol Buffers
>> more versatile and even more useful as a de facto standard interchange
>> format within more organizations.
>>
>> The problem with this approach is that it’s not as easy to skip ahead
>> over an entire nested message without parsing it. For example if you
>> wanted to load the nth message within a set of repeated fields and the
>> messages themselves uses this new wire type. Personally, I don’t find
>> this very often because you usually need some data within the message
>> to know whether you can skip it. You can’t always assume that
>> information will be at the top. So you end up parsing the message.
>> Even if you do, you can just use this option at the first level.
>>
>> There’s always a trade-off between serialization and deserialization
>> costs. This addition would give us one more optimization route.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Protocol Buffers" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected]<protobuf%[email protected]>
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/protobuf?hl=en.
>>
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "Protocol Buffers" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<protobuf%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/protobuf?hl=en.
>



-- 
Regards,

Marc

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.

Re: [protobuf] Suggesting a new Wire Type which encodes length by nested field count rather than bytes

Reply via email to