[+jhaberman who has brought this up before] So, the last time we added a new wire type was in 2001. This is not something to take lightly. Adding a new wire type means that any message encoded with that wire type simply cannot be parsed by any version of protocol buffers prior to the one where the wiretype was introduced. Older versions will not even be able to safely skip the field; they'll just fail to parse the whole message. Also note that there is only room to add two more wire types before we run out of bits.
Your wire type would also add a considerable amount of complication to the parser (which would now have to keep count of how many fields have been parsed in order to stop at the right moment) and serializer (which would now have to have a way to count fields, in addition to computing the byte size). I think that if we want to support streaming of large messages, the right thing to do is reuse the startgroup/endgroup wire types. It's true that these don't allow you to pre-allocate space, but it seems like the whole point of your proposal is to support cases where you cannot pre-allocate space anyway, because the space would be too large. Also note that pre-allocating space based on user input can easily lead to security vulnerabilities -- the message could claim that an impossibly large number of fields follow, causing the parser to exhaust memory trying to allocate space for it. Note that the protobuf implementation carefully avoids pre-allocating large amounts of space for exactly this reason. I've been resistant to resurrecting groups before, because they have down-sides, including slower seeking and no ability to lazily parse sub-messages. However, here's a thought that I'm open to considering: What if we said that parsers should accept all sub-messages and groups encoded in *either* length-delimited or tag-delimited format. So, the serializer would then be able to choose between length-delimited and tag-delimited output for each aggregate. Stream-based serializers would always use tag-delimited, but serializers for in-memory objects would continue using length-delimited in order to allow for lazy parsing and such. I would be happy to accept this change if it can be shown that it does not significantly increase the compiled size of generated code. Generated code is already far too large as it is, so we don't want to make it worse. But considering that the similar change which allowed repeated primitives to be encoded in either packed or non-packed format did not end up increasing code size, it's possible that this wouldn't either. On Sat, Apr 17, 2010 at 4:59 PM, Sebastian Markbåge <[email protected]>wrote: > SUMMARY > > I’d like to add a new wire type similar to Length-delimited for > embedded messages and packed repeated fields. However, instead of > specifying the byte-length, the value will be preceded by the number > of fields within it. > > PROBLEM > > Protocol Buffers currently require that it’s possible to calculate the > full byte length of a message before the serialization process. > > This becomes a problem for large data sets where it’s costly to keep > the full prepared message structure in memory at a time. This data may > be computed at the serialization time, it may be derived from other in- > memory data or it may be read and derived from another source such as > a database. > > Essentially, other than storing the full message structure in memory > or disk, the only solution is to calculate the message structure > twice. Neither are great options for performance. > > ALTERNATIVES > > Let’s say we have a message consisting of 1000 embedded large > messages. > > I would assume the suggested alternative is to write a custom > serialization format that packs the embedded Protobuf messages within > it. This is fairly simple. > > Now let’s say we have 100 embedded messages that each contains 1000 > nested messages. Now things get more complicated. We could keep our > large set of messages in a separate indexed format and perhaps > reference each sub-set from a standard set of Protobuf messages. > > As the messages get more complex, it becomes more difficult to > maintain the custom format around them. > > This essentially means that Protocol Buffers isn’t suitable for large > sets of nested data. This may be where we start looking elsewhere. > > SOLUTION > > My solution is based on the assumption that it’s fairly cheap to > derive the total number of items that’s going to be within a result > set without actually loading all data within it. E.g. it’s easy to > derive the number of items returned by a result set in a relational > database without loading the actual data rows. We certainly don’t have > to load any relationship tables that may contain nested data. > > Another case is if you have a large in-memory application structure > that needs to be serialized before being sent over the wire. Imagine a > complex graph structure or 3D drawing. The in-memory representation > may be very different from the serialized form. Computing that > serialization format twice is expensive. Duplicating it in memory is > also expensive. But you probably know the number nodes or groups > you’ll have. > > Even if we can’t derive the total number of items for every level in a > message tree, it’s enough to know the total number of message at the > first levels. That will at least give us the ability to break the > result up into manageable chunks. > > Now we can use this fact to add another wire type, similar to Length- > delimited. Except instead of specifying the number of bytes in a > nested message or packed repeated fields, we specify the number of > fields at the first level. Each single field within it still specifies > its own byte length by its wire type. > > Note: For nested messages or packed repeated fields we only need to > specify the number of fields directly within it. We don’t have to > count the number of fields within further nested messages. > > OUT OF SCOPE? > > Now I realize that Protobuf isn’t really designed to work with large > datasets like this. So this may be out of the scope of the project. I > thought I’d mention it since this is something I run into fairly > often. I would think that the case of large record sets in a > relational database is fairly common. > > The solution is fairly simple and versatile. It makes Protocol Buffers > more versatile and even more useful as a de facto standard interchange > format within more organizations. > > The problem with this approach is that it’s not as easy to skip ahead > over an entire nested message without parsing it. For example if you > wanted to load the nth message within a set of repeated fields and the > messages themselves uses this new wire type. Personally, I don’t find > this very often because you usually need some data within the message > to know whether you can skip it. You can’t always assume that > information will be at the top. So you end up parsing the message. > Even if you do, you can just use this option at the first level. > > There’s always a trade-off between serialization and deserialization > costs. This addition would give us one more optimization route. > > -- > You received this message because you are subscribed to the Google Groups > "Protocol Buffers" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]<protobuf%[email protected]> > . > For more options, visit this group at > http://groups.google.com/group/protobuf?hl=en. > > -- You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
