That is pretty-much how protobuf-net handles it (although the trunk does have a "strict" mode that enforces exact matches on wire-types). You'd still have the issue of older code expecting a length-prefix and getting a start-group, so even it is allowed either during deserialization, you'd want it to keep serializing it as length-prefixed unless the contract somehow specified "groups".
Serializing groups is a lot simpler; no messing calculating the length, and (if you do it the other way) no byte-shuffling. The downside is of course the skip time; I guess it depends whether you are commonly seeking through records without materializing anything. Marc On 19 April 2010 20:57, Kenton Varda <[email protected]> wrote: > [+jhaberman who has brought this up before] > > So, the last time we added a new wire type was in 2001. This is not > something to take lightly. Adding a new wire type means that any message > encoded with that wire type simply cannot be parsed by any version of > protocol buffers prior to the one where the wiretype was introduced. Older > versions will not even be able to safely skip the field; they'll just fail > to parse the whole message. Also note that there is only room to add two > more wire types before we run out of bits. > > Your wire type would also add a considerable amount of complication to the > parser (which would now have to keep count of how many fields have been > parsed in order to stop at the right moment) and serializer (which would now > have to have a way to count fields, in addition to computing the byte size). > > I think that if we want to support streaming of large messages, the right > thing to do is reuse the startgroup/endgroup wire types. It's true that > these don't allow you to pre-allocate space, but it seems like the whole > point of your proposal is to support cases where you cannot pre-allocate > space anyway, because the space would be too large. Also note that > pre-allocating space based on user input can easily lead to security > vulnerabilities -- the message could claim that an impossibly large number > of fields follow, causing the parser to exhaust memory trying to allocate > space for it. Note that the protobuf implementation carefully avoids > pre-allocating large amounts of space for exactly this reason. > > I've been resistant to resurrecting groups before, because they have > down-sides, including slower seeking and no ability to lazily parse > sub-messages. However, here's a thought that I'm open to considering: What > if we said that parsers should accept all sub-messages and groups encoded in > *either* length-delimited or tag-delimited format. So, the serializer would > then be able to choose between length-delimited and tag-delimited output for > each aggregate. Stream-based serializers would always use tag-delimited, > but serializers for in-memory objects would continue using length-delimited > in order to allow for lazy parsing and such. > > I would be happy to accept this change if it can be shown that it does not > significantly increase the compiled size of generated code. Generated code > is already far too large as it is, so we don't want to make it worse. But > considering that the similar change which allowed repeated primitives to be > encoded in either packed or non-packed format did not end up increasing code > size, it's possible that this wouldn't either. > > On Sat, Apr 17, 2010 at 4:59 PM, Sebastian Markbåge <[email protected] > > wrote: > >> SUMMARY >> >> I’d like to add a new wire type similar to Length-delimited for >> embedded messages and packed repeated fields. However, instead of >> specifying the byte-length, the value will be preceded by the number >> of fields within it. >> >> PROBLEM >> >> Protocol Buffers currently require that it’s possible to calculate the >> full byte length of a message before the serialization process. >> >> This becomes a problem for large data sets where it’s costly to keep >> the full prepared message structure in memory at a time. This data may >> be computed at the serialization time, it may be derived from other in- >> memory data or it may be read and derived from another source such as >> a database. >> >> Essentially, other than storing the full message structure in memory >> or disk, the only solution is to calculate the message structure >> twice. Neither are great options for performance. >> >> ALTERNATIVES >> >> Let’s say we have a message consisting of 1000 embedded large >> messages. >> >> I would assume the suggested alternative is to write a custom >> serialization format that packs the embedded Protobuf messages within >> it. This is fairly simple. >> >> Now let’s say we have 100 embedded messages that each contains 1000 >> nested messages. Now things get more complicated. We could keep our >> large set of messages in a separate indexed format and perhaps >> reference each sub-set from a standard set of Protobuf messages. >> >> As the messages get more complex, it becomes more difficult to >> maintain the custom format around them. >> >> This essentially means that Protocol Buffers isn’t suitable for large >> sets of nested data. This may be where we start looking elsewhere. >> >> SOLUTION >> >> My solution is based on the assumption that it’s fairly cheap to >> derive the total number of items that’s going to be within a result >> set without actually loading all data within it. E.g. it’s easy to >> derive the number of items returned by a result set in a relational >> database without loading the actual data rows. We certainly don’t have >> to load any relationship tables that may contain nested data. >> >> Another case is if you have a large in-memory application structure >> that needs to be serialized before being sent over the wire. Imagine a >> complex graph structure or 3D drawing. The in-memory representation >> may be very different from the serialized form. Computing that >> serialization format twice is expensive. Duplicating it in memory is >> also expensive. But you probably know the number nodes or groups >> you’ll have. >> >> Even if we can’t derive the total number of items for every level in a >> message tree, it’s enough to know the total number of message at the >> first levels. That will at least give us the ability to break the >> result up into manageable chunks. >> >> Now we can use this fact to add another wire type, similar to Length- >> delimited. Except instead of specifying the number of bytes in a >> nested message or packed repeated fields, we specify the number of >> fields at the first level. Each single field within it still specifies >> its own byte length by its wire type. >> >> Note: For nested messages or packed repeated fields we only need to >> specify the number of fields directly within it. We don’t have to >> count the number of fields within further nested messages. >> >> OUT OF SCOPE? >> >> Now I realize that Protobuf isn’t really designed to work with large >> datasets like this. So this may be out of the scope of the project. I >> thought I’d mention it since this is something I run into fairly >> often. I would think that the case of large record sets in a >> relational database is fairly common. >> >> The solution is fairly simple and versatile. It makes Protocol Buffers >> more versatile and even more useful as a de facto standard interchange >> format within more organizations. >> >> The problem with this approach is that it’s not as easy to skip ahead >> over an entire nested message without parsing it. For example if you >> wanted to load the nth message within a set of repeated fields and the >> messages themselves uses this new wire type. Personally, I don’t find >> this very often because you usually need some data within the message >> to know whether you can skip it. You can’t always assume that >> information will be at the top. So you end up parsing the message. >> Even if you do, you can just use this option at the first level. >> >> There’s always a trade-off between serialization and deserialization >> costs. This addition would give us one more optimization route. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Protocol Buffers" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]<protobuf%[email protected]> >> . >> For more options, visit this group at >> http://groups.google.com/group/protobuf?hl=en. >> >> > -- > You received this message because you are subscribed to the Google Groups > "Protocol Buffers" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]<protobuf%[email protected]> > . > For more options, visit this group at > http://groups.google.com/group/protobuf?hl=en. > -- Regards, Marc -- You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
