On Wed, Mar 10, 2010 at 7:48 AM, Marc Gravell <marc.grav...@gmail.com>wrote:
> The variable-length encoding allows multiple representations of the > same value - for example, 1 could be written as: > > 0000 0001 > > or it could be (I believe): > > 1000 0001 1000 0000 0000 0000 > > Is there anything in the core (or other) implementations that would > object to this second form? > I haven't checked python, but at least for Java and C++ this would parse correctly. (You can try this out for yourself by just producing the data and calling CodedInputStream::ReadVarint32) > In particular, the scenario I'm thinking about is where a message is > know to be pretty deep - for example: > > A > > B > > C > > D > > D > > C > > D > > D > > At the moment, my code leaves an optimistic single-byte dummy-value > for the prefix, and then backfills this value when the length is known > (i.e. after writing this subtree), shuffling the data if needed. > > I'm toying with making this voluntarily lossy; for example it might > (in some cases) decide to leave a longer (2-5) dummy value, and write > the alternative form if the value turns out to be small (i.e. if > actually only 4 bytes were written). This would reduce the number of > times I need to block-copy the data (noting that it might have to copy > the same data multiple times - potentially every non-root object might > be more than 127 bytes). > The way this is handled in the C++ and Java implementations (again, don't know about python) is to compute the byte-size up front, and store the sizes internally so that the lengths of the submessages are already known. When serializing, the containing message just gets the cached size and writes the correct length delimiter. You commonly need to know the total size that will be written in order to make sure the output buffer has sufficient space, so computing the byte size up-front usually doesn't incur any extra cost. In fact, we used to use the approach that your current code does: leave a single-byte dummy-value, backfill, and shuffle if necessary. Switching to the cached size approach was a pretty significant performance gain. Leaving a larger hole would work fine, but I'd be wary of the extra bloat to the encoding size - in the event that you have a large number of small submessages, the length delimiters would chew up a lot of space. > > I'm tempted to spike it (to gauge the performance benefit), but I > don't want to waste my time if this is going to make it incompatible > with the core implementations (i.e. if they would actively spot this > and cause an error). And if it *is* valid, would it be possible to > make it explicit in the encoding spec? (or indeed, if it *isn't* valid > make it explicit in the encoding spec). > > Thanks, > > Marc Gravell > > -- > You received this message because you are subscribed to the Google Groups > "Protocol Buffers" group. > To post to this group, send email to proto...@googlegroups.com. > To unsubscribe from this group, send email to > protobuf+unsubscr...@googlegroups.com<protobuf%2bunsubscr...@googlegroups.com> > . > For more options, visit this group at > http://groups.google.com/group/protobuf?hl=en. > > -- You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to proto...@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.