On Wed, Mar 10, 2010 at 7:48 AM, Marc Gravell <marc.grav...@gmail.com>wrote:

> The variable-length encoding allows multiple representations of the
> same value - for example, 1 could be written as:
>
>  0000 0001
>
> or it could be (I believe):
>
>  1000 0001    1000 0000    0000 0000
>
> Is there anything in the core (or other) implementations that would
> object to this second form?
>

I haven't checked python, but at least for Java and C++ this would parse
correctly. (You can try this out for yourself by just producing the data and
calling CodedInputStream::ReadVarint32)


> In particular, the scenario I'm thinking about is where a message is
> know to be pretty deep - for example:
>
>  A
>  > B
>     > C
>        > D
>        > D
>     > C
>        > D
>        > D
>
> At the moment, my code leaves an optimistic single-byte dummy-value
> for the prefix, and then backfills this value when the length is known
> (i.e. after writing this subtree), shuffling the data if needed.
>
> I'm toying with making this voluntarily lossy; for example it might
> (in some cases) decide to leave a longer (2-5) dummy value, and write
> the alternative form if the value turns out to be small (i.e. if
> actually only 4 bytes were written). This would reduce the number of
> times I need to block-copy the data (noting that it might have to copy
> the same data multiple times - potentially every non-root object might
> be more than 127 bytes).
>

The way this is handled in the C++ and Java implementations (again, don't
know about python) is to compute the byte-size up front, and store the sizes
internally so that the lengths of the submessages are already known. When
serializing, the containing message just gets the cached size and writes the
correct length delimiter. You commonly need to know the total size that will
be written in order to make sure the output buffer has sufficient space, so
computing the byte size up-front usually doesn't incur any extra cost.

In fact, we used to use the approach that your current code does: leave a
single-byte dummy-value, backfill, and shuffle if necessary. Switching to
the cached size approach was a pretty significant performance gain. Leaving
a larger hole would work fine, but I'd be wary of the extra bloat to the
encoding size - in the event that you have a large number of small
submessages, the length delimiters would chew up a lot of space.


>
> I'm tempted to spike it (to gauge the performance benefit), but I
> don't want to waste my time if this is going to make it incompatible
> with the core implementations (i.e. if they would actively spot this
> and cause an error). And if it *is* valid, would it be possible to
> make it explicit in the encoding spec? (or indeed, if it *isn't* valid
> make it explicit in the encoding spec).
>
> Thanks,
>
> Marc Gravell
>
> --
> You received this message because you are subscribed to the Google Groups
> "Protocol Buffers" group.
> To post to this group, send email to proto...@googlegroups.com.
> To unsubscribe from this group, send email to
> protobuf+unsubscr...@googlegroups.com<protobuf%2bunsubscr...@googlegroups.com>
> .
> For more options, visit this group at
> http://groups.google.com/group/protobuf?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.

Reply via email to