[protobuf] Suggesting a new Wire Type which encodes length by nested field count rather than bytes

Sebastian Markbåge Sat, 17 Apr 2010 17:54:00 -0700

SUMMARY

I’d like to add a new wire type similar to Length-delimited for
embedded messages and packed repeated fields. However, instead of
specifying the byte-length, the value will be preceded by the number
of fields within it.


PROBLEM

Protocol Buffers currently require that it’s possible to calculate the
full byte length of a message before the serialization process.

This becomes a problem for large data sets where it’s costly to keep
the full prepared message structure in memory at a time. This data may
be computed at the serialization time, it may be derived from other in-
memory data or it may be read and derived from another source such as
a database.

Essentially, other than storing the full message structure in memory
or disk, the only solution is to calculate the message structure
twice. Neither are great options for performance.

ALTERNATIVES

Let’s say we have a message consisting of 1000 embedded large
messages.

I would assume the suggested alternative is to write a custom
serialization format that packs the embedded Protobuf messages within
it. This is fairly simple.

Now let’s say we have 100 embedded messages that each contains 1000
nested messages. Now things get more complicated. We could keep our
large set of messages in a separate indexed format and perhaps
reference each sub-set from a standard set of Protobuf messages.

As the messages get more complex, it becomes more difficult to
maintain the custom format around them.

This essentially means that Protocol Buffers isn’t suitable for large
sets of nested data. This may be where we start looking elsewhere.

SOLUTION

My solution is based on the assumption that it’s fairly cheap to
derive the total number of items that’s going to be within a result
set without actually loading all data within it. E.g. it’s easy to
derive the number of items returned by a result set in a relational
database without loading the actual data rows. We certainly don’t have
to load any relationship tables that may contain nested data.

Another case is if you have a large in-memory application structure
that needs to be serialized before being sent over the wire. Imagine a
complex graph structure or 3D drawing. The in-memory representation
may be very different from the serialized form. Computing that
serialization format twice is expensive. Duplicating it in memory is
also expensive. But you probably know the number nodes or groups
you’ll have.

Even if we can’t derive the total number of items for every level in a
message tree, it’s enough to know the total number of message at the
first levels. That will at least give us the ability to break the
result up into manageable chunks.

Now we can use this fact to add another wire type, similar to Length-
delimited. Except instead of specifying the number of bytes in a
nested message or packed repeated fields, we specify the number of
fields at the first level. Each single field within it still specifies
its own byte length by its wire type.

Note: For nested messages or packed repeated fields we only need to
specify the number of fields directly within it. We don’t have to
count the number of fields within further nested messages.

OUT OF SCOPE?

Now I realize that Protobuf isn’t really designed to work with large
datasets like this. So this may be out of the scope of the project. I
thought I’d mention it since this is something I run into fairly
often. I would think that the case of large record sets in a
relational database is fairly common.

The solution is fairly simple and versatile. It makes Protocol Buffers
more versatile and even more useful as a de facto standard interchange
format within more organizations.

The problem with this approach is that it’s not as easy to skip ahead
over an entire nested message without parsing it. For example if you
wanted to load the nth message within a set of repeated fields and the
messages themselves uses this new wire type. Personally, I don’t find
this very often because you usually need some data within the message
to know whether you can skip it. You can’t always assume that
information will be at the top. So you end up parsing the message.
Even if you do, you can just use this option at the first level.

There’s always a trade-off between serialization and deserialization
costs. This addition would give us one more optimization route.

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.

[protobuf] Suggesting a new Wire Type which encodes length by nested field count rather than bytes

Reply via email to