[protobuf] Parsing a stream of oneof messages

Joel Holveck Fri, 30 Mar 2018 20:44:32 -0700

I'm dealing with a large stream of messages.  Each message can be of type A, 
B, or C.  Each message is a few hundred bytes, and the stream may contain 
millions of messages.


I know that the traditional advice for reading a bunch of concatenated 
messages 
<https://developers.google.com/protocol-buffers/docs/techniques#streaming> is 
to have a size followed by a message, I'd need both a type (to tell if I 
have A, B, or C) and a size.  It seemed reasonable to encode the type and 
size as varints, making them the same wire format as if I'd defined a 
container message.  That can be a message with a single oneof.  I've 
defined a container for these that looks like this:

message Container {
  oneof contents {
    A a = 1;
    B b = 2;
    C c = 3;
  }
}

Since all of the Parse methods will read an unbounded amount of data, my 
main loop has to parse the tag and size, put a bound on the stream, and 
then dispatch to the inner message's parser.  Here's a rough outline of 
what I do:

extern void process_item(const Container &container);

bool process_stream(google::protobuf::io::ZeroCopyInputStream *input) {
  Container container;
  while (true) {
    google::protobuf::io::CodedInputStream decoder(input);
    auto tagpair = decoder.ReadTagWithCutoffNoLastTag(127);
    if (!tagpair.second)
      return false;
    uint32_t tag = tagpair.first;
    if (tag == 0)
      return true;
    int size;
    if (!decoder.ReadVarintSizeAsInt(&size))
      return false;

    decoder.PushLimit(size);
    switch (tag) {
      case (Container::kAFieldNumber << 3 | 2):
        if (!container.mutable_a()->ParseFromCodedStream(&decoder))
          return false;
        break;
      case (Container::kBFieldNumber << 3 | 2):
        if (!container.mutable_b()->ParseFromCodedStream(&decoder))
          return false;
        break;
      case (Container::kCFieldNumber << 3 | 2):
        if (!container.mutable_c()->ParseFromCodedStream(&decoder))
          return false;
        break;
      default:
        return false;
    }
    if (!decoder.ConsumedEntireMessage() || decoder.BytesUntilLimit() != 0) 
{
      return false;
    }
    
    process_item(container);
  }
}


This is essentially a stripped-down version of the loop in 
Container::MergePartialFromCodedStream, without support for unknown fields 
and other exceptional situations.

It seems like there would be an easier way to do this.  I haven't come up 
with anything I'm fond of, though.  Some ideas I have require copying the 
input stream around (undesirable; zero-copy is useful for performance 
here), or fail if a size's varint crosses the boundary from 
ZeroCopyInputStream::Next, or have similar problems.

Anybody have suggestions for improvements?

(I'm also simplifying the above code snippet: the stream is actually 
self-describing, prefixed by a FileDescriptorSet, for legacy reasons.  That 
means that all this is going through the reflection interface.  Also, I 
have a visitor that gets dispatched with the inner object based on 
OneofContentsCase, rather than a single object that takes a Container.  
Finally, we're currently using Protobuf 2.5.0, so some features - like 
those provided by oneof - are implemented by my code.  That's also why you 
see me using one Container through the whole loop: since I don't have 
arenas in 2.5.0, it's more efficient for me to reuse one object and let it 
manage the cached members.  Indeed, doing this with oneof but not using 
arenas in 3.5.1 would cause a lot of unnecessary memory churn, since oneof 
doesn't reuse objects like optional and repeated, contrary to the note in 
the C++ docs 
<https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.message_lite#MessageLite.Clear.details>
.)

Cheerio,
Piquan

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to protobuf+unsubscr...@googlegroups.com.
To post to this group, send email to protobuf@googlegroups.com.
Visit this group at https://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.

[protobuf] Parsing a stream of oneof messages

Reply via email to