Re: Slicing support in Python
Hi Kenton and Petar, Sorry I haven't been able to reply for a few days; I've been so swamped this week. Hopefully I'll be able to conjure up an intelligent reply tomorrow :) Cheers, Alek Storm --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
[Fwd: Re: Streaming]
Thanks very much Jon (see below). You make good points and I like the approach the you describe. I am still thinking, however, that there is power in the ability for message instances to write and parse themselves from a stream. A message instance could be passed a stream object which chains back to the network connection from which bytes are being received. A stop flag based parsing mechanism could be passed this buffer object, and would handle reading the stream and initializing its properties, exiting when the serialization of that message instance stopped. At this point, a new message instance could be created, and the process repeated. The type of message doing the parsing could vary from message to message, even with the serializations being sent and received back to back. This mechanism would work regardless of field-types being streamed. A message type consisting solely of varint fields, whose length is determined while reading the varint's value, would support streaming no differently than any other message type. The solution also seems to support every requirement supported by the original buffer type. Messages serialized to a buffer, could just as easily be initialized from that buffer as they could from the string contained by the buffer. m1 = Message() buffer = Buffer() [...] (initialize instance vars) m1.SerializeToBuffer(buffer) m2 = Message() m2.ParseFromBuffer(buffer) Produces same result as: m2 = Message() bytes = m1.SerializeToString() m2.ParseFromString(bytes) The string-based parse would ignore the stop bit, parsing the entire string. The buffer-based parsing would stop parsing when the stop bit, producing the same result. Handling of concatenated serializations is supported through repeated calls to parse from buffer: m1 = Message() [...] (initialize instance vars) m2 = Message() [...] (initialize instance vars) buffer = Buffer() m1.SerializeToBuffer(buffer) m2.SerializeToBuffer(buffer) m3 = Message() m3.ParseFromBuffer(buffer) m3.ParseFromBuffer(buffer) Would produce same result as: m3 = Message() m3.ParseFromString(m1.SerializeToString() + m2.SerializeToString()) As long as an unused, and never to be used, field number is used to generate the stop bit's key, then I don't believe there are any incompatibilities between buffer-based message marshalling and the existing string-based code. A very easy usage: # Sending side for message in messages: message.SerializeToBuffer(buffer) # Receiving side for msgtype in types: message = msgtype() message.ParseFromBuffer(buffer) Unless I've overlooked something, it seems like the stream based marshalling and unmarshalling is powerful, simple, and completely compatible with all existing code. But there is a very real chance I've overlooked something... - Shane Forwarded Message From: Jon Skeet [EMAIL PROTECTED] To: Shane Green [EMAIL PROTECTED] Subject: Re: Streaming Date: Fri, 5 Dec 2008 08:19:41 + 2008/12/5 Shane Green [EMAIL PROTECTED] Thanks Jon. Those are good points. I rather liked the self-delimiting nature of fields, and thought this method would bring that feature up to the message level, without breaking any of the existing capabilities. So my goal was a message which could truly be streamed; perhaps even sent without knowing its own size up front. Perhaps I overlooked something? Currently the PB format requires that you know the size of each submessage before you send it. You don't need to know the size of the whole message, as it's assumed to be the entire size of the datastream. It's unfortunate that you do need to provide the whole message to the output stream though, unless you want to manually serialize the individual fields. My goal was slightly different - I wanted to be able to stream a sequence of messages. The most obvious use case (in my view) is a log. Write out a massive log file as a sequence of entries, and you can read it back in one at a time. It's not designed to help to stream a single huge message though. Would you mind if I resent my questions to the group? I lack confidence and wanted to make sure I wasn't overlooking something ridiculous, but am thinking that the exchange would be informative. Absolutely. Feel free to quote anything I've written if you think it helps. Also, how are you serializing and parsing messages as if they are repeated fields of a container message? Is there a fair bit of parsing or work being done outside the standard protocol-buffer APIs? There's not a lot of work, to be honest. On the parsing side the main difficulty is getting a type-safe delegate to read a message from the stream. The writing side is trivial. Have a look at the code:
Re: [Fwd: Re: Streaming]
It's quite easy to write a helper function that reads/writes delimited messages (delimited by size or by end tag). For example, here's one for writing a length-delimited message: bool WriteMessage(const Message message, ZeroCopyOutputStream* output) { CodedOutputStream coded_out(output); return coded_out.WriteVarint32(message.ByteSize()) message.SerializeWithCachedSizes(coded_out); } and here's one for reading one message: bool ReadMessage(ZeroCopyInputStream* input, Message* message) { CodedInputStream coded_in(input); uint32 size; if (!coded_in.ReadVarint32(size)) return false; CodedInputStream::Limit limit = coded_in.PushLimit(size); if (!message-ParseFromCodedStream(coded_in)) return false; if (!coded_in.ExpectAtEnd()) return false; coded_in.PopLimit(limit); return true; } (I haven't tested the above so it may contain minor errors.) We could add these as methods of the Message class. Note, though, that for many applications, this kind of streaming is too simplistic. For example, the above will not allow you to efficiently seek to an arbitrary message in the stream, since at the very least you have to read the sizes of all messages before it to find it. It's also not very robust in the face of data corruption -- if any of the sizes are corrupted, the whole stream is unreadable. So, you may find you want to do something more complicated, depending on your app. But, anything more complicated is really beyond the scope of the protocol buffer library. On Fri, Dec 5, 2008 at 8:27 AM, Shane Green [EMAIL PROTECTED]wrote: Thanks very much Jon (see below). You make good points and I like the approach the you describe. I am still thinking, however, that there is power in the ability for message instances to write and parse themselves from a stream. A message instance could be passed a stream object which chains back to the network connection from which bytes are being received. A stop flag based parsing mechanism could be passed this buffer object, and would handle reading the stream and initializing its properties, exiting when the serialization of that message instance stopped. At this point, a new message instance could be created, and the process repeated. The type of message doing the parsing could vary from message to message, even with the serializations being sent and received back to back. This mechanism would work regardless of field-types being streamed. A message type consisting solely of varint fields, whose length is determined while reading the varint's value, would support streaming no differently than any other message type. The solution also seems to support every requirement supported by the original buffer type. Messages serialized to a buffer, could just as easily be initialized from that buffer as they could from the string contained by the buffer. m1 = Message() buffer = Buffer() [...] (initialize instance vars) m1.SerializeToBuffer(buffer) m2 = Message() m2.ParseFromBuffer(buffer) Produces same result as: m2 = Message() bytes = m1.SerializeToString() m2.ParseFromString(bytes) The string-based parse would ignore the stop bit, parsing the entire string. The buffer-based parsing would stop parsing when the stop bit, producing the same result. Handling of concatenated serializations is supported through repeated calls to parse from buffer: m1 = Message() [...] (initialize instance vars) m2 = Message() [...] (initialize instance vars) buffer = Buffer() m1.SerializeToBuffer(buffer) m2.SerializeToBuffer(buffer) m3 = Message() m3.ParseFromBuffer(buffer) m3.ParseFromBuffer(buffer) Would produce same result as: m3 = Message() m3.ParseFromString(m1.SerializeToString() + m2.SerializeToString()) As long as an unused, and never to be used, field number is used to generate the stop bit's key, then I don't believe there are any incompatibilities between buffer-based message marshalling and the existing string-based code. A very easy usage: # Sending side for message in messages: message.SerializeToBuffer(buffer) # Receiving side for msgtype in types: message = msgtype() message.ParseFromBuffer(buffer) Unless I've overlooked something, it seems like the stream based marshalling and unmarshalling is powerful, simple, and completely compatible with all existing code. But there is a very real chance I've overlooked something... - Shane Forwarded Message From: Jon Skeet [EMAIL PROTECTED] To: Shane Green [EMAIL PROTECTED] Subject: Re: Streaming Date: Fri, 5 Dec 2008 08:19:41 + 2008/12/5 Shane Green [EMAIL PROTECTED] Thanks Jon. Those are good points. I rather liked the self-delimiting nature of fields, and thought this method would bring that feature up to the message level, without breaking any of the existing capabilities. So my goal was a message which could truly
Re: Slicing support in Python
On Wed, Dec 3, 2008 at 5:32 AM, Kenton Varda [EMAIL PROTECTED] wrote: Sorry, I think you misunderstood. The C++ parsers generated by protoc (with optimize_for = SPEED) are an order of magnitude faster than the dynamic *C++* parser (used with optimize_for = CODE_SIZE and DynamicMessage). The Python parser is considerably slower than either of them, but that's beside the point. Your decoupled parser which produces a tag/value tree will be at least as slow as the existing C++ dynamic parser, probably slower (since it sounds like it would use some sort of dictionary structure rather than flat classes/structs). Oh, I forgot we have two C++ parsers. The method I described uses the generated (SPEED) parser, so it should be a great deal quicker. It just outputs a tree instead of a message, leaving the smart object creation to Python. Run this backwards when serializing, and you get another advantage: you can easily swap out the function that converts the tree into serialized protobuf for one that outputs XML, JSON, etc. You can already easily write encoders and decoders for alternative formats using reflection. Honestly, I think using reflection for something as basic as changing the ouput format is hackish and could get ugly. Reflection should only be used in certain circumstances, e.g., generating message objects, because it exposes the internals. There's a chance we could change how Protocol Buffers works under the hood in a way that screws up an XML outputter, which wouldn't happen if we just expose a clean interface. Let's include it - it gives us a more complete list interface, there's no downside, and the users can decide whether they want to use it. We can't predict all possible use cases. Ah, yes, the old Why not? argument. :) Actually, I far prefer the opposite argument: If you aren't sure if someone will want a feature, don't include it. There is always a down side to including a feature. Even if people choose not to use it, it increases code size, maintenance burden, memory usage, and interface complexity. Worse yet, if people do use it, then we're permanently stuck with it, whether we like it or not. We can't change it later, even if we decide it's wrong. For example, we may decide later -- based on an actual use case, perhaps -- that it would really have been better if remove() compared elements by content rather than by identity, so that you could remove a message from a repeated field by constructing an identical message and then calling remove(). But we wouldn't be able to change it. We'd have to instead add a different method like removeByValue(), which would be ugly and add even more complexity. Protocol Buffers got where they are by stubbornly refusing the vast majority of feature suggestions. :) Ha, I thought you might say that. It's a good philosophy, and I completely understand where you're coming from. So I concede that point, and it all boils down to complete interface vs. compact interface. But just for the record, I'm pretty sure Python's list remove() method compares by value, and doesn't have a method that compares by identity. So there would be no reason to include a compare-by-identity method in protobuf repeated fields. That said, you do have a good point that the interface should be similar to standard Python lists if possible. But given the other problems that prevent this, it seems like a moot point. Okay, you place more value on compact interface. So are we keeping remove() for scalar values? I think their interfaces should be consistent, but I don't think you think that's as important. On Wed, Dec 3, 2008 at 10:25 AM, Petar Petrov [EMAIL PROTECTED]wrote: It's not that simple. We would also like to improve performance at least in MergeFrom/CopyFrom/ParseASCII/IsInitialized. Okay. So let's say we have a pure-C++ parser with a Python wrapper. This brings us back to getting slicing to work in C++ with no garbage collector. Kenton, could you elaborate on what you meant earlier by ownership problems specific to the C++ version? I can't really see anything that would affect PB repeated fields that isn't taken care of by handing the user control over allocation and deallocation of the field elements. Currently each composite field has a reference to its parent. This makes it impossible to add the same composite to two different repeated composite fields. The .add() method guarantees that this never happens. Is there anything wrong with having a list of parents? I'm guessing I'm being naive - would speed be affected too much by that? I think protobuf's repeated composite fields aren't and shouldn't be equivalent to python lists. Okay, that's cleared up now. Thanks. Cheers, Alek Storm --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to