Re: encoding of embedded messages and repeated elements
Yes, "groups" are never going to fully go away. But we recommend against using them in new code. On Thu, Jun 25, 2009 at 9:13 AM, Piotr Findeisen wrote: > > Hi! > > On Jun 24, 8:42 pm, Kenton Varda wrote: > > The end-tag approach is more efficient than your idea -- it's faster (no > > need to count elements at all) and it takes no more space (no need to > write > > a count, which makes up for the extra space taken by the end tag). > > But in any case, the encoding is not something we can change at this > point, > > since protocol buffers is nothing without backwards-compatibility. > > As I read the code of C++ protobuf deserializer I found it supports > end-tag approach using END_GROUP constant -- or I just misunderstood > the code and/or this thread? > > From my experiments it looks like I can stream messages one by one > separating them with END_GROUP tag, but -- again from comments in the > code -- it's deprecated. If "protocol buffers is nothing without > backwards-compatibility", can I assume that existing and future > implementation of C++ and also Java/Python deserializers will support > this approach? > > best regards, > Piotr > > > > --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
Re: encoding of embedded messages and repeated elements
Hi! On Jun 24, 8:42 pm, Kenton Varda wrote: > The end-tag approach is more efficient than your idea -- it's faster (no > need to count elements at all) and it takes no more space (no need to write > a count, which makes up for the extra space taken by the end tag). > But in any case, the encoding is not something we can change at this point, > since protocol buffers is nothing without backwards-compatibility. As I read the code of C++ protobuf deserializer I found it supports end-tag approach using END_GROUP constant -- or I just misunderstood the code and/or this thread? >From my experiments it looks like I can stream messages one by one separating them with END_GROUP tag, but -- again from comments in the code -- it's deprecated. If "protocol buffers is nothing without backwards-compatibility", can I assume that existing and future implementation of C++ and also Java/Python deserializers will support this approach? best regards, Piotr --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
Re: encoding of embedded messages and repeated elements
The end-tag approach is more efficient than your idea -- it's faster (no need to count elements at all) and it takes no more space (no need to write a count, which makes up for the extra space taken by the end tag). But in any case, the encoding is not something we can change at this point, since protocol buffers is nothing without backwards-compatibility. And yes, some existing parsers do, in fact, take advantage of the ability to skip over messages without parsing them, and there are many features that people are considering implementing (like lazy parsing) which would need this. It actually turns out that pre-computing the size of the embedded message does not take very long compared to actually writing it. On Wed, Jun 24, 2009 at 12:57 AM, etorri wrote: > > > Does some existing parser actually implement that skipping feature? > > There would not be any need for a end-tag. Let's assume that there > would be two different tags > > 2 - Length_Delimited, which could contain a packed list of bytes > (string, memory block) or other types where the parser needs to know > what is packed inside (no tags) > 6 - Group or Element_Delimited - which would be like Length_Delimited > but have the number of elements that follow that belong to this field > > So for an example message where the first field is a group > > (1,6),3 - field numbered 1 of the message, type 6 = Group and 3 > elements that follow belong to this group > (1,2),5,"Hello" -field number 1 of the embedded message would be a > string > (3,1),120 - field nr 3 of the embedded message, varint of value 120 > (4,1),0 - field nr 4 of the embedded message, varint of value 0 > (2,2),5,"World" - field nr 2 of the message > > this would be the encoding of the following TheMessage > > message Embedded { > required string greeting = 1; > optional int32 useless = 2; > required int32 good = 3; > required int32 evil = 4; > } > > message TheMessage { > required Embedded e = 1; > required string target = 2; > } > > So in this case there would not be need for an end tag. When > constructing the message it should be relatively easy to count the > number of embedded elements instead of knowing how much space they > occupy. This would enable streaming/serializing the elements > recursively out one by one. > > > > On Jun 23, 9:07 pm, Kenton Varda wrote: > > The advantage of writing the length is that a parser can skip the entire > > sub-message easily without having to parse its contents. Otherwise, we > > would probably use the "group" encoding for sub-messages, where a special > > end tag marks the end of the message. > > > > > > --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
Re: encoding of embedded messages and repeated elements
Does some existing parser actually implement that skipping feature? There would not be any need for a end-tag. Let's assume that there would be two different tags 2 - Length_Delimited, which could contain a packed list of bytes (string, memory block) or other types where the parser needs to know what is packed inside (no tags) 6 - Group or Element_Delimited - which would be like Length_Delimited but have the number of elements that follow that belong to this field So for an example message where the first field is a group (1,6),3 - field numbered 1 of the message, type 6 = Group and 3 elements that follow belong to this group (1,2),5,"Hello" -field number 1 of the embedded message would be a string (3,1),120 - field nr 3 of the embedded message, varint of value 120 (4,1),0 - field nr 4 of the embedded message, varint of value 0 (2,2),5,"World" - field nr 2 of the message this would be the encoding of the following TheMessage message Embedded { required string greeting = 1; optional int32 useless = 2; required int32 good = 3; required int32 evil = 4; } message TheMessage { required Embedded e = 1; required string target = 2; } So in this case there would not be need for an end tag. When constructing the message it should be relatively easy to count the number of embedded elements instead of knowing how much space they occupy. This would enable streaming/serializing the elements recursively out one by one. On Jun 23, 9:07 pm, Kenton Varda wrote: > The advantage of writing the length is that a parser can skip the entire > sub-message easily without having to parse its contents. Otherwise, we > would probably use the "group" encoding for sub-messages, where a special > end tag marks the end of the message. > --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
Re: encoding of embedded messages and repeated elements
The advantage of writing the length is that a parser can skip the entire sub-message easily without having to parse its contents. Otherwise, we would probably use the "group" encoding for sub-messages, where a special end tag marks the end of the message. On Tue, Jun 23, 2009 at 9:06 AM, etorri wrote: > > > Hello, > > The "length delimited" encoding basically tells that the following N > bytes belong to this field. Wouldn't it be easier to instead use the > number of elements that belong to the embedded message (repeated > element). > > Now (as far as I have understood) the message needs to be built from > fragments and then collected together as the lengths are not known > beforehand and it would be expensive to calculate the byte-length of > the embedded message. > > Instead, it would be relatively inexpensive to calculate just the > number of following elements that make the embedded message before > starting to encode it. > > This would enable streaming of PB or encoding and sending the elements > right as they are encoded. > > Sorry if I misunderstood something. I have just started looking at BP. > > > --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
Re: encoding of embedded messages and repeated elements
ok thanks. it is as it is. (just looking at the feasibility of implementing the PB in Ada for my own projects :-) On Jun 23, 6:17 pm, Alek Storm wrote: > Hi etorri, > > Embedded messages and strings have the exact same wire format. When parsing > a message, it's impossible to know whether you're parsing one or the other, > and since strings have to be encoded using their length in bytes, we can't > do something different for embedded messages. > > Cheers, > Alek > > > > On Tue, Jun 23, 2009 at 9:06 AM, etorri wrote: > > > Hello, > > > The "length delimited" encoding basically tells that the following N > > bytes belong to this field. Wouldn't it be easier to instead use the > > number of elements that belong to the embedded message (repeated > > element). > > > Now (as far as I have understood) the message needs to be built from > > fragments and then collected together as the lengths are not known > > beforehand and it would be expensive to calculate the byte-length of > > the embedded message. > > > Instead, it would be relatively inexpensive to calculate just the > > number of following elements that make the embedded message before > > starting to encode it. > > > This would enable streaming of PB or encoding and sending the elements > > right as they are encoded. > > > Sorry if I misunderstood something. I have just started looking at BP. > > -- > Alek Storm --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
Re: encoding of embedded messages and repeated elements
Hi etorri, Embedded messages and strings have the exact same wire format. When parsing a message, it's impossible to know whether you're parsing one or the other, and since strings have to be encoded using their length in bytes, we can't do something different for embedded messages. Cheers, Alek On Tue, Jun 23, 2009 at 9:06 AM, etorri wrote: > > > Hello, > > The "length delimited" encoding basically tells that the following N > bytes belong to this field. Wouldn't it be easier to instead use the > number of elements that belong to the embedded message (repeated > element). > > Now (as far as I have understood) the message needs to be built from > fragments and then collected together as the lengths are not known > beforehand and it would be expensive to calculate the byte-length of > the embedded message. > > Instead, it would be relatively inexpensive to calculate just the > number of following elements that make the embedded message before > starting to encode it. > > This would enable streaming of PB or encoding and sending the elements > right as they are encoded. > > Sorry if I misunderstood something. I have just started looking at BP. > > > -- Alek Storm --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
encoding of embedded messages and repeated elements
Hello, The "length delimited" encoding basically tells that the following N bytes belong to this field. Wouldn't it be easier to instead use the number of elements that belong to the embedded message (repeated element). Now (as far as I have understood) the message needs to be built from fragments and then collected together as the lengths are not known beforehand and it would be expensive to calculate the byte-length of the embedded message. Instead, it would be relatively inexpensive to calculate just the number of following elements that make the embedded message before starting to encode it. This would enable streaming of PB or encoding and sending the elements right as they are encoded. Sorry if I misunderstood something. I have just started looking at BP. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---