Re: Java deserialization - any best practices for performances?
The best way to think of it is: Builder : Java Message :: C++ Message : const C++ Message As far as performance goes, it is a common mistake to confuse C/C++ heap memory allocation costs to Java heap allocation. In the common case, allocations in Java are just a few instructions... comperable to stack allocations in C/C++. What normally gets you in Java is the initialization cost, and in this particlar scenario there is no way around that. If you are worried, you could benchmark the difference between constantly allocating builders as you go vs. starting with an array of N builders (allocating the array would be done outside of the benchmark). I am sure it will prove enlightening. On 7/24/09, Kenton Varda wrote: > On Thu, Jul 23, 2009 at 7:15 PM, alopecoid wrote: > >> Hmm... that strikes me as strange. I understand that the Message >> objects are immutable, but the Builders are as well? I thought that >> they would work more along the lines of String and StringBuilder, >> where String is obviously immutable and StringBuilder is mutable/ >> reusable. > > > The point is that it's the Message object that contains all the stuff > allocated by the Builder, and therefore none of that stuff can actually be > reused. (When you call build(), nothing is copied -- it just returns the > object that it has been working on.) So reusing the builder itself is kind > of useless, because it's just a trivial object containing one pointer (to > the message object it is working on constructing). > > >> But while we're on the subject, I have been looking for some rough >> benchmarks comparing the performance of Protocol Buffers in Java >> versus C++. Do you (the collective you) have any [rough] idea as to >> how they compare performance wise? I am thinking more in terms of >> batch-style processing (disk I/O, parsing centric) rather than RPC >> centric usage patterns. Any experiences you can share would be great. > > > I have some benchmarks that IIRC show that Java parsing and serialization is > roughly half the speed of C++. As I recall a lot of the speed difference is > from UTF-8 decoding/encoding -- in C++ we just leave the bytes encoded, but > in Java we need to decode them in order to construct standard String > objects. > > I've been planning to release these benchmarks publicly but it will take > some work and there's a lot of higher-priority stuff to do. :/ (I think > Jon Skeet did get the Java side of the benchmarks into SVN but there's no > C++ equivalent yet.) > > > > -- Sent from my mobile device Chris --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
Re: Java deserialization - any best practices for performances?
On Thu, Jul 23, 2009 at 7:15 PM, alopecoid wrote: > Hmm... that strikes me as strange. I understand that the Message > objects are immutable, but the Builders are as well? I thought that > they would work more along the lines of String and StringBuilder, > where String is obviously immutable and StringBuilder is mutable/ > reusable. The point is that it's the Message object that contains all the stuff allocated by the Builder, and therefore none of that stuff can actually be reused. (When you call build(), nothing is copied -- it just returns the object that it has been working on.) So reusing the builder itself is kind of useless, because it's just a trivial object containing one pointer (to the message object it is working on constructing). > But while we're on the subject, I have been looking for some rough > benchmarks comparing the performance of Protocol Buffers in Java > versus C++. Do you (the collective you) have any [rough] idea as to > how they compare performance wise? I am thinking more in terms of > batch-style processing (disk I/O, parsing centric) rather than RPC > centric usage patterns. Any experiences you can share would be great. I have some benchmarks that IIRC show that Java parsing and serialization is roughly half the speed of C++. As I recall a lot of the speed difference is from UTF-8 decoding/encoding -- in C++ we just leave the bytes encoded, but in Java we need to decode them in order to construct standard String objects. I've been planning to release these benchmarks publicly but it will take some work and there's a lot of higher-priority stuff to do. :/ (I think Jon Skeet did get the Java side of the benchmarks into SVN but there's no C++ equivalent yet.) --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
Re: Java deserialization - any best practices for performances?
Hi Kenton, Thanks for your reply. > You can't continue to use a Builder after calling build(). Even if we made > it so you could, it would be building an entirely new object, not reusing > the old one. We can't make it reuse the old one because that would break > the immutability guarantee of message objects. Hmm... that strikes me as strange. I understand that the Message objects are immutable, but the Builders are as well? I thought that they would work more along the lines of String and StringBuilder, where String is obviously immutable and StringBuilder is mutable/ reusable. > But seriously, object allocation with a modern generational garbage > collector is extremely cheap, especially for objects that don't stick around > very long. So I don't think there's much to gain here. While I agree that object allocation is relatively cheap in Java, I have noticed that if you generate a lot of garbage, you have to also spend some time tweaking the garbage collector settings to avoid long/ frequent garbage collection pauses. I know that there has been a lot of recent work done in Java 7 (and experimentally in Java 6) to avoid this, but I haven't had the opportunity to test this yet. In fact, I find that often times this is the real difference in performance between Java and C++ in the cases where C++ seems to perform significantly faster... different object allocation practices (but more importantly, implementation/design choices). I don't know how well this holds true for a spectrum of different usage patterns, but my experience has been more from the large scale data processing side of things. And don't get me wrong, I'm actually one of the few people (out of my closest colleagues) who think that data processing can and should be done in Java over C++, but that's another discussion entirely :) But while we're on the subject, I have been looking for some rough benchmarks comparing the performance of Protocol Buffers in Java versus C++. Do you (the collective you) have any [rough] idea as to how they compare performance wise? I am thinking more in terms of batch-style processing (disk I/O, parsing centric) rather than RPC centric usage patterns. Any experiences you can share would be great. Thanks! --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
Re: Java deserialization - any best practices for performances?
On Thu, Jul 23, 2009 at 12:32 AM, alopecoid wrote: > > Hi, > > I haven't actually used the Java protobuf API, but it seems to me from > the quick occasional glance that this isn't entirely true. I mean, > specifically in response to the code snippet posted in the original > message, I would possibly: > > 1. Reuse the Builder object by calling its clear() method. This would > save from the need to create a new Builder object for each iteration > of the outermost loop. You can't continue to use a Builder after calling build(). Even if we made it so you could, it would be building an entirely new object, not reusing the old one. We can't make it reuse the old one because that would break the immutability guarantee of message objects. Reusing the actual builder object is not that useful since it's only a very small object containing a pointer to a message object. > 2. Iterate over the repeated field using the get*Count() and get* > (index) methods instead of the get*List() method. I'm not sure if this > would save anything, but depending on how things are implemented in > the generated code, this could save from allocating a new List object. Won't save anything; we still need a list object internally. But seriously, object allocation with a modern generational garbage collector is extremely cheap, especially for objects that don't stick around very long. So I don't think there's much to gain here. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
Re: Java deserialization - any best practices for performances?
Hi, I haven't actually used the Java protobuf API, but it seems to me from the quick occasional glance that this isn't entirely true. I mean, specifically in response to the code snippet posted in the original message, I would possibly: 1. Reuse the Builder object by calling its clear() method. This would save from the need to create a new Builder object for each iteration of the outermost loop. 2. Iterate over the repeated field using the get*Count() and get* (index) methods instead of the get*List() method. I'm not sure if this would save anything, but depending on how things are implemented in the generated code, this could save from allocating a new List object. Also, might "bytes" type fields perform better than any "string" type fields that you may have in your particular data set? I'm not sure, but it might be worth benchmarking. On Jul 18, 9:22 pm, Kenton Varda wrote: > On Fri, Jul 17, 2009 at 8:13 PM, Alex Black wrote: > > > When I write out messages using C++ I'm careful to clear messages and > > re-use them, is there something equivalent on the java side when > > reading those same messages in? > > No. Sorry. This just doesn't fit at all with the Java library's design, > and even if it did, you cannot reuse Java String objects, which often > account for most of the memory usage. However, memory allocation is cheaper > in Java than in C++, so there's less to gain from it. > > > > > My code looks like: > > > CodedInputStream stream = CodedInputStream.newInstance(inputStream); > > > while ( !stream.isAtEnd() ) > > { > > MyMessage.Builder builder = MyMessage.newBuilder(); > > stream.readMessage(builder, null); > > MyMessage myMessage = builder.build(); > > > for ( MessageValue messageValue : myMessage.getValuesList() ) > > { > > .. > > } > > } > > > I'm passing 150 messages each with 1000 items, so presumably memory is > > allocated 150 times for each of the messages... > > > - Alex --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
Re: Java deserialization - any best practices for performances?
On Fri, Jul 17, 2009 at 8:13 PM, Alex Black wrote: > > When I write out messages using C++ I'm careful to clear messages and > re-use them, is there something equivalent on the java side when > reading those same messages in? No. Sorry. This just doesn't fit at all with the Java library's design, and even if it did, you cannot reuse Java String objects, which often account for most of the memory usage. However, memory allocation is cheaper in Java than in C++, so there's less to gain from it. > > > My code looks like: > > CodedInputStream stream = CodedInputStream.newInstance(inputStream); > > while ( !stream.isAtEnd() ) > { > MyMessage.Builder builder = MyMessage.newBuilder(); > stream.readMessage(builder, null); > MyMessage myMessage = builder.build(); > > for ( MessageValue messageValue : myMessage.getValuesList() ) > { >.. > } > } > > I'm passing 150 messages each with 1000 items, so presumably memory is > allocated 150 times for each of the messages... > > - Alex > > > --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
Re: Java deserialization - any best practices for performances?
Hi Alek, can you elaborate a bit on what you mean by lazy parsing? I think what I want is to be able to *reuse* my objects, specifically the instances of MyMessage, instead of allocating new ones each time through the loop. This is analagous to what my C++ code does when writing out messages, it re-uses the same message object, clearing it between uses. On Jul 18, 12:25 am, Alek Storm wrote: > I think what you want is lazy parsing, which unfortunately isn't available > yet. You could always read bytes off the stream in chunks, or write your > own CodedInputStream to skip to the end of each message every time it sees a > length. > > Alek > > On Fri, Jul 17, 2009 at 8:13 PM, Alex Black wrote: > > > When I write out messages using C++ I'm careful to clear messages and > > re-use them, is there something equivalent on the java side when > > reading those same messages in? > > > My code looks like: > > > CodedInputStream stream = CodedInputStream.newInstance(inputStream); > > > while ( !stream.isAtEnd() ) > > { > > MyMessage.Builder builder = MyMessage.newBuilder(); > > stream.readMessage(builder, null); > > MyMessage myMessage = builder.build(); > > > for ( MessageValue messageValue : myMessage.getValuesList() ) > > { > > .. > > } > > } > > > I'm passing 150 messages each with 1000 items, so presumably memory is > > allocated 150 times for each of the messages... > > > - Alex --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
Re: Java deserialization - any best practices for performances?
I think what you want is lazy parsing, which unfortunately isn't available yet. You could always read bytes off the stream in chunks, or write your own CodedInputStream to skip to the end of each message every time it sees a length. Alek On Fri, Jul 17, 2009 at 8:13 PM, Alex Black wrote: > > When I write out messages using C++ I'm careful to clear messages and > re-use them, is there something equivalent on the java side when > reading those same messages in? > > My code looks like: > > CodedInputStream stream = CodedInputStream.newInstance(inputStream); > > while ( !stream.isAtEnd() ) > { > MyMessage.Builder builder = MyMessage.newBuilder(); > stream.readMessage(builder, null); > MyMessage myMessage = builder.build(); > > for ( MessageValue messageValue : myMessage.getValuesList() ) > { >.. > } > } > > I'm passing 150 messages each with 1000 items, so presumably memory is > allocated 150 times for each of the messages... > > - Alex > > > --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
Java deserialization - any best practices for performances?
When I write out messages using C++ I'm careful to clear messages and re-use them, is there something equivalent on the java side when reading those same messages in? My code looks like: CodedInputStream stream = CodedInputStream.newInstance(inputStream); while ( !stream.isAtEnd() ) { MyMessage.Builder builder = MyMessage.newBuilder(); stream.readMessage(builder, null); MyMessage myMessage = builder.build(); for ( MessageValue messageValue : myMessage.getValuesList() ) { .. } } I'm passing 150 messages each with 1000 items, so presumably memory is allocated 150 times for each of the messages... - Alex --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---