Re: Java deserialization - any best practices for performances?

2009-07-24 Thread Christopher Smith

The best way to think of it is:

Builder : Java Message :: C++ Message : const C++ Message

As far as performance goes, it is a common mistake to confuse C/C++
heap memory allocation costs to Java heap allocation. In the common
case, allocations in Java are just a few instructions... comperable to
stack allocations in C/C++. What normally gets you in Java is the
initialization cost, and in this particlar scenario there is no way
around that.

If you are worried, you could benchmark the difference between
constantly allocating builders as you go vs. starting with an array of
N builders (allocating the array would be done outside of the
benchmark). I am sure it will prove enlightening.


On 7/24/09, Kenton Varda  wrote:
> On Thu, Jul 23, 2009 at 7:15 PM, alopecoid  wrote:
>
>> Hmm... that strikes me as strange. I understand that the Message
>> objects are immutable, but the Builders are as well? I thought that
>> they would work more along the lines of String and StringBuilder,
>> where String is obviously immutable and StringBuilder is mutable/
>> reusable.
>
>
> The point is that it's the Message object that contains all the stuff
> allocated by the Builder, and therefore none of that stuff can actually be
> reused.  (When you call build(), nothing is copied -- it just returns the
> object that it has been working on.)  So reusing the builder itself is kind
> of useless, because it's just a trivial object containing one pointer (to
> the message object it is working on constructing).
>
>
>> But while we're on the subject, I have been looking for some rough
>> benchmarks comparing the performance of Protocol Buffers in Java
>> versus C++. Do you (the collective you) have any [rough] idea as to
>> how they compare performance wise? I am thinking more in terms of
>> batch-style processing (disk I/O, parsing centric) rather than RPC
>> centric usage patterns. Any experiences you can share would be great.
>
>
> I have some benchmarks that IIRC show that Java parsing and serialization is
> roughly half the speed of C++.  As I recall a lot of the speed difference is
> from UTF-8 decoding/encoding -- in C++ we just leave the bytes encoded, but
> in Java we need to decode them in order to construct standard String
> objects.
>
> I've been planning to release these benchmarks publicly but it will take
> some work and there's a lot of higher-priority stuff to do.  :/  (I think
> Jon Skeet did get the Java side of the benchmarks into SVN but there's no
> C++ equivalent yet.)
>
> >
>

-- 
Sent from my mobile device

Chris

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Java deserialization - any best practices for performances?

2009-07-24 Thread Kenton Varda
On Thu, Jul 23, 2009 at 7:15 PM, alopecoid  wrote:

> Hmm... that strikes me as strange. I understand that the Message
> objects are immutable, but the Builders are as well? I thought that
> they would work more along the lines of String and StringBuilder,
> where String is obviously immutable and StringBuilder is mutable/
> reusable.


The point is that it's the Message object that contains all the stuff
allocated by the Builder, and therefore none of that stuff can actually be
reused.  (When you call build(), nothing is copied -- it just returns the
object that it has been working on.)  So reusing the builder itself is kind
of useless, because it's just a trivial object containing one pointer (to
the message object it is working on constructing).


> But while we're on the subject, I have been looking for some rough
> benchmarks comparing the performance of Protocol Buffers in Java
> versus C++. Do you (the collective you) have any [rough] idea as to
> how they compare performance wise? I am thinking more in terms of
> batch-style processing (disk I/O, parsing centric) rather than RPC
> centric usage patterns. Any experiences you can share would be great.


I have some benchmarks that IIRC show that Java parsing and serialization is
roughly half the speed of C++.  As I recall a lot of the speed difference is
from UTF-8 decoding/encoding -- in C++ we just leave the bytes encoded, but
in Java we need to decode them in order to construct standard String
objects.

I've been planning to release these benchmarks publicly but it will take
some work and there's a lot of higher-priority stuff to do.  :/  (I think
Jon Skeet did get the Java side of the benchmarks into SVN but there's no
C++ equivalent yet.)

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Java deserialization - any best practices for performances?

2009-07-23 Thread alopecoid

Hi Kenton,

Thanks for your reply.

> You can't continue to use a Builder after calling build().  Even if we made
> it so you could, it would be building an entirely new object, not reusing
> the old one.  We can't make it reuse the old one because that would break
> the immutability guarantee of message objects.

Hmm... that strikes me as strange. I understand that the Message
objects are immutable, but the Builders are as well? I thought that
they would work more along the lines of String and StringBuilder,
where String is obviously immutable and StringBuilder is mutable/
reusable.

> But seriously, object allocation with a modern generational garbage
> collector is extremely cheap, especially for objects that don't stick around
> very long.  So I don't think there's much to gain here.

While I agree that object allocation is relatively cheap in Java, I
have noticed that if you generate a lot of garbage, you have to also
spend some time tweaking the garbage collector settings to avoid long/
frequent garbage collection pauses. I know that there has been a lot
of recent work done in Java 7 (and experimentally in Java 6) to avoid
this, but I haven't had the opportunity to test this yet. In fact, I
find that often times this is the real difference in performance
between Java and C++ in the cases where C++ seems to perform
significantly faster... different object allocation practices (but
more importantly, implementation/design choices). I don't know how
well this holds true for a spectrum of different usage patterns, but
my experience has been more from the large scale data processing side
of things. And don't get me wrong, I'm actually one of the few people
(out of my closest colleagues) who think that data processing can and
should be done in Java over C++, but that's another discussion
entirely :)

But while we're on the subject, I have been looking for some rough
benchmarks comparing the performance of Protocol Buffers in Java
versus C++. Do you (the collective you) have any [rough] idea as to
how they compare performance wise? I am thinking more in terms of
batch-style processing (disk I/O, parsing centric) rather than RPC
centric usage patterns. Any experiences you can share would be great.

Thanks!
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Java deserialization - any best practices for performances?

2009-07-23 Thread Kenton Varda
On Thu, Jul 23, 2009 at 12:32 AM, alopecoid  wrote:

>
> Hi,
>
> I haven't actually used the Java protobuf API, but it seems to me from
> the quick occasional glance that this isn't entirely true. I mean,
> specifically in response to the code snippet posted in the original
> message, I would possibly:
>
> 1. Reuse the Builder object by calling its clear() method. This would
> save from the need to create a new Builder object for each iteration
> of the outermost loop.


You can't continue to use a Builder after calling build().  Even if we made
it so you could, it would be building an entirely new object, not reusing
the old one.  We can't make it reuse the old one because that would break
the immutability guarantee of message objects.

Reusing the actual builder object is not that useful since it's only a very
small object containing a pointer to a message object.


> 2. Iterate over the repeated field using the get*Count() and get*
> (index) methods instead of the get*List() method. I'm not sure if this
> would save anything, but depending on how things are implemented in
> the generated code, this could save from allocating a new List object.


Won't save anything; we still need a list object internally.

But seriously, object allocation with a modern generational garbage
collector is extremely cheap, especially for objects that don't stick around
very long.  So I don't think there's much to gain here.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Java deserialization - any best practices for performances?

2009-07-23 Thread alopecoid

Hi,

I haven't actually used the Java protobuf API, but it seems to me from
the quick occasional glance that this isn't entirely true. I mean,
specifically in response to the code snippet posted in the original
message, I would possibly:

1. Reuse the Builder object by calling its clear() method. This would
save from the need to create a new Builder object for each iteration
of the outermost loop.

2. Iterate over the repeated field using the get*Count() and get*
(index) methods instead of the get*List() method. I'm not sure if this
would save anything, but depending on how things are implemented in
the generated code, this could save from allocating a new List object.

Also, might "bytes" type fields perform better than any "string" type
fields that you may have in your particular data set? I'm not sure,
but it might be worth benchmarking.

On Jul 18, 9:22 pm, Kenton Varda  wrote:
> On Fri, Jul 17, 2009 at 8:13 PM, Alex Black  wrote:
>
> > When I write out messages using C++ I'm careful to clear messages and
> > re-use them, is there something equivalent on the java side when
> > reading those same messages in?
>
> No.  Sorry.  This just doesn't fit at all with the Java library's design,
> and even if it did, you cannot reuse Java String objects, which often
> account for most of the memory usage.  However, memory allocation is cheaper
> in Java than in C++, so there's less to gain from it.
>
>
>
> > My code looks like:
>
> > CodedInputStream stream = CodedInputStream.newInstance(inputStream);
>
> > while ( !stream.isAtEnd() )
> > {
> >     MyMessage.Builder builder = MyMessage.newBuilder();
> >     stream.readMessage(builder, null);
> >     MyMessage myMessage = builder.build();
>
> >     for ( MessageValue messageValue : myMessage.getValuesList() )
> >     {
> >        ..
> >     }
> > }
>
> > I'm passing 150 messages each with 1000 items, so presumably memory is
> > allocated 150 times for each of the messages...
>
> > - Alex
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Java deserialization - any best practices for performances?

2009-07-18 Thread Kenton Varda
On Fri, Jul 17, 2009 at 8:13 PM, Alex Black  wrote:

>
> When I write out messages using C++ I'm careful to clear messages and
> re-use them, is there something equivalent on the java side when
> reading those same messages in?


No.  Sorry.  This just doesn't fit at all with the Java library's design,
and even if it did, you cannot reuse Java String objects, which often
account for most of the memory usage.  However, memory allocation is cheaper
in Java than in C++, so there's less to gain from it.


>
>
> My code looks like:
>
> CodedInputStream stream = CodedInputStream.newInstance(inputStream);
>
> while ( !stream.isAtEnd() )
> {
> MyMessage.Builder builder = MyMessage.newBuilder();
> stream.readMessage(builder, null);
> MyMessage myMessage = builder.build();
>
> for ( MessageValue messageValue : myMessage.getValuesList() )
> {
>..
> }
> }
>
> I'm passing 150 messages each with 1000 items, so presumably memory is
> allocated 150 times for each of the messages...
>
> - Alex
> >
>

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Java deserialization - any best practices for performances?

2009-07-18 Thread Alex Black

Hi Alek, can you elaborate a bit on what you mean by lazy parsing?

I think what I want is to be able to *reuse* my objects, specifically
the instances of MyMessage, instead of allocating new ones each time
through the loop.  This is analagous to what my C++ code does when
writing out messages, it re-uses the same message object, clearing it
between uses.

On Jul 18, 12:25 am, Alek Storm  wrote:
> I think what you want is lazy parsing, which unfortunately isn't available
> yet.  You could always read bytes off the stream in chunks, or write your
> own CodedInputStream to skip to the end of each message every time it sees a
> length.
>
> Alek
>
> On Fri, Jul 17, 2009 at 8:13 PM, Alex Black  wrote:
>
> > When I write out messages using C++ I'm careful to clear messages and
> > re-use them, is there something equivalent on the java side when
> > reading those same messages in?
>
> > My code looks like:
>
> > CodedInputStream stream = CodedInputStream.newInstance(inputStream);
>
> > while ( !stream.isAtEnd() )
> > {
> >     MyMessage.Builder builder = MyMessage.newBuilder();
> >     stream.readMessage(builder, null);
> >     MyMessage myMessage = builder.build();
>
> >     for ( MessageValue messageValue : myMessage.getValuesList() )
> >     {
> >        ..
> >     }
> > }
>
> > I'm passing 150 messages each with 1000 items, so presumably memory is
> > allocated 150 times for each of the messages...
>
> > - Alex
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Java deserialization - any best practices for performances?

2009-07-17 Thread Alek Storm
I think what you want is lazy parsing, which unfortunately isn't available
yet.  You could always read bytes off the stream in chunks, or write your
own CodedInputStream to skip to the end of each message every time it sees a
length.

Alek

On Fri, Jul 17, 2009 at 8:13 PM, Alex Black  wrote:

>
> When I write out messages using C++ I'm careful to clear messages and
> re-use them, is there something equivalent on the java side when
> reading those same messages in?
>
> My code looks like:
>
> CodedInputStream stream = CodedInputStream.newInstance(inputStream);
>
> while ( !stream.isAtEnd() )
> {
> MyMessage.Builder builder = MyMessage.newBuilder();
> stream.readMessage(builder, null);
> MyMessage myMessage = builder.build();
>
> for ( MessageValue messageValue : myMessage.getValuesList() )
> {
>..
> }
> }
>
> I'm passing 150 messages each with 1000 items, so presumably memory is
> allocated 150 times for each of the messages...
>
> - Alex
> >
>

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Java deserialization - any best practices for performances?

2009-07-17 Thread Alex Black

When I write out messages using C++ I'm careful to clear messages and
re-use them, is there something equivalent on the java side when
reading those same messages in?

My code looks like:

CodedInputStream stream = CodedInputStream.newInstance(inputStream);

while ( !stream.isAtEnd() )
{
 MyMessage.Builder builder = MyMessage.newBuilder();
 stream.readMessage(builder, null);
 MyMessage myMessage = builder.build();

 for ( MessageValue messageValue : myMessage.getValuesList() )
 {
..
 }
}

I'm passing 150 messages each with 1000 items, so presumably memory is
allocated 150 times for each of the messages...

- Alex
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---