[ https://issues.apache.org/jira/browse/AVRO-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16384287#comment-16384287 ]
Tobi Vollebregt commented on AVRO-2147: --------------------------------------- Nice, thank you! > Proto to Avro serialization is unnecessarily slow due to repeated schema > creation > --------------------------------------------------------------------------------- > > Key: AVRO-2147 > URL: https://issues.apache.org/jira/browse/AVRO-2147 > Project: Avro > Issue Type: Improvement > Components: java > Affects Versions: 1.8.1, 1.8.2 > Reporter: Tobi Vollebregt > Assignee: Tobi Vollebregt > Priority: Major > Labels: java, optimization, protobuf > Fix For: 1.9.0 > > Attachments: AVRO-2147.patch, TestProtobufPerf.java, test.proto > > > Hi, > I discovered that proto to avro serialization is unnecessarily slow in > certain cases due to repeated schema creation. Specifically, this slowness > shows when serializing protocol buffer messages that contain nested protocol > buffer messages that contain enums with many possible values. Some profiling > showed this is due to the {{Schema}} objects for the nested message/enum not > being cached in this case. > An example that reproduces this is to add the following to {{test.proto}}: > {{message Foo {}} > {{ ...}} > {{ optional MessageWithLargeEnum bar = 21;}} > {{}}} > {{message MessageWithLargeEnum {}} > {{ optional LargeEnum enum = 1;}} > {{}}} > {{enum LargeEnum {}} > {{ AA = 1;}} > {{ AB = 2;}} > {{ AC = 3;}} > {{ ...}} > {{ ZZ = 676;}} > {{}}} > Then, a test like the following will exhibit the slow behavior: > {{@Test public void perf() throws Exception {}} > {{ Foo.Builder builder = Foo.newBuilder();}} > {{ builder.setInt32(0);}} > {{ builder.setInt64(2);}} > {{ builder.setUint32(3);}} > {{ builder.setUint64(4);}} > {{ builder.setSint32(5);}} > {{ builder.setSint64(6);}} > {{ builder.setFixed32(7);}} > {{ builder.setFixed64(8);}} > {{ builder.setSfixed32(9);}} > {{ builder.setSfixed64(10);}} > {{ builder.setFloat(1.0F);}} > {{ builder.setDouble(2.0);}} > {{ builder.setBool(true);}} > {{ builder.setString("foo");}} > {{ builder.setBytes(ByteString.copyFromUtf8("bar"));}} > {{ builder.setEnum(org.apache.avro.protobuf.Test.A.X);}} > {{ builder.addIntArray(27);}} > {{ builder.addSyms(org.apache.avro.protobuf.Test.A.Y);}} > {{ > builder.setBar(MessageWithLargeEnum.newBuilder().setEnum(LargeEnum.AA));}} > {{ Foo objToConvert = builder.build();}} > {{ Schema schema = ProtobufData.get().getSchema(Foo.class);}} > {{ ByteArrayOutputStream bao = new ByteArrayOutputStream();}} > {{ Encoder e = EncoderFactory.get().binaryEncoder(bao, null);}} > {{ ProtobufDatumWriter<Foo> w = new ProtobufDatumWriter<Foo>(schema);}} > {{ GenericDatumReader gdr = new GenericDatumReader(schema, schema);}} > {{ BinaryDecoder d = null;}} > {{ long startTime = System.nanoTime();}} > {{ for (int i = 0; i < 1000000; ++i) {}} > {{ bao.reset();}} > {{ w.write(objToConvert, e);}} > {{ e.flush();}} > {{ d = DecoderFactory.get().binaryDecoder(bao.toByteArray(), d);}} > {{ gdr.read(null, d);}} > \{{ }}} > {{ long endTime = System.nanoTime();}} > {{ System.out.println("Elapsed: " + (endTime - startTime) / 1000000 + " > ms");}} > {{}}} > I will attach a patch that optimizes this. > With the attached patch this test reports a runtime of about 4 seconds, while > the runtime without the patch is 30+ seconds, so this is an 7.5-8x > improvement for this particular test enum. -- This message was sent by Atlassian JIRA (v7.6.3#76005)