[jira] [Updated] (FLINK-34635) Clear successful records from the batch in JDBC connector
[ https://issues.apache.org/jira/browse/FLINK-34635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sai Sharath Dandi updated FLINK-34635: -- Affects Version/s: 1.18.1 > Clear successful records from the batch in JDBC connector > - > > Key: FLINK-34635 > URL: https://issues.apache.org/jira/browse/FLINK-34635 > Project: Flink > Issue Type: Improvement > Components: Connectors / JDBC >Affects Versions: 1.18.1 >Reporter: Sai Sharath Dandi >Priority: Minor > > Currently, when batch execution fails in the JDBC connector, the whole batch > is retried in the JDBC connector which is unnecessary. We should clear the > records that were successful in the > [SimpleBatchStatementExecutor|https://github.com/apache/flink-connector-jdbc/blob/main/flink-connector-jdbc/src/main/java/org/apache/flink/connector/jdbc/internal/executor/SimpleBatchStatementExecutor.java] > {code:java} > @Override > public void executeBatch() throws SQLException { > if (!batch.isEmpty()) { > for (T r : batch) { > parameterSetter.accept(st, r); > st.addBatch(); > } > st.executeBatch(); > --> catch the exception and clear successful records from the batch here > batch.clear(); > } > }{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-34635) Clear successful records from the batch in JDBC connector
[ https://issues.apache.org/jira/browse/FLINK-34635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sai Sharath Dandi updated FLINK-34635: -- Priority: Minor (was: Major) > Clear successful records from the batch in JDBC connector > - > > Key: FLINK-34635 > URL: https://issues.apache.org/jira/browse/FLINK-34635 > Project: Flink > Issue Type: Improvement > Components: Connectors / JDBC >Reporter: Sai Sharath Dandi >Priority: Minor > > Currently, when batch execution fails in the JDBC connector, the whole batch > is retried in the JDBC connector which is unnecessary. We should clear the > records that were successful in the > [SimpleBatchStatementExecutor|https://github.com/apache/flink-connector-jdbc/blob/main/flink-connector-jdbc/src/main/java/org/apache/flink/connector/jdbc/internal/executor/SimpleBatchStatementExecutor.java] > {code:java} > @Override > public void executeBatch() throws SQLException { > if (!batch.isEmpty()) { > for (T r : batch) { > parameterSetter.accept(st, r); > st.addBatch(); > } > st.executeBatch(); > --> catch the exception and clear successful records from the batch here > batch.clear(); > } > }{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-34635) Clear successful records from the batch in JDBC connector
Sai Sharath Dandi created FLINK-34635: - Summary: Clear successful records from the batch in JDBC connector Key: FLINK-34635 URL: https://issues.apache.org/jira/browse/FLINK-34635 Project: Flink Issue Type: Improvement Components: Connectors / JDBC Reporter: Sai Sharath Dandi Currently, when batch execution fails in the JDBC connector, the whole batch is retried in the JDBC connector which is unnecessary. We should clear the records that were successful in the [SimpleBatchStatementExecutor|https://github.com/apache/flink-connector-jdbc/blob/main/flink-connector-jdbc/src/main/java/org/apache/flink/connector/jdbc/internal/executor/SimpleBatchStatementExecutor.java] {code:java} @Override public void executeBatch() throws SQLException { if (!batch.isEmpty()) { for (T r : batch) { parameterSetter.accept(st, r); st.addBatch(); } st.executeBatch(); --> catch the exception and clear successful records from the batch here batch.clear(); } }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34403) VeryBigPbProtoToRowTest#testSimple cannot pass due to OOM
[ https://issues.apache.org/jira/browse/FLINK-34403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817115#comment-17817115 ] Sai Sharath Dandi commented on FLINK-34403: --- Sorry, I couldn't get back here in time. Thanks [~mapohl] for the fix! > VeryBigPbProtoToRowTest#testSimple cannot pass due to OOM > - > > Key: FLINK-34403 > URL: https://issues.apache.org/jira/browse/FLINK-34403 > Project: Flink > Issue Type: Bug > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.20.0 >Reporter: Benchao Li >Assignee: Matthias Pohl >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.20.0 > > > After FLINK-33611 merged, the misc test on GHA cannot pass due to out of > memory error, throwing following exceptions: > {code:java} > Error: 05:43:21 05:43:21.768 [ERROR] Tests run: 1, Failures: 0, Errors: 1, > Skipped: 0, Time elapsed: 40.98 s <<< FAILURE! -- in > org.apache.flink.formats.protobuf.VeryBigPbRowToProtoTest > Error: 05:43:21 05:43:21.773 [ERROR] > org.apache.flink.formats.protobuf.VeryBigPbRowToProtoTest.testSimple -- Time > elapsed: 40.97 s <<< ERROR! > Feb 07 05:43:21 org.apache.flink.util.FlinkRuntimeException: Error in > serialization. > Feb 07 05:43:21 at > org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:327) > Feb 07 05:43:21 at > org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:162) > Feb 07 05:43:21 at > org.apache.flink.streaming.api.graph.StreamGraph.getJobGraph(StreamGraph.java:1007) > Feb 07 05:43:21 at > org.apache.flink.client.StreamGraphTranslator.translateToJobGraph(StreamGraphTranslator.java:56) > Feb 07 05:43:21 at > org.apache.flink.client.FlinkPipelineTranslationUtil.getJobGraph(FlinkPipelineTranslationUtil.java:45) > Feb 07 05:43:21 at > org.apache.flink.client.deployment.executors.PipelineExecutorUtils.getJobGraph(PipelineExecutorUtils.java:61) > Feb 07 05:43:21 at > org.apache.flink.client.deployment.executors.LocalExecutor.getJobGraph(LocalExecutor.java:104) > Feb 07 05:43:21 at > org.apache.flink.client.deployment.executors.LocalExecutor.execute(LocalExecutor.java:81) > Feb 07 05:43:21 at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2440) > Feb 07 05:43:21 at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2421) > Feb 07 05:43:21 at > org.apache.flink.streaming.api.datastream.DataStream.executeAndCollectWithClient(DataStream.java:1495) > Feb 07 05:43:21 at > org.apache.flink.streaming.api.datastream.DataStream.executeAndCollect(DataStream.java:1382) > Feb 07 05:43:21 at > org.apache.flink.streaming.api.datastream.DataStream.executeAndCollect(DataStream.java:1367) > Feb 07 05:43:21 at > org.apache.flink.formats.protobuf.ProtobufTestHelper.validateRow(ProtobufTestHelper.java:66) > Feb 07 05:43:21 at > org.apache.flink.formats.protobuf.ProtobufTestHelper.rowToPbBytes(ProtobufTestHelper.java:89) > Feb 07 05:43:21 at > org.apache.flink.formats.protobuf.ProtobufTestHelper.rowToPbBytes(ProtobufTestHelper.java:76) > Feb 07 05:43:21 at > org.apache.flink.formats.protobuf.ProtobufTestHelper.rowToPbBytes(ProtobufTestHelper.java:71) > Feb 07 05:43:21 at > org.apache.flink.formats.protobuf.VeryBigPbRowToProtoTest.testSimple(VeryBigPbRowToProtoTest.java:37) > Feb 07 05:43:21 at java.lang.reflect.Method.invoke(Method.java:498) > Feb 07 05:43:21 Caused by: java.util.concurrent.ExecutionException: > java.lang.IllegalArgumentException: Self-suppression not permitted > Feb 07 05:43:21 at > java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) > Feb 07 05:43:21 at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) > Feb 07 05:43:21 at > org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:323) > Feb 07 05:43:21 ... 18 more > Feb 07 05:43:21 Caused by: java.lang.IllegalArgumentException: > Self-suppression not permitted > Feb 07 05:43:21 at > java.lang.Throwable.addSuppressed(Throwable.java:1072) > Feb 07 05:43:21 at > org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:556) > Feb 07 05:43:21 at > org.apache.flink.util.InstantiationUtil.writeObjectToConfig(InstantiationUtil.java:486) > Feb 07 05:43:21 at >
[jira] [Commented] (FLINK-34403) VeryBigPbProtoToRowTest#testSimple cannot pass due to OOM
[ https://issues.apache.org/jira/browse/FLINK-34403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815422#comment-17815422 ] Sai Sharath Dandi commented on FLINK-34403: --- The test was added for an extreme case that would fail without the changes made in the PR. If we reduce the test data size, the test case would pass without the changes in the PR. I've tried hard to make such a test that would meet all the requirements and also pass the Azure pipelines heap size requirement but I'm not sure why it has started failing after passing earlier. Fwiw, the test passes comfortably in my local environment at a much larger size than what was merged into the codebase. I can remove this test altogether if it is not possible to increase the heap size for the tests as the PR already achieved it's goal > VeryBigPbProtoToRowTest#testSimple cannot pass due to OOM > - > > Key: FLINK-34403 > URL: https://issues.apache.org/jira/browse/FLINK-34403 > Project: Flink > Issue Type: Bug > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.20.0 >Reporter: Benchao Li >Priority: Major > Labels: test-stability > > After FLINK-33611 merged, the misc test on GHA cannot pass due to out of > memory error, throwing following exceptions: > {code:java} > Error: 05:43:21 05:43:21.768 [ERROR] Tests run: 1, Failures: 0, Errors: 1, > Skipped: 0, Time elapsed: 40.98 s <<< FAILURE! -- in > org.apache.flink.formats.protobuf.VeryBigPbRowToProtoTest > Error: 05:43:21 05:43:21.773 [ERROR] > org.apache.flink.formats.protobuf.VeryBigPbRowToProtoTest.testSimple -- Time > elapsed: 40.97 s <<< ERROR! > Feb 07 05:43:21 org.apache.flink.util.FlinkRuntimeException: Error in > serialization. > Feb 07 05:43:21 at > org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:327) > Feb 07 05:43:21 at > org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:162) > Feb 07 05:43:21 at > org.apache.flink.streaming.api.graph.StreamGraph.getJobGraph(StreamGraph.java:1007) > Feb 07 05:43:21 at > org.apache.flink.client.StreamGraphTranslator.translateToJobGraph(StreamGraphTranslator.java:56) > Feb 07 05:43:21 at > org.apache.flink.client.FlinkPipelineTranslationUtil.getJobGraph(FlinkPipelineTranslationUtil.java:45) > Feb 07 05:43:21 at > org.apache.flink.client.deployment.executors.PipelineExecutorUtils.getJobGraph(PipelineExecutorUtils.java:61) > Feb 07 05:43:21 at > org.apache.flink.client.deployment.executors.LocalExecutor.getJobGraph(LocalExecutor.java:104) > Feb 07 05:43:21 at > org.apache.flink.client.deployment.executors.LocalExecutor.execute(LocalExecutor.java:81) > Feb 07 05:43:21 at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2440) > Feb 07 05:43:21 at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2421) > Feb 07 05:43:21 at > org.apache.flink.streaming.api.datastream.DataStream.executeAndCollectWithClient(DataStream.java:1495) > Feb 07 05:43:21 at > org.apache.flink.streaming.api.datastream.DataStream.executeAndCollect(DataStream.java:1382) > Feb 07 05:43:21 at > org.apache.flink.streaming.api.datastream.DataStream.executeAndCollect(DataStream.java:1367) > Feb 07 05:43:21 at > org.apache.flink.formats.protobuf.ProtobufTestHelper.validateRow(ProtobufTestHelper.java:66) > Feb 07 05:43:21 at > org.apache.flink.formats.protobuf.ProtobufTestHelper.rowToPbBytes(ProtobufTestHelper.java:89) > Feb 07 05:43:21 at > org.apache.flink.formats.protobuf.ProtobufTestHelper.rowToPbBytes(ProtobufTestHelper.java:76) > Feb 07 05:43:21 at > org.apache.flink.formats.protobuf.ProtobufTestHelper.rowToPbBytes(ProtobufTestHelper.java:71) > Feb 07 05:43:21 at > org.apache.flink.formats.protobuf.VeryBigPbRowToProtoTest.testSimple(VeryBigPbRowToProtoTest.java:37) > Feb 07 05:43:21 at java.lang.reflect.Method.invoke(Method.java:498) > Feb 07 05:43:21 Caused by: java.util.concurrent.ExecutionException: > java.lang.IllegalArgumentException: Self-suppression not permitted > Feb 07 05:43:21 at > java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) > Feb 07 05:43:21 at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) > Feb 07 05:43:21 at > org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:323) > Feb 07 05:43:21 ... 18 more > Feb 07 05:43:21 Caused by: java.lang.IllegalArgumentException: >
[jira] [Commented] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3
[ https://issues.apache.org/jira/browse/FLINK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812045#comment-17812045 ] Sai Sharath Dandi commented on FLINK-33817: --- [~libenchao] [~maosuhan] , Gentle ping on this ticket as there is a very high performance impact here for the Protobuf format. > Allow ReadDefaultValues = False for non primitive types on Proto3 > - > > Key: FLINK-33817 > URL: https://issues.apache.org/jira/browse/FLINK-33817 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Priority: Major > Labels: pull-request-available > > *Background* > > The current Protobuf format > [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java] > always sets ReadDefaultValues=False when using Proto3 version. This can > cause severe performance degradation for large Protobuf schemas with OneOf > fields as the entire generated code needs to be executed during > deserialization even when certain fields are not present in the data to be > deserialized and all the subsequent nested Fields can be skipped. Proto3 > supports hasXXX() methods for checking field presence for non primitive types > since Proto version > [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In > the internal performance benchmarks in our company, we've seen almost 10x > difference in performance for one of our real production usecase when > allowing to set ReadDefaultValues=False with proto3 version. The exact > difference in performance depends on the schema complexity and data payload > but we should allow user to set readDefaultValue=False in general. > > *Solution* > > Support using ReadDefaultValues=False when using Proto3 version. We need to > be careful to check for field presence only on non-primitive types if > ReadDefaultValues is false and version used is Proto3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33611) Support Large Protobuf Schemas
[ https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17804434#comment-17804434 ] Sai Sharath Dandi commented on FLINK-33611: --- [~libenchao] Thanks for the feedback, I agree with you on the points mentioned. Let me update the pull request as discussed above > Support Large Protobuf Schemas > -- > > Key: FLINK-33611 > URL: https://issues.apache.org/jira/browse/FLINK-33611 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Assignee: Sai Sharath Dandi >Priority: Major > Labels: pull-request-available > > h3. Background > Flink serializes and deserializes protobuf format data by calling the decode > or encode method in GeneratedProtoToRow_XXX.java generated by codegen to > parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the > ability to split the generated code to improve the performance for large > Protobuf schemas. However, this is still not sufficient to support some > larger protobuf schemas as the generated code exceeds the java constant pool > size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] > and we can see errors like "Too many constants" when trying to compile the > generated code. > *Solution* > Since we already have the split code functionality already introduced, the > main proposal here is to now reuse the variable names across different split > method scopes. This will greatly reduce the constant pool size. One more > optimization is to only split the last code segment also only when the size > exceeds split threshold limit. Currently, the last segment of the generated > code is always being split which can lead to too many split methods and thus > exceed the constant pool size limit -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3
[ https://issues.apache.org/jira/browse/FLINK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17803389#comment-17803389 ] Sai Sharath Dandi commented on FLINK-33817: --- [~maosuhan] Gentle ping on this. > Allow ReadDefaultValues = False for non primitive types on Proto3 > - > > Key: FLINK-33817 > URL: https://issues.apache.org/jira/browse/FLINK-33817 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Priority: Major > > *Background* > > The current Protobuf format > [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java] > always sets ReadDefaultValues=False when using Proto3 version. This can > cause severe performance degradation for large Protobuf schemas with OneOf > fields as the entire generated code needs to be executed during > deserialization even when certain fields are not present in the data to be > deserialized and all the subsequent nested Fields can be skipped. Proto3 > supports hasXXX() methods for checking field presence for non primitive types > since Proto version > [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In > the internal performance benchmarks in our company, we've seen almost 10x > difference in performance for one of our real production usecase when > allowing to set ReadDefaultValues=False with proto3 version. The exact > difference in performance depends on the schema complexity and data payload > but we should allow user to set readDefaultValue=False in general. > > *Solution* > > Support using ReadDefaultValues=False when using Proto3 version. We need to > be careful to check for field presence only on non-primitive types if > ReadDefaultValues is false and version used is Proto3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-33611) Support Large Protobuf Schemas
[ https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17803388#comment-17803388 ] Sai Sharath Dandi edited comment on FLINK-33611 at 1/5/24 3:03 AM: --- [~libenchao] , Thanks for the suggestion to add the test case to the pull request. While working on making an appropriate test case, I discovered something interesting. The local variable names are somehow not part of the Java constant pool for large schemas. I believe Java is triggering some optimizations internally and storing the variable names elsewhere when the code size becomes too large, therefore I'm not sure if reusing variable names has any impact on supporting large schemas. Perhaps, it can reduce the work needed for the Java compiler to rewrite variable names and result in faster compile times but I haven't conducted any experiment on that aspect. Apart from that, making the code change to reduce too many split methods has the most impact in supporting large schemas as I found that method names are always included in the constant pool even when the code size is too large from my experiment. In fact, this is the main reason which causes compilation errors with "too many constants error" With that being said, I would still prefer to keep the changes to reuse variable names since the change itself is non-intrusive, harmless, and can only improve the performance for compilation. Please let me know your thoughts was (Author: JIRAUSER298466): @libenchao, Thanks for the suggestion to add the test case to the pull request. While working on making an appropriate test case, I discovered something interesting. The local variable names are somehow not part of the Java constant pool for large schemas. I believe Java is triggering some optimizations internally and storing the variable names elsewhere when the code size becomes too large, therefore I'm not sure if reusing variable names has any impact on supporting large schemas. Perhaps, it can reduce the work needed for the Java compiler to rewrite variable names and result in faster compile times but I haven't conducted any experiment on that aspect. Apart from that, making the code change to reduce too many split methods has the most impact in supporting large schemas as I found that method names are always included in the constant pool even when the code size is too large from my experiment. In fact, this is the main reason which causes compilation errors with "too many constants error" With that being said, I would still prefer to keep the changes to reuse variable names since the change itself is non-intrusive, harmless, and can only improve the performance for compilation. Please let me know your thoughts > Support Large Protobuf Schemas > -- > > Key: FLINK-33611 > URL: https://issues.apache.org/jira/browse/FLINK-33611 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Assignee: Sai Sharath Dandi >Priority: Major > Labels: pull-request-available > > h3. Background > Flink serializes and deserializes protobuf format data by calling the decode > or encode method in GeneratedProtoToRow_XXX.java generated by codegen to > parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the > ability to split the generated code to improve the performance for large > Protobuf schemas. However, this is still not sufficient to support some > larger protobuf schemas as the generated code exceeds the java constant pool > size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] > and we can see errors like "Too many constants" when trying to compile the > generated code. > *Solution* > Since we already have the split code functionality already introduced, the > main proposal here is to now reuse the variable names across different split > method scopes. This will greatly reduce the constant pool size. One more > optimization is to only split the last code segment also only when the size > exceeds split threshold limit. Currently, the last segment of the generated > code is always being split which can lead to too many split methods and thus > exceed the constant pool size limit -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33611) Support Large Protobuf Schemas
[ https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17803388#comment-17803388 ] Sai Sharath Dandi commented on FLINK-33611: --- @libenchao, Thanks for the suggestion to add the test case to the pull request. While working on making an appropriate test case, I discovered something interesting. The local variable names are somehow not part of the Java constant pool for large schemas. I believe Java is triggering some optimizations internally and storing the variable names elsewhere when the code size becomes too large, therefore I'm not sure if reusing variable names has any impact on supporting large schemas. Perhaps, it can reduce the work needed for the Java compiler to rewrite variable names and result in faster compile times but I haven't conducted any experiment on that aspect. Apart from that, making the code change to reduce too many split methods has the most impact in supporting large schemas as I found that method names are always included in the constant pool even when the code size is too large from my experiment. In fact, this is the main reason which causes compilation errors with "too many constants error" With that being said, I would still prefer to keep the changes to reuse variable names since the change itself is non-intrusive, harmless, and can only improve the performance for compilation. Please let me know your thoughts > Support Large Protobuf Schemas > -- > > Key: FLINK-33611 > URL: https://issues.apache.org/jira/browse/FLINK-33611 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Assignee: Sai Sharath Dandi >Priority: Major > Labels: pull-request-available > > h3. Background > Flink serializes and deserializes protobuf format data by calling the decode > or encode method in GeneratedProtoToRow_XXX.java generated by codegen to > parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the > ability to split the generated code to improve the performance for large > Protobuf schemas. However, this is still not sufficient to support some > larger protobuf schemas as the generated code exceeds the java constant pool > size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] > and we can see errors like "Too many constants" when trying to compile the > generated code. > *Solution* > Since we already have the split code functionality already introduced, the > main proposal here is to now reuse the variable names across different split > method scopes. This will greatly reduce the constant pool size. One more > optimization is to only split the last code segment also only when the size > exceeds split threshold limit. Currently, the last segment of the generated > code is always being split which can lead to too many split methods and thus > exceed the constant pool size limit -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-33611) Support Large Protobuf Schemas
[ https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802335#comment-17802335 ] Sai Sharath Dandi edited comment on FLINK-33611 at 1/3/24 10:15 PM: [~libenchao] All identifier names in the code are part of the constant pool including local variable names. You can use the javap tool on a simple class file to examine the constant pool contents - [ref|[https://blogs.oracle.com/javamagazine/post/java-class-file-constant-pool].] Here's an example class and it's constant pool content obtained with javap - {code:java} public class Hello { public void sayHello1() { Integer a1 = 1; int b = 2; String c = "hi"; } public void sayHello2() { Integer a2 = 3; int b = 2; String c = "hello"; } }{code} {code:java} Constant pool: #1 = Methodref #6.#25 // java/lang/Object."":()V #2 = Methodref #26.#27 // java/lang/Integer.valueOf:(I)Ljava/lang/Integer; #3 = String #28 // hi #4 = String #29 // hello #5 = Class #30 // com/uber/athena/athenax/connector/kafka/formats/protobuf/deserialize/Hello #6 = Class #31 // java/lang/Object #7 = Utf8 #8 = Utf8 ()V #9 = Utf8 Code #10 = Utf8 LineNumberTable #11 = Utf8 LocalVariableTable #12 = Utf8 this #13 = Utf8 Lcom/uber/athena/athenax/connector/kafka/formats/protobuf/deserialize/Hello; #14 = Utf8 sayHello1 #15 = Utf8 a1 #16 = Utf8 Ljava/lang/Integer; #17 = Utf8 b #18 = Utf8 I #19 = Utf8 c #20 = Utf8 Ljava/lang/String; #21 = Utf8 sayHello2 #22 = Utf8 a2 #23 = Utf8 SourceFile #24 = Utf8 Hello.java #25 = NameAndType #7:#8 // "":()V #26 = Class #32 // java/lang/Integer #27 = NameAndType #33:#34 // valueOf:(I)Ljava/lang/Integer; #28 = Utf8 hi #29 = Utf8 hello #30 = Utf8 com/uber/athena/athenax/connector/kafka/formats/protobuf/deserialize/Hello #31 = Utf8 java/lang/Object #32 = Utf8 java/lang/Integer #33 = Utf8 valueOf #34 = Utf8 (I)Ljava/lang/Integer; {code} As we can see from the above example, local variable names are part of the constant pool was (Author: JIRAUSER298466): [~libenchao] All identifier names in the code are part of the constant pool including local variable names. You can use the javap tool on a simple class file to examine the constant pool contents - [ref|[https://blogs.oracle.com/javamagazine/post/java-class-file-constant-pool].] Here's an example class and it's constant pool content obtained with javap - {code:java} public class Hello { public void sayHello1() { Integer a1; int b; String c; } public void sayHello2() { Integer a2; int b; String c; } } {code} {code:java} Constant pool: #1 = Methodref #6.#25 // java/lang/Object."":()V #2 = Methodref #26.#27 // java/lang/Integer.valueOf:(I)Ljava/lang/Integer; #3 = String #28 // hi #4 = String #29 // hello #5 = Class #30 // com/uber/athena/athenax/connector/kafka/formats/protobuf/deserialize/Hello #6 = Class #31 // java/lang/Object #7 = Utf8 #8 = Utf8 ()V #9 = Utf8 Code #10 = Utf8 LineNumberTable #11 = Utf8 LocalVariableTable #12 = Utf8 this #13 = Utf8 Lcom/uber/athena/athenax/connector/kafka/formats/protobuf/deserialize/Hello; #14 = Utf8 sayHello1 #15 = Utf8 a1 #16 = Utf8 Ljava/lang/Integer; #17 = Utf8 b #18 = Utf8 I #19 = Utf8 c #20 = Utf8 Ljava/lang/String; #21 = Utf8 sayHello2 #22 = Utf8 a2 #23 = Utf8 SourceFile #24 = Utf8 Hello.java #25 = NameAndType #7:#8 // "":()V #26 = Class #32 // java/lang/Integer #27 = NameAndType #33:#34 // valueOf:(I)Ljava/lang/Integer; #28 = Utf8 hi #29 = Utf8 hello #30 = Utf8 com/uber/athena/athenax/connector/kafka/formats/protobuf/deserialize/Hello #31 = Utf8 java/lang/Object #32 = Utf8 java/lang/Integer #33 = Utf8 valueOf #34 = Utf8
[jira] [Commented] (FLINK-33611) Support Large Protobuf Schemas
[ https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802335#comment-17802335 ] Sai Sharath Dandi commented on FLINK-33611: --- [~libenchao] All identifier names in the code are part of the constant pool including local variable names. You can use the javap tool on a simple class file to examine the constant pool contents - [ref|[https://blogs.oracle.com/javamagazine/post/java-class-file-constant-pool].] Here's an example class and it's constant pool content obtained with javap - {code:java} public class Hello { public void sayHello1() { Integer a1; int b; String c; } public void sayHello2() { Integer a2; int b; String c; } } {code} {code:java} Constant pool: #1 = Methodref #6.#25 // java/lang/Object."":()V #2 = Methodref #26.#27 // java/lang/Integer.valueOf:(I)Ljava/lang/Integer; #3 = String #28 // hi #4 = String #29 // hello #5 = Class #30 // com/uber/athena/athenax/connector/kafka/formats/protobuf/deserialize/Hello #6 = Class #31 // java/lang/Object #7 = Utf8 #8 = Utf8 ()V #9 = Utf8 Code #10 = Utf8 LineNumberTable #11 = Utf8 LocalVariableTable #12 = Utf8 this #13 = Utf8 Lcom/uber/athena/athenax/connector/kafka/formats/protobuf/deserialize/Hello; #14 = Utf8 sayHello1 #15 = Utf8 a1 #16 = Utf8 Ljava/lang/Integer; #17 = Utf8 b #18 = Utf8 I #19 = Utf8 c #20 = Utf8 Ljava/lang/String; #21 = Utf8 sayHello2 #22 = Utf8 a2 #23 = Utf8 SourceFile #24 = Utf8 Hello.java #25 = NameAndType #7:#8 // "":()V #26 = Class #32 // java/lang/Integer #27 = NameAndType #33:#34 // valueOf:(I)Ljava/lang/Integer; #28 = Utf8 hi #29 = Utf8 hello #30 = Utf8 com/uber/athena/athenax/connector/kafka/formats/protobuf/deserialize/Hello #31 = Utf8 java/lang/Object #32 = Utf8 java/lang/Integer #33 = Utf8 valueOf #34 = Utf8 (I)Ljava/lang/Integer; {code} As we can see from the above example, local variable names are part of the constant pool > Support Large Protobuf Schemas > -- > > Key: FLINK-33611 > URL: https://issues.apache.org/jira/browse/FLINK-33611 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Assignee: Sai Sharath Dandi >Priority: Major > Labels: pull-request-available > > h3. Background > Flink serializes and deserializes protobuf format data by calling the decode > or encode method in GeneratedProtoToRow_XXX.java generated by codegen to > parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the > ability to split the generated code to improve the performance for large > Protobuf schemas. However, this is still not sufficient to support some > larger protobuf schemas as the generated code exceeds the java constant pool > size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] > and we can see errors like "Too many constants" when trying to compile the > generated code. > *Solution* > Since we already have the split code functionality already introduced, the > main proposal here is to now reuse the variable names across different split > method scopes. This will greatly reduce the constant pool size. One more > optimization is to only split the last code segment also only when the size > exceeds split threshold limit. Currently, the last segment of the generated > code is always being split which can lead to too many split methods and thus > exceed the constant pool size limit -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33611) Support Large Protobuf Schemas
[ https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799963#comment-17799963 ] Sai Sharath Dandi commented on FLINK-33611: --- Yes, it is quite large indeed but it's one of the most important use cases in my company that we need to support with Flink. Please help us review the solution/PR and lmk if there is any concerns. Thanks! > Support Large Protobuf Schemas > -- > > Key: FLINK-33611 > URL: https://issues.apache.org/jira/browse/FLINK-33611 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Assignee: Sai Sharath Dandi >Priority: Major > Labels: pull-request-available > > h3. Background > Flink serializes and deserializes protobuf format data by calling the decode > or encode method in GeneratedProtoToRow_XXX.java generated by codegen to > parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the > ability to split the generated code to improve the performance for large > Protobuf schemas. However, this is still not sufficient to support some > larger protobuf schemas as the generated code exceeds the java constant pool > size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] > and we can see errors like "Too many constants" when trying to compile the > generated code. > *Solution* > Since we already have the split code functionality already introduced, the > main proposal here is to now reuse the variable names across different split > method scopes. This will greatly reduce the constant pool size. One more > optimization is to only split the last code segment also only when the size > exceeds split threshold limit. Currently, the last segment of the generated > code is always being split which can lead to too many split methods and thus > exceed the constant pool size limit -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3
[ https://issues.apache.org/jira/browse/FLINK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799959#comment-17799959 ] Sai Sharath Dandi edited comment on FLINK-33817 at 12/22/23 10:09 PM: -- Yes, it is supported for message fields as well (although I'm not exactly sure which version). See this [documentation|https://protobuf.dev/programming-guides/field_presence/#how-to-enable-explicit-presence-in-proto3] for more details. In fact, the hasXXX() method is supported for Primitive types also if defined as optional. We can distinguish the primitive types using {{PbFormatUtils.isSimpleType()}} [method|https://github.com/apache/flink/blob/master/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/util/PbFormatUtils.java]. Also, We can add user guidance to only enable this setting when using Protobuf version newer than 3.15. The performance difference should also be called out explicitly in the documentation so that the user can make an informed decision. was (Author: JIRAUSER298466): Yes, it is supported for message fields as well (although I'm not exactly sure which version). See this [documentation|https://protobuf.dev/programming-guides/field_presence/#how-to-enable-explicit-presence-in-proto3] for more details. In fact, the hasXXX() method is supported for Primitive types also if defined as optional. We can distinguish the primitive types using {{PbFormatUtils.isSimpleType()}} [method|https://github.com/apache/flink/blob/master/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/util/PbFormatUtils.java] We can add user guidance to only enable this setting when using Protobuf version newer than 3.15. The performance difference should also be called out explicitly in the documentation so that the user can make an informed decision. > Allow ReadDefaultValues = False for non primitive types on Proto3 > - > > Key: FLINK-33817 > URL: https://issues.apache.org/jira/browse/FLINK-33817 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Priority: Major > > *Background* > > The current Protobuf format > [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java] > always sets ReadDefaultValues=False when using Proto3 version. This can > cause severe performance degradation for large Protobuf schemas with OneOf > fields as the entire generated code needs to be executed during > deserialization even when certain fields are not present in the data to be > deserialized and all the subsequent nested Fields can be skipped. Proto3 > supports hasXXX() methods for checking field presence for non primitive types > since Proto version > [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In > the internal performance benchmarks in our company, we've seen almost 10x > difference in performance for one of our real production usecase when > allowing to set ReadDefaultValues=False with proto3 version. The exact > difference in performance depends on the schema complexity and data payload > but we should allow user to set readDefaultValue=False in general. > > *Solution* > > Support using ReadDefaultValues=False when using Proto3 version. We need to > be careful to check for field presence only on non-primitive types if > ReadDefaultValues is false and version used is Proto3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3
[ https://issues.apache.org/jira/browse/FLINK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799959#comment-17799959 ] Sai Sharath Dandi edited comment on FLINK-33817 at 12/22/23 10:08 PM: -- Yes, it is supported for message fields as well (although I'm not exactly sure which version). See this [documentation|https://protobuf.dev/programming-guides/field_presence/#how-to-enable-explicit-presence-in-proto3] for more details. In fact, the hasXXX() method is supported for Primitive types also if defined as optional. We can distinguish the primitive types using PbFormatUtils.isSimpleType [method|https://github.com/apache/flink/blob/master/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/util/PbFormatUtils.java] We can add user guidance to only enable this setting when using Protobuf version newer than 3.15. The performance difference should also be called out explicitly in the documentation so that the user can make an informed decision. was (Author: JIRAUSER298466): Yes, it is supported for message fields as well (although I'm not exactly sure which version). See this [documentation|https://protobuf.dev/programming-guides/field_presence/#how-to-enable-explicit-presence-in-proto3] for more details. In fact, the hasXXX() method is supported for Primitive types also if defined as optional. We can distinguish the primitive types using PbFormatUtils.isSimpleType [method|[https://github.com/apache/flink/blob/master/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/util/PbFormatUtils.java].] We can add user guidance to only enable this setting when using Protobuf version newer than 3.15. The performance difference should also be called out explicitly in the documentation so that the user can make an informed decision. > Allow ReadDefaultValues = False for non primitive types on Proto3 > - > > Key: FLINK-33817 > URL: https://issues.apache.org/jira/browse/FLINK-33817 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Priority: Major > > *Background* > > The current Protobuf format > [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java] > always sets ReadDefaultValues=False when using Proto3 version. This can > cause severe performance degradation for large Protobuf schemas with OneOf > fields as the entire generated code needs to be executed during > deserialization even when certain fields are not present in the data to be > deserialized and all the subsequent nested Fields can be skipped. Proto3 > supports hasXXX() methods for checking field presence for non primitive types > since Proto version > [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In > the internal performance benchmarks in our company, we've seen almost 10x > difference in performance for one of our real production usecase when > allowing to set ReadDefaultValues=False with proto3 version. The exact > difference in performance depends on the schema complexity and data payload > but we should allow user to set readDefaultValue=False in general. > > *Solution* > > Support using ReadDefaultValues=False when using Proto3 version. We need to > be careful to check for field presence only on non-primitive types if > ReadDefaultValues is false and version used is Proto3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3
[ https://issues.apache.org/jira/browse/FLINK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799959#comment-17799959 ] Sai Sharath Dandi edited comment on FLINK-33817 at 12/22/23 10:08 PM: -- Yes, it is supported for message fields as well (although I'm not exactly sure which version). See this [documentation|https://protobuf.dev/programming-guides/field_presence/#how-to-enable-explicit-presence-in-proto3] for more details. In fact, the hasXXX() method is supported for Primitive types also if defined as optional. We can distinguish the primitive types using {{PbFormatUtils.isSimpleType()}} [method|https://github.com/apache/flink/blob/master/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/util/PbFormatUtils.java] We can add user guidance to only enable this setting when using Protobuf version newer than 3.15. The performance difference should also be called out explicitly in the documentation so that the user can make an informed decision. was (Author: JIRAUSER298466): Yes, it is supported for message fields as well (although I'm not exactly sure which version). See this [documentation|https://protobuf.dev/programming-guides/field_presence/#how-to-enable-explicit-presence-in-proto3] for more details. In fact, the hasXXX() method is supported for Primitive types also if defined as optional. We can distinguish the primitive types using PbFormatUtils.isSimpleType [method|https://github.com/apache/flink/blob/master/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/util/PbFormatUtils.java] We can add user guidance to only enable this setting when using Protobuf version newer than 3.15. The performance difference should also be called out explicitly in the documentation so that the user can make an informed decision. > Allow ReadDefaultValues = False for non primitive types on Proto3 > - > > Key: FLINK-33817 > URL: https://issues.apache.org/jira/browse/FLINK-33817 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Priority: Major > > *Background* > > The current Protobuf format > [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java] > always sets ReadDefaultValues=False when using Proto3 version. This can > cause severe performance degradation for large Protobuf schemas with OneOf > fields as the entire generated code needs to be executed during > deserialization even when certain fields are not present in the data to be > deserialized and all the subsequent nested Fields can be skipped. Proto3 > supports hasXXX() methods for checking field presence for non primitive types > since Proto version > [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In > the internal performance benchmarks in our company, we've seen almost 10x > difference in performance for one of our real production usecase when > allowing to set ReadDefaultValues=False with proto3 version. The exact > difference in performance depends on the schema complexity and data payload > but we should allow user to set readDefaultValue=False in general. > > *Solution* > > Support using ReadDefaultValues=False when using Proto3 version. We need to > be careful to check for field presence only on non-primitive types if > ReadDefaultValues is false and version used is Proto3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3
[ https://issues.apache.org/jira/browse/FLINK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799959#comment-17799959 ] Sai Sharath Dandi edited comment on FLINK-33817 at 12/22/23 10:07 PM: -- Yes, it is supported for message fields as well (although I'm not exactly sure which version). See this [documentation|https://protobuf.dev/programming-guides/field_presence/#how-to-enable-explicit-presence-in-proto3] for more details. In fact, the hasXXX() method is supported for Primitive types also if defined as optional. We can distinguish the primitive types using PbFormatUtils.isSimpleType [method|[https://github.com/apache/flink/blob/master/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/util/PbFormatUtils.java].] We can add user guidance to only enable this setting when using Protobuf version newer than 3.15. The performance difference should also be called out explicitly in the documentation so that the user can make an informed decision. was (Author: JIRAUSER298466): Yes, it is supported for message fields as well (although I'm not exactly sure which version). See this [documentation|https://protobuf.dev/programming-guides/field_presence/#how-to-enable-explicit-presence-in-proto3] for more details. In fact, the hasXXX() method is supported for Primitive types also if defined as optional > Allow ReadDefaultValues = False for non primitive types on Proto3 > - > > Key: FLINK-33817 > URL: https://issues.apache.org/jira/browse/FLINK-33817 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Priority: Major > > *Background* > > The current Protobuf format > [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java] > always sets ReadDefaultValues=False when using Proto3 version. This can > cause severe performance degradation for large Protobuf schemas with OneOf > fields as the entire generated code needs to be executed during > deserialization even when certain fields are not present in the data to be > deserialized and all the subsequent nested Fields can be skipped. Proto3 > supports hasXXX() methods for checking field presence for non primitive types > since Proto version > [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In > the internal performance benchmarks in our company, we've seen almost 10x > difference in performance for one of our real production usecase when > allowing to set ReadDefaultValues=False with proto3 version. The exact > difference in performance depends on the schema complexity and data payload > but we should allow user to set readDefaultValue=False in general. > > *Solution* > > Support using ReadDefaultValues=False when using Proto3 version. We need to > be careful to check for field presence only on non-primitive types if > ReadDefaultValues is false and version used is Proto3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3
[ https://issues.apache.org/jira/browse/FLINK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799959#comment-17799959 ] Sai Sharath Dandi commented on FLINK-33817: --- Yes, it is supported for message fields as well (although I'm not exactly sure which version). See this [documentation|https://protobuf.dev/programming-guides/field_presence/#how-to-enable-explicit-presence-in-proto3] for more details. In fact, the hasXXX() method is supported for Primitive types also if defined as optional > Allow ReadDefaultValues = False for non primitive types on Proto3 > - > > Key: FLINK-33817 > URL: https://issues.apache.org/jira/browse/FLINK-33817 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Priority: Major > > *Background* > > The current Protobuf format > [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java] > always sets ReadDefaultValues=False when using Proto3 version. This can > cause severe performance degradation for large Protobuf schemas with OneOf > fields as the entire generated code needs to be executed during > deserialization even when certain fields are not present in the data to be > deserialized and all the subsequent nested Fields can be skipped. Proto3 > supports hasXXX() methods for checking field presence for non primitive types > since Proto version > [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In > the internal performance benchmarks in our company, we've seen almost 10x > difference in performance for one of our real production usecase when > allowing to set ReadDefaultValues=False with proto3 version. The exact > difference in performance depends on the schema complexity and data payload > but we should allow user to set readDefaultValue=False in general. > > *Solution* > > Support using ReadDefaultValues=False when using Proto3 version. We need to > be careful to check for field presence only on non-primitive types if > ReadDefaultValues is false and version used is Proto3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3
[ https://issues.apache.org/jira/browse/FLINK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799584#comment-17799584 ] Sai Sharath Dandi commented on FLINK-33817: --- [~libenchao] , Can you please take a look at this ticket and assign it to me? > Allow ReadDefaultValues = False for non primitive types on Proto3 > - > > Key: FLINK-33817 > URL: https://issues.apache.org/jira/browse/FLINK-33817 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Priority: Major > > *Background* > > The current Protobuf format > [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java] > always sets ReadDefaultValues=False when using Proto3 version. This can > cause severe performance degradation for large Protobuf schemas with OneOf > fields as the entire generated code needs to be executed during > deserialization even when certain fields are not present in the data to be > deserialized and all the subsequent nested Fields can be skipped. Proto3 > supports hasXXX() methods for checking field presence for non primitive types > since Proto version > [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In > the internal performance benchmarks in our company, we've seen almost 10x > difference in performance for one of our real production usecase when > allowing to set ReadDefaultValues=False with proto3 version. The exact > difference in performance depends on the schema complexity and data payload > but we should allow readDefaultValue=False in general. > > *Solution* > > Support using ReadDefaultValues=False when using Proto3 version. We need to > be careful to check for field presence only on non-primitive types if > ReadDefaultValues is false and version used is Proto3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3
[ https://issues.apache.org/jira/browse/FLINK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sai Sharath Dandi updated FLINK-33817: -- Description: *Background* The current Protobuf format [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java] always sets ReadDefaultValues=False when using Proto3 version. This can cause severe performance degradation for large Protobuf schemas with OneOf fields as the entire generated code needs to be executed during deserialization even when certain fields are not present in the data to be deserialized and all the subsequent nested Fields can be skipped. Proto3 supports hasXXX() methods for checking field presence for non primitive types since Proto version [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In the internal performance benchmarks in our company, we've seen almost 10x difference in performance for one of our real production usecase when allowing to set ReadDefaultValues=False with proto3 version. The exact difference in performance depends on the schema complexity and data payload but we should allow user to set readDefaultValue=False in general. *Solution* Support using ReadDefaultValues=False when using Proto3 version. We need to be careful to check for field presence only on non-primitive types if ReadDefaultValues is false and version used is Proto3 was: *Background* The current Protobuf format [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java] always sets ReadDefaultValues=False when using Proto3 version. This can cause severe performance degradation for large Protobuf schemas with OneOf fields as the entire generated code needs to be executed during deserialization even when certain fields are not present in the data to be deserialized and all the subsequent nested Fields can be skipped. Proto3 supports hasXXX() methods for checking field presence for non primitive types since Proto version [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In the internal performance benchmarks in our company, we've seen almost 10x difference in performance for one of our real production usecase when allowing to set ReadDefaultValues=False with proto3 version. The exact difference in performance depends on the schema complexity and data payload but we should allow readDefaultValue=False in general. *Solution* Support using ReadDefaultValues=False when using Proto3 version. We need to be careful to check for field presence only on non-primitive types if ReadDefaultValues is false and version used is Proto3 > Allow ReadDefaultValues = False for non primitive types on Proto3 > - > > Key: FLINK-33817 > URL: https://issues.apache.org/jira/browse/FLINK-33817 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Priority: Major > > *Background* > > The current Protobuf format > [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java] > always sets ReadDefaultValues=False when using Proto3 version. This can > cause severe performance degradation for large Protobuf schemas with OneOf > fields as the entire generated code needs to be executed during > deserialization even when certain fields are not present in the data to be > deserialized and all the subsequent nested Fields can be skipped. Proto3 > supports hasXXX() methods for checking field presence for non primitive types > since Proto version > [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In > the internal performance benchmarks in our company, we've seen almost 10x > difference in performance for one of our real production usecase when > allowing to set ReadDefaultValues=False with proto3 version. The exact > difference in performance depends on the schema complexity and data payload > but we should allow user to set readDefaultValue=False in general. > > *Solution* > > Support using ReadDefaultValues=False when using Proto3 version. We need to > be careful to check for field presence only on non-primitive types if > ReadDefaultValues is false and version used is Proto3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-33611) Support Large Protobuf Schemas
[ https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799560#comment-17799560 ] Sai Sharath Dandi edited comment on FLINK-33611 at 12/21/23 9:23 PM: - [~libenchao] , The constant pool size [limit|https://docs.oracle.com/javase/specs/jvms/se12/html/jvms-4.html#jvms-4.4] is 65536 entries in Java. The constant pool size includes a lot of things but if we count only the identifier names and assume there are 2 identifiers(one for protobuf and one for rowdata) the generated code used for each field in the schema for rough estimation. There cannot be more than 65536/2 = 32768 fields in the Protobuf schema. Of course, the actual number is lower than that because we did not include split method names, class names etc.. was (Author: JIRAUSER298466): [~libenchao] , The constant pool size [limit|https://docs.oracle.com/javase/specs/jvms/se12/html/jvms-4.html#jvms-4.4] is 65536 entries in java. The constant pool size includes a lot of things but if we count only the identifier names and assume there are 2 identifiers(one for protobuf and one for rowdata) the generated code used for each field in the schema for rough estimation. There cannot be more than 65536/2 = 32768 fields in the Protobuf schema. Of course, the actual number is lower than that because we did not include split method names, class names etc.. > Support Large Protobuf Schemas > -- > > Key: FLINK-33611 > URL: https://issues.apache.org/jira/browse/FLINK-33611 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Assignee: Sai Sharath Dandi >Priority: Major > Labels: pull-request-available > > h3. Background > Flink serializes and deserializes protobuf format data by calling the decode > or encode method in GeneratedProtoToRow_XXX.java generated by codegen to > parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the > ability to split the generated code to improve the performance for large > Protobuf schemas. However, this is still not sufficient to support some > larger protobuf schemas as the generated code exceeds the java constant pool > size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] > and we can see errors like "Too many constants" when trying to compile the > generated code. > *Solution* > Since we already have the split code functionality already introduced, the > main proposal here is to now reuse the variable names across different split > method scopes. This will greatly reduce the constant pool size. One more > optimization is to only split the last code segment also only when the size > exceeds split threshold limit. Currently, the last segment of the generated > code is always being split which can lead to too many split methods and thus > exceed the constant pool size limit -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-33611) Support Large Protobuf Schemas
[ https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799560#comment-17799560 ] Sai Sharath Dandi edited comment on FLINK-33611 at 12/21/23 9:22 PM: - [~libenchao] , The constant pool size [limit|https://docs.oracle.com/javase/specs/jvms/se12/html/jvms-4.html#jvms-4.4] is 65536 entries in java. The constant pool size includes a lot of things but if we count only the identifier names and assume there are 2 identifiers(one for protobuf and one for rowdata) the generated code used for each field in the schema for rough estimation. There cannot be more than 65536/2 = 32768 fields in the Protobuf schema. Of course, the actual number is lower than that because we did not include split method names, class names etc.. was (Author: JIRAUSER298466): [~libenchao] , The constant pool size [limit|https://docs.oracle.com/javase/specs/jvms/se12/html/jvms-4.html#jvms-4.4] is 65536 entries in java. The constant pool size includes a lot of things but if we count only the identifier names and assume there are 2 identifiers in the generated code used for each field in the schema for rough estimation. There cannot be more than 65536/2 = 32768 fields in the Protobuf schema. Of course, the actual number is lower than that because we did not include split method names, class names etc.. > Support Large Protobuf Schemas > -- > > Key: FLINK-33611 > URL: https://issues.apache.org/jira/browse/FLINK-33611 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Assignee: Sai Sharath Dandi >Priority: Major > Labels: pull-request-available > > h3. Background > Flink serializes and deserializes protobuf format data by calling the decode > or encode method in GeneratedProtoToRow_XXX.java generated by codegen to > parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the > ability to split the generated code to improve the performance for large > Protobuf schemas. However, this is still not sufficient to support some > larger protobuf schemas as the generated code exceeds the java constant pool > size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] > and we can see errors like "Too many constants" when trying to compile the > generated code. > *Solution* > Since we already have the split code functionality already introduced, the > main proposal here is to now reuse the variable names across different split > method scopes. This will greatly reduce the constant pool size. One more > optimization is to only split the last code segment also only when the size > exceeds split threshold limit. Currently, the last segment of the generated > code is always being split which can lead to too many split methods and thus > exceed the constant pool size limit -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33611) Support Large Protobuf Schemas
[ https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799583#comment-17799583 ] Sai Sharath Dandi commented on FLINK-33611: --- Fwiw, I've run into this issue for one of our real production use case schema at my company which has about 44,368 fields in the Protobuf schema > Support Large Protobuf Schemas > -- > > Key: FLINK-33611 > URL: https://issues.apache.org/jira/browse/FLINK-33611 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Assignee: Sai Sharath Dandi >Priority: Major > Labels: pull-request-available > > h3. Background > Flink serializes and deserializes protobuf format data by calling the decode > or encode method in GeneratedProtoToRow_XXX.java generated by codegen to > parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the > ability to split the generated code to improve the performance for large > Protobuf schemas. However, this is still not sufficient to support some > larger protobuf schemas as the generated code exceeds the java constant pool > size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] > and we can see errors like "Too many constants" when trying to compile the > generated code. > *Solution* > Since we already have the split code functionality already introduced, the > main proposal here is to now reuse the variable names across different split > method scopes. This will greatly reduce the constant pool size. One more > optimization is to only split the last code segment also only when the size > exceeds split threshold limit. Currently, the last segment of the generated > code is always being split which can lead to too many split methods and thus > exceed the constant pool size limit -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-33611) Support Large Protobuf Schemas
[ https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799560#comment-17799560 ] Sai Sharath Dandi edited comment on FLINK-33611 at 12/21/23 8:18 PM: - [~libenchao] , The constant pool size [limit|https://docs.oracle.com/javase/specs/jvms/se12/html/jvms-4.html#jvms-4.4] is 65536 entries in java. The constant pool size includes a lot of things but if we count only the identifier names and assume there are 2 identifiers in the generated code used for each field in the schema for rough estimation. There cannot be more than 65536/2 = 32768 fields in the Protobuf schema. Of course, the actual number is lower than that because we did not include split method names, class names etc.. was (Author: JIRAUSER298466): [~libenchao] , The constant pool size [limit|https://docs.oracle.com/javase/specs/jvms/se12/html/jvms-4.html#jvms-4.4] is 65536 entries in java. The constant pool size includes a lot of things but if we count only the identifier names and assume there are 2 identifiers used for each field in the schema for rough estimation. There cannot be more than 65536/2 = 32768 fields in the Protobuf schema. Of course, the actual number is lower than that because we did not include split method names, class names etc.. > Support Large Protobuf Schemas > -- > > Key: FLINK-33611 > URL: https://issues.apache.org/jira/browse/FLINK-33611 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Assignee: Sai Sharath Dandi >Priority: Major > Labels: pull-request-available > > h3. Background > Flink serializes and deserializes protobuf format data by calling the decode > or encode method in GeneratedProtoToRow_XXX.java generated by codegen to > parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the > ability to split the generated code to improve the performance for large > Protobuf schemas. However, this is still not sufficient to support some > larger protobuf schemas as the generated code exceeds the java constant pool > size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] > and we can see errors like "Too many constants" when trying to compile the > generated code. > *Solution* > Since we already have the split code functionality already introduced, the > main proposal here is to now reuse the variable names across different split > method scopes. This will greatly reduce the constant pool size. One more > optimization is to only split the last code segment also only when the size > exceeds split threshold limit. Currently, the last segment of the generated > code is always being split which can lead to too many split methods and thus > exceed the constant pool size limit -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33611) Support Large Protobuf Schemas
[ https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799560#comment-17799560 ] Sai Sharath Dandi commented on FLINK-33611: --- [~libenchao] , The constant pool size [limit|https://docs.oracle.com/javase/specs/jvms/se12/html/jvms-4.html#jvms-4.4] is 65536 entries in java. The constant pool size includes a lot of things but if we count only the identifier names and assume there are 2 identifiers used for each field in the schema for rough estimation. There cannot be more than 65536/2 = 32768 fields in the Protobuf schema. Of course, the actual number is lower than that because we did not include split method names, class names etc.. > Support Large Protobuf Schemas > -- > > Key: FLINK-33611 > URL: https://issues.apache.org/jira/browse/FLINK-33611 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Assignee: Sai Sharath Dandi >Priority: Major > Labels: pull-request-available > > h3. Background > Flink serializes and deserializes protobuf format data by calling the decode > or encode method in GeneratedProtoToRow_XXX.java generated by codegen to > parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the > ability to split the generated code to improve the performance for large > Protobuf schemas. However, this is still not sufficient to support some > larger protobuf schemas as the generated code exceeds the java constant pool > size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] > and we can see errors like "Too many constants" when trying to compile the > generated code. > *Solution* > Since we already have the split code functionality already introduced, the > main proposal here is to now reuse the variable names across different split > method scopes. This will greatly reduce the constant pool size. One more > optimization is to only split the last code segment also only when the size > exceeds split threshold limit. Currently, the last segment of the generated > code is always being split which can lead to too many split methods and thus > exceed the constant pool size limit -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3
[ https://issues.apache.org/jira/browse/FLINK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sai Sharath Dandi updated FLINK-33817: -- Description: *Background* The current Protobuf format [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java] always sets ReadDefaultValues=False when using Proto3 version. This can cause severe performance degradation for large Protobuf schemas with OneOf fields as the entire generated code needs to be executed during deserialization even when certain fields are not present in the data to be deserialized and all the subsequent nested Fields can be skipped. Proto3 supports hasXXX() methods for checking field presence for non primitive types since Proto version [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In the internal performance benchmarks in our company, we've seen almost 10x difference in performance for one of our real production usecase when allowing to set ReadDefaultValues=False with proto3 version. The exact difference in performance depends on the schema complexity and data payload but we should allow readDefaultValue=False in general. *Solution* Support using ReadDefaultValues=False when using Proto3 version. We need to be careful to check for field presence only on non-primitive types if ReadDefaultValues is false and version used is Proto3 was: *Background* The current Protobuf format [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java] always sets ReadDefaultValues=False when using Proto3 version. This can cause severe performance degradation for large Protobuf schemas with OneOf fields as the entire generated code needs to be executed during deserialization even when certain fields are not present in the data to be deserialized and all the subsequent nested Fields can be skipped. Proto3 supports hasXXX() methods for checking field presence for non primitive types since Proto version [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In the internal performance benchmarks in our company, we've seen almost 10x difference in performance when allowing to set ReadDefaultValues=False with proto3 version *Solution* Support using ReadDefaultValues=False when using Proto3 version. We need to be careful to check for field presence only on non-primitive types if ReadDefaultValues is false and version used is Proto3 > Allow ReadDefaultValues = False for non primitive types on Proto3 > - > > Key: FLINK-33817 > URL: https://issues.apache.org/jira/browse/FLINK-33817 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Priority: Major > > *Background* > > The current Protobuf format > [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java] > always sets ReadDefaultValues=False when using Proto3 version. This can > cause severe performance degradation for large Protobuf schemas with OneOf > fields as the entire generated code needs to be executed during > deserialization even when certain fields are not present in the data to be > deserialized and all the subsequent nested Fields can be skipped. Proto3 > supports hasXXX() methods for checking field presence for non primitive types > since Proto version > [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In > the internal performance benchmarks in our company, we've seen almost 10x > difference in performance for one of our real production usecase when > allowing to set ReadDefaultValues=False with proto3 version. The exact > difference in performance depends on the schema complexity and data payload > but we should allow readDefaultValue=False in general. > > *Solution* > > Support using ReadDefaultValues=False when using Proto3 version. We need to > be careful to check for field presence only on non-primitive types if > ReadDefaultValues is false and version used is Proto3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33611) Support Large Protobuf Schemas
[ https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sai Sharath Dandi updated FLINK-33611: -- Summary: Support Large Protobuf Schemas (was: Add the ability to reuse variable names across different split method scopes) > Support Large Protobuf Schemas > -- > > Key: FLINK-33611 > URL: https://issues.apache.org/jira/browse/FLINK-33611 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Assignee: Sai Sharath Dandi >Priority: Major > > h3. Background > Flink serializes and deserializes protobuf format data by calling the decode > or encode method in GeneratedProtoToRow_XXX.java generated by codegen to > parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the > ability to split the generated code to improve the performance for large > Protobuf schemas. However, this is still not sufficient to support some > larger protobuf schemas as the generated code exceeds the java constant pool > size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] > and we can see errors like "Too many constants" when trying to compile the > generated code. > *Solution* > Since we already have the split code functionality already introduced, the > main proposal here is to now reuse the variable names across different split > method scopes. This will greatly reduce the constant pool size. One more > optimization is to only split the last code segment also only when the size > exceeds split threshold limit. Currently, the last segment of the generated > code is always being split which can lead to too many split methods and thus > exceed the constant pool size limit -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3
[ https://issues.apache.org/jira/browse/FLINK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sai Sharath Dandi updated FLINK-33817: -- Description: *Background* The current Protobuf format [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java] always sets ReadDefaultValues=False when using Proto3 version. This can cause severe performance degradation for large Protobuf schemas with OneOf fields as the entire generated code needs to be executed during deserialization even when certain fields are not present in the data to be deserialized and all the subsequent nested Fields can be skipped. Proto3 supports hasXXX() methods for checking field presence for non primitive types since Proto version [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In the internal performance benchmarks in our company, we've seen almost 10x difference in performance when allowing to set ReadDefaultValues=False with proto3 version *Solution* Support using ReadDefaultValues=False when using Proto3 version. We need to be careful to check for field presence only on non-primitive types if ReadDefaultValues is false and version used is Proto3 was: *Background* The current Protobuf format [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java] always sets ReadDefaultValues=False when using Proto3 version. This can cause severe performance degradation for large Protobuf schemas with OneOf fields as the entire generated code needs to be executed during deserialization even when certain fields are not present in the data to be deserialized and all the subsequent nested Fields can be skipped. Proto3 supports hasXXX() methods for checking field presence for non primitive types since Proto version [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In internal benchmarks in our company, we've seen almost 10x difference in performance when allowing to set ReadDefaultValues=False with proto3 version *Solution* Support using ReadDefaultValues=False when using Proto3 version. We need to be careful to check for field presence only on non-primitive types if ReadDefaultValues is false and version used is Proto3 > Allow ReadDefaultValues = False for non primitive types on Proto3 > - > > Key: FLINK-33817 > URL: https://issues.apache.org/jira/browse/FLINK-33817 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Priority: Major > > *Background* > > The current Protobuf format > [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java] > always sets ReadDefaultValues=False when using Proto3 version. This can > cause severe performance degradation for large Protobuf schemas with OneOf > fields as the entire generated code needs to be executed during > deserialization even when certain fields are not present in the data to be > deserialized and all the subsequent nested Fields can be skipped. Proto3 > supports hasXXX() methods for checking field presence for non primitive types > since Proto version > [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In > the internal performance benchmarks in our company, we've seen almost 10x > difference in performance when allowing to set ReadDefaultValues=False with > proto3 version > > *Solution* > > Support using ReadDefaultValues=False when using Proto3 version. We need to > be careful to check for field presence only on non-primitive types if > ReadDefaultValues is false and version used is Proto3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3
Sai Sharath Dandi created FLINK-33817: - Summary: Allow ReadDefaultValues = False for non primitive types on Proto3 Key: FLINK-33817 URL: https://issues.apache.org/jira/browse/FLINK-33817 Project: Flink Issue Type: Improvement Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) Affects Versions: 1.18.0 Reporter: Sai Sharath Dandi *Background* The current Protobuf format [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java] always sets ReadDefaultValues=False when using Proto3 version. This can cause severe performance degradation for large Protobuf schemas with OneOf fields as the entire generated code needs to be executed during deserialization even when certain fields are not present in the data to be deserialized and all the subsequent nested Fields can be skipped. Proto3 supports hasXXX() methods for checking field presence for non primitive types since Proto version [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In internal benchmarks in our company, we've seen almost 10x difference in performance when allowing to set ReadDefaultValues=False with proto3 version *Solution* Support using ReadDefaultValues=False when using Proto3 version. We need to be careful to check for field presence only on non-primitive types if ReadDefaultValues is false and version used is Proto3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33611) Add the ability to reuse variable names across different split method scopes
[ https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sai Sharath Dandi updated FLINK-33611: -- Description: h3. Background Flink serializes and deserializes protobuf format data by calling the decode or encode method in GeneratedProtoToRow_XXX.java generated by codegen to parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the ability to split the generated code to improve the performance for large Protobuf schemas. However, this is still not sufficient to support some larger protobuf schemas as the generated code exceeds the java constant pool size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] and we can see errors like "Too many constants" when trying to compile the generated code. *Solution* Since we already have the split code functionality already introduced, the main proposal here is to now reuse the variable names across different split method scopes. This will greatly reduce the constant pool size. One more optimization is to only split the last code segment also only when the size exceeds split threshold limit. Currently, the last segment of the generated code is always being split which can lead to too many split methods and thus exceed the constant pool size limit was: h3. Background Flink serializes and deserializes protobuf format data by calling the decode or encode method in GeneratedProtoToRow_XXX.java generated by codegen to parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the ability to split the generated code to improve the performance for large Protobuf schemas. However, this is still not sufficient to support some larger protobuf schemas as the generated code exceeds the java constant pool size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] and we can see errors like "Too many constants" when trying to compile the generated code. *Solution* Since we already have the split code functionality already introduced, the main proposal here is to now reuse the variable names across different split method scopes. This will greatly reduce the constant pool size. One more optimization is to only split the last code segment also only when the size exceeds split threshold limit. Currently, the last segment of the generated code is always being split which can lead to too many split methods. > Add the ability to reuse variable names across different split method scopes > > > Key: FLINK-33611 > URL: https://issues.apache.org/jira/browse/FLINK-33611 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Priority: Major > > h3. Background > Flink serializes and deserializes protobuf format data by calling the decode > or encode method in GeneratedProtoToRow_XXX.java generated by codegen to > parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the > ability to split the generated code to improve the performance for large > Protobuf schemas. However, this is still not sufficient to support some > larger protobuf schemas as the generated code exceeds the java constant pool > size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] > and we can see errors like "Too many constants" when trying to compile the > generated code. > *Solution* > Since we already have the split code functionality already introduced, the > main proposal here is to now reuse the variable names across different split > method scopes. This will greatly reduce the constant pool size. One more > optimization is to only split the last code segment also only when the size > exceeds split threshold limit. Currently, the last segment of the generated > code is always being split which can lead to too many split methods and thus > exceed the constant pool size limit -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33611) Add the ability to reuse variable names across different split method scopes
[ https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sai Sharath Dandi updated FLINK-33611: -- Description: h3. Background Flink serializes and deserializes protobuf format data by calling the decode or encode method in GeneratedProtoToRow_XXX.java generated by codegen to parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the ability to split the generated code to improve the performance for large Protobuf schemas. However, this is still not sufficient to support some larger protobuf schemas as the generated code exceeds the java constant pool size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] and we can see errors like "Too many constants" when trying to compile the generated code. *Solution* Since we already have the split code functionality already introduced, the main proposal here is to now reuse the variable names across different split method scopes. This will greatly reduce the constant pool size. One more optimization is to only split the last code segment also only when the size exceeds split threshold limit. Currently, the last segment of the generated code is always being split which can lead to too many split methods. was: h3. Background Flink serializes and deserializes protobuf format data by calling the decode or encode method in GeneratedProtoToRow_XXX.java generated by codegen to parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the ability to split the generated code to improve the performance for large Protobuf schemas. However, this is still not sufficient to support some larger protobuf schemas as the generated code exceeds the java constant pool size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] and we can see errors like "Too many constants" when trying to compile the generated code. *Solution* Since we already have the split code functionality already introduced, the main proposal here is to now use different variable names across different split method scopes. This will greatly reduce the constant pool size. One more optimization is to only split the last code segment also only when the size exceeds split threshold limit. Currently, the last segment of the generated code is always being split which can lead to too many split methods. > Add the ability to reuse variable names across different split method scopes > > > Key: FLINK-33611 > URL: https://issues.apache.org/jira/browse/FLINK-33611 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.18.0 >Reporter: Sai Sharath Dandi >Priority: Major > > h3. Background > Flink serializes and deserializes protobuf format data by calling the decode > or encode method in GeneratedProtoToRow_XXX.java generated by codegen to > parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the > ability to split the generated code to improve the performance for large > Protobuf schemas. However, this is still not sufficient to support some > larger protobuf schemas as the generated code exceeds the java constant pool > size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] > and we can see errors like "Too many constants" when trying to compile the > generated code. > *Solution* > Since we already have the split code functionality already introduced, the > main proposal here is to now reuse the variable names across different split > method scopes. This will greatly reduce the constant pool size. One more > optimization is to only split the last code segment also only when the size > exceeds split threshold limit. Currently, the last segment of the generated > code is always being split which can lead to too many split methods. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33611) Add the ability to reuse variable names across different split method scopes
Sai Sharath Dandi created FLINK-33611: - Summary: Add the ability to reuse variable names across different split method scopes Key: FLINK-33611 URL: https://issues.apache.org/jira/browse/FLINK-33611 Project: Flink Issue Type: Improvement Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) Affects Versions: 1.18.0 Reporter: Sai Sharath Dandi h3. Background Flink serializes and deserializes protobuf format data by calling the decode or encode method in GeneratedProtoToRow_XXX.java generated by codegen to parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the ability to split the generated code to improve the performance for large Protobuf schemas. However, this is still not sufficient to support some larger protobuf schemas as the generated code exceeds the java constant pool size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] and we can see errors like "Too many constants" when trying to compile the generated code. *Solution* Since we already have the split code functionality already introduced, the main proposal here is to now use different variable names across different split method scopes. This will greatly reduce the constant pool size. One more optimization is to only split the last code segment also only when the size exceeds split threshold limit. Currently, the last segment of the generated code is always being split which can lead to too many split methods. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-9161) Support STRUCT syntax to create named STRUCT in SQL
[ https://issues.apache.org/jira/browse/FLINK-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677996#comment-17677996 ] Sai Sharath Dandi commented on FLINK-9161: -- Was able to get it working using a cast inside the row like below for anyone interested {code:java} SELECT CAST(ROW(cast(PersonTable.person.name as string), cast(PersonTable.person.age as int)) as ROW< new_name string, new_age int > ) as new_person FROM (VALUES (CAST(ROW('Bob', 10) as ROW< name string, age int>)),(CAST(ROW('Alice', 20) as ROW< name string, age int>))) AS PersonTable(person); {code} > Support STRUCT syntax to create named STRUCT in SQL > --- > > Key: FLINK-9161 > URL: https://issues.apache.org/jira/browse/FLINK-9161 > Project: Flink > Issue Type: New Feature > Components: Table SQL / API >Reporter: Shuyi Chen >Priority: Major > Labels: auto-unassigned > > As discussed in [calcite dev mailing > list|https://mail-archives.apache.org/mod_mbox/calcite-dev/201804.mbox/%3cCAMZk55avGNmp1vXeJwA1B_a8bGyCQ9ahxmE=R=6fklpf7jt...@mail.gmail.com%3e], > we want add support for adding named structure construction in SQL, e.g., > {code:java} > SELECT STRUCT(a as first_name, b as last_name, STRUCT(c as zip code, d as > street, e as state) as address) as record FROM example_table > {code} > This would require adding necessary change in Calcite first. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-9161) Support STRUCT syntax to create named STRUCT in SQL
[ https://issues.apache.org/jira/browse/FLINK-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17676305#comment-17676305 ] Sai Sharath Dandi commented on FLINK-9161: -- [~leonard] ROW does not support creating nested type out of a nested type/array type input. {code:java} SELECT CAST(ROW(person.name, person.age) as ROW< new_name string, new_age int > ) as new_person FROM (VALUES (CAST(ROW('Bob', 10) as ROW< name string, age int>)),(CAST(ROW('Alice', 20) as ROW< name string, age int>))) AS PersonTable(person); > [ERROR] Could not execute SQL statement. Reason: org.apache.flink.sql.parser.impl.ParseException: Encountered "." at line 1, column 23. Was expecting one of: ")" ... "," ... {code} Is there any workaround solution here? > Support STRUCT syntax to create named STRUCT in SQL > --- > > Key: FLINK-9161 > URL: https://issues.apache.org/jira/browse/FLINK-9161 > Project: Flink > Issue Type: New Feature > Components: Table SQL / API >Reporter: Shuyi Chen >Priority: Major > Labels: auto-unassigned > > As discussed in [calcite dev mailing > list|https://mail-archives.apache.org/mod_mbox/calcite-dev/201804.mbox/%3cCAMZk55avGNmp1vXeJwA1B_a8bGyCQ9ahxmE=R=6fklpf7jt...@mail.gmail.com%3e], > we want add support for adding named structure construction in SQL, e.g., > {code:java} > SELECT STRUCT(a as first_name, b as last_name, STRUCT(c as zip code, d as > street, e as state) as address) as record FROM example_table > {code} > This would require adding necessary change in Calcite first. -- This message was sent by Atlassian Jira (v8.20.10#820010)