[jira] [Updated] (FLINK-34635) Clear successful records from the batch in JDBC connector

2024-03-09 Thread Sai Sharath Dandi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sai Sharath Dandi updated FLINK-34635:
--
Affects Version/s: 1.18.1

> Clear successful records from the batch in JDBC connector
> -
>
> Key: FLINK-34635
> URL: https://issues.apache.org/jira/browse/FLINK-34635
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / JDBC
>Affects Versions: 1.18.1
>Reporter: Sai Sharath Dandi
>Priority: Minor
>
> Currently, when batch execution fails in the JDBC connector, the whole batch 
> is retried in the JDBC connector which is unnecessary. We should clear the 
> records that were successful in the 
> [SimpleBatchStatementExecutor|https://github.com/apache/flink-connector-jdbc/blob/main/flink-connector-jdbc/src/main/java/org/apache/flink/connector/jdbc/internal/executor/SimpleBatchStatementExecutor.java]
> {code:java}
>  @Override
>     public void executeBatch() throws SQLException {
>         if (!batch.isEmpty()) {
>             for (T r : batch) {
>                 parameterSetter.accept(st, r);
>                 st.addBatch();
>             }
>             st.executeBatch();
> --> catch the exception and clear successful records from the batch here
>             batch.clear();
>         }
>     }{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34635) Clear successful records from the batch in JDBC connector

2024-03-09 Thread Sai Sharath Dandi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sai Sharath Dandi updated FLINK-34635:
--
Priority: Minor  (was: Major)

> Clear successful records from the batch in JDBC connector
> -
>
> Key: FLINK-34635
> URL: https://issues.apache.org/jira/browse/FLINK-34635
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / JDBC
>Reporter: Sai Sharath Dandi
>Priority: Minor
>
> Currently, when batch execution fails in the JDBC connector, the whole batch 
> is retried in the JDBC connector which is unnecessary. We should clear the 
> records that were successful in the 
> [SimpleBatchStatementExecutor|https://github.com/apache/flink-connector-jdbc/blob/main/flink-connector-jdbc/src/main/java/org/apache/flink/connector/jdbc/internal/executor/SimpleBatchStatementExecutor.java]
> {code:java}
>  @Override
>     public void executeBatch() throws SQLException {
>         if (!batch.isEmpty()) {
>             for (T r : batch) {
>                 parameterSetter.accept(st, r);
>                 st.addBatch();
>             }
>             st.executeBatch();
> --> catch the exception and clear successful records from the batch here
>             batch.clear();
>         }
>     }{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34635) Clear successful records from the batch in JDBC connector

2024-03-09 Thread Sai Sharath Dandi (Jira)
Sai Sharath Dandi created FLINK-34635:
-

 Summary: Clear successful records from the batch in JDBC connector
 Key: FLINK-34635
 URL: https://issues.apache.org/jira/browse/FLINK-34635
 Project: Flink
  Issue Type: Improvement
  Components: Connectors / JDBC
Reporter: Sai Sharath Dandi


Currently, when batch execution fails in the JDBC connector, the whole batch is 
retried in the JDBC connector which is unnecessary. We should clear the records 
that were successful in the 
[SimpleBatchStatementExecutor|https://github.com/apache/flink-connector-jdbc/blob/main/flink-connector-jdbc/src/main/java/org/apache/flink/connector/jdbc/internal/executor/SimpleBatchStatementExecutor.java]


{code:java}
 @Override
    public void executeBatch() throws SQLException {
        if (!batch.isEmpty()) {
            for (T r : batch) {
                parameterSetter.accept(st, r);
                st.addBatch();
            }
            st.executeBatch();
--> catch the exception and clear successful records from the batch here
            batch.clear();
        }
    }{code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34403) VeryBigPbProtoToRowTest#testSimple cannot pass due to OOM

2024-02-13 Thread Sai Sharath Dandi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817115#comment-17817115
 ] 

Sai Sharath Dandi commented on FLINK-34403:
---

Sorry, I couldn't get back here in time. Thanks [~mapohl] for the fix!

> VeryBigPbProtoToRowTest#testSimple cannot pass due to OOM
> -
>
> Key: FLINK-34403
> URL: https://issues.apache.org/jira/browse/FLINK-34403
> Project: Flink
>  Issue Type: Bug
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.20.0
>Reporter: Benchao Li
>Assignee: Matthias Pohl
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.20.0
>
>
> After FLINK-33611 merged, the misc test on GHA cannot pass due to out of 
> memory error, throwing following exceptions:
> {code:java}
> Error: 05:43:21 05:43:21.768 [ERROR] Tests run: 1, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 40.98 s <<< FAILURE! -- in 
> org.apache.flink.formats.protobuf.VeryBigPbRowToProtoTest
> Error: 05:43:21 05:43:21.773 [ERROR] 
> org.apache.flink.formats.protobuf.VeryBigPbRowToProtoTest.testSimple -- Time 
> elapsed: 40.97 s <<< ERROR!
> Feb 07 05:43:21 org.apache.flink.util.FlinkRuntimeException: Error in 
> serialization.
> Feb 07 05:43:21   at 
> org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:327)
> Feb 07 05:43:21   at 
> org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:162)
> Feb 07 05:43:21   at 
> org.apache.flink.streaming.api.graph.StreamGraph.getJobGraph(StreamGraph.java:1007)
> Feb 07 05:43:21   at 
> org.apache.flink.client.StreamGraphTranslator.translateToJobGraph(StreamGraphTranslator.java:56)
> Feb 07 05:43:21   at 
> org.apache.flink.client.FlinkPipelineTranslationUtil.getJobGraph(FlinkPipelineTranslationUtil.java:45)
> Feb 07 05:43:21   at 
> org.apache.flink.client.deployment.executors.PipelineExecutorUtils.getJobGraph(PipelineExecutorUtils.java:61)
> Feb 07 05:43:21   at 
> org.apache.flink.client.deployment.executors.LocalExecutor.getJobGraph(LocalExecutor.java:104)
> Feb 07 05:43:21   at 
> org.apache.flink.client.deployment.executors.LocalExecutor.execute(LocalExecutor.java:81)
> Feb 07 05:43:21   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2440)
> Feb 07 05:43:21   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2421)
> Feb 07 05:43:21   at 
> org.apache.flink.streaming.api.datastream.DataStream.executeAndCollectWithClient(DataStream.java:1495)
> Feb 07 05:43:21   at 
> org.apache.flink.streaming.api.datastream.DataStream.executeAndCollect(DataStream.java:1382)
> Feb 07 05:43:21   at 
> org.apache.flink.streaming.api.datastream.DataStream.executeAndCollect(DataStream.java:1367)
> Feb 07 05:43:21   at 
> org.apache.flink.formats.protobuf.ProtobufTestHelper.validateRow(ProtobufTestHelper.java:66)
> Feb 07 05:43:21   at 
> org.apache.flink.formats.protobuf.ProtobufTestHelper.rowToPbBytes(ProtobufTestHelper.java:89)
> Feb 07 05:43:21   at 
> org.apache.flink.formats.protobuf.ProtobufTestHelper.rowToPbBytes(ProtobufTestHelper.java:76)
> Feb 07 05:43:21   at 
> org.apache.flink.formats.protobuf.ProtobufTestHelper.rowToPbBytes(ProtobufTestHelper.java:71)
> Feb 07 05:43:21   at 
> org.apache.flink.formats.protobuf.VeryBigPbRowToProtoTest.testSimple(VeryBigPbRowToProtoTest.java:37)
> Feb 07 05:43:21   at java.lang.reflect.Method.invoke(Method.java:498)
> Feb 07 05:43:21 Caused by: java.util.concurrent.ExecutionException: 
> java.lang.IllegalArgumentException: Self-suppression not permitted
> Feb 07 05:43:21   at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> Feb 07 05:43:21   at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> Feb 07 05:43:21   at 
> org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:323)
> Feb 07 05:43:21   ... 18 more
> Feb 07 05:43:21 Caused by: java.lang.IllegalArgumentException: 
> Self-suppression not permitted
> Feb 07 05:43:21   at 
> java.lang.Throwable.addSuppressed(Throwable.java:1072)
> Feb 07 05:43:21   at 
> org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:556)
> Feb 07 05:43:21   at 
> org.apache.flink.util.InstantiationUtil.writeObjectToConfig(InstantiationUtil.java:486)
> Feb 07 05:43:21   at 
> 

[jira] [Commented] (FLINK-34403) VeryBigPbProtoToRowTest#testSimple cannot pass due to OOM

2024-02-07 Thread Sai Sharath Dandi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815422#comment-17815422
 ] 

Sai Sharath Dandi commented on FLINK-34403:
---

The test was added for an extreme case that would fail without the changes made 
in the PR. If we reduce the test data size, the test case would pass without 
the changes in the PR. I've tried hard to make such a test that would meet all 
the requirements and also pass the Azure pipelines heap size requirement but 
I'm not sure why it has started failing after passing earlier. Fwiw, the test 
passes comfortably in my local environment at a much larger size than what was 
merged into the codebase. I can remove this test altogether if it is not 
possible to increase the heap size for the tests as the PR already achieved 
it's goal

> VeryBigPbProtoToRowTest#testSimple cannot pass due to OOM
> -
>
> Key: FLINK-34403
> URL: https://issues.apache.org/jira/browse/FLINK-34403
> Project: Flink
>  Issue Type: Bug
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.20.0
>Reporter: Benchao Li
>Priority: Major
>  Labels: test-stability
>
> After FLINK-33611 merged, the misc test on GHA cannot pass due to out of 
> memory error, throwing following exceptions:
> {code:java}
> Error: 05:43:21 05:43:21.768 [ERROR] Tests run: 1, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 40.98 s <<< FAILURE! -- in 
> org.apache.flink.formats.protobuf.VeryBigPbRowToProtoTest
> Error: 05:43:21 05:43:21.773 [ERROR] 
> org.apache.flink.formats.protobuf.VeryBigPbRowToProtoTest.testSimple -- Time 
> elapsed: 40.97 s <<< ERROR!
> Feb 07 05:43:21 org.apache.flink.util.FlinkRuntimeException: Error in 
> serialization.
> Feb 07 05:43:21   at 
> org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:327)
> Feb 07 05:43:21   at 
> org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:162)
> Feb 07 05:43:21   at 
> org.apache.flink.streaming.api.graph.StreamGraph.getJobGraph(StreamGraph.java:1007)
> Feb 07 05:43:21   at 
> org.apache.flink.client.StreamGraphTranslator.translateToJobGraph(StreamGraphTranslator.java:56)
> Feb 07 05:43:21   at 
> org.apache.flink.client.FlinkPipelineTranslationUtil.getJobGraph(FlinkPipelineTranslationUtil.java:45)
> Feb 07 05:43:21   at 
> org.apache.flink.client.deployment.executors.PipelineExecutorUtils.getJobGraph(PipelineExecutorUtils.java:61)
> Feb 07 05:43:21   at 
> org.apache.flink.client.deployment.executors.LocalExecutor.getJobGraph(LocalExecutor.java:104)
> Feb 07 05:43:21   at 
> org.apache.flink.client.deployment.executors.LocalExecutor.execute(LocalExecutor.java:81)
> Feb 07 05:43:21   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2440)
> Feb 07 05:43:21   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2421)
> Feb 07 05:43:21   at 
> org.apache.flink.streaming.api.datastream.DataStream.executeAndCollectWithClient(DataStream.java:1495)
> Feb 07 05:43:21   at 
> org.apache.flink.streaming.api.datastream.DataStream.executeAndCollect(DataStream.java:1382)
> Feb 07 05:43:21   at 
> org.apache.flink.streaming.api.datastream.DataStream.executeAndCollect(DataStream.java:1367)
> Feb 07 05:43:21   at 
> org.apache.flink.formats.protobuf.ProtobufTestHelper.validateRow(ProtobufTestHelper.java:66)
> Feb 07 05:43:21   at 
> org.apache.flink.formats.protobuf.ProtobufTestHelper.rowToPbBytes(ProtobufTestHelper.java:89)
> Feb 07 05:43:21   at 
> org.apache.flink.formats.protobuf.ProtobufTestHelper.rowToPbBytes(ProtobufTestHelper.java:76)
> Feb 07 05:43:21   at 
> org.apache.flink.formats.protobuf.ProtobufTestHelper.rowToPbBytes(ProtobufTestHelper.java:71)
> Feb 07 05:43:21   at 
> org.apache.flink.formats.protobuf.VeryBigPbRowToProtoTest.testSimple(VeryBigPbRowToProtoTest.java:37)
> Feb 07 05:43:21   at java.lang.reflect.Method.invoke(Method.java:498)
> Feb 07 05:43:21 Caused by: java.util.concurrent.ExecutionException: 
> java.lang.IllegalArgumentException: Self-suppression not permitted
> Feb 07 05:43:21   at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> Feb 07 05:43:21   at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> Feb 07 05:43:21   at 
> org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:323)
> Feb 07 05:43:21   ... 18 more
> Feb 07 05:43:21 Caused by: java.lang.IllegalArgumentException: 
> 

[jira] [Commented] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3

2024-01-29 Thread Sai Sharath Dandi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812045#comment-17812045
 ] 

Sai Sharath Dandi commented on FLINK-33817:
---

[~libenchao] [~maosuhan] , Gentle ping on this ticket as there is a very high 
performance impact here for the Protobuf format.

> Allow ReadDefaultValues = False for non primitive types on Proto3
> -
>
> Key: FLINK-33817
> URL: https://issues.apache.org/jira/browse/FLINK-33817
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Priority: Major
>  Labels: pull-request-available
>
> *Background*
>  
> The current Protobuf format 
> [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java]
>  always sets ReadDefaultValues=False when using Proto3 version. This can 
> cause severe performance degradation for large Protobuf schemas with OneOf 
> fields as the entire generated code needs to be executed during 
> deserialization even when certain fields are not present in the data to be 
> deserialized and all the subsequent nested Fields can be skipped. Proto3 
> supports hasXXX() methods for checking field presence for non primitive types 
> since Proto version 
> [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In 
> the internal performance benchmarks in our company, we've seen almost 10x 
> difference in performance for one of our real production usecase when 
> allowing to set ReadDefaultValues=False with proto3 version. The exact 
> difference in performance depends on the schema complexity and data payload 
> but we should allow user to set readDefaultValue=False in general.
>  
> *Solution*
>  
> Support using ReadDefaultValues=False when using Proto3 version. We need to 
> be careful to check for field presence only on non-primitive types if 
> ReadDefaultValues is false and version used is Proto3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33611) Support Large Protobuf Schemas

2024-01-08 Thread Sai Sharath Dandi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17804434#comment-17804434
 ] 

Sai Sharath Dandi commented on FLINK-33611:
---

[~libenchao] Thanks for the feedback, I agree with you on the points mentioned. 
Let me update the pull request as discussed above

> Support Large Protobuf Schemas
> --
>
> Key: FLINK-33611
> URL: https://issues.apache.org/jira/browse/FLINK-33611
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Assignee: Sai Sharath Dandi
>Priority: Major
>  Labels: pull-request-available
>
> h3. Background
> Flink serializes and deserializes protobuf format data by calling the decode 
> or encode method in GeneratedProtoToRow_XXX.java generated by codegen to 
> parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the 
> ability to split the generated code to improve the performance for large 
> Protobuf schemas. However, this is still not sufficient to support some 
> larger protobuf schemas as the generated code exceeds the java constant pool 
> size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] 
> and we can see errors like "Too many constants" when trying to compile the 
> generated code. 
> *Solution*
> Since we already have the split code functionality already introduced, the 
> main proposal here is to now reuse the variable names across different split 
> method scopes. This will greatly reduce the constant pool size. One more 
> optimization is to only split the last code segment also only when the size 
> exceeds split threshold limit. Currently, the last segment of the generated 
> code is always being split which can lead to too many split methods and thus 
> exceed the constant pool size limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3

2024-01-04 Thread Sai Sharath Dandi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17803389#comment-17803389
 ] 

Sai Sharath Dandi commented on FLINK-33817:
---

[~maosuhan] Gentle ping on this.

> Allow ReadDefaultValues = False for non primitive types on Proto3
> -
>
> Key: FLINK-33817
> URL: https://issues.apache.org/jira/browse/FLINK-33817
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Priority: Major
>
> *Background*
>  
> The current Protobuf format 
> [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java]
>  always sets ReadDefaultValues=False when using Proto3 version. This can 
> cause severe performance degradation for large Protobuf schemas with OneOf 
> fields as the entire generated code needs to be executed during 
> deserialization even when certain fields are not present in the data to be 
> deserialized and all the subsequent nested Fields can be skipped. Proto3 
> supports hasXXX() methods for checking field presence for non primitive types 
> since Proto version 
> [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In 
> the internal performance benchmarks in our company, we've seen almost 10x 
> difference in performance for one of our real production usecase when 
> allowing to set ReadDefaultValues=False with proto3 version. The exact 
> difference in performance depends on the schema complexity and data payload 
> but we should allow user to set readDefaultValue=False in general.
>  
> *Solution*
>  
> Support using ReadDefaultValues=False when using Proto3 version. We need to 
> be careful to check for field presence only on non-primitive types if 
> ReadDefaultValues is false and version used is Proto3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-33611) Support Large Protobuf Schemas

2024-01-04 Thread Sai Sharath Dandi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17803388#comment-17803388
 ] 

Sai Sharath Dandi edited comment on FLINK-33611 at 1/5/24 3:03 AM:
---

[~libenchao] , Thanks for the suggestion to add the test case to the pull 
request. While working on making an appropriate test case, I discovered 
something interesting. The local variable names are somehow not part of the 
Java constant pool for large schemas. I believe Java is triggering some 
optimizations internally and storing the variable names elsewhere when the code 
size becomes too large, therefore I'm not sure if reusing variable names has 
any impact on supporting large schemas. Perhaps, it can reduce the work needed 
for the Java compiler to rewrite variable names and result in faster compile 
times but I haven't conducted any experiment on that aspect. Apart from that, 
making the code change to reduce too many split methods has the most impact in 
supporting large schemas as I found that method names are always included in 
the constant pool even when the code size is too large from my experiment. In 
fact, this is the main reason which causes compilation errors with "too many 
constants error"

With that being said, I would still prefer to keep the changes to reuse 
variable names since the change itself is non-intrusive, harmless, and can only 
improve the performance for compilation. Please let me know your thoughts


was (Author: JIRAUSER298466):
@libenchao, Thanks for the suggestion to add the test case to the pull request. 
While working on making an appropriate test case, I discovered something 
interesting. The local variable names are somehow not part of the Java constant 
pool for large schemas. I believe Java is triggering some optimizations 
internally and storing the variable names elsewhere when the code size becomes 
too large, therefore I'm not sure if reusing variable names has any impact on 
supporting large schemas. Perhaps, it can reduce the work needed for the Java 
compiler to rewrite variable names and result in faster compile times but I 
haven't conducted any experiment on that aspect. Apart from that, making the 
code change to reduce too many split methods has the most impact in supporting 
large schemas as I found that method names are always included in the constant 
pool even when the code size is too large from my experiment. In fact, this is 
the main reason which causes compilation errors with "too many constants error"

With that being said, I would still prefer to keep the changes to reuse 
variable names since the change itself is non-intrusive, harmless, and can only 
improve the performance for compilation. Please let me know your thoughts

> Support Large Protobuf Schemas
> --
>
> Key: FLINK-33611
> URL: https://issues.apache.org/jira/browse/FLINK-33611
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Assignee: Sai Sharath Dandi
>Priority: Major
>  Labels: pull-request-available
>
> h3. Background
> Flink serializes and deserializes protobuf format data by calling the decode 
> or encode method in GeneratedProtoToRow_XXX.java generated by codegen to 
> parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the 
> ability to split the generated code to improve the performance for large 
> Protobuf schemas. However, this is still not sufficient to support some 
> larger protobuf schemas as the generated code exceeds the java constant pool 
> size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] 
> and we can see errors like "Too many constants" when trying to compile the 
> generated code. 
> *Solution*
> Since we already have the split code functionality already introduced, the 
> main proposal here is to now reuse the variable names across different split 
> method scopes. This will greatly reduce the constant pool size. One more 
> optimization is to only split the last code segment also only when the size 
> exceeds split threshold limit. Currently, the last segment of the generated 
> code is always being split which can lead to too many split methods and thus 
> exceed the constant pool size limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33611) Support Large Protobuf Schemas

2024-01-04 Thread Sai Sharath Dandi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17803388#comment-17803388
 ] 

Sai Sharath Dandi commented on FLINK-33611:
---

@libenchao, Thanks for the suggestion to add the test case to the pull request. 
While working on making an appropriate test case, I discovered something 
interesting. The local variable names are somehow not part of the Java constant 
pool for large schemas. I believe Java is triggering some optimizations 
internally and storing the variable names elsewhere when the code size becomes 
too large, therefore I'm not sure if reusing variable names has any impact on 
supporting large schemas. Perhaps, it can reduce the work needed for the Java 
compiler to rewrite variable names and result in faster compile times but I 
haven't conducted any experiment on that aspect. Apart from that, making the 
code change to reduce too many split methods has the most impact in supporting 
large schemas as I found that method names are always included in the constant 
pool even when the code size is too large from my experiment. In fact, this is 
the main reason which causes compilation errors with "too many constants error"

With that being said, I would still prefer to keep the changes to reuse 
variable names since the change itself is non-intrusive, harmless, and can only 
improve the performance for compilation. Please let me know your thoughts

> Support Large Protobuf Schemas
> --
>
> Key: FLINK-33611
> URL: https://issues.apache.org/jira/browse/FLINK-33611
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Assignee: Sai Sharath Dandi
>Priority: Major
>  Labels: pull-request-available
>
> h3. Background
> Flink serializes and deserializes protobuf format data by calling the decode 
> or encode method in GeneratedProtoToRow_XXX.java generated by codegen to 
> parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the 
> ability to split the generated code to improve the performance for large 
> Protobuf schemas. However, this is still not sufficient to support some 
> larger protobuf schemas as the generated code exceeds the java constant pool 
> size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] 
> and we can see errors like "Too many constants" when trying to compile the 
> generated code. 
> *Solution*
> Since we already have the split code functionality already introduced, the 
> main proposal here is to now reuse the variable names across different split 
> method scopes. This will greatly reduce the constant pool size. One more 
> optimization is to only split the last code segment also only when the size 
> exceeds split threshold limit. Currently, the last segment of the generated 
> code is always being split which can lead to too many split methods and thus 
> exceed the constant pool size limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-33611) Support Large Protobuf Schemas

2024-01-03 Thread Sai Sharath Dandi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802335#comment-17802335
 ] 

Sai Sharath Dandi edited comment on FLINK-33611 at 1/3/24 10:15 PM:


[~libenchao] All identifier names in the code are part of the constant pool 
including local variable names. You can use the javap tool on a simple class 
file to examine the constant pool contents - 
[ref|[https://blogs.oracle.com/javamagazine/post/java-class-file-constant-pool].]

 

Here's an example class and it's constant pool content obtained with javap - 

 

 
{code:java}
public class Hello {

  public void sayHello1() {
Integer a1 = 1;
int b = 2;
String c = "hi";
  }

  public void sayHello2() {
Integer a2 = 3;
int b = 2;
String c = "hello";
  }
}{code}
{code:java}
Constant pool:
   #1 = Methodref          #6.#25         // java/lang/Object."":()V
   #2 = Methodref          #26.#27        // 
java/lang/Integer.valueOf:(I)Ljava/lang/Integer;
   #3 = String             #28            // hi
   #4 = String             #29            // hello
   #5 = Class              #30            // 
com/uber/athena/athenax/connector/kafka/formats/protobuf/deserialize/Hello
   #6 = Class              #31            // java/lang/Object
   #7 = Utf8               
   #8 = Utf8               ()V
   #9 = Utf8               Code
  #10 = Utf8               LineNumberTable
  #11 = Utf8               LocalVariableTable
  #12 = Utf8               this
  #13 = Utf8               
Lcom/uber/athena/athenax/connector/kafka/formats/protobuf/deserialize/Hello;
  #14 = Utf8               sayHello1
  #15 = Utf8               a1
  #16 = Utf8               Ljava/lang/Integer;
  #17 = Utf8               b
  #18 = Utf8               I
  #19 = Utf8               c
  #20 = Utf8               Ljava/lang/String;
  #21 = Utf8               sayHello2
  #22 = Utf8               a2
  #23 = Utf8               SourceFile
  #24 = Utf8               Hello.java
  #25 = NameAndType        #7:#8          // "":()V
  #26 = Class              #32            // java/lang/Integer
  #27 = NameAndType        #33:#34        // valueOf:(I)Ljava/lang/Integer;
  #28 = Utf8               hi
  #29 = Utf8               hello
  #30 = Utf8               
com/uber/athena/athenax/connector/kafka/formats/protobuf/deserialize/Hello
  #31 = Utf8               java/lang/Object
  #32 = Utf8               java/lang/Integer
  #33 = Utf8               valueOf
  #34 = Utf8               (I)Ljava/lang/Integer; {code}
 

 

As we can see from the above example, local variable names are part of the 
constant pool

 


was (Author: JIRAUSER298466):
[~libenchao] All identifier names in the code are part of the constant pool 
including local variable names. You can use the javap tool on a simple class 
file to examine the constant pool contents - 
[ref|[https://blogs.oracle.com/javamagazine/post/java-class-file-constant-pool].]

 

Here's an example class and it's constant pool content obtained with javap - 

 

 
{code:java}
public class Hello {

  public void sayHello1() {
Integer a1;
int b;
String c;
  }

  public void sayHello2() {
Integer a2;
int b;
String c;
  }
} {code}
{code:java}
Constant pool:
   #1 = Methodref          #6.#25         // java/lang/Object."":()V
   #2 = Methodref          #26.#27        // 
java/lang/Integer.valueOf:(I)Ljava/lang/Integer;
   #3 = String             #28            // hi
   #4 = String             #29            // hello
   #5 = Class              #30            // 
com/uber/athena/athenax/connector/kafka/formats/protobuf/deserialize/Hello
   #6 = Class              #31            // java/lang/Object
   #7 = Utf8               
   #8 = Utf8               ()V
   #9 = Utf8               Code
  #10 = Utf8               LineNumberTable
  #11 = Utf8               LocalVariableTable
  #12 = Utf8               this
  #13 = Utf8               
Lcom/uber/athena/athenax/connector/kafka/formats/protobuf/deserialize/Hello;
  #14 = Utf8               sayHello1
  #15 = Utf8               a1
  #16 = Utf8               Ljava/lang/Integer;
  #17 = Utf8               b
  #18 = Utf8               I
  #19 = Utf8               c
  #20 = Utf8               Ljava/lang/String;
  #21 = Utf8               sayHello2
  #22 = Utf8               a2
  #23 = Utf8               SourceFile
  #24 = Utf8               Hello.java
  #25 = NameAndType        #7:#8          // "":()V
  #26 = Class              #32            // java/lang/Integer
  #27 = NameAndType        #33:#34        // valueOf:(I)Ljava/lang/Integer;
  #28 = Utf8               hi
  #29 = Utf8               hello
  #30 = Utf8               
com/uber/athena/athenax/connector/kafka/formats/protobuf/deserialize/Hello
  #31 = Utf8               java/lang/Object
  #32 = Utf8               java/lang/Integer
  #33 = Utf8               valueOf
  #34 = Utf8               

[jira] [Commented] (FLINK-33611) Support Large Protobuf Schemas

2024-01-03 Thread Sai Sharath Dandi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802335#comment-17802335
 ] 

Sai Sharath Dandi commented on FLINK-33611:
---

[~libenchao] All identifier names in the code are part of the constant pool 
including local variable names. You can use the javap tool on a simple class 
file to examine the constant pool contents - 
[ref|[https://blogs.oracle.com/javamagazine/post/java-class-file-constant-pool].]

 

Here's an example class and it's constant pool content obtained with javap - 

 

 
{code:java}
public class Hello {

  public void sayHello1() {
Integer a1;
int b;
String c;
  }

  public void sayHello2() {
Integer a2;
int b;
String c;
  }
} {code}
{code:java}
Constant pool:
   #1 = Methodref          #6.#25         // java/lang/Object."":()V
   #2 = Methodref          #26.#27        // 
java/lang/Integer.valueOf:(I)Ljava/lang/Integer;
   #3 = String             #28            // hi
   #4 = String             #29            // hello
   #5 = Class              #30            // 
com/uber/athena/athenax/connector/kafka/formats/protobuf/deserialize/Hello
   #6 = Class              #31            // java/lang/Object
   #7 = Utf8               
   #8 = Utf8               ()V
   #9 = Utf8               Code
  #10 = Utf8               LineNumberTable
  #11 = Utf8               LocalVariableTable
  #12 = Utf8               this
  #13 = Utf8               
Lcom/uber/athena/athenax/connector/kafka/formats/protobuf/deserialize/Hello;
  #14 = Utf8               sayHello1
  #15 = Utf8               a1
  #16 = Utf8               Ljava/lang/Integer;
  #17 = Utf8               b
  #18 = Utf8               I
  #19 = Utf8               c
  #20 = Utf8               Ljava/lang/String;
  #21 = Utf8               sayHello2
  #22 = Utf8               a2
  #23 = Utf8               SourceFile
  #24 = Utf8               Hello.java
  #25 = NameAndType        #7:#8          // "":()V
  #26 = Class              #32            // java/lang/Integer
  #27 = NameAndType        #33:#34        // valueOf:(I)Ljava/lang/Integer;
  #28 = Utf8               hi
  #29 = Utf8               hello
  #30 = Utf8               
com/uber/athena/athenax/connector/kafka/formats/protobuf/deserialize/Hello
  #31 = Utf8               java/lang/Object
  #32 = Utf8               java/lang/Integer
  #33 = Utf8               valueOf
  #34 = Utf8               (I)Ljava/lang/Integer; {code}
 

 

As we can see from the above example, local variable names are part of the 
constant pool

 

> Support Large Protobuf Schemas
> --
>
> Key: FLINK-33611
> URL: https://issues.apache.org/jira/browse/FLINK-33611
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Assignee: Sai Sharath Dandi
>Priority: Major
>  Labels: pull-request-available
>
> h3. Background
> Flink serializes and deserializes protobuf format data by calling the decode 
> or encode method in GeneratedProtoToRow_XXX.java generated by codegen to 
> parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the 
> ability to split the generated code to improve the performance for large 
> Protobuf schemas. However, this is still not sufficient to support some 
> larger protobuf schemas as the generated code exceeds the java constant pool 
> size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] 
> and we can see errors like "Too many constants" when trying to compile the 
> generated code. 
> *Solution*
> Since we already have the split code functionality already introduced, the 
> main proposal here is to now reuse the variable names across different split 
> method scopes. This will greatly reduce the constant pool size. One more 
> optimization is to only split the last code segment also only when the size 
> exceeds split threshold limit. Currently, the last segment of the generated 
> code is always being split which can lead to too many split methods and thus 
> exceed the constant pool size limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33611) Support Large Protobuf Schemas

2023-12-22 Thread Sai Sharath Dandi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799963#comment-17799963
 ] 

Sai Sharath Dandi commented on FLINK-33611:
---

Yes, it is quite large indeed but it's one of the most important use cases in 
my company that we need to support with Flink. Please help us review the 
solution/PR and lmk if there is any concerns. Thanks!

> Support Large Protobuf Schemas
> --
>
> Key: FLINK-33611
> URL: https://issues.apache.org/jira/browse/FLINK-33611
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Assignee: Sai Sharath Dandi
>Priority: Major
>  Labels: pull-request-available
>
> h3. Background
> Flink serializes and deserializes protobuf format data by calling the decode 
> or encode method in GeneratedProtoToRow_XXX.java generated by codegen to 
> parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the 
> ability to split the generated code to improve the performance for large 
> Protobuf schemas. However, this is still not sufficient to support some 
> larger protobuf schemas as the generated code exceeds the java constant pool 
> size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] 
> and we can see errors like "Too many constants" when trying to compile the 
> generated code. 
> *Solution*
> Since we already have the split code functionality already introduced, the 
> main proposal here is to now reuse the variable names across different split 
> method scopes. This will greatly reduce the constant pool size. One more 
> optimization is to only split the last code segment also only when the size 
> exceeds split threshold limit. Currently, the last segment of the generated 
> code is always being split which can lead to too many split methods and thus 
> exceed the constant pool size limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3

2023-12-22 Thread Sai Sharath Dandi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799959#comment-17799959
 ] 

Sai Sharath Dandi edited comment on FLINK-33817 at 12/22/23 10:09 PM:
--

Yes, it is supported for message fields as well (although I'm not exactly sure 
which version). See this 
[documentation|https://protobuf.dev/programming-guides/field_presence/#how-to-enable-explicit-presence-in-proto3]
 for more details. In fact, the hasXXX() method is supported for Primitive 
types also if defined as optional.

 

We can distinguish the primitive types using {{PbFormatUtils.isSimpleType()}} 
[method|https://github.com/apache/flink/blob/master/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/util/PbFormatUtils.java].
 Also, We can add user guidance to only enable this setting when using Protobuf 
version newer than 3.15. The performance difference should also be called out 
explicitly in the documentation so that the user can make an informed decision. 


was (Author: JIRAUSER298466):
Yes, it is supported for message fields as well (although I'm not exactly sure 
which version). See this 
[documentation|https://protobuf.dev/programming-guides/field_presence/#how-to-enable-explicit-presence-in-proto3]
 for more details. In fact, the hasXXX() method is supported for Primitive 
types also if defined as optional.

 

We can distinguish the primitive types using {{PbFormatUtils.isSimpleType()}} 
[method|https://github.com/apache/flink/blob/master/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/util/PbFormatUtils.java]
 We can add user guidance to only enable this setting when using Protobuf 
version newer than 3.15. The performance difference should also be called out 
explicitly in the documentation so that the user can make an informed decision. 

> Allow ReadDefaultValues = False for non primitive types on Proto3
> -
>
> Key: FLINK-33817
> URL: https://issues.apache.org/jira/browse/FLINK-33817
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Priority: Major
>
> *Background*
>  
> The current Protobuf format 
> [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java]
>  always sets ReadDefaultValues=False when using Proto3 version. This can 
> cause severe performance degradation for large Protobuf schemas with OneOf 
> fields as the entire generated code needs to be executed during 
> deserialization even when certain fields are not present in the data to be 
> deserialized and all the subsequent nested Fields can be skipped. Proto3 
> supports hasXXX() methods for checking field presence for non primitive types 
> since Proto version 
> [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In 
> the internal performance benchmarks in our company, we've seen almost 10x 
> difference in performance for one of our real production usecase when 
> allowing to set ReadDefaultValues=False with proto3 version. The exact 
> difference in performance depends on the schema complexity and data payload 
> but we should allow user to set readDefaultValue=False in general.
>  
> *Solution*
>  
> Support using ReadDefaultValues=False when using Proto3 version. We need to 
> be careful to check for field presence only on non-primitive types if 
> ReadDefaultValues is false and version used is Proto3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3

2023-12-22 Thread Sai Sharath Dandi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799959#comment-17799959
 ] 

Sai Sharath Dandi edited comment on FLINK-33817 at 12/22/23 10:08 PM:
--

Yes, it is supported for message fields as well (although I'm not exactly sure 
which version). See this 
[documentation|https://protobuf.dev/programming-guides/field_presence/#how-to-enable-explicit-presence-in-proto3]
 for more details. In fact, the hasXXX() method is supported for Primitive 
types also if defined as optional.

 

We can distinguish the primitive types using PbFormatUtils.isSimpleType 
[method|https://github.com/apache/flink/blob/master/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/util/PbFormatUtils.java]
 We can add user guidance to only enable this setting when using Protobuf 
version newer than 3.15. The performance difference should also be called out 
explicitly in the documentation so that the user can make an informed decision. 


was (Author: JIRAUSER298466):
Yes, it is supported for message fields as well (although I'm not exactly sure 
which version). See this 
[documentation|https://protobuf.dev/programming-guides/field_presence/#how-to-enable-explicit-presence-in-proto3]
 for more details. In fact, the hasXXX() method is supported for Primitive 
types also if defined as optional.

 

We can distinguish the primitive types using PbFormatUtils.isSimpleType 
[method|[https://github.com/apache/flink/blob/master/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/util/PbFormatUtils.java].]
 We can add user guidance to only enable this setting when using Protobuf 
version newer than 3.15. The performance difference should also be called out 
explicitly in the documentation so that the user can make an informed decision. 

> Allow ReadDefaultValues = False for non primitive types on Proto3
> -
>
> Key: FLINK-33817
> URL: https://issues.apache.org/jira/browse/FLINK-33817
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Priority: Major
>
> *Background*
>  
> The current Protobuf format 
> [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java]
>  always sets ReadDefaultValues=False when using Proto3 version. This can 
> cause severe performance degradation for large Protobuf schemas with OneOf 
> fields as the entire generated code needs to be executed during 
> deserialization even when certain fields are not present in the data to be 
> deserialized and all the subsequent nested Fields can be skipped. Proto3 
> supports hasXXX() methods for checking field presence for non primitive types 
> since Proto version 
> [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In 
> the internal performance benchmarks in our company, we've seen almost 10x 
> difference in performance for one of our real production usecase when 
> allowing to set ReadDefaultValues=False with proto3 version. The exact 
> difference in performance depends on the schema complexity and data payload 
> but we should allow user to set readDefaultValue=False in general.
>  
> *Solution*
>  
> Support using ReadDefaultValues=False when using Proto3 version. We need to 
> be careful to check for field presence only on non-primitive types if 
> ReadDefaultValues is false and version used is Proto3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3

2023-12-22 Thread Sai Sharath Dandi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799959#comment-17799959
 ] 

Sai Sharath Dandi edited comment on FLINK-33817 at 12/22/23 10:08 PM:
--

Yes, it is supported for message fields as well (although I'm not exactly sure 
which version). See this 
[documentation|https://protobuf.dev/programming-guides/field_presence/#how-to-enable-explicit-presence-in-proto3]
 for more details. In fact, the hasXXX() method is supported for Primitive 
types also if defined as optional.

 

We can distinguish the primitive types using {{PbFormatUtils.isSimpleType()}} 
[method|https://github.com/apache/flink/blob/master/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/util/PbFormatUtils.java]
 We can add user guidance to only enable this setting when using Protobuf 
version newer than 3.15. The performance difference should also be called out 
explicitly in the documentation so that the user can make an informed decision. 


was (Author: JIRAUSER298466):
Yes, it is supported for message fields as well (although I'm not exactly sure 
which version). See this 
[documentation|https://protobuf.dev/programming-guides/field_presence/#how-to-enable-explicit-presence-in-proto3]
 for more details. In fact, the hasXXX() method is supported for Primitive 
types also if defined as optional.

 

We can distinguish the primitive types using PbFormatUtils.isSimpleType 
[method|https://github.com/apache/flink/blob/master/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/util/PbFormatUtils.java]
 We can add user guidance to only enable this setting when using Protobuf 
version newer than 3.15. The performance difference should also be called out 
explicitly in the documentation so that the user can make an informed decision. 

> Allow ReadDefaultValues = False for non primitive types on Proto3
> -
>
> Key: FLINK-33817
> URL: https://issues.apache.org/jira/browse/FLINK-33817
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Priority: Major
>
> *Background*
>  
> The current Protobuf format 
> [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java]
>  always sets ReadDefaultValues=False when using Proto3 version. This can 
> cause severe performance degradation for large Protobuf schemas with OneOf 
> fields as the entire generated code needs to be executed during 
> deserialization even when certain fields are not present in the data to be 
> deserialized and all the subsequent nested Fields can be skipped. Proto3 
> supports hasXXX() methods for checking field presence for non primitive types 
> since Proto version 
> [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In 
> the internal performance benchmarks in our company, we've seen almost 10x 
> difference in performance for one of our real production usecase when 
> allowing to set ReadDefaultValues=False with proto3 version. The exact 
> difference in performance depends on the schema complexity and data payload 
> but we should allow user to set readDefaultValue=False in general.
>  
> *Solution*
>  
> Support using ReadDefaultValues=False when using Proto3 version. We need to 
> be careful to check for field presence only on non-primitive types if 
> ReadDefaultValues is false and version used is Proto3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3

2023-12-22 Thread Sai Sharath Dandi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799959#comment-17799959
 ] 

Sai Sharath Dandi edited comment on FLINK-33817 at 12/22/23 10:07 PM:
--

Yes, it is supported for message fields as well (although I'm not exactly sure 
which version). See this 
[documentation|https://protobuf.dev/programming-guides/field_presence/#how-to-enable-explicit-presence-in-proto3]
 for more details. In fact, the hasXXX() method is supported for Primitive 
types also if defined as optional.

 

We can distinguish the primitive types using PbFormatUtils.isSimpleType 
[method|[https://github.com/apache/flink/blob/master/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/util/PbFormatUtils.java].]
 We can add user guidance to only enable this setting when using Protobuf 
version newer than 3.15. The performance difference should also be called out 
explicitly in the documentation so that the user can make an informed decision. 


was (Author: JIRAUSER298466):
Yes, it is supported for message fields as well (although I'm not exactly sure 
which version). See this 
[documentation|https://protobuf.dev/programming-guides/field_presence/#how-to-enable-explicit-presence-in-proto3]
 for more details. In fact, the hasXXX() method is supported for Primitive 
types also if defined as optional 

> Allow ReadDefaultValues = False for non primitive types on Proto3
> -
>
> Key: FLINK-33817
> URL: https://issues.apache.org/jira/browse/FLINK-33817
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Priority: Major
>
> *Background*
>  
> The current Protobuf format 
> [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java]
>  always sets ReadDefaultValues=False when using Proto3 version. This can 
> cause severe performance degradation for large Protobuf schemas with OneOf 
> fields as the entire generated code needs to be executed during 
> deserialization even when certain fields are not present in the data to be 
> deserialized and all the subsequent nested Fields can be skipped. Proto3 
> supports hasXXX() methods for checking field presence for non primitive types 
> since Proto version 
> [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In 
> the internal performance benchmarks in our company, we've seen almost 10x 
> difference in performance for one of our real production usecase when 
> allowing to set ReadDefaultValues=False with proto3 version. The exact 
> difference in performance depends on the schema complexity and data payload 
> but we should allow user to set readDefaultValue=False in general.
>  
> *Solution*
>  
> Support using ReadDefaultValues=False when using Proto3 version. We need to 
> be careful to check for field presence only on non-primitive types if 
> ReadDefaultValues is false and version used is Proto3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3

2023-12-22 Thread Sai Sharath Dandi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799959#comment-17799959
 ] 

Sai Sharath Dandi commented on FLINK-33817:
---

Yes, it is supported for message fields as well (although I'm not exactly sure 
which version). See this 
[documentation|https://protobuf.dev/programming-guides/field_presence/#how-to-enable-explicit-presence-in-proto3]
 for more details. In fact, the hasXXX() method is supported for Primitive 
types also if defined as optional 

> Allow ReadDefaultValues = False for non primitive types on Proto3
> -
>
> Key: FLINK-33817
> URL: https://issues.apache.org/jira/browse/FLINK-33817
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Priority: Major
>
> *Background*
>  
> The current Protobuf format 
> [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java]
>  always sets ReadDefaultValues=False when using Proto3 version. This can 
> cause severe performance degradation for large Protobuf schemas with OneOf 
> fields as the entire generated code needs to be executed during 
> deserialization even when certain fields are not present in the data to be 
> deserialized and all the subsequent nested Fields can be skipped. Proto3 
> supports hasXXX() methods for checking field presence for non primitive types 
> since Proto version 
> [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In 
> the internal performance benchmarks in our company, we've seen almost 10x 
> difference in performance for one of our real production usecase when 
> allowing to set ReadDefaultValues=False with proto3 version. The exact 
> difference in performance depends on the schema complexity and data payload 
> but we should allow user to set readDefaultValue=False in general.
>  
> *Solution*
>  
> Support using ReadDefaultValues=False when using Proto3 version. We need to 
> be careful to check for field presence only on non-primitive types if 
> ReadDefaultValues is false and version used is Proto3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3

2023-12-21 Thread Sai Sharath Dandi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799584#comment-17799584
 ] 

Sai Sharath Dandi commented on FLINK-33817:
---

[~libenchao] , Can you please take a look at this ticket and assign it to me? 

> Allow ReadDefaultValues = False for non primitive types on Proto3
> -
>
> Key: FLINK-33817
> URL: https://issues.apache.org/jira/browse/FLINK-33817
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Priority: Major
>
> *Background*
>  
> The current Protobuf format 
> [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java]
>  always sets ReadDefaultValues=False when using Proto3 version. This can 
> cause severe performance degradation for large Protobuf schemas with OneOf 
> fields as the entire generated code needs to be executed during 
> deserialization even when certain fields are not present in the data to be 
> deserialized and all the subsequent nested Fields can be skipped. Proto3 
> supports hasXXX() methods for checking field presence for non primitive types 
> since Proto version 
> [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In 
> the internal performance benchmarks in our company, we've seen almost 10x 
> difference in performance for one of our real production usecase when 
> allowing to set ReadDefaultValues=False with proto3 version. The exact 
> difference in performance depends on the schema complexity and data payload 
> but we should allow readDefaultValue=False in general.
>  
> *Solution*
>  
> Support using ReadDefaultValues=False when using Proto3 version. We need to 
> be careful to check for field presence only on non-primitive types if 
> ReadDefaultValues is false and version used is Proto3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3

2023-12-21 Thread Sai Sharath Dandi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sai Sharath Dandi updated FLINK-33817:
--
Description: 
*Background*

 

The current Protobuf format 
[implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java]
 always sets ReadDefaultValues=False when using Proto3 version. This can cause 
severe performance degradation for large Protobuf schemas with OneOf fields as 
the entire generated code needs to be executed during deserialization even when 
certain fields are not present in the data to be deserialized and all the 
subsequent nested Fields can be skipped. Proto3 supports hasXXX() methods for 
checking field presence for non primitive types since Proto version 
[3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In the 
internal performance benchmarks in our company, we've seen almost 10x 
difference in performance for one of our real production usecase when allowing 
to set ReadDefaultValues=False with proto3 version. The exact difference in 
performance depends on the schema complexity and data payload but we should 
allow user to set readDefaultValue=False in general.

 

*Solution*

 

Support using ReadDefaultValues=False when using Proto3 version. We need to be 
careful to check for field presence only on non-primitive types if 
ReadDefaultValues is false and version used is Proto3

  was:
*Background*

 

The current Protobuf format 
[implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java]
 always sets ReadDefaultValues=False when using Proto3 version. This can cause 
severe performance degradation for large Protobuf schemas with OneOf fields as 
the entire generated code needs to be executed during deserialization even when 
certain fields are not present in the data to be deserialized and all the 
subsequent nested Fields can be skipped. Proto3 supports hasXXX() methods for 
checking field presence for non primitive types since Proto version 
[3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In the 
internal performance benchmarks in our company, we've seen almost 10x 
difference in performance for one of our real production usecase when allowing 
to set ReadDefaultValues=False with proto3 version. The exact difference in 
performance depends on the schema complexity and data payload but we should 
allow readDefaultValue=False in general.

 

*Solution*

 

Support using ReadDefaultValues=False when using Proto3 version. We need to be 
careful to check for field presence only on non-primitive types if 
ReadDefaultValues is false and version used is Proto3


> Allow ReadDefaultValues = False for non primitive types on Proto3
> -
>
> Key: FLINK-33817
> URL: https://issues.apache.org/jira/browse/FLINK-33817
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Priority: Major
>
> *Background*
>  
> The current Protobuf format 
> [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java]
>  always sets ReadDefaultValues=False when using Proto3 version. This can 
> cause severe performance degradation for large Protobuf schemas with OneOf 
> fields as the entire generated code needs to be executed during 
> deserialization even when certain fields are not present in the data to be 
> deserialized and all the subsequent nested Fields can be skipped. Proto3 
> supports hasXXX() methods for checking field presence for non primitive types 
> since Proto version 
> [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In 
> the internal performance benchmarks in our company, we've seen almost 10x 
> difference in performance for one of our real production usecase when 
> allowing to set ReadDefaultValues=False with proto3 version. The exact 
> difference in performance depends on the schema complexity and data payload 
> but we should allow user to set readDefaultValue=False in general.
>  
> *Solution*
>  
> Support using ReadDefaultValues=False when using Proto3 version. We need to 
> be careful to check for field presence only on non-primitive types if 
> ReadDefaultValues is false and version used is Proto3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-33611) Support Large Protobuf Schemas

2023-12-21 Thread Sai Sharath Dandi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799560#comment-17799560
 ] 

Sai Sharath Dandi edited comment on FLINK-33611 at 12/21/23 9:23 PM:
-

[~libenchao] , The constant pool size 
[limit|https://docs.oracle.com/javase/specs/jvms/se12/html/jvms-4.html#jvms-4.4]
 is 65536 entries in Java. The constant pool size includes a lot of things but 
if we count only the identifier names and assume there are 2 identifiers(one 
for protobuf and one for rowdata) the generated code used for each field in the 
schema for rough estimation. There cannot be more than 65536/2 = 32768 fields 
in the Protobuf schema. Of course, the actual number is lower than that because 
we did not include split method names, class names etc.. 


was (Author: JIRAUSER298466):
[~libenchao] , The constant pool size 
[limit|https://docs.oracle.com/javase/specs/jvms/se12/html/jvms-4.html#jvms-4.4]
 is 65536 entries in java. The constant pool size includes a lot of things but 
if we count only the identifier names and assume there are 2 identifiers(one 
for protobuf and one for rowdata) the generated code used for each field in the 
schema for rough estimation. There cannot be more than 65536/2 = 32768 fields 
in the Protobuf schema. Of course, the actual number is lower than that because 
we did not include split method names, class names etc.. 

> Support Large Protobuf Schemas
> --
>
> Key: FLINK-33611
> URL: https://issues.apache.org/jira/browse/FLINK-33611
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Assignee: Sai Sharath Dandi
>Priority: Major
>  Labels: pull-request-available
>
> h3. Background
> Flink serializes and deserializes protobuf format data by calling the decode 
> or encode method in GeneratedProtoToRow_XXX.java generated by codegen to 
> parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the 
> ability to split the generated code to improve the performance for large 
> Protobuf schemas. However, this is still not sufficient to support some 
> larger protobuf schemas as the generated code exceeds the java constant pool 
> size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] 
> and we can see errors like "Too many constants" when trying to compile the 
> generated code. 
> *Solution*
> Since we already have the split code functionality already introduced, the 
> main proposal here is to now reuse the variable names across different split 
> method scopes. This will greatly reduce the constant pool size. One more 
> optimization is to only split the last code segment also only when the size 
> exceeds split threshold limit. Currently, the last segment of the generated 
> code is always being split which can lead to too many split methods and thus 
> exceed the constant pool size limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-33611) Support Large Protobuf Schemas

2023-12-21 Thread Sai Sharath Dandi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799560#comment-17799560
 ] 

Sai Sharath Dandi edited comment on FLINK-33611 at 12/21/23 9:22 PM:
-

[~libenchao] , The constant pool size 
[limit|https://docs.oracle.com/javase/specs/jvms/se12/html/jvms-4.html#jvms-4.4]
 is 65536 entries in java. The constant pool size includes a lot of things but 
if we count only the identifier names and assume there are 2 identifiers(one 
for protobuf and one for rowdata) the generated code used for each field in the 
schema for rough estimation. There cannot be more than 65536/2 = 32768 fields 
in the Protobuf schema. Of course, the actual number is lower than that because 
we did not include split method names, class names etc.. 


was (Author: JIRAUSER298466):
[~libenchao] , The constant pool size 
[limit|https://docs.oracle.com/javase/specs/jvms/se12/html/jvms-4.html#jvms-4.4]
 is 65536 entries in java. The constant pool size includes a lot of things but 
if we count only the identifier names and assume there are 2 identifiers in the 
generated code used for each field in the schema for rough estimation. There 
cannot be more than 65536/2 = 32768 fields in the Protobuf schema. Of course, 
the actual number is lower than that because we did not include split method 
names, class names etc.. 

> Support Large Protobuf Schemas
> --
>
> Key: FLINK-33611
> URL: https://issues.apache.org/jira/browse/FLINK-33611
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Assignee: Sai Sharath Dandi
>Priority: Major
>  Labels: pull-request-available
>
> h3. Background
> Flink serializes and deserializes protobuf format data by calling the decode 
> or encode method in GeneratedProtoToRow_XXX.java generated by codegen to 
> parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the 
> ability to split the generated code to improve the performance for large 
> Protobuf schemas. However, this is still not sufficient to support some 
> larger protobuf schemas as the generated code exceeds the java constant pool 
> size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] 
> and we can see errors like "Too many constants" when trying to compile the 
> generated code. 
> *Solution*
> Since we already have the split code functionality already introduced, the 
> main proposal here is to now reuse the variable names across different split 
> method scopes. This will greatly reduce the constant pool size. One more 
> optimization is to only split the last code segment also only when the size 
> exceeds split threshold limit. Currently, the last segment of the generated 
> code is always being split which can lead to too many split methods and thus 
> exceed the constant pool size limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33611) Support Large Protobuf Schemas

2023-12-21 Thread Sai Sharath Dandi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799583#comment-17799583
 ] 

Sai Sharath Dandi commented on FLINK-33611:
---

Fwiw, I've run into this issue for one of our real production use case schema 
at my company which has about 44,368 fields in the Protobuf schema

> Support Large Protobuf Schemas
> --
>
> Key: FLINK-33611
> URL: https://issues.apache.org/jira/browse/FLINK-33611
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Assignee: Sai Sharath Dandi
>Priority: Major
>  Labels: pull-request-available
>
> h3. Background
> Flink serializes and deserializes protobuf format data by calling the decode 
> or encode method in GeneratedProtoToRow_XXX.java generated by codegen to 
> parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the 
> ability to split the generated code to improve the performance for large 
> Protobuf schemas. However, this is still not sufficient to support some 
> larger protobuf schemas as the generated code exceeds the java constant pool 
> size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] 
> and we can see errors like "Too many constants" when trying to compile the 
> generated code. 
> *Solution*
> Since we already have the split code functionality already introduced, the 
> main proposal here is to now reuse the variable names across different split 
> method scopes. This will greatly reduce the constant pool size. One more 
> optimization is to only split the last code segment also only when the size 
> exceeds split threshold limit. Currently, the last segment of the generated 
> code is always being split which can lead to too many split methods and thus 
> exceed the constant pool size limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-33611) Support Large Protobuf Schemas

2023-12-21 Thread Sai Sharath Dandi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799560#comment-17799560
 ] 

Sai Sharath Dandi edited comment on FLINK-33611 at 12/21/23 8:18 PM:
-

[~libenchao] , The constant pool size 
[limit|https://docs.oracle.com/javase/specs/jvms/se12/html/jvms-4.html#jvms-4.4]
 is 65536 entries in java. The constant pool size includes a lot of things but 
if we count only the identifier names and assume there are 2 identifiers in the 
generated code used for each field in the schema for rough estimation. There 
cannot be more than 65536/2 = 32768 fields in the Protobuf schema. Of course, 
the actual number is lower than that because we did not include split method 
names, class names etc.. 


was (Author: JIRAUSER298466):
[~libenchao] , The constant pool size 
[limit|https://docs.oracle.com/javase/specs/jvms/se12/html/jvms-4.html#jvms-4.4]
 is 65536 entries in java. The constant pool size includes a lot of things but 
if we count only the identifier names and assume there are 2 identifiers used 
for each field in the schema for rough estimation. There cannot be more than 
65536/2 = 32768 fields in the Protobuf schema. Of course, the actual number is 
lower than that because we did not include split method names, class names 
etc.. 

> Support Large Protobuf Schemas
> --
>
> Key: FLINK-33611
> URL: https://issues.apache.org/jira/browse/FLINK-33611
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Assignee: Sai Sharath Dandi
>Priority: Major
>  Labels: pull-request-available
>
> h3. Background
> Flink serializes and deserializes protobuf format data by calling the decode 
> or encode method in GeneratedProtoToRow_XXX.java generated by codegen to 
> parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the 
> ability to split the generated code to improve the performance for large 
> Protobuf schemas. However, this is still not sufficient to support some 
> larger protobuf schemas as the generated code exceeds the java constant pool 
> size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] 
> and we can see errors like "Too many constants" when trying to compile the 
> generated code. 
> *Solution*
> Since we already have the split code functionality already introduced, the 
> main proposal here is to now reuse the variable names across different split 
> method scopes. This will greatly reduce the constant pool size. One more 
> optimization is to only split the last code segment also only when the size 
> exceeds split threshold limit. Currently, the last segment of the generated 
> code is always being split which can lead to too many split methods and thus 
> exceed the constant pool size limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33611) Support Large Protobuf Schemas

2023-12-21 Thread Sai Sharath Dandi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799560#comment-17799560
 ] 

Sai Sharath Dandi commented on FLINK-33611:
---

[~libenchao] , The constant pool size 
[limit|https://docs.oracle.com/javase/specs/jvms/se12/html/jvms-4.html#jvms-4.4]
 is 65536 entries in java. The constant pool size includes a lot of things but 
if we count only the identifier names and assume there are 2 identifiers used 
for each field in the schema for rough estimation. There cannot be more than 
65536/2 = 32768 fields in the Protobuf schema. Of course, the actual number is 
lower than that because we did not include split method names, class names 
etc.. 

> Support Large Protobuf Schemas
> --
>
> Key: FLINK-33611
> URL: https://issues.apache.org/jira/browse/FLINK-33611
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Assignee: Sai Sharath Dandi
>Priority: Major
>  Labels: pull-request-available
>
> h3. Background
> Flink serializes and deserializes protobuf format data by calling the decode 
> or encode method in GeneratedProtoToRow_XXX.java generated by codegen to 
> parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the 
> ability to split the generated code to improve the performance for large 
> Protobuf schemas. However, this is still not sufficient to support some 
> larger protobuf schemas as the generated code exceeds the java constant pool 
> size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] 
> and we can see errors like "Too many constants" when trying to compile the 
> generated code. 
> *Solution*
> Since we already have the split code functionality already introduced, the 
> main proposal here is to now reuse the variable names across different split 
> method scopes. This will greatly reduce the constant pool size. One more 
> optimization is to only split the last code segment also only when the size 
> exceeds split threshold limit. Currently, the last segment of the generated 
> code is always being split which can lead to too many split methods and thus 
> exceed the constant pool size limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3

2023-12-20 Thread Sai Sharath Dandi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sai Sharath Dandi updated FLINK-33817:
--
Description: 
*Background*

 

The current Protobuf format 
[implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java]
 always sets ReadDefaultValues=False when using Proto3 version. This can cause 
severe performance degradation for large Protobuf schemas with OneOf fields as 
the entire generated code needs to be executed during deserialization even when 
certain fields are not present in the data to be deserialized and all the 
subsequent nested Fields can be skipped. Proto3 supports hasXXX() methods for 
checking field presence for non primitive types since Proto version 
[3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In the 
internal performance benchmarks in our company, we've seen almost 10x 
difference in performance for one of our real production usecase when allowing 
to set ReadDefaultValues=False with proto3 version. The exact difference in 
performance depends on the schema complexity and data payload but we should 
allow readDefaultValue=False in general.

 

*Solution*

 

Support using ReadDefaultValues=False when using Proto3 version. We need to be 
careful to check for field presence only on non-primitive types if 
ReadDefaultValues is false and version used is Proto3

  was:
*Background*

 

The current Protobuf format 
[implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java]
 always sets ReadDefaultValues=False when using Proto3 version. This can cause 
severe performance degradation for large Protobuf schemas with OneOf fields as 
the entire generated code needs to be executed during deserialization even when 
certain fields are not present in the data to be deserialized and all the 
subsequent nested Fields can be skipped. Proto3 supports hasXXX() methods for 
checking field presence for non primitive types since Proto version 
[3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In the 
internal performance benchmarks in our company, we've seen almost 10x 
difference in performance when allowing to set ReadDefaultValues=False with 
proto3 version

 

*Solution*

 

Support using ReadDefaultValues=False when using Proto3 version. We need to be 
careful to check for field presence only on non-primitive types if 
ReadDefaultValues is false and version used is Proto3


> Allow ReadDefaultValues = False for non primitive types on Proto3
> -
>
> Key: FLINK-33817
> URL: https://issues.apache.org/jira/browse/FLINK-33817
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Priority: Major
>
> *Background*
>  
> The current Protobuf format 
> [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java]
>  always sets ReadDefaultValues=False when using Proto3 version. This can 
> cause severe performance degradation for large Protobuf schemas with OneOf 
> fields as the entire generated code needs to be executed during 
> deserialization even when certain fields are not present in the data to be 
> deserialized and all the subsequent nested Fields can be skipped. Proto3 
> supports hasXXX() methods for checking field presence for non primitive types 
> since Proto version 
> [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In 
> the internal performance benchmarks in our company, we've seen almost 10x 
> difference in performance for one of our real production usecase when 
> allowing to set ReadDefaultValues=False with proto3 version. The exact 
> difference in performance depends on the schema complexity and data payload 
> but we should allow readDefaultValue=False in general.
>  
> *Solution*
>  
> Support using ReadDefaultValues=False when using Proto3 version. We need to 
> be careful to check for field presence only on non-primitive types if 
> ReadDefaultValues is false and version used is Proto3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33611) Support Large Protobuf Schemas

2023-12-15 Thread Sai Sharath Dandi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sai Sharath Dandi updated FLINK-33611:
--
Summary: Support Large Protobuf Schemas  (was: Add the ability to reuse 
variable names across different split method scopes)

> Support Large Protobuf Schemas
> --
>
> Key: FLINK-33611
> URL: https://issues.apache.org/jira/browse/FLINK-33611
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Assignee: Sai Sharath Dandi
>Priority: Major
>
> h3. Background
> Flink serializes and deserializes protobuf format data by calling the decode 
> or encode method in GeneratedProtoToRow_XXX.java generated by codegen to 
> parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the 
> ability to split the generated code to improve the performance for large 
> Protobuf schemas. However, this is still not sufficient to support some 
> larger protobuf schemas as the generated code exceeds the java constant pool 
> size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] 
> and we can see errors like "Too many constants" when trying to compile the 
> generated code. 
> *Solution*
> Since we already have the split code functionality already introduced, the 
> main proposal here is to now reuse the variable names across different split 
> method scopes. This will greatly reduce the constant pool size. One more 
> optimization is to only split the last code segment also only when the size 
> exceeds split threshold limit. Currently, the last segment of the generated 
> code is always being split which can lead to too many split methods and thus 
> exceed the constant pool size limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3

2023-12-13 Thread Sai Sharath Dandi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sai Sharath Dandi updated FLINK-33817:
--
Description: 
*Background*

 

The current Protobuf format 
[implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java]
 always sets ReadDefaultValues=False when using Proto3 version. This can cause 
severe performance degradation for large Protobuf schemas with OneOf fields as 
the entire generated code needs to be executed during deserialization even when 
certain fields are not present in the data to be deserialized and all the 
subsequent nested Fields can be skipped. Proto3 supports hasXXX() methods for 
checking field presence for non primitive types since Proto version 
[3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In the 
internal performance benchmarks in our company, we've seen almost 10x 
difference in performance when allowing to set ReadDefaultValues=False with 
proto3 version

 

*Solution*

 

Support using ReadDefaultValues=False when using Proto3 version. We need to be 
careful to check for field presence only on non-primitive types if 
ReadDefaultValues is false and version used is Proto3

  was:
*Background*

 

The current Protobuf format 
[implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java]
 always sets ReadDefaultValues=False when using Proto3 version. This can cause 
severe performance degradation for large Protobuf schemas with OneOf fields as 
the entire generated code needs to be executed during deserialization even when 
certain fields are not present in the data to be deserialized and all the 
subsequent nested Fields can be skipped. Proto3 supports hasXXX() methods for 
checking field presence for non primitive types since Proto version 
[3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In 
internal benchmarks in our company, we've seen almost 10x difference in 
performance when allowing to set ReadDefaultValues=False with proto3 version

 

*Solution*

 

Support using ReadDefaultValues=False when using Proto3 version. We need to be 
careful to check for field presence only on non-primitive types if 
ReadDefaultValues is false and version used is Proto3


> Allow ReadDefaultValues = False for non primitive types on Proto3
> -
>
> Key: FLINK-33817
> URL: https://issues.apache.org/jira/browse/FLINK-33817
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Priority: Major
>
> *Background*
>  
> The current Protobuf format 
> [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java]
>  always sets ReadDefaultValues=False when using Proto3 version. This can 
> cause severe performance degradation for large Protobuf schemas with OneOf 
> fields as the entire generated code needs to be executed during 
> deserialization even when certain fields are not present in the data to be 
> deserialized and all the subsequent nested Fields can be skipped. Proto3 
> supports hasXXX() methods for checking field presence for non primitive types 
> since Proto version 
> [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In 
> the internal performance benchmarks in our company, we've seen almost 10x 
> difference in performance when allowing to set ReadDefaultValues=False with 
> proto3 version
>  
> *Solution*
>  
> Support using ReadDefaultValues=False when using Proto3 version. We need to 
> be careful to check for field presence only on non-primitive types if 
> ReadDefaultValues is false and version used is Proto3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33817) Allow ReadDefaultValues = False for non primitive types on Proto3

2023-12-13 Thread Sai Sharath Dandi (Jira)
Sai Sharath Dandi created FLINK-33817:
-

 Summary: Allow ReadDefaultValues = False for non primitive types 
on Proto3
 Key: FLINK-33817
 URL: https://issues.apache.org/jira/browse/FLINK-33817
 Project: Flink
  Issue Type: Improvement
  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
Affects Versions: 1.18.0
Reporter: Sai Sharath Dandi


*Background*

 

The current Protobuf format 
[implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java]
 always sets ReadDefaultValues=False when using Proto3 version. This can cause 
severe performance degradation for large Protobuf schemas with OneOf fields as 
the entire generated code needs to be executed during deserialization even when 
certain fields are not present in the data to be deserialized and all the 
subsequent nested Fields can be skipped. Proto3 supports hasXXX() methods for 
checking field presence for non primitive types since Proto version 
[3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In 
internal benchmarks in our company, we've seen almost 10x difference in 
performance when allowing to set ReadDefaultValues=False with proto3 version

 

*Solution*

 

Support using ReadDefaultValues=False when using Proto3 version. We need to be 
careful to check for field presence only on non-primitive types if 
ReadDefaultValues is false and version used is Proto3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33611) Add the ability to reuse variable names across different split method scopes

2023-11-21 Thread Sai Sharath Dandi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sai Sharath Dandi updated FLINK-33611:
--
Description: 
h3. Background

Flink serializes and deserializes protobuf format data by calling the decode or 
encode method in GeneratedProtoToRow_XXX.java generated by codegen to parse 
byte[] data into Protobuf Java objects. FLINK-32650 has introduced the ability 
to split the generated code to improve the performance for large Protobuf 
schemas. However, this is still not sufficient to support some larger protobuf 
schemas as the generated code exceeds the java constant pool size 
[limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] and we 
can see errors like "Too many constants" when trying to compile the generated 
code. 

*Solution*

Since we already have the split code functionality already introduced, the main 
proposal here is to now reuse the variable names across different split method 
scopes. This will greatly reduce the constant pool size. One more optimization 
is to only split the last code segment also only when the size exceeds split 
threshold limit. Currently, the last segment of the generated code is always 
being split which can lead to too many split methods and thus exceed the 
constant pool size limit

  was:
h3. Background

Flink serializes and deserializes protobuf format data by calling the decode or 
encode method in GeneratedProtoToRow_XXX.java generated by codegen to parse 
byte[] data into Protobuf Java objects. FLINK-32650 has introduced the ability 
to split the generated code to improve the performance for large Protobuf 
schemas. However, this is still not sufficient to support some larger protobuf 
schemas as the generated code exceeds the java constant pool size 
[limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] and we 
can see errors like "Too many constants" when trying to compile the generated 
code. 

*Solution*

Since we already have the split code functionality already introduced, the main 
proposal here is to now reuse the variable names across different split method 
scopes. This will greatly reduce the constant pool size. One more optimization 
is to only split the last code segment also only when the size exceeds split 
threshold limit. Currently, the last segment of the generated code is always 
being split which can lead to too many split methods.


> Add the ability to reuse variable names across different split method scopes
> 
>
> Key: FLINK-33611
> URL: https://issues.apache.org/jira/browse/FLINK-33611
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Priority: Major
>
> h3. Background
> Flink serializes and deserializes protobuf format data by calling the decode 
> or encode method in GeneratedProtoToRow_XXX.java generated by codegen to 
> parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the 
> ability to split the generated code to improve the performance for large 
> Protobuf schemas. However, this is still not sufficient to support some 
> larger protobuf schemas as the generated code exceeds the java constant pool 
> size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] 
> and we can see errors like "Too many constants" when trying to compile the 
> generated code. 
> *Solution*
> Since we already have the split code functionality already introduced, the 
> main proposal here is to now reuse the variable names across different split 
> method scopes. This will greatly reduce the constant pool size. One more 
> optimization is to only split the last code segment also only when the size 
> exceeds split threshold limit. Currently, the last segment of the generated 
> code is always being split which can lead to too many split methods and thus 
> exceed the constant pool size limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33611) Add the ability to reuse variable names across different split method scopes

2023-11-21 Thread Sai Sharath Dandi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sai Sharath Dandi updated FLINK-33611:
--
Description: 
h3. Background

Flink serializes and deserializes protobuf format data by calling the decode or 
encode method in GeneratedProtoToRow_XXX.java generated by codegen to parse 
byte[] data into Protobuf Java objects. FLINK-32650 has introduced the ability 
to split the generated code to improve the performance for large Protobuf 
schemas. However, this is still not sufficient to support some larger protobuf 
schemas as the generated code exceeds the java constant pool size 
[limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] and we 
can see errors like "Too many constants" when trying to compile the generated 
code. 

*Solution*

Since we already have the split code functionality already introduced, the main 
proposal here is to now reuse the variable names across different split method 
scopes. This will greatly reduce the constant pool size. One more optimization 
is to only split the last code segment also only when the size exceeds split 
threshold limit. Currently, the last segment of the generated code is always 
being split which can lead to too many split methods.

  was:
h3. Background

Flink serializes and deserializes protobuf format data by calling the decode or 
encode method in GeneratedProtoToRow_XXX.java generated by codegen to parse 
byte[] data into Protobuf Java objects. FLINK-32650 has introduced the ability 
to split the generated code to improve the performance for large Protobuf 
schemas. However, this is still not sufficient to support some larger protobuf 
schemas as the generated code exceeds the java constant pool size 
[limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] and we 
can see errors like "Too many constants" when trying to compile the generated 
code. 

*Solution*

Since we already have the split code functionality already introduced, the main 
proposal here is to now use different variable names across different split 
method scopes. This will greatly reduce the constant pool size. One more 
optimization is to only split the last code segment also only when the size 
exceeds split threshold limit. Currently, the last segment of the generated 
code is always being split which can lead to too many split methods.


> Add the ability to reuse variable names across different split method scopes
> 
>
> Key: FLINK-33611
> URL: https://issues.apache.org/jira/browse/FLINK-33611
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.18.0
>Reporter: Sai Sharath Dandi
>Priority: Major
>
> h3. Background
> Flink serializes and deserializes protobuf format data by calling the decode 
> or encode method in GeneratedProtoToRow_XXX.java generated by codegen to 
> parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the 
> ability to split the generated code to improve the performance for large 
> Protobuf schemas. However, this is still not sufficient to support some 
> larger protobuf schemas as the generated code exceeds the java constant pool 
> size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] 
> and we can see errors like "Too many constants" when trying to compile the 
> generated code. 
> *Solution*
> Since we already have the split code functionality already introduced, the 
> main proposal here is to now reuse the variable names across different split 
> method scopes. This will greatly reduce the constant pool size. One more 
> optimization is to only split the last code segment also only when the size 
> exceeds split threshold limit. Currently, the last segment of the generated 
> code is always being split which can lead to too many split methods.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33611) Add the ability to reuse variable names across different split method scopes

2023-11-21 Thread Sai Sharath Dandi (Jira)
Sai Sharath Dandi created FLINK-33611:
-

 Summary: Add the ability to reuse variable names across different 
split method scopes
 Key: FLINK-33611
 URL: https://issues.apache.org/jira/browse/FLINK-33611
 Project: Flink
  Issue Type: Improvement
  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
Affects Versions: 1.18.0
Reporter: Sai Sharath Dandi


h3. Background

Flink serializes and deserializes protobuf format data by calling the decode or 
encode method in GeneratedProtoToRow_XXX.java generated by codegen to parse 
byte[] data into Protobuf Java objects. FLINK-32650 has introduced the ability 
to split the generated code to improve the performance for large Protobuf 
schemas. However, this is still not sufficient to support some larger protobuf 
schemas as the generated code exceeds the java constant pool size 
[limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] and we 
can see errors like "Too many constants" when trying to compile the generated 
code. 

*Solution*

Since we already have the split code functionality already introduced, the main 
proposal here is to now use different variable names across different split 
method scopes. This will greatly reduce the constant pool size. One more 
optimization is to only split the last code segment also only when the size 
exceeds split threshold limit. Currently, the last segment of the generated 
code is always being split which can lead to too many split methods.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-9161) Support STRUCT syntax to create named STRUCT in SQL

2023-01-17 Thread Sai Sharath Dandi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677996#comment-17677996
 ] 

Sai Sharath Dandi commented on FLINK-9161:
--

Was able to get it working using a cast inside the row like below for anyone 
interested
{code:java}
SELECT CAST(ROW(cast(PersonTable.person.name as string), 
cast(PersonTable.person.age as int)) as ROW< new_name string, new_age int > ) 
as new_person FROM (VALUES (CAST(ROW('Bob', 10) as ROW< name string, age 
int>)),(CAST(ROW('Alice', 20) as ROW< name string, age int>))) AS 
PersonTable(person); {code}

> Support STRUCT syntax to create named STRUCT in SQL
> ---
>
> Key: FLINK-9161
> URL: https://issues.apache.org/jira/browse/FLINK-9161
> Project: Flink
>  Issue Type: New Feature
>  Components: Table SQL / API
>Reporter: Shuyi Chen
>Priority: Major
>  Labels: auto-unassigned
>
> As discussed in [calcite dev mailing 
> list|https://mail-archives.apache.org/mod_mbox/calcite-dev/201804.mbox/%3cCAMZk55avGNmp1vXeJwA1B_a8bGyCQ9ahxmE=R=6fklpf7jt...@mail.gmail.com%3e],
>  we want add support for adding named structure construction in SQL, e.g., 
> {code:java}
> SELECT STRUCT(a as first_name, b as last_name, STRUCT(c as zip code, d as
> street, e as state) as address) as record FROM example_table
> {code}
> This would require adding necessary change in Calcite first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-9161) Support STRUCT syntax to create named STRUCT in SQL

2023-01-12 Thread Sai Sharath Dandi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17676305#comment-17676305
 ] 

Sai Sharath Dandi commented on FLINK-9161:
--

[~leonard] ROW does not support creating nested type out of a nested type/array 
type input. 
{code:java}
SELECT CAST(ROW(person.name, person.age) as ROW< new_name string, new_age int > 
) as new_person FROM (VALUES (CAST(ROW('Bob', 10) as ROW< name string, age 
int>)),(CAST(ROW('Alice', 20) as ROW< name string, age int>))) AS 
PersonTable(person);
>
[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.sql.parser.impl.ParseException: Encountered "." at line 1, 
column 23.
Was expecting one of:
    ")" ...
    "," ... {code}
Is there any workaround solution here?

> Support STRUCT syntax to create named STRUCT in SQL
> ---
>
> Key: FLINK-9161
> URL: https://issues.apache.org/jira/browse/FLINK-9161
> Project: Flink
>  Issue Type: New Feature
>  Components: Table SQL / API
>Reporter: Shuyi Chen
>Priority: Major
>  Labels: auto-unassigned
>
> As discussed in [calcite dev mailing 
> list|https://mail-archives.apache.org/mod_mbox/calcite-dev/201804.mbox/%3cCAMZk55avGNmp1vXeJwA1B_a8bGyCQ9ahxmE=R=6fklpf7jt...@mail.gmail.com%3e],
>  we want add support for adding named structure construction in SQL, e.g., 
> {code:java}
> SELECT STRUCT(a as first_name, b as last_name, STRUCT(c as zip code, d as
> street, e as state) as address) as record FROM example_table
> {code}
> This would require adding necessary change in Calcite first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)