[jira] [Comment Edited] (FLINK-33611) Support Large Protobuf Schemas

Sai Sharath Dandi (Jira) Thu, 04 Jan 2024 19:04:49 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803388#comment-17803388
 ]


Sai Sharath Dandi edited comment on FLINK-33611 at 1/5/24 3:03 AM:
-------------------------------------------------------------------

[~libenchao] , Thanks for the suggestion to add the test case to the pull 
request. While working on making an appropriate test case, I discovered 
something interesting. The local variable names are somehow not part of the 
Java constant pool for large schemas. I believe Java is triggering some 
optimizations internally and storing the variable names elsewhere when the code 
size becomes too large, therefore I'm not sure if reusing variable names has 
any impact on supporting large schemas. Perhaps, it can reduce the work needed 
for the Java compiler to rewrite variable names and result in faster compile 
times but I haven't conducted any experiment on that aspect. Apart from that, 
making the code change to reduce too many split methods has the most impact in 
supporting large schemas as I found that method names are always included in 
the constant pool even when the code size is too large from my experiment. In 
fact, this is the main reason which causes compilation errors with "too many 
constants error"

With that being said, I would still prefer to keep the changes to reuse 
variable names since the change itself is non-intrusive, harmless, and can only 
improve the performance for compilation. Please let me know your thoughts


was (Author: JIRAUSER298466):
@libenchao, Thanks for the suggestion to add the test case to the pull request. 
While working on making an appropriate test case, I discovered something 
interesting. The local variable names are somehow not part of the Java constant 
pool for large schemas. I believe Java is triggering some optimizations 
internally and storing the variable names elsewhere when the code size becomes 
too large, therefore I'm not sure if reusing variable names has any impact on 
supporting large schemas. Perhaps, it can reduce the work needed for the Java 
compiler to rewrite variable names and result in faster compile times but I 
haven't conducted any experiment on that aspect. Apart from that, making the 
code change to reduce too many split methods has the most impact in supporting 
large schemas as I found that method names are always included in the constant 
pool even when the code size is too large from my experiment. In fact, this is 
the main reason which causes compilation errors with "too many constants error"

With that being said, I would still prefer to keep the changes to reuse 
variable names since the change itself is non-intrusive, harmless, and can only 
improve the performance for compilation. Please let me know your thoughts

> Support Large Protobuf Schemas
> ------------------------------
>
>                 Key: FLINK-33611
>                 URL: https://issues.apache.org/jira/browse/FLINK-33611
>             Project: Flink
>          Issue Type: Improvement
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>    Affects Versions: 1.18.0
>            Reporter: Sai Sharath Dandi
>            Assignee: Sai Sharath Dandi
>            Priority: Major
>              Labels: pull-request-available
>
> h3. Background
> Flink serializes and deserializes protobuf format data by calling the decode 
> or encode method in GeneratedProtoToRow_XXX.java generated by codegen to 
> parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the 
> ability to split the generated code to improve the performance for large 
> Protobuf schemas. However, this is still not sufficient to support some 
> larger protobuf schemas as the generated code exceeds the java constant pool 
> size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] 
> and we can see errors like "Too many constants" when trying to compile the 
> generated code. 
> *Solution*
> Since we already have the split code functionality already introduced, the 
> main proposal here is to now reuse the variable names across different split 
> method scopes. This will greatly reduce the constant pool size. One more 
> optimization is to only split the last code segment also only when the size 
> exceeds split threshold limit. Currently, the last segment of the generated 
> code is always being split which can lead to too many split methods and thus 
> exceed the constant pool size limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-33611) Support Large Protobuf Schemas

Reply via email to