[ https://issues.apache.org/jira/browse/FLINK-18202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164310#comment-17164310 ]
Benchao Li commented on FLINK-18202: ------------------------------------ Hi all, Sorry about the late reply, I would like to resume the discussion, and it would be great if we can make it in 1.12 There are a few key points we should discuss before we proceed to next step. # 1. Schema derivation Currently all native formats in Flink derive table schema from DDL. However, for ProtoBuf, there is another choice, we can derive the table schema from the proto defination. ## pros The table schema and the proto definition is one-to-one mapping, it would save a lot of work for users to write the DDL. Sometimes, the proto definition maybe really complex, e.g. hundreds of columns (we also see proto definition more than 1 thousand fields). ## cons There is no native way to do this, we may change the planner a little to achieve this. # 2. How users configure the format? (proto string vs compiled proto class) ## proto string Just like Avro before, user can provide a schema string, and the format will derive the table schema from it, also generate the runtime converter from it. However for ProtoBuf, I did not find an easy way to do this. Just like @liufangliang said, ProtoBuf only supports dynamic compilation for C++. Also, there are some cases that this may lead to inconvenient use experience, when the the number proto files is more than one, and I assume it would be a common case. ## compiled proto class It's the easiest way to do this. Users should compile the proto file, and provide the generated class in job classpath, and tell us the class full qualified name. We can derive the runtime converter from the compiled class, (also the table schema if we decide to derive table schema from format in #1) # 3. What the ProtoBuf API we would use? (DynamicMessage vs native method) ## DynamicMessage The DynamicMessage already provide us a way to de(serialize) dynamically, without knowing the proto definition before runtime. It's performance is a little worse than calling native method, however It's acceptable in my opinion. (De)serialization is only one step in the whole pipeline, and in most of cases it only accounts for a small percentage. ## native method It's a little hard to do this for now. We may need to introduce the Code Generation capability to formats, and we can generate the native method calls in format to fulfill this. # 4. The version of ProtoBuf we want to support? (Or the what's the compatibility?) I think we can just use the latest version of proto3 to do this. The reasons are: ## proto2 vs proto3 They are just protocol versions, not the proto compiler/runtime version. Proto3 could support the proto2 and proto3 at the same time. ## the compatibility I think we can utilize the compatibility of the ProtoBuf directly. To make it clear: There are three versions we need to take care: 1. the data version (Other programs wrote with some proto definition) 2. the ProtoBuf compiler version 3. the proto-java dependency version We cannot control the #1 in most cases. For #2, if we decide to let the user to provide us the compiled class, then the version is decided by users. For #3, I think we can just use the latest version. Then we only need to care about the version of #2/#3, we can document it clear in the doc to tell users that they'd better to use some specific ProtoBuf compiler version. And the compatibility is handled by ProtoBuf itself. There are some other details we may need to consider, however I think we need to discuss the above points I mentioned and reach consensus before to proceed to next step. What do you think? Feel free to join the discussion and the next steps. > Introduce Protobuf format > ------------------------- > > Key: FLINK-18202 > URL: https://issues.apache.org/jira/browse/FLINK-18202 > Project: Flink > Issue Type: New Feature > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile), Table > SQL / API > Reporter: Benchao Li > Priority: Major > Attachments: image-2020-06-15-17-18-03-182.png > > > PB[1] is a very famous and wildly used (de)serialization framework. The ML[2] > also has some discussions about this. It's a useful feature. > This issue maybe needs some designs, or a FLIP. > [1] [https://developers.google.com/protocol-buffers] > [2] [http://apache-flink.147419.n8.nabble.com/Flink-SQL-UDF-td3725.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)