[jira] [Commented] (FLINK-18202) Introduce Protobuf format

Benchao Li (Jira) Fri, 24 Jul 2020 02:30:25 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-18202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164310#comment-17164310
 ]


Benchao Li commented on FLINK-18202:
------------------------------------

Hi all,

Sorry about the late reply, I would like to resume the discussion, and it would 
be great if we can make it in 1.12

There are a few key points we should discuss before we proceed to next step.

# 1. Schema derivation
Currently all native formats in Flink derive table schema from DDL. However, 
for ProtoBuf, there is another choice, we can derive the table schema from the 
proto defination.

## pros
The table schema and the proto definition is one-to-one mapping, it would save 
a lot of work for users to write the DDL. Sometimes, the proto definition maybe 
really complex, e.g. hundreds of columns (we also see proto definition more 
than 1 thousand fields).

## cons
There is no native way to do this, we may change the planner a little to 
achieve this.

# 2. How users configure the format? (proto string vs compiled proto class)

## proto string 
Just like Avro before, user can provide a schema string, and the format will 
derive the table schema from it, also generate the runtime converter from it. 
However for ProtoBuf, I did not find an easy way to do this. Just like 
@liufangliang said, ProtoBuf only supports dynamic compilation for C++.
Also, there are some cases that this may lead to inconvenient use experience, 
when the the number proto files is more than one, and I assume it would be a 
common case.

## compiled proto class
It's the easiest way to do this. Users should compile the proto file, and 
provide the generated class in job classpath, and tell us the class full 
qualified name. We can derive the runtime converter from the compiled class, 
(also the table schema if we decide to derive table schema from format in #1)

# 3. What the ProtoBuf API we would use? (DynamicMessage vs native method)

## DynamicMessage
The DynamicMessage already provide us a way to de(serialize) dynamically, 
without knowing the proto definition before runtime. It's performance is a 
little worse than calling native method, however It's acceptable in my opinion. 
(De)serialization is only one step in the whole pipeline, and in most of cases 
it only accounts for a small percentage. 

## native method
It's a little hard to do this for now. We may need to introduce the Code 
Generation capability to formats, and we can generate the native method calls 
in format to fulfill this.

# 4. The version of ProtoBuf we want to support? (Or the what's the 
compatibility?)

I think we can just use the latest version of proto3 to do this. The reasons 
are:

## proto2 vs proto3
They are just protocol versions, not the proto compiler/runtime version. Proto3 
could support the proto2 and proto3 at the same time.

## the compatibility
I think we can utilize the compatibility of the ProtoBuf directly. To make it 
clear:
There are three versions we need to take care:
1. the data version (Other programs wrote with some proto definition)
2. the ProtoBuf compiler version 
3. the proto-java dependency version 
We cannot control the #1 in most cases. For #2, if we decide to let the user to 
provide us the compiled class, then the version is decided by users. For #3, I 
think we can just use the latest version.
Then we only need to care about the version of #2/#3, we can document it clear 
in the doc to tell users that they'd better to use some specific ProtoBuf 
compiler version.
And the compatibility is handled by ProtoBuf itself.


There are some other details we may need to consider, however I think we need 
to discuss the above points I mentioned and reach consensus before to proceed 
to next step.

What do you think? Feel free to join the discussion and the next steps.

> Introduce Protobuf format
> -------------------------
>
>                 Key: FLINK-18202
>                 URL: https://issues.apache.org/jira/browse/FLINK-18202
>             Project: Flink
>          Issue Type: New Feature
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile), Table 
> SQL / API
>            Reporter: Benchao Li
>            Priority: Major
>         Attachments: image-2020-06-15-17-18-03-182.png
>
>
> PB[1] is a very famous and wildly used (de)serialization framework. The ML[2] 
> also has some discussions about this. It's a useful feature.
> This issue maybe needs some designs, or a FLIP.
> [1] [https://developers.google.com/protocol-buffers]
> [2] [http://apache-flink.147419.n8.nabble.com/Flink-SQL-UDF-td3725.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18202) Introduce Protobuf format

Reply via email to