[jira] [Commented] (FLINK-18202) Introduce Protobuf format

Suhan Mao (Jira) Wed, 09 Dec 2020 02:19:06 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-18202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246418#comment-17246418
 ]


Suhan Mao commented on FLINK-18202:
-----------------------------------

I will give an introduction about 3 ways:
1.  DynamicMessage and Descriptor API
2. Codegen to use classes and builders compiled by protoc
3, CodedInputStream/CodedOutputStream
 
h2. Performance
I add json format additionally to offer a better comparison of the speed.
The proto data contains 200 fields.
|Implementation|Deserialize Speed|Serialize Speed|
|json|105s|106s|
|#1
DynamicMessage and Descriptor API|166s|69s|
|#2
Codegen to use classes and builders compiled by protoc|40s|According to 
Benchao's experiment,
sligt faster than #1, may be 60s|
|#3
CodedInputStream/CodedOutputStream|Not implemented, may be 20-30s|23s|
 
As you can see, in deserialization process, #2 is 4 times faster than #1 so I 
think the effecience matters.
In many companies, they probably use json as the data format. If #1 is slower 
than json, users may see
performance degration and sometimes it is not acceptable. If we can improve the 
performance a lot compared
to json, users will prefer to choose pb format and benefit from both 
computation and storage/traffic. In many cases, pb
format consumes only 40% size of space compared to json in compressed format.
 
The deserialization way of #1 is not implemented because the implementation 
code is much harder and involve a lot of low level pb API and I'm very concered 
about the compartibility. Also the performance gain is not as big as 
serialization.
 
h2. Implementation & Compartibility
#1
It is very easy and clear to implement. We can use official protobuf 
DynamicMessage API. So it will be stable
among different pb library versions and pb syntax.
#2
We need to use codegen to support different type including map/array/nested 
message. It will be a little
difficult to implement compared to #1 but it is not a big problem because flink 
use code generation technology a lot.
In the generated code, we also use compiled java official pb API to get the 
data. So it will also be stable among different pb library versions and pb 
syntax.
#3
This is the most efficient way to encode/decode streams and eliminates all 
intermediate object creation and fields copying.
But the biggest problem is the compatibility, because we will use very low 
level API and "replay" the logic of what pb framework does.
If pb framework change some logic inside, it is hard to align with the official 
framework logic.
Also, the correctness is not guaranteed among different pb library versions due 
to semantics of low level API may change over time.
This is the biggest problem and there's no good solution to handle this in my 
mind.
 
h2. Some Other Questions
*1. java_multiple_files & java_outer_classname*
The different combination of this options in proto file will result in 
different class layout, and it bring troubles in message class reflection,
so we need to handle it carefully. I will read proto definition to load the 
correct class to adjust different values of java_multiple_files & 
java_outer_classname.
 
*2. default values*
As you know, if the syntax is proto2, the generated pb class has bit flags to 
indicate whether a field is set or not. We can use pbObject.hasXXX() method
to know whether the field is set or not. In this way, we can handle null 
information in flink properly.
But if the syntax is proto3, the generated pb class does not have 
pbObject.hasXXX() method and does not hold bit flags, so there is no way to 
tell if a field
is set or not if it is equals to default value. For example, if 
pbObje.getDim1() returns 0, there's no way to tell if dim1 is set by 0 or it is 
not set anyway.
 
Also pb does not permit null in key/value of map and array. We need to generate 
default value for them.
Below is the conversion table of converting row value to pb value in case of 
null value.
|row value|pb value|
|map<string,string>(<"a", null>)|map<string,string>(("a", ""))|
|map<string,string>(<null, "a">)|map<string,string>(("", "a"))|
|map<int, int>(null, 1)|map<int, int>(0, 1)|
|map<int, int>(1, null)|map<int, int>(1, 0)|
|map<long, long>(null, 1)|map<long, long>(0, 1)|
|map<long, long>(1, null)|map<long, long>(1, 0)|
|map<bool, bool>(null, true)|map<bool, bool>(false, true)|
|map<bool, bool>(true, null)|map<bool, bool>(true, false)|
|map<string, float>("key", null)|map<string, float>("key", 0)|
|map<string, double>("key", null)|map<string, double>("key", 0)|
|map<string, enum>("key", null)|map<string, enum>("key", first_enum_element)|
|map<string, binary>("key", null)|map<string, binary>("key", ByteString.EMPTY)|
|map<string, MESSAGE>("key", null)|map<string, MESSAGE>("key", 
MESSAGE.getDefaultInstance())|
|array<string>(null)|array("")|
|array<int>(null)|array(0)|
|array<long>(null)|array(0)|
|array<bool>(null)|array(false)|
|array<float>(null)|array(0)|
|array<double>(null)|array(0)|
|array<enum>(null)|array(first_enum_element)|
|array<binary>(null)|array(ByteString.EMPTY)|
|array<message>(null)|array(MESSAGE.getDefaultInstance())|
 
*3. OneOf field*
In serialization process, there's no guarantee that the flink row fields of 
one-of group only contains at least one non-null value.
So in serialization, we set each field in the order of flink schema, so the 
field in high position will override then field of low position in the same 
one-of group.
 
 
h2. Conclusion
I prefer to use #2 for the implementation. I only need to implement the 
serialization part because
deserialization code is already done.
 
 
[~libenchao]
Could you share your opinion on this?


 

> Introduce Protobuf format
> -------------------------
>
>                 Key: FLINK-18202
>                 URL: https://issues.apache.org/jira/browse/FLINK-18202
>             Project: Flink
>          Issue Type: New Feature
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile), Table 
> SQL / API
>            Reporter: Benchao Li
>            Priority: Major
>         Attachments: image-2020-06-15-17-18-03-182.png
>
>
> PB[1] is a very famous and wildly used (de)serialization framework. The ML[2] 
> also has some discussions about this. It's a useful feature.
> This issue maybe needs some designs, or a FLIP.
> [1] [https://developers.google.com/protocol-buffers]
> [2] [http://apache-flink.147419.n8.nabble.com/Flink-SQL-UDF-td3725.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18202) Introduce Protobuf format

Reply via email to