[ https://issues.apache.org/jira/browse/TEZ-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16440000#comment-16440000 ]
Jonathan Eagles commented on TEZ-3914: -------------------------------------- Essentially, the underlying serializer/deserializer for protobuf messages is the CodedOutputStream/CodedInputStream. When messages are written either with the writeDelimitedTo (read parallel is parseDelimitedFrom) or writeMessageNoTag (parallel is readMessage), first the serialized size is written and then the serialized message is written. In addition to this, Tez prepends the message type to handle message dispatching. To protect itself from remote messages, CodedInputStream has a default serialized size limit of 64 MB. When using the parseDelimitedFrom java.io.InputStream method, new underlying deserializer CodedInputStream is created using the default message size limit. This prevents larger than 64MB Tez recovery messages from being properly parsed and prevents further messages after the large message from being parsed as well since any Exception is treated the same as EOF. There are a few possible solutions to this. 1) Ensure all recovery messages are bounded in size less than 64 MB 2) Ensure messages larger than 64 MB can be successfully read 3) Ensure large messages are non-fatal and skip them. This jira takes aproach 2 since recovery data is not lost. In addition, this jira replaces the higher level writeDelimitedTo/parseDelimitedFrom with the writeMessageNoTag/readMessage. This allows control over the maximum message size as required. In addition, recover writes and reads are now more efficient as serializing/deserializing streams (CodedInputStream/CodedOutputStream) are now reused. > Recovering a large DAG fails to size limit exceeded > --------------------------------------------------- > > Key: TEZ-3914 > URL: https://issues.apache.org/jira/browse/TEZ-3914 > Project: Apache Tez > Issue Type: Bug > Reporter: Jonathan Eagles > Assignee: Jonathan Eagles > Priority: Major > Attachments: TEZ-3914.001.patch, TEZ-3914.002.patch, > TEZ-3914.003.patch > > > A large message will be failed to parse and will be treated as recovery file > EOF. > {noformat} > 2018-04-16 15:33:59,807 WARN [Thread-2] app.RecoveryParser > (RecoveryParser.java:parseRecoveryData(771)) - Corrupt data found when trying > to read next event > com.google.protobuf.InvalidProtocolBufferException: Protocol message was too > large. May be malicious. Use CodedInputStream.setSizeLimit() to increase > the size limit. > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)