[
https://issues.apache.org/jira/browse/FLINK-38904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18051461#comment-18051461
]
John Watson commented on FLINK-38904:
-------------------------------------
Note: Forcing TLS1.2 fixes this issue
> MySQL CDC binlog reader hangs due to TLS 1.3 KeyUpdate deadlock (potentially
> JDK-8241239)
> -----------------------------------------------------------------------------------------
>
> Key: FLINK-38904
> URL: https://issues.apache.org/jira/browse/FLINK-38904
> Project: Flink
> Issue Type: Bug
> Components: Flink CDC
> Environment: * JDK 11.0.18 (TLS 1.3 enabled by default)
> * MySQL on AWS RDS with SSL required
> * ~137GB data processed
> Reporter: John Watson
> Priority: Major
>
> The MySQL CDC binlog reader deadlocks after processing ~137GB data when
> using TLS 1.3. This appears to be caused by JDK bug
> [JDK-8241239|https://bugs.openjdk.org/browse/JDK-8241239] where TLS 1.3's
> KeyUpdate mechanism triggers a deadlock in SSLSocketImpl.
>
> TLS 1.3 sends KeyUpdate messages after ~137GB of data transfer (AES-GCM nonce
> limit). The deadlock occurs as follows:
> * Reader thread receives KeyUpdate, must respond by writing new keys
> * Reader thread holds SSL lock, blocks in native {{socketWrite0()}}
> * Keepalive thread detects timeout, attempts to close connection
> * {{SSLSocketImpl.closeNotify()}} requires the same SSL lock
> * Deadlock: Reader holds lock waiting on network I/O; Keepalive waiting for
> lock
> Thread Dump:
> {code:java}
> Thread: blc-...:3306 (id=113)
> State: RUNNABLE (blocked in native socketWrite0)
> Holds: ReentrantLock@753cff5d
> Stack:
> java.net.SocketOutputStream.socketWrite0(Native Method)
> sun.security.ssl.SSLSocketOutputRecord.flush()
> sun.security.ssl.OutputRecord.changeWriteCiphers()
> sun.security.ssl.KeyUpdate$KeyUpdateProducer.produce()
> sun.security.ssl.SSLSocketImpl.tryKeyUpdate()
> sun.security.ssl.SSLSocketImpl.decode()
> sun.security.ssl.SSLSocketImpl.readApplicationRecord()
> ...
> Thread: blc-keepalive-...:3306 (id=115)
> State: WAITING
> Waiting on: ReentrantLock@753cff5d
> Lock owner: Thread 113
> Stack:
> java.util.concurrent.locks.ReentrantLock.lock()
> sun.security.ssl.SSLSocketImpl.closeNotify()
> sun.security.ssl.TransportContext.closeNotify()
> sun.security.ssl.SSLSocketImpl.shutdownOutput()
> com.github.shyiko.mysql.binlog.network.protocol.PacketChannel.close()
> com.github.shyiko.mysql.binlog.BinaryLogClient.disconnectChannel()
> com.github.shyiko.mysql.binlog.BinaryLogClient.terminateConnect()
> ... {code}
>
> +Steps to Reproduce+
> Configure MySQL CDC with SSL enabled ({{requireSSL=true}}) against AWS Aurora
> Use JDK 11 (TLS 1.3 enabled by default)
> Process high-volume CDC workload (>137GB)
> Observe binlog reader thread deadlock
> +Expected Behavior+
> Binlog reader continues processing indefinitely without deadlocking.
> +Actual Behavior+
> Binlog reader deadlocks after ~137GB data transfer when TLS 1.3 KeyUpdate is
> triggered. The reader thread holds the SSL lock while blocked in
> {{socketWrite0()}}, and the keepalive thread blocks forever waiting for the
> same lock to send {{close_notify}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)