Yanquan Lv created FLINK-38644:
----------------------------------
Summary: Reading tables with String type as the primary key may
cause OutOfMemory Error
Key: FLINK-38644
URL: https://issues.apache.org/jira/browse/FLINK-38644
Project: Flink
Issue Type: Bug
Reporter: Yanquan Lv
When using a {*}String type as the primary key{*}, {{MySqlChunkSplitter}}
employs an {*}unevenly chunking algorithm{*}. Specifically, it queries the
{{min}} and {{max}} values of the key range, calculates the {{ChunkEnd}} based
on {{chunkStart}} and {{{}chunkSize{}}}, and compares {{ChunkEnd}} with {{max}}
to determine whether to proceed with the next chunk split.
However, during the querying of {{{}min{}}}, {{{}max{}}}, and {{{}ChunkEnd{}}},
*MySQL's sorting rules* are applied. In contrast, when comparing {{ChunkEnd}}
and {{max}} to decide the chunk boundary, the comparison relies on {*}Java's
string sorting rules{*}. By default, *MySQL is case-insensitive* in string
comparisons, while {*}Java's string sorting is case-sensitive{*}. This
discrepancy may result in {*}unexpected outcomes{*}, which can ultimately lead
to an {*}{{OutOfMemoryError}}{*}.
For example, in MySQL, consider a set of primary key data sorted by the
database's collation rules as:
{{{}"a1,A2,b1,B2,c1,C2,d1,D2,e1,E2,f1,F2"{}}}.
Assume the {{chunkSize}} is 4. The computed {{min/max}} values would be {{a1}}
and {{{}F2{}}}.
* {*}First Chunk{*}: The calculated {{chunkEnd}} is {{{}B2{}}}.
* {*}Second Chunk{*}: The calculated {{chunkEnd}} is {{{}d1{}}}.
However, due to Java's lexicographical string comparison (case-sensitive),
{{d1}} is considered *greater than* {{F2}} (since {{'d' < 'F'}} in ASCII). As a
result:
* The second chunk's {{chunkEnd}} becomes {{{}null{}}}.
* The final chunks are: {{[null, B2]}} and {{{}[B2, null]{}}}.
This inconsistency may lead to the second chunk being incorrectly processed by
the {*}TaskManager{*}, potentially causing an {*}{{OutOfMemoryError}}{*}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)