[I] [Bug] [Mysql Source] Mysql数据源并行数据同步数据严重倾斜问题 [seatunnel]

via GitHub Wed, 22 Oct 2025 04:24:22 -0700


fxrWinters opened a new issue, #9974:
URL: https://github.com/apache/seatunnel/issues/9974


   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22bug%22)
 and found no similar issues.
   
   
   ### What happened
   
   1、数据源信息：Mysql表，数据量1100万+，partition_column = 主键，主键为字符串类型，其值是雪花算法产生的纯数字字符串；
   2、作业信息：本地模式、分离模式下运行，作业并行度10
   3、问题：按主键分片后，数据倾斜严重，打印分片条件类似 ABS(MD5(`id`) % 10) ，条件值为0，数据量500万；其他条件值1 - 
9，数据量在60多万，分布较均匀
   
   ABS(MD5(`id`) % 10)  = 0  -->  500万+
   ABS(MD5(`id`) % 10)  = 1  -->  60万+
   ...
   ABS(MD5(`id`) % 10)  = 9  -->  60万+
   4、因分片数据量不均匀，导致数据同步耗时久，最后都在等待数据量大的一个线程
   5、通过测试，不使用MD5，而使用CRC32函数，数据能均匀分布，但是目前的版本不支持配置策略
   
   ### SeaTunnel Version
   
   2.3.11
   
   ### SeaTunnel Config
   
   ```conf
   无其他特殊配置，调整了jvm参数 xmx 与 xms 为4G
   ```
   
   ### Running Command
   
   ```shell
   /.../bin/seatunnel.sh --config /.../table -m local
   ```
   
   ### Error Exception
   
   ```log
   没有异常，任务能正常跑完，就是数据倾斜后，需要等待数据量大的线程完成，耗时很久。
   ```
   
   ### Zeta or Flink or Spark Version
   
   Zeta = 2.3.11
   
   ### Java or Scala Version
   
   Java = 1.8
   
   ### Screenshots
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Bug] [Mysql Source] Mysql数据源并行数据同步数据严重倾斜问题 [seatunnel]

Reply via email to