jinkachy opened a new pull request, #9002:
URL: https://github.com/apache/seatunnel/pull/9002

   
   <!--
   
   Thank you for contributing to SeaTunnel! Please make sure that your code 
changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   ## Contribution Checklist
     - Make sure that the pull request corresponds to a [GITHUB 
issue](https://github.com/apache/seatunnel/issues).
     - Name the pull request in the form "[Feature] [component] Title of the 
pull request", where *Feature* can be replaced by `Hotfix`, `Bug`, etc.
     - Minor fixes should be named following this pattern: `[hotfix] [docs] Fix 
typo in README.md doc`.
   -->
   
   ### Purpose of this pull request
   
   This PR introduces a new character-based splitting algorithm for JDBC 
connectors when dealing with string-type columns. The traditional approach for 
splitting string-type data relies on database limit queries or mod hash 
operations, which can be inefficient for large datasets. The new algorithm uses 
character set ordering for more efficient splitting, eliminating the need for 
multiple database limit queries when MIN and MAX values are already known.
   
   The core algorithm works as 
follows(org.apache.seatunnel.connectors.seatunnel.jdbc.source.CollationBasedSplitter):
   
   1. It treats strings as numbers in a numeral system where the base is the 
size of the character set (plus 1 to account for null/empty character)
   2. Each string is converted to a "numeral" in this system, with positions 
representing place values
   3. These numerals are then converted to decimal (BigInteger) values to 
create a numerical range
   4. The numerical range is split evenly using standard numeric splitting 
algorithms
   5. The resulting split points are converted back to string representation
   
   This approach produces evenly distributed string splits without requiring 
additional database queries, significantly improving performance for large 
datasets.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, this PR introduces a new configuration option `string_split_mode` which 
can be set to `charsetBased` to enable the new character set-based string 
splitting algorithm. Users can also specify the `collate` parameter to define a 
specific character collation order. If not specified, the database system's 
default sorting rule will be used.
   
   Currently, the implementation supports all visible ASCII characters (code 
points 32-126), which covers most common use cases for string fields typically 
composed of numbers and letters.
   
   **Recommendation:**  recommend setting `string_split_mode=charsetBased` when 
dealing with large datasets that require many partitions and only have string 
fields available as split keys. This mode significantly reduces the number of 
database queries and improves overall performance in these scenarios.
   
   ### How was this patch tested?
   
   The implementation has been tested with:
   
   1. Unit tests for the `CollationBasedSplitter` class to verify the 
conversion between strings and numeric ranges
   2. tests with different database systems (MySQL, PostgreSQL, and so on) to 
verify string-based splitting works correctly
   3. Performance comparison tests between the traditional approach and the new 
character-based approach
   
   All tests confirm that the algorithm correctly splits string-type data into 
evenly distributed chunks and provides significant performance improvements for 
large datasets.
   
   ### Check list
   
   * [x] If any new Jar binary package adding in your PR, please add License 
Notice according [New License 
Guide](https://github.com/apache/seatunnel/blob/dev/docs/en/contribution/new-license.md)
   * [x] If necessary, please update the documentation to describe the new 
feature. https://github.com/apache/seatunnel/tree/dev/docs
   * [x] If you are contributing the connector code, please check that the 
following files are updated:
     1. Update 
[plugin-mapping.properties](https://github.com/apache/seatunnel/blob/dev/plugin-mapping.properties)
 and add new connector information in it
     2. Update the pom file of 
[seatunnel-dist](https://github.com/apache/seatunnel/blob/dev/seatunnel-dist/pom.xml)
     3. Add ci label in 
[label-scope-conf](https://github.com/apache/seatunnel/blob/dev/.github/workflows/labeler/label-scope-conf.yml)
     4. Add e2e testcase in 
[seatunnel-e2e](https://github.com/apache/seatunnel/tree/dev/seatunnel-e2e/seatunnel-connector-v2-e2e/)
     5. Update connector 
[plugin_config](https://github.com/apache/seatunnel/blob/dev/config/plugin_config)
   * [x] Update the 
[`release-note`](https://github.com/apache/seatunnel/blob/dev/release-note.md).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to