[PR] [Feature][Jdbc] Add String type column split Support by charset-based splitting algorithm [seatunnel]

via GitHub Fri, 04 Apr 2025 13:51:30 -0700


jinkachy opened a new pull request, #9002:
URL: https://github.com/apache/seatunnel/pull/9002

<!--

Thank you for contributing to SeaTunnel! Please make sure that your code
changes
are covered with tests. And in case of new features or big changes
remember to adjust the documentation.

Feel free to ping committers for the review!

## Contribution Checklist
- Make sure that the pull request corresponds to a [GITHUB
issue](https://github.com/apache/seatunnel/issues).
- Name the pull request in the form "[Feature] [component] Title of the
pull request", where *Feature* can be replaced by `Hotfix`, `Bug`, etc.
- Minor fixes should be named following this pattern: `[hotfix] [docs] Fix
typo in README.md doc`.
-->

### Purpose of this pull request

This PR introduces a new character-based splitting algorithm for JDBC
connectors when dealing with string-type columns. The traditional approach for
splitting string-type data relies on database limit queries or mod hash
operations, which can be inefficient for large datasets. The new algorithm uses
character set ordering for more efficient splitting, eliminating the need for
multiple database limit queries when MIN and MAX values are already known.

The core algorithm works as
follows(org.apache.seatunnel.connectors.seatunnel.jdbc.source.CollationBasedSplitter):

1. It treats strings as numbers in a numeral system where the base is the
size of the character set (plus 1 to account for null/empty character)
2. Each string is converted to a "numeral" in this system, with positions
representing place values
3. These numerals are then converted to decimal (BigInteger) values to
create a numerical range
4. The numerical range is split evenly using standard numeric splitting
algorithms
5. The resulting split points are converted back to string representation

This approach produces evenly distributed string splits without requiring
additional database queries, significantly improving performance for large
datasets.

### Does this PR introduce _any_ user-facing change?

Yes, this PR introduces a new configuration option `string_split_mode` which
can be set to `charsetBased` to enable the new character set-based string
splitting algorithm. Users can also specify the `collate` parameter to define a
specific character collation order. If not specified, the database system's
default sorting rule will be used.

Currently, the implementation supports all visible ASCII characters (code
points 32-126), which covers most common use cases for string fields typically
composed of numbers and letters.

**Recommendation:** recommend setting `string_split_mode=charsetBased` when
dealing with large datasets that require many partitions and only have string
fields available as split keys. This mode significantly reduces the number of
database queries and improves overall performance in these scenarios.

### How was this patch tested?

The implementation has been tested with:

1. Unit tests for the `CollationBasedSplitter` class to verify the
conversion between strings and numeric ranges
2. tests with different database systems (MySQL, PostgreSQL, and so on) to
verify string-based splitting works correctly
3. Performance comparison tests between the traditional approach and the new
character-based approach

All tests confirm that the algorithm correctly splits string-type data into
evenly distributed chunks and provides significant performance improvements for
large datasets.

### Check list

* [x] If any new Jar binary package adding in your PR, please add License
Notice according [New License
Guide](https://github.com/apache/seatunnel/blob/dev/docs/en/contribution/new-license.md)
* [x] If necessary, please update the documentation to describe the new
feature. https://github.com/apache/seatunnel/tree/dev/docs
* [x] If you are contributing the connector code, please check that the
following files are updated:
1. Update
[plugin-mapping.properties](https://github.com/apache/seatunnel/blob/dev/plugin-mapping.properties)
and add new connector information in it
2. Update the pom file of
[seatunnel-dist](https://github.com/apache/seatunnel/blob/dev/seatunnel-dist/pom.xml)
3. Add ci label in
[label-scope-conf](https://github.com/apache/seatunnel/blob/dev/.github/workflows/labeler/label-scope-conf.yml)
4. Add e2e testcase in
[seatunnel-e2e](https://github.com/apache/seatunnel/tree/dev/seatunnel-e2e/seatunnel-connector-v2-e2e/)
5. Update connector
[plugin_config](https://github.com/apache/seatunnel/blob/dev/config/plugin_config)
* [x] Update the
[`release-note`](https://github.com/apache/seatunnel/blob/dev/release-note.md).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [Feature][Jdbc] Add String type column split Support by charset-based splitting algorithm [seatunnel]

Reply via email to