xuqinghuang opened a new issue, #541: URL: https://github.com/apache/doris-flink-connector/issues/541
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/incubator-doris/issues?q=is%3Aissue) and found no similar issues. ### Description During the current Mongo synchronization, the initialized data sampling parameter schema.sample-percent defaults to 0.2. Due to this rule, the logic is fixed when the table is large or small. 1. If a large table is sampled too much, it can cause performance problems in the program. 2. If there are too few small table samples, it will result in incorrect structure collection. ### Solution I feel that the logic can be changed to dynamic sampling, for example: 1. If the table is small: The sample size is automatically set to the total number of tables, i.e. all data is collected. 2. If the table is large: the sample size will be limited to MAX_SAMPLE_SIZE (e.g. 100,000). 3. If the size of the table is moderate: sample according to the proportion specified by the user, but the sampling amount cannot be less than MIN_SAMPLE_SIZE or more than MAX_SAMPLE_SIZE. This approach allows for flexibility in handling tables of different sizes while avoiding performance bottlenecks. ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
