[I] [Enhancement] Logic Optimization of MongoCDC Sampled Data [doris-flink-connector]

via GitHub Thu, 09 Jan 2025 18:36:29 -0800


xuqinghuang opened a new issue, #541:
URL: https://github.com/apache/doris-flink-connector/issues/541


   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/incubator-doris/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### Description
   
   During the current Mongo synchronization, the initialized data sampling 
parameter schema.sample-percent defaults to 0.2. Due to this rule, the logic is 
fixed when the table is large or small.
   
   1. If a large table is sampled too much, it can cause performance problems 
in the program.
   2. If there are too few small table samples, it will result in incorrect 
structure collection.
   
   ### Solution
   
   I feel that the logic can be changed to dynamic sampling, for example:
   
   1. If the table is small: The sample size is automatically set to the total 
number of tables, i.e. all data is collected.
   2. If the table is large: the sample size will be limited to MAX_SAMPLE_SIZE 
(e.g. 100,000).
   3. If the size of the table is moderate: sample according to the proportion 
specified by the user, but the sampling amount cannot be less than 
MIN_SAMPLE_SIZE or more than MAX_SAMPLE_SIZE.
   
   This approach allows for flexibility in handling tables of different sizes 
while avoiding performance bottlenecks.
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [Enhancement] Logic Optimization of MongoCDC Sampled Data [doris-flink-connector]

Reply via email to