2000liux opened a new issue, #3972: URL: https://github.com/apache/incubator-seatunnel/issues/3972
### Search before asking - [X] I had searched in the [feature](https://github.com/apache/incubator-seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22) and found no similar feature requirement. ### Description Hello everyOne、 ### 1. Existing problem description - SeaTunnel currently only supports some simple Transformation functions (very simple functions), which obviously cannot meet the complex data integration capabilities, thus reducing the priority for users to choose Seatunnel. (flink-cdc, flink-connectors will have higher priority) - I describe below what deequ can do and what problems it solves. ### 2. What are the highlights of Deequ? #### 2.1 supported Some of the functions such as - Approximate Quantile, Approximate Quantile, Completeness, Compliance, Correlation, CountDiistict, DataType, Distance, Dissimilarity, Histogram, Maximum, Maximum Length, Mean, Minimum, Minimum Length, Mutual Information, Mode Analysis of matching, size, questandard deviation, sum, Value ratio, uniqueness related data analysis; - hasSize、isComplete、hasCompletness、isUnique、isPrimaryKey、hasUniqueness、hasDistinctness、hasUniqueValueRatio、hasNumberOfDistinctValues、hasHistogramValues、hasEntropy、hasMutualInformation、hasApproxQuantile、hasMinLength、hasMaxLength,hasMin、has Max、hasMean、hasSum、hasStandardDeviation、hasAbroxCountDistinct、hasCorrelation、has,containsEmail,containsURL,contains SocialSecurityNumber,hasDataType,isNonNegative,isPositive,isLessThan,isLesThanOrEqualTo,isGreaterThan,is GreaterTo,is ContainedIn relevant data validation; - Functions such as verification of abnormal fields, output of abnormal indicators, etc. #### 2.2 What problems can Deequ solve? - Missing values can cause failures in production systems that require non-null values (null point exceptions). - Changes in the distribution of data can lead to unexpected outputs from machine learning models. - Aggregation of incorrect data, or import/export may cause errors within the framework. #### 2.3 Features of Deequ - You can create a pipeline to verify the integrity or missing of data - Multi-module and multi-mode anomaly detection - It can repair data dynamically and export exception data report - Implemented on Spark, out-of-the-box (Flink can be supported in the future) #### 2.4 Deequ analytical capabilities for data - 1. Data Analysis - 2. Data Validation - 3. Anomaly detection - 4. Automatic Constraint Recommendations ### 3. Suggestions for Improvement - If Seatunnel integrates this capability, will more users choose Seatunnel. I think so. - Directly integrate deequ, or directly support the capabilities of deequ. Our capabilities in Transformation will be greatly strengthened and improved. - Remarks - I have some cases when using Deequ. If you are interested in this function, please @ me. I will post the analysis ability code mentioned in the 2.4 title. ### 3. Other - [View deequ items ]() ### Usage Scenario _No response_ ### Related issues _No response_ ### Are you willing to submit a PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
