2000liux opened a new issue, #3972:
URL: https://github.com/apache/incubator-seatunnel/issues/3972

   ### Search before asking
   
   - [X] I had searched in the 
[feature](https://github.com/apache/incubator-seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22)
 and found no similar feature requirement.
   
   
   ### Description
   
   Hello everyOne、
   ### 1. Existing problem description
   - SeaTunnel currently only supports some simple Transformation functions 
(very simple functions), which obviously cannot meet the complex data 
integration capabilities, thus reducing the priority for users to choose 
Seatunnel. (flink-cdc, flink-connectors will have higher priority)
   
   - I describe below what deequ can do and what problems it solves.
   
   ### 2. What are the highlights of Deequ?
   #### 2.1 supported Some of the functions such as
     - Approximate Quantile, Approximate Quantile, Completeness, Compliance, 
Correlation, CountDiistict, DataType, Distance, Dissimilarity, Histogram, 
Maximum, Maximum Length, Mean, Minimum, Minimum Length, Mutual Information, 
Mode Analysis of matching, size, questandard deviation, sum, Value ratio, 
uniqueness related data analysis;
    
     - 
hasSize、isComplete、hasCompletness、isUnique、isPrimaryKey、hasUniqueness、hasDistinctness、hasUniqueValueRatio、hasNumberOfDistinctValues、hasHistogramValues、hasEntropy、hasMutualInformation、hasApproxQuantile、hasMinLength、hasMaxLength,hasMin、has
 
Max、hasMean、hasSum、hasStandardDeviation、hasAbroxCountDistinct、hasCorrelation、has,containsEmail,containsURL,contains
 
SocialSecurityNumber,hasDataType,isNonNegative,isPositive,isLessThan,isLesThanOrEqualTo,isGreaterThan,is
 GreaterTo,is ContainedIn relevant data validation;
   
     - Functions such as verification of abnormal fields, output of abnormal 
indicators, etc.
     
   #### 2.2 What problems can Deequ solve?
   - Missing values can cause failures in production systems that require 
non-null values (null point exceptions).
   - Changes in the distribution of data can lead to unexpected outputs from 
machine learning models.
   - Aggregation of incorrect data, or import/export may cause errors within 
the framework.
   
   #### 2.3 Features of Deequ
   - You can create a pipeline to verify the integrity or missing of data
   - Multi-module and multi-mode anomaly detection
   - It can repair data dynamically and export exception data report
   - Implemented on Spark, out-of-the-box (Flink can be supported in the future)
   
   #### 2.4 Deequ analytical capabilities for data
   - 1. Data Analysis
   - 2. Data Validation
   - 3. Anomaly detection
   - 4. Automatic Constraint Recommendations
   
   ### 3. Suggestions for Improvement
   - If Seatunnel integrates this capability, will more users choose Seatunnel. 
I think so.
   - Directly integrate deequ, or directly support the capabilities of deequ. 
Our capabilities in Transformation will be greatly strengthened and improved.
   
   - Remarks
   - I have some cases when using Deequ. If you are interested in this 
function, please @ me. I will post the analysis ability code mentioned in the 
2.4 title.
   
   ### 3. Other
   - [View deequ items 
](![image](https://user-images.githubusercontent.com/42398474/212920214-248212d2-484f-4519-aa6f-ddc292ac1a61.png))
   
   
   ### Usage Scenario
   
   _No response_
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to