Hi,

Yes, it sounds a very good idea. I am pretty interested in the topic.
Is there an ongoing discussion that I can start to look at?

Thanks,

Enrico

On 5/13/18, 2:58 AM, "William Guo" <gu...@apache.org> wrote:

    hi Enrico,
    
    Yes, since we have released 0.2.0 recently.
    
    Our next plan will include enhance measures, including support anomaly
    detection.
    
    
    Would you like to contribute this feature together?
    
    
    Thanks,
    William
    
    On Sat, May 12, 2018 at 12:22 AM, Enrico D'Urso (JIRA) <j...@apache.org>
    wrote:
    
    >
    >     [ https://issues.apache.org/jira/browse/GRIFFIN-160?page=
    > com.atlassian.jira.plugin.system.issuetabpanels:comment-
    > tabpanel&focusedCommentId=16472199#comment-16472199 ]
    >
    > Enrico D'Urso commented on GRIFFIN-160:
    > ---------------------------------------
    >
    > Hi,
    >
    > there are several ways to go for anomaly detection implementation.
    >
    > The point is to have numerical data. If you want to apply AD against
    > non-numerical data you have to map string to number somehow.
    >
    > However, as Griffin uses Spark as the engine, I think K-Means can be an
    > option.
    >
    > Basically, you have your data: you normalise it, decide the number of
    > clusters, apply K-means, finally check the distance from final centroids 
to
    > search for anomalies. MLlib fully supports it.
    >
    > Otherwise just get the mean and std and search for samples that are 3sd+
    > far from the mean.
    >
    > More complicated stuff can be done using Covariance matrix and Gaussian
    > distribution, more info here [https://www.coursera.org/
    > learn/machine-learning/lecture/C8IJp/helpUrl]
    >
    > but am not sure if doable in a distributed environment.
    >
    >
    >
    > Thanks,
    >
    > Enrico
    >
    >
    >
    > > Anomaly detection for thousands of tables
    > > -----------------------------------------
    > >
    > >                 Key: GRIFFIN-160
    > >                 URL: https://issues.apache.org/jira/browse/GRIFFIN-160
    > >             Project: Griffin (Incubating)
    > >          Issue Type: New Feature
    > >            Reporter: William Guo
    > >            Assignee: William Guo
    > >            Priority: Major
    > >
    > > Hi team,
    > >
    > > I am trying find the Griffin road map, and here it is [
    > https://cwiki.apache.org/confluence/display/GRIFFIN/0.+Roadmap], is this
    > the latest version?
    > >
    > > We have thousands of tables need to applied for data quality validation,
    > is there any simple machine learning algorithm can be applied to detect 
the
    > data quality issue instead of build a lot measures?  Will this be added in
    > the Griffin road map if possible?
    > >
    > > Thanks, Randy
    > >
    >
    >
    >
    > --
    > This message was sent by Atlassian JIRA
    > (v7.6.3#76005)
    >
    

Reply via email to