hi Enrico, https://issues.apache.org/jira/browse/GRIFFIN-165
A ticket created to follow this issue. Thanks, William On Tue, May 15, 2018 at 6:58 AM, William Guo <gu...@apache.org> wrote: > yes, > > Griffin team will make a doc about contributing points(interfaces) for > measures. > > Will let you know when it is ready. > > > > Thanks, > William > > On Mon, May 14, 2018 at 5:13 PM, Enrico D'Urso <a-edu...@hotels.com> > wrote: > >> Hi, >> >> Yes, it sounds a very good idea. I am pretty interested in the topic. >> Is there an ongoing discussion that I can start to look at? >> >> Thanks, >> >> Enrico >> >> On 5/13/18, 2:58 AM, "William Guo" <gu...@apache.org> wrote: >> >> hi Enrico, >> >> Yes, since we have released 0.2.0 recently. >> >> Our next plan will include enhance measures, including support anomaly >> detection. >> >> >> Would you like to contribute this feature together? >> >> >> Thanks, >> William >> >> On Sat, May 12, 2018 at 12:22 AM, Enrico D'Urso (JIRA) < >> j...@apache.org> >> wrote: >> >> > >> > [ https://issues.apache.org/jira/browse/GRIFFIN-160?page= >> > com.atlassian.jira.plugin.system.issuetabpanels:comment- >> > tabpanel&focusedCommentId=16472199#comment-16472199 ] >> > >> > Enrico D'Urso commented on GRIFFIN-160: >> > --------------------------------------- >> > >> > Hi, >> > >> > there are several ways to go for anomaly detection implementation. >> > >> > The point is to have numerical data. If you want to apply AD against >> > non-numerical data you have to map string to number somehow. >> > >> > However, as Griffin uses Spark as the engine, I think K-Means can >> be an >> > option. >> > >> > Basically, you have your data: you normalise it, decide the number >> of >> > clusters, apply K-means, finally check the distance from final >> centroids to >> > search for anomalies. MLlib fully supports it. >> > >> > Otherwise just get the mean and std and search for samples that are >> 3sd+ >> > far from the mean. >> > >> > More complicated stuff can be done using Covariance matrix and >> Gaussian >> > distribution, more info here [https://www.coursera.org/ >> > learn/machine-learning/lecture/C8IJp/helpUrl] >> > >> > but am not sure if doable in a distributed environment. >> > >> > >> > >> > Thanks, >> > >> > Enrico >> > >> > >> > >> > > Anomaly detection for thousands of tables >> > > ----------------------------------------- >> > > >> > > Key: GRIFFIN-160 >> > > URL: https://issues.apache.org/jira >> /browse/GRIFFIN-160 >> > > Project: Griffin (Incubating) >> > > Issue Type: New Feature >> > > Reporter: William Guo >> > > Assignee: William Guo >> > > Priority: Major >> > > >> > > Hi team, >> > > >> > > I am trying find the Griffin road map, and here it is [ >> > https://cwiki.apache.org/confluence/display/GRIFFIN/0.+Roadmap], >> is this >> > the latest version? >> > > >> > > We have thousands of tables need to applied for data quality >> validation, >> > is there any simple machine learning algorithm can be applied to >> detect the >> > data quality issue instead of build a lot measures? Will this be >> added in >> > the Griffin road map if possible? >> > > >> > > Thanks, Randy >> > > >> > >> > >> > >> > -- >> > This message was sent by Atlassian JIRA >> > (v7.6.3#76005) >> > >> >> >> >