Hi All,Created a JIRA ticket and have moved the discussion for the runtime performance framework there: https://issues.apache.org/jira/browse/MAHOUT-1869 @AndrewP & Trevor I would like to integrate zeppelin into the runtime performance measurement framework to output some measurement related data for some of the algorithms. Should I wait till the zeppelin integration is completely working before I incorporate this piece? Also would really some feedback either on the JIRA ticket or in response to this thread.Regards
> From: sxk1...@hotmail.com > To: dev@mahout.apache.org > Subject: [Discuss--A proposal for building an application in mahout to > measure runtime performance of algorithms in mahout] > Date: Thu, 19 May 2016 21:31:05 -0700 > > > > > > > > > > This proposal will outline a runtime performance module used to measure the > performance of various algorithms in mahout in the three major areas, > clustering, regression and classification. The module will be a > spray/scala/akka application which will be run by any current or new > algorithm in mahout and will display a csv file and a set of zeppelin plots > outlining the various criteria for performance. The goal of releasing any > new build in mahout will be to run a set of tests for each of the algorithms > to compare and contrast some benchmarks from one release to another. > > > Architecture > The run time performance application will run on top of spray/scala and akka > and will make async api calls into the various mahout algorithms to generate > a cvs file containing data representing the run time performance measurement > calculations for each algorithm of interest as well as a set of zeppelin > plots for displaying some of these results. The spray scala architecture > will leverage the zeppelin server to create the visualizations. The > discussion below centers around two types of algorithms to be addressed by > the application. > > > Clustering > The application will consist of a set of rest APIs to do the following: > > > a) A method to load and execute the run time perf module and takes as inputs > the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of > files containing various sizes of data sets > > > /algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 > and finally a set of values for the number of clusters to use for each of > the different sizes of the datasets > > > The above API call will return a runId which the client program can then use > to monitor the module > > > > > b) A method to monitor the application to ensure that its making progress > towards generating the zeppelin plots > /monitor/runId=456 > > > > > The above method will execute asynchronously by calling into the mahout > kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin > plots showing the normalized time on the y axis and the number of clusters in > the x axis. The spray/scala akka framework will allow the client application > to receive a callback when the run time performance calculations are actually > completed. For now the calculations for measuring run time performance will > contain: a) the ratio of the number of points clustered correctly to the > total number of points b) the total time taken for the algorithm to run . > These items will be represented in separate zeppelin plots. > > > > > Regression > a) The runtime performance module will run the likelihood ratio test with a > different set of features in every run . We will introduce a rest API to run > the likelihood ratio test and return the results, this will once again be an > sync call through the spray/akka stack. > > > > > > > b) The run time performance module will contain the following metrics for > every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to > converge and run to completion. These metrics will be reported on top of the > zeppelin graphs for both the regression and the different clustering > algorithms mentioned above. > > How does the application get runThe run time performance measuring > application will get invoked from the command line, eventually it would be > worthwhile to hook this into some sort of integration test suite to certify > the different mahout releases. > > > I will add more thoughts around this and create a JIRA ticket only once > there's enough consensus between the committers that this is headed in the > right direction. I will also add some more thoughts on measuring run time > performance of some of the other algorithms after some more research. > Would love feedback or additional things to consider that I might have > missed. If its more appropriate I can move the discussion to a jira ticket > as well so please let me know.Thanks in advance. >