This proposal will outline a runtime performance module used to measure the 
performance of various algorithms in mahout in the three major areas, 
clustering, regression and classification.  The module will be a 
spray/scala/akka application which will be run by any current or new algorithm 
in mahout and will display a csv file and a set of zeppelin plots outlining the 
various criteria for performance.    The goal of releasing any new build in 
mahout will be to run a set of tests for each of the algorithms to compare and 
contrast some benchmarks from one release to another.


Architecture
The run time performance application will run on top of spray/scala and akka 
and will make async api calls into the various mahout algorithms to generate a 
cvs file containing data representing the run time performance measurement 
calculations for each algorithm of interest as well as a set of zeppelin plots 
for displaying some of these results.  The spray scala architecture will 
leverage the zeppelin server to create the visualizations.  The discussion 
below centers around two types of algorithms to be addressed by the application.


Clustering
The application will consist of a set of rest APIs to do the following:


a) A method to load and execute the run time perf module and takes as inputs 
the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of 
files containing various sizes of data sets


/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40
 and finally a set of values for the number of clusters to use for each of the 
different sizes of the datasets


The above API call will return a runId which the client program can then use to 
monitor the module




b) A method to monitor the application to ensure that its making progress 
towards generating the zeppelin plots
/monitor/runId=456




The above method will execute asynchronously by calling into the mahout kmeans 
(fuzzy kmeans) clustering implementations and will generate zeppelin plots 
showing the normalized time on the y axis and the number of clusters in the x 
axis.  The spray/scala akka framework will allow the client application to 
receive a callback when the run time performance calculations are actually 
completed.  For now the calculations for measuring run time performance will 
contain: a) the ratio of the number of points clustered correctly to the total 
number of points b) the total time taken for the algorithm to run .  These 
items will be represented in separate zeppelin plots.




Regression
a) The runtime performance module will run the likelihood ratio test with a 
different set of features in every run .  We will introduce a rest API to run 
the likelihood ratio test and return the results, this will once again be an 
sync call through the spray/akka stack.






b) The run time performance module will contain the following metrics for every 
algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge 
and run to completion.  These metrics will be reported on top of the zeppelin 
graphs for both the regression and the different clustering algorithms 
mentioned above.

How does the application get runThe run time performance measuring application 
will get invoked from the command line, eventually it would be worthwhile to 
hook this into some sort of integration test suite to certify the different 
mahout releases.


I will add more thoughts around this and create a JIRA ticket only once there's 
enough consensus between the committers that this is headed in the right 
direction.  I will also add some more thoughts on measuring run time 
performance of some of the other algorithms after some more research.
Would love feedback or additional things to consider that I might have missed.  
If its more appropriate I can move the discussion to a jira ticket as well so 
please let me know.Thanks in advance.                                       

Reply via email to