This proposal will outline a runtime performance module used to measure the
performance of various algorithms in mahout in the three major areas,
clustering, regression and classification. The module will be a
spray/scala/akka application which will be run by any current or new algorithm
in mahout and will display a csv file and a set of zeppelin plots outlining the
various criteria for performance. The goal of releasing any new build in
mahout will be to run a set of tests for each of the algorithms to compare and
contrast some benchmarks from one release to another.
Architecture
The run time performance application will run on top of spray/scala and akka
and will make async api calls into the various mahout algorithms to generate a
cvs file containing data representing the run time performance measurement
calculations for each algorithm of interest as well as a set of zeppelin plots
for displaying some of these results. The spray scala architecture will
leverage the zeppelin server to create the visualizations. The discussion
below centers around two types of algorithms to be addressed by the application.
Clustering
The application will consist of a set of rest APIs to do the following:
a) A method to load and execute the run time perf module and takes as inputs
the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of
files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40
and finally a set of values for the number of clusters to use for each of the
different sizes of the datasets
The above API call will return a runId which the client program can then use to
monitor the module
b) A method to monitor the application to ensure that its making progress
towards generating the zeppelin plots
/monitor/runId=456
The above method will execute asynchronously by calling into the mahout kmeans
(fuzzy kmeans) clustering implementations and will generate zeppelin plots
showing the normalized time on the y axis and the number of clusters in the x
axis. The spray/scala akka framework will allow the client application to
receive a callback when the run time performance calculations are actually
completed. For now the calculations for measuring run time performance will
contain: a) the ratio of the number of points clustered correctly to the total
number of points b) the total time taken for the algorithm to run . These
items will be represented in separate zeppelin plots.
Regression
a) The runtime performance module will run the likelihood ratio test with a
different set of features in every run . We will introduce a rest API to run
the likelihood ratio test and return the results, this will once again be an
sync call through the spray/akka stack.
b) The run time performance module will contain the following metrics for every
algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge
and run to completion. These metrics will be reported on top of the zeppelin
graphs for both the regression and the different clustering algorithms
mentioned above.
How does the application get runThe run time performance measuring application
will get invoked from the command line, eventually it would be worthwhile to
hook this into some sort of integration test suite to certify the different
mahout releases.
I will add more thoughts around this and create a JIRA ticket only once there's
enough consensus between the committers that this is headed in the right
direction. I will also add some more thoughts on measuring run time
performance of some of the other algorithms after some more research.
Would love feedback or additional things to consider that I might have missed.
If its more appropriate I can move the discussion to a jira ticket as well so
please let me know.Thanks in advance.