[ https://issues.apache.org/jira/browse/MAHOUT-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323856#comment-15323856 ]
Saikat Kanjilal commented on MAHOUT-1869: ----------------------------------------- ok so after a bit of toiling I have some code compiling at least for the newly minted perf module, I will now add some instrumentation for a simple naive bayes implementation and some timers to time the overall run > Create a runtime performance measuring framework for mahout > ----------------------------------------------------------- > > Key: MAHOUT-1869 > URL: https://issues.apache.org/jira/browse/MAHOUT-1869 > Project: Mahout > Issue Type: Story > Components: build, Classification, Collaborative Filtering, Math > Affects Versions: 1.0.0 > Reporter: Saikat Kanjilal > Labels: build > Fix For: 1.0.0 > > Original Estimate: 1,008h > Remaining Estimate: 1,008h > > This proposal will outline a runtime performance module used to measure the > performance of various algorithms in mahout in the three major areas, > clustering, regression and classification. The module will be a > spray/scala/akka application which will be run by any current or new > algorithm in mahout and will display a csv file and a set of zeppelin plots > outlining the various criteria for performance. The goal of releasing any new > build in mahout will be to run a set of tests for each of the algorithms to > compare and contrast some benchmarks from one release to another. > github repo is here: https://github.com/skanjila/mahout, will send pull > request when I have 1 algorithm operational > Architecture > The run time performance application will run on top of spray/scala and akka > and will make async api calls into the various mahout algorithms to generate > a cvs file containing data representing the run time performance measurement > calculations for each algorithm of interest as well as a set of zeppelin > plots for displaying some of these results. The spray scala architecture will > leverage the zeppelin server to create the visualizations. The discussion > below centers around two types of algorithms to be addressed by the > application. > Clustering > The application will consist of a set of rest APIs to do the following: > a) A method to load and execute the run time perf module and takes as inputs > the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of > files containing various sizes of data sets > /algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 > and finally a set of values for the number of clusters to use for each of > the different sizes of the datasets > The above API call will return a runId which the client program can then use > to monitor the module > b) A method to monitor the application to ensure that its making progress > towards generating the zeppelin plots > /monitor/runId=456 > The above method will execute asynchronously by calling into the mahout > kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin > plots showing the normalized time on the y axis and the number of clusters in > the x axis. The spray/scala akka framework will allow the client application > to receive a callback when the run time performance calculations are actually > completed. For now the calculations for measuring run time performance will > contain: a) the ratio of the number of points clustered correctly to the > total number of points b) the total time taken for the algorithm to run . > These items will be represented in separate zeppelin plots. > Regression > a) The runtime performance module will run the likelihood ratio test with a > different set of features in every run . We will introduce a rest API to run > the likelihood ratio test and return the results, this will once again be an > sync call through the spray/akka stack. > b) The run time performance module will contain the following metrics for > every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to > converge and run to completion. These metrics will be reported on top of the > zeppelin graphs for both the regression and the different clustering > algorithms mentioned above. > How does the application get run. The run time performance measuring > application will get invoked from the command line, eventually it would be > worthwhile to hook this into some sort of integration test suite to certify > the different mahout releases. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)