[ 
https://issues.apache.org/jira/browse/MAHOUT-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316019#comment-15316019
 ] 

Saikat Kanjilal commented on MAHOUT-1869:
-----------------------------------------

Moved all the code to a new branch mahout-1869, renamed all of the spray sample 
code, next steps will be to get the code to compile, added dependencies for 
spray/akka/org.json4s

> Create a runtime performance measuring framework for mahout
> -----------------------------------------------------------
>
>                 Key: MAHOUT-1869
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1869
>             Project: Mahout
>          Issue Type: Story
>          Components: build, Classification, Collaborative Filtering, Math
>    Affects Versions: 1.0.0
>            Reporter: Saikat Kanjilal
>              Labels: build
>             Fix For: 1.0.0
>
>   Original Estimate: 1,008h
>  Remaining Estimate: 1,008h
>
> This proposal will outline a runtime performance module used to measure the 
> performance of various algorithms in mahout in the three major areas, 
> clustering, regression and classification. The module will be a 
> spray/scala/akka application which will be run by any current or new 
> algorithm in mahout and will display a csv file and a set of zeppelin plots 
> outlining the various criteria for performance. The goal of releasing any new 
> build in mahout will be to run a set of tests for each of the algorithms to 
> compare and contrast some benchmarks from one release to another.
> github repo is here:  https://github.com/skanjila/mahout, will send pull 
> request when I have 1 algorithm operational
> Architecture
> The run time performance application will run on top of spray/scala and akka 
> and will make async api calls into the various mahout algorithms to generate 
> a cvs file containing data representing the run time performance measurement 
> calculations for each algorithm of interest as well as a set of zeppelin 
> plots for displaying some of these results. The spray scala architecture will 
> leverage the zeppelin server to create the visualizations. The discussion 
> below centers around two types of algorithms to be addressed by the 
> application.
> Clustering
> The application will consist of a set of rest APIs to do the following:
> a) A method to load and execute the run time perf module and takes as inputs 
> the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of 
> files containing various sizes of data sets
> /algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40
>  and finally a set of values for the number of clusters to use for each of 
> the different sizes of the datasets
> The above API call will return a runId which the client program can then use 
> to monitor the module
> b) A method to monitor the application to ensure that its making progress 
> towards generating the zeppelin plots
> /monitor/runId=456
> The above method will execute asynchronously by calling into the mahout 
> kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin 
> plots showing the normalized time on the y axis and the number of clusters in 
> the x axis. The spray/scala akka framework will allow the client application 
> to receive a callback when the run time performance calculations are actually 
> completed. For now the calculations for measuring run time performance will 
> contain: a) the ratio of the number of points clustered correctly to the 
> total number of points b) the total time taken for the algorithm to run . 
> These items will be represented in separate zeppelin plots.
> Regression
> a) The runtime performance module will run the likelihood ratio test with a 
> different set of features in every run . We will introduce a rest API to run 
> the likelihood ratio test and return the results, this will once again be an 
> sync call through the spray/akka stack.
> b) The run time performance module will contain the following metrics for 
> every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to 
> converge and run to completion. These metrics will be reported on top of the 
> zeppelin graphs for both the regression and the different clustering 
> algorithms mentioned above.
> How does the application get run.  The run time performance measuring 
> application will get invoked from the command line, eventually it would be 
> worthwhile to hook this into some sort of integration test suite to certify 
> the different mahout releases.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to