RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]

Saikat Kanjilal Thu, 02 Jun 2016 21:36:39 -0700

Hi All,Created a JIRA ticket and have moved the discussion for the runtime 
performance framework  there:
https://issues.apache.org/jira/browse/MAHOUT-1869
@AndrewP & Trevor I would like to integrate zeppelin into the runtime 
performance measurement framework to output some measurement related data for 
some of the algorithms.
Should I wait till the zeppelin integration is completely working before I 
incorporate this piece?
Also would really some feedback either on the JIRA ticket or in response to 
this thread.Regards


> From: sxk1...@hotmail.com
> To: dev@mahout.apache.org
> Subject: [Discuss--A proposal for building an application in mahout to 
> measure runtime performance of algorithms in mahout]
> Date: Thu, 19 May 2016 21:31:05 -0700
> 
> 
> 
> 
> 
> 
> 
> 
> 
> This proposal will outline a runtime performance module used to measure the 
> performance of various algorithms in mahout in the three major areas, 
> clustering, regression and classification.  The module will be a 
> spray/scala/akka application which will be run by any current or new 
> algorithm in mahout and will display a csv file and a set of zeppelin plots 
> outlining the various criteria for performance.    The goal of releasing any 
> new build in mahout will be to run a set of tests for each of the algorithms 
> to compare and contrast some benchmarks from one release to another.
> 
> 
> Architecture
> The run time performance application will run on top of spray/scala and akka 
> and will make async api calls into the various mahout algorithms to generate 
> a cvs file containing data representing the run time performance measurement 
> calculations for each algorithm of interest as well as a set of zeppelin 
> plots for displaying some of these results.  The spray scala architecture 
> will leverage the zeppelin server to create the visualizations.  The 
> discussion below centers around two types of algorithms to be addressed by 
> the application.
> 
> 
> Clustering
> The application will consist of a set of rest APIs to do the following:
> 
> 
> a) A method to load and execute the run time perf module and takes as inputs 
> the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of 
> files containing various sizes of data sets
> 
> 
> /algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40
>  and finally a set of values for the number of clusters to use for each of 
> the different sizes of the datasets
> 
> 
> The above API call will return a runId which the client program can then use 
> to monitor the module
> 
> 
> 
> 
> b) A method to monitor the application to ensure that its making progress 
> towards generating the zeppelin plots
> /monitor/runId=456
> 
> 
> 
> 
> The above method will execute asynchronously by calling into the mahout 
> kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin 
> plots showing the normalized time on the y axis and the number of clusters in 
> the x axis.  The spray/scala akka framework will allow the client application 
> to receive a callback when the run time performance calculations are actually 
> completed.  For now the calculations for measuring run time performance will 
> contain: a) the ratio of the number of points clustered correctly to the 
> total number of points b) the total time taken for the algorithm to run .  
> These items will be represented in separate zeppelin plots.
> 
> 
> 
> 
> Regression
> a) The runtime performance module will run the likelihood ratio test with a 
> different set of features in every run .  We will introduce a rest API to run 
> the likelihood ratio test and return the results, this will once again be an 
> sync call through the spray/akka stack.
> 
> 
> 
> 
> 
> 
> b) The run time performance module will contain the following metrics for 
> every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to 
> converge and run to completion.  These metrics will be reported on top of the 
> zeppelin graphs for both the regression and the different clustering 
> algorithms mentioned above.
> 
> How does the application get runThe run time performance measuring 
> application will get invoked from the command line, eventually it would be 
> worthwhile to hook this into some sort of integration test suite to certify 
> the different mahout releases.
> 
> 
> I will add more thoughts around this and create a JIRA ticket only once 
> there's enough consensus between the committers that this is headed in the 
> right direction.  I will also add some more thoughts on measuring run time 
> performance of some of the other algorithms after some more research.
> Would love feedback or additional things to consider that I might have 
> missed.  If its more appropriate I can move the discussion to a jira ticket 
> as well so please let me know.Thanks in advance.                              
>

RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]

Reply via email to