RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]

Saikat Kanjilal Tue, 21 Jun 2016 21:22:43 -0700

Ok, so for now I am able to get around the issues bwlow by working on code to 
measure performance times  not requiring the notion of a DIstributedContext to 
get this up and running, I have two methods that I am measuring performance 
times for,ssvd and spca.   Github repo is here:
https://github.com/skanjila/mahout/tree/mahout-1869
Please provide feedback as I will now restructure/reorganize code to add more 
methods and start work on a perf harness that spits out a report in csv and 
then eventually tie this to zeppelin.
I've kept JIRA up to date as well.
Thanks in advance.


> From: sxk1...@hotmail.com
> To: dev@mahout.apache.org
> Subject: RE: [Discuss--A proposal for building an application in mahout to 
> measure runtime performance of algorithms in mahout]
> Date: Mon, 20 Jun 2016 20:37:31 -0700
> 
> AndrewP et al,Any chance I can get some pointers on the items below, would 
> love some direction on this.Thanks
> 
> > From: sxk1...@hotmail.com
> > To: dev@mahout.apache.org
> > Subject: RE: [Discuss--A proposal for building an application in mahout to 
> > measure runtime performance of algorithms in mahout]
> > Date: Sun, 12 Jun 2016 12:40:26 -0700
> > 
> > Hi Folks,I need some input/help here to get me unblocked and moving:
> > 1) I need to reuse/extend the DistributedContext inside the runtime perf 
> > measurement module as all algorithms inside math-scala need this, I was 
> > trying to mimic some of the H2O code and saw that they had their own 
> > engine, I am wondering what the best way is to extend DistributedContext 
> > and get the benefit of an already existing engine without needing to tie 
> > into h2o or flink, or is the only way to add an engine to point to one of 
> > those back ends, ideally I want to build the runtime perf module in a 
> > backend agnostic way and currently I dont see a way around this, 
> > thoughts?2) I also tried to reuse some of the logic inside math-scala but 
> > in digging into this code it seems that this code is strongly tied to scala 
> > test utilities
> > 
> > Net-Net: I just need access to the DistributedContext without linking in 
> > any test utilities or backends.
> > Would love some advice on ways to move forward to maximize reuse.Thanks in 
> > advance.
> > 
> > > From: sxk1...@hotmail.com
> > > To: dev@mahout.apache.org
> > > Subject: RE: [Discuss--A proposal for building an application in mahout 
> > > to measure runtime performance of algorithms in mahout]
> > > Date: Thu, 9 Jun 2016 21:45:13 -0700
> > > 
> > > Andrew et al,So I've finally been able to over the past few days got a 
> > > self contained module compiling that leverages the DistributedContext, 
> > > for starters I copied the NaiveBayes test code, ripped out the test 
> > > infrastructure code around it and then added some timers, next steps will 
> > > be to dump to csv and eventually to zeppelin, some questions before I get 
> > > too far ahead:
> > > 1) I made the design decision to create my own trait and encapsulate the 
> > > context within that, I am wondering if I should instead leverage the 
> > > context that is already defined in math-scala ,, this however brings its 
> > > own complications in that it brings in the MahoutSuite which I'm not sure 
> > > I really need, thoughts on this
> > > 2) I need some infrastructure to run the perf framework , I can use an 
> > > azure ubuntu vm for now but is there an AWS instance or some other vm I 
> > > can eventually use, I would really like to avoid using my mac laptop as a 
> > > runtime perf testing environment
> > > 
> > > Thanks, I'll update JIRA as I make more headway.
> > > 
> > > > From: sxk1...@hotmail.com
> > > > To: dev@mahout.apache.org
> > > > Subject: RE: [Discuss--A proposal for building an application in mahout 
> > > > to measure runtime performance of algorithms in mahout]
> > > > Date: Mon, 6 Jun 2016 08:58:49 -0700
> > > > 
> > > > Andrew,Thanks for the input, I will shift gears a bit and just get some 
> > > > lightweight code going that calls into mahout algorithms and does a csv 
> > > > dump out.  Note that I think akka could be a good use for this as you 
> > > > could make an async call and get back a notification when the csv dump 
> > > > is finished.  Also I am indeed not focusing on mapreduce algorithms and 
> > > > will be tackling the algorithms in the math-scala library.  What do you 
> > > > think of making this a lightweight web based workbench using spray that 
> > > > committers can run outside of mahout through curl or something, this 
> > > > was my initial vision in using spray and its good that I'm getting 
> > > > early feedback.
> > > > 
> > > > On zeppelin do you think its worthwhile that I incorporate Trevor's 
> > > > efforts to take that csv and turn that into one or two visualizations.  
> > > > I'm trying to understand how that effort may(or may not) intersect with 
> > > > what I'm trying to accomplish.
> > > > Also point taken on the small data sets.
> > > > Thanks
> > > > 
> > > > > From: ap....@outlook.com
> > > > > To: dev@mahout.apache.org
> > > > > Subject: Re: [Discuss--A proposal for building an application in 
> > > > > mahout to measure runtime performance of algorithms in mahout]
> > > > > Date: Mon, 6 Jun 2016 15:50:16 +0000
> > > > > 
> > > > > Saikat,
> > > > > 
> > > > > If you're going to pursue this there is a few things that I would 
> > > > > suggest.  First, keep it light weight.  We don't want to bring a a 
> > > > > lot of extra dependencies or data into the distribution.  I'm not 
> > > > > sure what this means as far as spray/akka, but those seem like 
> > > > > overkill in my opinion. This should be able to be kept down to a 
> > > > > simple csv dump I think.
> > > > > 
> > > > > Second, use Data that can be either randomly generated with a seeded 
> > > > > RNG, or a function like Mackey-Glass or downloaded (probably best), 
> > > > > and only use a small very small sample in the tests- since they're 
> > > > > pretty long currently. The main point being that we don't want to 
> > > > > ship any large test datasets with the distro.
> > > > > 
> > > > > Third, we're not using MapReduce anymore, so focus on algorithms in 
> > > > > the math-scala library (eg. dssvd, thinqr, dals, etc.) as well as 
> > > > > Matrix algebra operations.  That is where i see this being useful, so 
> > > > > that we may compare changes and optimizations going forward.
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > Andy
> > > > > 
> > > > > ________________________________________
> > > > > From: Saikat Kanjilal <sxk1...@hotmail.com>
> > > > > Sent: Friday, June 3, 2016 12:35:54 AM
> > > > > To: dev@mahout.apache.org
> > > > > Subject: RE: [Discuss--A proposal for building an application in 
> > > > > mahout to measure runtime performance of algorithms in mahout]
> > > > > 
> > > > > Hi All,Created a JIRA ticket and have moved the discussion for the 
> > > > > runtime performance framework  there:
> > > > > https://issues.apache.org/jira/browse/MAHOUT-1869
> > > > > @AndrewP & Trevor I would like to integrate zeppelin into the runtime 
> > > > > performance measurement framework to output some measurement related 
> > > > > data for some of the algorithms.
> > > > > Should I wait till the zeppelin integration is completely working 
> > > > > before I incorporate this piece?
> > > > > Also would really some feedback either on the JIRA ticket or in 
> > > > > response to this thread.Regards
> > > > > 
> > > > > > From: sxk1...@hotmail.com
> > > > > > To: dev@mahout.apache.org
> > > > > > Subject: [Discuss--A proposal for building an application in mahout 
> > > > > > to measure runtime performance of algorithms in mahout]
> > > > > > Date: Thu, 19 May 2016 21:31:05 -0700
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > This proposal will outline a runtime performance module used to 
> > > > > > measure the performance of various algorithms in mahout in the 
> > > > > > three major areas, clustering, regression and classification.  The 
> > > > > > module will be a spray/scala/akka application which will be run by 
> > > > > > any current or new algorithm in mahout and will display a csv file 
> > > > > > and a set of zeppelin plots outlining the various criteria for 
> > > > > > performance.    The goal of releasing any new build in mahout will 
> > > > > > be to run a set of tests for each of the algorithms to compare and 
> > > > > > contrast some benchmarks from one release to another.
> > > > > >
> > > > > >
> > > > > > Architecture
> > > > > > The run time performance application will run on top of spray/scala 
> > > > > > and akka and will make async api calls into the various mahout 
> > > > > > algorithms to generate a cvs file containing data representing the 
> > > > > > run time performance measurement calculations for each algorithm of 
> > > > > > interest as well as a set of zeppelin plots for displaying some of 
> > > > > > these results.  The spray scala architecture will leverage the 
> > > > > > zeppelin server to create the visualizations.  The discussion below 
> > > > > > centers around two types of algorithms to be addressed by the 
> > > > > > application.
> > > > > >
> > > > > >
> > > > > > Clustering
> > > > > > The application will consist of a set of rest APIs to do the 
> > > > > > following:
> > > > > >
> > > > > >
> > > > > > a) A method to load and execute the run time perf module and takes 
> > > > > > as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a 
> > > > > > location of a set of files containing various sizes of data sets
> > > > > >
> > > > > >
> > > > > > /algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40
> > > > > >  and finally a set of values for the number of clusters to use for 
> > > > > > each of the different sizes of the datasets
> > > > > >
> > > > > >
> > > > > > The above API call will return a runId which the client program can 
> > > > > > then use to monitor the module
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > b) A method to monitor the application to ensure that its making 
> > > > > > progress towards generating the zeppelin plots
> > > > > > /monitor/runId=456
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > The above method will execute asynchronously by calling into the 
> > > > > > mahout kmeans (fuzzy kmeans) clustering implementations and will 
> > > > > > generate zeppelin plots showing the normalized time on the y axis 
> > > > > > and the number of clusters in the x axis.  The spray/scala akka 
> > > > > > framework will allow the client application to receive a callback 
> > > > > > when the run time performance calculations are actually completed.  
> > > > > > For now the calculations for measuring run time performance will 
> > > > > > contain: a) the ratio of the number of points clustered correctly 
> > > > > > to the total number of points b) the total time taken for the 
> > > > > > algorithm to run .  These items will be represented in separate 
> > > > > > zeppelin plots.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Regression
> > > > > > a) The runtime performance module will run the likelihood ratio 
> > > > > > test with a different set of features in every run .  We will 
> > > > > > introduce a rest API to run the likelihood ratio test and return 
> > > > > > the results, this will once again be an sync call through the 
> > > > > > spray/akka stack.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > b) The run time performance module will contain the following 
> > > > > > metrics for every algorithm: 1) cpu usage 2) memory usage 3) time 
> > > > > > taken for algorithm to converge and run to completion.  These 
> > > > > > metrics will be reported on top of the zeppelin graphs for both the 
> > > > > > regression and the different clustering algorithms mentioned above.
> > > > > >
> > > > > > How does the application get runThe run time performance measuring 
> > > > > > application will get invoked from the command line, eventually it 
> > > > > > would be worthwhile to hook this into some sort of integration test 
> > > > > > suite to certify the different mahout releases.
> > > > > >
> > > > > >
> > > > > > I will add more thoughts around this and create a JIRA ticket only 
> > > > > > once there's enough consensus between the committers that this is 
> > > > > > headed in the right direction.  I will also add some more thoughts 
> > > > > > on measuring run time performance of some of the other algorithms 
> > > > > > after some more research.
> > > > > > Would love feedback or additional things to consider that I might 
> > > > > > have missed.  If its more appropriate I can move the discussion to 
> > > > > > a jira ticket as well so please let me know.Thanks in advance.
> > > >                                           
> > >                                     
> >                                       
>

RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]

Reply via email to