Excellent, thanks Saikat; I'll be able to take a look over the weekend. On Wed, Jul 6, 2016 at 9:37 PM, Saikat Kanjilal <sxk1...@hotmail.com> wrote:
> Ok folks I've created a pull request here for a barebones runtime > performance measurement framework that: > > 1) measures two simple timings from ssvd and spca > > 2) dumps these timings into a csv file > > > https://github.com/apache/mahout/pull/245 > > [https://avatars0.githubusercontent.com/u/674374?v=3&s=400]< > https://github.com/apache/mahout/pull/245> > > Mahout 1869 by skanjila · Pull Request #245 · apache/mahout< > https://github.com/apache/mahout/pull/245> > github.com > Added the ability to dump output to csv file > > > > > > I'd be greatly appreciative if I can get some early feedback on design > before moving forward and making too many more changes and not getting > something included. I will move ahead with the zeppelin integration in a > few days and reorganize the code a bit to include all the perf related > pieces into one class or trait. > > > Thanks in advance for your help. > > > Thanks in advance. > > > ________________________________ > From: Saikat Kanjilal <sxk1...@hotmail.com> > Sent: Tuesday, June 21, 2016 9:21 PM > To: dev@mahout.apache.org > Subject: RE: [Discuss--A proposal for building an application in mahout to > measure runtime performance of algorithms in mahout] > > Ok, so for now I am able to get around the issues bwlow by working on code > to measure performance times not requiring the notion of a > DIstributedContext to get this up and running, I have two methods that I am > measuring performance times for,ssvd and spca. Github repo is here: > https://github.com/skanjila/mahout/tree/mahout-1869 > [https://avatars0.githubusercontent.com/u/674374?v=3&s=400]< > https://github.com/skanjila/mahout/tree/mahout-1869> > > skanjila/mahout<https://github.com/skanjila/mahout/tree/mahout-1869> > github.com > mahout - Mirror of Apache Mahout > > > > Please provide feedback as I will now restructure/reorganize code to add > more methods and start work on a perf harness that spits out a report in > csv and then eventually tie this to zeppelin. > I've kept JIRA up to date as well. > Thanks in advance. > > > From: sxk1...@hotmail.com > > To: dev@mahout.apache.org > > Subject: RE: [Discuss--A proposal for building an application in mahout > to measure runtime performance of algorithms in mahout] > > Date: Mon, 20 Jun 2016 20:37:31 -0700 > > > > AndrewP et al,Any chance I can get some pointers on the items below, > would love some direction on this.Thanks > > > > > From: sxk1...@hotmail.com > > > To: dev@mahout.apache.org > > > Subject: RE: [Discuss--A proposal for building an application in > mahout to measure runtime performance of algorithms in mahout] > > > Date: Sun, 12 Jun 2016 12:40:26 -0700 > > > > > > Hi Folks,I need some input/help here to get me unblocked and moving: > > > 1) I need to reuse/extend the DistributedContext inside the runtime > perf measurement module as all algorithms inside math-scala need this, I > was trying to mimic some of the H2O code and saw that they had their own > engine, I am wondering what the best way is to extend DistributedContext > and get the benefit of an already existing engine without needing to tie > into h2o or flink, or is the only way to add an engine to point to one of > those back ends, ideally I want to build the runtime perf module in a > backend agnostic way and currently I dont see a way around this, > thoughts?2) I also tried to reuse some of the logic inside math-scala but > in digging into this code it seems that this code is strongly tied to scala > test utilities > > > > > > Net-Net: I just need access to the DistributedContext without linking > in any test utilities or backends. > > > Would love some advice on ways to move forward to maximize > reuse.Thanks in advance. > > > > > > > From: sxk1...@hotmail.com > > > > To: dev@mahout.apache.org > > > > Subject: RE: [Discuss--A proposal for building an application in > mahout to measure runtime performance of algorithms in mahout] > > > > Date: Thu, 9 Jun 2016 21:45:13 -0700 > > > > > > > > Andrew et al,So I've finally been able to over the past few days got > a self contained module compiling that leverages the DistributedContext, > for starters I copied the NaiveBayes test code, ripped out the test > infrastructure code around it and then added some timers, next steps will > be to dump to csv and eventually to zeppelin, some questions before I get > too far ahead: > > > > 1) I made the design decision to create my own trait and encapsulate > the context within that, I am wondering if I should instead leverage the > context that is already defined in math-scala ,, this however brings its > own complications in that it brings in the MahoutSuite which I'm not sure I > really need, thoughts on this > > > > 2) I need some infrastructure to run the perf framework , I can use > an azure ubuntu vm for now but is there an AWS instance or some other vm I > can eventually use, I would really like to avoid using my mac laptop as a > runtime perf testing environment > > > > > > > > Thanks, I'll update JIRA as I make more headway. > > > > > > > > > From: sxk1...@hotmail.com > > > > > To: dev@mahout.apache.org > > > > > Subject: RE: [Discuss--A proposal for building an application in > mahout to measure runtime performance of algorithms in mahout] > > > > > Date: Mon, 6 Jun 2016 08:58:49 -0700 > > > > > > > > > > Andrew,Thanks for the input, I will shift gears a bit and just get > some lightweight code going that calls into mahout algorithms and does a > csv dump out. Note that I think akka could be a good use for this as you > could make an async call and get back a notification when the csv dump is > finished. Also I am indeed not focusing on mapreduce algorithms and will > be tackling the algorithms in the math-scala library. What do you think of > making this a lightweight web based workbench using spray that committers > can run outside of mahout through curl or something, this was my initial > vision in using spray and its good that I'm getting early feedback. > > > > > > > > > > On zeppelin do you think its worthwhile that I incorporate > Trevor's efforts to take that csv and turn that into one or two > visualizations. I'm trying to understand how that effort may(or may not) > intersect with what I'm trying to accomplish. > > > > > Also point taken on the small data sets. > > > > > Thanks > > > > > > > > > > > From: ap....@outlook.com > > > > > > To: dev@mahout.apache.org > > > > > > Subject: Re: [Discuss--A proposal for building an application in > mahout to measure runtime performance of algorithms in mahout] > > > > > > Date: Mon, 6 Jun 2016 15:50:16 +0000 > > > > > > > > > > > > Saikat, > > > > > > > > > > > > If you're going to pursue this there is a few things that I > would suggest. First, keep it light weight. We don't want to bring a a > lot of extra dependencies or data into the distribution. I'm not sure what > this means as far as spray/akka, but those seem like overkill in my > opinion. This should be able to be kept down to a simple csv dump I think. > > > > > > > > > > > > Second, use Data that can be either randomly generated with a > seeded RNG, or a function like Mackey-Glass or downloaded (probably best), > and only use a small very small sample in the tests- since they're pretty > long currently. The main point being that we don't want to ship any large > test datasets with the distro. > > > > > > > > > > > > Third, we're not using MapReduce anymore, so focus on algorithms > in the math-scala library (eg. dssvd, thinqr, dals, etc.) as well as Matrix > algebra operations. That is where i see this being useful, so that we may > compare changes and optimizations going forward. > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Andy > > > > > > > > > > > > ________________________________________ > > > > > > From: Saikat Kanjilal <sxk1...@hotmail.com> > > > > > > Sent: Friday, June 3, 2016 12:35:54 AM > > > > > > To: dev@mahout.apache.org > > > > > > Subject: RE: [Discuss--A proposal for building an application in > mahout to measure runtime performance of algorithms in mahout] > > > > > > > > > > > > Hi All,Created a JIRA ticket and have moved the discussion for > the runtime performance framework there: > > > > > > https://issues.apache.org/jira/browse/MAHOUT-1869 > > > > > > @AndrewP & Trevor I would like to integrate zeppelin into the > runtime performance measurement framework to output some measurement > related data for some of the algorithms. > > > > > > Should I wait till the zeppelin integration is completely > working before I incorporate this piece? > > > > > > Also would really some feedback either on the JIRA ticket or in > response to this thread.Regards > > > > > > > > > > > > > From: sxk1...@hotmail.com > > > > > > > To: dev@mahout.apache.org > > > > > > > Subject: [Discuss--A proposal for building an application in > mahout to measure runtime performance of algorithms in mahout] > > > > > > > Date: Thu, 19 May 2016 21:31:05 -0700 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This proposal will outline a runtime performance module used > to measure the performance of various algorithms in mahout in the three > major areas, clustering, regression and classification. The module will be > a spray/scala/akka application which will be run by any current or new > algorithm in mahout and will display a csv file and a set of zeppelin plots > outlining the various criteria for performance. The goal of releasing > any new build in mahout will be to run a set of tests for each of the > algorithms to compare and contrast some benchmarks from one release to > another. > > > > > > > > > > > > > > > > > > > > > Architecture > > > > > > > The run time performance application will run on top of > spray/scala and akka and will make async api calls into the various mahout > algorithms to generate a cvs file containing data representing the run time > performance measurement calculations for each algorithm of interest as well > as a set of zeppelin plots for displaying some of these results. The spray > scala architecture will leverage the zeppelin server to create the > visualizations. The discussion below centers around two types of > algorithms to be addressed by the application. > > > > > > > > > > > > > > > > > > > > > Clustering > > > > > > > The application will consist of a set of rest APIs to do the > following: > > > > > > > > > > > > > > > > > > > > > a) A method to load and execute the run time perf module and > takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a > location of a set of files containing various sizes of data sets > > > > > > > > > > > > > > > > > > > > > > /algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 > and finally a set of values for the number of clusters to use for each of > the different sizes of the datasets > > > > > > > > > > > > > > > > > > > > > The above API call will return a runId which the client > program can then use to monitor the module > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > b) A method to monitor the application to ensure that its > making progress towards generating the zeppelin plots > > > > > > > /monitor/runId=456 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The above method will execute asynchronously by calling into > the mahout kmeans (fuzzy kmeans) clustering implementations and will > generate zeppelin plots showing the normalized time on the y axis and the > number of clusters in the x axis. The spray/scala akka framework will > allow the client application to receive a callback when the run time > performance calculations are actually completed. For now the calculations > for measuring run time performance will contain: a) the ratio of the number > of points clustered correctly to the total number of points b) the total > time taken for the algorithm to run . These items will be represented in > separate zeppelin plots. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regression > > > > > > > a) The runtime performance module will run the likelihood > ratio test with a different set of features in every run . We will > introduce a rest API to run the likelihood ratio test and return the > results, this will once again be an sync call through the spray/akka stack. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > b) The run time performance module will contain the following > metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for > algorithm to converge and run to completion. These metrics will be > reported on top of the zeppelin graphs for both the regression and the > different clustering algorithms mentioned above. > > > > > > > > > > > > > > How does the application get runThe run time performance > measuring application will get invoked from the command line, eventually it > would be worthwhile to hook this into some sort of integration test suite > to certify the different mahout releases. > > > > > > > > > > > > > > > > > > > > > I will add more thoughts around this and create a JIRA ticket > only once there's enough consensus between the committers that this is > headed in the right direction. I will also add some more thoughts on > measuring run time performance of some of the other algorithms after some > more research. > > > > > > > Would love feedback or additional things to consider that I > might have missed. If its more appropriate I can move the discussion to a > jira ticket as well so please let me know.Thanks in advance. > > > > > > > > > > > > > > > >