Ok, so for now I am able to get around the issues bwlow by working on code to measure performance times not requiring the notion of a DIstributedContext to get this up and running, I have two methods that I am measuring performance times for,ssvd and spca. Github repo is here: https://github.com/skanjila/mahout/tree/mahout-1869 Please provide feedback as I will now restructure/reorganize code to add more methods and start work on a perf harness that spits out a report in csv and then eventually tie this to zeppelin. I've kept JIRA up to date as well. Thanks in advance.
> From: sxk1...@hotmail.com > To: dev@mahout.apache.org > Subject: RE: [Discuss--A proposal for building an application in mahout to > measure runtime performance of algorithms in mahout] > Date: Mon, 20 Jun 2016 20:37:31 -0700 > > AndrewP et al,Any chance I can get some pointers on the items below, would > love some direction on this.Thanks > > > From: sxk1...@hotmail.com > > To: dev@mahout.apache.org > > Subject: RE: [Discuss--A proposal for building an application in mahout to > > measure runtime performance of algorithms in mahout] > > Date: Sun, 12 Jun 2016 12:40:26 -0700 > > > > Hi Folks,I need some input/help here to get me unblocked and moving: > > 1) I need to reuse/extend the DistributedContext inside the runtime perf > > measurement module as all algorithms inside math-scala need this, I was > > trying to mimic some of the H2O code and saw that they had their own > > engine, I am wondering what the best way is to extend DistributedContext > > and get the benefit of an already existing engine without needing to tie > > into h2o or flink, or is the only way to add an engine to point to one of > > those back ends, ideally I want to build the runtime perf module in a > > backend agnostic way and currently I dont see a way around this, > > thoughts?2) I also tried to reuse some of the logic inside math-scala but > > in digging into this code it seems that this code is strongly tied to scala > > test utilities > > > > Net-Net: I just need access to the DistributedContext without linking in > > any test utilities or backends. > > Would love some advice on ways to move forward to maximize reuse.Thanks in > > advance. > > > > > From: sxk1...@hotmail.com > > > To: dev@mahout.apache.org > > > Subject: RE: [Discuss--A proposal for building an application in mahout > > > to measure runtime performance of algorithms in mahout] > > > Date: Thu, 9 Jun 2016 21:45:13 -0700 > > > > > > Andrew et al,So I've finally been able to over the past few days got a > > > self contained module compiling that leverages the DistributedContext, > > > for starters I copied the NaiveBayes test code, ripped out the test > > > infrastructure code around it and then added some timers, next steps will > > > be to dump to csv and eventually to zeppelin, some questions before I get > > > too far ahead: > > > 1) I made the design decision to create my own trait and encapsulate the > > > context within that, I am wondering if I should instead leverage the > > > context that is already defined in math-scala ,, this however brings its > > > own complications in that it brings in the MahoutSuite which I'm not sure > > > I really need, thoughts on this > > > 2) I need some infrastructure to run the perf framework , I can use an > > > azure ubuntu vm for now but is there an AWS instance or some other vm I > > > can eventually use, I would really like to avoid using my mac laptop as a > > > runtime perf testing environment > > > > > > Thanks, I'll update JIRA as I make more headway. > > > > > > > From: sxk1...@hotmail.com > > > > To: dev@mahout.apache.org > > > > Subject: RE: [Discuss--A proposal for building an application in mahout > > > > to measure runtime performance of algorithms in mahout] > > > > Date: Mon, 6 Jun 2016 08:58:49 -0700 > > > > > > > > Andrew,Thanks for the input, I will shift gears a bit and just get some > > > > lightweight code going that calls into mahout algorithms and does a csv > > > > dump out. Note that I think akka could be a good use for this as you > > > > could make an async call and get back a notification when the csv dump > > > > is finished. Also I am indeed not focusing on mapreduce algorithms and > > > > will be tackling the algorithms in the math-scala library. What do you > > > > think of making this a lightweight web based workbench using spray that > > > > committers can run outside of mahout through curl or something, this > > > > was my initial vision in using spray and its good that I'm getting > > > > early feedback. > > > > > > > > On zeppelin do you think its worthwhile that I incorporate Trevor's > > > > efforts to take that csv and turn that into one or two visualizations. > > > > I'm trying to understand how that effort may(or may not) intersect with > > > > what I'm trying to accomplish. > > > > Also point taken on the small data sets. > > > > Thanks > > > > > > > > > From: ap....@outlook.com > > > > > To: dev@mahout.apache.org > > > > > Subject: Re: [Discuss--A proposal for building an application in > > > > > mahout to measure runtime performance of algorithms in mahout] > > > > > Date: Mon, 6 Jun 2016 15:50:16 +0000 > > > > > > > > > > Saikat, > > > > > > > > > > If you're going to pursue this there is a few things that I would > > > > > suggest. First, keep it light weight. We don't want to bring a a > > > > > lot of extra dependencies or data into the distribution. I'm not > > > > > sure what this means as far as spray/akka, but those seem like > > > > > overkill in my opinion. This should be able to be kept down to a > > > > > simple csv dump I think. > > > > > > > > > > Second, use Data that can be either randomly generated with a seeded > > > > > RNG, or a function like Mackey-Glass or downloaded (probably best), > > > > > and only use a small very small sample in the tests- since they're > > > > > pretty long currently. The main point being that we don't want to > > > > > ship any large test datasets with the distro. > > > > > > > > > > Third, we're not using MapReduce anymore, so focus on algorithms in > > > > > the math-scala library (eg. dssvd, thinqr, dals, etc.) as well as > > > > > Matrix algebra operations. That is where i see this being useful, so > > > > > that we may compare changes and optimizations going forward. > > > > > > > > > > Thanks, > > > > > > > > > > Andy > > > > > > > > > > ________________________________________ > > > > > From: Saikat Kanjilal <sxk1...@hotmail.com> > > > > > Sent: Friday, June 3, 2016 12:35:54 AM > > > > > To: dev@mahout.apache.org > > > > > Subject: RE: [Discuss--A proposal for building an application in > > > > > mahout to measure runtime performance of algorithms in mahout] > > > > > > > > > > Hi All,Created a JIRA ticket and have moved the discussion for the > > > > > runtime performance framework there: > > > > > https://issues.apache.org/jira/browse/MAHOUT-1869 > > > > > @AndrewP & Trevor I would like to integrate zeppelin into the runtime > > > > > performance measurement framework to output some measurement related > > > > > data for some of the algorithms. > > > > > Should I wait till the zeppelin integration is completely working > > > > > before I incorporate this piece? > > > > > Also would really some feedback either on the JIRA ticket or in > > > > > response to this thread.Regards > > > > > > > > > > > From: sxk1...@hotmail.com > > > > > > To: dev@mahout.apache.org > > > > > > Subject: [Discuss--A proposal for building an application in mahout > > > > > > to measure runtime performance of algorithms in mahout] > > > > > > Date: Thu, 19 May 2016 21:31:05 -0700 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This proposal will outline a runtime performance module used to > > > > > > measure the performance of various algorithms in mahout in the > > > > > > three major areas, clustering, regression and classification. The > > > > > > module will be a spray/scala/akka application which will be run by > > > > > > any current or new algorithm in mahout and will display a csv file > > > > > > and a set of zeppelin plots outlining the various criteria for > > > > > > performance. The goal of releasing any new build in mahout will > > > > > > be to run a set of tests for each of the algorithms to compare and > > > > > > contrast some benchmarks from one release to another. > > > > > > > > > > > > > > > > > > Architecture > > > > > > The run time performance application will run on top of spray/scala > > > > > > and akka and will make async api calls into the various mahout > > > > > > algorithms to generate a cvs file containing data representing the > > > > > > run time performance measurement calculations for each algorithm of > > > > > > interest as well as a set of zeppelin plots for displaying some of > > > > > > these results. The spray scala architecture will leverage the > > > > > > zeppelin server to create the visualizations. The discussion below > > > > > > centers around two types of algorithms to be addressed by the > > > > > > application. > > > > > > > > > > > > > > > > > > Clustering > > > > > > The application will consist of a set of rest APIs to do the > > > > > > following: > > > > > > > > > > > > > > > > > > a) A method to load and execute the run time perf module and takes > > > > > > as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a > > > > > > location of a set of files containing various sizes of data sets > > > > > > > > > > > > > > > > > > /algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 > > > > > > and finally a set of values for the number of clusters to use for > > > > > > each of the different sizes of the datasets > > > > > > > > > > > > > > > > > > The above API call will return a runId which the client program can > > > > > > then use to monitor the module > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > b) A method to monitor the application to ensure that its making > > > > > > progress towards generating the zeppelin plots > > > > > > /monitor/runId=456 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The above method will execute asynchronously by calling into the > > > > > > mahout kmeans (fuzzy kmeans) clustering implementations and will > > > > > > generate zeppelin plots showing the normalized time on the y axis > > > > > > and the number of clusters in the x axis. The spray/scala akka > > > > > > framework will allow the client application to receive a callback > > > > > > when the run time performance calculations are actually completed. > > > > > > For now the calculations for measuring run time performance will > > > > > > contain: a) the ratio of the number of points clustered correctly > > > > > > to the total number of points b) the total time taken for the > > > > > > algorithm to run . These items will be represented in separate > > > > > > zeppelin plots. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regression > > > > > > a) The runtime performance module will run the likelihood ratio > > > > > > test with a different set of features in every run . We will > > > > > > introduce a rest API to run the likelihood ratio test and return > > > > > > the results, this will once again be an sync call through the > > > > > > spray/akka stack. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > b) The run time performance module will contain the following > > > > > > metrics for every algorithm: 1) cpu usage 2) memory usage 3) time > > > > > > taken for algorithm to converge and run to completion. These > > > > > > metrics will be reported on top of the zeppelin graphs for both the > > > > > > regression and the different clustering algorithms mentioned above. > > > > > > > > > > > > How does the application get runThe run time performance measuring > > > > > > application will get invoked from the command line, eventually it > > > > > > would be worthwhile to hook this into some sort of integration test > > > > > > suite to certify the different mahout releases. > > > > > > > > > > > > > > > > > > I will add more thoughts around this and create a JIRA ticket only > > > > > > once there's enough consensus between the committers that this is > > > > > > headed in the right direction. I will also add some more thoughts > > > > > > on measuring run time performance of some of the other algorithms > > > > > > after some more research. > > > > > > Would love feedback or additional things to consider that I might > > > > > > have missed. If its more appropriate I can move the discussion to > > > > > > a jira ticket as well so please let me know.Thanks in advance. > > > > > > > > > >