Andrew, Ping on this , let me know your thoughts on my pull request. Thanks
Sent from my iPad > On Jul 7, 2016, at 8:15 AM, Andrew Musselman <andrew.mussel...@gmail.com> > wrote: > > Excellent, thanks Saikat; I'll be able to take a look over the weekend. > >> On Wed, Jul 6, 2016 at 9:37 PM, Saikat Kanjilal <sxk1...@hotmail.com> wrote: >> >> Ok folks I've created a pull request here for a barebones runtime >> performance measurement framework that: >> >> 1) measures two simple timings from ssvd and spca >> >> 2) dumps these timings into a csv file >> >> >> https://github.com/apache/mahout/pull/245 >> >> [https://avatars0.githubusercontent.com/u/674374?v=3&s=400]< >> https://github.com/apache/mahout/pull/245> >> >> Mahout 1869 by skanjila · Pull Request #245 · apache/mahout< >> https://github.com/apache/mahout/pull/245> >> github.com >> Added the ability to dump output to csv file >> >> >> >> >> >> I'd be greatly appreciative if I can get some early feedback on design >> before moving forward and making too many more changes and not getting >> something included. I will move ahead with the zeppelin integration in a >> few days and reorganize the code a bit to include all the perf related >> pieces into one class or trait. >> >> >> Thanks in advance for your help. >> >> >> Thanks in advance. >> >> >> ________________________________ >> From: Saikat Kanjilal <sxk1...@hotmail.com> >> Sent: Tuesday, June 21, 2016 9:21 PM >> To: dev@mahout.apache.org >> Subject: RE: [Discuss--A proposal for building an application in mahout to >> measure runtime performance of algorithms in mahout] >> >> Ok, so for now I am able to get around the issues bwlow by working on code >> to measure performance times not requiring the notion of a >> DIstributedContext to get this up and running, I have two methods that I am >> measuring performance times for,ssvd and spca. Github repo is here: >> https://github.com/skanjila/mahout/tree/mahout-1869 >> [https://avatars0.githubusercontent.com/u/674374?v=3&s=400]< >> https://github.com/skanjila/mahout/tree/mahout-1869> >> >> skanjila/mahout<https://github.com/skanjila/mahout/tree/mahout-1869> >> github.com >> mahout - Mirror of Apache Mahout >> >> >> >> Please provide feedback as I will now restructure/reorganize code to add >> more methods and start work on a perf harness that spits out a report in >> csv and then eventually tie this to zeppelin. >> I've kept JIRA up to date as well. >> Thanks in advance. >> >>> From: sxk1...@hotmail.com >>> To: dev@mahout.apache.org >>> Subject: RE: [Discuss--A proposal for building an application in mahout >> to measure runtime performance of algorithms in mahout] >>> Date: Mon, 20 Jun 2016 20:37:31 -0700 >>> >>> AndrewP et al,Any chance I can get some pointers on the items below, >> would love some direction on this.Thanks >>> >>>> From: sxk1...@hotmail.com >>>> To: dev@mahout.apache.org >>>> Subject: RE: [Discuss--A proposal for building an application in >> mahout to measure runtime performance of algorithms in mahout] >>>> Date: Sun, 12 Jun 2016 12:40:26 -0700 >>>> >>>> Hi Folks,I need some input/help here to get me unblocked and moving: >>>> 1) I need to reuse/extend the DistributedContext inside the runtime >> perf measurement module as all algorithms inside math-scala need this, I >> was trying to mimic some of the H2O code and saw that they had their own >> engine, I am wondering what the best way is to extend DistributedContext >> and get the benefit of an already existing engine without needing to tie >> into h2o or flink, or is the only way to add an engine to point to one of >> those back ends, ideally I want to build the runtime perf module in a >> backend agnostic way and currently I dont see a way around this, >> thoughts?2) I also tried to reuse some of the logic inside math-scala but >> in digging into this code it seems that this code is strongly tied to scala >> test utilities >>>> >>>> Net-Net: I just need access to the DistributedContext without linking >> in any test utilities or backends. >>>> Would love some advice on ways to move forward to maximize >> reuse.Thanks in advance. >>>> >>>>> From: sxk1...@hotmail.com >>>>> To: dev@mahout.apache.org >>>>> Subject: RE: [Discuss--A proposal for building an application in >> mahout to measure runtime performance of algorithms in mahout] >>>>> Date: Thu, 9 Jun 2016 21:45:13 -0700 >>>>> >>>>> Andrew et al,So I've finally been able to over the past few days got >> a self contained module compiling that leverages the DistributedContext, >> for starters I copied the NaiveBayes test code, ripped out the test >> infrastructure code around it and then added some timers, next steps will >> be to dump to csv and eventually to zeppelin, some questions before I get >> too far ahead: >>>>> 1) I made the design decision to create my own trait and encapsulate >> the context within that, I am wondering if I should instead leverage the >> context that is already defined in math-scala ,, this however brings its >> own complications in that it brings in the MahoutSuite which I'm not sure I >> really need, thoughts on this >>>>> 2) I need some infrastructure to run the perf framework , I can use >> an azure ubuntu vm for now but is there an AWS instance or some other vm I >> can eventually use, I would really like to avoid using my mac laptop as a >> runtime perf testing environment >>>>> >>>>> Thanks, I'll update JIRA as I make more headway. >>>>> >>>>>> From: sxk1...@hotmail.com >>>>>> To: dev@mahout.apache.org >>>>>> Subject: RE: [Discuss--A proposal for building an application in >> mahout to measure runtime performance of algorithms in mahout] >>>>>> Date: Mon, 6 Jun 2016 08:58:49 -0700 >>>>>> >>>>>> Andrew,Thanks for the input, I will shift gears a bit and just get >> some lightweight code going that calls into mahout algorithms and does a >> csv dump out. Note that I think akka could be a good use for this as you >> could make an async call and get back a notification when the csv dump is >> finished. Also I am indeed not focusing on mapreduce algorithms and will >> be tackling the algorithms in the math-scala library. What do you think of >> making this a lightweight web based workbench using spray that committers >> can run outside of mahout through curl or something, this was my initial >> vision in using spray and its good that I'm getting early feedback. >>>>>> >>>>>> On zeppelin do you think its worthwhile that I incorporate >> Trevor's efforts to take that csv and turn that into one or two >> visualizations. I'm trying to understand how that effort may(or may not) >> intersect with what I'm trying to accomplish. >>>>>> Also point taken on the small data sets. >>>>>> Thanks >>>>>> >>>>>>> From: ap....@outlook.com >>>>>>> To: dev@mahout.apache.org >>>>>>> Subject: Re: [Discuss--A proposal for building an application in >> mahout to measure runtime performance of algorithms in mahout] >>>>>>> Date: Mon, 6 Jun 2016 15:50:16 +0000 >>>>>>> >>>>>>> Saikat, >>>>>>> >>>>>>> If you're going to pursue this there is a few things that I >> would suggest. First, keep it light weight. We don't want to bring a a >> lot of extra dependencies or data into the distribution. I'm not sure what >> this means as far as spray/akka, but those seem like overkill in my >> opinion. This should be able to be kept down to a simple csv dump I think. >>>>>>> >>>>>>> Second, use Data that can be either randomly generated with a >> seeded RNG, or a function like Mackey-Glass or downloaded (probably best), >> and only use a small very small sample in the tests- since they're pretty >> long currently. The main point being that we don't want to ship any large >> test datasets with the distro. >>>>>>> >>>>>>> Third, we're not using MapReduce anymore, so focus on algorithms >> in the math-scala library (eg. dssvd, thinqr, dals, etc.) as well as Matrix >> algebra operations. That is where i see this being useful, so that we may >> compare changes and optimizations going forward. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Andy >>>>>>> >>>>>>> ________________________________________ >>>>>>> From: Saikat Kanjilal <sxk1...@hotmail.com> >>>>>>> Sent: Friday, June 3, 2016 12:35:54 AM >>>>>>> To: dev@mahout.apache.org >>>>>>> Subject: RE: [Discuss--A proposal for building an application in >> mahout to measure runtime performance of algorithms in mahout] >>>>>>> >>>>>>> Hi All,Created a JIRA ticket and have moved the discussion for >> the runtime performance framework there: >>>>>>> https://issues.apache.org/jira/browse/MAHOUT-1869 >>>>>>> @AndrewP & Trevor I would like to integrate zeppelin into the >> runtime performance measurement framework to output some measurement >> related data for some of the algorithms. >>>>>>> Should I wait till the zeppelin integration is completely >> working before I incorporate this piece? >>>>>>> Also would really some feedback either on the JIRA ticket or in >> response to this thread.Regards >>>>>>> >>>>>>>> From: sxk1...@hotmail.com >>>>>>>> To: dev@mahout.apache.org >>>>>>>> Subject: [Discuss--A proposal for building an application in >> mahout to measure runtime performance of algorithms in mahout] >>>>>>>> Date: Thu, 19 May 2016 21:31:05 -0700 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> This proposal will outline a runtime performance module used >> to measure the performance of various algorithms in mahout in the three >> major areas, clustering, regression and classification. The module will be >> a spray/scala/akka application which will be run by any current or new >> algorithm in mahout and will display a csv file and a set of zeppelin plots >> outlining the various criteria for performance. The goal of releasing >> any new build in mahout will be to run a set of tests for each of the >> algorithms to compare and contrast some benchmarks from one release to >> another. >>>>>>>> >>>>>>>> >>>>>>>> Architecture >>>>>>>> The run time performance application will run on top of >> spray/scala and akka and will make async api calls into the various mahout >> algorithms to generate a cvs file containing data representing the run time >> performance measurement calculations for each algorithm of interest as well >> as a set of zeppelin plots for displaying some of these results. The spray >> scala architecture will leverage the zeppelin server to create the >> visualizations. The discussion below centers around two types of >> algorithms to be addressed by the application. >>>>>>>> >>>>>>>> >>>>>>>> Clustering >>>>>>>> The application will consist of a set of rest APIs to do the >> following: >>>>>>>> >>>>>>>> >>>>>>>> a) A method to load and execute the run time perf module and >> takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a >> location of a set of files containing various sizes of data sets >>>>>>>> >>>>>>>> >>>>>>>> >> /algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 >> and finally a set of values for the number of clusters to use for each of >> the different sizes of the datasets >>>>>>>> >>>>>>>> >>>>>>>> The above API call will return a runId which the client >> program can then use to monitor the module >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> b) A method to monitor the application to ensure that its >> making progress towards generating the zeppelin plots >>>>>>>> /monitor/runId=456 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> The above method will execute asynchronously by calling into >> the mahout kmeans (fuzzy kmeans) clustering implementations and will >> generate zeppelin plots showing the normalized time on the y axis and the >> number of clusters in the x axis. The spray/scala akka framework will >> allow the client application to receive a callback when the run time >> performance calculations are actually completed. For now the calculations >> for measuring run time performance will contain: a) the ratio of the number >> of points clustered correctly to the total number of points b) the total >> time taken for the algorithm to run . These items will be represented in >> separate zeppelin plots. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Regression >>>>>>>> a) The runtime performance module will run the likelihood >> ratio test with a different set of features in every run . We will >> introduce a rest API to run the likelihood ratio test and return the >> results, this will once again be an sync call through the spray/akka stack. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> b) The run time performance module will contain the following >> metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for >> algorithm to converge and run to completion. These metrics will be >> reported on top of the zeppelin graphs for both the regression and the >> different clustering algorithms mentioned above. >>>>>>>> >>>>>>>> How does the application get runThe run time performance >> measuring application will get invoked from the command line, eventually it >> would be worthwhile to hook this into some sort of integration test suite >> to certify the different mahout releases. >>>>>>>> >>>>>>>> >>>>>>>> I will add more thoughts around this and create a JIRA ticket >> only once there's enough consensus between the committers that this is >> headed in the right direction. I will also add some more thoughts on >> measuring run time performance of some of the other algorithms after some >> more research. >>>>>>>> Would love feedback or additional things to consider that I >> might have missed. If its more appropriate I can move the discussion to a >> jira ticket as well so please let me know.Thanks in advance. >>>>>> >>>>> >>>> >>> >> >>