Perfect, thank you, that helps scope the effort a bit more accurately.
> From: trevor.d.gr...@gmail.com > Date: Mon, 6 Jun 2016 11:33:32 -0500 > Subject: Re: [Discuss--A proposal for building an application in mahout to > measure runtime performance of algorithms in mahout] > To: dev@mahout.apache.org > > I can only chime in to the visualization part, > > If you output to a csv- it can be easily consumed and visualized via > Zeppelin. > > Specifically, there should be an exposed function where a csv (or even > better a tsv) string is generated, which can then be used by a 'write to > disk' method. > > The tsv string can then be visualized in Zeppelin ala the %table interface > (which is angular based, but sufficient for many benchmarking applications) > or to R/Python -> (ggplot2,etc / matplotlib) > > The moral of the story being the only thing needed to integrate with > Zeppelin would be a *.tsv file as a string. > > tg > > Trevor Grant > Data Scientist > https://github.com/rawkintrevo > http://stackexchange.com/users/3002022/rawkintrevo > http://trevorgrant.org > > *"Fortunate is he, who is able to know the causes of things." -Virgil* > > > On Mon, Jun 6, 2016 at 10:58 AM, Saikat Kanjilal <sxk1...@hotmail.com> > wrote: > > > Andrew,Thanks for the input, I will shift gears a bit and just get some > > lightweight code going that calls into mahout algorithms and does a csv > > dump out. Note that I think akka could be a good use for this as you could > > make an async call and get back a notification when the csv dump is > > finished. Also I am indeed not focusing on mapreduce algorithms and will > > be tackling the algorithms in the math-scala library. What do you think of > > making this a lightweight web based workbench using spray that committers > > can run outside of mahout through curl or something, this was my initial > > vision in using spray and its good that I'm getting early feedback. > > > > On zeppelin do you think its worthwhile that I incorporate Trevor's > > efforts to take that csv and turn that into one or two visualizations. I'm > > trying to understand how that effort may(or may not) intersect with what > > I'm trying to accomplish. > > Also point taken on the small data sets. > > Thanks > > > > > From: ap....@outlook.com > > > To: dev@mahout.apache.org > > > Subject: Re: [Discuss--A proposal for building an application in mahout > > to measure runtime performance of algorithms in mahout] > > > Date: Mon, 6 Jun 2016 15:50:16 +0000 > > > > > > Saikat, > > > > > > If you're going to pursue this there is a few things that I would > > suggest. First, keep it light weight. We don't want to bring a a lot of > > extra dependencies or data into the distribution. I'm not sure what this > > means as far as spray/akka, but those seem like overkill in my opinion. > > This should be able to be kept down to a simple csv dump I think. > > > > > > Second, use Data that can be either randomly generated with a seeded > > RNG, or a function like Mackey-Glass or downloaded (probably best), and > > only use a small very small sample in the tests- since they're pretty long > > currently. The main point being that we don't want to ship any large test > > datasets with the distro. > > > > > > Third, we're not using MapReduce anymore, so focus on algorithms in the > > math-scala library (eg. dssvd, thinqr, dals, etc.) as well as Matrix > > algebra operations. That is where i see this being useful, so that we may > > compare changes and optimizations going forward. > > > > > > Thanks, > > > > > > Andy > > > > > > ________________________________________ > > > From: Saikat Kanjilal <sxk1...@hotmail.com> > > > Sent: Friday, June 3, 2016 12:35:54 AM > > > To: dev@mahout.apache.org > > > Subject: RE: [Discuss--A proposal for building an application in mahout > > to measure runtime performance of algorithms in mahout] > > > > > > Hi All,Created a JIRA ticket and have moved the discussion for the > > runtime performance framework there: > > > https://issues.apache.org/jira/browse/MAHOUT-1869 > > > @AndrewP & Trevor I would like to integrate zeppelin into the runtime > > performance measurement framework to output some measurement related data > > for some of the algorithms. > > > Should I wait till the zeppelin integration is completely working before > > I incorporate this piece? > > > Also would really some feedback either on the JIRA ticket or in response > > to this thread.Regards > > > > > > > From: sxk1...@hotmail.com > > > > To: dev@mahout.apache.org > > > > Subject: [Discuss--A proposal for building an application in mahout to > > measure runtime performance of algorithms in mahout] > > > > Date: Thu, 19 May 2016 21:31:05 -0700 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This proposal will outline a runtime performance module used to > > measure the performance of various algorithms in mahout in the three major > > areas, clustering, regression and classification. The module will be a > > spray/scala/akka application which will be run by any current or new > > algorithm in mahout and will display a csv file and a set of zeppelin plots > > outlining the various criteria for performance. The goal of releasing > > any new build in mahout will be to run a set of tests for each of the > > algorithms to compare and contrast some benchmarks from one release to > > another. > > > > > > > > > > > > Architecture > > > > The run time performance application will run on top of spray/scala > > and akka and will make async api calls into the various mahout algorithms > > to generate a cvs file containing data representing the run time > > performance measurement calculations for each algorithm of interest as well > > as a set of zeppelin plots for displaying some of these results. The spray > > scala architecture will leverage the zeppelin server to create the > > visualizations. The discussion below centers around two types of > > algorithms to be addressed by the application. > > > > > > > > > > > > Clustering > > > > The application will consist of a set of rest APIs to do the following: > > > > > > > > > > > > a) A method to load and execute the run time perf module and takes as > > inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a > > set of files containing various sizes of data sets > > > > > > > > > > > > > > /algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 > > and finally a set of values for the number of clusters to use for each of > > the different sizes of the datasets > > > > > > > > > > > > The above API call will return a runId which the client program can > > then use to monitor the module > > > > > > > > > > > > > > > > > > > > b) A method to monitor the application to ensure that its making > > progress towards generating the zeppelin plots > > > > /monitor/runId=456 > > > > > > > > > > > > > > > > > > > > The above method will execute asynchronously by calling into the > > mahout kmeans (fuzzy kmeans) clustering implementations and will generate > > zeppelin plots showing the normalized time on the y axis and the number of > > clusters in the x axis. The spray/scala akka framework will allow the > > client application to receive a callback when the run time performance > > calculations are actually completed. For now the calculations for > > measuring run time performance will contain: a) the ratio of the number of > > points clustered correctly to the total number of points b) the total time > > taken for the algorithm to run . These items will be represented in > > separate zeppelin plots. > > > > > > > > > > > > > > > > > > > > Regression > > > > a) The runtime performance module will run the likelihood ratio test > > with a different set of features in every run . We will introduce a rest > > API to run the likelihood ratio test and return the results, this will once > > again be an sync call through the spray/akka stack. > > > > > > > > > > > > > > > > > > > > > > > > > > > > b) The run time performance module will contain the following metrics > > for every algorithm: 1) cpu usage 2) memory usage 3) time taken for > > algorithm to converge and run to completion. These metrics will be > > reported on top of the zeppelin graphs for both the regression and the > > different clustering algorithms mentioned above. > > > > > > > > How does the application get runThe run time performance measuring > > application will get invoked from the command line, eventually it would be > > worthwhile to hook this into some sort of integration test suite to certify > > the different mahout releases. > > > > > > > > > > > > I will add more thoughts around this and create a JIRA ticket only > > once there's enough consensus between the committers that this is headed in > > the right direction. I will also add some more thoughts on measuring run > > time performance of some of the other algorithms after some more research. > > > > Would love feedback or additional things to consider that I might have > > missed. If its more appropriate I can move the discussion to a jira ticket > > as well so please let me know.Thanks in advance. > >