I can only chime in to the visualization part, If you output to a csv- it can be easily consumed and visualized via Zeppelin.
Specifically, there should be an exposed function where a csv (or even better a tsv) string is generated, which can then be used by a 'write to disk' method. The tsv string can then be visualized in Zeppelin ala the %table interface (which is angular based, but sufficient for many benchmarking applications) or to R/Python -> (ggplot2,etc / matplotlib) The moral of the story being the only thing needed to integrate with Zeppelin would be a *.tsv file as a string. tg Trevor Grant Data Scientist https://github.com/rawkintrevo http://stackexchange.com/users/3002022/rawkintrevo http://trevorgrant.org *"Fortunate is he, who is able to know the causes of things." -Virgil* On Mon, Jun 6, 2016 at 10:58 AM, Saikat Kanjilal <sxk1...@hotmail.com> wrote: > Andrew,Thanks for the input, I will shift gears a bit and just get some > lightweight code going that calls into mahout algorithms and does a csv > dump out. Note that I think akka could be a good use for this as you could > make an async call and get back a notification when the csv dump is > finished. Also I am indeed not focusing on mapreduce algorithms and will > be tackling the algorithms in the math-scala library. What do you think of > making this a lightweight web based workbench using spray that committers > can run outside of mahout through curl or something, this was my initial > vision in using spray and its good that I'm getting early feedback. > > On zeppelin do you think its worthwhile that I incorporate Trevor's > efforts to take that csv and turn that into one or two visualizations. I'm > trying to understand how that effort may(or may not) intersect with what > I'm trying to accomplish. > Also point taken on the small data sets. > Thanks > > > From: ap....@outlook.com > > To: dev@mahout.apache.org > > Subject: Re: [Discuss--A proposal for building an application in mahout > to measure runtime performance of algorithms in mahout] > > Date: Mon, 6 Jun 2016 15:50:16 +0000 > > > > Saikat, > > > > If you're going to pursue this there is a few things that I would > suggest. First, keep it light weight. We don't want to bring a a lot of > extra dependencies or data into the distribution. I'm not sure what this > means as far as spray/akka, but those seem like overkill in my opinion. > This should be able to be kept down to a simple csv dump I think. > > > > Second, use Data that can be either randomly generated with a seeded > RNG, or a function like Mackey-Glass or downloaded (probably best), and > only use a small very small sample in the tests- since they're pretty long > currently. The main point being that we don't want to ship any large test > datasets with the distro. > > > > Third, we're not using MapReduce anymore, so focus on algorithms in the > math-scala library (eg. dssvd, thinqr, dals, etc.) as well as Matrix > algebra operations. That is where i see this being useful, so that we may > compare changes and optimizations going forward. > > > > Thanks, > > > > Andy > > > > ________________________________________ > > From: Saikat Kanjilal <sxk1...@hotmail.com> > > Sent: Friday, June 3, 2016 12:35:54 AM > > To: dev@mahout.apache.org > > Subject: RE: [Discuss--A proposal for building an application in mahout > to measure runtime performance of algorithms in mahout] > > > > Hi All,Created a JIRA ticket and have moved the discussion for the > runtime performance framework there: > > https://issues.apache.org/jira/browse/MAHOUT-1869 > > @AndrewP & Trevor I would like to integrate zeppelin into the runtime > performance measurement framework to output some measurement related data > for some of the algorithms. > > Should I wait till the zeppelin integration is completely working before > I incorporate this piece? > > Also would really some feedback either on the JIRA ticket or in response > to this thread.Regards > > > > > From: sxk1...@hotmail.com > > > To: dev@mahout.apache.org > > > Subject: [Discuss--A proposal for building an application in mahout to > measure runtime performance of algorithms in mahout] > > > Date: Thu, 19 May 2016 21:31:05 -0700 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This proposal will outline a runtime performance module used to > measure the performance of various algorithms in mahout in the three major > areas, clustering, regression and classification. The module will be a > spray/scala/akka application which will be run by any current or new > algorithm in mahout and will display a csv file and a set of zeppelin plots > outlining the various criteria for performance. The goal of releasing > any new build in mahout will be to run a set of tests for each of the > algorithms to compare and contrast some benchmarks from one release to > another. > > > > > > > > > Architecture > > > The run time performance application will run on top of spray/scala > and akka and will make async api calls into the various mahout algorithms > to generate a cvs file containing data representing the run time > performance measurement calculations for each algorithm of interest as well > as a set of zeppelin plots for displaying some of these results. The spray > scala architecture will leverage the zeppelin server to create the > visualizations. The discussion below centers around two types of > algorithms to be addressed by the application. > > > > > > > > > Clustering > > > The application will consist of a set of rest APIs to do the following: > > > > > > > > > a) A method to load and execute the run time perf module and takes as > inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a > set of files containing various sizes of data sets > > > > > > > > > > /algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 > and finally a set of values for the number of clusters to use for each of > the different sizes of the datasets > > > > > > > > > The above API call will return a runId which the client program can > then use to monitor the module > > > > > > > > > > > > > > > b) A method to monitor the application to ensure that its making > progress towards generating the zeppelin plots > > > /monitor/runId=456 > > > > > > > > > > > > > > > The above method will execute asynchronously by calling into the > mahout kmeans (fuzzy kmeans) clustering implementations and will generate > zeppelin plots showing the normalized time on the y axis and the number of > clusters in the x axis. The spray/scala akka framework will allow the > client application to receive a callback when the run time performance > calculations are actually completed. For now the calculations for > measuring run time performance will contain: a) the ratio of the number of > points clustered correctly to the total number of points b) the total time > taken for the algorithm to run . These items will be represented in > separate zeppelin plots. > > > > > > > > > > > > > > > Regression > > > a) The runtime performance module will run the likelihood ratio test > with a different set of features in every run . We will introduce a rest > API to run the likelihood ratio test and return the results, this will once > again be an sync call through the spray/akka stack. > > > > > > > > > > > > > > > > > > > > > b) The run time performance module will contain the following metrics > for every algorithm: 1) cpu usage 2) memory usage 3) time taken for > algorithm to converge and run to completion. These metrics will be > reported on top of the zeppelin graphs for both the regression and the > different clustering algorithms mentioned above. > > > > > > How does the application get runThe run time performance measuring > application will get invoked from the command line, eventually it would be > worthwhile to hook this into some sort of integration test suite to certify > the different mahout releases. > > > > > > > > > I will add more thoughts around this and create a JIRA ticket only > once there's enough consensus between the committers that this is headed in > the right direction. I will also add some more thoughts on measuring run > time performance of some of the other algorithms after some more research. > > > Would love feedback or additional things to consider that I might have > missed. If its more appropriate I can move the discussion to a jira ticket > as well so please let me know.Thanks in advance. >