RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]

Saikat Kanjilal Mon, 06 Jun 2016 10:15:22 -0700

Perfect, thank you, that helps scope the effort a bit more accurately.


> From: trevor.d.gr...@gmail.com
> Date: Mon, 6 Jun 2016 11:33:32 -0500
> Subject: Re: [Discuss--A proposal for building an application in mahout to 
> measure runtime performance of algorithms in mahout]
> To: dev@mahout.apache.org
> 
> I can only chime in to the visualization part,
> 
> If you output to a csv- it can be easily consumed and visualized via
> Zeppelin.
> 
> Specifically, there should be  an exposed function where a csv (or even
> better a tsv) string is generated, which can then be used by a 'write to
> disk' method.
> 
> The tsv string can then be visualized in Zeppelin ala the %table interface
> (which is angular based, but sufficient for many benchmarking applications)
> or to R/Python -> (ggplot2,etc / matplotlib)
> 
> The moral of the story being the only thing needed to integrate with
> Zeppelin would be a *.tsv file as a string.
> 
> tg
> 
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
> 
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> 
> 
> On Mon, Jun 6, 2016 at 10:58 AM, Saikat Kanjilal <sxk1...@hotmail.com>
> wrote:
> 
> > Andrew,Thanks for the input, I will shift gears a bit and just get some
> > lightweight code going that calls into mahout algorithms and does a csv
> > dump out.  Note that I think akka could be a good use for this as you could
> > make an async call and get back a notification when the csv dump is
> > finished.  Also I am indeed not focusing on mapreduce algorithms and will
> > be tackling the algorithms in the math-scala library.  What do you think of
> > making this a lightweight web based workbench using spray that committers
> > can run outside of mahout through curl or something, this was my initial
> > vision in using spray and its good that I'm getting early feedback.
> >
> > On zeppelin do you think its worthwhile that I incorporate Trevor's
> > efforts to take that csv and turn that into one or two visualizations.  I'm
> > trying to understand how that effort may(or may not) intersect with what
> > I'm trying to accomplish.
> > Also point taken on the small data sets.
> > Thanks
> >
> > > From: ap....@outlook.com
> > > To: dev@mahout.apache.org
> > > Subject: Re: [Discuss--A proposal for building an application in mahout
> > to measure runtime performance of algorithms in mahout]
> > > Date: Mon, 6 Jun 2016 15:50:16 +0000
> > >
> > > Saikat,
> > >
> > > If you're going to pursue this there is a few things that I would
> > suggest.  First, keep it light weight.  We don't want to bring a a lot of
> > extra dependencies or data into the distribution.  I'm not sure what this
> > means as far as spray/akka, but those seem like overkill in my opinion.
> > This should be able to be kept down to a simple csv dump I think.
> > >
> > > Second, use Data that can be either randomly generated with a seeded
> > RNG, or a function like Mackey-Glass or downloaded (probably best), and
> > only use a small very small sample in the tests- since they're pretty long
> > currently. The main point being that we don't want to ship any large test
> > datasets with the distro.
> > >
> > > Third, we're not using MapReduce anymore, so focus on algorithms in the
> > math-scala library (eg. dssvd, thinqr, dals, etc.) as well as Matrix
> > algebra operations.  That is where i see this being useful, so that we may
> > compare changes and optimizations going forward.
> > >
> > > Thanks,
> > >
> > > Andy
> > >
> > > ________________________________________
> > > From: Saikat Kanjilal <sxk1...@hotmail.com>
> > > Sent: Friday, June 3, 2016 12:35:54 AM
> > > To: dev@mahout.apache.org
> > > Subject: RE: [Discuss--A proposal for building an application in mahout
> > to measure runtime performance of algorithms in mahout]
> > >
> > > Hi All,Created a JIRA ticket and have moved the discussion for the
> > runtime performance framework  there:
> > > https://issues.apache.org/jira/browse/MAHOUT-1869
> > > @AndrewP & Trevor I would like to integrate zeppelin into the runtime
> > performance measurement framework to output some measurement related data
> > for some of the algorithms.
> > > Should I wait till the zeppelin integration is completely working before
> > I incorporate this piece?
> > > Also would really some feedback either on the JIRA ticket or in response
> > to this thread.Regards
> > >
> > > > From: sxk1...@hotmail.com
> > > > To: dev@mahout.apache.org
> > > > Subject: [Discuss--A proposal for building an application in mahout to
> > measure runtime performance of algorithms in mahout]
> > > > Date: Thu, 19 May 2016 21:31:05 -0700
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > This proposal will outline a runtime performance module used to
> > measure the performance of various algorithms in mahout in the three major
> > areas, clustering, regression and classification.  The module will be a
> > spray/scala/akka application which will be run by any current or new
> > algorithm in mahout and will display a csv file and a set of zeppelin plots
> > outlining the various criteria for performance.    The goal of releasing
> > any new build in mahout will be to run a set of tests for each of the
> > algorithms to compare and contrast some benchmarks from one release to
> > another.
> > > >
> > > >
> > > > Architecture
> > > > The run time performance application will run on top of spray/scala
> > and akka and will make async api calls into the various mahout algorithms
> > to generate a cvs file containing data representing the run time
> > performance measurement calculations for each algorithm of interest as well
> > as a set of zeppelin plots for displaying some of these results.  The spray
> > scala architecture will leverage the zeppelin server to create the
> > visualizations.  The discussion below centers around two types of
> > algorithms to be addressed by the application.
> > > >
> > > >
> > > > Clustering
> > > > The application will consist of a set of rest APIs to do the following:
> > > >
> > > >
> > > > a) A method to load and execute the run time perf module and takes as
> > inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a
> > set of files containing various sizes of data sets
> > > >
> > > >
> > > >
> > /algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40
> > and finally a set of values for the number of clusters to use for each of
> > the different sizes of the datasets
> > > >
> > > >
> > > > The above API call will return a runId which the client program can
> > then use to monitor the module
> > > >
> > > >
> > > >
> > > >
> > > > b) A method to monitor the application to ensure that its making
> > progress towards generating the zeppelin plots
> > > > /monitor/runId=456
> > > >
> > > >
> > > >
> > > >
> > > > The above method will execute asynchronously by calling into the
> > mahout kmeans (fuzzy kmeans) clustering implementations and will generate
> > zeppelin plots showing the normalized time on the y axis and the number of
> > clusters in the x axis.  The spray/scala akka framework will allow the
> > client application to receive a callback when the run time performance
> > calculations are actually completed.  For now the calculations for
> > measuring run time performance will contain: a) the ratio of the number of
> > points clustered correctly to the total number of points b) the total time
> > taken for the algorithm to run .  These items will be represented in
> > separate zeppelin plots.
> > > >
> > > >
> > > >
> > > >
> > > > Regression
> > > > a) The runtime performance module will run the likelihood ratio test
> > with a different set of features in every run .  We will introduce a rest
> > API to run the likelihood ratio test and return the results, this will once
> > again be an sync call through the spray/akka stack.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > b) The run time performance module will contain the following metrics
> > for every algorithm: 1) cpu usage 2) memory usage 3) time taken for
> > algorithm to converge and run to completion.  These metrics will be
> > reported on top of the zeppelin graphs for both the regression and the
> > different clustering algorithms mentioned above.
> > > >
> > > > How does the application get runThe run time performance measuring
> > application will get invoked from the command line, eventually it would be
> > worthwhile to hook this into some sort of integration test suite to certify
> > the different mahout releases.
> > > >
> > > >
> > > > I will add more thoughts around this and create a JIRA ticket only
> > once there's enough consensus between the committers that this is headed in
> > the right direction.  I will also add some more thoughts on measuring run
> > time performance of some of the other algorithms after some more research.
> > > > Would love feedback or additional things to consider that I might have
> > missed.  If its more appropriate I can move the discussion to a jira ticket
> > as well so please let me know.Thanks in advance.
> >

RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]

Reply via email to