I can only chime in to the visualization part,

If you output to a csv- it can be easily consumed and visualized via
Zeppelin.

Specifically, there should be  an exposed function where a csv (or even
better a tsv) string is generated, which can then be used by a 'write to
disk' method.

The tsv string can then be visualized in Zeppelin ala the %table interface
(which is angular based, but sufficient for many benchmarking applications)
or to R/Python -> (ggplot2,etc / matplotlib)

The moral of the story being the only thing needed to integrate with
Zeppelin would be a *.tsv file as a string.

tg

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Mon, Jun 6, 2016 at 10:58 AM, Saikat Kanjilal <sxk1...@hotmail.com>
wrote:

> Andrew,Thanks for the input, I will shift gears a bit and just get some
> lightweight code going that calls into mahout algorithms and does a csv
> dump out.  Note that I think akka could be a good use for this as you could
> make an async call and get back a notification when the csv dump is
> finished.  Also I am indeed not focusing on mapreduce algorithms and will
> be tackling the algorithms in the math-scala library.  What do you think of
> making this a lightweight web based workbench using spray that committers
> can run outside of mahout through curl or something, this was my initial
> vision in using spray and its good that I'm getting early feedback.
>
> On zeppelin do you think its worthwhile that I incorporate Trevor's
> efforts to take that csv and turn that into one or two visualizations.  I'm
> trying to understand how that effort may(or may not) intersect with what
> I'm trying to accomplish.
> Also point taken on the small data sets.
> Thanks
>
> > From: ap....@outlook.com
> > To: dev@mahout.apache.org
> > Subject: Re: [Discuss--A proposal for building an application in mahout
> to measure runtime performance of algorithms in mahout]
> > Date: Mon, 6 Jun 2016 15:50:16 +0000
> >
> > Saikat,
> >
> > If you're going to pursue this there is a few things that I would
> suggest.  First, keep it light weight.  We don't want to bring a a lot of
> extra dependencies or data into the distribution.  I'm not sure what this
> means as far as spray/akka, but those seem like overkill in my opinion.
> This should be able to be kept down to a simple csv dump I think.
> >
> > Second, use Data that can be either randomly generated with a seeded
> RNG, or a function like Mackey-Glass or downloaded (probably best), and
> only use a small very small sample in the tests- since they're pretty long
> currently. The main point being that we don't want to ship any large test
> datasets with the distro.
> >
> > Third, we're not using MapReduce anymore, so focus on algorithms in the
> math-scala library (eg. dssvd, thinqr, dals, etc.) as well as Matrix
> algebra operations.  That is where i see this being useful, so that we may
> compare changes and optimizations going forward.
> >
> > Thanks,
> >
> > Andy
> >
> > ________________________________________
> > From: Saikat Kanjilal <sxk1...@hotmail.com>
> > Sent: Friday, June 3, 2016 12:35:54 AM
> > To: dev@mahout.apache.org
> > Subject: RE: [Discuss--A proposal for building an application in mahout
> to measure runtime performance of algorithms in mahout]
> >
> > Hi All,Created a JIRA ticket and have moved the discussion for the
> runtime performance framework  there:
> > https://issues.apache.org/jira/browse/MAHOUT-1869
> > @AndrewP & Trevor I would like to integrate zeppelin into the runtime
> performance measurement framework to output some measurement related data
> for some of the algorithms.
> > Should I wait till the zeppelin integration is completely working before
> I incorporate this piece?
> > Also would really some feedback either on the JIRA ticket or in response
> to this thread.Regards
> >
> > > From: sxk1...@hotmail.com
> > > To: dev@mahout.apache.org
> > > Subject: [Discuss--A proposal for building an application in mahout to
> measure runtime performance of algorithms in mahout]
> > > Date: Thu, 19 May 2016 21:31:05 -0700
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > This proposal will outline a runtime performance module used to
> measure the performance of various algorithms in mahout in the three major
> areas, clustering, regression and classification.  The module will be a
> spray/scala/akka application which will be run by any current or new
> algorithm in mahout and will display a csv file and a set of zeppelin plots
> outlining the various criteria for performance.    The goal of releasing
> any new build in mahout will be to run a set of tests for each of the
> algorithms to compare and contrast some benchmarks from one release to
> another.
> > >
> > >
> > > Architecture
> > > The run time performance application will run on top of spray/scala
> and akka and will make async api calls into the various mahout algorithms
> to generate a cvs file containing data representing the run time
> performance measurement calculations for each algorithm of interest as well
> as a set of zeppelin plots for displaying some of these results.  The spray
> scala architecture will leverage the zeppelin server to create the
> visualizations.  The discussion below centers around two types of
> algorithms to be addressed by the application.
> > >
> > >
> > > Clustering
> > > The application will consist of a set of rest APIs to do the following:
> > >
> > >
> > > a) A method to load and execute the run time perf module and takes as
> inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a
> set of files containing various sizes of data sets
> > >
> > >
> > >
> /algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40
> and finally a set of values for the number of clusters to use for each of
> the different sizes of the datasets
> > >
> > >
> > > The above API call will return a runId which the client program can
> then use to monitor the module
> > >
> > >
> > >
> > >
> > > b) A method to monitor the application to ensure that its making
> progress towards generating the zeppelin plots
> > > /monitor/runId=456
> > >
> > >
> > >
> > >
> > > The above method will execute asynchronously by calling into the
> mahout kmeans (fuzzy kmeans) clustering implementations and will generate
> zeppelin plots showing the normalized time on the y axis and the number of
> clusters in the x axis.  The spray/scala akka framework will allow the
> client application to receive a callback when the run time performance
> calculations are actually completed.  For now the calculations for
> measuring run time performance will contain: a) the ratio of the number of
> points clustered correctly to the total number of points b) the total time
> taken for the algorithm to run .  These items will be represented in
> separate zeppelin plots.
> > >
> > >
> > >
> > >
> > > Regression
> > > a) The runtime performance module will run the likelihood ratio test
> with a different set of features in every run .  We will introduce a rest
> API to run the likelihood ratio test and return the results, this will once
> again be an sync call through the spray/akka stack.
> > >
> > >
> > >
> > >
> > >
> > >
> > > b) The run time performance module will contain the following metrics
> for every algorithm: 1) cpu usage 2) memory usage 3) time taken for
> algorithm to converge and run to completion.  These metrics will be
> reported on top of the zeppelin graphs for both the regression and the
> different clustering algorithms mentioned above.
> > >
> > > How does the application get runThe run time performance measuring
> application will get invoked from the command line, eventually it would be
> worthwhile to hook this into some sort of integration test suite to certify
> the different mahout releases.
> > >
> > >
> > > I will add more thoughts around this and create a JIRA ticket only
> once there's enough consensus between the committers that this is headed in
> the right direction.  I will also add some more thoughts on measuring run
> time performance of some of the other algorithms after some more research.
> > > Would love feedback or additional things to consider that I might have
> missed.  If its more appropriate I can move the discussion to a jira ticket
> as well so please let me know.Thanks in advance.
>

Reply via email to