Re: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]

Andrew Musselman Thu, 07 Jul 2016 08:16:06 -0700

Excellent, thanks Saikat; I'll be able to take a look over the weekend.

On Wed, Jul 6, 2016 at 9:37 PM, Saikat Kanjilal <sxk1...@hotmail.com> wrote:


> Ok folks I've created a pull request here for a barebones runtime
> performance measurement framework that:
>
> 1) measures two simple timings from ssvd and spca
>
> 2) dumps these timings into a csv file
>
>
> https://github.com/apache/mahout/pull/245
>
> [https://avatars0.githubusercontent.com/u/674374?v=3&s=400]<
> https://github.com/apache/mahout/pull/245>
>
> Mahout 1869 by skanjila · Pull Request #245 · apache/mahout<
> https://github.com/apache/mahout/pull/245>
> github.com
> Added the ability to dump output to csv file
>
>
>
>
>
> I'd be greatly appreciative if I can get some early feedback on design
> before moving forward and making too many more changes and not getting
> something included.  I will move ahead with the zeppelin integration in a
> few days and reorganize the code a bit to include all the perf related
> pieces into one class or trait.
>
>
> Thanks in advance for your help.
>
>
> Thanks in advance.
>
>
> ________________________________
> From: Saikat Kanjilal <sxk1...@hotmail.com>
> Sent: Tuesday, June 21, 2016 9:21 PM
> To: dev@mahout.apache.org
> Subject: RE: [Discuss--A proposal for building an application in mahout to
> measure runtime performance of algorithms in mahout]
>
> Ok, so for now I am able to get around the issues bwlow by working on code
> to measure performance times  not requiring the notion of a
> DIstributedContext to get this up and running, I have two methods that I am
> measuring performance times for,ssvd and spca.   Github repo is here:
> https://github.com/skanjila/mahout/tree/mahout-1869
> [https://avatars0.githubusercontent.com/u/674374?v=3&s=400]<
> https://github.com/skanjila/mahout/tree/mahout-1869>
>
> skanjila/mahout<https://github.com/skanjila/mahout/tree/mahout-1869>
> github.com
> mahout - Mirror of Apache Mahout
>
>
>
> Please provide feedback as I will now restructure/reorganize code to add
> more methods and start work on a perf harness that spits out a report in
> csv and then eventually tie this to zeppelin.
> I've kept JIRA up to date as well.
> Thanks in advance.
>
> > From: sxk1...@hotmail.com
> > To: dev@mahout.apache.org
> > Subject: RE: [Discuss--A proposal for building an application in mahout
> to measure runtime performance of algorithms in mahout]
> > Date: Mon, 20 Jun 2016 20:37:31 -0700
> >
> > AndrewP et al,Any chance I can get some pointers on the items below,
> would love some direction on this.Thanks
> >
> > > From: sxk1...@hotmail.com
> > > To: dev@mahout.apache.org
> > > Subject: RE: [Discuss--A proposal for building an application in
> mahout to measure runtime performance of algorithms in mahout]
> > > Date: Sun, 12 Jun 2016 12:40:26 -0700
> > >
> > > Hi Folks,I need some input/help here to get me unblocked and moving:
> > > 1) I need to reuse/extend the DistributedContext inside the runtime
> perf measurement module as all algorithms inside math-scala need this, I
> was trying to mimic some of the H2O code and saw that they had their own
> engine, I am wondering what the best way is to extend DistributedContext
> and get the benefit of an already existing engine without needing to tie
> into h2o or flink, or is the only way to add an engine to point to one of
> those back ends, ideally I want to build the runtime perf module in a
> backend agnostic way and currently I dont see a way around this,
> thoughts?2) I also tried to reuse some of the logic inside math-scala but
> in digging into this code it seems that this code is strongly tied to scala
> test utilities
> > >
> > > Net-Net: I just need access to the DistributedContext without linking
> in any test utilities or backends.
> > > Would love some advice on ways to move forward to maximize
> reuse.Thanks in advance.
> > >
> > > > From: sxk1...@hotmail.com
> > > > To: dev@mahout.apache.org
> > > > Subject: RE: [Discuss--A proposal for building an application in
> mahout to measure runtime performance of algorithms in mahout]
> > > > Date: Thu, 9 Jun 2016 21:45:13 -0700
> > > >
> > > > Andrew et al,So I've finally been able to over the past few days got
> a self contained module compiling that leverages the DistributedContext,
> for starters I copied the NaiveBayes test code, ripped out the test
> infrastructure code around it and then added some timers, next steps will
> be to dump to csv and eventually to zeppelin, some questions before I get
> too far ahead:
> > > > 1) I made the design decision to create my own trait and encapsulate
> the context within that, I am wondering if I should instead leverage the
> context that is already defined in math-scala ,, this however brings its
> own complications in that it brings in the MahoutSuite which I'm not sure I
> really need, thoughts on this
> > > > 2) I need some infrastructure to run the perf framework , I can use
> an azure ubuntu vm for now but is there an AWS instance or some other vm I
> can eventually use, I would really like to avoid using my mac laptop as a
> runtime perf testing environment
> > > >
> > > > Thanks, I'll update JIRA as I make more headway.
> > > >
> > > > > From: sxk1...@hotmail.com
> > > > > To: dev@mahout.apache.org
> > > > > Subject: RE: [Discuss--A proposal for building an application in
> mahout to measure runtime performance of algorithms in mahout]
> > > > > Date: Mon, 6 Jun 2016 08:58:49 -0700
> > > > >
> > > > > Andrew,Thanks for the input, I will shift gears a bit and just get
> some lightweight code going that calls into mahout algorithms and does a
> csv dump out.  Note that I think akka could be a good use for this as you
> could make an async call and get back a notification when the csv dump is
> finished.  Also I am indeed not focusing on mapreduce algorithms and will
> be tackling the algorithms in the math-scala library.  What do you think of
> making this a lightweight web based workbench using spray that committers
> can run outside of mahout through curl or something, this was my initial
> vision in using spray and its good that I'm getting early feedback.
> > > > >
> > > > > On zeppelin do you think its worthwhile that I incorporate
> Trevor's efforts to take that csv and turn that into one or two
> visualizations.  I'm trying to understand how that effort may(or may not)
> intersect with what I'm trying to accomplish.
> > > > > Also point taken on the small data sets.
> > > > > Thanks
> > > > >
> > > > > > From: ap....@outlook.com
> > > > > > To: dev@mahout.apache.org
> > > > > > Subject: Re: [Discuss--A proposal for building an application in
> mahout to measure runtime performance of algorithms in mahout]
> > > > > > Date: Mon, 6 Jun 2016 15:50:16 +0000
> > > > > >
> > > > > > Saikat,
> > > > > >
> > > > > > If you're going to pursue this there is a few things that I
> would suggest.  First, keep it light weight.  We don't want to bring a a
> lot of extra dependencies or data into the distribution.  I'm not sure what
> this means as far as spray/akka, but those seem like overkill in my
> opinion. This should be able to be kept down to a simple csv dump I think.
> > > > > >
> > > > > > Second, use Data that can be either randomly generated with a
> seeded RNG, or a function like Mackey-Glass or downloaded (probably best),
> and only use a small very small sample in the tests- since they're pretty
> long currently. The main point being that we don't want to ship any large
> test datasets with the distro.
> > > > > >
> > > > > > Third, we're not using MapReduce anymore, so focus on algorithms
> in the math-scala library (eg. dssvd, thinqr, dals, etc.) as well as Matrix
> algebra operations.  That is where i see this being useful, so that we may
> compare changes and optimizations going forward.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Andy
> > > > > >
> > > > > > ________________________________________
> > > > > > From: Saikat Kanjilal <sxk1...@hotmail.com>
> > > > > > Sent: Friday, June 3, 2016 12:35:54 AM
> > > > > > To: dev@mahout.apache.org
> > > > > > Subject: RE: [Discuss--A proposal for building an application in
> mahout to measure runtime performance of algorithms in mahout]
> > > > > >
> > > > > > Hi All,Created a JIRA ticket and have moved the discussion for
> the runtime performance framework  there:
> > > > > > https://issues.apache.org/jira/browse/MAHOUT-1869
> > > > > > @AndrewP & Trevor I would like to integrate zeppelin into the
> runtime performance measurement framework to output some measurement
> related data for some of the algorithms.
> > > > > > Should I wait till the zeppelin integration is completely
> working before I incorporate this piece?
> > > > > > Also would really some feedback either on the JIRA ticket or in
> response to this thread.Regards
> > > > > >
> > > > > > > From: sxk1...@hotmail.com
> > > > > > > To: dev@mahout.apache.org
> > > > > > > Subject: [Discuss--A proposal for building an application in
> mahout to measure runtime performance of algorithms in mahout]
> > > > > > > Date: Thu, 19 May 2016 21:31:05 -0700
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > This proposal will outline a runtime performance module used
> to measure the performance of various algorithms in mahout in the three
> major areas, clustering, regression and classification.  The module will be
> a spray/scala/akka application which will be run by any current or new
> algorithm in mahout and will display a csv file and a set of zeppelin plots
> outlining the various criteria for performance.    The goal of releasing
> any new build in mahout will be to run a set of tests for each of the
> algorithms to compare and contrast some benchmarks from one release to
> another.
> > > > > > >
> > > > > > >
> > > > > > > Architecture
> > > > > > > The run time performance application will run on top of
> spray/scala and akka and will make async api calls into the various mahout
> algorithms to generate a cvs file containing data representing the run time
> performance measurement calculations for each algorithm of interest as well
> as a set of zeppelin plots for displaying some of these results.  The spray
> scala architecture will leverage the zeppelin server to create the
> visualizations.  The discussion below centers around two types of
> algorithms to be addressed by the application.
> > > > > > >
> > > > > > >
> > > > > > > Clustering
> > > > > > > The application will consist of a set of rest APIs to do the
> following:
> > > > > > >
> > > > > > >
> > > > > > > a) A method to load and execute the run time perf module and
> takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a
> location of a set of files containing various sizes of data sets
> > > > > > >
> > > > > > >
> > > > > > >
> /algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40
> and finally a set of values for the number of clusters to use for each of
> the different sizes of the datasets
> > > > > > >
> > > > > > >
> > > > > > > The above API call will return a runId which the client
> program can then use to monitor the module
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > b) A method to monitor the application to ensure that its
> making progress towards generating the zeppelin plots
> > > > > > > /monitor/runId=456
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > The above method will execute asynchronously by calling into
> the mahout kmeans (fuzzy kmeans) clustering implementations and will
> generate zeppelin plots showing the normalized time on the y axis and the
> number of clusters in the x axis.  The spray/scala akka framework will
> allow the client application to receive a callback when the run time
> performance calculations are actually completed.  For now the calculations
> for measuring run time performance will contain: a) the ratio of the number
> of points clustered correctly to the total number of points b) the total
> time taken for the algorithm to run .  These items will be represented in
> separate zeppelin plots.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Regression
> > > > > > > a) The runtime performance module will run the likelihood
> ratio test with a different set of features in every run .  We will
> introduce a rest API to run the likelihood ratio test and return the
> results, this will once again be an sync call through the spray/akka stack.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > b) The run time performance module will contain the following
> metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for
> algorithm to converge and run to completion.  These metrics will be
> reported on top of the zeppelin graphs for both the regression and the
> different clustering algorithms mentioned above.
> > > > > > >
> > > > > > > How does the application get runThe run time performance
> measuring application will get invoked from the command line, eventually it
> would be worthwhile to hook this into some sort of integration test suite
> to certify the different mahout releases.
> > > > > > >
> > > > > > >
> > > > > > > I will add more thoughts around this and create a JIRA ticket
> only once there's enough consensus between the committers that this is
> headed in the right direction.  I will also add some more thoughts on
> measuring run time performance of some of the other algorithms after some
> more research.
> > > > > > > Would love feedback or additional things to consider that I
> might have missed.  If its more appropriate I can move the discussion to a
> jira ticket as well so please let me know.Thanks in advance.
> > > > >
> > > >
> > >
> >
>
>

Re: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]

Reply via email to