Re: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]

Saikat Kanjilal Tue, 12 Jul 2016 22:57:08 -0700

Andrew,
Ping on this , let me know your thoughts on my pull request.
Thanks


Sent from my iPad

> On Jul 7, 2016, at 8:15 AM, Andrew Musselman <andrew.mussel...@gmail.com> 
> wrote:
> 
> Excellent, thanks Saikat; I'll be able to take a look over the weekend.
> 
>> On Wed, Jul 6, 2016 at 9:37 PM, Saikat Kanjilal <sxk1...@hotmail.com> wrote:
>> 
>> Ok folks I've created a pull request here for a barebones runtime
>> performance measurement framework that:
>> 
>> 1) measures two simple timings from ssvd and spca
>> 
>> 2) dumps these timings into a csv file
>> 
>> 
>> https://github.com/apache/mahout/pull/245
>> 
>> [https://avatars0.githubusercontent.com/u/674374?v=3&s=400]<
>> https://github.com/apache/mahout/pull/245>
>> 
>> Mahout 1869 by skanjila · Pull Request #245 · apache/mahout<
>> https://github.com/apache/mahout/pull/245>
>> github.com
>> Added the ability to dump output to csv file
>> 
>> 
>> 
>> 
>> 
>> I'd be greatly appreciative if I can get some early feedback on design
>> before moving forward and making too many more changes and not getting
>> something included.  I will move ahead with the zeppelin integration in a
>> few days and reorganize the code a bit to include all the perf related
>> pieces into one class or trait.
>> 
>> 
>> Thanks in advance for your help.
>> 
>> 
>> Thanks in advance.
>> 
>> 
>> ________________________________
>> From: Saikat Kanjilal <sxk1...@hotmail.com>
>> Sent: Tuesday, June 21, 2016 9:21 PM
>> To: dev@mahout.apache.org
>> Subject: RE: [Discuss--A proposal for building an application in mahout to
>> measure runtime performance of algorithms in mahout]
>> 
>> Ok, so for now I am able to get around the issues bwlow by working on code
>> to measure performance times  not requiring the notion of a
>> DIstributedContext to get this up and running, I have two methods that I am
>> measuring performance times for,ssvd and spca.   Github repo is here:
>> https://github.com/skanjila/mahout/tree/mahout-1869
>> [https://avatars0.githubusercontent.com/u/674374?v=3&s=400]<
>> https://github.com/skanjila/mahout/tree/mahout-1869>
>> 
>> skanjila/mahout<https://github.com/skanjila/mahout/tree/mahout-1869>
>> github.com
>> mahout - Mirror of Apache Mahout
>> 
>> 
>> 
>> Please provide feedback as I will now restructure/reorganize code to add
>> more methods and start work on a perf harness that spits out a report in
>> csv and then eventually tie this to zeppelin.
>> I've kept JIRA up to date as well.
>> Thanks in advance.
>> 
>>> From: sxk1...@hotmail.com
>>> To: dev@mahout.apache.org
>>> Subject: RE: [Discuss--A proposal for building an application in mahout
>> to measure runtime performance of algorithms in mahout]
>>> Date: Mon, 20 Jun 2016 20:37:31 -0700
>>> 
>>> AndrewP et al,Any chance I can get some pointers on the items below,
>> would love some direction on this.Thanks
>>> 
>>>> From: sxk1...@hotmail.com
>>>> To: dev@mahout.apache.org
>>>> Subject: RE: [Discuss--A proposal for building an application in
>> mahout to measure runtime performance of algorithms in mahout]
>>>> Date: Sun, 12 Jun 2016 12:40:26 -0700
>>>> 
>>>> Hi Folks,I need some input/help here to get me unblocked and moving:
>>>> 1) I need to reuse/extend the DistributedContext inside the runtime
>> perf measurement module as all algorithms inside math-scala need this, I
>> was trying to mimic some of the H2O code and saw that they had their own
>> engine, I am wondering what the best way is to extend DistributedContext
>> and get the benefit of an already existing engine without needing to tie
>> into h2o or flink, or is the only way to add an engine to point to one of
>> those back ends, ideally I want to build the runtime perf module in a
>> backend agnostic way and currently I dont see a way around this,
>> thoughts?2) I also tried to reuse some of the logic inside math-scala but
>> in digging into this code it seems that this code is strongly tied to scala
>> test utilities
>>>> 
>>>> Net-Net: I just need access to the DistributedContext without linking
>> in any test utilities or backends.
>>>> Would love some advice on ways to move forward to maximize
>> reuse.Thanks in advance.
>>>> 
>>>>> From: sxk1...@hotmail.com
>>>>> To: dev@mahout.apache.org
>>>>> Subject: RE: [Discuss--A proposal for building an application in
>> mahout to measure runtime performance of algorithms in mahout]
>>>>> Date: Thu, 9 Jun 2016 21:45:13 -0700
>>>>> 
>>>>> Andrew et al,So I've finally been able to over the past few days got
>> a self contained module compiling that leverages the DistributedContext,
>> for starters I copied the NaiveBayes test code, ripped out the test
>> infrastructure code around it and then added some timers, next steps will
>> be to dump to csv and eventually to zeppelin, some questions before I get
>> too far ahead:
>>>>> 1) I made the design decision to create my own trait and encapsulate
>> the context within that, I am wondering if I should instead leverage the
>> context that is already defined in math-scala ,, this however brings its
>> own complications in that it brings in the MahoutSuite which I'm not sure I
>> really need, thoughts on this
>>>>> 2) I need some infrastructure to run the perf framework , I can use
>> an azure ubuntu vm for now but is there an AWS instance or some other vm I
>> can eventually use, I would really like to avoid using my mac laptop as a
>> runtime perf testing environment
>>>>> 
>>>>> Thanks, I'll update JIRA as I make more headway.
>>>>> 
>>>>>> From: sxk1...@hotmail.com
>>>>>> To: dev@mahout.apache.org
>>>>>> Subject: RE: [Discuss--A proposal for building an application in
>> mahout to measure runtime performance of algorithms in mahout]
>>>>>> Date: Mon, 6 Jun 2016 08:58:49 -0700
>>>>>> 
>>>>>> Andrew,Thanks for the input, I will shift gears a bit and just get
>> some lightweight code going that calls into mahout algorithms and does a
>> csv dump out.  Note that I think akka could be a good use for this as you
>> could make an async call and get back a notification when the csv dump is
>> finished.  Also I am indeed not focusing on mapreduce algorithms and will
>> be tackling the algorithms in the math-scala library.  What do you think of
>> making this a lightweight web based workbench using spray that committers
>> can run outside of mahout through curl or something, this was my initial
>> vision in using spray and its good that I'm getting early feedback.
>>>>>> 
>>>>>> On zeppelin do you think its worthwhile that I incorporate
>> Trevor's efforts to take that csv and turn that into one or two
>> visualizations.  I'm trying to understand how that effort may(or may not)
>> intersect with what I'm trying to accomplish.
>>>>>> Also point taken on the small data sets.
>>>>>> Thanks
>>>>>> 
>>>>>>> From: ap....@outlook.com
>>>>>>> To: dev@mahout.apache.org
>>>>>>> Subject: Re: [Discuss--A proposal for building an application in
>> mahout to measure runtime performance of algorithms in mahout]
>>>>>>> Date: Mon, 6 Jun 2016 15:50:16 +0000
>>>>>>> 
>>>>>>> Saikat,
>>>>>>> 
>>>>>>> If you're going to pursue this there is a few things that I
>> would suggest.  First, keep it light weight.  We don't want to bring a a
>> lot of extra dependencies or data into the distribution.  I'm not sure what
>> this means as far as spray/akka, but those seem like overkill in my
>> opinion. This should be able to be kept down to a simple csv dump I think.
>>>>>>> 
>>>>>>> Second, use Data that can be either randomly generated with a
>> seeded RNG, or a function like Mackey-Glass or downloaded (probably best),
>> and only use a small very small sample in the tests- since they're pretty
>> long currently. The main point being that we don't want to ship any large
>> test datasets with the distro.
>>>>>>> 
>>>>>>> Third, we're not using MapReduce anymore, so focus on algorithms
>> in the math-scala library (eg. dssvd, thinqr, dals, etc.) as well as Matrix
>> algebra operations.  That is where i see this being useful, so that we may
>> compare changes and optimizations going forward.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Andy
>>>>>>> 
>>>>>>> ________________________________________
>>>>>>> From: Saikat Kanjilal <sxk1...@hotmail.com>
>>>>>>> Sent: Friday, June 3, 2016 12:35:54 AM
>>>>>>> To: dev@mahout.apache.org
>>>>>>> Subject: RE: [Discuss--A proposal for building an application in
>> mahout to measure runtime performance of algorithms in mahout]
>>>>>>> 
>>>>>>> Hi All,Created a JIRA ticket and have moved the discussion for
>> the runtime performance framework  there:
>>>>>>> https://issues.apache.org/jira/browse/MAHOUT-1869
>>>>>>> @AndrewP & Trevor I would like to integrate zeppelin into the
>> runtime performance measurement framework to output some measurement
>> related data for some of the algorithms.
>>>>>>> Should I wait till the zeppelin integration is completely
>> working before I incorporate this piece?
>>>>>>> Also would really some feedback either on the JIRA ticket or in
>> response to this thread.Regards
>>>>>>> 
>>>>>>>> From: sxk1...@hotmail.com
>>>>>>>> To: dev@mahout.apache.org
>>>>>>>> Subject: [Discuss--A proposal for building an application in
>> mahout to measure runtime performance of algorithms in mahout]
>>>>>>>> Date: Thu, 19 May 2016 21:31:05 -0700
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> This proposal will outline a runtime performance module used
>> to measure the performance of various algorithms in mahout in the three
>> major areas, clustering, regression and classification.  The module will be
>> a spray/scala/akka application which will be run by any current or new
>> algorithm in mahout and will display a csv file and a set of zeppelin plots
>> outlining the various criteria for performance.    The goal of releasing
>> any new build in mahout will be to run a set of tests for each of the
>> algorithms to compare and contrast some benchmarks from one release to
>> another.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Architecture
>>>>>>>> The run time performance application will run on top of
>> spray/scala and akka and will make async api calls into the various mahout
>> algorithms to generate a cvs file containing data representing the run time
>> performance measurement calculations for each algorithm of interest as well
>> as a set of zeppelin plots for displaying some of these results.  The spray
>> scala architecture will leverage the zeppelin server to create the
>> visualizations.  The discussion below centers around two types of
>> algorithms to be addressed by the application.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Clustering
>>>>>>>> The application will consist of a set of rest APIs to do the
>> following:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> a) A method to load and execute the run time perf module and
>> takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a
>> location of a set of files containing various sizes of data sets
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>> /algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40
>> and finally a set of values for the number of clusters to use for each of
>> the different sizes of the datasets
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The above API call will return a runId which the client
>> program can then use to monitor the module
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> b) A method to monitor the application to ensure that its
>> making progress towards generating the zeppelin plots
>>>>>>>> /monitor/runId=456
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The above method will execute asynchronously by calling into
>> the mahout kmeans (fuzzy kmeans) clustering implementations and will
>> generate zeppelin plots showing the normalized time on the y axis and the
>> number of clusters in the x axis.  The spray/scala akka framework will
>> allow the client application to receive a callback when the run time
>> performance calculations are actually completed.  For now the calculations
>> for measuring run time performance will contain: a) the ratio of the number
>> of points clustered correctly to the total number of points b) the total
>> time taken for the algorithm to run .  These items will be represented in
>> separate zeppelin plots.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Regression
>>>>>>>> a) The runtime performance module will run the likelihood
>> ratio test with a different set of features in every run .  We will
>> introduce a rest API to run the likelihood ratio test and return the
>> results, this will once again be an sync call through the spray/akka stack.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> b) The run time performance module will contain the following
>> metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for
>> algorithm to converge and run to completion.  These metrics will be
>> reported on top of the zeppelin graphs for both the regression and the
>> different clustering algorithms mentioned above.
>>>>>>>> 
>>>>>>>> How does the application get runThe run time performance
>> measuring application will get invoked from the command line, eventually it
>> would be worthwhile to hook this into some sort of integration test suite
>> to certify the different mahout releases.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I will add more thoughts around this and create a JIRA ticket
>> only once there's enough consensus between the committers that this is
>> headed in the right direction.  I will also add some more thoughts on
>> measuring run time performance of some of the other algorithms after some
>> more research.
>>>>>>>> Would love feedback or additional things to consider that I
>> might have missed.  If its more appropriate I can move the discussion to a
>> jira ticket as well so please let me know.Thanks in advance.
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>>

Re: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]

Reply via email to