Re: GSOC 2011: Benchmarking, Profiling and Documentation

Grant Ingersoll Wed, 09 Mar 2011 04:57:34 -0800

On Mar 8, 2011, at 4:45 AM, Pararth Shah wrote:

> Hi,
> I am Pararth Shah, an undergraduate student in Computer Science and
> Engineering at the Indian Institute of Technology, Bombay. I am planning to
> submit a GSOC proposal, for Mahout. Considering Grant Ingersoll's reply in a
> previous thread, I would like to "focus on benchmarking, examples and
> documentation of existing capabilities." Here is a rough list of ideas that
> came to my mind, while I was familiarizing myself with Mahout, through the
> code, documentation and wiki:
> 
> 1) Build a set of benchmarking tools tailored to Mahout, similar to the
> Lucene benchmarking contrib[1], which benchmarks Lucene using "standard,
> freely available corpora".


Awesome.  See also MAHOUT-588


> 
> 2) Build a profiling tool, based on Java Interactive Profiler[2], to find
> "hotspots" in the algorithm execution. This will help in identifying
> modifications to gain speedups. The modified algorithm can be retested using
> above benchmarking tools to quantify the speedup obtained. I believe a
> custom-built profiler will have advantages in terms of speed, ability to
> filter packages/classes profiled, and possible interactivity with the user,
> over the standard profilers like hprof, JProbe and Yourkit. What I
> understand about profilers is mostly from reading [3]. Also, I found useful
> information to start with building a simple profiler consisting of java
> agent interface coupled with the ASM library, on this page [4].

I don't really think it makes sense to re-invent the wheel here and I doubt the 
community has much interest in maintaining pure profiling code.  Instead, I 
think it makes sense to leverage existing profilers.


> 
> 3) Use these tools to gather detailed information about the control flow,
> data flow, processing time, and memory usage patterns of execution of every
> algorithm present in Mahout on certain standard datasets, and providing the
> information on the Mahout website/wiki for analysis (white box testing[5]).
> 
> 4) Add functionality to import databases (MySQL) into Vectors, as input for
> clustering algorithms. This will allow more datasets to be directly used
> with the clustering algorithms.
> 
> 5) Update the documentation where required. For example,
> "org.apache.mahout.classifier.bayes" and
> "org.apache.mahout.clustering.canopy" are well documented, but it took me
> some time to figure out "org.apache.mahout.clustering.minhash". The wiki
> proved to be very informative in general, and I am assuming that the pages
> that are incomplete (eg Hierarchical Clustering, Independant Component
> Analysis) correspond to algorithms that are still work-in-progress. Writing
> one or two more examples for each algorithm would certainly benefit
> newcomers starting out with Mahout (eg me).

This would be great.


> 
> 6) (I don't know if this is feasible. Please comment) Build a tool that
> tracks the progress of an algorithm in real time during its execution,
> depicting (graphically?) what part of the dataset is already analysed, what
> is being currently analysed (eg. which part of training set in a classifier
> is being worked on); what is the current state of the learning algorithm (eg
> size and number of clusters in clustering algorithms). The data collected by
> this tool can then be further analysed (eg movement of the decision boundary
> over the course of a classifier algorithm, before attaining its final
> state). I believe this would be a great tool to:
>    (a) gain insights about the data set
>    (b) gain insights about the algorithm
>    (c) introduce machine learning concepts to anyone


You might look into the tools that are out there for Hadoop for analyzing 
processes, etc.

> 
> 
> These are just ideas, I wish to know which (if any) seem interesting enough,
> and what are the possible improvements. Then I can spend the next month,
> before submitting the proposal, working on the specifics, figuring out how I
> may go about doing it. I am hoping I'll get enough pointers along the way,
> to refine and prioritize these tasks to suit the community.
> 
> My motivation is simple: I am looking forward to either pursuing graduate
> study in, or working on solving problems that require a knowledge of, the
> field of machine learning. I have a fair idea of the basic concepts and
> algorithms. Spending a summer closely scrutinising, documenting and testing
> the implementations of the many ML algorithms currently present in Mahout,
> will be a great opportunity for me to gain a solid, breadth-first
> understanding of a majority of ML algorithms, plus it should be fun too :)

I think this is a good start.  Generally speaking, many people fail to get 
selected because they bite off too much.  I would encourage you to focus in on 
a few areas that you think you can do a really good job in and propose along 
those lines.

> 
> Any feedback is appreciated.
> 
> Thanks and regards,
> Pararth
> 
> References:
> [1] "Lucene Javadocs"
> http://lucene.apache.org/java/2_9_4/api/contrib-benchmark/index.html
> [2] "Java Interactive Profiler" http://sourceforge.net/projects/jiprof/
> [3] "Profiling Tools" http://vast.uccs.edu/~tboult/CS330/NOTES/profilers.ppt
> [4] "Build Your Own Profiler"
> http://www.ibm.com/developerworks/java/library/j-jip/
> [5] "White Box Testing" http://en.wikipedia.org/wiki/White-box_testing

--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: GSOC 2011: Benchmarking, Profiling and Documentation

Reply via email to