Yes, very cool stuff. Makes me wonder if in addition to mining this data it could somehow be presented via visualization like CodeSwarm: http://vis.cs.ucdavis.edu/~ogawa/codeswarm/
There's a great visualization of the activity on the twitter codbase here: http://engineering.twitter.com/2010/02/hello-world.html -Drew On Fri, Feb 26, 2010 at 4:30 AM, Robin Anil <[email protected]> wrote: > *Moving the discussion to mahout-user > > Cool stuff Eirik. > > > Here an idea, of what I have in mind. > > Mahout has the frequent pattern mining algorithm(FPGrowth) which mines > patterns. > FPGrowth algorithm needs a list of transactions, which is essentially what > people do every commit is a transaction > > consider this > date, time robinanil > MAHOUT-300 <http://issues.apache.org/jira/browse/MAHOUT-300> First wave of > perf improvements > M > /lucene/mahout/trunk/core/src/main/java/org/apache/mahout/common/TimingStatistics.java<#9126559452364> > > M > /lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/AbstractVector.java<#91265519213864> > > M > /lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/DenseVector.java<#91265533120615> > > M > /lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java<#91265523711206> > > M > /lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/SequentialAccessSparseVector.java<#91265521816025> > > M > /lucene/mahout/trunk/math/src/test/java/org/apache/mahout/math/TestSparseVector.java<#91265512656507> > > M > /lucene/mahout/trunk/math/src/test/java/org/apache/mahout/math/VectorTest.java<#91265516382718> > > M > /lucene/mahout/trunk/utils/src/main/java/org/apache/mahout/benchmark/VectorBenchmarks.java<#91265530232771> > > > First make the transaction as a line in say comma separated format as > follow. You will need some clever processing to convert the above the his > > robinanil, hour of day, day of week, common, math, benchmark, > TimingStatistics, names .. other classes, java > > Then you have a lot of these lines as transactions for every commit. and > each of the words are entities > > You run fp growth over this and you will get the top 50(say) patterns for > each entity > > if you fetch the top patterns for the user robinanil, you will find patterns > like > > robinanil => {robinanil, clustering, math, java , 100) > {robinanil, classifier, algorithm, 80 } > ,.... > .... > gsingers => {gsingers, clustering, matrix, parameter, 200} > > What this gives you is what are the frequently occuring pattern for each > user and sorted(desc) by the number of times it has occurred. This is like > saying who worked on what for what amount of time > You can also choose a folder(you have to prevent collisions for duplicate > folder names, by say adding the complete path instead of just the name} > > There are things you discussed which can be done using the above > > Commit cluster, > Just find the patterns corresponding to a filename you will find the other > files in the top patterns and the number of times they were edited together > > Developer clusters, > for each file or folder, create the transaction from the commit history > containing each line having just users who edited the file > Run fpgrowth to find top K patterns > for each file or folder you will get the groups of people and sorted by the > count of their occurrence > > Commit labelling: this is tricky you need to identify features in a commit > that tells whether its a fix, or a feature or a patch. You can get that > looking at the code or the commit description > > Once you create such a training data, which has multiple instances label => > features all you need to do is run Bayes classifier > > Robin >
