Yes, very cool stuff. Makes me wonder if in addition to mining this
data it could somehow be presented via visualization like CodeSwarm:
http://vis.cs.ucdavis.edu/~ogawa/codeswarm/

There's a great visualization of the activity on the twitter codbase here:
http://engineering.twitter.com/2010/02/hello-world.html

-Drew

On Fri, Feb 26, 2010 at 4:30 AM, Robin Anil <[email protected]> wrote:
> *Moving the discussion to mahout-user
>
> Cool stuff Eirik.
>
>
> Here an idea, of what I have in mind.
>
> Mahout has the frequent pattern mining algorithm(FPGrowth) which mines
> patterns.
> FPGrowth algorithm needs a list of transactions, which is essentially what
> people do every commit is a transaction
>
> consider this
> date, time robinanil
> MAHOUT-300 <http://issues.apache.org/jira/browse/MAHOUT-300> First wave of
> perf improvements
> M
> /lucene/mahout/trunk/core/src/main/java/org/apache/mahout/common/TimingStatistics.java<#9126559452364>
>
> M
> /lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/AbstractVector.java<#91265519213864>
>
> M
> /lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/DenseVector.java<#91265533120615>
>
> M
> /lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java<#91265523711206>
>
> M
> /lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/SequentialAccessSparseVector.java<#91265521816025>
>
> M
> /lucene/mahout/trunk/math/src/test/java/org/apache/mahout/math/TestSparseVector.java<#91265512656507>
>
> M
> /lucene/mahout/trunk/math/src/test/java/org/apache/mahout/math/VectorTest.java<#91265516382718>
>
> M
> /lucene/mahout/trunk/utils/src/main/java/org/apache/mahout/benchmark/VectorBenchmarks.java<#91265530232771>
>
>
> First make the transaction as a line in say comma separated format as
> follow. You will need some clever processing to convert the above the his
>
> robinanil, hour of day, day of week, common, math, benchmark,
> TimingStatistics, names .. other classes, java
>
> Then you have a lot of these lines as transactions for every commit. and
> each of the words are entities
>
> You run fp growth over this and you will get the top 50(say) patterns for
> each entity
>
> if you fetch the top patterns for the user robinanil, you will find patterns
> like
>
> robinanil => {robinanil, clustering, math, java , 100)
>                  {robinanil, classifier, algorithm, 80 }
>                  ,....
>                  ....
> gsingers => {gsingers, clustering, matrix, parameter, 200}
>
> What this gives you is what are the frequently occuring pattern for each
> user and sorted(desc) by the number of times it has occurred. This is like
> saying who worked on what for what amount of time
> You can also choose a folder(you have to prevent collisions for duplicate
> folder names, by say adding the complete path instead of just the name}
>
> There are things you discussed which can be done using the above
>
> Commit cluster,
> Just find the patterns corresponding to a filename you will find the other
> files in the top patterns and the number of times they were edited together
>
> Developer clusters,
> for each file or folder, create the transaction from the commit history
> containing each line having just users who edited the file
> Run fpgrowth to find top K patterns
> for each file or folder you will get the groups of people and sorted by the
> count of their occurrence
>
> Commit labelling: this is tricky you need to identify features in a commit
> that tells whether its a fix, or a feature or a patch. You can get that
> looking at the code or the commit description
>
> Once you create such a training data, which has multiple instances label =>
> features all you need to do is run Bayes classifier
>
> Robin
>

Reply via email to