*Moving the discussion to mahout-user

Cool stuff Eirik.


Here an idea, of what I have in mind.

Mahout has the frequent pattern mining algorithm(FPGrowth) which mines
patterns.
FPGrowth algorithm needs a list of transactions, which is essentially what
people do every commit is a transaction

consider this
date, time robinanil
MAHOUT-300 <http://issues.apache.org/jira/browse/MAHOUT-300> First wave of
perf improvements
M
/lucene/mahout/trunk/core/src/main/java/org/apache/mahout/common/TimingStatistics.java<#9126559452364>

M
/lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/AbstractVector.java<#91265519213864>

M
/lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/DenseVector.java<#91265533120615>

M
/lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java<#91265523711206>

M
/lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/SequentialAccessSparseVector.java<#91265521816025>

M
/lucene/mahout/trunk/math/src/test/java/org/apache/mahout/math/TestSparseVector.java<#91265512656507>

M
/lucene/mahout/trunk/math/src/test/java/org/apache/mahout/math/VectorTest.java<#91265516382718>

M
/lucene/mahout/trunk/utils/src/main/java/org/apache/mahout/benchmark/VectorBenchmarks.java<#91265530232771>


First make the transaction as a line in say comma separated format as
follow. You will need some clever processing to convert the above the his

robinanil, hour of day, day of week, common, math, benchmark,
TimingStatistics, names .. other classes, java

Then you have a lot of these lines as transactions for every commit. and
each of the words are entities

You run fp growth over this and you will get the top 50(say) patterns for
each entity

if you fetch the top patterns for the user robinanil, you will find patterns
like

robinanil => {robinanil, clustering, math, java , 100)
                  {robinanil, classifier, algorithm, 80 }
                  ,....
                  ....
gsingers => {gsingers, clustering, matrix, parameter, 200}

What this gives you is what are the frequently occuring pattern for each
user and sorted(desc) by the number of times it has occurred. This is like
saying who worked on what for what amount of time
You can also choose a folder(you have to prevent collisions for duplicate
folder names, by say adding the complete path instead of just the name}

There are things you discussed which can be done using the above

Commit cluster,
Just find the patterns corresponding to a filename you will find the other
files in the top patterns and the number of times they were edited together

Developer clusters,
for each file or folder, create the transaction from the commit history
containing each line having just users who edited the file
Run fpgrowth to find top K patterns
for each file or folder you will get the groups of people and sorted by the
count of their occurrence

Commit labelling: this is tricky you need to identify features in a commit
that tells whether its a fix, or a feature or a patch. You can get that
looking at the code or the commit description

Once you create such a training data, which has multiple instances label =>
features all you need to do is run Bayes classifier

Robin

Reply via email to