*Moving the discussion to mahout-user Cool stuff Eirik.
Here an idea, of what I have in mind. Mahout has the frequent pattern mining algorithm(FPGrowth) which mines patterns. FPGrowth algorithm needs a list of transactions, which is essentially what people do every commit is a transaction consider this date, time robinanil MAHOUT-300 <http://issues.apache.org/jira/browse/MAHOUT-300> First wave of perf improvements M /lucene/mahout/trunk/core/src/main/java/org/apache/mahout/common/TimingStatistics.java<#9126559452364> M /lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/AbstractVector.java<#91265519213864> M /lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/DenseVector.java<#91265533120615> M /lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java<#91265523711206> M /lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/SequentialAccessSparseVector.java<#91265521816025> M /lucene/mahout/trunk/math/src/test/java/org/apache/mahout/math/TestSparseVector.java<#91265512656507> M /lucene/mahout/trunk/math/src/test/java/org/apache/mahout/math/VectorTest.java<#91265516382718> M /lucene/mahout/trunk/utils/src/main/java/org/apache/mahout/benchmark/VectorBenchmarks.java<#91265530232771> First make the transaction as a line in say comma separated format as follow. You will need some clever processing to convert the above the his robinanil, hour of day, day of week, common, math, benchmark, TimingStatistics, names .. other classes, java Then you have a lot of these lines as transactions for every commit. and each of the words are entities You run fp growth over this and you will get the top 50(say) patterns for each entity if you fetch the top patterns for the user robinanil, you will find patterns like robinanil => {robinanil, clustering, math, java , 100) {robinanil, classifier, algorithm, 80 } ,.... .... gsingers => {gsingers, clustering, matrix, parameter, 200} What this gives you is what are the frequently occuring pattern for each user and sorted(desc) by the number of times it has occurred. This is like saying who worked on what for what amount of time You can also choose a folder(you have to prevent collisions for duplicate folder names, by say adding the complete path instead of just the name} There are things you discussed which can be done using the above Commit cluster, Just find the patterns corresponding to a filename you will find the other files in the top patterns and the number of times they were edited together Developer clusters, for each file or folder, create the transaction from the commit history containing each line having just users who edited the file Run fpgrowth to find top K patterns for each file or folder you will get the groups of people and sorted by the count of their occurrence Commit labelling: this is tricky you need to identify features in a commit that tells whether its a fix, or a feature or a patch. You can get that looking at the code or the commit description Once you create such a training data, which has multiple instances label => features all you need to do is run Bayes classifier Robin