Hi, I've been playing around a bit with the FPGrowth implementation, and I have some questions.
Is it intentional that all frequent itemsets including a given item are mined for each (frequent) item? For example, consider a simple example from FPGrowthTest.java: X (occurs 12 times) Y (occurs 4 times) X Y (occurs 10 times) The expected result is: [(Y,([Y],14), ([X, Y],10)), (X,([X],22), ([X, Y],10))] (This can be read as "When considering patterns with Y, [Y] and [X. Y] were found to occur 14 and 10 times, respectively..") Notice [X, Y] is found twice; a key efficiency improvement of FPGrowth is that you can avoid this by only mining for frequent sets that include the current base item and more-frequent items. Seems like that is not taken advantage of here... Is there a reason for this, or is this a bug? Related to this is the grouping strategy - currently, when creating batches of items to assign to reducers, the N most-frequent items form a group, then the next N form a group, and so on. Intuitively, it makes more sense to round-robin the examples as I would expect the "hardest" items to (roughly) band together in frequency-order. Was there a reason behind this choice? I am also a little confused by the reporting of patterns (similar to Vipul Pandey in MAHOUT-617). To again use an example from the tests: [(Z,([X, Y, Z],11)), (Y,([Y],25), ([X, Y],21), ([X, Y, Z],11)), (X,([X],33), ([X, Y],21), ([X, Y, Z],11))] What is the rationale for not reporting [X, Z], for example? I understand how you could generate these later, but why would you have to? I have also found two bugs; one is that the default number of groups is mis-advertised. The other is that sometimes children are multiply-added to nodes in conditional FP trees (so a node is its own sibling). These are in Jira with patches as MAHOUT-885 and MAHOUT-886. In general, this code seems like it's more complicated (and difficult to follow) than it could be. Part of that is simply due to the object-avoidance, but I suspect that a lot is related to planned improvements (for example I don't see where FP-Bonsai pruning is implemented, but comments indicate someone is/was working on it). Would folks be open to a major refactoring in here, or are people actively working on fleshing bits of this out with new functionality? -tom