On Mon, Nov 14, 2011 at 1:58 PM, Tom Pierce <t...@cloudera.com> wrote:
> Hi, > > I've been playing around a bit with the FPGrowth implementation, and I > have some questions. > > Is it intentional that all frequent itemsets including a given item > are mined for each (frequent) item? For example, consider a simple > example from FPGrowthTest.java: > > X (occurs 12 times) > Y (occurs 4 times) > X Y (occurs 10 times) > > The expected result is: > > [(Y,([Y],14), ([X, Y],10)), (X,([X],22), ([X, Y],10))] > > (This can be read as "When considering patterns with Y, [Y] and [X. Y] > were found to occur 14 and 10 times, respectively..") > > Notice [X, Y] is found twice; a key efficiency improvement of FPGrowth > is that you can avoid this by only mining for frequent sets that > include the current base item and more-frequent items. Seems like > that is not taken advantage of here... Is there a reason for this, or > is this a bug? > No, the mining here is to lookup top N patterns for each attribute > > Related to this is the grouping strategy - currently, when creating > batches of items to assign to reducers, the N most-frequent items form > a group, then the next N form a group, and so on. Intuitively, it > makes more sense to round-robin the examples as I would expect the > "hardest" items to (roughly) band together in frequency-order. Was > there a reason behind this choice? > I cant recall the rationale, but the basic idea was this type of partitioning allowed all patterns to be generated. Round robin will miss some See the paper http://infolab.stanford.edu/~echang/recsys08-69.pdf > > I am also a little confused by the reporting of patterns (similar to > Vipul Pandey in MAHOUT-617). To again use an example from the tests: > > [(Z,([X, Y, Z],11)), (Y,([Y],25), ([X, Y],21), ([X, Y, Z],11)), > (X,([X],33), ([X, Y],21), ([X, Y, Z],11))] > > What is the rationale for not reporting [X, Z], for example? I > understand how you could generate these later, but why would you have > to? > Ah. XZ - 11 is a redundant pattern. We try to mine only closed subsets. so XYZ-11 already has that information. > > I have also found two bugs; one is that the default number of groups > is mis-advertised. The other is that sometimes children are > multiply-added to nodes in conditional FP trees (so a node is its own > sibling). These are in Jira with patches as MAHOUT-885 and > MAHOUT-886. > I will check this out. Seems like a bug. > > In general, this code seems like it's more complicated (and difficult > to follow) than it could be. Part of that is simply due to the > object-avoidance, but I suspect that a lot is related to planned > improvements (for example I don't see where FP-Bonsai pruning is > implemented, but comments indicate someone is/was working on it). > Would folks be open to a major refactoring in here, or are people > actively working on fleshing bits of this out with new functionality? > The current implementation tries to minimize memory usage by a lot. More objects you create more performance hit it takes. But I am open to refactoring the code to make it more readable. FP Bonsai is implemented as a single function that prunes the nodes before mining. You will find it as a small function there. Its not a significant piece of code. Its already checked in I believe > > -tom >