Hi,

I've been playing around a bit with the FPGrowth implementation, and I
have some questions.

Is it intentional that all frequent itemsets including a given item
are mined for each (frequent) item?  For example, consider a simple
example from FPGrowthTest.java:

X      (occurs 12 times)
Y      (occurs 4 times)
X Y   (occurs 10 times)

The expected result is:

[(Y,([Y],14), ([X, Y],10)), (X,([X],22), ([X, Y],10))]

(This can be read as "When considering patterns with Y, [Y] and [X. Y]
were found to occur 14 and 10 times, respectively..")

Notice [X, Y] is found twice; a key efficiency improvement of FPGrowth
is that you can avoid this by only mining for frequent sets that
include the current base item and more-frequent items.  Seems like
that is not taken advantage of here...  Is there a reason for this, or
is this a bug?

Related to this is the grouping strategy - currently, when creating
batches of items to assign to reducers, the N most-frequent items form
a group, then the next N form a group, and so on.  Intuitively, it
makes more sense to round-robin the examples as I would expect the
"hardest" items to (roughly) band together in frequency-order.  Was
there a reason behind this choice?

I am also a little confused by the reporting of patterns (similar to
Vipul Pandey in MAHOUT-617).  To again use an example from the tests:

[(Z,([X, Y, Z],11)), (Y,([Y],25), ([X, Y],21), ([X, Y, Z],11)),
(X,([X],33), ([X, Y],21), ([X, Y, Z],11))]

What is the rationale for not reporting [X, Z], for example?  I
understand how you could generate these later, but why would you have
to?

I have also found two bugs; one is that the default number of groups
is mis-advertised.  The other is that sometimes children are
multiply-added to nodes in conditional FP trees (so a node is its own
sibling).  These are in Jira with patches as MAHOUT-885 and
MAHOUT-886.

In general, this code seems like it's more complicated (and difficult
to follow) than it could be.  Part of that is simply due to the
object-avoidance, but I suspect that a lot is related to planned
improvements (for example I don't see where FP-Bonsai pruning is
implemented, but comments indicate someone is/was working on it).
Would folks be open to a major refactoring in here, or are people
actively working on fleshing bits of this out with new functionality?

-tom

Reply via email to