From http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html:
<snip>
        Given a huge transaction list, the algorithm finds all unique 
features(field values) and eliminates those features whose frequency in the 
whole dataset is less that minSupport. Using these remaining features N, we 
find the top K closed patterns for each of them, generating a total of NxK 
patterns. FPGrowth Algorithm is a generic implementation, we can use any Object 
type to denote a feature. 
</snip>

I'm a little confused on what constitutes a feature.  What I want to do is mine 
query logs to give most freq. occurring related searches.  I realize this 
problem can be modeled in a bunch of different ways: eg recommenders, 
clustering, but I'm interested in using PFP for the time being.  So, beyond the 
actual query, what other things should I consider for the features? Also, the 
docs seem to indicate I should pass in tokens, should that really be features?  
In other words, if I want a phrase such as "foo bar" to be a single feature, I 
should some how concatenate them, right?

Finally, seems like some useful utilities would be HTTPD Logs -> Vectors, Solr 
Logs -> Vectors, etc. right.  In other words, take in a log file and a pattern 
(regex with matching subgroups) and get the various pieces for each line and 
create a vector sequence file all in a M/R way, right?  Anyone already have 
that or something similar?

-Grant


Reply via email to