From http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html: <snip> Given a huge transaction list, the algorithm finds all unique features(field values) and eliminates those features whose frequency in the whole dataset is less that minSupport. Using these remaining features N, we find the top K closed patterns for each of them, generating a total of NxK patterns. FPGrowth Algorithm is a generic implementation, we can use any Object type to denote a feature. </snip>
I'm a little confused on what constitutes a feature. What I want to do is mine query logs to give most freq. occurring related searches. I realize this problem can be modeled in a bunch of different ways: eg recommenders, clustering, but I'm interested in using PFP for the time being. So, beyond the actual query, what other things should I consider for the features? Also, the docs seem to indicate I should pass in tokens, should that really be features? In other words, if I want a phrase such as "foo bar" to be a single feature, I should some how concatenate them, right? Finally, seems like some useful utilities would be HTTPD Logs -> Vectors, Solr Logs -> Vectors, etc. right. In other words, take in a log file and a pattern (regex with matching subgroups) and get the various pieces for each line and create a vector sequence file all in a M/R way, right? Anyone already have that or something similar? -Grant
