I have to clarify this as many have asked this before. Mahout's
Implementation is Top K FPGrowth that finds closed patterns
if {1} - 3 is a pattern {1, 2}- 3 is another top pattern of item 1
{1, 2} - 3 is a closed pattern of {1} -3 as there is no information that the
former gives that that the latter gives extra, this gives significant boost
in speed for the Mahout FP-Growth algorithm. If the closed pattern is not
there, then there should be a problem with implementation otherwise its an
algorithm choice. I am forgetting the paper form which I implemented this. I
will post that when I find it.
Robin
On Thu, Apr 15, 2010 at 4:03 AM, Robert Neumayer <[email protected]>wrote:
> Hi.
>
> I was trying to test the fpgrowth algorithm on a toy example and I kind of
> fail
> to get the correct results ... so it'd be nice if someone could take a look
> at
> my (admittedly very ad-hoc) code and tell me whether there's something
> wrong with it. I'm running the mahout svn version.
>
> I tried to run the following example (input):
>
> 1 3 4
> 2 3 5
> 1 2 3 5
> 2 5
> 1 2 3 5
>
> With the following code:
> public static void main(String[] args) {
> int minSupport = 2;
> int maxHeapSize = 100;
>
> String input = "inputfile";
> String output = "output";
>
> FPGrowth<String> fp = new FPGrowth<String>();
> FileSystem fs = new RawLocalFileSystem();
> Configuration conf = new Configuration();
>
> String pattern = "[\\ ]";
> try {
> fs = FileSystem.get(URI.create(output), conf);
>
> SequenceFile.Writer writer = null;
> writer = new SequenceFile.Writer(fs, conf, new Path(output),
> Text.class, TopKStringPatterns.class);
>
> Charset encoding = Charset.forName("UTF-8");
>
> List<Pair<String, Long>> generateFList = null;
> generateFList = fp.generateFList(new StringRecordIterator(new
> FileLineIterable(new File(input), encoding, false),
> pattern), minSupport);
>
> StringRecordIterator transactions = null;
>
> transactions = new StringRecordIterator(new FileLineIterable(new
> File(input), encoding, false), pattern);
>
>
> List<Text> keyList = new LinkedList<Text>();
> List<TopKStringPatterns> valueList = new
> LinkedList<TopKStringPatterns>();
>
> StringOutputCollector<Text, TopKStringPatterns> collector = new
> StringOutputCollector<Text, TopKStringPatterns>(
> keyList, valueList);
> StringOutputConverter soc = new
> StringOutputConverter(collector);
>
> StatusUpdater updater = new StatusUpdater() {
> @Override
> public void update(String status) {
> System.out.println(status);
> }
> };
>
> fp.generateTopKFrequentPatterns(transactions, generateFList,
> minSupport, maxHeapSize, null, soc, updater);
> writer.close();
> fs.close();
>
> System.out.println("list.size: " + valueList.size());
> HashSet<List<String>> unique = new HashSet<List<String>>();
> for (int i = 0; i < valueList.size(); i++) {
>
> System.out.println(keyList.get(i) + " / " +
> valueList.get(i));
> Iterator<Pair<List<String>, Long>> iterator =
> valueList.get(i).iterator();
> while (iterator.hasNext()) {
>
> unique.add(iterator.next().getFirst());
> }
> }
> // print a bit more clearly
> Iterator<List<String>> iterator = unique.iterator();
> while (iterator.hasNext()) {
> System.out.println(iterator.next());
> }
>
> } catch (IOException e) {
> e.printStackTrace();
> }
> }
>
> I wrote another collector class (I know, that was probably not necessary
> but I
> wanted to use the API a bit):
>
> public class StringOutputCollector<K extends Writable, V extends Writable>
> implements OutputCollector<K, V> {
>
> private final List<K> keyList;
>
>
> private final List<V> valueList;
>
> public StringOutputCollector(List<K> keyList, List<V> valueList) {
> this.keyList = keyList;
> this.valueList = valueList;
> }
>
> @Override
> public final void collect(K key, V value) throws IOException {
> keyList.add(key);
> valueList.add(value);
> }
>
> }
>
> The output I get is the following:
> 1 / ([1, 3],3), ([1, 2, 3, 5],2)
> 5 / ([2, 5],4), ([2, 3, 5],3), ([1, 2, 3, 5],2)
> 3 / ([3],4), ([2, 3, 5],3), ([1, 3],3), ([1, 2, 3, 5],2)
> 2 / ([2, 5],4), ([2, 3, 5],3), ([1, 2, 3, 5],2)
>
> But I think there's more frequent itemsets to find in these transactions
> (with
> min support 2):
>
> {1} 3
> {2} 4
> {3} 4
> {5} 4
> {1, 2} 2
> {1, 3} 3
> {1, 5} 2
> {2, 3} 3
> {2, 5} 4
> {3, 5} 3
> {1, 2, 3} 2
> {1, 2, 5} 3
> {1, 3, 5} 2
> {2, 3, 5} 3
> {1, 2, 3, 5} 2
>
> Also, some of them are duplicates with the mahout code. I'm not sure
> whether
> it's supposed to be like that.
>
> I also tried another textbook example and that didn't produce the same
> results
> either.
>
> So am I doing something wrong with the topk selection (that's what I maybe
> could think of as a stupid mistake of mine)?
>
> Any hints would be appreciated.
>
> R
>