Hello, I just wanted to introduce myself. I am a MSc. Computer Science student at the University of Victoria. My research over the past year has been focused on developing and implementing an Apriori based frequent item-set mining algorithm for mining large data sets at low support counts.
https://docs.google.com/Doc?docid=0ATkk_-6ZolXnZGZjeGYzNzNfOTBjcjJncGpkaA&hl=en The main finding of the above report is that support levels as low as 0.001% on the webdocs (1.4GB) dataset can be efficiently calculated. On a 100 core cluster all frequent k2 pairs can calculated in approximately 6 minutes. I currently have an optimized k2 Hadoop implementation and algorithm for generating frequent pairs and I am currently extending my work to items of any length. The analysis of the extended approach will be complete within the next two weeks. Would you be interesting in moving forward with such an implementation as a GSoC project? If so any comments/feedback would be very much appreciated. If you are interested I can create a proposal and submit it to your issue tracker when it comes back online. Thanks, Neal.