[ 
https://issues.apache.org/jira/browse/MAHOUT-658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved MAHOUT-658.
------------------------------

    Resolution: Later

I presume this was a GSoC proposal that did not result in a project. I am going 
to mark it as "Later" for now as a result. Looks like a fine idea IMHO.

> UH-Mine algorithm for frequent pattern mining of uncertain data
> ---------------------------------------------------------------
>
>                 Key: MAHOUT-658
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-658
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4, 0.5
>            Reporter: Yarco Hayduk
>              Labels: gsoc, gsoc11, gsoc2011, mahout-gsoc-11
>
> Proposal Title: UH-Mine algorithm for frequent pattern mining of uncertain 
> data
> Student Name: Yaroslav Hayduk
> Student E-mail: [email protected]
> Organization/Project: Apache Mahout
> Assigned Mentor:
> Abstract
> Frequent pattern mining detects frequent patterns in data, which in turn 
> permits data analysts determine interesting correlations in the dataset. In 
> uncertain data mining, we suspect, but cannot guarantee, the presence or 
> absence of an item. Currently, Apache
> Mahout does not contain algorithm implementations capable of mining frequent 
> patterns from uncertain data. Hence, I propose to implement UH-Mine during 
> GSoC 2011, used for frequent pattern mining of uncertain data.
> Motivation
> There are many real-life situations, where we observe uncertain data, such as 
> in temperature and wind speed readings, patient diagnosis, and satellite 
> imaging. Due to the probabilistic nature of this data, it takes more time and 
> resources to mine it. Current state of-the-art frequent pattern mining of 
> uncertain data algorithms do not provide sufficient performance results, as 
> most of them are not crafted to execute in parallel.
> In uncertain data mining, we suspect, but cannot guarantee, the presence or 
> absence of an item. For example, when a doctor is 90% sure that a patient 
> suffers from a particular disease. A video lecture by Jian Pei [3] contains a 
> thorough overview of the uncertain data problem domain as well as elaborates 
> on the process of frequent pattern mining of uncertain data.
> Comparison of UF-Growth and UH-Mine
> The UF-Growth [4] algorithm modifies the FP-Growth [2] algorithm in the way 
> the transaction tree is built. FP-Growth uses the FP-Tree, a tree-based data 
> structure, to store a compact representation of the transaction database, 
> which contains frequency information of all frequent items. The FP-Tree, 
> however, does not store existential probabilities, associated with items. 
> Hence, Leung et al. [4] created a data structure, called UF-Tree, which is 
> based on the FP-tree. Each node in the UF-Tree stores an item, its expected 
> support, as well as the number of occurrence of such expected support for 
> each item. To merge the transaction with the child node in UF-Tree, UF-Growth 
> requires both the item and its corresponding existential probability to 
> match. Hence, the algorithm arrives with the UF-Tree, having a much lower 
> compression ratio, that in the FP-Tree, constructed by FP-Growth.
> As such, I propose to adapt and implement the UH-Mine algorithms to the 
> MapReduce programming model. The UH-Struct structure uses the linkage 
> behaviour among transactions corresponding to a branch of the 
> FP-Tree(UF-Tree) without actually creating a projected database.
> This approach is better than FP-Tree even in the deterministic case, when 
> compression from FP-Tree is not high. This turns out to be particularly true 
> for the uncertain case, as discussed earlier. The UH-mine [1] algorithm 
> provides the best trade-offs both in terms of running time and memory usage.
> The UH-Mine algorithm works as follows: it
> 1. prunes the initial DB such that all singleton infrequent items are 
> removed; 
> 2. divides the pruned DB into equal chunks; 
> 3. mines these chunks separately using the UH-Mine(mem) algorithm. To mine 
> frequent patterns, UH-Mine(mem) maintains an H-Struct, which contains 
> pointers to transaction items. At each step, the algorithm adjusts these 
> pointers and does not incur the overheads, associated with the FP(UF)-Tree 
> construction;
> 4. joins the results and 
> 5. scans the pruned DB once again to remove false positives and obtain the 
> actual counts.
> Benefits for the Mahout community
> a) My work adds an algorithm implementation for discovering frequent patterns 
> in uncertain data 
> b) If my algorithm implementation proves to be fast, it would permit future 
> algorithm developers to retrofit my UH-Mine implementation to create 
> H-Mine[5]. H-Mine is an alternative algorithm, which mines frequent patterns 
> from precise data.
> Timeline
> Weeks 1-4: Implement UH-Mine in Java, set up a local Hadoop cluster composed 
> of 3 machines Weeks 5-7: Investigate Mahout code structure, start Adopting 
> UH-Mine to MapReduce. 
> Weeks 8: Summit for mid-term evaluation 
> Weeks 9 - 11: Finish-up with my implementation, refactor code smells, 
> identify and fix performance issues 
> Weeks 11 - 12: Code cleaning, documenting and testing.
> Biography 
> My name is Yarco Hayduk and I am a Comp Sci graduate student at the 
> University of Manitoba, Canada. I'm very interested in concurrency and data 
> mining. Currently, I am working on a similar project, which adopts the 
> UFP-Growth[6] algorithm to MapReduce. I'm using the Parallel FP-Growth 
> algorithm as a starting point for implementing UFP-Growth. I also implemented 
> UH-Mine and H-Mine algorithms in C. Thus, I would use it as a reference for 
> my proposed UH-Mine implementation on MapReduce.
> References
> [1] Jianyong Wang Charu C. Aggarwal, Yan Li and Jing Wang. Frequent pattern 
> mining
> with uncertain data. In KDD '09: 15th ACM SIGKDD International Conference on
> Knowledge Discovery and Data Mining, pages 29--38, Paris, France, June 2010. 
> ACM.
> 10.1145/1557019.1557030.
> [2] Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without 
> candidate generation. In SIGMOD '00: 2000 ACM SIGMOD international conference 
> on management
> of data, pages 1--12, Dallas, TX, USA, May 2000. ACM. 10.1145/342009.335372.
> [3] Ming Hua Jian Pei. Mining uncertain and probabilistic data: problems, 
> challenges,
> methods, and applications. http://videolectures.net/kdd08_pei_mupd, Accessed 
> on
> March 11, 2011.
> [4] Carson Kai-Sang Leung, Christopher Lee Carmichael, and Boyu Hao. 
> Efficient mining
> of frequent patterns from uncertain data. In ICDMW '07: 7th IEEE 
> International Conference on Data Mining Workshops, pages 489--494, Omaha, NE, 
> USA, October 2007.
> IEEE. 10.1109/ICDMW.2007.84.
> [5] Jian Pei, Jiawei Han, Hongjun Lu, Shojiro Nishio, Shiwei Tang, and 
> Dongqing Yang.
> H-Mine: Hyper-structure mining of frequent patterns in large databases. In 
> ICDM '01:
> 1st IEEE International Conference on Data Mining, pages 441--448, San Jose, 
> California,
> USA, November 2001. IEEE. 10.1109/ICDM.2001.989550.
> [6] Calin Garboni Toon Calders and Bart Goethals. Efficient pattern mining 
> from uncertain data with sampling. In PAKDD 2010: 14th Pacific-Asia 
> Conference on Knowledge
> Discovery and Data Mining, pages 480--487, Hyderabad, India, June 2010. 
> Springer.
> 10.1007/978-3-642-13657-351.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to