[jira] [Updated] (MAHOUT-658) UH-Mine algorithm for frequent pattern mining of uncertain data

Yarco Hayduk (JIRA) Thu, 07 Apr 2011 21:25:52 -0700

     [ 
https://issues.apache.org/jira/browse/MAHOUT-658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yarco Hayduk updated MAHOUT-658:
--------------------------------

    Description: 
Proposal Title: UH-Mine algorithm for frequent pattern mining of uncertain data

Student Name: Yaroslav Hayduk
Student E-mail: [email protected]
Organization/Project: Apache Mahout
Assigned Mentor:

Abstract
Frequent pattern mining detects frequent patterns in data, which in turn 
permits data analysts determine interesting correlations in the dataset. In 
uncertain data mining, we suspect, but cannot guarantee, the presence or 
absence of an item. Currently, Apache
Mahout does not contain algorithm implementations capable of mining frequent 
patterns from uncertain data. Hence, I propose to implement UH-Mine during GSoC 
2011, used for frequent pattern mining of uncertain data.

Motivation
There are many real-life situations, where we observe uncertain data, such as 
in temperature and wind speed readings, patient diagnosis, and satellite 
imaging. Due to the probabilistic nature of this data, it takes more time and 
resources to mine it. Current state of-the-art frequent pattern mining of 
uncertain data algorithms do not provide sufficient performance results, as 
most of them are not crafted to execute in parallel.
In uncertain data mining, we suspect, but cannot guarantee, the presence or 
absence of an item. For example, when a doctor is 90% sure that a patient 
suffers from a particular disease. A video lecture by Jian Pei [3] contains a 
thorough overview of the uncertain data problem domain as well as elaborates on 
the process of frequent pattern mining of uncertain data.

Comparison of UF-Growth and UH-Mine
The UF-Growth [4] algorithm modifies the FP-Growth [2] algorithm in the way the 
transaction tree is built. FP-Growth uses the FP-Tree, a tree-based data 
structure, to store a compact representation of the transaction database, which 
contains frequency information of all frequent items. The FP-Tree, however, 
does not store existential probabilities, associated with items. Hence, Leung 
et al. [4] created a data structure, called UF-Tree, which is based on the 
FP-tree. Each node in the UF-Tree stores an item, its expected support, as well 
as the number of occurrence of such expected support for each item. To merge 
the transaction with the child node in UF-Tree, UF-Growth requires both the 
item and its corresponding existential probability to match. Hence, the 
algorithm arrives with the UF-Tree, having a much lower compression ratio, that 
in the FP-Tree, constructed by FP-Growth.
As such, I propose to adapt and implement the UH-Mine algorithms to the 
MapReduce programming model. The UH-Struct structure uses the linkage behaviour 
among transactions corresponding to a branch of the FP-Tree(UF-Tree) without 
actually creating a projected database.
This approach is better than FP-Tree even in the deterministic case, when 
compression from FP-Tree is not high. This turns out to be particularly true 
for the uncertain case, as discussed earlier. The UH-mine [1] algorithm 
provides the best trade-offs both in terms of running time and memory usage.

The UH-Mine algorithm works as follows: it
1. prunes the initial DB such that all singleton infrequent items are removed; 
2. divides the pruned DB into equal chunks; 
3. mines these chunks separately using the UH-Mine(mem) algorithm. To mine 
frequent patterns, UH-Mine(mem) maintains an H-Struct, which contains pointers 
to transaction items. At each step, the algorithm adjusts these pointers and 
does not incur the overheads, associated with the FP(UF)-Tree construction;
4. joins the results and 
5. scans the pruned DB once again to remove false positives and obtain the 
actual counts.

Benefits for the Mahout community
a) My work adds an algorithm implementation for discovering frequent patterns 
in uncertain data 
b) If my algorithm implementation proves to be fast, it would permit future 
algorithm developers to retrofit my UH-Mine implementation to create H-Mine[5]. 
H-Mine is an alternative algorithm, which mines frequent patterns from precise 
data.

Timeline
Weeks 1-4: Implement UH-Mine in Java, set up a local Hadoop cluster composed of 
3 machines Weeks 5-7: Investigate Mahout code structure, start Adopting UH-Mine 
to MapReduce. 
Weeks 8: Summit for mid-term evaluation 
Weeks 9 - 11: Finish-up with my implementation, refactor code smells, identify 
and fix performance issues 
Weeks 11 - 12: Code cleaning, documenting and testing.

Biography 
My name is Yarco Hayduk and I am a Comp Sci graduate student at the University 
of Manitoba, Canada. I'm very interested in concurrency and data mining. 
Currently, I am working on a similar project, which adopts the UFP-Growth[6] 
algorithm to MapReduce. I'm using the Parallel FP-Growth algorithm as a 
starting point for implementing UFP-Growth. I also implemented UH-Mine and 
H-Mine algorithms in C. Thus, I would use it as a reference for my proposed 
UH-Mine implementation on MapReduce.

References
[1] Jianyong Wang Charu C. Aggarwal, Yan Li and Jing Wang. Frequent pattern 
mining
with uncertain data. In KDD '09: 15th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 29--38, Paris, France, June 2010. 
ACM.
10.1145/1557019.1557030.
[2] Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without 
candidate generation. In SIGMOD '00: 2000 ACM SIGMOD international conference 
on management
of data, pages 1--12, Dallas, TX, USA, May 2000. ACM. 10.1145/342009.335372.
[3] Ming Hua Jian Pei. Mining uncertain and probabilistic data: problems, 
challenges,
methods, and applications. http://videolectures.net/kdd08_pei_mupd, Accessed on
March 11, 2011.
[4] Carson Kai-Sang Leung, Christopher Lee Carmichael, and Boyu Hao. Efficient 
mining
of frequent patterns from uncertain data. In ICDMW '07: 7th IEEE International 
Conference on Data Mining Workshops, pages 489--494, Omaha, NE, USA, October 
2007.
IEEE. 10.1109/ICDMW.2007.84.
[5] Jian Pei, Jiawei Han, Hongjun Lu, Shojiro Nishio, Shiwei Tang, and Dongqing 
Yang.
H-Mine: Hyper-structure mining of frequent patterns in large databases. In ICDM 
'01:
1st IEEE International Conference on Data Mining, pages 441--448, San Jose, 
California,
USA, November 2001. IEEE. 10.1109/ICDM.2001.989550.
[6] Calin Garboni Toon Calders and Bart Goethals. Efficient pattern mining from 
uncertain data with sampling. In PAKDD 2010: 14th Pacific-Asia Conference on 
Knowledge
Discovery and Data Mining, pages 480--487, Hyderabad, India, June 2010. 
Springer.
10.1007/978-3-642-13657-351.


  was:
Proposal Title: UH-Mine algorithm for frequent pattern mining of uncertain data

Student Name: Sergey Bartunov
Student E-mail: [email protected]
Organization/Project: Apache Mahout
Assigned Mentor:

Abstract
Frequent pattern mining detects frequent patterns in data, which in turn 
permits data analysts determine interesting correlations in the dataset. In 
uncertain data mining, we suspect, but cannot guarantee, the presence or 
absence of an item. Currently, Apache
Mahout does not contain algorithm implementations capable of mining frequent 
patterns from uncertain data. Hence, I propose to implement UH-Mine during GSoC 
2011, used for frequent pattern mining of uncertain data.

Motivation
There are many real-life situations, where we observe uncertain data, such as 
in temperature and wind speed readings, patient diagnosis, and satellite 
imaging. Due to the probabilistic nature of this data, it takes more time and 
resources to mine it. Current state of-the-art frequent pattern mining of 
uncertain data algorithms do not provide sufficient performance results, as 
most of them are not crafted to execute in parallel.
In uncertain data mining, we suspect, but cannot guarantee, the presence or 
absence of an item. For example, when a doctor is 90% sure that a patient 
suffers from a particular disease. A video lecture by Jian Pei [3] contains a 
thorough overview of the uncertain data problem domain as well as elaborates on 
the process of frequent pattern mining of uncertain data.

Comparison of UF-Growth and UH-Mine
The UF-Growth [4] algorithm modifies the FP-Growth [2] algorithm in the way the 
transaction tree is built. FP-Growth uses the FP-Tree, a tree-based data 
structure, to store a compact representation of the transaction database, which 
contains frequency information of all frequent items. The FP-Tree, however, 
does not store existential probabilities, associated with items. Hence, Leung 
et al. [4] created a data structure, called UF-Tree, which is based on the 
FP-tree. Each node in the UF-Tree stores an item, its expected support, as well 
as the number of occurrence of such expected support for each item. To merge 
the transaction with the child node in UF-Tree, UF-Growth requires both the 
item and its corresponding existential probability to match. Hence, the 
algorithm arrives with the UF-Tree, having a much lower compression ratio, that 
in the FP-Tree, constructed by FP-Growth.
As such, I propose to adapt and implement the UH-Mine algorithms to the 
MapReduce programming model. The UH-Struct structure uses the linkage behaviour 
among transactions corresponding to a branch of the FP-Tree(UF-Tree) without 
actually creating a projected database.
This approach is better than FP-Tree even in the deterministic case, when 
compression from FP-Tree is not high. This turns out to be particularly true 
for the uncertain case, as discussed earlier. The UH-mine [1] algorithm 
provides the best trade-offs both in terms of running time and memory usage.

The UH-Mine algorithm works as follows: it
1. prunes the initial DB such that all singleton infrequent items are removed; 
2. divides the pruned DB into equal chunks; 
3. mines these chunks separately using the UH-Mine(mem) algorithm. To mine 
frequent patterns, UH-Mine(mem) maintains an H-Struct, which contains pointers 
to transaction items. At each step, the algorithm adjusts these pointers and 
does not incur the overheads, associated with the FP(UF)-Tree construction;
4. joins the results and 
5. scans the pruned DB once again to remove false positives and obtain the 
actual counts.

Benefits for the Mahout community
a) My work adds an algorithm implementation for discovering frequent patterns 
in uncertain data 
b) If my algorithm implementation proves to be fast, it would permit future 
algorithm developers to retrofit my UH-Mine implementation to create H-Mine[5]. 
H-Mine is an alternative algorithm, which mines frequent patterns from precise 
data.

Timeline
Weeks 1-4: Implement UH-Mine in Java, set up a local Hadoop cluster composed of 
3 machines Weeks 5-7: Investigate Mahout code structure, start Adopting UH-Mine 
to MapReduce. 
Weeks 8: Summit for mid-term evaluation 
Weeks 9 - 11: Finish-up with my implementation, refactor code smells, identify 
and fix performance issues 
Weeks 11 - 12: Code cleaning, documenting and testing.

Biography 
My name is Yarco Hayduk and I am a Comp Sci graduate student at the University 
of Manitoba, Canada. I'm very interested in concurrency and data mining. 
Currently, I am working on a similar project, which adopts the UFP-Growth[6] 
algorithm to MapReduce. I'm using the Parallel FP-Growth algorithm as a 
starting point for implementing UFP-Growth. I also implemented UH-Mine and 
H-Mine algorithms in C. Thus, I would use it as a reference for my proposed 
UH-Mine implementation on MapReduce.

References
[1] Jianyong Wang Charu C. Aggarwal, Yan Li and Jing Wang. Frequent pattern 
mining
with uncertain data. In KDD '09: 15th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 29--38, Paris, France, June 2010. 
ACM.
10.1145/1557019.1557030.
[2] Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without 
candidate generation. In SIGMOD '00: 2000 ACM SIGMOD international conference 
on management
of data, pages 1--12, Dallas, TX, USA, May 2000. ACM. 10.1145/342009.335372.
[3] Ming Hua Jian Pei. Mining uncertain and probabilistic data: problems, 
challenges,
methods, and applications. http://videolectures.net/kdd08_pei_mupd, Accessed on
March 11, 2011.
[4] Carson Kai-Sang Leung, Christopher Lee Carmichael, and Boyu Hao. Efficient 
mining
of frequent patterns from uncertain data. In ICDMW '07: 7th IEEE International 
Conference on Data Mining Workshops, pages 489--494, Omaha, NE, USA, October 
2007.
IEEE. 10.1109/ICDMW.2007.84.
[5] Jian Pei, Jiawei Han, Hongjun Lu, Shojiro Nishio, Shiwei Tang, and Dongqing 
Yang.
H-Mine: Hyper-structure mining of frequent patterns in large databases. In ICDM 
'01:
1st IEEE International Conference on Data Mining, pages 441--448, San Jose, 
California,
USA, November 2001. IEEE. 10.1109/ICDM.2001.989550.
[6] Calin Garboni Toon Calders and Bart Goethals. Efficient pattern mining from 
uncertain data with sampling. In PAKDD 2010: 14th Pacific-Asia Conference on 
Knowledge
Discovery and Data Mining, pages 480--487, Hyderabad, India, June 2010. 
Springer.
10.1007/978-3-642-13657-351.



> UH-Mine algorithm for frequent pattern mining of uncertain data
> ---------------------------------------------------------------
>
>                 Key: MAHOUT-658
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-658
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4, 0.5
>            Reporter: Yarco Hayduk
>              Labels: gsoc, gsoc11, gsoc2011, mahout-gsoc-11
>
> Proposal Title: UH-Mine algorithm for frequent pattern mining of uncertain 
> data
> Student Name: Yaroslav Hayduk
> Student E-mail: [email protected]
> Organization/Project: Apache Mahout
> Assigned Mentor:
> Abstract
> Frequent pattern mining detects frequent patterns in data, which in turn 
> permits data analysts determine interesting correlations in the dataset. In 
> uncertain data mining, we suspect, but cannot guarantee, the presence or 
> absence of an item. Currently, Apache
> Mahout does not contain algorithm implementations capable of mining frequent 
> patterns from uncertain data. Hence, I propose to implement UH-Mine during 
> GSoC 2011, used for frequent pattern mining of uncertain data.
> Motivation
> There are many real-life situations, where we observe uncertain data, such as 
> in temperature and wind speed readings, patient diagnosis, and satellite 
> imaging. Due to the probabilistic nature of this data, it takes more time and 
> resources to mine it. Current state of-the-art frequent pattern mining of 
> uncertain data algorithms do not provide sufficient performance results, as 
> most of them are not crafted to execute in parallel.
> In uncertain data mining, we suspect, but cannot guarantee, the presence or 
> absence of an item. For example, when a doctor is 90% sure that a patient 
> suffers from a particular disease. A video lecture by Jian Pei [3] contains a 
> thorough overview of the uncertain data problem domain as well as elaborates 
> on the process of frequent pattern mining of uncertain data.
> Comparison of UF-Growth and UH-Mine
> The UF-Growth [4] algorithm modifies the FP-Growth [2] algorithm in the way 
> the transaction tree is built. FP-Growth uses the FP-Tree, a tree-based data 
> structure, to store a compact representation of the transaction database, 
> which contains frequency information of all frequent items. The FP-Tree, 
> however, does not store existential probabilities, associated with items. 
> Hence, Leung et al. [4] created a data structure, called UF-Tree, which is 
> based on the FP-tree. Each node in the UF-Tree stores an item, its expected 
> support, as well as the number of occurrence of such expected support for 
> each item. To merge the transaction with the child node in UF-Tree, UF-Growth 
> requires both the item and its corresponding existential probability to 
> match. Hence, the algorithm arrives with the UF-Tree, having a much lower 
> compression ratio, that in the FP-Tree, constructed by FP-Growth.
> As such, I propose to adapt and implement the UH-Mine algorithms to the 
> MapReduce programming model. The UH-Struct structure uses the linkage 
> behaviour among transactions corresponding to a branch of the 
> FP-Tree(UF-Tree) without actually creating a projected database.
> This approach is better than FP-Tree even in the deterministic case, when 
> compression from FP-Tree is not high. This turns out to be particularly true 
> for the uncertain case, as discussed earlier. The UH-mine [1] algorithm 
> provides the best trade-offs both in terms of running time and memory usage.
> The UH-Mine algorithm works as follows: it
> 1. prunes the initial DB such that all singleton infrequent items are 
> removed; 
> 2. divides the pruned DB into equal chunks; 
> 3. mines these chunks separately using the UH-Mine(mem) algorithm. To mine 
> frequent patterns, UH-Mine(mem) maintains an H-Struct, which contains 
> pointers to transaction items. At each step, the algorithm adjusts these 
> pointers and does not incur the overheads, associated with the FP(UF)-Tree 
> construction;
> 4. joins the results and 
> 5. scans the pruned DB once again to remove false positives and obtain the 
> actual counts.
> Benefits for the Mahout community
> a) My work adds an algorithm implementation for discovering frequent patterns 
> in uncertain data 
> b) If my algorithm implementation proves to be fast, it would permit future 
> algorithm developers to retrofit my UH-Mine implementation to create 
> H-Mine[5]. H-Mine is an alternative algorithm, which mines frequent patterns 
> from precise data.
> Timeline
> Weeks 1-4: Implement UH-Mine in Java, set up a local Hadoop cluster composed 
> of 3 machines Weeks 5-7: Investigate Mahout code structure, start Adopting 
> UH-Mine to MapReduce. 
> Weeks 8: Summit for mid-term evaluation 
> Weeks 9 - 11: Finish-up with my implementation, refactor code smells, 
> identify and fix performance issues 
> Weeks 11 - 12: Code cleaning, documenting and testing.
> Biography 
> My name is Yarco Hayduk and I am a Comp Sci graduate student at the 
> University of Manitoba, Canada. I'm very interested in concurrency and data 
> mining. Currently, I am working on a similar project, which adopts the 
> UFP-Growth[6] algorithm to MapReduce. I'm using the Parallel FP-Growth 
> algorithm as a starting point for implementing UFP-Growth. I also implemented 
> UH-Mine and H-Mine algorithms in C. Thus, I would use it as a reference for 
> my proposed UH-Mine implementation on MapReduce.
> References
> [1] Jianyong Wang Charu C. Aggarwal, Yan Li and Jing Wang. Frequent pattern 
> mining
> with uncertain data. In KDD '09: 15th ACM SIGKDD International Conference on
> Knowledge Discovery and Data Mining, pages 29--38, Paris, France, June 2010. 
> ACM.
> 10.1145/1557019.1557030.
> [2] Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without 
> candidate generation. In SIGMOD '00: 2000 ACM SIGMOD international conference 
> on management
> of data, pages 1--12, Dallas, TX, USA, May 2000. ACM. 10.1145/342009.335372.
> [3] Ming Hua Jian Pei. Mining uncertain and probabilistic data: problems, 
> challenges,
> methods, and applications. http://videolectures.net/kdd08_pei_mupd, Accessed 
> on
> March 11, 2011.
> [4] Carson Kai-Sang Leung, Christopher Lee Carmichael, and Boyu Hao. 
> Efficient mining
> of frequent patterns from uncertain data. In ICDMW '07: 7th IEEE 
> International Conference on Data Mining Workshops, pages 489--494, Omaha, NE, 
> USA, October 2007.
> IEEE. 10.1109/ICDMW.2007.84.
> [5] Jian Pei, Jiawei Han, Hongjun Lu, Shojiro Nishio, Shiwei Tang, and 
> Dongqing Yang.
> H-Mine: Hyper-structure mining of frequent patterns in large databases. In 
> ICDM '01:
> 1st IEEE International Conference on Data Mining, pages 441--448, San Jose, 
> California,
> USA, November 2001. IEEE. 10.1109/ICDM.2001.989550.
> [6] Calin Garboni Toon Calders and Bart Goethals. Efficient pattern mining 
> from uncertain data with sampling. In PAKDD 2010: 14th Pacific-Asia 
> Conference on Knowledge
> Discovery and Data Mining, pages 480--487, Hyderabad, India, June 2010. 
> Springer.
> 10.1007/978-3-642-13657-351.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-658) UH-Mine algorithm for frequent pattern mining of uncertain data

Reply via email to