[
https://issues.apache.org/jira/browse/MAHOUT-658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yarco Hayduk updated MAHOUT-658:
--------------------------------
Description:
Proposal Title: UH-Mine algorithm for frequent pattern mining of uncertain data
Student Name: Yaroslav Hayduk
Student E-mail: [email protected]
Organization/Project: Apache Mahout
Assigned Mentor:
Abstract
Frequent pattern mining detects frequent patterns in data, which in turn
permits data analysts determine interesting correlations in the dataset. In
uncertain data mining, we suspect, but cannot guarantee, the presence or
absence of an item. Currently, Apache
Mahout does not contain algorithm implementations capable of mining frequent
patterns from uncertain data. Hence, I propose to implement UH-Mine during GSoC
2011, used for frequent pattern mining of uncertain data.
Motivation
There are many real-life situations, where we observe uncertain data, such as
in temperature and wind speed readings, patient diagnosis, and satellite
imaging. Due to the probabilistic nature of this data, it takes more time and
resources to mine it. Current state of-the-art frequent pattern mining of
uncertain data algorithms do not provide sufficient performance results, as
most of them are not crafted to execute in parallel.
In uncertain data mining, we suspect, but cannot guarantee, the presence or
absence of an item. For example, when a doctor is 90% sure that a patient
suffers from a particular disease. A video lecture by Jian Pei [3] contains a
thorough overview of the uncertain data problem domain as well as elaborates on
the process of frequent pattern mining of uncertain data.
Comparison of UF-Growth and UH-Mine
The UF-Growth [4] algorithm modifies the FP-Growth [2] algorithm in the way the
transaction tree is built. FP-Growth uses the FP-Tree, a tree-based data
structure, to store a compact representation of the transaction database, which
contains frequency information of all frequent items. The FP-Tree, however,
does not store existential probabilities, associated with items. Hence, Leung
et al. [4] created a data structure, called UF-Tree, which is based on the
FP-tree. Each node in the UF-Tree stores an item, its expected support, as well
as the number of occurrence of such expected support for each item. To merge
the transaction with the child node in UF-Tree, UF-Growth requires both the
item and its corresponding existential probability to match. Hence, the
algorithm arrives with the UF-Tree, having a much lower compression ratio, that
in the FP-Tree, constructed by FP-Growth.
As such, I propose to adapt and implement the UH-Mine algorithms to the
MapReduce programming model. The UH-Struct structure uses the linkage behaviour
among transactions corresponding to a branch of the FP-Tree(UF-Tree) without
actually creating a projected database.
This approach is better than FP-Tree even in the deterministic case, when
compression from FP-Tree is not high. This turns out to be particularly true
for the uncertain case, as discussed earlier. The UH-mine [1] algorithm
provides the best trade-offs both in terms of running time and memory usage.
The UH-Mine algorithm works as follows: it
1. prunes the initial DB such that all singleton infrequent items are removed;
2. divides the pruned DB into equal chunks;
3. mines these chunks separately using the UH-Mine(mem) algorithm. To mine
frequent patterns, UH-Mine(mem) maintains an H-Struct, which contains pointers
to transaction items. At each step, the algorithm adjusts these pointers and
does not incur the overheads, associated with the FP(UF)-Tree construction;
4. joins the results and
5. scans the pruned DB once again to remove false positives and obtain the
actual counts.
Benefits for the Mahout community
a) My work adds an algorithm implementation for discovering frequent patterns
in uncertain data
b) If my algorithm implementation proves to be fast, it would permit future
algorithm developers to retrofit my UH-Mine implementation to create H-Mine[5].
H-Mine is an alternative algorithm, which mines frequent patterns from precise
data.
Timeline
Weeks 1-4: Implement UH-Mine in Java, set up a local Hadoop cluster composed of
3 machines Weeks 5-7: Investigate Mahout code structure, start Adopting UH-Mine
to MapReduce.
Weeks 8: Summit for mid-term evaluation
Weeks 9 - 11: Finish-up with my implementation, refactor code smells, identify
and fix performance issues
Weeks 11 - 12: Code cleaning, documenting and testing.
Biography
My name is Yarco Hayduk and I am a Comp Sci graduate student at the University
of Manitoba, Canada. I'm very interested in concurrency and data mining.
Currently, I am working on a similar project, which adopts the UFP-Growth[6]
algorithm to MapReduce. I'm using the Parallel FP-Growth algorithm as a
starting point for implementing UFP-Growth. I also implemented UH-Mine and
H-Mine algorithms in C. Thus, I would use it as a reference for my proposed
UH-Mine implementation on MapReduce.
References
[1] Jianyong Wang Charu C. Aggarwal, Yan Li and Jing Wang. Frequent pattern
mining
with uncertain data. In KDD '09: 15th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 29--38, Paris, France, June 2010.
ACM.
10.1145/1557019.1557030.
[2] Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without
candidate generation. In SIGMOD '00: 2000 ACM SIGMOD international conference
on management
of data, pages 1--12, Dallas, TX, USA, May 2000. ACM. 10.1145/342009.335372.
[3] Ming Hua Jian Pei. Mining uncertain and probabilistic data: problems,
challenges,
methods, and applications. http://videolectures.net/kdd08_pei_mupd, Accessed on
March 11, 2011.
[4] Carson Kai-Sang Leung, Christopher Lee Carmichael, and Boyu Hao. Efficient
mining
of frequent patterns from uncertain data. In ICDMW '07: 7th IEEE International
Conference on Data Mining Workshops, pages 489--494, Omaha, NE, USA, October
2007.
IEEE. 10.1109/ICDMW.2007.84.
[5] Jian Pei, Jiawei Han, Hongjun Lu, Shojiro Nishio, Shiwei Tang, and Dongqing
Yang.
H-Mine: Hyper-structure mining of frequent patterns in large databases. In ICDM
'01:
1st IEEE International Conference on Data Mining, pages 441--448, San Jose,
California,
USA, November 2001. IEEE. 10.1109/ICDM.2001.989550.
[6] Calin Garboni Toon Calders and Bart Goethals. Efficient pattern mining from
uncertain data with sampling. In PAKDD 2010: 14th Pacific-Asia Conference on
Knowledge
Discovery and Data Mining, pages 480--487, Hyderabad, India, June 2010.
Springer.
10.1007/978-3-642-13657-351.
was:
Proposal Title: UH-Mine algorithm for frequent pattern mining of uncertain data
Student Name: Sergey Bartunov
Student E-mail: [email protected]
Organization/Project: Apache Mahout
Assigned Mentor:
Abstract
Frequent pattern mining detects frequent patterns in data, which in turn
permits data analysts determine interesting correlations in the dataset. In
uncertain data mining, we suspect, but cannot guarantee, the presence or
absence of an item. Currently, Apache
Mahout does not contain algorithm implementations capable of mining frequent
patterns from uncertain data. Hence, I propose to implement UH-Mine during GSoC
2011, used for frequent pattern mining of uncertain data.
Motivation
There are many real-life situations, where we observe uncertain data, such as
in temperature and wind speed readings, patient diagnosis, and satellite
imaging. Due to the probabilistic nature of this data, it takes more time and
resources to mine it. Current state of-the-art frequent pattern mining of
uncertain data algorithms do not provide sufficient performance results, as
most of them are not crafted to execute in parallel.
In uncertain data mining, we suspect, but cannot guarantee, the presence or
absence of an item. For example, when a doctor is 90% sure that a patient
suffers from a particular disease. A video lecture by Jian Pei [3] contains a
thorough overview of the uncertain data problem domain as well as elaborates on
the process of frequent pattern mining of uncertain data.
Comparison of UF-Growth and UH-Mine
The UF-Growth [4] algorithm modifies the FP-Growth [2] algorithm in the way the
transaction tree is built. FP-Growth uses the FP-Tree, a tree-based data
structure, to store a compact representation of the transaction database, which
contains frequency information of all frequent items. The FP-Tree, however,
does not store existential probabilities, associated with items. Hence, Leung
et al. [4] created a data structure, called UF-Tree, which is based on the
FP-tree. Each node in the UF-Tree stores an item, its expected support, as well
as the number of occurrence of such expected support for each item. To merge
the transaction with the child node in UF-Tree, UF-Growth requires both the
item and its corresponding existential probability to match. Hence, the
algorithm arrives with the UF-Tree, having a much lower compression ratio, that
in the FP-Tree, constructed by FP-Growth.
As such, I propose to adapt and implement the UH-Mine algorithms to the
MapReduce programming model. The UH-Struct structure uses the linkage behaviour
among transactions corresponding to a branch of the FP-Tree(UF-Tree) without
actually creating a projected database.
This approach is better than FP-Tree even in the deterministic case, when
compression from FP-Tree is not high. This turns out to be particularly true
for the uncertain case, as discussed earlier. The UH-mine [1] algorithm
provides the best trade-offs both in terms of running time and memory usage.
The UH-Mine algorithm works as follows: it
1. prunes the initial DB such that all singleton infrequent items are removed;
2. divides the pruned DB into equal chunks;
3. mines these chunks separately using the UH-Mine(mem) algorithm. To mine
frequent patterns, UH-Mine(mem) maintains an H-Struct, which contains pointers
to transaction items. At each step, the algorithm adjusts these pointers and
does not incur the overheads, associated with the FP(UF)-Tree construction;
4. joins the results and
5. scans the pruned DB once again to remove false positives and obtain the
actual counts.
Benefits for the Mahout community
a) My work adds an algorithm implementation for discovering frequent patterns
in uncertain data
b) If my algorithm implementation proves to be fast, it would permit future
algorithm developers to retrofit my UH-Mine implementation to create H-Mine[5].
H-Mine is an alternative algorithm, which mines frequent patterns from precise
data.
Timeline
Weeks 1-4: Implement UH-Mine in Java, set up a local Hadoop cluster composed of
3 machines Weeks 5-7: Investigate Mahout code structure, start Adopting UH-Mine
to MapReduce.
Weeks 8: Summit for mid-term evaluation
Weeks 9 - 11: Finish-up with my implementation, refactor code smells, identify
and fix performance issues
Weeks 11 - 12: Code cleaning, documenting and testing.
Biography
My name is Yarco Hayduk and I am a Comp Sci graduate student at the University
of Manitoba, Canada. I'm very interested in concurrency and data mining.
Currently, I am working on a similar project, which adopts the UFP-Growth[6]
algorithm to MapReduce. I'm using the Parallel FP-Growth algorithm as a
starting point for implementing UFP-Growth. I also implemented UH-Mine and
H-Mine algorithms in C. Thus, I would use it as a reference for my proposed
UH-Mine implementation on MapReduce.
References
[1] Jianyong Wang Charu C. Aggarwal, Yan Li and Jing Wang. Frequent pattern
mining
with uncertain data. In KDD '09: 15th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 29--38, Paris, France, June 2010.
ACM.
10.1145/1557019.1557030.
[2] Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without
candidate generation. In SIGMOD '00: 2000 ACM SIGMOD international conference
on management
of data, pages 1--12, Dallas, TX, USA, May 2000. ACM. 10.1145/342009.335372.
[3] Ming Hua Jian Pei. Mining uncertain and probabilistic data: problems,
challenges,
methods, and applications. http://videolectures.net/kdd08_pei_mupd, Accessed on
March 11, 2011.
[4] Carson Kai-Sang Leung, Christopher Lee Carmichael, and Boyu Hao. Efficient
mining
of frequent patterns from uncertain data. In ICDMW '07: 7th IEEE International
Conference on Data Mining Workshops, pages 489--494, Omaha, NE, USA, October
2007.
IEEE. 10.1109/ICDMW.2007.84.
[5] Jian Pei, Jiawei Han, Hongjun Lu, Shojiro Nishio, Shiwei Tang, and Dongqing
Yang.
H-Mine: Hyper-structure mining of frequent patterns in large databases. In ICDM
'01:
1st IEEE International Conference on Data Mining, pages 441--448, San Jose,
California,
USA, November 2001. IEEE. 10.1109/ICDM.2001.989550.
[6] Calin Garboni Toon Calders and Bart Goethals. Efficient pattern mining from
uncertain data with sampling. In PAKDD 2010: 14th Pacific-Asia Conference on
Knowledge
Discovery and Data Mining, pages 480--487, Hyderabad, India, June 2010.
Springer.
10.1007/978-3-642-13657-351.
> UH-Mine algorithm for frequent pattern mining of uncertain data
> ---------------------------------------------------------------
>
> Key: MAHOUT-658
> URL: https://issues.apache.org/jira/browse/MAHOUT-658
> Project: Mahout
> Issue Type: New Feature
> Components: Frequent Itemset/Association Rule Mining
> Affects Versions: 0.4, 0.5
> Reporter: Yarco Hayduk
> Labels: gsoc, gsoc11, gsoc2011, mahout-gsoc-11
>
> Proposal Title: UH-Mine algorithm for frequent pattern mining of uncertain
> data
> Student Name: Yaroslav Hayduk
> Student E-mail: [email protected]
> Organization/Project: Apache Mahout
> Assigned Mentor:
> Abstract
> Frequent pattern mining detects frequent patterns in data, which in turn
> permits data analysts determine interesting correlations in the dataset. In
> uncertain data mining, we suspect, but cannot guarantee, the presence or
> absence of an item. Currently, Apache
> Mahout does not contain algorithm implementations capable of mining frequent
> patterns from uncertain data. Hence, I propose to implement UH-Mine during
> GSoC 2011, used for frequent pattern mining of uncertain data.
> Motivation
> There are many real-life situations, where we observe uncertain data, such as
> in temperature and wind speed readings, patient diagnosis, and satellite
> imaging. Due to the probabilistic nature of this data, it takes more time and
> resources to mine it. Current state of-the-art frequent pattern mining of
> uncertain data algorithms do not provide sufficient performance results, as
> most of them are not crafted to execute in parallel.
> In uncertain data mining, we suspect, but cannot guarantee, the presence or
> absence of an item. For example, when a doctor is 90% sure that a patient
> suffers from a particular disease. A video lecture by Jian Pei [3] contains a
> thorough overview of the uncertain data problem domain as well as elaborates
> on the process of frequent pattern mining of uncertain data.
> Comparison of UF-Growth and UH-Mine
> The UF-Growth [4] algorithm modifies the FP-Growth [2] algorithm in the way
> the transaction tree is built. FP-Growth uses the FP-Tree, a tree-based data
> structure, to store a compact representation of the transaction database,
> which contains frequency information of all frequent items. The FP-Tree,
> however, does not store existential probabilities, associated with items.
> Hence, Leung et al. [4] created a data structure, called UF-Tree, which is
> based on the FP-tree. Each node in the UF-Tree stores an item, its expected
> support, as well as the number of occurrence of such expected support for
> each item. To merge the transaction with the child node in UF-Tree, UF-Growth
> requires both the item and its corresponding existential probability to
> match. Hence, the algorithm arrives with the UF-Tree, having a much lower
> compression ratio, that in the FP-Tree, constructed by FP-Growth.
> As such, I propose to adapt and implement the UH-Mine algorithms to the
> MapReduce programming model. The UH-Struct structure uses the linkage
> behaviour among transactions corresponding to a branch of the
> FP-Tree(UF-Tree) without actually creating a projected database.
> This approach is better than FP-Tree even in the deterministic case, when
> compression from FP-Tree is not high. This turns out to be particularly true
> for the uncertain case, as discussed earlier. The UH-mine [1] algorithm
> provides the best trade-offs both in terms of running time and memory usage.
> The UH-Mine algorithm works as follows: it
> 1. prunes the initial DB such that all singleton infrequent items are
> removed;
> 2. divides the pruned DB into equal chunks;
> 3. mines these chunks separately using the UH-Mine(mem) algorithm. To mine
> frequent patterns, UH-Mine(mem) maintains an H-Struct, which contains
> pointers to transaction items. At each step, the algorithm adjusts these
> pointers and does not incur the overheads, associated with the FP(UF)-Tree
> construction;
> 4. joins the results and
> 5. scans the pruned DB once again to remove false positives and obtain the
> actual counts.
> Benefits for the Mahout community
> a) My work adds an algorithm implementation for discovering frequent patterns
> in uncertain data
> b) If my algorithm implementation proves to be fast, it would permit future
> algorithm developers to retrofit my UH-Mine implementation to create
> H-Mine[5]. H-Mine is an alternative algorithm, which mines frequent patterns
> from precise data.
> Timeline
> Weeks 1-4: Implement UH-Mine in Java, set up a local Hadoop cluster composed
> of 3 machines Weeks 5-7: Investigate Mahout code structure, start Adopting
> UH-Mine to MapReduce.
> Weeks 8: Summit for mid-term evaluation
> Weeks 9 - 11: Finish-up with my implementation, refactor code smells,
> identify and fix performance issues
> Weeks 11 - 12: Code cleaning, documenting and testing.
> Biography
> My name is Yarco Hayduk and I am a Comp Sci graduate student at the
> University of Manitoba, Canada. I'm very interested in concurrency and data
> mining. Currently, I am working on a similar project, which adopts the
> UFP-Growth[6] algorithm to MapReduce. I'm using the Parallel FP-Growth
> algorithm as a starting point for implementing UFP-Growth. I also implemented
> UH-Mine and H-Mine algorithms in C. Thus, I would use it as a reference for
> my proposed UH-Mine implementation on MapReduce.
> References
> [1] Jianyong Wang Charu C. Aggarwal, Yan Li and Jing Wang. Frequent pattern
> mining
> with uncertain data. In KDD '09: 15th ACM SIGKDD International Conference on
> Knowledge Discovery and Data Mining, pages 29--38, Paris, France, June 2010.
> ACM.
> 10.1145/1557019.1557030.
> [2] Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without
> candidate generation. In SIGMOD '00: 2000 ACM SIGMOD international conference
> on management
> of data, pages 1--12, Dallas, TX, USA, May 2000. ACM. 10.1145/342009.335372.
> [3] Ming Hua Jian Pei. Mining uncertain and probabilistic data: problems,
> challenges,
> methods, and applications. http://videolectures.net/kdd08_pei_mupd, Accessed
> on
> March 11, 2011.
> [4] Carson Kai-Sang Leung, Christopher Lee Carmichael, and Boyu Hao.
> Efficient mining
> of frequent patterns from uncertain data. In ICDMW '07: 7th IEEE
> International Conference on Data Mining Workshops, pages 489--494, Omaha, NE,
> USA, October 2007.
> IEEE. 10.1109/ICDMW.2007.84.
> [5] Jian Pei, Jiawei Han, Hongjun Lu, Shojiro Nishio, Shiwei Tang, and
> Dongqing Yang.
> H-Mine: Hyper-structure mining of frequent patterns in large databases. In
> ICDM '01:
> 1st IEEE International Conference on Data Mining, pages 441--448, San Jose,
> California,
> USA, November 2001. IEEE. 10.1109/ICDM.2001.989550.
> [6] Calin Garboni Toon Calders and Bart Goethals. Efficient pattern mining
> from uncertain data with sampling. In PAKDD 2010: 14th Pacific-Asia
> Conference on Knowledge
> Discovery and Data Mining, pages 480--487, Hyderabad, India, June 2010.
> Springer.
> 10.1007/978-3-642-13657-351.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira