spark git commit: [SPARK-9900] [MLLIB] User guide for Association Rules

meng Tue, 18 Aug 2015 12:55:23 -0700

Repository: spark
Updated Branches:
  refs/heads/master c1840a862 -> f5ea39129



[SPARK-9900] [MLLIB] User guide for Association Rules

Updates FPM user guide to include Association Rules.

Author: Feynman Liang <fli...@databricks.com>

Closes #8207 from feynmanliang/SPARK-9900-arules.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f5ea3912
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f5ea3912
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f5ea3912

Branch: refs/heads/master
Commit: f5ea3912900ccdf23e2eb419a342bfe3c0c0b61b
Parents: c1840a8
Author: Feynman Liang <fli...@databricks.com>
Authored: Tue Aug 18 12:53:57 2015 -0700
Committer: Xiangrui Meng <m...@databricks.com>
Committed: Tue Aug 18 12:53:57 2015 -0700

----------------------------------------------------------------------
 docs/mllib-frequent-pattern-mining.md           | 130 +++++++++++++++++--
 docs/mllib-guide.md                             |   1 +
 .../mllib/fpm/JavaAssociationRulesSuite.java    |   2 +-
 3 files changed, 118 insertions(+), 15 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/f5ea3912/docs/mllib-frequent-pattern-mining.md
----------------------------------------------------------------------
diff --git a/docs/mllib-frequent-pattern-mining.md 
b/docs/mllib-frequent-pattern-mining.md
index 8ea4389..6c06550 100644
--- a/docs/mllib-frequent-pattern-mining.md
+++ b/docs/mllib-frequent-pattern-mining.md
@@ -39,18 +39,30 @@ MLlib's FP-growth implementation takes the following 
(hyper-)parameters:
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
 
-[`FPGrowth`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth) 
implements the
-FP-growth algorithm.
-It take a `JavaRDD` of transactions, where each transaction is an `Iterable` 
of items of a generic type.
+[`FPGrowth`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth)
+implements the FP-growth algorithm.  It take an `RDD` of transactions,
+where each transaction is an `Iterable` of items of a generic type.
 Calling `FPGrowth.run` with transactions returns an
 
[`FPGrowthModel`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowthModel)
-that stores the frequent itemsets with their frequencies.
+that stores the frequent itemsets with their frequencies.  The following
+example illustrates how to mine frequent itemsets and association rules
+(see [Association
+Rules](mllib-frequent-pattern-mining.html#association-rules) for
+details) from `transactions`.
+
 
 {% highlight scala %}
 import org.apache.spark.rdd.RDD
 import org.apache.spark.mllib.fpm.{FPGrowth, FPGrowthModel}
 
-val transactions: RDD[Array[String]] = ...
+val transactions: RDD[Array[String]] = sc.parallelize(Seq(
+  "r z h k p",
+  "z y x w v u t s",
+  "s x o n r",
+  "x z y m t s q e",
+  "z",
+  "x z y r q t p")
+  .map(_.split(" ")))
 
 val fpg = new FPGrowth()
   .setMinSupport(0.2)
@@ -60,29 +72,48 @@ val model = fpg.run(transactions)
 model.freqItemsets.collect().foreach { itemset =>
   println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)
 }
+
+val minConfidence = 0.8
+model.generateAssociationRules(minConfidence).collect().foreach { rule =>
+  println(
+    rule.antecedent.mkString("[", ",", "]")
+      + " => " + rule.consequent .mkString("[", ",", "]")
+      + ", " + rule.confidence)
+}
 {% endhighlight %}
 
 </div>
 
 <div data-lang="java" markdown="1">
 
-[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) implements the
-FP-growth algorithm.
-It take an `RDD` of transactions, where each transaction is an `Array` of 
items of a generic type.
-Calling `FPGrowth.run` with transactions returns an
+[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html)
+implements the FP-growth algorithm.  It take a `JavaRDD` of
+transactions, where each transaction is an `Array` of items of a generic
+type.  Calling `FPGrowth.run` with transactions returns an
 [`FPGrowthModel`](api/java/org/apache/spark/mllib/fpm/FPGrowthModel.html)
-that stores the frequent itemsets with their frequencies.
+that stores the frequent itemsets with their frequencies.  The following
+example illustrates how to mine frequent itemsets and association rules
+(see [Association
+Rules](mllib-frequent-pattern-mining.html#association-rules) for
+details) from `transactions`.
 
 {% highlight java %}
+import java.util.Arrays;
 import java.util.List;
 
-import com.google.common.base.Joiner;
-
 import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.mllib.fpm.AssociationRules;
 import org.apache.spark.mllib.fpm.FPGrowth;
 import org.apache.spark.mllib.fpm.FPGrowthModel;
 
-JavaRDD<List<String>> transactions = ...
+JavaRDD<List<String>> transactions = sc.parallelize(Arrays.asList(
+  Arrays.asList("r z h k p".split(" ")),
+  Arrays.asList("z y x w v u t s".split(" ")),
+  Arrays.asList("s x o n r".split(" ")),
+  Arrays.asList("x z y m t s q e".split(" ")),
+  Arrays.asList("z".split(" ")),
+  Arrays.asList("x z y r q t p".split(" "))), 2);
 
 FPGrowth fpg = new FPGrowth()
   .setMinSupport(0.2)
@@ -90,7 +121,78 @@ FPGrowth fpg = new FPGrowth()
 FPGrowthModel<String> model = fpg.run(transactions);
 
 for (FPGrowth.FreqItemset<String> itemset: 
model.freqItemsets().toJavaRDD().collect()) {
-   System.out.println("[" + Joiner.on(",").join(s.javaItems()) + "], " + 
s.freq());
+  System.out.println("[" + itemset.javaItems() + "], " + itemset.freq());
+}
+
+double minConfidence = 0.8;
+for (AssociationRules.Rule<String> rule
+    : model.generateAssociationRules(minConfidence).toJavaRDD().collect()) {
+  System.out.println(
+    rule.javaAntecedent() + " => " + rule.javaConsequent() + ", " + 
rule.confidence());
+}
+{% endhighlight %}
+
+</div>
+</div>
+
+## Association Rules
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+[AssociationRules](api/scala/index.html#org.apache.spark.mllib.fpm.AssociationRules)
+implements a parallel rule generation algorithm for constructing rules
+that have a single item as the consequent.
+
+{% highlight scala %}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.fpm.AssociationRules
+import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset
+
+val freqItemsets = sc.parallelize(Seq(
+  new FreqItemset(Array("a"), 15L),
+  new FreqItemset(Array("b"), 35L),
+  new FreqItemset(Array("a", "b"), 12L)
+));
+
+val ar = new AssociationRules()
+  .setMinConfidence(0.8)
+val results = ar.run(freqItemsets)
+
+results.collect().foreach { rule =>
+  println("[" + rule.antecedent.mkString(",")
+    + "=>"
+    + rule.consequent.mkString(",") + "]," + rule.confidence)
+}
+{% endhighlight %}
+
+</div>
+
+<div data-lang="java" markdown="1">
+[AssociationRules](api/java/org/apache/spark/mllib/fpm/AssociationRules.html)
+implements a parallel rule generation algorithm for constructing rules
+that have a single item as the consequent.
+
+{% highlight java %}
+import java.util.Arrays;
+
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.mllib.fpm.AssociationRules;
+import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset;
+
+JavaRDD<FPGrowth.FreqItemset<String>> freqItemsets = 
sc.parallelize(Arrays.asList(
+  new FreqItemset<String>(new String[] {"a"}, 15L),
+  new FreqItemset<String>(new String[] {"b"}, 35L),
+  new FreqItemset<String>(new String[] {"a", "b"}, 12L)
+));
+
+AssociationRules arules = new AssociationRules()
+  .setMinConfidence(0.8);
+JavaRDD<AssociationRules.Rule<String>> results = arules.run(freqItemsets);
+
+for (AssociationRules.Rule<String> rule: results.collect()) {
+  System.out.println(
+    rule.javaAntecedent() + " => " + rule.javaConsequent() + ", " + 
rule.confidence());
 }
 {% endhighlight %}
 

http://git-wip-us.apache.org/repos/asf/spark/blob/f5ea3912/docs/mllib-guide.md
----------------------------------------------------------------------
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 7851175..0573e5d 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -48,6 +48,7 @@ This lists functionality included in `spark.mllib`, the main 
MLlib API.
 * [Feature extraction and transformation](mllib-feature-extraction.html)
 * [Frequent pattern mining](mllib-frequent-pattern-mining.html)
   * [FP-growth](mllib-frequent-pattern-mining.html#fp-growth)
+  * [association rules](mllib-frequent-pattern-mining.html#association-rules)
   * [PrefixSpan](mllib-frequent-pattern-mining.html#prefix-span)
 * [Evaluation Metrics](mllib-evaluation-metrics.html)
 * [Optimization (developer)](mllib-optimization.html)

http://git-wip-us.apache.org/repos/asf/spark/blob/f5ea3912/mllib/src/test/java/org/apache/spark/mllib/fpm/JavaAssociationRulesSuite.java
----------------------------------------------------------------------
diff --git 
a/mllib/src/test/java/org/apache/spark/mllib/fpm/JavaAssociationRulesSuite.java 
b/mllib/src/test/java/org/apache/spark/mllib/fpm/JavaAssociationRulesSuite.java
index b3815ae..d7c2cb3 100644
--- 
a/mllib/src/test/java/org/apache/spark/mllib/fpm/JavaAssociationRulesSuite.java
+++ 
b/mllib/src/test/java/org/apache/spark/mllib/fpm/JavaAssociationRulesSuite.java
@@ -49,7 +49,7 @@ public class JavaAssociationRulesSuite implements 
Serializable {
     JavaRDD<FPGrowth.FreqItemset<String>> freqItemsets = 
sc.parallelize(Lists.newArrayList(
       new FreqItemset<String>(new String[] {"a"}, 15L),
       new FreqItemset<String>(new String[] {"b"}, 35L),
-      new FreqItemset<String>(new String[] {"a", "b"}, 18L)
+      new FreqItemset<String>(new String[] {"a", "b"}, 12L)
     ));
 
     JavaRDD<AssociationRules.Rule<String>> results = (new 
AssociationRules()).run(freqItemsets);


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9900] [MLLIB] User guide for Association Rules

Reply via email to