zhangyouhua created SPARK-6381:
----------------------------------

             Summary: add Apriori algorithm to MLLib
                 Key: SPARK-6381
                 URL: https://issues.apache.org/jira/browse/SPARK-6381
             Project: Spark
          Issue Type: New Feature
          Components: MLlib
    Affects Versions: 1.3.1
            Reporter: zhangyouhua
             Fix For: 1.4.0


[~mengxr]
There are many algorithms about association rule mining,for example FPGrowth, 
Apriori and so on.these algorithms are classic 

algorithms in machine learning, and there are very much usefully in big data 
mining. Even the FPGrowth algorithm in spark 

1.3 version have implementation to solution big big data set, but it need 
create FPTree before mining frequent item. so 

while transition data is smaller and the data is sparse and minSupport is 
bigger,wen can select Apriori  algorithms. 
how Apriori algorithm parallelism?
1.Generates frequent items by filtering the input data using minimal support 
level.
  private def genFreqItems[Item: ClassTag]( data: RDD[Array[Item]],minCount: 
Long,partitioner: Partitioner): Array[Item]
2.Generate frequent itemSets by building apriori, the extraction is done on 
each partition.
 2.1 create candidateSet by kFreqItems and k
     private def createCandidateSet[Item: ClassTag]( kFreqItems: 
Array[(Array[Item], Long)], k: Int)
 2.2 create kFreqItems from candidateSet is generated by candidateSet
     private def scanDataSet[Item: ClassTag](dataSet: 
RDD[Array[Item]],candidateSet: Array[Array[Item]], minCount: Double): 
RDD[(Array[Item], Long)]
 2.3 filter dataSet by candidateSet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to