[ https://issues.apache.org/jira/browse/SPARK-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-6381. ------------------------------ Resolution: Duplicate Fix Version/s: (was: 1.4.0) (Don't set fix version, and 1.3.1 does not exist.) Search JIRA first please. This was already implemented in SPARK-4001 as FP-growth. See also SPARK-2432. > add Apriori algorithm to MLLib > ------------------------------ > > Key: SPARK-6381 > URL: https://issues.apache.org/jira/browse/SPARK-6381 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: zhangyouhua > > [~mengxr] > There are many algorithms about association rule mining,for example FPGrowth, > Apriori and so on.these algorithms are classic > algorithms in machine learning, and there are very much usefully in big data > mining. Even the FPGrowth algorithm in spark > 1.3 version have implementation to solution big big data set, but it need > create FPTree before mining frequent item. so > while transition data is smaller and the data is sparse and minSupport is > bigger,wen can select Apriori algorithms. > how Apriori algorithm parallelism? > 1.Generates frequent items by filtering the input data using minimal support > level. > private def genFreqItems[Item: ClassTag]( data: RDD[Array[Item]],minCount: > Long,partitioner: Partitioner): Array[Item] > 2.Generate frequent itemSets by building apriori, the extraction is done on > each partition. > 2.1 create candidateSet by kFreqItems and k > private def createCandidateSet[Item: ClassTag]( kFreqItems: > Array[(Array[Item], Long)], k: Int) > 2.2 create kFreqItems from candidateSet is generated by candidateSet > private def scanDataSet[Item: ClassTag](dataSet: > RDD[Array[Item]],candidateSet: Array[Array[Item]], minCount: Double): > RDD[(Array[Item], Long)] > 2.3 filter dataSet by candidateSet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org