[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258697#comment-14258697 ] zhangyouhua commented on SPARK-4001: apriori and FPGrowth are commonly algorithm,there are two mainly method.one is calculate frequently item set,another is calculate association rules.here we have implementate formar,the latter will be submit as soon as possible. Add Apriori algorithm to Spark MLlib Key: SPARK-4001 URL: https://issues.apache.org/jira/browse/SPARK-4001 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jacky Li Assignee: Jacky Li Attachments: Distributed frequent item mining algorithm based on Spark.pptx Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258695#comment-14258695 ] zhangyouhua commented on SPARK-4001: apriori and FPGrowth are commonly algorithm,there are two mainly method.one is calculate frequently item set,another is calculate association rules.here we have implementate formar,the latter will be submit as soon as possible. Add Apriori algorithm to Spark MLlib Key: SPARK-4001 URL: https://issues.apache.org/jira/browse/SPARK-4001 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jacky Li Assignee: Jacky Li Attachments: Distributed frequent item mining algorithm based on Spark.pptx Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258696#comment-14258696 ] zhangyouhua commented on SPARK-4001: apriori and FPGrowth are commonly algorithm,there are two mainly method.one is calculate frequently item set,another is calculate association rules.here we have implementate formar,the latter will be submit as soon as possible. Add Apriori algorithm to Spark MLlib Key: SPARK-4001 URL: https://issues.apache.org/jira/browse/SPARK-4001 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jacky Li Assignee: Jacky Li Attachments: Distributed frequent item mining algorithm based on Spark.pptx Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14236501#comment-14236501 ] Jacky Li commented on SPARK-4001: - Sure, Xiangrui. I will update it on next Monday while in office. Add Apriori algorithm to Spark MLlib Key: SPARK-4001 URL: https://issues.apache.org/jira/browse/SPARK-4001 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jacky Li Assignee: Jacky Li Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233067#comment-14233067 ] Xiangrui Meng commented on SPARK-4001: -- [~jackylk] Could you share some performance testing results? Add Apriori algorithm to Spark MLlib Key: SPARK-4001 URL: https://issues.apache.org/jira/browse/SPARK-4001 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jacky Li Assignee: Jacky Li Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226562#comment-14226562 ] Jacky Li commented on SPARK-4001: - Thanks for your suggestion, Daniel. Here is the current status. 1. Currently I have implemented apriori and fp-growth by referring to YAFIM (http://pasa-bigdata.nju.edu.cn/people/ronggu/pub/YAFIM_ParLearning.pdf) and PFP (http://dl.acm.org/citation.cfm?id=1454027) For apriori, currently there are two versions implemented, one using broadcast variable and another one using cartisian join of two RDD, I am testing them using mushroom and webdoc open dataset (http://fimi.ua.ac.be/data/) to check the performance of them before deciding which one to contribute to MLlib. I have updated the code in the PR (https://github.com/apache/spark/pull/2847), you are welcome to check it and try in your use case. 2. For the input part, currently the apriori algo is taking {{RDD\[Array\[String\]\]}} as the input dataset, but not containing basket_id or user_id. I am not sure whether it can easily fit into your use case. Can you give more detail of how you want to use it in collaborative filtering contexts? Add Apriori algorithm to Spark MLlib Key: SPARK-4001 URL: https://issues.apache.org/jira/browse/SPARK-4001 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jacky Li Assignee: Jacky Li Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226793#comment-14226793 ] Daniel Erenrich commented on SPARK-4001: First a minor point: I'd suggest making it take an RDD of arrays of ints? Holding tons of strings around seems wasteful. The user can maintain a map from strings to ints. Or else can we just make the array contain comparables? So the main issue is whether we are trying to do frequent itemset mining or association rule construction. I argue the latter is more common an operation and while the second requires the first there's no great extra cost to doing both. I'm actually unfamiliar with what can be done with just the former and not the latter. If you equate baskets with users the connection between association rules and collaborative filtering becomes very clear. I want to feed in someone's, say, movie viewing history and get reccomendations of the form you watched X and Y so you'll really like Z (where X,Y,Z is a frequent itemset). The API could be made to match. Give me all the things this user bought in the format I described above and the prediction mode is here are all of things this person bought please apply as many rules as you can. If a user does care about the frequent item sets that would be additionally stored inside the model. The alternative here is to make a set of frequent itemset miners and then make an association rule learner that takes their output. The only downside is that that suffers some perf loss (requiring an additional pass). I'll gladly write this version if we decide that's the way we should go. Add Apriori algorithm to Spark MLlib Key: SPARK-4001 URL: https://issues.apache.org/jira/browse/SPARK-4001 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jacky Li Assignee: Jacky Li Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225728#comment-14225728 ] Daniel Erenrich commented on SPARK-4001: I was about to start coding something like this when I noticed this ticket. What's the status here? Association rule algorithms in general (and apriori in particular) are useful in collaborative filtering contexts (which mllib already has code for). As far as library cohesiveness, my though here is that we can frame the inputs to look near identical to the matrix facotarization code though with (basket_id, item_id) instead of (user_id, item_id). That input format would be inefficient though (so maybe we'd support a second more natural input format. This though would sidestep the concern the sklearn folks had. Add Apriori algorithm to Spark MLlib Key: SPARK-4001 URL: https://issues.apache.org/jira/browse/SPARK-4001 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jacky Li Assignee: Jacky Li Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178183#comment-14178183 ] wangfei commented on SPARK-4001: Thanks Sean Owen for explaining! Frequent itemset algorithm works by scanning the input data set, there is no probabilistic model in nature. To answer Xiangrui Meng’s earlier questions: 1. These algorithm is used for finding major patterns / association rules in a data set. For a real use case, some analytic applications in telecom domain use them to find subscriber behavior from the data set combining service record, network traffic record, and demographic data. Please refer to this Chinese article for example: http://www.ss-lw.com/wxxw-361.html And, sometimes we use frequent itemset algorithm for preparing features input to other algorithm which selects feature and do other ML task like training a classifier, like this paper: http://dl.acm.org/citation.cfm?id=1401922, 2. Since Apriori is a basic algorithm for frequent itemset mining, I am not aware of any parallel implementation for it. But I think the algorithm fits Spark’s data parallel model since it only need to scan the input data set. And for FP-Growth, I do know there is a Parallel FP-Growth from Haoyuan Li: http://dl.acm.org/citation.cfm?id=1454027 . I think I probably will refer to this paper to implement FP-Growth in Spark 3. The Apriori computation complexity is about O(N*k) where N is the number of item in input data and k is the depth of the frequent item tree to search. FP-Grwoth complexity is about O(N), it is more efficient comparing to Apriori. For space efficiency, FP-growth is also more efficient than Apriori. But in case of smaller data and if frequent itemset is more, Apriori is more efficient. This is because FP-Growth need to construct a FP Tree out of the input data set, and it needs some time. And another advantage of Apriori is that it can output association rules while FP-Growth can not. Although these two algorithms are basic algo (FP-Growth is more complex), I think it will be handy if mllib can include them since there is no frequent itemset mining algo in Spark yet, and especially in distributed environment. Please suggest how to handle this issue. Thanks a lot. Add Apriori algorithm to Spark MLlib Key: SPARK-4001 URL: https://issues.apache.org/jira/browse/SPARK-4001 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jacky Li Assignee: Jacky Li Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178200#comment-14178200 ] Jacky Li commented on SPARK-4001: - I accidentally use wangfei (my college) 's account to send the last comment, sorry for the inconvenience. Thanks @Sean Owen for explaining! Frequent itemset algorithm works by scanning the input data set, there is no probabilistic model in nature. To answer @Xiangrui Meng’s earlier questions: 1. These algorithm is used for finding major patterns / association rules in a data set. For a real use case, some analytic applications in telecom domain use them to find subscriber behavior from the data set combining service record, network traffic record, and demographic data. Please refer to this Chinese article for example: http://www.ss-lw.com/wxxw-361.html And, sometimes we use frequent itemset algorithm for preparing features input to other algorithm which selects feature and do other ML task like training a classifier, like this paper: http://dl.acm.org/citation.cfm?id=1401922, 2. Since Apriori is a basic algorithm for frequent itemset mining, I am not aware of any parallel implementation for it. But I think the algorithm fits Spark’s data parallel model since it only need to scan the input data set. And for FP-Growth, I do know there is a Parallel FP-Growth from Haoyuan Li: http://dl.acm.org/citation.cfm?id=1454027 . I think I probably will refer to this paper to implement FP-Growth in Spark 3. The Apriori computation complexity is about O(N*k) where N is the number of item in input data and k is the depth of the frequent item tree to search. FP-Grwoth complexity is about O(N), it is more efficient comparing to Apriori. For space efficiency, FP-growth is also more efficient than Apriori. But in case of smaller data and if frequent itemset is more, Apriori is more efficient. This is because FP-Growth need to construct a FP Tree out of the input data set, and it needs some time. And another advantage of Apriori is that it can output association rules while FP-Growth can not. Although these two algorithms are basic algo (FP-Growth is more complex), I think it will be handy if mllib can include them since there is no frequent itemset mining algo in Spark yet, and especially in distributed environment. Please suggest how to handle this issue. Thanks a lot. Add Apriori algorithm to Spark MLlib Key: SPARK-4001 URL: https://issues.apache.org/jira/browse/SPARK-4001 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jacky Li Assignee: Jacky Li Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177081#comment-14177081 ] Xiangrui Meng commented on SPARK-4001: -- [~jackylk] Could you provide references and comment on its use cases? I'm not familiar with the database context. How does it compare to freq-item algorithms, e.g., http://dl.acm.org/citation.cfm?id=762473 ? Add Apriori algorithm to Spark MLlib Key: SPARK-4001 URL: https://issues.apache.org/jira/browse/SPARK-4001 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jacky Li Assignee: Jacky Li Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177640#comment-14177640 ] Xiangrui Meng commented on SPARK-4001: -- No. Add Apriori algorithm to Spark MLlib Key: SPARK-4001 URL: https://issues.apache.org/jira/browse/SPARK-4001 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jacky Li Assignee: Jacky Li Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177643#comment-14177643 ] Jacky Li commented on SPARK-4001: - Maybe there is a misunderstand, I do not mean to use it in database by saying transactional data set. Apriori is the classic algorithm for freq-item mining, and the frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the data set. This has applications in domains such as market basket analysis for cross-selling and up-selling. I am not sure whether there is a freq-item algorithm in the mllib, so actually I am implementing Apriori and FP-Growth (not include in this JIRA thread). Will it be useful if contribute to mllib? Add Apriori algorithm to Spark MLlib Key: SPARK-4001 URL: https://issues.apache.org/jira/browse/SPARK-4001 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jacky Li Assignee: Jacky Li Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177658#comment-14177658 ] Xiangrui Meng commented on SPARK-4001: -- I'm asking because I'm not very familiar with this algorithm. I need some references to understand the following: 1) how important/popular the algorithm is (paper, use cases) 2) whether it is straight-forward to implement it in parallel 3) what is the storage and computation complexity 4) whether there are alternatives and how they compare For example, for the one I mentioned (A simple algorithm for finding frequent elements in streams and bags): 1) it has 400 references on google scholar / it finds frequent items that appears above a threshold in two passes 2) it is very easy to implement in parallel 3) storage is O(1/p) and two-passes, where p is threshold on the frequency Add Apriori algorithm to Spark MLlib Key: SPARK-4001 URL: https://issues.apache.org/jira/browse/SPARK-4001 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jacky Li Assignee: Jacky Li Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177707#comment-14177707 ] Sean Owen commented on SPARK-4001: -- FWIW I do perceive Apriori to be *the* basic frequent itemset algorithm. I think this is the original paper -- at least it was on Wikipedia and looks like the right time / author: http://rakesh.agrawal-family.com/papers/vldb94apriori.pdf It is very simple, and probably what you'd cook up if you invented a solution to the problem: http://en.wikipedia.org/wiki/Apriori_algorithm Frequent itemset is not quite the same as a frequent item algorithm. From a bunch of sets of items, it tries to determine which subsets occur frequently. FP-Growth is the other itemset algorithm I have ever heard of. It's more sophisticated. I don't have a paper reference. If you're going to implement frequent itemsets, I think these are the two to start with. That said I perceive frequent itemsets to be kind of 90s and I have never had to use it myself. That is not to say they don't have use, and hey they're simple. I suppose my problem with this type of technique is that it's not really telling you whether the set occurred unusually frequently, just that it did in absolute terms. There is not a probabilistic element to these. Add Apriori algorithm to Spark MLlib Key: SPARK-4001 URL: https://issues.apache.org/jira/browse/SPARK-4001 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jacky Li Assignee: Jacky Li Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177801#comment-14177801 ] Xiangrui Meng commented on SPARK-4001: -- Thanks [~srowen]! That's helpful! I searched on Google and found some discussion in scikit-learn: 1. https://github.com/scikit-learn/scikit-learn/issues/2662 2. https://github.com/scikit-learn/scikit-learn/issues/2872 They didn't accept the patch for couple reasons: a) relevance, b) integration with other components. [~jackylk] Could you comment on the real use cases that you are aware of and the complexity of the algorithm? Add Apriori algorithm to Spark MLlib Key: SPARK-4001 URL: https://issues.apache.org/jira/browse/SPARK-4001 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jacky Li Assignee: Jacky Li Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176262#comment-14176262 ] Apache Spark commented on SPARK-4001: - User 'jackylk' has created a pull request for this issue: https://github.com/apache/spark/pull/2847 Add Apriori algorithm to Spark MLlib Key: SPARK-4001 URL: https://issues.apache.org/jira/browse/SPARK-4001 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.0 Reporter: Jacky Li Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org