Repository: madlib Updated Branches: refs/heads/master d62e5516b -> e0f76db8b
add caution on run-times to assoc rules user docs re: max itemset size usage Project: http://git-wip-us.apache.org/repos/asf/madlib/repo Commit: http://git-wip-us.apache.org/repos/asf/madlib/commit/e0f76db8 Tree: http://git-wip-us.apache.org/repos/asf/madlib/tree/e0f76db8 Diff: http://git-wip-us.apache.org/repos/asf/madlib/diff/e0f76db8 Branch: refs/heads/master Commit: e0f76db8bf2d7ca478d972cef302939b6f2babb5 Parents: d62e551 Author: Frank McQuillan <fmcquil...@pivotal.io> Authored: Tue Sep 18 15:02:18 2018 -0700 Committer: Frank McQuillan <fmcquil...@pivotal.io> Committed: Tue Sep 18 15:02:18 2018 -0700 ---------------------------------------------------------------------- .../modules/assoc_rules/assoc_rules.sql_in | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/madlib/blob/e0f76db8/src/ports/postgres/modules/assoc_rules/assoc_rules.sql_in ---------------------------------------------------------------------- diff --git a/src/ports/postgres/modules/assoc_rules/assoc_rules.sql_in b/src/ports/postgres/modules/assoc_rules/assoc_rules.sql_in index ec3c330..bcd5464 100644 --- a/src/ports/postgres/modules/assoc_rules/assoc_rules.sql_in +++ b/src/ports/postgres/modules/assoc_rules/assoc_rules.sql_in @@ -161,6 +161,12 @@ Given a frequent itemset \f$ A \f$ generated from the Apriori algorithm, and all subsets \f$ B \f$ , we generate rules such that \f$ B \Rightarrow (A - B) \f$ meets minimum confidence requirements. +@note Beware of combinatorial explosion. The Apriori algorithm can potentially +generate a huge number of rules, even for fairly simple data sets, resulting +in run-times that are unreasonably long. To avoid this, it is recommended +to cap the maximum itemset size to a small number to start with, then +increase it gradually. <em>Support</em> and <em>confidence</em> values are +parameters that can also be used to control rule generation. @anchor syntax @par Function Syntax @@ -257,14 +263,16 @@ This generates all association rules that satisfy the specified minimum \c conviction columns are calculated as described earlier. </dd> - <dt>verbose</dt> + <dt>verbose (optional)</dt> <dd>BOOLEAN, default: FALSE. Determines if details are printed for each iteration as the algorithm progresses.</dd> - <dt>max_itemset_size</dt> + <dt>max_itemset_size (optional)</dt> <dd>INTEGER, default: generate itemsets of all sizes. Determines the maximum size of frequent itemsets that are used for generating association rules. Must be 2 or more. - This parameter can be used to reduce run time for data sets where itemset size is large. </dd> + This parameter can be used to reduce run time for data sets where itemset size is large, + which is a common situation. If your query is not returning or is running too long, + try using a lower value for this parameter.</dd> </dl> @@ -338,7 +346,8 @@ Result: (7 rows) </pre> --# Limit association rules generated from itemsets of size at most 2: +-# Limit association rules generated from itemsets of size at most 2. This parameter is +a good way to reduce long run times. <pre class="example"> SELECT * FROM madlib.assoc_rules( .25, -- Support .5, -- Confidence