[ https://issues.apache.org/jira/browse/MADLIB-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Frank McQuillan updated MADLIB-1288: ------------------------------------ Description: Story As a data scientist, I want to default itemset size to 10, so that assoc rules does not run for a long time. Details We have had some complaints about how long assoc rules runs which could have to do with the implementation, or wrong parameter settings by the user, but may also be due to combinatorial explosion of number of generated rules. The R param `maxlen` is default to 10 https://cran.r-project.org/web/packages/arules/arules.pdf see page 10 "apriori - mining associations with apriori" which is the same as the madlib param `max_itemset_size` http://madlib.apache.org/docs/latest/group__grp__assoc__rules.html "If the minimum support is chosen too low for the dataset, then the algorithm will try to create an extremely large set of itemsets/rules. This will result in very long run time and eventually the process will run out of memory. To prevent this, the default maximal length of itemsets/rules is restricted to 10 items (via the parameter element `maxlen=10`)..." Interface Stays the same. The allowed values for max_itemset_size are: * any number 2 or more * if not specified set to 10 (default) * if user wants all itemsets they can specify a big number like 1000 or 10000 or whatever Acceptance 1) Set `max_itemset_size` parameter to 100 and run a data set that creates rules with more than 10 items. 2) Set `max_itemset_size` to `NULL` and re-run, confirm that default max rule size limit of 10 is respected. 3) Set `max_itemset_size` parameter to 10 and check it creates the same rules as #2 above. was: Story As a data scientist, I want to default itemset size to 10, so that assoc rules does not run for a long time. Details We have had some complaints about how long assoc rules runs which could have to do with the implementation, or wrong parameter settings by the user, but may also be due to combinatorial explosion of number of generated rules. The R param `maxlen` is default to 10 https://cran.r-project.org/web/packages/arules/arules.pdf see page 10 "apriori - mining associations with apriori" which is the same as the madlib param `max_itemset_size` http://madlib.apache.org/docs/latest/group__grp__assoc__rules.html "If the minimum support is chosen too low for the dataset, then the algorithm will try to create an extremely large set of itemsets/rules. This will result in very long run time and eventually the process will run out of memory. To prevent this, the default maximal length of itemsets/rules is restricted to 10 items (via the parameter element `maxlen=10`)..." Interface Stays the same. The allowed values for max_itemset_size are: * any number 2 or more * if not specified set to 10 (default) * can also accept `ALL` as in input which means generate itemsets of all sizes - this is the current behavior today in 1.15.1 Acceptance 1) Set `max_itemset_size` parameter to 100 and run a data set that creates rules with more than 10 items. 2) Set `max_itemset_size` to `NULL` and re-run, confirm that default max rule size limit is respected. > Set max itemset size to 10 by default in assoc rules > ---------------------------------------------------- > > Key: MADLIB-1288 > URL: https://issues.apache.org/jira/browse/MADLIB-1288 > Project: Apache MADlib > Issue Type: Improvement > Components: Module: Association Rules > Reporter: Frank McQuillan > Priority: Minor > Fix For: v1.16 > > > Story > As a data scientist, > I want to default itemset size to 10, > so that assoc rules does not run for a long time. > Details > We have had some complaints about how long assoc rules runs which could have > to do with the implementation, or wrong parameter settings by the user, but > may also be due to combinatorial explosion of number of generated rules. > The R param `maxlen` is default to 10 > https://cran.r-project.org/web/packages/arules/arules.pdf > see page 10 "apriori - mining associations with apriori" > which is the same as the madlib param `max_itemset_size` > http://madlib.apache.org/docs/latest/group__grp__assoc__rules.html > "If the minimum support is chosen too low for the dataset, > then the algorithm will try to create an extremely large set of > itemsets/rules. This will result in > very long run time and eventually the process will run out of memory. To > prevent this, the default > maximal length of itemsets/rules is restricted to 10 items (via the parameter > element `maxlen=10`)..." > Interface > Stays the same. The allowed values for max_itemset_size are: > * any number 2 or more > * if not specified set to 10 (default) > * if user wants all itemsets they can specify a big number like 1000 or 10000 > or whatever > Acceptance > 1) Set `max_itemset_size` parameter to 100 and run a data set that creates > rules with more than 10 items. > 2) Set `max_itemset_size` to `NULL` and re-run, confirm that default max rule > size limit of 10 is respected. > 3) Set `max_itemset_size` parameter to 10 and check it creates the same rules > as #2 above. -- This message was sent by Atlassian JIRA (v7.6.3#76005)