[
https://issues.apache.org/jira/browse/MADLIB-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16826384#comment-16826384
]
Frank McQuillan edited comment on MADLIB-1288 at 4/25/19 7:23 PM:
------------------------------------------------------------------
LGTM , see PR for tests
Also see attached doc - does improve speed roughly in proportion to the itemset
size
was (Author: fmcquillan):
LGTM , see PR for tests
> Set max itemset size to 10 by default in assoc rules
> ----------------------------------------------------
>
> Key: MADLIB-1288
> URL: https://issues.apache.org/jira/browse/MADLIB-1288
> Project: Apache MADlib
> Issue Type: Improvement
> Components: Module: Association Rules
> Reporter: Frank McQuillan
> Priority: Minor
> Fix For: v1.16
>
> Attachments: assoc-rules-scale.pdf
>
>
> Story
> As a data scientist,
> I want to default itemset size to 10,
> so that assoc rules does not run for a long time.
> Details
> We have had some complaints about how long assoc rules runs which could have
> to do with the implementation, or wrong parameter settings by the user, but
> may also be due to combinatorial explosion of number of generated rules.
> The R param `maxlen` is default to 10
> https://cran.r-project.org/web/packages/arules/arules.pdf
> see page 10 "apriori - mining associations with apriori"
> which is the same as the madlib param `max_itemset_size`
> http://madlib.apache.org/docs/latest/group__grp__assoc__rules.html
> "If the minimum support is chosen too low for the dataset,
> then the algorithm will try to create an extremely large set of
> itemsets/rules. This will result in
> very long run time and eventually the process will run out of memory. To
> prevent this, the default
> maximal length of itemsets/rules is restricted to 10 items (via the parameter
> element `maxlen=10`)..."
> Interface
> Stays the same. The allowed values for max_itemset_size are:
> * any number 2 or more
> * if not specified set to 10 (default)
> * if user wants all itemsets they can specify a big number like 1000 or 10000
> or whatever
> Acceptance
> 1) Set `max_itemset_size` parameter to 100 and run a data set that creates
> rules with more than 10 items.
> 2) Set `max_itemset_size` to `NULL` and re-run, confirm that default max rule
> size limit of 10 is respected.
> 3) Set `max_itemset_size` parameter to 10 and check it creates the same rules
> as #2 above.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)