Frank McQuillan created MADLIB-1274: ---------------------------------------
Summary: Association rules hangs/errors out for toy example Key: MADLIB-1274 URL: https://issues.apache.org/jira/browse/MADLIB-1274 Project: Apache MADlib Issue Type: Bug Components: Module: Association Rules Reporter: Frank McQuillan Error observed on: * Postgres 9.6 * Greenplum Database 5.9.0 This is a small AWS single node GP, 4 segments on a machine with 8 VCPUs, and plenty of available memory [gpadmin@ip-172-21-0-246 RetailDemo]$ cat /proc/meminfo MemTotal: 62711428 kB MemFree: 59786076 kB MemAvailable: 60281836 kB Load data ``` DROP TABLE IF EXISTS order_items; CREATE TABLE order_items( itemid INTEGER, orderid INTEGER, productid INTEGER, quantity INTEGER, productname TEXT); INSERT INTO order_items VALUES ( 5 , 1044 , 9 , 3 , 'Kirby cukes'), ( 11 , 37 , 2 , 3 , 'Ooopsi Cola'), ( 12 , 37 , 21 , 3 , 'black radish'), ( 15 , 37 , 49 , 3 , 'Leg of lamb'), ( 18 , 37 , 37 , 3 , 'Uggo Waffles'), ( 20 , 37 , 76 , 3 , 'Happy Valley White Peaches'), ( 21 , 37 , 29 , 3 , 'Breakstone Whole Milk Cottage Cheese'), ( 22 , 37 , 25 , 3 , 'ugli fruit'), ( 4 , 1044 , 44 , 3 , 'ground beef'), ( 6 , 1044 , 17 , 3 , 'napa'), ( 9 , 1044 , 10 , 3 , 'dill'), ( 13 , 37 , 21 , 3 , 'black radish'), ( 24 , 37 , 47 , 3 , 'Ball Park Franks'), ( 25 , 37 , 69 , 3 , 'Ball Park Mustard'), ( 26 , 37 , 64 , 3 , 'Ballpark Hot Dog Rolls'), ( 27 , 1044 , 47 , 3 , 'Ball Park Franks'), ( 28 , 1044 , 69 , 3 , 'Ball Park Mustard'), ( 29 , 1044 , 64 , 3 , 'Ballpark Hot Dog Rolls'), ( 30 , 1044 , 70 , 3 , 'Homer''s Strawberry Jam'), ( 31 , 1044 , 71 , 3 , 'Mr Peanut Peanut Butter'), ( 32 , 37 , 71 , 3 , 'Mr Peanut Peanut Butter'), ( 33 , 37 , 70 , 3 , 'Homer''s Strawberry Jam'), ( 1 , 1044 , 1 , 3 , 'Pivotal Apple Juice'), ( 3 , 1044 , 77 , 3 , 'Pivotal Baked Beans'), ( 14 , 37 , 53 , 3 , 'Old Zurich Swiss Cheese'), ( 17 , 37 , 49 , 3 , 'Leg of lamb'), ( 19 , 37 , 18 , 3 , 'california navels'), ( 2 , 1044 , 41 , 3 , '12" Dinner Plates'), ( 7 , 1044 , 32 , 3 , 'Vermot Extra Sharp Cheddar'), ( 8 , 1044 , 71 , 3 , 'Mr Peanut Peanut Butter'), ( 10 , 1044 , 39 , 3 , 'Pivotal Soft and Smooth 24 pack'), ( 16 , 37 , 22 , 3 , 'triple wahsed spinach'), ( 23 , 37 , 61 , 3 , 'Brooklyn Bagel 6 pack'); ``` (1) Run assoc rules: ``` SELECT * FROM madlib.assoc_rules( .25, .5, 'orderid', 'productid', 'order_items', NULL, TRUE ); ``` does not return. (2) Run assoc rules with output table specified results in: ``` SELECT * FROM madlib.assoc_rules(.10, -- Support .10, -- Confidence 'orderid', -- Transaction id col 'productname', -- Product col 'order_items', -- Input data 'pivotalmarkets', -- Output data TRUE); -- Verbose ``` results in error: ``` InternalError: (psycopg2.InternalError) plpy.Error: the output schema does not exist CONTEXT: Traceback (most recent call last): PL/Python function "assoc_rules", line 31, in <module> 'NULL' PL/Python function "assoc_rules", line 107, in assoc_rules PL/Python function "assoc_rules", line 21, in __assert PL/Python function "assoc_rules" [SQL: "SELECT * FROM madlib.assoc_rules(.10, -- Support\n .10, -- Confidence\n 'orderid', -- Transaction id col\n 'productname', -- Product col\n 'order_items', -- Input data\n 'pivotalmarkets', -- Output data\n TRUE); -- Verbose"] ``` Other info on failure on GP: ``` The original table was distributed randomly. If distributed by trans_id, the code completes. I get no assoc_rules, but it doesn’t run forever. If test_data is distributed randomly, the function returns, but there are no assoc_rules. So the behavior is different depending upon the table distribution. There may be a tiny data set issue where there are no rules that meet the support and confidence thresholds. ``` -- This message was sent by Atlassian JIRA (v7.6.3#76005)