Craig Ringer wrote:
> Allan Kamau wrote:
>> Hi all,
>> I have a list of purchases (market basket) and I would like to select
>> non redundant longest possible patterns by eliminating
>> (creating/populating other table to contain only non redandant itemsets)
>> purchases having item lists which are fully included in at least one
>> other purchase.
> 
> Here's a possibly slow and surely ugly solution (I think it's right,
> though I haven't done more than passing testing):
> 
> 
> 
> CREATE VIEW togo_as_arr AS
>   SELECT a.tid,
>     ARRAY(SELECT item FROM togo b WHERE b.tid = a.tid ORDER BY item)
>     AS items
>   FROM togo a GROUP BY tid;
> 
> SELECT arr_a.tid AS redundant_tid, arr_b.tid AS contained_by
> FROM togo_as_arr arr_a CROSS JOIN togo_as_arr arr_b
> WHERE arr_a.tid <> arr_b.tid AND arr_a.items <@ arr_b.items;

Alternately:

-- Helps with the massively repeated subquery below
CREATE INDEX togo_by_tid_and_item ON togo(tid,item);

-- Find any `a' for which `item_from_a_is_in_b' is
-- true for all items in `a'
SELECT a_tid AS is_redundant, b_tid AS contained_by
FROM (
  -- For every item in every pair of purchases,
  -- determine whether the item in purchase `a'
  -- was also in purchase `b'.
  SELECT
    a.tid AS a_tid,
    b.tid AS b_tid,
    a.item AS item,
    EXISTS(
      -- Was this item from `a' also in the `b' purchase?
      SELECT 1 FROM togo x WHERE x.tid = b.tid AND x.item = a.item
    ) AS item_from_a_is_in_b
  FROM togo a INNER JOIN togo b ON (a.tid <> b.tid)
  GROUP BY a.tid, b.tid, a.item) AS item_containment
GROUP BY a_tid, b_tid
HAVING every(item_from_a_is_in_b);


... which avoids the array building, but is actually considerably slower
on the trivial test data. That's not too surprising given that this
approach requires a subquery:

   SELECT 1 FROM togo x WHERE x.tid = b.tid AND x.item = a.item

for EVERY item to be tested. Twice, actually, as each record appears in
both `a' and `b' positions.

I'd be very interested to see what happened on real world test data,
especially compared to doing the array accumulation based query off a
temporary table instead of a view.

I suspect it'll depend on the average number of items per purchase -
lots of items per purchase and the array building cost will dominate.
That's really just a guess, though.

I'm sure there's a properly smart way to do this that I just can't
figure out, but this is the best I've come up with so far.

--
Craig Ringer

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Reply via email to