Re: [GENERAL] Indexing queries with bit masks

2010-05-02 Thread Mike Christensen
Hey thanks..  I thought I'd share the method I came up with for updating
subscriptions.  Basically, as far as my code is concerned the DB uses a
bitmask (at least for updates) but I abstract it through a function.  First
off, I have a little helper function so I don't repeat the same code a bunch
of times:

CREATE OR REPLACE FUNCTION KPC_UpdateEmailPreferenceHelper(_enable boolean,
_userid uuid, _type smallint)
   RETURNS void AS
   $BODY$
  BEGIN
 IF _enable THEN
INSERT INTO EmailPreferences (UserId, NotificationType) SELECT
_userid, _type
WHERE NOT EXISTS (SELECT 1 FROM EmailPreferences WHERE UserId =
_userid AND NotificationType = _type);
 ELSE
DELETE FROM EmailPreferences WHERE UserId = _userid AND
NotificationType = _type;
 END IF;
  END;
   $BODY$
   LANGUAGE 'plpgsql' VOLATILE
   COST 100;

Then through my code I call this one with a bitmask of which notifications
the user wants to subscribe to:

CREATE OR REPLACE FUNCTION KPC_UpdateEmailPreference(_userid uuid, _prefs
smallint)
  RETURNS void AS
$BODY$
  BEGIN
 PERFORM KPC_UpdateEmailPreferenceHelper (_prefs  1  0, _userid,
1::smallint);
 PERFORM KPC_UpdateEmailPreferenceHelper (_prefs  2  0, _userid,
2::smallint);
 PERFORM KPC_UpdateEmailPreferenceHelper (_prefs  4  0, _userid,
3::smallint);
 PERFORM KPC_UpdateEmailPreferenceHelper (_prefs  8  0, _userid,
4::smallint);
 PERFORM KPC_UpdateEmailPreferenceHelper (_prefs  16  0, _userid,
5::smallint);
 PERFORM KPC_UpdateEmailPreferenceHelper (_prefs  32  0, _userid,
6::smallint);
 PERFORM KPC_UpdateEmailPreferenceHelper (_prefs  64  0, _userid,
7::smallint);
 PERFORM KPC_UpdateEmailPreferenceHelper (_prefs  128  0, _userid,
8::smallint);
 PERFORM KPC_UpdateEmailPreferenceHelper (_prefs  256  0, _userid,
9::smallint);
 PERFORM KPC_UpdateEmailPreferenceHelper (_prefs  512  0, _userid,
10::smallint);
 PERFORM KPC_UpdateEmailPreferenceHelper (_prefs  1024  0,
_userid, 11::smallint);
 PERFORM KPC_UpdateEmailPreferenceHelper (_prefs  2048  0,
_userid, 12::smallint);
 PERFORM KPC_UpdateEmailPreferenceHelper (_prefs  4096  0,
_userid, 13::smallint);
 PERFORM KPC_UpdateEmailPreferenceHelper (_prefs  8192  0,
_userid, 14::smallint);
  END;
   $BODY$
  LANGUAGE 'plpgsql' VOLATILE
  COST 100;

Seems to work pretty well, anyone have any feedback?

Mike

2010/5/1 Filip Rembiałkowski plk.zu...@gmail.com

 2010/4/30 Mike Christensen m...@kitchenpc.com:
  Ok I've been blatantly lying, err, purposely simplifying the problem for
 the
  sake of the original email :)
 
  I've read over the responses, and am actually now considering just not
 using
  any index at all.  Here's why:
 
  First, this actually isn't the only thing on the WHERE clause.  It will
 only
  query for users who are friends with you so it can notify them of your
  activities.  That's done via a weird JOIN on a table that holds all the
  friend relationships.  So in reality, it will only load maybe a hundred
  rows, or maybe a thousand every once in a while if you're way popular.
 If
  I'm not mistaken, it should use the index to narrow it down to the list
 of
  friends, and then use a sequential scan to weed out the ones who
 subscribe
  to that type of notification.
 
  Second, the only thing /ever/ that will do this query is the queue
 service
  whose job it is to process notifications (which are files dropped on the
  file system) and email people all day long.  This service handles one job
 at
  a time, and could potentially run on its own machine with its own
 read-only
  copy of the database.  Thus, even if it was a fairly slow query, it's not
  gonna bring down the rest of the site.
 
  Regarding the idea of putting an index on each bit, I thought about this
  earlier as well as just kinda cringed.  The users table gets updated
 quite a
  bit (last logon, session id, any time they change their profile info,
  etc)..  Too many indexes is bad.  I could just put the data in another
 table
  of course, which lead me to another idea.  Have a table called
 Subscriptions
  and have each row hold a user id and a notification type.  I could index
  both, and join on (Subscriptions.UserId = Users.UserId AND
  Subscriptions.Type = 8).  This would be pretty dang fast, however updates
  are kinda a royal pain.  When the user changes which types of
 subscriptions
  they want (via a list of checkboxes), I'd have to figure out which rows
 to
  delete and which new ones to insert.  However, I think I have an idea in
  mind for a PgSQL function you pass in the bitmask to and then it
  translates it to conditional deletes and inserts.
 
  A third idea I'm tossing around is just not worry about it.  Put the
 bitmask
  in the DB, but not filter on it.  Every friend would be loaded into the
  dataset, but the queue processor would just skip rows if they didn't
  subscribe to that 

Re: [GENERAL] Indexing queries with bit masks

2010-05-01 Thread Filip Rembiałkowski
2010/4/30 Mike Christensen m...@kitchenpc.com:
 Ok I've been blatantly lying, err, purposely simplifying the problem for the
 sake of the original email :)

 I've read over the responses, and am actually now considering just not using
 any index at all.  Here's why:

 First, this actually isn't the only thing on the WHERE clause.  It will only
 query for users who are friends with you so it can notify them of your
 activities.  That's done via a weird JOIN on a table that holds all the
 friend relationships.  So in reality, it will only load maybe a hundred
 rows, or maybe a thousand every once in a while if you're way popular.  If
 I'm not mistaken, it should use the index to narrow it down to the list of
 friends, and then use a sequential scan to weed out the ones who subscribe
 to that type of notification.

 Second, the only thing /ever/ that will do this query is the queue service
 whose job it is to process notifications (which are files dropped on the
 file system) and email people all day long.  This service handles one job at
 a time, and could potentially run on its own machine with its own read-only
 copy of the database.  Thus, even if it was a fairly slow query, it's not
 gonna bring down the rest of the site.

 Regarding the idea of putting an index on each bit, I thought about this
 earlier as well as just kinda cringed.  The users table gets updated quite a
 bit (last logon, session id, any time they change their profile info,
 etc)..  Too many indexes is bad.  I could just put the data in another table
 of course, which lead me to another idea.  Have a table called Subscriptions
 and have each row hold a user id and a notification type.  I could index
 both, and join on (Subscriptions.UserId = Users.UserId AND
 Subscriptions.Type = 8).  This would be pretty dang fast, however updates
 are kinda a royal pain.  When the user changes which types of subscriptions
 they want (via a list of checkboxes), I'd have to figure out which rows to
 delete and which new ones to insert.  However, I think I have an idea in
 mind for a PgSQL function you pass in the bitmask to and then it
 translates it to conditional deletes and inserts.

 A third idea I'm tossing around is just not worry about it.  Put the bitmask
 in the DB, but not filter on it.  Every friend would be loaded into the
 dataset, but the queue processor would just skip rows if they didn't
 subscribe to that event.  In other words, move the problem down to the
 business layer.  The drawback is potentially large number of rows are
 loaded, serialized, etc into memory that will just be ignored.  But of
 course the DB is probably a read-only copy and it's not even close to the
 bottle neck of the email queue under heavy load, so it's probably a
 non-issue.  If mailing is slow, I just add more queue services..

 I'm exploring all these ideas.  I predict using the bitwise AND on the where
 clause isn't gonna be the worst design ever, and it's sure easier to
 implement than a table of subscriptions.  What do you guys think?

I would say normalize. Which means that I like your separate table
idea best.
It's clear, obvious, and 3NF - conforming solution.
Changing the set of subscriptions with delete-update-insert combo is
not so bad as you would think.
Encapsulating it in some kind of functional API looks nice too.

Filip

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Indexing queries with bit masks

2010-04-30 Thread Mike Christensen
I want a column in my Users table that will keep track of which types of
notifications the user wants to subscribe to.  There's probably about 10
different types, so I don't want to have 10 boolean columns because this
seems kinda hacky and makes adding new types more work.  So I'm thinking
about using a 32bit integer type and storing the data as a bitmask.

When a certain event happens, let's say event 4, I need to query for which
users to notify.  So I'll be doing something like:

SELECT UserId FROM Users WHERE Subscriptions  8;

(I haven't checked this syntax but I'm assuming that's how you do it)..

My question is say there's a million rows in the Users table.  If I have an
index on Subscriptions, will this index be used in the above query?  Is
there another good way to make this query super fast, or is my approach
totally dumb?  I haven't implemented this yet so I'm open to new clever
ideas.  Thanks!!

Mike


Re: [GENERAL] Indexing queries with bit masks

2010-04-30 Thread Tom Lane
Mike Christensen m...@kitchenpc.com writes:
 When a certain event happens, let's say event 4, I need to query for which
 users to notify.  So I'll be doing something like:

 SELECT UserId FROM Users WHERE Subscriptions  8;

 My question is say there's a million rows in the Users table.  If I have an
 index on Subscriptions, will this index be used in the above query?

No.  At least not with a standard btree index.

I'm not exactly sure that an index would be helpful at all --- it seems
like the selectivity of this condition won't be very good anyway, will
it?  The more popular notifications will be subscribed to by a large
fraction of the user base.  Maybe it'd be useful to index unpopular
notifications, but how often will you be searching for those?

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Indexing queries with bit masks

2010-04-30 Thread Peter Hunsberger
On Fri, Apr 30, 2010 at 10:08 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Mike Christensen m...@kitchenpc.com writes:
 When a certain event happens, let's say event 4, I need to query for which
 users to notify.  So I'll be doing something like:

 SELECT UserId FROM Users WHERE Subscriptions  8;

 My question is say there's a million rows in the Users table.  If I have an
 index on Subscriptions, will this index be used in the above query?

 No.  At least not with a standard btree index.

 I'm not exactly sure that an index would be helpful at all --- it seems
 like the selectivity of this condition won't be very good anyway, will
 it?  The more popular notifications will be subscribed to by a large
 fraction of the user base.  Maybe it'd be useful to index unpopular
 notifications, but how often will you be searching for those?


We've got some similar columns (though nothing with any major number
of rows), so this is interesting...

If all subscriptions are roughly equal in popularity then any single
select should give ~ 10% of the data.  That would seem to be selective
enough that you'd really want an index?  If so, any answers to the
OP's main question; what would be the most efficient way to handle
this type of thing?

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Indexing queries with bit masks

2010-04-30 Thread Tom Lane
Peter Hunsberger peter.hunsber...@gmail.com writes:
 If all subscriptions are roughly equal in popularity then any single
 select should give ~ 10% of the data.  That would seem to be selective
 enough that you'd really want an index?

My personal rule of thumb is that 10% is around the threshold where
indexes stop being very helpful.  At that selectivity, you're going
to be having to read every page of the table anyway, and it's not
clear that the extra I/O to read the index is going to get repaid in
CPU savings.  (Now if the table+index are fully cached in RAM, the
threshold's probably a bit higher, but there still is not reason to
think that an index is going to make for a huge improvement.)

 If so, any answers to the OP's main question; what would be the most
 efficient way to handle this type of thing?

Well, btree's right out for indexing bit selections.  In principle you
could maybe do something with a GIN index, but I don't think we ship
any prefab GIN opclasses for this.

[ thinks for a bit ]

The best idea that comes to mind offhand is to not use an integer, but a
boolean array, such that the queries look like

select ... where subscriptions[4];

This already gives you one big advantage, which is that you're not
hard-wiring an assumption about how many notification types there can
ever be.  What I would then do is build a separate partial index for
each subscription column, ie,

create index ... where subscriptions[1];
create index ... where subscriptions[2];
.. etc ..

Now this only works as long as the queries are referencing explicit
constant subscription numbers, else the planner won't be able to
match the WHERE clause to any of the partial indexes.  But if that
is a reasonable restriction for your app then it seems like it
should work.

The main disadvantage of this is that you need N indexes, which could
get a bit expensive if the table is updated heavily.  But you don't need
to bother maintaining indexes corresponding to subscriptions that are
too popular to be worth indexing, so some of that could be bought back
by careful index selection.

Another point is that the partial indexes could be created on some other
column(s) and thereby serve double duty.  This depends on the details of
your typical queries though.  Is the subscriptions[] clause usually used
by itself, or together with additional WHERE conditions?

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Indexing queries with bit masks

2010-04-30 Thread Peter Hunsberger
On Fri, Apr 30, 2010 at 11:29 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Peter Hunsberger peter.hunsber...@gmail.com writes:
 If all subscriptions are roughly equal in popularity then any single
 select should give ~ 10% of the data.  That would seem to be selective
 enough that you'd really want an index?

 My personal rule of thumb is that 10% is around the threshold where
 indexes stop being very helpful.  At that selectivity, you're going
 to be having to read every page of the table anyway, and it's not
 clear that the extra I/O to read the index is going to get repaid in
 CPU savings.  (Now if the table+index are fully cached in RAM, the
 threshold's probably a bit higher, but there still is not reason to
 think that an index is going to make for a huge improvement.)

 If so, any answers to the OP's main question; what would be the most
 efficient way to handle this type of thing?

Ok, that makes sense, which immediately makes me wonder if partitions
might make sense for this use case?  In particular if there really are
only 10 different types?

[...]

 The best idea that comes to mind offhand is to not use an integer, but a
 boolean array, such that the queries look like

        select ... where subscriptions[4];


Interesting idea.  That might be worth testing for some of my use cases

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Indexing queries with bit masks

2010-04-30 Thread Mike Christensen
Ok I've been blatantly lying, err, purposely simplifying the problem for the
sake of the original email :)

I've read over the responses, and am actually now considering just not using
any index at all.  Here's why:

First, this actually isn't the only thing on the WHERE clause.  It will only
query for users who are friends with you so it can notify them of your
activities.  That's done via a weird JOIN on a table that holds all the
friend relationships.  So in reality, it will only load maybe a hundred
rows, or maybe a thousand every once in a while if you're way popular.  If
I'm not mistaken, it should use the index to narrow it down to the list of
friends, and then use a sequential scan to weed out the ones who subscribe
to that type of notification.

Second, the only thing /ever/ that will do this query is the queue service
whose job it is to process notifications (which are files dropped on the
file system) and email people all day long.  This service handles one job at
a time, and could potentially run on its own machine with its own read-only
copy of the database.  Thus, even if it was a fairly slow query, it's not
gonna bring down the rest of the site.

Regarding the idea of putting an index on each bit, I thought about this
earlier as well as just kinda cringed.  The users table gets updated quite a
bit (last logon, session id, any time they change their profile info,
etc)..  Too many indexes is bad.  I could just put the data in another table
of course, which lead me to another idea.  Have a table called Subscriptions
and have each row hold a user id and a notification type.  I could index
both, and join on (Subscriptions.UserId = Users.UserId AND
Subscriptions.Type = 8).  This would be pretty dang fast, however updates
are kinda a royal pain.  When the user changes which types of subscriptions
they want (via a list of checkboxes), I'd have to figure out which rows to
delete and which new ones to insert.  However, I think I have an idea in
mind for a PgSQL function you pass in the bitmask to and then it
translates it to conditional deletes and inserts.

A third idea I'm tossing around is just not worry about it.  Put the bitmask
in the DB, but not filter on it.  Every friend would be loaded into the
dataset, but the queue processor would just skip rows if they didn't
subscribe to that event.  In other words, move the problem down to the
business layer.  The drawback is potentially large number of rows are
loaded, serialized, etc into memory that will just be ignored.  But of
course the DB is probably a read-only copy and it's not even close to the
bottle neck of the email queue under heavy load, so it's probably a
non-issue.  If mailing is slow, I just add more queue services..

I'm exploring all these ideas.  I predict using the bitwise AND on the where
clause isn't gonna be the worst design ever, and it's sure easier to
implement than a table of subscriptions.  What do you guys think?

Mike

On Fri, Apr 30, 2010 at 9:29 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Peter Hunsberger peter.hunsber...@gmail.com writes:
  If all subscriptions are roughly equal in popularity then any single
  select should give ~ 10% of the data.  That would seem to be selective
  enough that you'd really want an index?

 My personal rule of thumb is that 10% is around the threshold where
 indexes stop being very helpful.  At that selectivity, you're going
 to be having to read every page of the table anyway, and it's not
 clear that the extra I/O to read the index is going to get repaid in
 CPU savings.  (Now if the table+index are fully cached in RAM, the
 threshold's probably a bit higher, but there still is not reason to
 think that an index is going to make for a huge improvement.)

  If so, any answers to the OP's main question; what would be the most
  efficient way to handle this type of thing?

 Well, btree's right out for indexing bit selections.  In principle you
 could maybe do something with a GIN index, but I don't think we ship
 any prefab GIN opclasses for this.

 [ thinks for a bit ]

 The best idea that comes to mind offhand is to not use an integer, but a
 boolean array, such that the queries look like

select ... where subscriptions[4];

 This already gives you one big advantage, which is that you're not
 hard-wiring an assumption about how many notification types there can
 ever be.  What I would then do is build a separate partial index for
 each subscription column, ie,

create index ... where subscriptions[1];
create index ... where subscriptions[2];
.. etc ..

 Now this only works as long as the queries are referencing explicit
 constant subscription numbers, else the planner won't be able to
 match the WHERE clause to any of the partial indexes.  But if that
 is a reasonable restriction for your app then it seems like it
 should work.

 The main disadvantage of this is that you need N indexes, which could
 get a bit expensive if the table is updated heavily.  But