Re: [GENERAL] Build in function to verify email addresses
I'm pretty sure this isn't as simple as you think it is, I'd suggest having a good read of: https://stackoverflow.com/a/201378/216229 Chris On 16/11/2017 07:56, Nick Dro wrote: I beleieve that every information system has the needs to send emails. Currently PostgreSQL doesn't have a function which gets TEXT and return true if it's valid email address (x...@yyy.com / .co.ZZ) Do you believe such function should exist in PostgreSQL or it's best to let every user to implement his own function? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] options for logical replication plugins?
On 15/03/2017 19:18, Andres Freund wrote: Hi, On 2017-03-15 18:29:06 +, Chris Withers wrote: Shame the decoding has to be done on the server-side rather than the client side. Why? Requiring compiled extensions to be installed on the server is always going to be a pain, especially in a multi-tenant environment or something with Amazon or Google's cloud offerings. You can't filter on the client side. You don't have any catalog information available, so you'd have to transport a lot of metadata and / or decode to a verbose default format. Just to check my understanding: a logical replication stream is per-database, right? I'd imagine that for many uses of this functionality, having the "whole stream" in an efficient, compressed format that could be decoded on the client side, with any filtering or reshaping done there. cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] options for logical replication plugins?
Hi All, What are the current "best" options for logical replication plugins? I saw wal2json and one that serialized to protobuf instead, and then of course there's bottledwater, but none of these smell particularly production ready. Shame the decoding has to be done on the server-side rather than the client side. Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] json aggregation question
Hi All, Given the following table: |#createtablething (id serial,tags jsonb);#\d thing Table"public.thing"Column|Type |Modifiers +-+id |integer |notnulldefaultnextval('thing_id_seq'::regclass)tags |jsonb || ...and the following data: |insertintothing (tags)values('{"tag1": ["val1", "val2"], "tag2": ["t2val1"]}');insertintothing (tags)values('{"tag1": ["val3", "val1"], "tag2": ["t2val1"]}');insertintothing (tags)values('{"tag1": ["val2", "val1"], "tag2": ["t2val2"]}');| How can I aggregate the results of a query that equates to "show me the number of matching rows and the set of|tag1|value that have a|tag2|value of|t2val1|? The closes I can get is: |#selectcount(*),json_agg(tags)fromthing wheretags->'tag2'?'t2val1';count |json_agg ---+--2|[{"tag1":["val1","val2"],"tag2":["t2val1"]},{"tag1":["val3","val1"],"tag2":["t2val1"]}](1row)| ...but I really want: |count |tag1 ---+-2|["val1","val2","val3"](1row)| cheers, Chris
Re: [GENERAL] json aggregation question
Thanks, this is closer, but regex really scares me for something like this... On 28/02/2017 17:19, Yasin Sari wrote: Hi Chris, Maybe there is an another better solution; 1. sending values into jsonb_array_elements to getting elements (lateral join) 2. distinct to eliminate duplicates 3. regexp_replace to remove malformed Array literals 4. Casting into text array SELECT count(distinct tags ), string_to_array(regexp_replace(string_agg(distinct elem::text , ','),'\[*\"*\s*\]*','','g'),',') AS list from thing as t, jsonb_array_elements(t.tags->'tag1') elem where tags->'tag2'?'t2val1' count | tag1 2| {val1,val2,val3} 28 Şub 2017 Sal, 19:22 tarihinde, Chris Withers mailto:ch...@simplistix.co.uk>> şunu yazdı: Hi All, Given the following table: |#createtablething (id serial,tags jsonb);#\d thing Table"public.thing"Column|Type |Modifiers +-+id |integer |notnulldefaultnextval('thing_id_seq'::regclass)tags |jsonb || ...and the following data: |insertintothing (tags)values('{"tag1": ["val1", "val2"], "tag2": ["t2val1"]}');insertintothing (tags)values('{"tag1": ["val3", "val1"], "tag2": ["t2val1"]}');insertintothing (tags)values('{"tag1": ["val2", "val1"], "tag2": ["t2val2"]}');| How can I aggregate the results of a query that equates to "show me the number of matching rows and the set of|tag1|value that have a|tag2|value of|t2val1|? The closes I can get is: |#selectcount(*),json_agg(tags)fromthing wheretags->'tag2'?'t2val1';count |json_agg ---+--2|[{"tag1":["val1","val2"],"tag2":["t2val1"]},{"tag1":["val3","val1"],"tag2":["t2val1"]}](1row)| ...but I really want: |count |tag1 ---+-2|["val1","val2","val3"](1row)| cheers, Chris
[GENERAL] json aggregation question
Hi All, Given the following table: |#createtablething (id serial,tags jsonb);#\d thing Table"public.thing"Column|Type |Modifiers +-+id |integer |notnulldefaultnextval('thing_id_seq'::regclass)tags |jsonb || ...and the following data: |insertintothing (tags)values('{"tag1": ["val1", "val2"], "tag2": ["t2val1"]}');insertintothing (tags)values('{"tag1": ["val3", "val1"], "tag2": ["t2val1"]}');insertintothing (tags)values('{"tag1": ["val2", "val1"], "tag2": ["t2val2"]}');| How can I aggregate the results of a query that equates to "show me the number of matching rows and the set of|tag1|value that have a|tag2|value of|t2val1|? The closes I can get is: |#selectcount(*),json_agg(tags)fromthing wheretags->'tag2'?'t2val1';count |json_agg ---+--2|[{"tag1":["val1","val2"],"tag2":["t2val1"]},{"tag1":["val3","val1"],"tag2":["t2val1"]}](1row)| ...but I really want: |count |tag1 ---+-2|["val1","val2","val3"](1row)| cheers, Chris
Re: [GENERAL] default representation of null in psql
On 12/12/2016 14:33, Adrian Klaver wrote: On 12/11/2016 11:34 PM, Chris Withers wrote: On 01/12/2016 12:12, Francisco Olarte wrote: On Thu, Dec 1, 2016 at 12:56 PM, Chris Withers wrote: So, first observation: if I make room nullable, the exclude constraint does not apply for rows that have a room of null. I guess that's to be expected, right? I would expect it, given: n=> select null=null, null<>null, not (null=null); ?column? | ?column? | ?column? --+--+-- | | (1 row) Those are nulls, Yes, it's a shame psql has the same repr for null and empty-string ;-) test=# select NULL; ?column? -- (1 row) test=# \pset null 'NULL' Null display is "NULL". test=# select NULL; ?column? -- NULL (1 row) Sure, so perhaps the default should change? Of course, no-one has yet offered anything on the question I was really hoping for help with: Working with the exclude constraint example from https://www.postgresql.org/docs/current/static/rangetypes.html: CREATE EXTENSION btree_gist; CREATE TABLE room_reservation ( room text, during tsrange, EXCLUDE USING GIST (room WITH =, during WITH &&) ); Next question: if lots of rows have open-ended periods (eg: [, 2010-01-01 15:00) or [2010-01-01 14:00,)), how does that affect the performance of the btree gist index backing the exclude constraint? Tom Lane made a comment on here but never followed up with a definitive answer. Can anyone else help? cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] schema advice for event stream with tagging and filtering
On 16/08/2016 15:10, Ilya Kazakevich wrote: An event is a row with a primary key along the lines of (colo, host, category) and an associated set of tags, where each tag has a type and a value (eg: {"color": "red", "owner": "fred", "status": "open"...}). What about simple table with several columns and hstore field for tags? BTW, "pure SQL" approach here is to use separate tables: Tags(TagId, TagName) and TagValues(EventId,TagId,Value). Well, maybe, but none of us wants to do that ;-) But in this case it will be painful to filter events by tag values directly, so only separate denormalized OLAP table should be used in "pure SQL":) I don't understand the second half of this I'm afraid... PostgreSQL, however, supports key-value based hstore. Right, but hstore only allows single values for each key, if I understand correctly? Okay, so that leaves me with a jsonb "tags" column with a gin index, but I still have a couple of choices.. So, in order to best answer these types of queries: - show me a list of tag types and the count of the number of events of that type - show me all events that have tag1=x, tag2=y and does not have tag3 ...which of the following is going to be most performant: # SELECT '{"tag1":["v1", "v2", "v3"]}'::jsonb @> '{"tag1": ["v1"]}'::jsonb; ?column? -- t (1 row) # SELECT '[{"tag1":"v1"}, {"tag1": "v2"}, {"tag1": "v3"}]'::jsonb @> '[{"tag1": "v1"}]'::jsonb; ?column? -- t (1 row) So, should I go for a tag name that maps to a list of values for that tag, or should I go for a sequence of one-entry mappings of tag name to tag value? cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] btree gist indices, null and open-ended tsranges
On 01/12/2016 12:12, Francisco Olarte wrote: On Thu, Dec 1, 2016 at 12:56 PM, Chris Withers wrote: So, first observation: if I make room nullable, the exclude constraint does not apply for rows that have a room of null. I guess that's to be expected, right? I would expect it, given: n=> select null=null, null<>null, not (null=null); ?column? | ?column? | ?column? --+--+-- | | (1 row) Those are nulls, Yes, it's a shame psql has the same repr for null and empty-string ;-) n=> select (null=null) is null, (null<>null) is null, (not (null=null)) is null; ?column? | ?column? | ?column? --+--+-- t| t| t (1 row) I.e., the same happens with a nullable unique column, you can have one of each not null values and as many nulls as you want. SQL null is a strange beast. Sure, I think that was the answer I was expecting but not hoping for... However, my "next question" was the one I was really hoping for help with: Working with the exclude constraint example from https://www.postgresql.org/docs/current/static/rangetypes.html: CREATE EXTENSION btree_gist; CREATE TABLE room_reservation ( room text, during tsrange, EXCLUDE USING GIST (room WITH =, during WITH &&) ); Next question: if lots of rows have open-ended periods (eg: [, 2010-01-01 15:00) or [2010-01-01 14:00,)), how does that affect the performance of the btree gist index backing the exclude constraint? Tom Lane made a comment on here but never followed up with a definitive answer. Can anyone else help? cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] btree gist indices, null and open-ended tsranges
Hi All, Working with the exclude constraint example from https://www.postgresql.org/docs/current/static/rangetypes.html: CREATE EXTENSION btree_gist; CREATE TABLE room_reservation ( room text, during tsrange, EXCLUDE USING GIST (room WITH =, during WITH &&) ); So, first observation: if I make room nullable, the exclude constraint does not apply for rows that have a room of null. I guess that's to be expected, right? Next question: if lots of rows have open-ended periods (eg: [, 2010-01-01 15:00) or [2010-01-01 14:00,)), how does that affect the performance of the btree gist index backing the exclude constraint? cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] performance problems with bulk inserts/updates on tsrange with gist-based exclude constrains
Hey Tom, I appreciate you're busy, but did you ever get a chance to look at this? On 19/09/2016 08:40, Chris Withers wrote: On 16/09/2016 15:29, Tom Lane wrote: Chris Withers writes: On 16/09/2016 14:54, Igor Neyman wrote: So, what is the value for "end ts", when the record is inserted (the range just started)? It's open ended, so the period is [start_ts, ) I've not looked at the GiST range opclass, but I would not be surprised if having lots of those is pretty destructive to the index's ability to be selective about && searches. If that's so, that's a little disappointing... (I'd have thought the special case end value (open ended) and the ending type (inclusive/exclusive) would just be sentinel values) How would I verify your suspicions? cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] performance problems with bulk inserts/updates on tsrange with gist-based exclude constrains
Hey Tom, I appreciate you're busy, but did you ever get a chance to look at this? On 19/09/2016 08:40, Chris Withers wrote: On 16/09/2016 15:29, Tom Lane wrote: Chris Withers writes: On 16/09/2016 14:54, Igor Neyman wrote: So, what is the value for "end ts", when the record is inserted (the range just started)? It's open ended, so the period is [start_ts, ) I've not looked at the GiST range opclass, but I would not be surprised if having lots of those is pretty destructive to the index's ability to be selective about && searches. If that's so, that's a little disappointing... (I'd have thought the special case end value (open ended) and the ending type (inclusive/exclusive) would just be sentinel values) How would I verify your suspicions? cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] dumping table contents in a sensible order
On 16/11/2016 01:05, Adrian Klaver wrote: INSERT 0 1 ERROR: insert or update on table "table_one" violates foreign key constraint "table_one_parent_id_fkey" DETAIL: Key (parent_id)=(xxx) is not present in table "table_one". So, the problem appears to be that table_one is self-referential by way of a parent_id field. How can I either: - dump the table in an insertable order? Don't use --inserts, instead let the data be entered via COPY(the default) which does it a single transaction. That fixed it, many thanks. I guess that'll teach me to use an answer from StackOverflow without full understanding the details... Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] dumping table contents in a sensible order
Hi All, I have a database that I want to dump three tables from, for use in development. They form a subset of the data, so I was dumping like this: pg_dump thedatabase --inserts -t table_one -t depends_on_table_one -t depends_on_previous_two > dump.sql However, when I try to load this using the following: psql thedatabase_dev < dump.sql I get the following: SET ERROR: unrecognized configuration parameter "lock_timeout" SET SET SET SET ERROR: unrecognized configuration parameter "row_security" SET SET SET ERROR: relation "table_one" already exists ALTER TABLE ERROR: relation "depends_on_previous_two" already exists ALTER TABLE ERROR: relation "depends_on_previous_two_id_seq" already exists ALTER TABLE ALTER SEQUENCE ALTER TABLE INSERT 0 1 ... INSERT 0 1 ERROR: insert or update on table "table_one" violates foreign key constraint "table_one_parent_id_fkey" DETAIL: Key (parent_id)=(xxx) is not present in table "table_one". So, the problem appears to be that table_one is self-referential by way of a parent_id field. How can I either: - dump the table in an insertable order? - have the load only apply the foreign key constraint at the end of each table import? cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] performance problems with bulk inserts/updates on tsrange with gist-based exclude constrains
On 16/09/2016 15:29, Tom Lane wrote: Chris Withers writes: On 16/09/2016 14:54, Igor Neyman wrote: So, what is the value for "end ts", when the record is inserted (the range just started)? It's open ended, so the period is [start_ts, ) I've not looked at the GiST range opclass, but I would not be surprised if having lots of those is pretty destructive to the index's ability to be selective about && searches. If that's so, that's a little disappointing... (I'd have thought the special case end value (open ended) and the ending type (inclusive/exclusive) would just be sentinel values) How would I verify your suspicions? cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] performance problems with bulk inserts/updates on tsrange with gist-based exclude constrains
On 16/09/2016 10:26, John R Pierce wrote: On 9/16/2016 2:23 AM, John R Pierce wrote: wait, what is a tsrange? the standard textsearch data types in postgres are tsvector and tsquery, never mind, I should have known, its a timestamp range. ... when you do updates, are you changing any of the indexed fields, or just "value" ? Yeah, it's a temporal table, so "updates" involve modifying the period column for a row to set its end ts, and then inserting a new row with a start ts running on from that. Of course, the adds are just inserting new rows. cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] performance problems with bulk inserts/updates on tsrange with gist-based exclude constrains
On 16/09/2016 12:00, John R Pierce wrote: On 9/16/2016 3:46 AM, Chris Withers wrote: when you do updates, are you changing any of the indexed fields, or just "value" ? Yeah, it's a temporal table, so "updates" involve modifying the period column for a row to set its end ts, and then inserting a new row with a start ts running on from that. thats expensive, as it has to reindex that row. and range indexes are more expensive than timestamp indexes modifiyng the primary key is kind of a violation of one of the basic rules of relational databases as it means the row can't be referenced by another table. Right, but these rows have no natural primary key. Would it help if I just added an auto-incrementing integer key? Would that make a positive difference or would it just be a wasted column? I expect the expensive one is the constraint that ensures no periods overlap for the given key.I'm not sure how that can be done short of a full scan for each update/insert. Indeed, I wonder if making the constraint deferrable might help for the bulk case? it might actually perform better if you write the index with the key first as presumably the key is invariant ? You mean: PRIMARY KEY, btree (period, key) as opposed to -- john r pierce, recycling bits in santa cruz -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] performance problems with bulk inserts/updates on tsrange with gist-based exclude constrains
Hi All, I have quite a few tables that follow a pattern like this: Table "public.my_model" Column | Type| Modifiers +---+--- period | tsrange | not null key| character varying | not null value | integer | Indexes: "my_model_pkey" PRIMARY KEY, btree (period, key) "my_model_period_key_excl" EXCLUDE USING gist (period WITH &&, key WITH =) Check constraints: "my_model_period_check" CHECK (period <> 'empty'::tsrange) So, a primary key of a period column and one or more other columns (usually int or string) and an exclude constraint to prevent overlaps, and a check constraint to prevent empty ranges. However, I'm hitting performance problems on moderate bulk inserts and updates, with ~700k rows taking around 13 minutes. Profiling my python code suggests that most of the time is being taken by Postgres (9.4 in this case...) What can I do to speed things up? Is there a different type of index I can use to achieve the same exclude constraint? Is there something I can do to have the index changes only done on the commit of the bulk batches? cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] performance problems with bulk inserts/updates on tsrange with gist-based exclude constrains
On 16/09/2016 14:54, Igor Neyman wrote: -Original Message- From: pgsql-general-ow...@postgresql.org [mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Chris Withers Sent: Friday, September 16, 2016 6:47 AM To: John R Pierce ; pgsql-general@postgresql.org Subject: Re: [GENERAL] performance problems with bulk inserts/updates on tsrange with gist-based exclude constrains On 16/09/2016 10:26, John R Pierce wrote: On 9/16/2016 2:23 AM, John R Pierce wrote: wait, what is a tsrange? the standard textsearch data types in postgres are tsvector and tsquery, never mind, I should have known, its a timestamp range. ... when you do updates, are you changing any of the indexed fields, or just "value" ? Yeah, it's a temporal table, so "updates" involve modifying the period column for a row to set its end ts, and then inserting a new row with a start ts running on from that. Of course, the adds are just inserting new rows. So, what is the value for "end ts", when the record is inserted (the range just started)? It's open ended, so the period is [start_ts, ) Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] performance problems with bulk inserts/updates on tsrange with gist-based exclude constrains
On 16/09/2016 12:00, John R Pierce wrote: On 9/16/2016 3:46 AM, Chris Withers wrote: when you do updates, are you changing any of the indexed fields, or just "value" ? Yeah, it's a temporal table, so "updates" involve modifying the period column for a row to set its end ts, and then inserting a new row with a start ts running on from that. thats expensive, as it has to reindex that row. and range indexes are more expensive than timestamp indexes modifiyng the primary key is kind of a violation of one of the basic rules of relational databases as it means the row can't be referenced by another table. Right, but these rows have no natural primary key. Would it help if I just added an auto-incrementing integer key? Would that make a positive difference or would it just be a wasted column? I expect the expensive one is the constraint that ensures no periods overlap for the given key.I'm not sure how that can be done short of a full scan for each update/insert. Indeed, I wonder if making the constraint deferrable might help for the bulk case? it might actually perform better if you write the index with the key first as presumably the key is invariant ? You mean: PRIMARY KEY, btree (key1, key2, period) as opposed to PRIMARY KEY, btree (period, key) Interesting, I'd assumed postgres would optimise that under the covers... Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] performance problems with bulk inserts/updates on tsrange with gist-based exclude constrains
On 16/09/2016 10:26, John R Pierce wrote: On 9/16/2016 2:23 AM, John R Pierce wrote: wait, what is a tsrange? the standard textsearch data types in postgres are tsvector and tsquery, never mind, I should have known, its a timestamp range. ... when you do updates, are you changing any of the indexed fields, or just "value" ? Yeah, it's a temporal table, so "updates" involve modifying the period column for a row to set its end ts, and then inserting a new row with a start ts running on from that. Of course, the adds are just inserting new rows. cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] performance problems with bulk inserts/updates on tsrange with gist-based exclude constrains
Hi All, I have quite a few tables that follow a pattern like this: Table "public.my_model" Column | Type| Modifiers +---+--- period | tsrange | not null key| character varying | not null value | integer | Indexes: "my_model_pkey" PRIMARY KEY, btree (period, key) "my_model_period_key_excl" EXCLUDE USING gist (period WITH &&, key WITH =) Check constraints: "my_model_period_check" CHECK (period <> 'empty'::tsrange) So, a primary key of a period column and one or more other columns (usually int or string) and an exclude constraint to prevent overlaps, and a check constraint to prevent empty ranges. However, I'm hitting performance problems on moderate bulk inserts and updates, with ~700k rows taking around 13 minutes. Profiling my python code suggests that most of the time is being taken by Postgres (9.4 in this case...) What can I do to speed things up? Is there a different type of index I can use to achieve the same exclude constraint? Is there something I can do to have the index changes only done on the commit of the bulk batches? cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] schema advice for event stream with tagging and filtering
Hi All, What would be the best schema to use when looking to implement an event stream with tagging and filtering? An event is a row with a primary key along the lines of (colo, host, category) and an associated set of tags, where each tag has a type and a value (eg: {"color": "red", "owner": "fred", "status": "open"...}). Events come in as a streams of creates/updates as a cluster of http posts to a web app. We want to display events as a faceted list, with the facets being colo, host, category and each of the tag types that exist, with counts against each value for each facet (in an ideal world). The idea is to be able to click a facet to include or exclude it from the results. Furthermore, we want to do some other filtering (only show events tagged for the team of the current user, exclude everything for this colo between time x and y, etc). I've deliberately tried to be abstract here as I'm trying to ask a question rather than proposing a solution that might have problems, if there's any more information that would help, please let me know! cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] schema advice for event stream with tagging and filtering
On 16/08/2016 14:29, Ilya Kazakevich wrote: Hi, An event is a row with a primary key along the lines of (colo, host, category) and an associated set of tags, where each tag has a type and a value (eg: {"color": "red", "owner": "fred", "status": "open"...}). What about simple table with several columns and hstore field for tags? Interesting! You may also normalize it (move hosts and categories to separate table). Why? These form part of the primary key for the event... indexes should help you with fast filtering, or you may load data from this table to denormalized olap table once a day and build index there to speed-up queries. What kind of index is recommended here? The kind of queries would be: - show me a list of tag types and the count of the number of events of that type - show me all events that have tag1=x, tag2=y and does not have tag3 cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] schema advice for event stream with tagging and filtering
Hi All, What would be the best schema to use when looking to implement an event stream with tagging and filtering? An event is a row with a primary key along the lines of (colo, host, category) and an associated set of tags, where each tag has a type and a value (eg: {"color": "red", "owner": "fred", "status": "open"...}). Events come in as a streams of creates/updates as a cluster of http posts to a web app. We want to display events as a faceted list, with the facets being colo, host, category and each of the tag types that exist, with counts against each value for each facet (in an ideal world). The idea is to be able to click a facet to include or exclude it from the results. Furthermore, we want to do some other filtering (only show events tagged for the team of the current user, exclude everything for this colo between time x and y, etc). I've deliberately tried to be abstract here as I'm trying to ask a question rather than proposing a solution that might have problems, if there's any more information that would help, please let me know! cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] connection file descriptors created with identical number after process fork on mac
On 04/08/2016 00:20, Tom Lane wrote: Chris Withers writes: I'm writing some multi-process code in Python and trying to make sure I open a new connection for each process. Here's the really cut down code: ... What's really surpising to me is the output on a mac: $ python psycopg2_multiprocess.py 44276 child fd: 13 44277 child fd: 13 44278 child fd: 13 44279 child fd: 13 The getpid() output indicates that the connec() call is being made inside a different process each time, yet the connection appears to still be using the same fd. FD numbers are process-local in all flavors of Unix. The above only proves that all of these processes had FDs 0..12 open already, which doesn't seem terribly surprising. Thanks, that's certainly good news! How can I convince myself, from the client side, that I really have got a new connection and not somehow ended up with one that been passed on as part of the fork? cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] connection file descriptors created with identical number after process fork on mac
Hi All, I'm writing some multi-process code in Python and trying to make sure I open a new connection for each process. Here's the really cut down code: """ import os, time import psycopg2 from multiprocessing import Pool def init(): conn = psycopg2.connect("dbname=...host=localhost") print os.getpid(), ' child fd:', conn.fileno() if __name__=='__main__': pool = Pool(initializer=init) time.sleep(30) """ What's really surpising to me is the output on a mac: $ python psycopg2_multiprocess.py 44276 child fd: 13 44277 child fd: 13 44278 child fd: 13 44279 child fd: 13 The getpid() output indicates that the connec() call is being made inside a different process each time, yet the connection appears to still be using the same fd. conn.file() is basically (long int)PQsocket(self->pgconn);: https://github.com/psycopg/psycopg2/blob/master/psycopg/connection_type.c#L898 Is there something I'm missing about file descriptors on Macs or is something bad happening here? Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] create create of database of a particular name
On 09/03/2016 14:53, Adrian Klaver wrote: On 03/09/2016 05:49 AM, Chris Withers wrote: Hi All, Django, a popular python web framework, runs its unit tests by creating a 'test_'-prefixed database and then dropping it at the end of the test run. FYI in Django 1.8+ you can prevent that: https://docs.djangoproject.com/en/1.8/topics/testing/overview/#the-test-database " New in Django 1.8: You can prevent the test databases from being destroyed by adding the --keepdb flag to the test command. This will preserve the test database between runs. If the database does not exist, it will first be created. Any migrations will also be applied in order to keep it up to date. " ...and we're on 1.7 :-( Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] create create of database of a particular name
Hi All, Django, a popular python web framework, runs its unit tests by creating a 'test_'-prefixed database and then dropping it at the end of the test run. Is there any way I can grant a user in postgres the ability only to create a database of a particular name? cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] foreign key to "some rows" of a second table
Hi All, So, I have a table that looks like this: CREATE TABLE config ( region varchar(10), name varchar(10), value varchar(40) ); Another looks like this: CREATE TABLE tag ( host varchar(10), type varchar(10), value varchar(10) ); What's the best way to set up a constraint on the 'config' table such that the 'region' column can only contain values that exist in the 'tag' table's value column where the 'type' is 'region'? cheers, Chris
Re: [GENERAL] union/difference/intersection question
On 08/12/2015 11:51, David Rowley wrote: On 9 December 2015 at 00:44, Chris Withers <mailto:ch...@simplistix.co.uk>> wrote: Hi All, I hope this is quite a simple one... I have a fixtures table containing home_club and away_club, and I'd like to select the distinct list of clubs contained in all rows but in either the home_club or away_club columns. How would I do that? Use UNION: select home_club from fixtures UNION select away_club from fixtures; Fantastic, thanks. That reminds me of a more general question, say I have a query which returns: key='a', value=1 key='b', value=2 ...and another query, which returns: key='b', value=3 key='c, value=4 (obviously the real world version of this has many more rows, and indeed, more key columns and more value columns...) So, ultimately I want to turn these two queries into inserts for keys like 'a', updates for keys like 'b' and deletes for keys like 'c'. What's the most efficient way of doing this? Three queries to get the three sets, then a bulk insert, a set of updates and a bulk delete? cheers, Chris -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services __ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com __
[GENERAL] aggregation question
Hi All, I hope this is quite a simple one... I have a fixtures table containing home_club and away_club, and I'd like to select the distinct list of clubs contained in all rows but in either the home_club or away_club columns. How would I do that? thanks for the help! Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] "trust" authentication in pg_hba.conf
Hi All, What's the default contents of pg_hba.conf that postgres ships with? I've been to it contains 'trust' for all local connections. Is this wise? Anyone who can get a shell on your database server can connect to any database as any user? cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] current_query='IDLE" in pg_stat_activity
On 23/11/2015 10:33, paramjib baruah wrote: I am trying to check current running sql queries in postgres through pg_stat_activity .However, in the current_query column I could see only "IDLE" . I am not able to see the query . That would suggest no query is running, what suggests to you that there are queries you should be expecting to see? cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] using a postgres table as a multi-writer multi-updater queue
Hi All, I wondered if any of you could recommend best practices for using a postgres table as a queue. Roughly speaking, 100-200 workers will vomit rows and rates of a few hundres per second into the table leaving the status as new and then as many workers as needed to keep up with the load will plough through the queue changing the status to something other than new. My naive implementation would be something along the lines of: CREATE TABLE event ( tstimestamp, event char(40), statuschar(10), CONSTRAINT pkey PRIMARY KEY(ts, event) ); ...with writers doing INSERT or COPY to get data into the table and readers doing something like: SELECT FOR UPDATE * FROM event WHERE status='new' LIMIT 1000; ...so, grabbing batches of 1,000, working on them and then setting their status. But, am I correct in thinking that SELECT FOR UPDATE will not prevent multiple workers selecting the same rows? Anyway, is this approach reasonable? If so, what tweaks/optimisations should I be looking to make? If it's totally wrong, how should I be looking to approach the problem? cheers, Chris
Re: [GENERAL] scaling postgres
On 04/08/2015 09:11, Seref Arikan wrote: I work in healthcare and patient centric records let me consider multiple servers for lots and lots of patients. The engineering team from instagram has been sharing their experience with postgres, which is possible due to their domain model. I'll have a google, but do you have the Instagram links you're thinking of? Nope, sorry, Google is your friend :) For the benefit of the archives: http://instagram-engineering.tumblr.com/post/40781627982/handling-growth-with-postgres-5-tips-from http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram cheers, Chris
Re: [GENERAL] scaling postgres
On 03/08/2015 08:34, Seref Arikan wrote: At what point does postgres stop scaling? What happens when the computational load no longer fits on one machine? What are the options then? I think it is hard to come up with blanket responses to generic questions such as "What happens when the computational load no longer fits on one machine?" Of course, I guess I'm looking out for the pain points that people commonly hit with chunky postgres installs... I work in healthcare and patient centric records let me consider multiple servers for lots and lots of patients. The engineering team from instagram has been sharing their experience with postgres, which is possible due to their domain model. I'll have a google, but do you have the Instagram links you're thinking of? cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] scaling postgres - can child tables be in a different tablespace?
On 03/08/2015 08:40, Jony Cohen wrote: Servers now days reach very impressive write speeds and at rather low prices - it's simpler to split the write to 2 tables on different tablespaces/devices than 2 servers. This raises an interesting question: can a child table be in a different tablespace to its parent and other children of that parent? cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] scaling postgres
On 03/08/2015 08:15, Jony Cohen wrote: SSD disks are cheep these days but they don't like repeated writes/deletes so it might cause problems down the line (hence my first RAM recommendation) as for keeping the raw data - you could easily do it if you use partitions, if you have daily partitions inheriting from a master table you can quickly access the last day (or even several days) but do take note that a full table scan takes time and you'll need to manage writing to the right partition yourself (not that hard - simply insert to _) if you can write the data daily, keeping it will not add any real load (a little on parse times for queries that access the master table) Interesting, you seem a lot less fussed by these numbers than I am, which is good to hear! At what point does postgres stop scaling? What happens when the computational load no longer fits on one machine? What are the options then? cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] scaling postgres
Hi All, I'm curious if there are recommendations for scaling postgres to what, for me, seems like "a lot" of data... The app in question currently writes around 1.5 billion rows into a table before rolling them up into tables that have a few million roll up rows each. That 1.5 billion row table is emptied and refilled each day, so we're talking about quite high write as well as quite high read. Where can I find could examples/docs of how to scale postgres for this kind of data load? What sort of hardware would I be looking to spec? Okay, now this app may well eventually want to progress to storing those 1.5 billion rows per day. Is that feasible with postgres? If not, what storage and processing solutions would people recommend for that kind of data load? cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] pgbench tps drop from 5000 to 37 going from localhost to a server 13ms away
On 24/07/2015 22:51, Jeff Janes wrote: starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 1 This is your problem. There is only one row in the pgbench_branch table, and every transaction has to update that one row. This is inherently a seriaized event. Indeed it was! One solution is to just use a large scale on the benchmark so that they upate random pgbench_branch rows, rather than all updating the same row: pgbench -i -s50 With a scale of 1000, everything except the END took roughly the latency time. Interestingly, the END still seems to take more, when threads/clients are really ramped up (100 vs 8). Why would that be? Alternatively, you could write a custom file so that all 7 commands are sent down in one packet. How would you restructure the sql so as the make that happen? cheers, Chris
Re: [GENERAL] pgbench tps drop from 5000 to 37 going from localhost to a server 13ms away
On 24/07/2015 19:21, Jan Lentfer wrote: I've been doing some lightweight load testing with “pgbench -c8 -j8 -T10” When run locally on the postgres server I've testing, this gives around 5000tps When I do it from a server that has a 13ms ping latency, it drops to 37tps. This is using the default pgbench script, is it to be expected? If so, why? That seems to be a large drop. On the other hand 13 ms is also like a very large network latency. On LAN your usually in the sub ms area. So going from e.g. 0.2 ms to 13ms is 65 fold decrease. What is the network toplogy like? 10G between two colos, a switch at each end. What's interesting is how it goes when adjusting the number for threads/connections. As a baseline, I did one of each. As expected, this gave around 10 tps, (7 statements in the standard file * 13ms latency gives around 0.1s per transaction). This behaviour continued up to -c3 -j3 linearly achieving roughly 10tps per client/thread: starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 1 query mode: simple number of clients: 3 number of threads: 3 duration: 10 s number of transactions actually processed: 317 latency average: 94.637 ms tps = 31.494417 (including connections establishing) tps = 31.668794 (excluding connections establishing) statement latencies in milliseconds: 0.002016 \set nbranches 1 * :scale 0.000438 \set ntellers 10 * :scale 0.000379 \set naccounts 10 * :scale 0.000489 \setrandom aid 1 :naccounts 0.000404 \setrandom bid 1 :nbranches 0.000413 \setrandom tid 1 :ntellers 0.000470 \setrandom delta -5000 5000 13.061975 BEGIN; 13.174287 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; 13.127691 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; 14.552413 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; 14.109375 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; 13.113028 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); 13.256063 END; Now, at -c4 and -j4 the problem becomes apparent: starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 1 query mode: simple number of clients: 4 number of threads: 4 duration: 10 s number of transactions actually processed: 362 latency average: 110.497 ms tps = 35.912951 (including connections establishing) tps = 36.111849 (excluding connections establishing) statement latencies in milliseconds: 0.001917 \set nbranches 1 * :scale 0.000511 \set ntellers 10 * :scale 0.000384 \set naccounts 10 * :scale 0.000525 \setrandom aid 1 :naccounts 0.000406 \setrandom bid 1 :nbranches 0.000472 \setrandom tid 1 :ntellers 0.000483 \setrandom delta -5000 5000 13.063624 BEGIN; 13.170928 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; 13.122450 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; 16.532138 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; 28.090450 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; 13.112207 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); 13.229887 END; We've dropped down to 9 tps per client/thread, and it's the update statements that are growing in time, here's the worst case: starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 1 query mode: simple number of clients: 10 number of threads: 10 duration: 10 s number of transactions actually processed: 382 latency average: 261.780 ms tps = 37.310918 (including connections establishing) tps = 37.517192 (excluding connections establishing) statement latencies in milliseconds: 0.001798 \set nbranches 1 * :scale 0.000437 \set ntellers 10 * :scale 0.000385 \set naccounts 10 * :scale 0.000597 \setrandom aid 1 :naccounts 0.000369 \setrandom bid 1 :nbranches 0.000437 \setrandom tid 1 :ntellers 0.000401 \setrandom delta -5000 5000 13.064963 BEGIN; 13.192241 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; 13.121914 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; 83.994516 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; 113.638228 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; 13.133390 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); 13.288079 END; What on earth could be causing those updates to now take getting on for an order of magnitude more than the latency to the server? cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] pgbench tps drop from 5000 to 37 going from localhost to a server 13ms away
Hi all, I've been doing some lightweight load testing with “pgbench -c8 -j8 -T10” When run locally on the postgres server I've testing, this gives around 5000tps When I do it from a server that has a 13ms ping latency, it drops to 37tps. This is using the default pgbench script, is it to be expected? If so, why? cheers, Chris -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] how do I disable automatic start on mac os x?
Sachin Srivastava wrote: Will setting RunAtLoad to false not stop that? RunAtLoad = false means, the daemon is loaded not executed/launched. If you dont want it to load at system startup. Remove the plist from that directory. (i.e. /Library/LaunchDaemons) I don't want postgres to start on machine start, but I do want it to start when I click the "start server" icon in the Postgres application folder. If I delete this plist, will that still work? Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] how do I disable automatic start on mac os x?
Sachin Srivastava wrote: Iirc, you have to change: RunAtLoad to RunAtLoad Sachin; can you confirm please? Yes. Though the daemon is loaded during startup. Will setting RunAtLoad to false not stop that? Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] how do I disable automatic start on mac os x?
Dave Page wrote: Iirc, you have to change: RunAtLoad to RunAtLoad I feel slightly stupid asking this, but this is a Mac, surely there's some pretty shiny GUI for editing things like this? (even Windows has the "Services" stuff in Control Panel) What am I missing? Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] how do I disable automatic start on mac os x?
Dave Page wrote: On Mon, Jan 4, 2010 at 1:38 PM, Chris Withers wrote: Naoko Reeves wrote: Try check this directory: Library: StartupItems: postgres-plus-8.xx there should be script to start automatically. /Library/StartupItems is empty. I think putting stuff in there is an old-fashioned way of doing things that went out with Mac OS X 10.4, and I'm glad to say it looks like the latest postgres doesn't install like this.. Look in /Library/LaunchDaemons/ Ah, there it is... but now what do I do with it to disable the auto-start? Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] how do I disable automatic start on mac os x?
Naoko Reeves wrote: Try check this directory: Library: StartupItems: postgres-plus-8.xx there should be script to start automatically. /Library/StartupItems is empty. I think putting stuff in there is an old-fashioned way of doing things that went out with Mac OS X 10.4, and I'm glad to say it looks like the latest postgres doesn't install like this.. ...however, that still leaves me back at square 1 with respect to disabling the auto-startup of postgres. Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] how do I disable automatic start on mac os x?
Hi All, I installed postgres from the enterprisedb-maintained one-click installer at http://www.postgresql.org/download/macosx. However, it starts automatically when the machine boots up, and I'd like it only to start when I manually start it. I've found an entry in launchd: $ sudo launchctl list | grep post com.edb.launchd.postgresql-8.4 ...but I can't find where this comes from: $ sudo find / -name com.edb.launchd.postgresql-8.4 $ Where do I find this entry to disable it? (and, perhaps showing my Mac-ignorance, is there a GUI for launchd configuration that I'm somehow missing?) cheers, Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general