Re: [PERFORM] Inlining of functions (doing LIKE on an array)
> From: pgsql-performance-ow...@postgresql.org > [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of > l...@laurent-hasson.com > Sent: Freitag, 11. November 2016 07:54 > To: pgsql-performance@postgresql.org > Subject: [PERFORM] Inlining of functions (doing LIKE on an array) > > Hello, > > I am trying to implement an efficient "like" over a text[]. I see a lot of > people have tried before me and I learnt a lot through the forums. The > results of my search is that a query like the following is optimal: > > select count(*) > from claims > where (select count(*) > from unnest("ICD9_DGNS_CD") x_ > where x_ like '427%' >) > 0 > Hi, are you using GIN indexes? http://stackoverflow.com/questions/4058731/can-postgresql-index-array-columns moreover your query can still be optimized: => select count(*) from claims where exists (select * from unnest("ICD9_DGNS_CD") x_ where x_ like '427%' ) regards, Marc Mamin > So I figured I'd create a Function to encapsulate the concept: > > CREATE OR REPLACE FUNCTION ArrayLike(text[], text) > RETURNS bigint > AS 'select count(*) from unnest($1) a where a like $2' > LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF > > This works functionally, but performs like crap: full table scan, and cannot > make use of any index it seems. Basically, it feels like PG can't inline that > function. > > I have been trying all evening to find a way to rewrite it to trick the > compiler/planner into inlining. I tried the operator approach for example, > but performance is again not good. > > create function rlike(text,text) > returns bool as 'select $2 like $1' language sql strict immutable; > create operator ``` (procedure = rlike, leftarg = text, > rightarg = text, commutator = ```); > CREATE OR REPLACE FUNCTION MyLike(text[], text) > RETURNS boolean > AS 'select $2 ``` ANY($1)' > LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF > > And by not good, I mean that on my table of 2M+ rows, the "native" query > takes 3s, while the function version takes 9s and the operator version takes > (via the function, or through the operator directly), takes 15s. > > Any ideas or pointers? > > > Thank you, > Laurent Hasson -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] gin performance issue.
> -Original Message- > From: Tom Lane [mailto:t...@sss.pgh.pa.us] > Sent: Freitag, 5. Februar 2016 16:07 > > http://explain.depesz.com/s/wKv7 > > Postgres Version 9.3.10 (Linux) > > > > Hello, > > this is a large daily table that only get bulk inserts (200-400 /days) with > > no update. > > After rebuilding the whole table, the Bitmap Index Scan on > > r_20160204_ix_toprid falls under 1 second (from 800) > > > > Fastupdate is using the default, but autovacuum is disabled on that > > table which contains 30 Mio rows. > Pre-9.5, it's a pretty bad idea to disable autovacuum on a GIN index, > because then the "pending list" only gets flushed when it exceeds > work_mem. (Obviously, using a large work_mem setting makes this > worse.) > > regards, tom lane Hello, knowing what the problem is don't really help here: - auto vacuum will not run as these are insert only tables - according to this post, auto analyze would also do the job: http://postgresql.nabble.com/Performance-problem-with-gin-index-td5867870.html It seems that this information is missing in the doc but it sadly neither triggers in our case as we have manual analyzes called during the dataprocesssing just following the imports. Manual vacuum is just too expensive here. Hence disabling fast update seems to be our only option. I hope this problem will help push up the 9.5 upgrade on our todo list :) Ideally, we would then like to flush the pending list inconditionally after the imports. I guess we could achieve something approaching while modifying the analyze scale factor and gin_pending_list_limit before/after the (bulk) imports, but having the possibility to flush it per SQL would be better. Is this a reasonable feature wish? And a last question: how does the index update work with bulk (COPY) inserts: without pending list: is it like a per row trigger or will the index be cared of afterwards ? with small pending lists : is there a concurrency problem, or can both tasks cleanly work in parallel ? best regards, Marc mamin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] gin performance issue.
http://explain.depesz.com/s/wKv7 Postgres Version 9.3.10 (Linux) Hello, this is a large daily table that only get bulk inserts (200-400 /days) with no update. After rebuilding the whole table, the Bitmap Index Scan on r_20160204_ix_toprid falls under 1 second (from 800) Fastupdate is using the default, but autovacuum is disabled on that table which contains 30 Mio rows. Another specificity is that the cardinality of the indexed column is very high. The average count per distinct values is only 2.7 I'm not sure what the problem is. Does the missing vacuum affect the gin index sanity further than not cleaning the pending list? As I understand it, this list will be merged into the index automatically when it get full, independently from the vaccum setting. Can it be an index bloating issue ? and last but not least, can I reduce the problem by configuration ? regards, Marc Mamin
Re: [PERFORM] Slow query help
>I ask your help to solve a slow query which is taking more than 14 seconds to >be executed. >Maybe I am asking too much both from you and specially from postgresql, as it >is really huge, envolving 16 tables. > >Explain: >http://explain.depesz.com/s/XII9 > >Schema: >http://adj.com.br/erp/data_schema/ Hello, It seems that you don't pay much attention to column alignment. e.g. http://adj.com.br/erp/data_schema/tables/ERP_PUBLIC_sys_person.html This probably won't make any significant difference in your case, but this is something to be aware of when dealing with large tables. here is a good starting link for this topic: http://stackoverflow.com/questions/12604744/does-the-order-of-columns-in-a-postgres-table-impact-performance regards, Marc Mamin
Re: [PERFORM] GroupAggregate and Integer Arrays
> CREATE OR REPLACE FUNCTION array_add(int[],int[]) RETURNS int[] AS $$ >-- Add two arrays. >select > ARRAY ( > SELECT coalesce($1[i],0) + coalesce($2[i],0) > FROM ( > select generate_series(least(array_lower($1, 1),array_lower($2, > 1)), greatest(array_upper($1, 1),array_upper($2, 1)), 1) AS i > ) sub >GROUP BY i >ORDER BY i >); > $$ LANGUAGE sql STRICT IMMUTABLE; it seems that both the GROUP and ORDER BY are superfluous and adding some cycles. regards, Marc Mamin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] wildcard text filter switched to boolean column, performance is way worse
Hello, > > > > > > From: pgsql-performance-ow...@postgresql.org > [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Mike Broers > Sent: Dienstag, 7. Juli 2015 18:28 > To: Tom Lane > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] wildcard text filter switched to boolean column, > performance is way worse > > After bumping up work_mem from 12MB to 25MB that last materialize is indeed > hashing and this cut the query time by about 60%. Thanks, this was very > helpful and gives me something else to look for when troubleshooting > explains. > > > > On Tue, Jul 7, 2015 at 11:10 AM, Mike Broers wrote: > Thanks, very informative! I'll experiment with work_mem settings and report > back. > > On Tue, Jul 7, 2015 at 11:02 AM, Tom Lane wrote: > Mike Broers writes: > > I had a query that was filtering with a wildcard search of a text field for > > %SUCCESS%. The query took about 5 seconds and was running often so I wanted > > to improve it. I suggested that the engineers include a new boolean column > > for successful status. They implemented the requested field, but the query > > that filters on that new column runs very long (i kill it after letting it > > run for about an hour). Can someone help me understand why that is the > > case and how to resolve it? > > It's hashing the subplan output in the first case and not the second: > > > Seq Scan on lead (cost=130951.81..158059.21 rows=139957 width=369) (actual > > time=4699.619..4699.869 rows=1 loops=1) > > Filter: ((NOT (hashed SubPlan 1)) AND (("ReferenceNumber")::text <> > > ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3 > > vs > > > Seq Scan on lead (cost=85775.78..9005687281.12 rows=139957 width=369) > > Filter: ((NOT (hashed SubPlan 1)) AND (("ReferenceNumber")::text <> > > ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3 >^ > > Presumably, the new more-accurate rows count causes the planner to realize > that the hash table will exceed work_mem so it doesn't choose to hash ... > but for your situation, you'd rather it did, because what you're getting > instead is a Materialize node that spills to disk (again, because the data > involved exceeds work_mem) and that's a killer for this query. You should > be able to get back the old behavior if you raise work_mem enough. > > Another idea you might think about is changing the OR'd IN conditions > to a single IN over a UNION ALL of the subselects. I'm not really sure if > that would produce a better plan, but it's worth trying if it wouldn't > require too much app-side contortion. Hello, you might try to use a CTE to first collect the IDs to exclude, and join them to your main table. This should result in an anti join pattern. Something like: WITH IDS as ( SELECT U1."lead_id" AS "lead_id" FROM "event" U1 WHERE U1."event_type" ='type_1' UNION ( SELECT U1."lead_id" AS "lead_id" FROM "event" U1 WHERE U1."event_type" = 'type_2' INTERSECT SELECT U1."lead_id" AS "lead_id" FROM "event" U1 WHERE successful ) ) SELECT * FROM lead LEFT OUTER JOIN IDS ON (lead.id=IDS.lead_id) WHERE IDS.lead_id IS NULL; regards, Marc Mamin > regards, tom lane > > > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Re: Hmmm... why does pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?
> >Hi everyone, > >I've written a new open source tool for easily parallelising SQL scripts in >postgres. [obligatory plug: https://github.com/gbb/par_psql ] > >Using it, I'm seeing a problem that I've also seen in other postgres projects >involving high degrees of parallelisation in the last 12 months. > >Basically: > >- I have machines here with up to 16 CPU cores and 128GB memory, very fast >SSDs and controller etc, carefully configured kernel/postgresql.conf for high >performance. > >- Ordinary queries parallelise nearly perfectly (e.g. SELECT some_stuff ...), >e.g. almost up to 16x performance improvement. > >- Non-DB stuff like GDAL, python etc. parallelise nearly perfectly. > >- HOWEVER calls to CPU-intensive user-defined pl/pgsql functions (e.g. SELECT >myfunction(some_stuff)) do not parallelise well, even when they are >independently defined functions, or accessing tables in a read-only way. They >hit a limit of 2.5x performance improvement relative to single-CPU performance >(pg9.4) and merely 2x performance (pg9.3) regardless of how many CPU cores I >throw at them. This is about 6 times slower than I'm expecting. > > >I can't see what would be locking. It seems like it's the pl/pgsql environment >itself that is somehow locking or incurring some huge frictional costs. >Whether I use independently defined functions, independent source tables, >independent output tables, makes no difference whatsoever, so it doesn't feel >'lock-related'. It also doesn't seem to be WAL/synchronisation related, as the >machines I'm using can hit absurdly high pgbench rates, and I'm using unlogged >tables for output. > >Take a quick peek here: >https://github.com/gbb/par_psql/blob/master/BENCHMARKS.md > >I'm wondering what I'm missing here. Any ideas? > >Graeme. > auto explain might help giving some insight in what's going on: http://www.postgresql.org/docs/9.4/static/auto-explain.html Regards, Marc Mamin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance issues
>Hi Team, > >This is the EXPLAIN ANALYZE for one of the view : S_V_D_CAMPAIGN_HIERARCHY: >Rows Removed by Join Filter: 3577676116 That's quite a lot. You're possibly missing a clause in a join, resulting in a cross join. It is also helpful to put your result here: http://explain.depesz.com/ regards, Marc Mamin > >=== > > >Nested Loop (cost=33666.96..37971.39 rows=1 width=894) (actual >time=443.556..966558.767 rows=45360 loops=1) > Join Filter: (tp_exec.touchpoint_execution_id = >valid_executions.touchpoint_execution_id) > Rows Removed by Join Filter: 3577676116 > CTE valid_executions > -> Hash Join (cost=13753.53..31711.17 rows=1 width=8) (actual >time=232.571..357.749 rows=52997 loops=1) > Hash Cond: >((s_f_touchpoint_execution_status_history_1.touchpoint_execution_id = >s_f_touchpoint_execution_status_history.touchpoint_execution_id) AND ((max(s >_f_touchpoint_execution_status_history_1.creation_dt)) = >s_f_touchpoint_execution_status_history.creation_dt)) > -> HashAggregate (cost=6221.56..6905.66 rows=68410 width=16) >(actual time=139.713..171.340 rows=76454 loops=1) > -> Seq Scan on s_f_touchpoint_execution_status_history >s_f_touchpoint_execution_status_history_1 (cost=0.00..4766.04 rows=291104 >width=16) (actual ti >me=0.006..38.582 rows=291104 loops=1) > -> Hash (cost=5493.80..5493.80 rows=135878 width=16) (actual >time=92.737..92.737 rows=136280 loops=1) > Buckets: 16384 Batches: 1 Memory Usage: 6389kB > -> Seq Scan on s_f_touchpoint_execution_status_history >(cost=0.00..5493.80 rows=135878 width=16) (actual time=0.012..55.078 >rows=136280 loops=1) > Filter: (touchpoint_execution_status_type_id = ANY >('{3,4}'::integer[])) > Rows Removed by Filter: 154824 > -> Nested Loop Left Join (cost=1955.80..6260.19 rows=1 width=894) >(actual time=31.608..3147.015 rows=67508 loops=1) > -> Nested Loop (cost=1955.67..6260.04 rows=1 width=776) (actual >time=31.602..2912.625 rows=67508 loops=1) > -> Nested Loop Left Join (cost=1955.54..6259.87 rows=1 >width=658) (actual time=31.595..2713.696 rows=72427 loops=1) > -> Nested Loop Left Join (cost=1955.40..6259.71 >rows=1 width=340) (actual time=31.589..2532.926 rows=72427 loops=1) > -> Nested Loop Left Join (cost=1955.27..6259.55 >rows=1 width=222) (actual time=31.581..2354.662 rows=72427 loops=1) > -> Nested Loop (cost=1954.99..6259.24 >rows=1 width=197) (actual time=31.572..2090.104 rows=72427 loops=1) > -> Nested Loop >(cost=1954.71..6258.92 rows=1 width=173) (actual time=31.562..1802.857 >rows=72427 loops=1) > Join Filter: >(camp_exec.campaign_id = wave.campaign_id) > Rows Removed by Join Filter: >243 > -> Nested Loop >(cost=1954.42..6254.67 rows=13 width=167) (actual time=31.551..1468.718 >rows=72670 loops=1) > -> Hash Join >(cost=1954.13..6249.67 rows=13 width=108) (actual time=31.525..402.039 >rows=72670 loops=1) > Hash Cond: >((tp_exec.touchpoint_id = tp.touchpoint_id) AND (wave_exec.wave_id = >tp.wave_id)) > -> Hash Join >(cost=1576.83..4595.51 rows=72956 width=90) (actual time=26.254..256.328 >rows=72956 loops=1) > Hash Cond: >(tp_exec.wave_execution_id = wave_exec.wave_execution_id) > -> Seq Scan >on s_d_touchpoint_execution tp_exec (cost=0.00..1559.56 rows=72956 >width=42) (actual time=0.005..76.099 rows=72956 loops=1) > -> Hash >(cost=1001.37..1001.37 rows=46037 width=56) (actual time=26.178..26.178 >rows=46037 loops=1) > Buckets: >8192 Batches: 1 Memory Usage: 4104kB > -> Seq >Scan on s_d_wave_execution wave_exec (cost=0.00..1001.37 rows=46037 >width=56) (actual time=0.006..10.388 rows=46037 loops=1) > -> Hash >(cost=212.72..212.72 rows=10972 width=26) (actual time=5.252..
Re: [PERFORM] EXPLAIN (no ANALYZE) taking an hour for INSERT FROM SELECT
>Hi, thanks for your follow-up questions. >- postgres version is 9.1.13 >- the number of rows (in this latest instance) is 28,474,842 >- I've clustered and vacuum-full-ed and analyzed this table frequently, >attempting to troubleshoot this. (Running vacuum full on the whole catalog >seems a little excessive, and unlikely to help.) Hi, I mean the pg_* tables. When working with temp objects and a high number of classes, regular vacuum may not clean them efficiently. It is not a bad idea to run a vacuum full verbose manually on the largest of those from time to time to verify that they don't grow outer control. And this normally requires a few seconds only. The verbose output of vacuum full sometimes returns interesting information... For the ANALYZE performance, I guess that these are the most relevant one: pg_statistic; pg_class; pg_attribute; pg_index; pg_constraint; regards, Marc Mamin >- no other processes are likely to be interfering; nothing other than >PostgreSQL runs on this machine (except for normal OS processes and New Relic >server monitoring service); concurrent activity in PostgreSQL is low-level and >unrelated, and this effect is observed systematically whenever this kind of >operation is performed on this table >- no override for this table; the system default_statistics_target is 100 (the >default) >- yes, an ANALYZE is performed on the temp table after the COPY and before the >INSERT >- no index on the temp table (but I'm scanning the whole thing anyway). There >are indexes on f_foo as detailed in my original post, and I expect the PK to >make the WHERE NOT EXISTS filtering efficient (as it filters on exactly all >columns of the PK) ... but even if it didn't, I would expect that to only slow >down the actual insert execution, not the EXPLAIN. >Cheers, >Gulli >On Wed, Mar 4, 2015 at 8:10 PM, Marc Mamin wrote: > >>Hi, >>we are seeing cases of EXPLAIN INSERT INTO foo SELECT ... taking over an > hour, with disk I/O utilization (percent of time device is busy) at 100% the > whole time, although I/O bandwidth is not saturated. This is on PostgreSQL > 9.1.13. >>What could cause this? Note that there is no ANALYZE. Is it possible that > the SELECT is actually executed, in planning the INSERT? >>When executing the INSERT itself (not EXPLAIN) immediately afterwards, > that logs a "temporary file" message, but the EXPLAIN invocation does not > (though the disk I/O suggests that a large on-disk sort is occurring): >>LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp6016.0", size > 744103936 >>STATEMENT: INSERT INTO f_foo SELECT >>[...] >>During that actual execution, there's a lower disk I/O utilization > (though a higher MB/s rate). >>Charts of disk I/O utilization and rate are at > http://postimg.org/image/628h6jmn3/ ... the solid 100% span is the EXPLAIN > statement, ending at 6:13:30pm, followed by the INSERT statement ending at > 6:32pm. Metrics are collected by New Relic; their definition of I/O > utilization is at https://discuss.newrelic.com/t/disk-i-o-metrics/2900 >>Here's the EXPLAIN statement: >>LOG: duration: 3928778.823 ms statement: EXPLAIN INSERT INTO f_foo > SELECT >> t_foo.fk_d1, >> t_foo.fk_d2, >> t_foo.fk_d3, >> t_foo.fk_d4, >> t_foo.fk_d5, >> t_foo.fk_d6, >> t_foo.value >>FROM t_foo >>WHERE NOT (EXISTS >> (SELECT * >> FROM f_foo >> WHERE f_foo.fk_d2 = t_foo.fk_d2 >>AND f_foo.fk_d3 = t_foo.fk_d3 >>AND f_foo.fk_d4 = t_foo.fk_d4 >>AND f_foo.fk_d5 = t_foo.fk_d5 >>AND f_foo.fk_d6 = t_foo.fk_d6 >>AND f_foo.fk_d1 = t_foo.fk_d1)) >>(where t_foo is a temp table previously populated using COPY, and the NOT > EXISTS subquery refers to the same table we are inserting into) >>Here's the EXPLAIN output: >>Insert on f_foo (cost=8098210.50..9354519.69 rows=1 width=16) >> -> Merge Anti Join (cost=8098210.50..9354519.69 rows=1 width=16) >>Merge Cond: ((t_foo.fk_d2 = public.f_foo.fk_d2) AND >> (t_foo.fk_d3 = public.f_foo.fk_d3) AND >> (t_foo.fk_d4 = public.f_foo.fk_d4) AND >> (t_foo.fk_d5 = public.f_foo.fk_d5) AND >> (t_foo.fk_d6 = public.f_foo.fk_d6) AND >> (t_foo.fk_d1 = public.f_foo.fk_d1)) >>-> Sort (cost=3981372.25..
Re: [PERFORM] EXPLAIN (no ANALYZE) taking an hour for INSERT FROM SELECT
t;on INSERT into this and a couple of other tables with tens of millions of rows. >Any ideas? >Thanks, best regards, >- Gulli > Hi, I've no clue for the time required by EXPLAIN but some more information are probably relevant to find an explanation: - postgres version - number of rows inserted by the query - how clean is your catalog in regard to vacuum ( can you run vacuum full verbose & analyze it, and then retry the analyze statement ?) - any other process that may interfere, e.g. while locking some catalog tables ? - statistic target ? - is your temp table analyzed? - any index on it ? We have about 300'000 entries in our pg_class tables, and I've never seen such an issue. regards, Marc Mamin
Re: [ADMIN] [PERFORM] Vs NULL
>>Hi >> >>2015-02-09 12:22 GMT+01:00 sridhar bamandlapally : >> >>Hi All >> >>We are testing our Oracle compatible business applications on PostgreSQL >> database, >> >>the issue we are facing is Vs NULL >> >>In Oracle '' () and NULL are treated as NULL >> >>but, in PostgreSQL '' not treated as NULL >> >>I need some implicit way in PostgreSQL where '' can be >> treated as NULL >It is not possible in PostgreSQL. PostgreSQL respects ANSI SQL standard - >Oracle not. > >Regards > >Pavel > >p.s. theoretically you can overwrite a type operators to support Oracle >behave, but you should not be sure about unexpected negative side effects. A clean way would be to disallow empty strings on the PG side. This is somewhat combersome depending on how dynamic your model is and add some last on your db though. ALTER TABLE tablename ADD CONSTRAINT tablename_not_empty_ck CHECK (false= (colname1 IS NULL OR colname2 IS NULL OR colname3 IS NULL ...) IS NULL) -- and to ensure compatibility with your app or migration: CREATE OR REPLACE FUNCTION tablename_setnull_trf() RETURNS trigger AS $BODY$ BEGIN -- for all *string* columns NEW.colname1 = NULLIF (colname1,''); NEW.colname2 = NULLIF (colname2,''); NEW.colname3 = NULLIF (colname3,''); RETURN NEW; END; $BODY$ CREATE TRIGGER tablename_setnull_tr BEFORE INSERT OR UPDATE ON tablename FOR EACH ROW EXECUTE PROCEDURE tablename_setnull_trf(); You can query the pg catalog to generate all required statements. A possible issue is the order in which triggers are fired, when more than one exist for a given table: "If more than one trigger is defined for the same event on the same relation, the triggers will be fired in alphabetical order by trigger name" ( http://www.postgresql.org/docs/9.3/static/trigger-definition.html ) regards, Marc Mamin
Re: [PERFORM] Why is GIN index slowing down my query?
AlexK987 writes: >>> I've created a GIN index on an INT[] column, but it slows down the selects. >>> Here is my table: >> >>> create table talent(person_id INT NOT NULL, >>> skills INT[] NOT NULL); >> >>> insert into talent(person_id, skills) >>> select generate_series, array[0, 1] || generate_series >>> from generate_series(3, 1048575); >> >>> create index talent_skills on talent using gin(skills); >> >>> analyze talent; >> >>> Here is my select: >> >>> explain analyze >>> select * from talent >>> where skills <@ array[1, 15] >> >>Well, that's pretty much going to suck given that data distribution. >>Since "1" is a member of every last entry, the GIN scan will end up >>examining every entry, and then rejecting all of them as not being >>true subsets of [1,15]. > >This is equivalent and fast: > >explain analyze >WITH rare AS ( > select * from talent > where skills @> array[15]) >select * from rare > where skills @> array[1] > -- (with changed operator) > >You might variate your query according to an additional table that keeps the >occurrence count of all skills. >Not really pretty though. I wonder if in such cases, the Bitmap Index Scan could discard entries that would result in a table scan and use them only in the recheck part: explain select * from talent where skills @> array[1] Seq Scan on talent (cost=0.00..21846.16 rows=1048573 width=37) Filter: (skills @> '{1}'::integer[]) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Why is GIN index slowing down my query?
AlexK987 writes: >> I've created a GIN index on an INT[] column, but it slows down the selects. >> Here is my table: > >> create table talent(person_id INT NOT NULL, >> skills INT[] NOT NULL); > >> insert into talent(person_id, skills) >> select generate_series, array[0, 1] || generate_series >> from generate_series(3, 1048575); > >> create index talent_skills on talent using gin(skills); > >> analyze talent; > >> Here is my select: > >> explain analyze >> select * from talent >> where skills <@ array[1, 15] > >Well, that's pretty much going to suck given that data distribution. >Since "1" is a member of every last entry, the GIN scan will end up >examining every entry, and then rejecting all of them as not being >true subsets of [1,15]. This is equivalent and fast: explain analyze WITH rare AS ( select * from talent where skills @> array[15]) select * from rare where skills @> array[1] -- (with changed operator) You might variate your query according to an additional table that keeps the occurrence count of all skills. Not really pretty though. regards, Marc Mamin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query performance
>I have an events table that records page views and purchases (type = 'viewed' >or type='purchased'). I have a query that figures out "people who >bought/viewed this also bought/viewed that". > >It worked fine, taking about 0.1 seconds to complete, until a few hours ago >when it started taking hours to complete. Vacuum/analyze didn't help. Turned >out there was one session_id that had 400k rows in the system. Deleting that >made the query performant again. > >Is there anything I can do to make the query work better in cases like that? >Missing index, or better query? > >This is on 9.3.5. > >The below is reproduced at the following URL if it's not formatted correctly >in the email. >https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt Hello, here are 2 variations that should be somewhat faster. It seems you may have duplicate (site_id,session_id,product_id) which would false the result. In that case you'll need some more logic in the query. select '82503' as product_id, e2.site_id, e2.product_id, count(nullif(e2.type='viewed', false)) view_count, count(nullif(e2.type='purchased', false)) purchase_count from events e1 join events e2 on e1.session_id = e2.session_id and e1.type = e2.type where e1.product_id = '82503' and e2.product_id != '82503' group by e2.product_id, e2.site_id; OR: WITH SALL as( select e2.site_id, e2.product_id, count(nullif(e2.type='viewed', false)) view_count, count(nullif(e2.type='purchased', false)) purchase_count from events e1 join events e2 on e1.session_id = e2.session_id and e1.type = e2.type where e1.product_id = '82503' group by e2.product_id, e2.site_id ) SELECT '82503' as product_id_1, site_id, product_id, view_count, purchase_count FROM SALL WHERE product_id != '82503'; regards, Marc Mamin >explain select > e1.product_id, > e2.site_id, > e2.product_id, > count(nullif(e2.type='viewed', false)) view_count, > count(nullif(e2.type='purchased', false)) purchase_count > from events e1 > join events e2 on e1.session_id = e2.session_id and e1.type = e2.type > where > e1.product_id = '82503' and > e1.product_id != e2.product_id > group by e1.product_id, e2.product_id, e2.site_id; > QUERY PLAN > > GroupAggregate (cost=828395.67..945838.90 rows=22110 width=19) > -> Sort (cost=828395.67..840117.89 rows=465 width=19) > Sort Key: e1.product_id, e2.product_id, e2.site_id > -> Nested Loop (cost=11.85..20371.14 rows=465 width=19) > -> Bitmap Heap Scan on events e1 (cost=11.29..1404.31 > rows=369 width=49) > Recheck Cond: (product_id = '82503'::citext) > -> Bitmap Index Scan on events_product_id_site_id_idx > (cost=0.00..11.20 rows=369 width=0) > Index Cond: (product_id = '82503'::citext) > -> Index Scan using events_session_id_type_product_id_idx on > events e2 (cost=0.56..51.28 rows=12 width=51) > Index Cond: ((session_id = e1.session_id) AND (type = > e1.type)) > Filter: (e1.product_id <> product_id) >(11 rows) > >recommender_production=> \d events >Table "public.events" > Column| Type | Modifiers >-+--+- > id | bigint | not null default > nextval('events_id_seq'::regclass) > user_id | citext | > session_id | citext | not null > product_id | citext | not null > site_id | citext | not null > type| text | not null > happened_at | timestamp with time zone | not null > created_at | timestamp with time zone | not null >Indexes: >"events_pkey" PRIMARY KEY, btree (id) >"events_product_id_site_id_idx" btree (product_id, site_id) >"events_session_id_type_product_id_idx" btree (session_id, type, > product_id) >Check constraints: >"events_session_id_check" CHECK (length(session_id::text) < 255) >"events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, > 'viewed'::text])) >"events_user_id_check" CHECK (length(user_id::text) < 255) > > > >
Re: [PERFORM] query a table with lots of coulmns
>At first, thanks for your fast and comprehensive help. > >The structure of my cache table is > >a text , b text NOT NULL , c text , d text , e timestamp without >timezone DEFAULT now(), f text, s1 integer DEFAULT 0, s2 integer >DEFAULT 0, s3 integer DEFAULT 0, ... ,s512 DEFAULT 0 >additional constraints: primary key (b) , Unique(b), Unique(a) >Indexes : Index on a, Index on b This looks redundant. e.g. you don't need a unique index on b if you already have a primary key on it. Can you post the complete table definition ? ... >One remark which might help: overall 90 - 95 % of the s1-s512 columns >are 0. I am only interested in columns not equals 0. Perhaps it would >make sense to use and array of json and enumerate only values not equals 0. Could you change that to replace 0 values with NULLs? This would greatly reduce your table space as Postgres is very efficient about NULLs storage: It marks all null values in a bit map within the row header so you just need about one bit per null instead of 4 bytes for zeros, and hence get rid of your I/O issue. regards, Marc Mamin Von: pgsql-performance-ow...@postgresql.org [pgsql-performance-ow...@postgresql.org]" im Auftrag von "Björn Wittich [bjoern_witt...@gmx.de] Gesendet: Samstag, 20. September 2014 09:19 An: Josh Berkus; pgsql-performance@postgresql.org Betreff: Re: [PERFORM] query a table with lots of coulmns At first, thanks for your fast and comprehensive help. The structure of my cache table is a text , b text NOT NULL , c text , d text , e timestamp without timezone DEFAULT now(), f text, s1 integer DEFAULT 0, s2 integer DEFAULT 0, s3 integer DEFAULT 0, ... ,s512 DEFAULT 0 additional constraints: primary key (b) , Unique(b), Unique(a) Indexes : Index on a, Index on b This table has 30 Mio rows ( will increase to 50 Mio) in future My working table is b text, g integer Indexes on b and c This table has 5 Mio rows Scenario: What I want to achieve : SELECT s1,s2,s3,...s512,g,d from INNER JOIN USING(b) ORDER BY g The inner join will match at least 95 % of columns of the smaller worktable in this example 4,75 mio rows. Running this query takes several hours until I receive the first results. Query analyzing shows that the execution plan is doing 2 seq table scans on cache and work table. When I divide this huge statement into SELECT s1,s2,s3,...s512,g,d from INNER JOIN USING(b) WHERE g BETWEEN 1 and 1 ORDER BY g, SELECT s1,s2,s3,...s512,g,d from INNER JOIN USING(b) WHERE g BETWEEN 10001 and 2 ORDER BY g, (I can do this because g i unique and continous id from 1 to N) The result is fast but fireing parallel requests (4-8 times parallel) slows down the retrieval. Execution plan changes when adding "BETWEEN 1 and 1" to use the indexes. One remark which might help: overall 90 - 95 % of the s1-s512 columns are 0. I am only interested in columns not equals 0. Perhaps it would make sense to use and array of json and enumerate only values not equals 0. Statistics on the large table: table size: 80 GB toast-tablesize: 37 GB size of indexes: 17 GB Thanks for your help and ideas Björn Am 19.09.2014 23:40, schrieb Josh Berkus: > On 09/19/2014 04:51 AM, Björn Wittich wrote: >> I am relatively new to postgres. I have a table with 500 coulmns and >> about 40 mio rows. I call this cache table where one column is a unique >> key (indexed) and the 499 columns (type integer) are some values >> belonging to this key. >> >> Now I have a second (temporary) table (only 2 columns one is the key of >> my cache table) and I want do an inner join between my temporary table >> and the large cache table and export all matching rows. I found out, >> that the performance increases when I limit the join to lots of small >> parts. >> But it seems that the databases needs a lot of disk io to gather all 499 >> data columns. >> Is there a possibilty to tell the databases that all these colums are >> always treated as tuples and I always want to get the whole row? Perhaps >> the disk oraganization could then be optimized? > PostgreSQL is already a row store, which means by default you're getting > all of the columns, and the columns are stored physically adjacent to > each other. > > If requesting only 1 or two columns is faster than requesting all of > them, that's pretty much certainly due to transmission time, not disk > IO. Otherwise, please post your schema (well, a truncated version) and > your queries. > > BTW, in cases like yours I've used a INT array instead of 500 columns to > good effect; it works slightly better with PostgreSQL's compression. > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscripti
Re: [PERFORM] Cursor + upsert (astronomical data)
[Craig] >>If you haven't looked at clustering algorithms yet, you might want to do so. >>Your problem is a special case of clustering, where you have a large number >>of small clusters. A good place to start is the overview on Wikipedia: >>http://en.wikipedia.org/wiki/Cluster_analysis According to this list, your method is similar to http://en.wikipedia.org/wiki/Basic_sequential_algorithmic_scheme, but with what seems to be some logical errors. >The simplest approach I could think of is that I process each row of the 3e8 >rows >sequentially and ask: >Do I have an identifier in the radius of 1 arcsec? >No: Generate one and assign me to it. >Yes: Update it and assigne me to it. The update is done as weighted average – >I keep the >number of how many observations the identifier has been computed. The result >is that the >identifier will have average coordinates of all the observations it identifies >– it will >be the center. Let say you have 2 single measures on a line at arcsec 1 and 2.1. which hence correspond to 2 ipix_cat. Now add a new measure at 1.9: as you choose any of the possible adjacent ipix_cat whitout considering the least distance, you may end with an ipix_at at 1.45 which is at less than 1 arcsec then the next one. Moreover you have raised the weight of both ipix_cat. Which increases the lock probability when trying a nutlithreaded upgrade. The max distance between 2 observations belonging to a same ipix_cat tends to 2 arcsec with your method. If this is ok, you should probbaly modify your method so that the 2 first points of my example would have megred to a single ipix_cat. You could use your weigth for this: increase your search radius to 2arcsec and then reject the candidates located between 1 and 2 arsec depending on their weight. The additional work load might be compensated by the smaller number of ipix_cat that woul will have.
Re: [PERFORM] Building multiple indexes on one table.
>Von: pgsql-performance-ow...@postgresql.org >[pgsql-performance-ow...@postgresql.org]" im Auftrag von "Claudio >Freire [klaussfre...@gmail.com] >Gesendet: Freitag, 18. Juli 2014 01:21 >An: Chris Ruprecht >Cc: pgsql-performance@postgresql.org >Betreff: Re: [PERFORM] Building multiple indexes on one table. > >On Thu, Jul 17, 2014 at 7:47 PM, Chris Ruprecht wrote: >> Is there any way that I can build multiple indexes on one table without >> having to scan the table multiple times? For small tables, that's probably >> not an issue, but if I have a 500 GB table that I need to create 6 indexes >> on, I don't want to read that table 6 times. >> Nothing I could find in the manual other than reindex, but that's not >> helping, since it only rebuilds indexes that are already there and I don't >> know if that reads the table once or multiple times. If I could create >> indexes inactive and then run reindex, which then reads the table once, I >> would have a solution. But that doesn't seem to exist either. > >Just build them with separate but concurrent connections, and the >scans will be synchronized so it will be only one. > >Btw, reindex rebuilds one index at a time, so what I do is issue >separate reindex for each index in parallel, to avoid the repeated >scans as well. > >Just make sure you've got the I/O and CPU capacity for it (you'll be >writing many indexes at once, so there is a lot of I/O). Index creation on large tables are mostly CPU bound as long as no swap occurs. I/O may be an issue when all your indexes are similar; e.g. all on single int4 columns. in other cases the writes will not all take place concurrently. To reduce I/O due to swap, you can consider increasing maintenance_work_mem on the connextions/sessionns that build the indexes. regards, Marc Mamin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] fragmention issue with ext4: e4defrag?
Hello, Has anyone some experience using defragmentation tools on Linux against tablespaces ? we are facing fragmentation problems with postgres instances having a few TB of data. ( RAID 5 ) I/O througput decreased from 300MB/s to 160. - We first moved some schemas to separate servers. After that we still have 150'000 tables in 1.5 TB - Now we are in the process of vacuuming FULL historical tables which are not written anymore. This seems to improve the I/O considerably Our remaining issue is that the free space fragmentíon is still suboptimal so that fragmention will probably start again soon. Would it make sense to use a tool like e4defrag (http://www.linux.org/threads/online-defragmentation.4121/) in order to defrag the free space ? And how safe is it to use such a tool against a running postgres instance? many thanks, Marc Mamin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] CREATE TABLE AS WITH FREEZE ?
Hello, Does anything speaks again adding a "WITH FREEZE" option to "CREATE TABLE AS" , similar to the new COPY FREEZE feature ? best regards, Marc Mamin
Re: [PERFORM] stable and immutable functions in GROUP BY clauses.
> > Hello, > > Stable and immutable functions do not improve performance when used within > the GROUP BY clause. > Here, the function will be called for each row. > > To avoid it, I can replace the funtion by its arguments within GROUP BY. Shame on me ! This is of course bullsh... It has nothing to do with immutability and can only applies to few cases e.g: it's fine for select x+1 ... group by x, but not for select x^2 ... group by x Marc Mamin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] stable and immutable functions in GROUP BY clauses.
Hello, Stable and immutable functions do not improve performance when used within the GROUP BY clause. Here, the function will be called for each row. To avoid it, I can replace the funtion by its arguments within GROUP BY. Maybe this hint is worth a note within the documentation on Function Volatility. I have the situation where queries are generating by the application and it would be a pain to extend the "query builder" in order to avoid this performance issue. So I wonder if it would be possible for the query planner to recognize such cases and optimize the query internally ? best regards, Marc Mamin here an example to highlight possible performance loss: create temp table ref ( i int, r int); create temp table val ( i int, v int); insert into ref select s,s%2 from generate_series(1,1)s; insert into val select s,s%2 from generate_series(1,1)s; create or replace function getv(int) returns int as $$ select v+1 from val where i=$1; $$ language SQL stable; explain analyze select getv(r) from ref group by r; Total runtime: 5.928 ms explain analyze select getv(r) from ref group by getv(r); Total runtime: 3980.012 ms -- and more reasonably with an index: create unique index val_ux on val(i); explain analyze select getv(r) from ref group by r; Total runtime: 4.278 ms explain analyze select getv(r) from ref group by getv(r); Total runtime: 68.758 ms -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] FTS performance issue - planner problem identified (but only partially resolved)
Von: Stefan Keller [sfkel...@gmail.com] >Gesendet: Samstag, 20. Juli 2013 01:55 > >Hi Marc > >Thanks a lot for your hint! > >You mean doing a "SET track_counts (true);" for the whole session? No, I mean ALTER TABLE ALTER SET STATISTICS 0; And remove existing statistics DELETE FROM pg_catalog.pg_statistic where starelid='':: regclass AND staattnum = (SELECT attnum FROM pg_attribute WHERE attrelid = '':: regclass AND attname = ''::name ) But you should first try to find out which proportion of your ts queries are faster when using a table scan as they will probably not happen anymore afterwards ! (Except if further columns on your table 'FullTextSearch' are considered by the planner) >That would be ok if it would be possible just for the gin index. > >It's obviously an issue of the planner estimation costs. >The data I'm speaking about ("movies") has a text attribute which has >a length of more than 8K so it's obviously having to do with >detoasting. >But the thoughts about @@ operators together with this GIN index seem >also to be valid. > >I hope this issue is being tracked in preparation for 9.3. > >Regards, Stefan > > >2013/7/19 Marc Mamin : >> >>> SELECT * FROM FullTextSearch WHERE content_tsv_gin @@ >>> plainto_tsquery('english', 'good'); >>> >>> It's slow (> 30 sec.) for some GB (27886 html files, originally 73 MB >>> zipped). >>> The planner obviously always chooses table scan >> >> >> Hello, >> >> A probable reason for the time difference is the cost for decompressing >> toasted content. >> At least in 8.3, the planner was not good at estimating it. >> >> I'm getting better overall performances since I've stopped collect statistic >> on tsvectors. >> An alternative would have been to disallow compression on them. >> >> I'm aware this is a drastic way and would not recommend it without testing. >> The benefit may depend on the type of data you are indexing. >> In our use case these are error logs with many java stack traces, hence with >> many lexemes poorly discriminative. >> >> see: http://www.postgresql.org/message-id/27953.1329434...@sss.pgh.pa.us >> as a comment on >> http://www.postgresql.org/message-id/c4dac901169b624f933534a26ed7df310861b...@jenmail01.ad.intershop.net >> >> regards, >> >> Marc Mamin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] FTS performance issue - planner problem identified (but only partially resolved)
> SELECT * FROM FullTextSearch WHERE content_tsv_gin @@ > plainto_tsquery('english', 'good'); > > It's slow (> 30 sec.) for some GB (27886 html files, originally 73 MB zipped). > The planner obviously always chooses table scan Hello, A probable reason for the time difference is the cost for decompressing toasted content. At lest in 8.3, the planner was not good at estimating it. I'm getting better overall performances since I've stopped collect statistic on tsvectors. An alternative would have been to disallow compression on them. I'm aware this is a drastic way and would not recommend it without testing. The benefit may depend on the type of data you are indexing. In our use case these are error logs with many java stack traces, hence with many lexemes poorly discriminative. see: http://www.postgresql.org/message-id/27953.1329434...@sss.pgh.pa.us as a comment on http://www.postgresql.org/message-id/c4dac901169b624f933534a26ed7df310861b...@jenmail01.ad.intershop.net regards, Marc Mamin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Trying to eliminate union and sort
Hello, you may want to try starting with some CTE that first retrieve required subsets. adding ordering within those CTE might also improve the timing of following sort/join operations. (sorry for the top posting) Something like: WITH T1 AS ( SELECT id, typeid, backup_id, mycolumn1, mycolumn2 FROM table1 WHERE t.external_id IN ('6544', '2234', '2', '4536') ORDER BY mycolumn2, id ? ), TYPES AS (SELECT DISTINCT typeid FROM T1), T3_OTHERS AS ( SELECT id, otherid FROM table3 JOIN TYPES ON table3.id = TYPES.typeid -- Order BY id ? ), SELECT T1.id, T1.mycolumn1, T3_OTHERS.otherid, T3_2.otherid, T1.mycolumn2 AS mycolumn2 FROM T1 LEFT OUTER JOIN T3_OTHERS ON T1.typeid = T3_OTHERS.id LEFT OUTER JOIN table2 t2 ON (t2.real_id = T1.backup_id OR t2.real_id = t.id LEFT OUTER JOIN table3 T3_2 ON t2.third_id = T3_2.id ORDER BY T1.mycolumn2,T1.id regards, Marc Mamin Von: pgsql-performance-ow...@postgresql.org [pgsql-performance-ow...@postgresql.org]" im Auftrag von "Brian Fehrle [bri...@consistentstate.com] Gesendet: Montag, 15. Juli 2013 18:12 An: pgsql-performance@postgresql.org Betreff: Re: [PERFORM] Trying to eliminate union and sort On 07/12/2013 04:43 PM, Josh Berkus wrote: >> As for the counts on the tables: >> table1 3,653,472 >> table2 2,191,314 >> table325,676,589 >> >> I think it's safe to assume right now that any resulting joins are not >> one-to-one > Hmmm? How is doing a subselect in the SELECT clause even working, then? > Oh my, this is sad. the query in all returns 9,955,729 rows, so the sub queries are run on each of these resulting rows, however in this entire result set, subquery 1 returns 16 distinct rows, subquery 2 returns 63 different rows, but those sub queries are run over 10 million times to return these few distinct rows. So it's running many times, but returning the same small set of data over and over again. - Brian F -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Are bitmap index scans slow to start?
>Rebuilding the index might help, as it would put all the leaf pages holding >values for session_id=27 adjacent to each other, so they would read from disk >faster. But with a name like >"session_id", I don't know how long such >clustering would last though. >If I'm right about the index disk-read time, then switching to a plain index >scan rather than a bitmap index scan would make no difference--either way the >data has to come off the disk. >>I'd prefer a >>strategy that allowed fast performance the first time, rather than slow the >>first time and extremely fast subsequently. Hello, if the index is only used to locate rows for single session_id, you may consider split it in a set of partial indexes. e.g. create index i_0 on foo where session_id%4 =0; create index i_1 on foo where session_id%4 =1; create index i_2 on foo where session_id%4 =2; create index i_3 on foo where session_id%4 =3; (can be built in parallel using separate threads) Then you will have to ensure that all your WHERE clauses also contain the index condition: WHERE session_id = 27 AND session_id%4 =27%4 regards, Marc Mamin
Re: [PERFORM] A very long running query....
Hello, isn't the first test superfluous here ? > where extract('day' from message_copies.msg_date_rec) = 17 > and date_trunc('day',message_copies.msg_date_rec) = '2012-07-17' > Here is the index: > > CREATE INDEX idx_message_copies_wk2_date_src_pos_partial > ON feed_all_y2012m07.message_copies_wk2 > USING btree > (date_trunc('day'::text, msg_date_rec), > src_id, > (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || > pos_georef4::text)) > TABLESPACE archive > WHERE (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || > pos_georef4::text) IS NOT NULL > OR NOT (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || > pos_georef4::text) = ''::text; the georef test can be simplified using coalesce: > and (message_copies.pos_georef1 || message_copies.pos_georef2 || > message_copies.pos_georef3 || message_copies.pos_georef4) <> '' > and not (message_copies.pos_georef1 || message_copies.pos_georef2 || > message_copies.pos_georef3 || message_copies.pos_georef4) is null => and coaesce ( (message_copies.pos_georef1 || message_copies.pos_georef2 || message_copies.pos_georef3 || message_copies.pos_georef4), '') <> '' In order to avoid this test at query time you might add a boolean column message_copies.pos.has_georef, and keep it up to date with a before insert or update trigger. This will allow to shorten your index definition and simplify the planner task a little bit. Moreover it will fasten your query in cases when the index don't get used. As Tom already mentioned it, it may make sense not to concatenate the georef within the index, but keep them separated, or even keep them in different indexes. Which is the best depend on the other queries running against this table HTH, Marc Mamin -Original Message- From: pgsql-performance-ow...@postgresql.org on behalf of Ioannis Anagnostopoulos Sent: Sat 7/21/2012 1:56 AM To: Tom Lane Cc: Claudio Freire; pgsql-performance@postgresql.org Subject: Re: [PERFORM] A very long running query On 21/07/2012 00:10, Tom Lane wrote: > Claudio Freire writes: >> Looking at this: >> "-> Index Scan using >> idx_message_copies_wk2_date_src_pos_partial on message_copies_wk2 >> message_copies (cost=0.00..19057.93 rows=52 width=32) (actual >> time=62.124..5486270.845 rows=387524 loops=1)" >> " Index Cond: ((date_trunc('day'::text, >> msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone) >> AND (src_id = 1))" >> " Filter: ((date_part('day'::text, >> msg_date_rec) = 17::double precision) AND (NOT (pos_georef1)::text >> || (pos_georef2)::text) || (pos_georef3)::text) || >> (pos_georef4)::text) IS NULL)) AND (pos_georef1)::text || >> (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text) >> <> ''::text))" > I think the real problem is that the planner has no hope of doing > anything very accurate with such an unwieldy filter condition. I'd look > at ways of making the filter conditions simpler, perhaps by recasting > the data representation. In particular, that's a horridly bad way of > asking whether some columns are empty, which I gather is the intent. > If you really want to do it just like that, creating an index on the > concatenation expression would guide ANALYZE to collect some stats about > it, but it would probably be a lot more efficient to put together an AND > or OR of tests on the individual columns. > > regards, tom lane So what you suggest is to forget all together the concatenation of the georef1/2/3/4 and instead alter my query with something like: georef1 is not null and not georeg1 = ''etc for georef2 3 and 4 That would require to alter my index and have the four georef columns separately in it and not as a concatenation and so on for the partial index part. And a final thing, you seem to imply that the indexes are used by the analyser to collect statistics even if they are not used. So an index serves not only as a way to speed up targeted queries but also to provide better statistics to the analyzer? Kind Regards Yiannis -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] PostgreSQL db, 30 tables with number of rows < 100 (not huge) - the fastest way to clean each non-empty table and reset unique identifier column of empty ones.
Stanislaw Pankevich wrote: >>> PostgreSQL db, 30 tables with number of rows < 100 (not huge) - the >>> fastest way to clean each >>> non-empty table and reset unique identifier column of empty ones Hello, 2 'exotic' ideas: - use dblink_send_query to do the job in multiple threads (I doubt this really could be faster) - have prepared empty tables in a separate schema, and a "garbage schema": ALTER TABLE x set schema garbage; ALTER TABLE prepared.x set schema "current"; you should be ready for the next test, but still have to clean garbage nad moved to prepared for the next but one in the background best regards, Marc Mamin >>> >>> I wonder, what is the fastest way to accomplish this kind of task in >>> PostgreSQL. I am interested in >>> the fastest solutions ever possible. >>> I need the fastest cleaning strategy for such case working on PostgreSQL >>> both 8 and 9. >>> >>> I see the following approaches: >>> >>> 1) Truncate each table. It is too slow, I think, especially for empty >>> tables. >> Did you actually try it? That's the king's way to performance questions! >> Truncating a single table is done in a matter of microseconds, particularly >> if it is not big. >> Do you have tens of thousands of tables? > Actually, 10-100 tables. >> You could of course run a SELECT 1 FROM table LIMIT 1, but again I don't >> think that this will be considerably faster than just truncating the table. > > Exactly this query is much faster, believe me. You can see my latest > results on https://github.com/stanislaw/truncate-vs-count. Ok, I believe you. My quick tests showed that a sible truncate (including transaction and client-server roundtrip via UNIX sockets takes some 10 to 30 milliseconds. Multiply that with 100, and you end up with just a few seconds at most. Or what did you measure? I guess you run that deletion very often so that it is painful. Still I think that the biggest performance gain is to be had by using PostgreSQL's features (truncate several tables in one statement, ...). Try to bend your Ruby framework! Yours, Laurenz Albe -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance of a large array access by position (tested version 9.1.3)
> -Original Message- > From: Pavel Stehule [mailto:pavel.steh...@gmail.com] > > 2012/6/26 Marc Mamin : > > > >>> On 22/06/12 09:02, Maxim Boguk wrote: > > > >>> May be I completely wrong but I always assumed that the access > speed to the array element in PostgreSQL should be close to constant > time. > >>> But in tests I found that access speed degrade as O(N) of array > size. > > > >>> Is that behaviour is correct? > > yes - access to n position means in postgresql - skip n-1 elements Hmmm... how many elements to be skipped here ? SELECT _array[1] FROM t2; I wonder if the time rather get spent in first retrieving the array itself before accessing its elements. regards, Marc Mamin > > Regards > > Pavel > > > > > > >> From: pgsql-performance-ow...@postgresql.org On Behalf Of Jesper > Krogh > > > >> Default column storage is to "compress it, and store in TOAST" with > large values. > >> This it what is causing the shift. Try to change the column storage > of the column > >> to EXTERNAL instead and rerun the test. > > > > > > Hello, > > > > I've repeated your test in a simplified form: > > you are right :-( > > > > create table t1 ( _array int[]); > > alter table t1 alter _array set storage external; > > insert into t1 SELECT ARRAY(SELECT * FROM generate_series(1,5)); > > > > create table t2 ( _array int[]); > > alter table t2 alter _array set storage external; > > insert into t2 SELECT ARRAY(SELECT * FROM > generate_series(1,500)); > > > > explain analyze SELECT _array[1] FROM t1; > > Total runtime: 0.125 ms > > > > explain analyze SELECT _array[1] FROM t2; > > Total runtime: 8.649 ms > > > > > > best regards, > > > > Marc Mamin > > > > > > > > -- > > Sent via pgsql-performance mailing list (pgsql- > performa...@postgresql.org) > > To make changes to your subscription: > > http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance of a large array access by position (tested version 9.1.3)
>> On 22/06/12 09:02, Maxim Boguk wrote: >> May be I completely wrong but I always assumed that the access speed to the >> array element in PostgreSQL should be close to constant time. >> But in tests I found that access speed degrade as O(N) of array size. >> Is that behaviour is correct? > From: pgsql-performance-ow...@postgresql.org On Behalf Of Jesper Krogh > Default column storage is to "compress it, and store in TOAST" with large > values. > This it what is causing the shift. Try to change the column storage of the > column > to EXTERNAL instead and rerun the test. Hello, I've repeated your test in a simplified form: you are right :-( create table t1 ( _array int[]); alter table t1 alter _array set storage external; insert into t1 SELECT ARRAY(SELECT * FROM generate_series(1,5)); create table t2 ( _array int[]); alter table t2 alter _array set storage external; insert into t2 SELECT ARRAY(SELECT * FROM generate_series(1,500)); explain analyze SELECT _array[1] FROM t1; Total runtime: 0.125 ms explain analyze SELECT _array[1] FROM t2; Total runtime: 8.649 ms best regards, Marc Mamin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] text search: tablescan cost for a tsvector
> Von: Robert Haas [mailto:robertmh...@gmail.com] > Gesendet: Mi 2/29/2012 7:32 > > On Mon, Feb 6, 2012 at 6:05 AM, Marc Mamin wrote: > > without analyze: http://explain.depesz.com/s/6At > > with analyze:http://explain.depesz.com/s/r3B ... > The problem seems to be that the cost estimator doesn't know that > detoasting is expensive. Hello, Tom Lane has started a follow up thread in the hacker list. Detoasting is indeed the main obstacle, but I've repeated my test using plain storage and the planer still choose (systematically?) the slowest query. It seems that I bumped into 2 different issues at the same time. http://archives.postgresql.org/pgsql-hackers/2012-02/msg00896.php Backround: Our reporting system offers amongst others time histograms combined with a FTS filtering on error occurences (imported from error logs), It is hence not unusual that given search terms are found within a majority of the documents... best regards, Marc Mamin
[PERFORM] text search: tablescan cost for a tsvector
Hello, I have quite systematically better performance with the text search when I disable the statistics collection for the tsvector column. So I wonder if such statistics ever make sense. Here a testcase: The table contains 200'000 tsvector, whereas the lexeme 'fooblablabla' exists in all tsvector: Without statistics, the planner decide as expected for the gin index. After analyze, it switch to a table scan which is also expected, but the query is 3 times slower. My first thought was that the Bitmap Heap Scan was really fast as the searched term is always at the first position. So I repeated the test with an additional search term at the last position, but without significant change: (result from the 6. test below) without analyze: http://explain.depesz.com/s/6At with analyze:http://explain.depesz.com/s/r3B best regards, Marc Mamin Here all my results, always one of the fastest from a few runs. CREATE TABLE tsv_test ( id bigserial NOT NULL, v tsvector ); The test query: explain analyze select id from tsv_test where v @@ 'lexeme3179'::tsquery UNION ALL select id from tsv_test where v @@ 'lexeme5'::tsquery UNION ALL select id from tsv_test where v @@ 'fooblablabla'::tsquery The results A) on first lexeme 1) without indexes without analyze: http://explain.depesz.com/s/bOv 2) alter table tsv_test add constraint tsv_test_pk primary key(id); http://explain.depesz.com/s/9QQ (same as previous); 3) create index tsv_gin on tsv_test using gin(v); http://explain.depesz.com/s/r4M <= fastest 4) ANALYZE tsv_test (id); http://explain.depesz.com/s/MyC (same as previous); 5) ANALYZE tsv_test; http://explain.depesz.com/s/qu3S B) on lastlexeme 6) create table tsv_test2 as select id, v||'zzthisisalongerlexemethisisalongerlexeme'::tsvector from tsv_test; explain analyze select id from tsv_test2 where v @@ 'zzthisisalongerlexemethisisalongerlexeme'::tsquery http://explain.depesz.com/s/6At ANALYZE tsv_test2; http://explain.depesz.com/s/r3B test data: insert into tsv_test (v) select cast('fooblablabla' || ' lexeme'||s%2|| ' lexeme'||s%3|| ' lexeme'||s%4|| ' lexeme'||s%4|| ' lexeme'||s%5|| ' lexeme'||s%6|| ' lexeme'||s%7|| ' lexeme'||s%8|| ' lexeme'||s%9|| ' lexeme'||s%10 || ' lexeme2'||s%11 || ' lexeme3'||s%12 || ' lexeme'||s%11 || ' lexeme2'||s%12 || ' lexeme3'||s%22 || ' lexeme'||s%12 || ' lexeme2'||s%13 || ' lexeme3'||s%32 || ' lexeme'||s%13 || ' lexeme2'||s%14 || ' lexeme3'||s%42 || ' lexeme'||s%14 || ' lexeme2'||s%15 || ' lexeme3'||s%52 || ' lexeme'||s%15 || ' lexeme2'||s%16 || ' lexeme3'||s%62 || ' lexeme'||s%16 || ' lexeme2'||s%17 || ' lexeme3'||s%72 || ' lexeme'||s%17 || ' lexeme2'||s%18 || ' lexeme3'||s%82 || ' lexeme'||s%18 || ' lexeme2'||s%19 || ' lexeme3'||s%92 || ' lexeme'||s%19 || ' lexeme2'||s%10 || ' lexeme3'||s%15 || ' lexeme'||s%12 || ' lexeme2'||s%71 || ' lexeme3'||s%16 || ' lexeme'||s%20 || ' lexeme2'||s%81 || ' lexeme3'||s%17 || ' lexeme'||s%35 || ' lexeme2'||s%91 || ' lexeme3'||s%18 || ' lexeme'||s%100 || ' lexeme2'||s%110 || ' lexeme3'||s%120 || ' lexeme'||s%110 || ' lexeme2'||s%120 || ' lexeme3'||s%220 || ' lexeme'||s%120 || ' lexeme2'||s%130 || ' lexeme3'||s%320 || ' lexeme'||s%130 || ' lexeme2'||s%140 || ' lexeme3'||s%420 || ' lexeme'||s%140 || ' lexeme2'||s%150 || ' lexeme3'||s%520 || ' lexeme'||s%150 || ' lexeme2'||s%160 || ' lexeme3'||s%620 || ' lexeme'||s%160 || ' lexeme2'||s%170 || ' lexeme3'||s%720 || ' lexeme'||s%170 || ' lexeme2'||s%180 || ' lexeme3'||s%820 || ' lexeme'||s%180 || ' lexeme2'||s%190 || ' lexeme3'||s%920 || ' lexeme'||s%190 || ' lexeme2'||s%100 || ' lexeme3'||s%150 || ' lexeme'||s%120 || ' lexeme2'||s%710 || ' lexeme3'||s%160 || ' lexeme'||s%200 || ' lexeme2'||s%810 || ' lexeme3'||s%170 || ' lexeme'||s%350 || ' lexeme2'||s%910 || ' lexeme3'||s%180 as tsvector) FROM generate_series(1,10) s UNION ALL select cast('fooblablabla' || ' thisisalongerlexemethisisalongerlexeme'||s%2|| ' thisisalongerlexemethisisalongerl
Re: [PERFORM] How to remove a table statistics ?
Hello, Some more tests have shown that removing the statistics just move the performance issue to other places. The main issue here is a bad design, so I'd better focus on this than losing too much time with the current situation. But this raises an interesting question on how/where does Postgres store statistics on functional indexes. in pg_statistics there are information on the column content, but I couldn't find stats on the function result which is fully computed only during the index creation. I guess that the planner would need to know at least the function cost to weight the benefit of such an index. In my case I would set the function cost to 200 ... I have also tried to reduce random_page_cost to "2", and it seems to help in a few cases. (anonymized) explain analyze SELECT min(msoffset) as t, coalesce(pipelinecall_id,-2) as pid from aserrorlist_20120125 l WHERE 1 = 1 AND msoffset >= 132750300 AND my_func('foo',20120125,l.id, l.header_9_10_id, l.categories_id, l.firstline_id) @@ to_aserrcfg_search_tsq($KUKU$lexeme_1 ! lexeme_2$KUKU$) group by ridcount,pipelinecall_id,coalesce(toplevelrid,msoffset::varchar); without stats: http://explain.depesz.com/s/qPg with stats: http://explain.depesz.com/s/88q aserr_20120125_tvi: GIN Index on my_func(.,.,.,.,.,.) best regards, Marc Mamin > -Original Message- > From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance- > ow...@postgresql.org] On Behalf Of Josh Berkus > Sent: Dienstag, 31. Januar 2012 19:44 > To: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] How to remove a table statistics ? > > On 1/31/12 3:50 AM, Marc Mamin wrote: > > Hello, > > > > I have a weird table, upon with the queries are much faster when no > > statics were collected. > > > > Is there a way to delete statistics information for a table ? > > I've tried ALTER.. set STATISTICS 0 and then run ANALYZE, but it > seems > > that old statistics are kept this way. > > Can I delete entries directly in pg_statistic ? > > (Postgresql 9.1) > > You can, but it won't do any good; autovaccum will replace them. > > It would be better to fix the actual query plan issue. If you can, > post > the query plans with and without statistics (EXPLAIN ANALYZE, please) > here. > > -- > Josh Berkus > PostgreSQL Experts Inc. > http://pgexperts.com > > -- > Sent via pgsql-performance mailing list (pgsql- > performa...@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] How to remove a table statistics ?
Hello, I have a weird table, upon with the queries are much faster when no statics were collected. Is there a way to delete statistics information for a table ? I've tried ALTER.. set STATISTICS 0 and then run ANALYZE, but it seems that old statistics are kept this way. Can I delete entries directly in pg_statistic ? (Postgresql 9.1) short backgroud Info: One of the table index is a GIN on a tsvector returning function, which is very costy. once analyzed, the query planner often ignore this index in favour of other one, hence triggering this function too often. I'll fix that model, but am first looking for a quick way to restore performance on our production servers. best regards, Marc Mamin
Re: [PERFORM] Duplicate deletion optimizations
Yes, but it should become a bit slower if you fix your code :-) where t_imp.id is null and test.id=t_imp.id; => where t_imp.id is not null and test.id=t_imp.id; and a partial index on matching rows might help (should be tested): (after the first updat) create index t_imp_ix on t_imp(t_value,t_record,output_id) where t_imp.id is not null. regards, Marc Mamin -Ursprüngliche Nachricht- Von: pgsql-performance-ow...@postgresql.org im Auftrag von Jochen Erwied Gesendet: Sa 1/7/2012 12:57 An: anto...@inaps.org Cc: pgsql-performance@postgresql.org Betreff: Re: [PERFORM] Duplicate deletion optimizations Friday, January 6, 2012, 4:21:06 PM you wrote: >> Every 5 minutes, a process have to insert a few thousand of rows in this >> table, but sometime, the process have to insert an already existing row >> (based on values in the triplet (t_value, t_record, output_id). In this >> case, the row must be updated with the new count value. I've tried some >> solution given on this stackoverflow question [1] but the insertion rate >> is always too low for my needs. I did check the following in a loop, starting with an empty table, and inserting/updating 5 random unique entries. After 15 minutes I've got about 10 million records, each loop takes about 3 seconds. After 30 minutes the table contains approx. 18 million entries, time per loop only slightly increased. After 90 minutes the database has about 30 million entries. The speed has dropped to about 15-20 seconds per loop, but the server is doing lots of other queries in parallel, so with an unloaded server the updates should still take less than 10 seconds. The generator runs in perl, and generates records for a maximum of 100 million different entries: use strict; srand time; my $i = 0; open FD, ">data.in"; for (1..5) { $i += rand(2000); print FD sprintf("%d\t%d\t%d\t%d\n", $i/65536, ($i/256)%255, $i%255, rand(1000)); } close FD; The SQL-script looks like this: \timing on begin; create temp table t_imp(id bigint,t_value integer,t_record integer,output_id integer,count bigint); \copy t_imp (t_value, t_record, output_id, count) from 'data.in' --an index is not really needed, table is in memory anyway --create index t_imp_ix on t_imp(t_value,t_record,output_id); -- find matching rows update t_imp set id=test.id from test where (t_imp.t_value,t_imp.t_record,t_imp.output_id)=(test.t_value,test.t_record,test.output_id); -- update matching rows using primary key update test set count=t_imp.count from t_imp where t_imp.id is null and test.id=t_imp.id; -- insert missing rows insert into test(t_value,t_record,output_id,count) select t_value,t_record,output_id,count from t_imp where id is null; commit; Advantages of this solution: - all updates are done in-place, no index modifications (except for the inserts, of course) - big table only gets inserts - no dead tuples from deletes - completely avoids sequential scans on the big table Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6 GHz, table and indices stored on a SSD) Table statistics: relid | 14332525 schemaname| public relname | test seq_scan | 8 seq_tup_read | 111541821 idx_scan | 149240169 idx_tup_fetch | 117901695 n_tup_ins | 30280175 n_tup_upd | 0 n_tup_del | 0 n_tup_hot_upd | 0 n_live_tup| 30264431 n_dead_tup| 0 last_vacuum | last_autovacuum | last_analyze | last_autoanalyze | 2012-01-07 12:38:49.593651+01 vacuum_count | 0 autovacuum_count | 0 analyze_count | 0 autoanalyze_count | 31 The sequential scans were from some 'select count(*)' in between. HTH. -- Jochen Erwied | home: joc...@erwied.eu +49-208-38800-18, FAX: -19 Sauerbruchstr. 17 | work: j...@mbs-software.de +49-2151-7294-24, FAX: -50 D-45470 Muelheim | mobile: jochen.erw...@vodafone.de +49-173-5404164 -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Postgres array parser
> Yes, it would be great, but I haven't found such a function, which > splits 2 dimensional array into rows =) Maybe we'll modify existing > function, but unfortunately we have tried hstore type and function in > postgres and we see a significant performance improvements. So we only > need to convert existing data into hstore and I think this is a good > solution. I haven't tested hstore yet, but I would be interested to find out if it still better perform with custom "numeric" aggregates on the hstore values. I've made a short "proof of concept" test with a custom key/value type to achieve such an aggregation. Something like: SELECT x, distinct_sum( (currency,amount)::keyval ) overview FROM ... GROUP BY x x currency amount a EUR 15.0 a EUR5.0 a CHF7.5 b USD 12.0 => x overview - ---- a {(EUR,20.0), (CHF,7.5)} b {(USD,10.0)} regards, Marc Mamin > On 12/14/2011 11:21 AM, Marc Mamin wrote: > > Hello, > > > > For such cases (see below), it would be nice to have an unnest > function that only affect the first array dimension. > > > > Something like > > > > unnest(ARRAY[[1,2],[2,3]], SLICE=1) > > => > > unnest > > -- > > [1,2] > > [2,3] > > > > > > With this function, I imagine that following sql function > > might beat the plpgsql FOREACH version. > > > > > > CREATE OR REPLACE FUNCTION input_value_un (in_inputs numeric[], > in_input_nr numeric) > > RETURNS numeric AS > > $BODY$ > > > >SELECT u[1][2] > >FROM unnest($1, SLICE =1) u > >WHERE u[1][1]=in_input_nr > >LIMIT 1; > > > > $BODY$ > > LANGUAGE sql IMMUTABLE; > > > > > > > > best regards, > > > > Marc Mamin > > > > > >> -Original Message- > >> From: pgsql-performance-ow...@postgresql.org [mailto:pgsql- > performance- > >> ow...@postgresql.org] On Behalf Of Pavel Stehule > >> Sent: Dienstag, 13. Dezember 2011 15:43 > >> To: Aleksej Trofimov > >> Cc: pgsql-performance@postgresql.org > >> Subject: Re: [PERFORM] Postgres array parser > >> > >> Hello > >> > >> 2011/12/13 Aleksej Trofimov: > >>> We have tried foreach syntax, but we have noticed performance > >> degradation: > >>> Function with for: 203ms > >>> Function with foreach: ~250ms: > >>> > >>> there is functions code: > >>> CREATE OR REPLACE FUNCTION input_value_fe(in_inputs numeric[], > >> in_input_nr > >>> numeric) > >>> RETURNS numeric AS > >>> $BODY$ > >>> declare i numeric[]; > >>> BEGIN > >>> FOREACH i SLICE 1 IN ARRAY in_inputs > >>> LOOP > >>> if i[1] = in_input_nr then > >>> return i[2]; > >>> end if; > >>> END LOOP; > >>> > >>> return null; > >>> END; > >>> $BODY$ > >>> LANGUAGE plpgsql VOLATILE > >>> COST 100; > >>> > >>> CREATE OR REPLACE FUNCTION input_value(in_inputs numeric[], > >> in_input_nr > >>> numeric) > >>> RETURNS numeric AS > >>> $BODY$ > >>> declare > >>> size int; > >>> BEGIN > >>> size = array_upper(in_inputs, 1); > >>> IF size IS NOT NULL THEN > >>> > >>> FOR i IN 1 .. size LOOP > >>> if in_inputs[i][1] = in_input_nr then > >>> return in_inputs[i][2]; > >>> end if; > >>> END LOOP; > >>> END IF; > >>> > >>> return null; > >>> END; > >>> $BODY$ > >>> LANGUAGE plpgsql VOLATILE > >>> COST 100; > >>> > >>> > >>> On 12/13/2011 04:02 PM, Pavel Stehule wrote: > >>>> Hello > >>>> > >>>> do you know FOREACH IN ARRAY statement in 9.1 > >>>> > >>>> this significantly accelerate iteration over array > >>>> > >>>> > >>>> http://www.depesz.com/index.php/2011/03/07/waiting-for-9-1- > foreach- > >> in-array/ > >>>> > >>>> > >>>> 2011/12/13 Aleksej Trofimov: > >>>>> Hello, I wanted to
Re: [PERFORM] Postgres array parser
Hello, For such cases (see below), it would be nice to have an unnest function that only affect the first array dimension. Something like unnest(ARRAY[[1,2],[2,3]], SLICE=1) => unnest -- [1,2] [2,3] With this function, I imagine that following sql function might beat the plpgsql FOREACH version. CREATE OR REPLACE FUNCTION input_value_un (in_inputs numeric[], in_input_nr numeric) RETURNS numeric AS $BODY$ SELECT u[1][2] FROM unnest($1, SLICE =1) u WHERE u[1][1]=in_input_nr LIMIT 1; $BODY$ LANGUAGE sql IMMUTABLE; best regards, Marc Mamin > -Original Message- > From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance- > ow...@postgresql.org] On Behalf Of Pavel Stehule > Sent: Dienstag, 13. Dezember 2011 15:43 > To: Aleksej Trofimov > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] Postgres array parser > > Hello > > 2011/12/13 Aleksej Trofimov : > > We have tried foreach syntax, but we have noticed performance > degradation: > > Function with for: 203ms > > Function with foreach: ~250ms: > > > > there is functions code: > > CREATE OR REPLACE FUNCTION input_value_fe(in_inputs numeric[], > in_input_nr > > numeric) > > RETURNS numeric AS > > $BODY$ > > declare i numeric[]; > > BEGIN > > FOREACH i SLICE 1 IN ARRAY in_inputs > > LOOP > > if i[1] = in_input_nr then > > return i[2]; > > end if; > > END LOOP; > > > > return null; > > END; > > $BODY$ > > LANGUAGE plpgsql VOLATILE > > COST 100; > > > > CREATE OR REPLACE FUNCTION input_value(in_inputs numeric[], > in_input_nr > > numeric) > > RETURNS numeric AS > > $BODY$ > > declare > > size int; > > BEGIN > > size = array_upper(in_inputs, 1); > > IF size IS NOT NULL THEN > > > > FOR i IN 1 .. size LOOP > > if in_inputs[i][1] = in_input_nr then > > return in_inputs[i][2]; > > end if; > > END LOOP; > > END IF; > > > > return null; > > END; > > $BODY$ > > LANGUAGE plpgsql VOLATILE > > COST 100; > > > > > > On 12/13/2011 04:02 PM, Pavel Stehule wrote: > >> > >> Hello > >> > >> do you know FOREACH IN ARRAY statement in 9.1 > >> > >> this significantly accelerate iteration over array > >> > >> > >> http://www.depesz.com/index.php/2011/03/07/waiting-for-9-1-foreach- > in-array/ > >> > >> > >> > >> 2011/12/13 Aleksej Trofimov: > >>> > >>> Hello, I wanted to ask according such a problem which we had faced > with. > >>> We are widely using postgres arrays like key->value array by doing > like > >>> this: > >>> > >>> {{1,5},{2,6},{3,7}} > >>> > >>> where 1,2,3 are keys, and 5,6,7 are values. In our pgSql functions > we are > >>> using self written array_input(array::numeric[], key::numeric) > function > >>> which makes a loop on whole array and searches for key like > >>> FOR i IN 1 .. size LOOP > >>> if array[i][1] = key then > >>> return array[i][2]; > >>> end if; > >>> END LOOP; > >>> > >>> But this was a good solution until our arrays and database had > grown. So > >>> now > >>> FOR loop takes a lot of time to find value of an array. > >>> > >>> And my question is, how this problem of performance could be > solved? We > >>> had > >>> tried pgperl for string parsing, but it takes much more time than > our > >>> current solution. Also we are thinking about self-written C++ > function, > >>> may > >>> be someone had implemented this algorithm before? > >>> > >> you can use indexes or you can use hstore > >> > >> Regards > >> > >> Pavel Stehule > >> > >>> -- > >>> Best regards > >>> > >>> Aleksej Trofimov > >>> > >>> > >>> -- > >>> Sent via pgsql-performance mailing list > >>> (pgsql-performance@postgresql.org) > >>> To make changes to your subscription: > >>> http://www.postgresql.org/mailpref/pgsql-performance > > > > > > It is strange - on my comp FOREACH is about 2x faster > > postgres=# select inp
Re: [PERFORM] Postgres for a "data warehouse", 5-10 TB
Hi, > (see attachment) under high concurency you may expect that your data is already in. In such a case you better do nothing at all: begin select dat=a_dat from t where id=a_id into test: if test is null then begin insert into t (id, dat) values (a_id, a_dat); exception when unique_violation then update t set dat = a_dat where id = a_id and dat <> a_dat; return 0; end; elsif not test then update t set dat = a_dat where id = a_id; return 0; end if; return 1; best regards, Marc Mamin -Ursprüngliche Nachricht- Von: pgsql-performance-ow...@postgresql.org im Auftrag von Robert Klemme Gesendet: Di 9/13/2011 6:34 An: Marti Raudsepp Cc: pgsql-performance@postgresql.org Betreff: Re: [PERFORM] Postgres for a "data warehouse", 5-10 TB On Tue, Sep 13, 2011 at 5:13 PM, Marti Raudsepp wrote: > On Tue, Sep 13, 2011 at 00:26, Robert Klemme > wrote: >> In the case of PG this particular example will work: >> 1. TX inserts new PK row >> 2. TX tries to insert same PK row => blocks >> 1. TX commits >> 2. TX fails with PK violation >> 2. TX does the update (if the error is caught) > > That goes against the point I was making in my earlier comment. In > order to implement this error-catching logic, you'll have to allocate > a new subtransaction (transaction ID) for EVERY ROW you insert. I don't think so. You only need to catch the error (see attachment). Or does this create a sub transaction? > If > you're going to be loading billions of rows this way, you will invoke > the wrath of the "vacuum freeze" process, which will seq-scan all > older tables and re-write every row that it hasn't touched yet. You'll > survive it if your database is a few GB in size, but in the terabyte > land that's unacceptable. Transaction IDs are a scarce resource there. Certainly. But it's not needed as far as I can see. > In addition, such blocking will limit the parallelism you will get > from multiple inserters. Yes, I mentioned the speed issue. But regardless of the solution for MySQL's "INSERT..ON DUPLICATE KEY UPDATE" which Igor mentioned you will have the locking problem anyhow if you plan to insert concurrently into the same table and be robust. Kind regards robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/
Re: [PERFORM] Query improvement
> On Mon, May 2, 2011 at 10:54 PM, Mark wrote: > > but the result have been worst than before. By the way is there a posibility > > to create beeter query with same effect? > > I have tried more queries, but this has got best performance yet. > > Well, this seems to be the worst part: > >(SELECT page_id FROM mediawiki.page WHERE page_id IN >(SELECT page_id FROM mediawiki.page > WHERE (titlevector @@ (to_tsquery('fotbal' >OR page_id IN >(SELECT p.page_id from mediawiki.page p,mediawiki.revision r, >(SELECT old_id FROM mediawiki.pagecontent >WHERE (textvector @@ (to_tsquery('fotbal' ss >WHERE (p.page_id=r.rev_page AND r.rev_id=ss.old_id))) > 'OR' statements often generate complicated plans. You should try to rewrite your Query with a n UNION clause. Using explicit joins may also help the planner: SELECT page_id FROM mediawiki.page WHERE (titlevector @@ (to_tsquery('fotbal'))) UNION SELECT p.page_id FROM mediawiki.page p JOIN mediawiki.revision r on (p.page_id=r.rev_page) JOIN mediawiki.pagecontent ss on (r.rev_id=ss.old_id) WHERE (ss.textvector @@ (to_tsquery('fotbal'))) HTH, Marc Mamin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] UNION and bad performance
Hello, UNION will remove all duplicates, so that the result additionally requires to be sorted. Anyway, for performance issues, you should always start investigation with explain analyze . regards, Marc Mamin From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Olivier Pala Sent: Donnerstag, 9. Dezember 2010 11:52 To: pgsql-performance@postgresql.org Cc: Olivier Pala Subject: [PERFORM] UNION and bad performance Hi, I have a performance trouble with UNION query First I have this view : SELECT a,b,c FROM table1, table2 WHERE jointure AND condition Result : 150 000 records in 1~2 s Then, I adding an UNION into the same view : SELECT a,b,c FROM table1, table2 WHERE jointure AND condition UNION SELECT a,b,c FROM table3 Result : 150 200 records in 6~7 s Why, do I have bad performance only for 200 adding records ? Thanks SGBD : Postgres 8.3 et 8.4
Re: [PERFORM] hashed subplan 5000x slower than two sequential operations
Another point: would a conditionl index help ? on articles (context_key) where indexed regards, -Ursprüngliche Nachricht- Von: pgsql-performance-ow...@postgresql.org im Auftrag von Marc Mamin Gesendet: Mi 12/8/2010 9:06 An: Shrirang Chitnis; Bryce Nesbitt; pgsql-performance@postgresql.org Betreff: Re: [PERFORM] hashed subplan 5000x slower than two sequential operations Hello, are the table freshly analyzed, with a sufficient default_statistics_target ? You may try to get a better plan while rewriting the query as an UNION to get rid of the OR clause. Something like (not tested): SELECT contexts.context_key FROM contexts JOIN articles ON (articles.context_key=contexts.context_key) WHERE contexts.parent_key = 392210 AND articles.indexed UNION SELECT context_key FROM ( SELECT contexts.context_key FROM contexts JOIN collection_data ON ( contexts.context_key = collection_data .context_key) WHERE collection_data.collection_context_key = 392210) ) foo JOIN articles ON (foo.context_key=contexts.context_key) WHERE articles.indexed ; I've had one similar problem where there was no way for the planner to notice that the query would systematically return very few rows. Here, my last resort was to disable some planner methods within the given transaction. regards, Marc Mamin -Ursprüngliche Nachricht- Von: pgsql-performance-ow...@postgresql.org im Auftrag von Shrirang Chitnis Gesendet: Mi 12/8/2010 8:05 An: Bryce Nesbitt; pgsql-performance@postgresql.org Betreff: Re: [PERFORM] hashed subplan 5000x slower than two sequential operations Bryce, The two queries are different: You are looking for contexts.context_key in first query WHERE (contexts.parent_key = 392210 OR contexts.context_key IN (SELECT collection_data.context_key FROM collection_data WHERE collection_data.collection_context_key = 392210) but second query has context.parent_key WHERE (contexts.parent_key = 392210 OR contexts.parent_key IN (392210,392210,395073,1304250)) Is the contexts.context_key an indexed field? contexts.parent_key certainly seems to be. HTH, Shrirang Chitnis Sr. Manager, Applications Development HOV Services Office: (866) 808-0935 Ext: 39210 shrirang.chit...@hovservices.com www.hovservices.com The information contained in this message, including any attachments, is attorney privileged and/or confidential information intended only for the use of the individual or entity named as addressee. The review, dissemination, distribution or copying of this communication by or to anyone other than the intended addressee is strictly prohibited. If you have received this communication in error, please immediately notify the sender by replying to the message and destroy all copies of the original message. -Original Message- From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Bryce Nesbitt Sent: Thursday, December 09, 2010 12:24 AM To: pgsql-performance@postgresql.org Subject: [PERFORM] hashed subplan 5000x slower than two sequential operations Can you help me understand how to optimize the following. There's a subplan which in this case returns 3 rows, but it is really expensive: = explain analyze SELECT contexts.context_key FROM contexts JOIN articles ON (articles.context_key=contexts.context_key) WHERE (contexts.parent_key = 392210 OR contexts.context_key IN (SELECT collection_data.context_key FROM collection_data WHERE collection_data.collection_context_key = 392210) ) AND articles.indexed ; QUERY PLAN Hash Join (cost=83054.41..443755.45 rows=261077 width=4) (actual time=4362.143..6002.808 rows=28 loops=1) Hash Cond: (articles.context_key = contexts.context_key) -> Seq Scan on articles (cost=0.00..345661.91 rows=522136 width=4) (actual time=0.558..3953.002 rows=517356 loops=1) Filter: indexed -> Hash (cost=69921.25..69921.25 rows=800493 width=4) (actual time=829.501..829.501 rows=31 loops=1) -> Seq Scan on contexts (cost=14.31..69921.25 rows=800493 width=4) (actual time=1.641..829.339 rows=31 loops=1) Filter: ((parent_key = 392210) OR (hashed subplan)) SubPlan -> Index Scan using collection_data_context_key_index on collection_data (cost=0.00..14.30 rows=6 width=4) (actual time=0.018..0.023 rows=3 loops=1) Index Cond: (collection_context_key = 392210) Total runtime: 6002.976 ms (11 rows) =
Re: [PERFORM] hashed subplan 5000x slower than two sequential operations
Hello, are the table freshly analyzed, with a sufficient default_statistics_target ? You may try to get a better plan while rewriting the query as an UNION to get rid of the OR clause. Something like (not tested): SELECT contexts.context_key FROM contexts JOIN articles ON (articles.context_key=contexts.context_key) WHERE contexts.parent_key = 392210 AND articles.indexed UNION SELECT context_key FROM ( SELECT contexts.context_key FROM contexts JOIN collection_data ON ( contexts.context_key = collection_data .context_key) WHERE collection_data.collection_context_key = 392210) ) foo JOIN articles ON (foo.context_key=contexts.context_key) WHERE articles.indexed ; I've had one similar problem where there was no way for the planner to notice that the query would systematically return very few rows. Here, my last resort was to disable some planner methods within the given transaction. regards, Marc Mamin -Ursprüngliche Nachricht- Von: pgsql-performance-ow...@postgresql.org im Auftrag von Shrirang Chitnis Gesendet: Mi 12/8/2010 8:05 An: Bryce Nesbitt; pgsql-performance@postgresql.org Betreff: Re: [PERFORM] hashed subplan 5000x slower than two sequential operations Bryce, The two queries are different: You are looking for contexts.context_key in first query WHERE (contexts.parent_key = 392210 OR contexts.context_key IN (SELECT collection_data.context_key FROM collection_data WHERE collection_data.collection_context_key = 392210) but second query has context.parent_key WHERE (contexts.parent_key = 392210 OR contexts.parent_key IN (392210,392210,395073,1304250)) Is the contexts.context_key an indexed field? contexts.parent_key certainly seems to be. HTH, Shrirang Chitnis Sr. Manager, Applications Development HOV Services Office: (866) 808-0935 Ext: 39210 shrirang.chit...@hovservices.com www.hovservices.com The information contained in this message, including any attachments, is attorney privileged and/or confidential information intended only for the use of the individual or entity named as addressee. The review, dissemination, distribution or copying of this communication by or to anyone other than the intended addressee is strictly prohibited. If you have received this communication in error, please immediately notify the sender by replying to the message and destroy all copies of the original message. -Original Message- From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Bryce Nesbitt Sent: Thursday, December 09, 2010 12:24 AM To: pgsql-performance@postgresql.org Subject: [PERFORM] hashed subplan 5000x slower than two sequential operations Can you help me understand how to optimize the following. There's a subplan which in this case returns 3 rows, but it is really expensive: = explain analyze SELECT contexts.context_key FROM contexts JOIN articles ON (articles.context_key=contexts.context_key) WHERE (contexts.parent_key = 392210 OR contexts.context_key IN (SELECT collection_data.context_key FROM collection_data WHERE collection_data.collection_context_key = 392210) ) AND articles.indexed ; QUERY PLAN Hash Join (cost=83054.41..443755.45 rows=261077 width=4) (actual time=4362.143..6002.808 rows=28 loops=1) Hash Cond: (articles.context_key = contexts.context_key) -> Seq Scan on articles (cost=0.00..345661.91 rows=522136 width=4) (actual time=0.558..3953.002 rows=517356 loops=1) Filter: indexed -> Hash (cost=69921.25..69921.25 rows=800493 width=4) (actual time=829.501..829.501 rows=31 loops=1) -> Seq Scan on contexts (cost=14.31..69921.25 rows=800493 width=4) (actual time=1.641..829.339 rows=31 loops=1) Filter: ((parent_key = 392210) OR (hashed subplan)) SubPlan -> Index Scan using collection_data_context_key_index on collection_data (cost=0.00..14.30 rows=6 width=4) (actual time=0.018..0.023 rows=3 loops=1) Index Cond: (collection_context_key = 392210) Total runtime: 6002.976 ms (11 rows) = explain analyze SELECT contexts.context_key FROM contexts JOIN articles ON (articles.context_key=contexts.context_key) WHERE (contexts.parent_key = 392210 OR contexts.parent_key IN (392210,392210,395073,1304250)) AND articles.indexed ;
Re: [PERFORM] anti-join chosen even when slower than old plan
Hello, Just a short though: Is it imaginable to compare the prognoses of the plans with the actual results and somehow log the worst cases ? a) to help the DBA locate bad statistics and queries b) as additional information source for the planner This could possibly affect parameters of your formula on the fly. best regards, Marc Mamin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] CREATE INDEX as bottleneck
No, CONCURRENTLY is to improve table availability during index creation, but it degrades the performances. best regards, Marc Mamin -Original Message- From: Alex Hunsaker [mailto:bada...@gmail.com] Sent: Donnerstag, 11. November 2010 19:55 To: Marc Mamin Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] CREATE INDEX as bottleneck On Thu, Nov 11, 2010 at 06:41, Marc Mamin wrote: > There are a few places in our data flow where we have to wait for index > creation before being able to distribute the process on multiple threads > again. Would CREATE INDEX CONCURRENTLY help here? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] CREATE INDEX as bottleneck
Hello, in the last years, we have successfully manage to cope with our data growth using partitioning and splitting large aggregation tasks on multiple threads. The partitioning is done logically by our applicationn server, thus avoiding trigger overhead. There are a few places in our data flow where we have to wait for index creation before being able to distribute the process on multiple threads again. With the expected growth, create index will probably become a severe bottleneck for us. Is there any chance to see major improvement on it in a middle future ? I guess the question is naive, but why can't posgres use multiple threads for large sort operation ? best regards, Marc Mamin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] optimizing query with multiple aggregates
Hello, I didn't try it, but following should be slightly faster: COUNT( CASE WHEN field >= x AND field < y THEN true END) intead of SUM( CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END) HTH, Marc Mamin From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Nikolas Everett Sent: Thursday, October 22, 2009 4:48 AM To: Doug Cole Cc: pgsql-performance Subject: Re: [PERFORM] optimizing query with multiple aggregates So you've got a query like: SELECT SUM(CASE WHEN field >= 0 AND field < 10 THEN 1 ELSE 0 END) as zeroToTen, SUM(CASE WHEN field >= 10 AND field < 20 THEN 1 ELSE 0 END) as tenToTwenty, SUM(CASE WHEN field >= 20 AND field < 30 THEN 1 ELSE 0 END) as tenToTwenty, ... FROM bigtable My guess is this forcing a whole bunch of if checks and your getting cpu bound. Could you try something like: SELECT SUM(CASE WHEN field >= 0 AND field < 10 THEN count ELSE 0 END) as zeroToTen, SUM(CASE WHEN field >= 10 AND field < 20 THEN count ELSE 0 END) as tenToTwenty, SUM(CASE WHEN field >= 20 AND field < 30 THEN count ELSE 0 END) as tenToTwenty, ... FROM (SELECT field, count(*) FROM bigtable GROUP BY field) which will allow a hash aggregate? You'd do a hash aggregate on the whole table which should be quick and then you'd summarize your bins. This all supposes that you don't want to just query postgres's column statistics. On Wed, Oct 21, 2009 at 10:21 PM, Doug Cole wrote: On Wed, Oct 21, 2009 at 5:39 PM, Merlin Moncure wrote: > > On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole wrote: > > I have a reporting query that is taking nearly all of it's time in aggregate > > functions and I'm trying to figure out how to optimize it. The query takes > > approximately 170ms when run with "select *", but when run with all the > > aggregate functions the query takes 18 seconds. The slowness comes from our > > attempt to find distribution data using selects of the form: > > > > SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END) > > > > repeated across many different x,y values and fields to build out several > > histograms of the data. The main culprit appears to be the CASE statement, > > but I'm not sure what to use instead. I'm sure other people have had > > similar queries and I was wondering what methods they used to build out data > > like this? > > have you tried: > > count(*) where field >= x AND field < y; > > ?? > > merlin Unless I'm misunderstanding you, that would require breaking each bin into a separate sql statement and since I'm trying to calculate more than 100 bins between the different fields any improvement in the aggregate functions would be overwhelmed by the cost of the actual query, which is about 170ms. Thanks, Doug -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Very specialised query
Hello Matthew, Another idea: Are your objects limited to some smaller ranges of your whole interval ? If yes you may possibly reduce the ranges to search for while using an additional table with the min(start) max(end) of each object... Marc Mamin
Re: [PERFORM] Very specialised query
>> WHERE (l2.start BETWEEN l1.start AND l1.end >> OR >> l1.start BETWEEN l2.start AND l2.end >> ) >Yes, that's another way to calculate an overlap. However, it turns out to not >be that fast. >The problem is that OR there, which causes a bitmap index scan, as the leaf of >a nested loop join, >which can be rather slow. Ok , than splitting these checks in 2 Queries with UNION is better. But I often read that BETWEEN is faster than using 2 comparison operators. Here I guess that a combined index on (start,end) makes sense: .. WHERE l2.start BETWEEN l1.start AND l1.end .. UNION .. WHERE l1.start BETWEEN l2.start AND l2.end .. The first clause being equivalent to AND l1.start <= l2.end AND l1.end >= l2.start AND l1.start <= l2.start I don't know how you have to deal the limit conditions... Marc Mamin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Very specialised query
Hello, if your data are mostly static and you have a few mains objects, maybe you can have some gain while defining conditional indexes for those plus one for the rest and then slicing the query: create index o_1x on X (start,end,id) where object_id = 1 create index o_2x on X (start,end,id) where object_id = 2 create index o_3x on X (start,end,id) where object_id = 3 create index o_4x on X (start,end,id) where object_id = 4 ... create index o_4x on X (start,end,id) where object_id not in (1,2,3,4..) I'm not sure that putting all in one index and using the BETWEEN clause as in my example is the best method though. Marc Mamin SELECT l1.id AS id1, l2.id AS id2 FROM location l1, location l2 WHERE l1.objectid = 1 AND (l2.start BETWEEN l1.start AND l1.end OR l1.start BETWEEN l2.start AND l2.end ) l1.start AND l2.start <> l2.start -- if required AND l2.start <> l2.end -- if required AND l1.id <> l2.id UNION ALL ... WHERE l1.objectid = 2 ... UNION ALL ... WHERE l1.objectid not in (1,2,3,4..)
[PERFORM] temp_tablespaces and RAID
Hello, To improve performances, I would like to try moving the temp_tablespaces locations outside of our RAID system. Is it a good practice ? Thanks, Marc Mamin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Delete performance again
Hi, Maybe you can try this syntax. I'm not sure, but it eventually perform better: delete from company_alias USING comprm where company_alias.company_id =comprm.id Cheers, Marc -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] join tables vs. denormalization by trigger
Hello, I had a similar issue and -atfer testing - decided to merge the tables B and C into a single table. In my case the resulting table contains a large proportion of nulls which limits the size increase... You'll have to do some testing with your data to evaluate the performance gain. Hope to help, Marc -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Walter Mauritz Sent: Tuesday, September 04, 2007 8:53 PM To: pgsql-performance@postgresql.org Subject: [PERFORM] join tables vs. denormalization by trigger Hi, I wonder about differences in performance between two scenarios: Background: Table A, ~50,000 records Table B, ~3,000,000 records (~20 cols) Table C, ~30,000,000 records (~10 cols) a query every 3sec. with limit 10 Table C depends on Table B wich depends on Table A, int8 foreign key, btree index * consider it a read only scenario (load data only in night, with time for vacuum analyze daily) * im required to show records from Table C, but also with some (~5cols) info from Table B * where clause always contains the foreign key to Table A * where clause may contain further 1-10 search parameter Scenario A) simply inner join Table B + C Scenario B) with use of trigger on insert/update I could push the required information from table B down to table C. -> so i would only require to select from table C. My question: 1) From your experience ... how much faster (approximately) in percent do you regard Scenario B faster than A ? 2) any other tips for such a read only scenario Thx for any attention :-) Walter -- GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Postgres configuration for 64 CPUs, 128 GB RAM...
Hello, thank you for all your comments and recommendations. I'm aware that the conditions for this benchmark are not ideal, mostly due to the lack of time to prepare it. We will also need an additional benchmark on a less powerful - more realistic - server to better understand the scability of our application. Our application is based on java and is generating dynamic reports from log files content. Dynamic means here that a repor will be calculated from the postgres data the first time it is requested (it will then be cached). Java is used to drive the data preparation and to handle/generate the reports requests. This is much more an OLAP system then an OLTP, at least for our performance concern. Data preparation: 1) parsing the log files with a heavy use of perl (regular expressions) to generate csv files. Prepared statements also maintain reference tables in the DB. Postgres performance is not an issue for this first step. 2) loading the csv files with COPY. As around 70% of the data to load come in a single daily table, we don't allow concurrent jobs for this step. We have between a few and a few hundreds files to load into a single table; they are processed one after the other. A primary key is always defined; for the case when the required indexes are alreay built and when the new data are above a given size, we are using a "shadow" table instead (without the indexes) , build the index after the import and then replace the live table with the shadow one. For example, we a have a table of 13 GB + 11 GB indexes (5 pieces). Performances : a) is there an "ideal" size to consider for our csv files (100 x 10 MB or better 1 x 1GB ?) b) maintenance_work_mem: I'll use around 1 GB as recommended by Stefan 3) Data agggregation. This is the heaviest part for Postgres. On our current system some queries need above one hour, with phases of around 100% cpu use, alterning with times of heavy i/o load when temporary results are written/read to the plate (pgsql_tmp). During the aggregation, other postgres activities are low (at least should be) as this should take place at night. Currently we have a locking mechanism to avoid having more than one of such queries running concurently. This may be to strict for the benchmark server but better reflect our current hardware capabilities. Performances : Here we should favorise a single huge transaction and consider a low probability to have another transaction requiring large sort space. Considering this, is it reasonable to define work_mem being 3GB (I guess I should raise this parameter dynamically before running the aggregation queries) 4) Queries (report generation) We have only few requests which are not satisfying while requiring large sort operations. The data are structured in different aggregation levels (minutes, hours, days) with logical time based partitions in oder to limit the data size to compute for a given report. Moreover we can scale our infrastrucure while using different or dedicated Postgres servers for different customers. Smaller customers may share a same instance, each of them having its own schema (The lock mechanism for large aggregations apply to a whole Postgres instance, not to a single customer) . The benchmark will help us to plan such distribution. During the benchmark, we will probably not have more than 50 not idle connections simultaneously. It is a bit too early for us to fine tune this part. The benchmark will mainly focus on the steps 1 to 3 During the benchmark, the Db will reach a size of about 400 GB, simulating 3 different customers, also with data quite equally splitted in 3 scheemas. I will post our configuration(s) later on. Thanks again for all your valuable input. Marc Mamin ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[PERFORM] Postgres configuration for 64 CPUs, 128 GB RAM...
Postgres configuration for 64 CPUs, 128 GB RAM... Hello, We have the oppotunity to benchmark our application on a large server. I have to prepare the Postgres configuration and I'd appreciate some comments on it as I am not experienced with servers of such a scale. Moreover the configuration should be fail-proof as I won't be able to attend the tests. Our application (java + perl) and Postgres will run on the same server, whereas the application activity is low when Postgres has large transactions to process. There is a large gap between our current produtcion server (Linux, 4GB RAM, 4 cpus) and the benchmark server; one of the target of this benchmark is to verify the scalability of our application. And you have no reason to be envious as the server doesn't belong us :-) Thanks for your comments, Marc Mamin Posgres version: 8.2.1 Server Specifications: -- Sun SPARC Enterprise M8000 Server: http://www.sun.com/servers/highend/m8000/specs.xml File system: http://en.wikipedia.org/wiki/ZFS Planned configuration: # we don't expect more than 150 parallel connections, # but I suspect a leak in our application that let some idle connections open max_connections=2000 ssl = off #maximum allowed shared_buffers= 262143 # on our current best production server with 4GB RAM (not dedicated to Postgres), work_mem is set to 600 MB # this limitation is probably the bottleneck for our application as the files in pgsql_tmp grows up to 15 GB # during large aggregations (we have a locking mechanismus to avoid parallel processing of such transactions) work_mem = 31457280 # (30 GB) # index creation time is also an issue for us; the process is locking other large processes too. # our largest table so far is 13 GB + 11 GB indexes maintenance_work_mem = 31457280 # (30 GB) # more than the max number of tables +indexes expected during the benchmark max_fsm_relations = 10 max_fsm_pages = 180 # don't know if I schoud modify this. # seems to be sufficient on our production servers max_stack_depth = 2MB # vacuum will be done per hand between each test session autovacuum = off # required to analyse the benchmark log_min_duration_statement = 1000 max_prepared_transaction = 100 # seems to be required to drop schema/roles containing large number of objects max_locks_per_transaction = 128 # I use the default for the bgwriter as I couldnt find recommendation on those #bgwriter_delay = 200ms # 10-1ms between rounds #bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers scanned/round #bgwriter_lru_maxpages = 5 # 0-1000 buffers max written/round #bgwriter_all_percent = 0.333 # 0-100% of all buffers scanned/round #bgwriter_all_maxpages = 5 # 0-1000 buffers max written/round #WAL fsync = on #use default #wal_sync_method # we are using 32 on our production system wal_buffers=64 # we didn't make any testing with this parameter until now, but this should'nt be a relevant # point as our performance focus is on large transactions commit_delay = 0 #CHECKPOINT # xlog will be on a separate disk checkpoint_segments=256 checkpoint_timeout = 5min
[PERFORM] tuning a function to insert/retrieve values from a reference table
Hello, I have a simple table id/value, and a function that returns the id of a given value, inserting the later if not yet present. The probability that a value already exist within the reference table is very high. Different db users may have their own reference table with different content, but as the table definition is identical, I've defined a public function to maintain these tables. Can I optimize this function with: a) remove the EXCEPTION clause (Is there an underlying lock that prevent concurrent inserts ?) b) declare the function being IMMUTABLE ? - although it may insert a new raw, the returned id is invariant for a given user (I don't really understand the holdability ov immutable functions; are the results cached only for the livetime of a prepared statement ?, or can they be shared by different sessions ?) Thanks, Marc --Table definition: create table ref_table ( id serial NOT NULL, v varchar NOT NULL, constraint ref_table_pk primary key (id) ) without oids; create unique index ref_table_uk on ref_table(v); -- Function: CREATE OR REPLACE FUNCTION public.get_or_insert_value("varchar") RETURNS INT AS $BODY$ DECLARE id_value INT; BEGIN SELECT INTO id_value id FROM ref_table WHERE v = $1; IF FOUND THEN RETURN id_value; ELSE --new value to be inserted DECLARE rec record; BEGIN FOR rec in INSERT INTO ref_table (v) VALUES ($1) RETURNING id LOOP return rec.id; END LOOP; EXCEPTION --concurrent access ? WHEN unique_violation THEN RETURN(SELECT id FROM ref_table WHERE v = $1); END; END IF; END; $BODY$ LANGUAGE 'plpgsql' VOLATILE;
[PERFORM] copy from performance on large tables with indexes
Hello, Postgres: 8.2 os: Linux 4CPU, 4 GB RAM, Raid 1, 32 bit system work_mem: 600 Mb I have some tables which may become quite large (currently up to 6 Gb) . I initially fill them using copy from (files) . The import is fast enough as I only have a primary key on the table: about 18 minutes (over 300 Mb/minute) Then I need 5 additional indexes on it. Creation time: 30 minutes subsequently I compute some aggregations which need 4 hours and 30 minutes additional time And now the problem: If I get additional data for the table, the import become much more slower due to the indexes (about 30 times slower !): The performance degradation is probably due to the fact that all indexs are too large to be kept in memory. Moreover I guess that the indexes fill factors are too high (90%) During this second import, I have about 20% iowait time. The usual solution is to drop the indexes before the second import and rebuild them afterwards, but I feel unconfident doing this as I don't know how the system will react if some SELECT statements occures when the index are missing. I can hardly avoid this. So my idea for the second import process: 1) make a copy of the table: create table B as select * from table A; alter table B add constraint B_pk primary key (id); 2) import the new data in table B copy B from file; 3) create the required indexes on B create index Bix_1 on B.. create index Bix_2 on B.. create index Bix_2 on B.. create index Bix_2 on B.. 4) replace table A with table B alter table A renam to A_trash; alter table B renam to A; drop table A_trash; (and rename the indexes to get the original state) This seems to work but with side effects: The only objects that refer to the tables are functions and indexes. If a function is called within a same session before and after the table renaming, the second attempt fails (or use the table A_trash if it still exists). So I should close the session and start a new one before further processing. Errors in other live sessions are acceptable, but maybe you know a way to avoid them?) And now a few questions :-) - do you see any issue that prevent this workflow to work? - is there any other side effect to take care of ? - what is the maximum acceptable value for the parameter work_mem for my configuration (see the complete configuration below) - has anybody built a similar workflow ? - could this be a feature request to extend the capabilities of copy from ? Thanks for your time and attention, Marc Mamin
Re: [PERFORM] Planner doing seqscan before indexed join
You may try to change the planner's opinion using sub queries. Something like: select * from eventactivity, (select * from keyword_incidents, eventmain, eventgeo where eventmain.incidentid = keyword_incidents.incidentid and eventgeo.incidentid = keyword_incidents.incidentid and ( recordtext like '%JOSE CHAVEZ%' ) )foo where eventactivity.incidentid = foo.incidentid order by foo.entrydate limit 1; HTH, Marc -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Dan Harris Sent: Thursday, March 29, 2007 4:22 AM To: PostgreSQL Performance Subject: [PERFORM] Planner doing seqscan before indexed join 8.0.3 - Linux 2.6.18.. Freshly vacuumed and analyzed This database has been humming along fine for a while now, but I've got one of those sticky queries that is taking much too long to finish. After some digging, I've found that the planner is choosing to apply a necessary seq scan to the table. Unfortunately, it's scanning the whole table, when it seems that it could have joined it to a smaller table first and reduce the amount of rows it would have to scan dramatically ( 70 million to about 5,000 ). The table "eventactivity" has about 70million rows in it, index on "incidentid" The table "keyword_incidents" is a temporary table and has incidentid as its primary key. It contains 5125 rows. eventmain and eventgeo both have 2.1 million. My hope is that I can convince the planner to do the join to keyword_incidents *first* and then do the seq scan for the LIKE condition. Instead, it seems that it's seqscanning the whole 70 million rows first and then doing the join, which takes a lot longer than I'd like to wait for it. Or, maybe I'm misreading the explain output? Thanks again -Dan - Here's the query: explain analyze select * from keyword_incidents, eventactivity, eventmain, eventgeo where eventmain.incidentid = keyword_incidents.incidentid and eventgeo.incidentid = keyword_incidents.incidentid and eventactivity.incidentid = keyword_incidents.incidentid and ( recordtext like '%JOSE CHAVEZ%' ) order by eventmain.entrydate limit 1; --- Limit (cost=2388918.07..2388918.08 rows=1 width=455) (actual time=81771.186..81771.292 rows=26 loops=1) -> Sort (cost=2388918.07..2388918.08 rows=1 width=455) (actual time=81771.180..81771.215 rows=26 loops=1) Sort Key: eventmain.entrydate -> Nested Loop (cost=0.00..2388918.06 rows=1 width=455) (actual time=357.389..81770.982 rows=26 loops=1) -> Nested Loop (cost=0.00..2388913.27 rows=1 width=230) (actual time=357.292..81767.385 rows=26 loops=1) -> Nested Loop (cost=0.00..2388909.33 rows=1 width=122) (actual time=357.226..81764.501 rows=26 loops=1) -> Seq Scan on eventactivity (cost=0.00..2388874.46 rows=7 width=84) (actual time=357.147..81762.582 rows=27 loops=1) Filter: ((recordtext)::text ~~ '%JOSE CHAVEZ%'::text) -> Index Scan using keyword_incidentid_pkey on keyword_incidents (cost=0.00..4.97 rows=1 width=38) (actual time=0.034..0.036 rows=1 loops=27) Index Cond: (("outer".incidentid)::text = (keyword_incidents.incidentid)::text) -> Index Scan using eventgeo_incidentid_idx on eventgeo (cost=0.00..3.93 rows=1 width=108) (actual time=0.076..0.081 rows=1 loops=26) Index Cond: (("outer".incidentid)::text = (eventgeo.incidentid)::text) -> Index Scan using eventmain_incidentid_idx on eventmain (cost=0.00..4.78 rows=1 width=225) (actual time=0.069..0.075 rows=1 loops=26) Index Cond: (("outer".incidentid)::text = (eventmain.incidentid)::text) Total runtime: 81771.529 ms (15 rows) ---(end of broadcast)--- TIP 6: explain analyze is your friend ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Scaling SELECT:s with the number of disks on a stripe
Hello Peter, If you are dealing with timed data or similar, you may consider to partition your table(s). In order to deal with large data, I've built a "logical" partition system, whereas the target partition is defined by the date of my data (the date is part of the filenames that I import...). Instead of using the Postgres partitioning framework, I keep the tables boundaries within a refererence table. Then I've built a function that takes the different query parameters as argument (column list, where clause...). This functions retrieve the list of tables to query from my reference table and build the final query, binding the different subqueries from each partition with "UNION ALL". It also requires an additional reference table that describes the table columns (data type, behaviour , e.g. groupable,summable...) This allowed me to replace many "delete" with "drop table" statements, whis is probably the main advantage of the solution. The biggest issue was the implementation time ;-) but I'm really happy with the resulting performances. HTH, Marc -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Peter Schuller Sent: Friday, March 30, 2007 7:17 AM To: pgsql-performance@postgresql.org Subject: [PERFORM] Scaling SELECT:s with the number of disks on a stripe Hello, I am looking to use PostgreSQL for storing some very simple flat data mostly in a single table. The amount of data will be in the hundreds of gigabytes range. Each row is on the order of 100-300 bytes in size; in other words, small enough that I am expecting disk I/O to be seek bound (even if PostgreSQL reads a full pg page at a time, since a page is significantly smaller than the stripe size of the volume). The only important performance characteristics are insertion/deletion performance, and the performance of trivial SELECT queries whose WHERE clause tests equality on one of the columns. Other than absolute performance, an important goal is to be able to scale fairly linearly with the number of underlying disk drives. We are fully willing to take a disk seek per item selected, as long as it scales. To this end I have been doing some benchmarking to see whether the plan is going to be feasable. On a 12 disk hardware stripe, insertion performance does scale somewhat with concurrent inserters. However, I am seeing surprising effects with SELECT:s: a single selecter generates the same amount of disk activity as two concurrent selecters (I was easily expecting about twice). The query is simple: SELECT * FROM test WHERE value = 'xxx' LIMIT 1000; No ordering, no joins, no nothing. Selecting concurrently with two different values of 'xxx' yields the same amount of disk activity (never any significant CPU activity). Note that the total amount of data is too large to fit in RAM (> 500 million rows), and the number of distinct values in the value column is 1. The column in the WHERE clause is indexed. So my first question is - why am I not seeing this scaling? The absolute amount of disk activity with a single selecter is consistent with what I would expect from a SINGLE disk, which is completely expected since I never thought PostgreSQL would introduce disk I/O concurrency on its own. But this means that adding additional readers doing random-access reads *should* scale very well with 12 underlying disks in a stripe. (Note that I have seen fairly similar results on other RAID variants too, including software RAID5 (yes yes I know), in addition to the hardware stripe.) These tests have been done Linux 2.6.19.3 and PostgreSQL 8.1. Secondly, I am seeing a query plan switch after a certain threshold. Observe: perftest=# explain select * from test where val='7433' limit 1000; QUERY PLAN - Limit (cost=0.00..4016.50 rows=1000 width=143) -> Index Scan using test_val_ix on test (cost=0.00..206620.88 rows=51443 width=143) Index Cond: ((val)::text = '7433'::text) (3 rows) Now increasing to a limit of 1: perftest=# explain select * from test where val='7433' limit 1; QUERY PLAN -- Limit (cost=360.05..38393.36 rows=1 width=143) -> Bitmap Heap Scan on test (cost=360.05..196014.82 rows=51443 width=143) Recheck Cond: ((val)::text = '7433'::text) -> Bitmap Index Scan on test_val_ix (cost=0.00..360.05 rows=51443 width=0) Index Cond: ((val)::text = '7433'::text) (5 rows) The interesting part is that the latter query is entirely CPU bound (no disk I/O at all) for an extended period of time before even beginning to read data from disk. And when it *does* start performing disk I/O, the performance is about the same as for the other case. In other words, the change in query plan se
Re: [PERFORM] Looking for tips
Title: Re: [PERFORM] Looking for tips Hi, I have a similar application, but instead of adding new items to the db once at time, I retrieve new IDs from a sequence (actually only every 10'000 times) and write a csv file from perl. When finished, I load all new record in one run with Copy. hth, Marc Mamin From: [EMAIL PROTECTED] on behalf of Oliver CrosbySent: Wed 7/20/2005 3:50 AMTo: PFCCc: Sven Willenberger; Dawid Kuroczko; Kevin Grittner; [EMAIL PROTECTED]; pgsql-performance@postgresql.orgSubject: Re: [PERFORM] Looking for tips Sorry for the lack of specifics...We have a file generated as a list of events, one per line. Supposelines 1,2,3,5,7,11,etc were related, then the last one would specifythat it's the last event. Gradually this gets assembled by a perlscript and when the last event is encountered, it gets inserted intothe db. For a given table, let's say it's of the form (a,b,c) where'a' is a pkey, 'b' is indexed, and 'c' is other related information.The most common 'b' values are cached locally with the perl script tosave us having to query the db. So what we end up having is:if 'b' exists in cache, use cached 'a' value and continueelse if 'b' exists in the db, use the associated 'a' value and continueelse add a new line with 'b', return the new 'a' and continueThe local cache was a huge time saver with mysql. I've tried making aplpgsql function that handles everything in one step on the db side,but it didn't show any improvement. Time permitting, I'll try some newapproaches with changing the scripts and queries, though right now Iwas just hoping to tune postgresql.conf to work better with thehardware available.Thanks to everyone for your help. Very much appreciated.---(end of broadcast)---TIP 5: don't forget to increase your free space map settings
[PERFORM] Query limitations (size, number of UNIONs ...)
Hello, I've split my data in daily tables to keep them in an acceptable size. Now I have quite complex queries which may be very long if I need to query a large number of daily tables. I've just made a first test wich resulted in a query being 15KB big annd containing 63 UNION. The Query plan in PGAdmin is about 100KB big with 800 lines :-) The performance is not such bad, but I'm wondering if there are some POSTGRES limitations I should take care of with this strategy. Thanks, Marc -- Geschenkt: 3 Monate GMX ProMail gratis + 3 Ausgaben stern gratis ++ Jetzt anmelden & testen ++ http://www.gmx.net/de/go/promail ++ ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[PERFORM] TIP 9: the planner will ignore... & datatypes
TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match But INT2, INT4, INT8 and "SERIAL" are considered to be a unique datatype. Am I Right? Thanks, Marc -- Geschenkt: 3 Monate GMX ProMail gratis + 3 Ausgaben stern gratis ++ Jetzt anmelden & testen ++ http://www.gmx.net/de/go/promail ++ ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Optimising queries involving unions
Hi, I'm using a workaround for this kind of issues: consider: select A from (select B from T1 where C union select B from T2 where C union select B from T3 where C ) foo where D in your case: SELECT u.txt FROM ( SELECT id, txt FROM largetable1,smalltable t WHERE t.id = u.id AND t.foo = 'bar' UNION ALL SELECT id, txt FROM largetable2,smalltable t WHERE t.id = u.id AND t.foo = 'bar' ) u and select A from foo where C and D (A, B, C, D being everything you want, C and D may also include "GROUP BY,ORDER...) The first version will be handled correctly by the optimiser, whereas in the second version, Postgres will first build the UNION and then run the query on it. I'm having large tables with identical structure, one per day. Instead of defining a view on all tables, I' using functions that "distribute" my query on all tables. The only issue if that I need to define a type that match the result structure and each return type needs its own function. Example: (The first parameter is a schema name, the four next corresponds to A, B, C, D - create type T_i2_vc1 as (int_1 int,int_2 int,vc_1 varchar); CREATE OR REPLACE FUNCTION vq_T_i2_vc1(varchar,varchar,varchar,varchar,varchar) RETURNS setof T_i2_vc1 AS $$ DECLARE result T_i2_vc1%rowtype; mviews RECORD; sql varchar; counter int; BEGIN select into counter 1; -- loop on all daily tables FOR mviews IN SELECT distinct this_day FROM daylist order by plainday desc LOOP IF counter =1 THEN select INTO sql 'SELECT '||mviews.this_day||' AS plainday, '||$2||' FROM '||$3||'_'||mviews.plainday||' WHERE '||$4; ELSE select INTO sql sql||' UNION ALL SELECT '||mviews.this_day||' AS plainday, '||$2||' FROM '||$3||'_'||mviews.plainday||' WHERE '||$4; END IF; select into counter counter+1; END LOOP; select INTO sql 'SELECT '||$1||' FROM ('||sql||')foo '||$5; for result in EXECUTE (sql) LOOP return NEXT result; end loop; return ; END; $$ LANGUAGE plpgsql; Note: in your case the function shoud have a further parameter to join largetable(n) to smalltable in the "sub queries" HTH, Marc > I've got a query that I think the query optimiser should be able > to work it's magic on but it doesn't! I've had a look around and > asked on the IRC channel and found that the current code doesn't > attempt to optimise for what I'm asking it to do at the moment. > Here's a bad example: > > SELECT u.txt > FROM smalltable t, ( > SELECT id, txt FROM largetable1 > UNION ALL > SELECT id, txt FROM largetable2) u > WHERE t.id = u.id > AND t.foo = 'bar'; > > I was hoping that "smalltable" would get moved up into the union, > but it doesn't at the moment and the database does a LOT of extra > work. In this case, I can manually do quite a couple of transforms > to move things around and it does the right thing: > > SELECT txt > FROM ( > SELECT l.id as lid, r.id as rid, r.foo, l.txt > FROM largetable1 l, smalltable r > UNION ALL > SELECT l.id as lid, r.id as rid, r.foo, l.txt > FROM largetable1 l, smalltable r) > WHERE foo = 'bar'; > AND lid = rid > > The optimiser is intelligent enough to move the where clauses up > into the union and end end up with a reasonably optimal query. > Unfortunatly, in real life, the query is much larger and reorganising > everything manually isn't really feasible! -- Weitersagen: GMX DSL-Flatrates mit Tempo-Garantie! Ab 4,99 Euro/Monat: http://www.gmx.net/de/go/dsl ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
[PERFORM] tuning Postgres for large data import (using Copy from)
Hello, I'd like to tune Postgres for large data import (using Copy from). here are a few steps already done: 1) use 3 different disks for: -1: source data -2: index tablespaces -3: data tablespaces 2) define all foreign keys as initially deferred 3) tune some parameters: max_connections =20 shared_buffers =3 work_mem = 8192 maintenance_work_mem = 32768 checkpoint_segments = 12 (I also modified the kernel accordingly) 4) runs VACUUM regulary The server runs RedHat and has 1GB RAM In the production (which may run on a better server), I plan to: - import a few millions rows per day, - keep up to ca 100 millions rows in the db - delete older data I've seen a few posting on hash/btree indexes, which say that hash index do not work very well in Postgres; currently, I only use btree indexes. Could I gain performances whole using hash indexes as well ? How does Postgres handle concurrent copy from on: same table / different tables ? I'd be glad on any further suggestion on how to further increase my performances. Marc -- +++ Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS +++ GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail ---(end of broadcast)--- TIP 8: explain analyze is your friend