Re: [PERFORM] Inlining of functions (doing LIKE on an array)

2016-11-11 Thread Marc Mamin



> From: pgsql-performance-ow...@postgresql.org 
> [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of 
> l...@laurent-hasson.com
> Sent: Freitag, 11. November 2016 07:54
> To: pgsql-performance@postgresql.org
> Subject: [PERFORM] Inlining of functions (doing LIKE on an array)
> 
> Hello,
> 
> I am trying to implement an efficient "like" over a text[]. I see a lot of 
> people have tried before me and I learnt a lot through the forums. The 
> results of my search is that a query like the following is optimal:
> 
> select count(*) 
>   from claims
> where (select count(*) 
>   from unnest("ICD9_DGNS_CD") x_ 
>  where x_ like '427%'
>) > 0
> 

Hi,
are you using GIN indexes?

http://stackoverflow.com/questions/4058731/can-postgresql-index-array-columns 

moreover your query can still be optimized:
=>
select count(*) 
  from claims
where exists (select *
      from unnest("ICD9_DGNS_CD") x_ 
 where x_ like '427%'
   ) 

regards,

Marc Mamin

> So I figured I'd create a Function to encapsulate the concept:
> 
> CREATE OR REPLACE FUNCTION ArrayLike(text[], text)
> RETURNS bigint
> AS 'select count(*) from unnest($1) a where a like $2'
> LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF
> 
> This works functionally, but performs like crap: full table scan, and cannot 
> make use of any index it seems. Basically, it feels like PG can't inline that 
> function.
> 
> I have been trying all evening to find a way to rewrite it to trick the 
> compiler/planner into inlining. I tried the operator approach for example, 
> but performance is again not good.
> 
> create function rlike(text,text) 
> returns bool as 'select $2 like $1' language sql strict immutable;
> create operator  ``` (procedure = rlike, leftarg = text, 
>   rightarg = text, commutator = ```);
> CREATE OR REPLACE FUNCTION MyLike(text[], text)
> RETURNS boolean
> AS 'select $2 ``` ANY($1)'
> LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF
> 
> And by not good, I mean that on my table of 2M+ rows, the "native" query 
> takes 3s, while the function version takes 9s and the operator version takes 
> (via the function, or through the operator directly), takes 15s.
> 
> Any ideas or pointers?
> 
> 
> Thank you,
> Laurent Hasson


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] gin performance issue.

2016-02-08 Thread Marc Mamin


> -Original Message-
> From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> Sent: Freitag, 5. Februar 2016 16:07
 

> > http://explain.depesz.com/s/wKv7
> > Postgres Version 9.3.10 (Linux)
> > 
> > Hello,
> > this is a large daily table that only get bulk inserts (200-400 /days) with 
> > no update.
> > After rebuilding the whole table, the Bitmap Index Scan on
> > r_20160204_ix_toprid falls under 1 second (from 800)
> >
> > Fastupdate is using the default, but autovacuum is disabled on that
> > table which contains 30 Mio rows.


> Pre-9.5, it's a pretty bad idea to disable autovacuum on a GIN index,
> because then the "pending list" only gets flushed when it exceeds
> work_mem.  (Obviously, using a large work_mem setting makes this
> worse.)
> 
>   regards, tom lane


Hello,
knowing what the problem is don't really help here:

- auto vacuum will not run as these are insert only tables
- according to this post, auto analyze would also do the job:
  http://postgresql.nabble.com/Performance-problem-with-gin-index-td5867870.html
  It seems that this information is missing in the doc
  
  but it sadly neither triggers in our case as we have manual analyzes called 
during the dataprocesssing just following the imports.
  Manual vacuum is just too expensive here.
  
  Hence disabling fast update seems to be our only option. 
  
  I hope this problem will help push up the 9.5 upgrade on our todo list :)
  
  Ideally, we would then like to flush the pending list inconditionally after 
the imports. 
  I guess we could achieve something approaching while modifying the analyze 
scale factor  and gin_pending_list_limit
  before/after the (bulk) imports, but having the possibility to flush it per 
SQL would be better. 
  Is this a reasonable feature wish?
  
  And a last question: how does the index update work with bulk (COPY) inserts:
  without pending list: is it like a per row trigger or will the index be cared 
of afterwards ?
  with small pending lists : is there a concurrency problem, or can both tasks 
cleanly work in parallel ?
  
  best regards,
  
  Marc mamin
  
  


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] gin performance issue.

2016-02-05 Thread Marc Mamin
http://explain.depesz.com/s/wKv7
Postgres Version 9.3.10 (Linux)

Hello,
this is a large daily table that only get bulk inserts (200-400 /days) with no 
update.
After rebuilding the whole table, the Bitmap Index Scan on r_20160204_ix_toprid 
falls under 1 second (from 800)

Fastupdate is using the default, but autovacuum is disabled on that table which 
contains 30 Mio rows.
Another specificity is that the cardinality of the indexed column is very high. 
The average count per distinct values is only 2.7

I'm not sure what the problem is. Does the missing vacuum affect the gin index 
sanity further than not cleaning the pending list?
As I understand it, this list will be merged into the index automatically when 
it get full, independently from the vaccum setting.

Can it be an index bloating issue ?

and last but not least, can I reduce the problem by configuration ?

regards,

Marc Mamin



Re: [PERFORM] Slow query help

2016-01-07 Thread Marc Mamin


>I ask your help to solve a slow query which is taking more than 14 seconds to 
>be executed.
>Maybe I am asking too much both from you and specially from postgresql, as it 
>is really huge, envolving 16 tables.
>
>Explain:
>http://explain.depesz.com/s/XII9
>
>Schema:
>http://adj.com.br/erp/data_schema/

Hello,

It seems that you don't pay much attention to column alignment.
e.g. http://adj.com.br/erp/data_schema/tables/ERP_PUBLIC_sys_person.html

This probably won't make any significant difference in your case,
but this is something to be aware of when dealing with large tables.
here is a good starting link for this topic:
http://stackoverflow.com/questions/12604744/does-the-order-of-columns-in-a-postgres-table-impact-performance

regards,

Marc Mamin


Re: [PERFORM] GroupAggregate and Integer Arrays

2015-10-24 Thread Marc Mamin
> CREATE OR REPLACE FUNCTION array_add(int[],int[]) RETURNS int[] AS $$
>-- Add two arrays.
>select
>   ARRAY (
>  SELECT coalesce($1[i],0) + coalesce($2[i],0)
>  FROM (
> select generate_series(least(array_lower($1, 1),array_lower($2,
> 1)), greatest(array_upper($1, 1),array_upper($2, 1)), 1) AS i
>  ) sub
>GROUP BY i
>ORDER BY i
>);
> $$ LANGUAGE sql STRICT IMMUTABLE;

it seems that both the GROUP and ORDER BY are superfluous and adding some 
cycles.

regards,

Marc Mamin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] wildcard text filter switched to boolean column, performance is way worse

2015-07-08 Thread Marc Mamin
Hello,
> 
> 
> 
> 
> 
> From: pgsql-performance-ow...@postgresql.org 
> [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Mike Broers
> Sent: Dienstag, 7. Juli 2015 18:28
> To: Tom Lane
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] wildcard text filter switched to boolean column, 
> performance is way worse
> 
> After bumping up work_mem from 12MB to 25MB that last materialize is indeed 
> hashing and this cut the query time by about 60%.  Thanks, this was very 
> helpful and gives me something else to look for when troubleshooting 
> explains.  
> 
> 
> 
> On Tue, Jul 7, 2015 at 11:10 AM, Mike Broers  wrote:
> Thanks, very informative! I'll experiment with work_mem settings and report 
> back.
> 
> On Tue, Jul 7, 2015 at 11:02 AM, Tom Lane  wrote:
> Mike Broers  writes:
> > I had a query that was filtering with a wildcard search of a text field for
> > %SUCCESS%. The query took about 5 seconds and was running often so I wanted
> > to improve it.  I suggested that the engineers include a new boolean column
> > for successful status.  They implemented the requested field, but the query
> > that filters on that new column runs very long (i kill it after letting it
> > run for about an hour).  Can someone help me understand why that is the
> > case and how to resolve it?
> 
> It's hashing the subplan output in the first case and not the second:
> 
> > Seq Scan on lead  (cost=130951.81..158059.21 rows=139957 width=369) (actual
> > time=4699.619..4699.869 rows=1 loops=1)
> > Filter: ((NOT (hashed SubPlan 1)) AND (("ReferenceNumber")::text <>
> > ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3
>
> vs
> 
> >   Seq Scan on lead  (cost=85775.78..9005687281.12 rows=139957 width=369)
> > Filter: ((NOT (hashed SubPlan 1)) AND (("ReferenceNumber")::text <>
> > ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3
>^
> 
> Presumably, the new more-accurate rows count causes the planner to realize
> that the hash table will exceed work_mem so it doesn't choose to hash ...
> but for your situation, you'd rather it did, because what you're getting
> instead is a Materialize node that spills to disk (again, because the data
> involved exceeds work_mem) and that's a killer for this query.  You should
> be able to get back the old behavior if you raise work_mem enough.
> 
> Another idea you might think about is changing the OR'd IN conditions
> to a single IN over a UNION ALL of the subselects.  I'm not really sure if
> that would produce a better plan, but it's worth trying if it wouldn't
> require too much app-side contortion.


Hello,
you might try to use a CTE to first collect the IDs to exclude, and join them 
to your main table.
This should result in an anti join pattern.

Something like:

WITH IDS as (
SELECT U1."lead_id" AS "lead_id" 
FROM "event" U1 
WHERE U1."event_type" ='type_1'
UNION (
   SELECT U1."lead_id" AS "lead_id" 
   FROM "event" U1 
   WHERE U1."event_type" = 'type_2'
   INTERSECT
   SELECT U1."lead_id" AS "lead_id" 
   FROM "event" U1 
   WHERE successful
)
)
SELECT * FROM lead LEFT OUTER JOIN IDS ON (lead.id=IDS.lead_id)
WHERE IDS.lead_id IS NULL;

regards,

Marc Mamin






> regards, tom lane
> 
> 
>

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Re: Hmmm... why does pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?

2015-07-03 Thread Marc Mamin

>
>Hi everyone,
>
>I've written a new open source tool for easily parallelising SQL scripts in 
>postgres.   [obligatory plug:   https://github.com/gbb/par_psql   ]
>
>Using it, I'm seeing a problem that I've also seen in other postgres projects 
>involving high degrees of parallelisation in the last 12 months.
>
>Basically:
>
>- I have machines here with up to 16 CPU cores and 128GB memory, very fast 
>SSDs and controller etc, carefully configured kernel/postgresql.conf for high 
>performance.
>
>- Ordinary queries parallelise nearly perfectly (e.g. SELECT some_stuff ...), 
>e.g. almost up to 16x performance improvement.
>
>- Non-DB stuff like GDAL, python etc. parallelise nearly perfectly.
>
>- HOWEVER calls to CPU-intensive user-defined pl/pgsql functions (e.g. SELECT 
>myfunction(some_stuff)) do not parallelise well, even when they are 
>independently defined functions, or accessing tables in a read-only way. They 
>hit a limit of 2.5x performance improvement relative to single-CPU performance 
>(pg9.4) and merely 2x performance (pg9.3) regardless of how many CPU cores I 
>throw at them. This is about 6 times slower than I'm expecting.
>
>
>I can't see what would be locking. It seems like it's the pl/pgsql environment 
>itself that is somehow locking or incurring some huge frictional costs. 
>Whether I use independently defined functions, independent source tables, 
>independent output tables, makes no difference whatsoever, so it doesn't feel 
>'lock-related'. It also doesn't seem to be WAL/synchronisation related, as the 
>machines I'm using can hit absurdly high pgbench rates, and I'm using unlogged 
>tables for output.
>
>Take a quick peek here: 
>https://github.com/gbb/par_psql/blob/master/BENCHMARKS.md
>
>I'm wondering what I'm missing here. Any ideas?
>
>Graeme.
>

auto explain might help giving some insight in what's going on: 
http://www.postgresql.org/docs/9.4/static/auto-explain.html

Regards, 
Marc Mamin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance issues

2015-03-16 Thread Marc Mamin

>Hi Team,
>
>This is the EXPLAIN ANALYZE for one of the view : S_V_D_CAMPAIGN_HIERARCHY:


>Rows Removed by Join Filter: 3577676116

That's quite a lot.
You're possibly missing a clause in a join, resulting in a cross join.
It is also helpful to put your result here:
http://explain.depesz.com/
regards,

Marc Mamin


>
>===
>
>
>Nested Loop  (cost=33666.96..37971.39 rows=1 width=894) (actual
>time=443.556..966558.767 rows=45360 loops=1)
>   Join Filter: (tp_exec.touchpoint_execution_id =
>valid_executions.touchpoint_execution_id)
>   Rows Removed by Join Filter: 3577676116
>   CTE valid_executions
> ->  Hash Join  (cost=13753.53..31711.17 rows=1 width=8) (actual
>time=232.571..357.749 rows=52997 loops=1)
>   Hash Cond:
>((s_f_touchpoint_execution_status_history_1.touchpoint_execution_id =
>s_f_touchpoint_execution_status_history.touchpoint_execution_id) AND ((max(s
>_f_touchpoint_execution_status_history_1.creation_dt)) =
>s_f_touchpoint_execution_status_history.creation_dt))
>   ->  HashAggregate  (cost=6221.56..6905.66 rows=68410 width=16)
>(actual time=139.713..171.340 rows=76454 loops=1)
> ->  Seq Scan on s_f_touchpoint_execution_status_history
>s_f_touchpoint_execution_status_history_1  (cost=0.00..4766.04 rows=291104
>width=16) (actual ti
>me=0.006..38.582 rows=291104 loops=1)
>   ->  Hash  (cost=5493.80..5493.80 rows=135878 width=16) (actual
>time=92.737..92.737 rows=136280 loops=1)
> Buckets: 16384  Batches: 1  Memory Usage: 6389kB
> ->  Seq Scan on s_f_touchpoint_execution_status_history
>(cost=0.00..5493.80 rows=135878 width=16) (actual time=0.012..55.078
>rows=136280 loops=1)
>   Filter: (touchpoint_execution_status_type_id = ANY
>('{3,4}'::integer[]))
>   Rows Removed by Filter: 154824
>   ->  Nested Loop Left Join  (cost=1955.80..6260.19 rows=1 width=894)
>(actual time=31.608..3147.015 rows=67508 loops=1)
> ->  Nested Loop  (cost=1955.67..6260.04 rows=1 width=776) (actual
>time=31.602..2912.625 rows=67508 loops=1)
>   ->  Nested Loop Left Join  (cost=1955.54..6259.87 rows=1
>width=658) (actual time=31.595..2713.696 rows=72427 loops=1)
> ->  Nested Loop Left Join  (cost=1955.40..6259.71
>rows=1 width=340) (actual time=31.589..2532.926 rows=72427 loops=1)
>   ->  Nested Loop Left Join  (cost=1955.27..6259.55
>rows=1 width=222) (actual time=31.581..2354.662 rows=72427 loops=1)
> ->  Nested Loop  (cost=1954.99..6259.24
>rows=1 width=197) (actual time=31.572..2090.104 rows=72427 loops=1)
>   ->  Nested Loop
>(cost=1954.71..6258.92 rows=1 width=173) (actual time=31.562..1802.857
>rows=72427 loops=1)
> Join Filter:
>(camp_exec.campaign_id = wave.campaign_id)
> Rows Removed by Join Filter:
>243
> ->  Nested Loop
>(cost=1954.42..6254.67 rows=13 width=167) (actual time=31.551..1468.718
>rows=72670 loops=1)
>   ->  Hash Join
>(cost=1954.13..6249.67 rows=13 width=108) (actual time=31.525..402.039
>rows=72670 loops=1)
> Hash Cond:
>((tp_exec.touchpoint_id = tp.touchpoint_id) AND (wave_exec.wave_id =
>tp.wave_id))
> ->  Hash Join
>(cost=1576.83..4595.51 rows=72956 width=90) (actual time=26.254..256.328
>rows=72956 loops=1)
>   Hash Cond:
>(tp_exec.wave_execution_id = wave_exec.wave_execution_id)
>   ->  Seq Scan
>on s_d_touchpoint_execution tp_exec  (cost=0.00..1559.56 rows=72956
>width=42) (actual time=0.005..76.099 rows=72956 loops=1)
>   ->  Hash
>(cost=1001.37..1001.37 rows=46037 width=56) (actual time=26.178..26.178
>rows=46037 loops=1)
> Buckets:
>8192  Batches: 1  Memory Usage: 4104kB
> ->  Seq
>Scan on s_d_wave_execution wave_exec  (cost=0.00..1001.37 rows=46037
>width=56) (actual time=0.006..10.388 rows=46037 loops=1)
> ->  Hash
>(cost=212.72..212.72 rows=10972 width=26) (actual time=5.252..

Re: [PERFORM] EXPLAIN (no ANALYZE) taking an hour for INSERT FROM SELECT

2015-03-05 Thread Marc Mamin


>Hi, thanks for your follow-up questions.
>- postgres version is 9.1.13
>- the number of rows (in this latest instance) is 28,474,842
>- I've clustered and vacuum-full-ed and analyzed this table frequently, 
>attempting to troubleshoot this. (Running vacuum full on the whole catalog 
>seems a little excessive, and unlikely to help.)

Hi,

I mean the pg_* tables. When working with temp objects and a high number of 
classes, regular vacuum may not clean them efficiently.
It is not a bad idea to run a vacuum full verbose manually on the largest of 
those from time to time to verify that they don't grow outer control.
And this normally requires a few seconds only.
The verbose output of vacuum full sometimes returns interesting information...
For the ANALYZE performance, I guess that these are the most relevant one:
 pg_statistic;
 pg_class;
 pg_attribute;
 pg_index;
 pg_constraint;

regards,

Marc Mamin



>- no other processes are likely to be interfering; nothing other than 
>PostgreSQL runs on this machine (except for normal OS processes and New Relic 
>server monitoring service); concurrent activity in PostgreSQL is low-level and 
>unrelated, and this effect is observed systematically whenever this kind of 
>operation is performed on this table
>- no override for this table; the system default_statistics_target is 100 (the 
>default)
>- yes, an ANALYZE is performed on the temp table after the COPY and before the 
>INSERT
>- no index on the temp table (but I'm scanning the whole thing anyway). There 
>are indexes on f_foo as detailed in my original post, and I expect the PK to 
>make the WHERE NOT EXISTS filtering efficient (as it filters on exactly all 
>columns of the PK) ... but even if it didn't, I would expect that to only slow 
>down the actual insert execution, not the EXPLAIN.
>Cheers,
>Gulli
>On Wed, Mar 4, 2015 at 8:10 PM, Marc Mamin  wrote:
>
>>Hi,
>>we are seeing cases of EXPLAIN INSERT INTO foo SELECT ... taking over an 
> hour, with disk I/O utilization (percent of time device is busy) at 100% the 
> whole time, although I/O bandwidth is not saturated. This is on PostgreSQL 
> 9.1.13.
>>What could cause this? Note that there is no ANALYZE. Is it possible that 
> the SELECT is actually executed, in planning the INSERT?
>>When executing the INSERT itself (not EXPLAIN) immediately afterwards, 
> that logs a "temporary file" message, but the EXPLAIN invocation does not 
> (though the disk I/O suggests that a large on-disk sort is occurring):
>>LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp6016.0", size 
> 744103936
>>STATEMENT:  INSERT INTO f_foo SELECT
>>[...]
>>During that actual execution, there's a lower disk I/O utilization 
> (though a higher MB/s rate).
>>Charts of disk I/O utilization and rate are at 
> http://postimg.org/image/628h6jmn3/ ... the solid 100% span is the EXPLAIN 
> statement, ending at 6:13:30pm, followed by the INSERT statement ending at 
> 6:32pm. Metrics are collected by New Relic; their definition of I/O 
> utilization is at https://discuss.newrelic.com/t/disk-i-o-metrics/2900
>>Here's the EXPLAIN statement:
>>LOG:  duration: 3928778.823 ms  statement: EXPLAIN INSERT INTO f_foo 
> SELECT
>>   t_foo.fk_d1,
>>   t_foo.fk_d2,
>>   t_foo.fk_d3,
>>   t_foo.fk_d4,
>>   t_foo.fk_d5,
>>   t_foo.fk_d6,
>>   t_foo.value
>>FROM t_foo
>>WHERE NOT (EXISTS
>> (SELECT *
>>  FROM f_foo
>>  WHERE f_foo.fk_d2 = t_foo.fk_d2
>>AND f_foo.fk_d3 = t_foo.fk_d3
>>AND f_foo.fk_d4 = t_foo.fk_d4
>>AND f_foo.fk_d5 = t_foo.fk_d5
>>AND f_foo.fk_d6 = t_foo.fk_d6
>>AND f_foo.fk_d1 = t_foo.fk_d1))
>>(where t_foo is a temp table previously populated using COPY, and the NOT 
> EXISTS subquery refers to the same table we are inserting into)
>>Here's the EXPLAIN output:
>>Insert on f_foo  (cost=8098210.50..9354519.69 rows=1 width=16)
>>  ->  Merge Anti Join  (cost=8098210.50..9354519.69 rows=1 width=16)
>>Merge Cond: ((t_foo.fk_d2 = public.f_foo.fk_d2) AND
>> (t_foo.fk_d3 = public.f_foo.fk_d3) AND
>> (t_foo.fk_d4 = public.f_foo.fk_d4) AND
>> (t_foo.fk_d5 = public.f_foo.fk_d5) AND
>> (t_foo.fk_d6 = public.f_foo.fk_d6) AND
>> (t_foo.fk_d1 = public.f_foo.fk_d1))
>>->  Sort  (cost=3981372.25..

Re: [PERFORM] EXPLAIN (no ANALYZE) taking an hour for INSERT FROM SELECT

2015-03-04 Thread Marc Mamin
t;on INSERT into this and a couple of other tables with tens of millions of rows.
>Any ideas?
>Thanks, best regards,
>- Gulli
>

Hi,
I've no clue for the time required by EXPLAIN
but some more information are probably relevant to find an explanation:

- postgres version
- number of rows inserted by the query
- how clean is your catalog in regard to vacuum
   ( can you run vacuum full verbose & analyze it, and then retry the analyze 
statement ?)
- any other process that may interfere, e.g. while locking some catalog tables ?
- statistic target ?
- is your temp table analyzed?
- any index on it ?

We have about 300'000 entries in our pg_class tables, and I've never seen such 
an issue.

regards,
Marc Mamin



Re: [ADMIN] [PERFORM] Vs NULL

2015-02-09 Thread Marc Mamin

>>Hi
>>
>>2015-02-09 12:22 GMT+01:00 sridhar bamandlapally :
>>
>>Hi All
>>
>>We are testing our Oracle compatible business applications on PostgreSQL 
>> database,
>>
>>the issue we are facing is  Vs NULL
>>
>>In Oracle '' () and NULL are treated as NULL
>>
>>but, in PostgreSQL ''  not treated as NULL
>>
>>I need some implicit way in PostgreSQL where '' can be 
>> treated as NULL

>It is not possible in PostgreSQL.  PostgreSQL respects ANSI SQL standard - 
>Oracle not.
>
>Regards
>
>Pavel
>
>p.s. theoretically you can overwrite a type operators to support Oracle 
>behave, but you should not be sure about unexpected negative side effects.


A clean way would be to disallow empty strings on the PG side.
This is somewhat combersome depending on how dynamic your model is
and add some last on your db though.


ALTER TABLE tablename ADD CONSTRAINT tablename_not_empty_ck
  CHECK (false= (colname1 IS NULL OR colname2 IS NULL OR colname3 IS NULL ...) 
IS NULL)

-- and to ensure compatibility with your app or migration:

CREATE OR REPLACE FUNCTION tablename_setnull_trf()
  RETURNS trigger AS
$BODY$
BEGIN
-- for all *string* columns
   NEW.colname1 = NULLIF (colname1,'');
   NEW.colname2 = NULLIF (colname2,'');
   NEW.colname3 = NULLIF (colname3,'');
RETURN NEW;
END;
$BODY$

CREATE TRIGGER tablename_setnull_tr
  BEFORE INSERT OR UPDATE
  ON tablename
  FOR EACH ROW
  EXECUTE PROCEDURE tablename_setnull_trf();

You can query the pg catalog to generate all required statements.
A possible issue is the order in which triggers are fired, when more than one 
exist for a given table:
"If more than one trigger is defined for the same event on the same relation, 
the triggers will be fired in alphabetical order by trigger name"
( http://www.postgresql.org/docs/9.3/static/trigger-definition.html )

regards,

Marc Mamin


Re: [PERFORM] Why is GIN index slowing down my query?

2015-02-02 Thread Marc Mamin
AlexK987  writes:
>>> I've created a GIN index on an INT[] column, but it slows down the selects.
>>> Here is my table:
>>
>>> create table talent(person_id INT NOT NULL,
>>> skills INT[] NOT NULL);
>>
>>> insert into talent(person_id, skills)
>>> select generate_series, array[0, 1] || generate_series
>>> from generate_series(3, 1048575);
>>
>>> create index talent_skills on talent using gin(skills);
>>
>>> analyze talent;
>>
>>> Here is my select:
>>
>>> explain analyze 
>>> select * from talent 
>>> where skills <@ array[1, 15]
>>
>>Well, that's pretty much going to suck given that data distribution.
>>Since "1" is a member of every last entry, the GIN scan will end up
>>examining every entry, and then rejecting all of them as not being
>>true subsets of [1,15].  
>
>This is equivalent and fast:
>
>explain analyze
>WITH rare AS (
> select * from talent
> where skills @> array[15])
>select * from rare
> where skills @> array[1]
> -- (with changed operator)
>
>You might variate your query according to an additional table that keeps the 
>occurrence count of all skills.
>Not really pretty though.

I wonder if in such cases, the Bitmap Index Scan could discard entries that 
would result in a table scan
and use them only in the recheck part:

explain
 select * from talent 
 where skills @> array[1]
 
 Seq Scan on talent  (cost=0.00..21846.16 rows=1048573 width=37)
   Filter: (skills @> '{1}'::integer[])


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why is GIN index slowing down my query?

2015-02-02 Thread Marc Mamin
AlexK987  writes:
>> I've created a GIN index on an INT[] column, but it slows down the selects.
>> Here is my table:
>
>> create table talent(person_id INT NOT NULL,
>> skills INT[] NOT NULL);
>
>> insert into talent(person_id, skills)
>> select generate_series, array[0, 1] || generate_series
>> from generate_series(3, 1048575);
>
>> create index talent_skills on talent using gin(skills);
>
>> analyze talent;
>
>> Here is my select:
>
>> explain analyze 
>> select * from talent 
>> where skills <@ array[1, 15]
>
>Well, that's pretty much going to suck given that data distribution.
>Since "1" is a member of every last entry, the GIN scan will end up
>examining every entry, and then rejecting all of them as not being
>true subsets of [1,15].  

This is equivalent and fast:

explain analyze
WITH rare AS (
 select * from talent 
 where skills @> array[15])
select * from rare
 where skills @> array[1]
 -- (with changed operator)

You might variate your query according to an additional table that keeps the 
occurrence count of all skills.
Not really pretty though.

regards,

Marc Mamin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query performance

2015-01-25 Thread Marc Mamin

>I have an events table that records page views and purchases (type = 'viewed' 
>or type='purchased'). I have a query that figures out "people who 
>bought/viewed this also bought/viewed that".
>
>It worked fine, taking about 0.1 seconds to complete, until a few hours ago 
>when it started taking hours to complete. Vacuum/analyze didn't help.  Turned 
>out there was one session_id that had 400k rows in the system. Deleting that 
>made the query performant again.
>
>Is there anything I can do to make the query work better in cases like that? 
>Missing index, or better query?
>
>This is on 9.3.5.
>
>The below is reproduced at the following URL if it's not formatted correctly 
>in the email. 
>https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt

Hello,

here  are 2 variations that should be somewhat faster.

 It seems you may have duplicate (site_id,session_id,product_id)
 which would false the result. In that case you'll need some more logic in the 
query.

 select
'82503' as product_id,
e2.site_id,
e2.product_id,
count(nullif(e2.type='viewed', false)) view_count,
count(nullif(e2.type='purchased', false)) purchase_count
  from events e1
  join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
  where
e1.product_id = '82503' and
e2.product_id != '82503'
 group by  e2.product_id, e2.site_id;


 OR:

 WITH SALL as(
  select
 e2.site_id,
 e2.product_id,
 count(nullif(e2.type='viewed', false)) view_count,
 count(nullif(e2.type='purchased', false)) purchase_count
   from events e1
   join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
   where
 e1.product_id = '82503'
  group by  e2.product_id, e2.site_id
 )
 SELECT
'82503' as product_id_1,
site_id,
product_id,
view_count,
purchase_count
 FROM SALL
 WHERE product_id != '82503';


 regards,
 Marc Mamin



>explain  select
>   e1.product_id,
>   e2.site_id,
>   e2.product_id,
>   count(nullif(e2.type='viewed', false)) view_count,
>   count(nullif(e2.type='purchased', false)) purchase_count
> from events e1
> join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
> where
>   e1.product_id = '82503' and
>   e1.product_id != e2.product_id
> group by e1.product_id, e2.product_id, e2.site_id;
> QUERY PLAN
>
> GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)
>   ->  Sort  (cost=828395.67..840117.89 rows=465 width=19)
> Sort Key: e1.product_id, e2.product_id, e2.site_id
> ->  Nested Loop  (cost=11.85..20371.14 rows=465 width=19)
>   ->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 
> rows=369 width=49)
> Recheck Cond: (product_id = '82503'::citext)
> ->  Bitmap Index Scan on events_product_id_site_id_idx  
> (cost=0.00..11.20 rows=369 width=0)
>   Index Cond: (product_id = '82503'::citext)
>   ->  Index Scan using events_session_id_type_product_id_idx on 
> events e2  (cost=0.56..51.28 rows=12 width=51)
> Index Cond: ((session_id = e1.session_id) AND (type = 
> e1.type))
> Filter: (e1.product_id <> product_id)
>(11 rows)
>
>recommender_production=> \d events
>Table "public.events"
>   Column|   Type   |  Modifiers
>-+--+-
> id  | bigint   | not null default 
> nextval('events_id_seq'::regclass)
> user_id | citext   |
> session_id  | citext   | not null
> product_id  | citext   | not null
> site_id | citext   | not null
> type| text | not null
> happened_at | timestamp with time zone | not null
> created_at  | timestamp with time zone | not null
>Indexes:
>"events_pkey" PRIMARY KEY, btree (id)
>"events_product_id_site_id_idx" btree (product_id, site_id)
>"events_session_id_type_product_id_idx" btree (session_id, type, 
> product_id)
>Check constraints:
>"events_session_id_check" CHECK (length(session_id::text) < 255)
>"events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 
> 'viewed'::text]))
>"events_user_id_check" CHECK (length(user_id::text) < 255)
>
>
>
>


Re: [PERFORM] query a table with lots of coulmns

2014-09-20 Thread Marc Mamin
>At first, thanks for your fast and comprehensive help.
>
>The structure of my cache table is
>
>a text , b text NOT NULL , c text , d text , e timestamp without
>timezone DEFAULT now(),  f text, s1 integer DEFAULT 0, s2 integer
>DEFAULT 0, s3 integer DEFAULT 0, ... ,s512 DEFAULT 0


>additional constraints: primary key (b)  , Unique(b), Unique(a)
>Indexes : Index on a, Index on b

This looks redundant. e.g. you don't need a unique index on b if you already 
have a primary key on it.
Can you post the complete table definition ?

...

>One remark which might help: overall 90 - 95 % of the s1-s512 columns
>are 0. I am only interested in columns not equals 0. Perhaps it would
>make sense to use and array of json and enumerate only values not equals 0.

Could you change that to replace 0 values with NULLs? 
This would greatly reduce your table space as Postgres is very efficient about 
NULLs storage:
It marks all null values in a bit map within the row header so you just need 
about one bit per null
instead of 4 bytes for zeros, and hence get rid of your I/O issue.

regards,

Marc Mamin

Von: pgsql-performance-ow...@postgresql.org 
[pgsql-performance-ow...@postgresql.org]" im Auftrag von "Björn 
Wittich [bjoern_witt...@gmx.de]
Gesendet: Samstag, 20. September 2014 09:19
An: Josh Berkus; pgsql-performance@postgresql.org
Betreff: Re: [PERFORM] query a table with lots of coulmns

At first, thanks for your fast and comprehensive help.

The structure of my cache table is

a text , b text NOT NULL , c text , d text , e timestamp without
timezone DEFAULT now(),  f text, s1 integer DEFAULT 0, s2 integer
DEFAULT 0, s3 integer DEFAULT 0, ... ,s512 DEFAULT 0

additional constraints: primary key (b)  , Unique(b), Unique(a)
Indexes : Index on a, Index on b

This table has 30 Mio rows ( will increase to 50 Mio) in future

My working table is

b text, g integer

Indexes on b and c


This table has 5 Mio rows

Scenario:

What I want to achieve :

SELECT s1,s2,s3,...s512,g,d  from  INNER JOIN 
USING(b) ORDER BY g


The inner join will match at least 95 % of columns of the smaller
worktable in this example 4,75 mio rows.

Running this query takes several hours until I receive the first
results. Query analyzing shows that the execution plan is doing 2 seq
table scans on cache and work table.


When I divide this huge statement into

SELECT s1,s2,s3,...s512,g,d  from  INNER JOIN 
USING(b) WHERE g BETWEEN 1 and 1 ORDER BY g, SELECT
s1,s2,s3,...s512,g,d  from  INNER JOIN  USING(b)
WHERE g BETWEEN 10001 and 2 ORDER BY g, 

(I can do this because g i unique and continous id from 1 to N)

The result is fast but fireing parallel requests (4-8 times parallel)
slows down the retrieval.

Execution plan changes when adding "BETWEEN 1 and 1" to use the indexes.



One remark which might help: overall 90 - 95 % of the s1-s512 columns
are 0. I am only interested in columns not equals 0. Perhaps it would
make sense to use and array of json and enumerate only values not equals 0.

Statistics on the large table:
table size: 80 GB
toast-tablesize: 37 GB
size of indexes: 17 GB


Thanks for your help and ideas

Björn





Am 19.09.2014 23:40, schrieb Josh Berkus:
> On 09/19/2014 04:51 AM, Björn Wittich wrote:
>> I am relatively new to postgres. I have a table with 500 coulmns and
>> about 40 mio rows. I call this cache table where one column is a unique
>> key (indexed) and the 499 columns (type integer) are some values
>> belonging to this key.
>>
>> Now I have a second (temporary) table (only 2 columns one is the key of
>> my cache table) and I want  do an inner join between my temporary table
>> and the large cache table and export all matching rows. I found out,
>> that the performance increases when I limit the join to lots of small
>> parts.
>> But it seems that the databases needs a lot of disk io to gather all 499
>> data columns.
>> Is there a possibilty to tell the databases that all these colums are
>> always treated as tuples and I always want to get the whole row? Perhaps
>> the disk oraganization could then be optimized?
> PostgreSQL is already a row store, which means by default you're getting
> all of the columns, and the columns are stored physically adjacent to
> each other.
>
> If requesting only 1 or two columns is faster than requesting all of
> them, that's pretty much certainly due to transmission time, not disk
> IO.  Otherwise, please post your schema (well, a truncated version) and
> your queries.
>
> BTW, in cases like yours I've used a INT array instead of 500 columns to
> good effect; it works slightly better with PostgreSQL's compression.
>



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscripti

Re: [PERFORM] Cursor + upsert (astronomical data)

2014-07-27 Thread Marc Mamin
[Craig]
>>If you haven't looked at clustering algorithms yet, you might want to do so.
>>Your problem is a special case of clustering, where you have a large number
>>of small clusters.  A good place to start is the overview on Wikipedia:
>>http://en.wikipedia.org/wiki/Cluster_analysis

According to this list, your method is similar to 
http://en.wikipedia.org/wiki/Basic_sequential_algorithmic_scheme,
but with what seems to be some logical errors.

>The simplest approach I could think of is that I process each row of the 3e8 
>rows
>sequentially and ask:
>Do I have an identifier in the radius of 1 arcsec?
>No: Generate one and assign me to it.
>Yes: Update it and assigne me to it. The update is done as weighted average – 
>I keep the
>number of how many observations the identifier has been computed. The result 
>is that the
>identifier will have average coordinates of all the observations it identifies 
>– it will
>be the center.


Let say you have 2 single measures on a line at arcsec 1 and 2.1. which hence 
correspond to 2 ipix_cat.
Now add a new measure at 1.9: as you choose any of the possible adjacent 
ipix_cat whitout considering the least distance, you may end with an ipix_at at 
1.45 which is at less than 1 arcsec then the next one.
   Moreover you have raised the weight of both ipix_cat.
 Which increases the lock probability when trying a nutlithreaded upgrade.

The max distance between 2 observations belonging to a same ipix_cat tends to 2 
arcsec with your method. If this is ok, you should probbaly modify your method 
so that the 2 first points of my example would have megred to a single 
ipix_cat. You could use your weigth for this: increase your search radius to 
2arcsec and then reject the candidates located between 1 and 2 arsec depending 
on their weight. The additional work load might be compensated by the smaller 
number of ipix_cat that woul will have.







Re: [PERFORM] Building multiple indexes on one table.

2014-07-23 Thread Marc Mamin
>Von: pgsql-performance-ow...@postgresql.org 
>[pgsql-performance-ow...@postgresql.org]" im Auftrag von "Claudio 
>Freire [klaussfre...@gmail.com]
>Gesendet: Freitag, 18. Juli 2014 01:21
>An: Chris Ruprecht
>Cc: pgsql-performance@postgresql.org
>Betreff: Re: [PERFORM] Building multiple indexes on one table.
>
>On Thu, Jul 17, 2014 at 7:47 PM, Chris Ruprecht  wrote:
>> Is there any way that I can build multiple indexes on one table without 
>> having to scan the table multiple times? For small tables, that's probably 
>> not an issue, but if I have a 500 GB table that I need to create 6 indexes 
>> on, I don't want to read that table 6 times.
>> Nothing I could find in the manual other than reindex, but that's not 
>> helping, since it only rebuilds indexes that are already there and I don't 
>> know if that reads the table once or multiple times. If I could create 
>> indexes inactive and then run reindex, which then reads the table once, I 
>> would have a solution. But that doesn't seem to exist either.
>
>Just build them with separate but concurrent connections, and the
>scans will be synchronized so it will be only one.
>
>Btw, reindex rebuilds one index at a time, so what I do is issue
>separate reindex for each index in parallel, to avoid the repeated
>scans as well.
>
>Just make sure you've got the I/O and CPU capacity for it (you'll be
>writing many indexes at once, so there is a lot of I/O).

Index creation on large tables are mostly CPU bound as long as no swap occurs.
I/O may be an issue when all your indexes are similar; e.g. all on single int4 
columns.
in other cases the writes will not all take place concurrently.
To reduce I/O due to swap, you can consider increasing maintenance_work_mem on 
the connextions/sessionns
that build the indexes.

regards,

Marc Mamin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] fragmention issue with ext4: e4defrag?

2014-07-02 Thread Marc Mamin
Hello,

Has anyone some experience using defragmentation tools on Linux against 
tablespaces ?

we are facing fragmentation problems with postgres instances having a few TB of 
data.
( RAID 5 )

I/O througput decreased from 300MB/s to 160.


- We first moved some schemas to separate servers.
  After that we still have  150'000 tables in 1.5 TB
  
- Now we are in the process of vacuuming FULL historical tables which are not 
written anymore. 
  This seems to improve the I/O considerably
  
Our remaining issue is that the free space fragmentíon is still suboptimal 
so that fragmention will probably start again soon.

Would it make sense to use a tool like e4defrag 
(http://www.linux.org/threads/online-defragmentation.4121/)
in order to defrag the free space ?
And how safe is it to use such a tool against a running postgres instance?



many thanks,

Marc Mamin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] CREATE TABLE AS WITH FREEZE ?

2013-11-13 Thread Marc Mamin
Hello,

Does anything speaks again adding a  "WITH FREEZE" option to "CREATE TABLE AS" ,
similar to the new COPY FREEZE feature ?

best regards,

Marc Mamin





Re: [PERFORM] stable and immutable functions in GROUP BY clauses.

2013-09-16 Thread Marc Mamin

> 
> Hello,
> 
> Stable and immutable functions do not improve performance when used within 
> the GROUP BY clause.
> Here, the function will be called for each row.
> 
> To avoid it, I can replace the funtion by its arguments within GROUP BY.

Shame on me !
This is of course bullsh...  It has nothing to do with immutability and can 
only applies to few cases

e.g: it's fine for select x+1 ... group by  x,
but not for    select x^2 ... group by  x

Marc Mamin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] stable and immutable functions in GROUP BY clauses.

2013-08-26 Thread Marc Mamin
Hello,

Stable and immutable functions do not improve performance when used within the 
GROUP BY clause.
Here, the function will be called for each row.

To avoid it, I can replace the funtion by its arguments within GROUP BY.

Maybe this hint is worth a note within the documentation on Function Volatility.

I have the situation where queries are generating by the application and it 
would be a pain to extend the "query builder"
in order to avoid this performance issue.
So I wonder if it would be possible for the query planner to recognize such 
cases and optimize the query internally ?

best regards,
Marc Mamin


here an example to highlight possible performance loss:

create temp table ref ( i int, r int);
create temp table val ( i int, v int);

insert into ref select s,s%2 from generate_series(1,1)s;
insert into val select s,s%2 from generate_series(1,1)s;

create or replace function getv(int) returns int as 
$$ select v+1 from val where i=$1; $$ language SQL stable;

explain analyze select  getv(r) from ref group by r;
Total runtime: 5.928 ms

explain analyze select  getv(r) from ref group by getv(r);
Total runtime: 3980.012 ms

-- and more reasonably with an index:

create unique index val_ux on val(i);

explain analyze select  getv(r) from ref group by r;
Total runtime: 4.278 ms

explain analyze select  getv(r) from ref group by getv(r);
Total runtime: 68.758 ms


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] FTS performance issue - planner problem identified (but only partially resolved)

2013-07-20 Thread Marc Mamin



Von: Stefan Keller [sfkel...@gmail.com]
>Gesendet: Samstag, 20. Juli 2013 01:55
>
>Hi Marc
>
>Thanks a lot for your hint!
>
>You mean doing a "SET track_counts (true);" for the whole session?

No, 
I mean 

ALTER TABLE  ALTER   SET STATISTICS 0;

And remove existing statistics

DELETE FROM pg_catalog.pg_statistic 
where starelid='':: regclass
AND staattnum = (SELECT attnum FROM pg_attribute
WHERE attrelid = '':: regclass
AND  attname  =  ''::name
)

But you should first try to find out which proportion of your ts queries are 
faster 
when using a table scan as they will probably not happen anymore afterwards !
(Except if further columns on your table 'FullTextSearch' are considered by the 
planner)




>That would be ok if it would be possible just for the gin index.
>
>It's obviously an issue of the planner estimation costs.
>The data I'm speaking about ("movies") has a text attribute which has
>a length of more than 8K so it's obviously having to do with
>detoasting.
>But the thoughts about @@ operators together with this GIN index seem
>also to be valid.
>
>I hope this issue is being tracked in preparation for 9.3.
>
>Regards, Stefan
>
>
>2013/7/19 Marc Mamin :
>>
>>> SELECT * FROM FullTextSearch WHERE content_tsv_gin @@
>>> plainto_tsquery('english', 'good');
>>>
>>> It's slow (> 30 sec.) for some GB (27886 html files, originally 73 MB 
>>> zipped).
>>> The planner obviously always chooses table scan
>>
>>
>> Hello,
>>
>> A probable reason for the time difference is the cost for decompressing 
>> toasted content.
>> At least in 8.3, the planner was not good at estimating it.
>>
>> I'm getting better overall performances since I've stopped collect statistic 
>> on tsvectors.
>> An alternative would have been to disallow compression on them.
>>
>> I'm aware this is a drastic way and would not recommend it without testing. 
>> The benefit may depend on the type of data you are indexing.
>> In our use case these are error logs with many java stack traces, hence with 
>> many lexemes poorly discriminative.
>>
>> see: http://www.postgresql.org/message-id/27953.1329434...@sss.pgh.pa.us
>> as a comment on
>> http://www.postgresql.org/message-id/c4dac901169b624f933534a26ed7df310861b...@jenmail01.ad.intershop.net
>>
>> regards,
>>
>> Marc Mamin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] FTS performance issue - planner problem identified (but only partially resolved)

2013-07-19 Thread Marc Mamin

> SELECT * FROM FullTextSearch WHERE content_tsv_gin @@
> plainto_tsquery('english', 'good');
>
> It's slow (> 30 sec.) for some GB (27886 html files, originally 73 MB zipped).
> The planner obviously always chooses table scan


Hello,

A probable reason for the time difference is the cost for decompressing toasted 
content.
At lest in 8.3, the planner was not good at estimating it.

I'm getting better overall performances since I've stopped collect statistic on 
tsvectors.
An alternative would have been to disallow compression on them.

I'm aware this is a drastic way and would not recommend it without testing. The 
benefit may depend on the type of data you are indexing.
In our use case these are error logs with many java stack traces, hence with 
many lexemes poorly discriminative.

see: http://www.postgresql.org/message-id/27953.1329434...@sss.pgh.pa.us
as a comment on
http://www.postgresql.org/message-id/c4dac901169b624f933534a26ed7df310861b...@jenmail01.ad.intershop.net

regards,

Marc Mamin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Trying to eliminate union and sort

2013-07-15 Thread Marc Mamin
Hello,

you may want to try starting with some CTE that first retrieve required subsets.
adding ordering  within those CTE might also improve the timing of following 
sort/join operations.
(sorry for the top posting)

Something like:

WITH T1 AS (
  SELECT id, typeid, backup_id, mycolumn1, mycolumn2
  FROM table1 WHERE t.external_id IN ('6544', '2234', '2', '4536')
  ORDER BY mycolumn2, id ?
 ),
 
TYPES AS (SELECT DISTINCT typeid FROM T1),

T3_OTHERS AS ( SELECT id, otherid FROM table3 JOIN TYPES ON table3.id = 
TYPES.typeid
   -- Order BY id ?  
   ), 

SELECT
T1.id,
T1.mycolumn1,
T3_OTHERS.otherid,
T3_2.otherid,
T1.mycolumn2 AS mycolumn2

FROM T1 
LEFT OUTER JOIN T3_OTHERS ON T1.typeid = T3_OTHERS.id
LEFT OUTER JOIN table2 t2 ON (t2.real_id = T1.backup_id OR t2.real_id = t.id
LEFT OUTER JOIN table3 T3_2  ON t2.third_id = T3_2.id
ORDER BY T1.mycolumn2,T1.id

regards,
Marc Mamin


Von: pgsql-performance-ow...@postgresql.org 
[pgsql-performance-ow...@postgresql.org]" im Auftrag von "Brian 
Fehrle [bri...@consistentstate.com]
Gesendet: Montag, 15. Juli 2013 18:12
An: pgsql-performance@postgresql.org
Betreff: Re: [PERFORM] Trying to eliminate union and sort

On 07/12/2013 04:43 PM, Josh Berkus wrote:
>> As for the counts on the tables:
>> table1  3,653,472
>> table2  2,191,314
>> table325,676,589
>>
>> I think it's safe to assume right now that any resulting joins are not
>> one-to-one
> Hmmm?  How is doing a subselect in the SELECT clause even working, then?
>
Oh my, this is sad. the query in all returns 9,955,729 rows, so the sub
queries are run on each of these resulting rows, however in this entire
result set, subquery 1 returns 16 distinct rows, subquery 2 returns 63
different rows, but those sub queries are run over 10 million times to
return these few distinct rows. So it's running many times, but
returning the same small set of data over and over again.

- Brian F




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Are bitmap index scans slow to start?

2013-02-28 Thread Marc Mamin

>Rebuilding the index might help, as it would put all the leaf pages holding 
>values for session_id=27 adjacent to each other, so they would read from disk 
>faster.  But with a name like >"session_id", I don't know how long such 
>clustering would last though.

>If I'm right about the index disk-read time, then switching to a plain index 
>scan rather than a bitmap index scan would make no difference--either way the 
>data has to come off the disk.



>>I'd prefer a
>>strategy that allowed fast performance the first time, rather than slow the
>>first time and extremely fast subsequently.

Hello,

if the index is only used to locate rows for single session_id, you may 
consider split it in a set of partial indexes.

e.g.
create index i_0 on foo where session_id%4 =0;
create index i_1 on foo where session_id%4 =1;
create index i_2 on foo where session_id%4 =2;
create index i_3 on foo where session_id%4 =3;

(can be built in parallel using separate threads)

Then you will have to ensure that all your WHERE clauses also contain the index 
condition:

WHERE session_id = 27 AND session_id%4 =27%4

regards,

Marc Mamin


Re: [PERFORM] A very long running query....

2012-07-21 Thread Marc Mamin


Hello,
isn't the first test superfluous here ?

>   where extract('day' from message_copies.msg_date_rec) = 17
>   and date_trunc('day',message_copies.msg_date_rec) = '2012-07-17'


> Here is the index:
> 
> CREATE INDEX idx_message_copies_wk2_date_src_pos_partial
>   ON feed_all_y2012m07.message_copies_wk2
>   USING btree
>   (date_trunc('day'::text, msg_date_rec),
>   src_id,
>   (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || 
> pos_georef4::text))
> TABLESPACE archive
>   WHERE (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || 
> pos_georef4::text) IS NOT NULL 
>   OR NOT (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || 
> pos_georef4::text) = ''::text;


the georef test can be simplified using coalesce:

>  and (message_copies.pos_georef1 || message_copies.pos_georef2 || 
> message_copies.pos_georef3 || message_copies.pos_georef4) <> ''
>  and not (message_copies.pos_georef1 || message_copies.pos_georef2 || 
> message_copies.pos_georef3 || message_copies.pos_georef4) is null
  =>
  and coaesce ( 
(message_copies.pos_georef1 || message_copies.pos_georef2 || 
message_copies.pos_georef3 || message_copies.pos_georef4), 
 '') <> ''
  
In order to avoid this test at query time you might add a boolean column   
message_copies.pos.has_georef,
and keep it up to date  with a before insert or update trigger. This will allow 
to shorten your index definition and simplify the planner task a little bit.
Moreover it will fasten your query in cases when the index don't get used.

As Tom already mentioned it, it may make sense not to concatenate the georef 
within the index, but keep them separated, or even keep them in different 
indexes.
Which is the best depend on the other queries running against this table
  
HTH,

Marc Mamin
  


-Original Message-
From: pgsql-performance-ow...@postgresql.org on behalf of Ioannis 
Anagnostopoulos
Sent: Sat 7/21/2012 1:56 AM
To: Tom Lane
Cc: Claudio Freire; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] A very long running query
 
On 21/07/2012 00:10, Tom Lane wrote:
> Claudio Freire  writes:
>> Looking at this:
>> "->  Index Scan using
>> idx_message_copies_wk2_date_src_pos_partial on message_copies_wk2
>> message_copies  (cost=0.00..19057.93 rows=52 width=32) (actual
>> time=62.124..5486270.845 rows=387524 loops=1)"
>> "  Index Cond: ((date_trunc('day'::text,
>> msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone)
>> AND (src_id = 1))"
>> "  Filter: ((date_part('day'::text,
>> msg_date_rec) = 17::double precision) AND (NOT (pos_georef1)::text
>> || (pos_georef2)::text) || (pos_georef3)::text) ||
>> (pos_georef4)::text) IS NULL)) AND (pos_georef1)::text ||
>> (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text)
>> <> ''::text))"
> I think the real problem is that the planner has no hope of doing
> anything very accurate with such an unwieldy filter condition.  I'd look
> at ways of making the filter conditions simpler, perhaps by recasting
> the data representation.  In particular, that's a horridly bad way of
> asking whether some columns are empty, which I gather is the intent.
> If you really want to do it just like that, creating an index on the
> concatenation expression would guide ANALYZE to collect some stats about
> it, but it would probably be a lot more efficient to put together an AND
> or OR of tests on the individual columns.
>
>   regards, tom lane
So what you suggest is to forget all together the concatenation of the 
georef1/2/3/4 and instead alter my query with something like:

georef1 is not null and not georeg1 = ''etc for georef2 3 and 4

That would require to alter my index and have the four georef columns 
separately in it and not as a concatenation and so on for the partial 
index part. And a final thing, you seem to imply that the indexes are 
used by the analyser to collect statistics even if they are not used. So 
an index serves not only as a way to speed up targeted queries but also 
to provide better statistics to the analyzer?

Kind Regards
Yiannis

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance




Re: [PERFORM] PostgreSQL db, 30 tables with number of rows < 100 (not huge) - the fastest way to clean each non-empty table and reset unique identifier column of empty ones.

2012-07-06 Thread Marc Mamin



Stanislaw Pankevich wrote:
>>>  PostgreSQL db, 30 tables with number of rows < 100 (not huge) - the 
>>> fastest way to clean each
>>> non-empty table and reset unique identifier column of empty ones 

Hello, 

2 'exotic' ideas:

- use dblink_send_query to do the job in multiple threads (I doubt this really 
could be faster)
- have prepared empty tables in a separate schema, and a "garbage schema":

   ALTER TABLE x set schema garbage;
   ALTER TABLE prepared.x set schema "current";

you should be ready for the next test, 

but still have to clean garbage nad moved to prepared for the next but one in 
the background

best regards,

Marc Mamin





>>>
>>> I wonder, what is the fastest way to accomplish this kind of task in 
>>> PostgreSQL. I am interested in
>>> the fastest solutions ever possible.

>>> I need the fastest cleaning strategy for such case working on PostgreSQL 
>>> both 8 and 9.
>>>
>>> I see the following approaches:
>>>
>>> 1) Truncate each table. It is too slow, I think, especially for empty 
>>> tables.

>> Did you actually try it? That's the king's way to performance questions!
>> Truncating a single table is done in a matter of microseconds, particularly
>> if it is not big.
>> Do you have tens of thousands of tables?

> Actually, 10-100 tables.

>> You could of course run a SELECT 1 FROM table LIMIT 1, but again I don't
>> think that this will be considerably faster than just truncating the table.
> 
> Exactly this query is much faster, believe me. You can see my latest
> results on https://github.com/stanislaw/truncate-vs-count.

Ok, I believe you.

My quick tests showed that a sible truncate (including transaction and
client-server roundtrip via UNIX sockets takes some 10 to 30 milliseconds.

Multiply that with 100, and you end up with just a few seconds at most.
Or what did you measure?

I guess you run that deletion very often so that it is painful.

Still I think that the biggest performance gain is to be had by using
PostgreSQL's features (truncate several tables in one statement, ...).

Try to bend your Ruby framework!

Yours,
Laurenz Albe

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance




Re: [PERFORM] Performance of a large array access by position (tested version 9.1.3)

2012-06-26 Thread Marc Mamin


> -Original Message-
> From: Pavel Stehule [mailto:pavel.steh...@gmail.com]
> 
> 2012/6/26 Marc Mamin :
> >
> >>> On 22/06/12 09:02, Maxim Boguk wrote:
> >
> >>> May be I completely wrong but I always assumed that the access
> speed to the array element in PostgreSQL should be close to constant
> time.
> >>> But in tests I found that access speed degrade as O(N) of array
> size.
> >
> >>> Is that behaviour is correct?
> 
> yes - access to n position means in postgresql - skip n-1 elements


Hmmm...

how many elements to be skipped here ?

SELECT _array[1] FROM t2;

I wonder if the time rather get spent in first retrieving the array itself 
before accessing its elements.

regards,

Marc Mamin

> 
> Regards
> 
> Pavel
> 
> >
> >
> >> From: pgsql-performance-ow...@postgresql.org On Behalf Of Jesper
> Krogh
> >
> >> Default column storage is to "compress it, and store in TOAST" with
> large values.
> >> This it what is causing the shift. Try to change the column storage
> of the column
> >> to EXTERNAL instead and rerun the test.
> >
> >
> > Hello,
> >
> > I've repeated your test in a simplified form:
> > you are right :-(
> >
> > create table t1 ( _array int[]);
> > alter table t1 alter _array set storage external;
> > insert into t1 SELECT ARRAY(SELECT * FROM generate_series(1,5));
> >
> > create table t2 ( _array int[]);
> > alter table t2 alter _array set storage external;
> > insert into t2 SELECT ARRAY(SELECT * FROM
> generate_series(1,500));
> >
> > explain analyze SELECT _array[1] FROM t1;
> > Total runtime: 0.125 ms
> >
> > explain analyze SELECT _array[1] FROM t2;
> > Total runtime: 8.649 ms
> >
> >
> > best regards,
> >
> > Marc Mamin
> >
> >
> >
> > --
> > Sent via pgsql-performance mailing list (pgsql-
> performa...@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-performance

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance of a large array access by position (tested version 9.1.3)

2012-06-26 Thread Marc Mamin

>> On 22/06/12 09:02, Maxim Boguk wrote: 

>> May be I completely wrong but I always assumed that the access speed to the 
>> array element in PostgreSQL should be close to constant time.
>> But in tests I found that access speed degrade as O(N) of array size.

>> Is that behaviour is correct?


> From: pgsql-performance-ow...@postgresql.org On Behalf Of Jesper Krogh

> Default column storage is to "compress it, and store in TOAST" with large 
> values. 
> This it what is causing the shift. Try to change the column storage of the 
> column
> to EXTERNAL instead and rerun the test. 


Hello,

I've repeated your test in a simplified form:
you are right :-(

create table t1 ( _array int[]);
alter table t1 alter _array set storage external;
insert into t1 SELECT ARRAY(SELECT * FROM generate_series(1,5));

create table t2 ( _array int[]);
alter table t2 alter _array set storage external;
insert into t2 SELECT ARRAY(SELECT * FROM generate_series(1,500));

explain analyze SELECT _array[1] FROM t1;
Total runtime: 0.125 ms

explain analyze SELECT _array[1] FROM t2;
Total runtime: 8.649 ms


best regards,

Marc Mamin



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] text search: tablescan cost for a tsvector

2012-02-29 Thread Marc Mamin
> Von: Robert Haas [mailto:robertmh...@gmail.com]
> Gesendet: Mi 2/29/2012 7:32

>  
> On Mon, Feb 6, 2012 at 6:05 AM, Marc Mamin  wrote:
> > without analyze: http://explain.depesz.com/s/6At
> > with analyze:http://explain.depesz.com/s/r3B
... 
> The problem seems to be that the cost estimator doesn't know that
> detoasting is expensive.

Hello,

Tom Lane has started a follow up thread in the hacker list.
Detoasting is indeed the main obstacle, but I've repeated my test using plain 
storage
and the planer still choose (systematically?) the slowest query.
It seems that I bumped into 2 different issues at the same time.

http://archives.postgresql.org/pgsql-hackers/2012-02/msg00896.php

Backround: 
Our reporting system offers amongst others time histograms 
combined with a FTS filtering on error occurences (imported from error logs), 
It is hence not unusual that given search terms are found within a majority of 
the documents...

best regards,

Marc Mamin


[PERFORM] text search: tablescan cost for a tsvector

2012-02-06 Thread Marc Mamin
Hello,

I have quite systematically better performance with the text search when
I disable the statistics collection for the tsvector column.
So I wonder if such statistics ever make sense.

Here a testcase:

The table contains 200'000 tsvector, whereas the lexeme 'fooblablabla'
exists in all tsvector:
Without statistics, the planner decide as expected for the gin index.
After analyze, it switch to a table scan which is also expected, but the
query is 3 times slower.

My first thought was that the Bitmap Heap Scan was really fast as the
searched term is always at the first position.
So I repeated the test with an additional search term at the last
position, but without significant change:

(result from the 6. test below)

without analyze: http://explain.depesz.com/s/6At
with analyze:http://explain.depesz.com/s/r3B


best regards,

Marc Mamin




Here all my results, always one of the fastest from a few runs.


CREATE TABLE tsv_test
(
  id bigserial NOT NULL,
  v tsvector
);





The test query:

explain analyze
select id from tsv_test where v @@ 'lexeme3179'::tsquery 
UNION ALL
select id from tsv_test where v @@ 'lexeme5'::tsquery
UNION ALL
select id from tsv_test where v @@ 'fooblablabla'::tsquery

The results

A) on first lexeme

1) without indexes without analyze:
   http://explain.depesz.com/s/bOv

2) alter table tsv_test add constraint tsv_test_pk primary key(id);
   http://explain.depesz.com/s/9QQ (same as previous);

3) create index tsv_gin on tsv_test using gin(v);
   http://explain.depesz.com/s/r4M <= fastest

4) ANALYZE tsv_test (id);
   http://explain.depesz.com/s/MyC (same as previous);

5) ANALYZE tsv_test;
   http://explain.depesz.com/s/qu3S 
   

B) on lastlexeme   

6) create table  tsv_test2 as select id,
   v||'zzthisisalongerlexemethisisalongerlexeme'::tsvector 
   from tsv_test;
   
   explain analyze
   select id from tsv_test2 where v @@
'zzthisisalongerlexemethisisalongerlexeme'::tsquery 
   
   http://explain.depesz.com/s/6At 
   
   ANALYZE tsv_test2;
   
   http://explain.depesz.com/s/r3B 



test data:

insert into tsv_test (v) 
select
cast('fooblablabla' ||
' lexeme'||s%2|| ' lexeme'||s%3|| ' lexeme'||s%4||
' lexeme'||s%4|| ' lexeme'||s%5|| ' lexeme'||s%6||
' lexeme'||s%7|| ' lexeme'||s%8|| ' lexeme'||s%9||
' lexeme'||s%10 || ' lexeme2'||s%11 || ' lexeme3'||s%12 ||
' lexeme'||s%11 || ' lexeme2'||s%12 || ' lexeme3'||s%22 ||
' lexeme'||s%12 || ' lexeme2'||s%13 || ' lexeme3'||s%32 ||
' lexeme'||s%13 || ' lexeme2'||s%14 || ' lexeme3'||s%42 ||
' lexeme'||s%14 || ' lexeme2'||s%15 || ' lexeme3'||s%52 ||
' lexeme'||s%15 || ' lexeme2'||s%16 || ' lexeme3'||s%62 ||
' lexeme'||s%16 || ' lexeme2'||s%17 || ' lexeme3'||s%72 ||
' lexeme'||s%17 || ' lexeme2'||s%18 || ' lexeme3'||s%82 ||
' lexeme'||s%18 || ' lexeme2'||s%19 || ' lexeme3'||s%92 ||
' lexeme'||s%19 || ' lexeme2'||s%10 || ' lexeme3'||s%15 ||
' lexeme'||s%12 || ' lexeme2'||s%71 || ' lexeme3'||s%16 ||
' lexeme'||s%20 || ' lexeme2'||s%81 || ' lexeme3'||s%17 ||
' lexeme'||s%35 || ' lexeme2'||s%91 || ' lexeme3'||s%18 ||
' lexeme'||s%100 || ' lexeme2'||s%110 || ' lexeme3'||s%120 ||
' lexeme'||s%110 || ' lexeme2'||s%120 || ' lexeme3'||s%220 ||
' lexeme'||s%120 || ' lexeme2'||s%130 || ' lexeme3'||s%320 ||
' lexeme'||s%130 || ' lexeme2'||s%140 || ' lexeme3'||s%420 ||
' lexeme'||s%140 || ' lexeme2'||s%150 || ' lexeme3'||s%520 ||
' lexeme'||s%150 || ' lexeme2'||s%160 || ' lexeme3'||s%620 ||
' lexeme'||s%160 || ' lexeme2'||s%170 || ' lexeme3'||s%720 ||
' lexeme'||s%170 || ' lexeme2'||s%180 || ' lexeme3'||s%820 ||
' lexeme'||s%180 || ' lexeme2'||s%190 || ' lexeme3'||s%920 ||
' lexeme'||s%190 || ' lexeme2'||s%100 || ' lexeme3'||s%150 ||
' lexeme'||s%120 || ' lexeme2'||s%710 || ' lexeme3'||s%160 ||
' lexeme'||s%200 || ' lexeme2'||s%810 || ' lexeme3'||s%170 ||
' lexeme'||s%350 || ' lexeme2'||s%910 || ' lexeme3'||s%180 
as tsvector)
FROM generate_series(1,10) s
UNION ALL
select
cast('fooblablabla' ||
' thisisalongerlexemethisisalongerlexeme'||s%2|| '
thisisalongerlexemethisisalongerl

Re: [PERFORM] How to remove a table statistics ?

2012-01-31 Thread Marc Mamin
Hello,
Some more tests have shown that removing the statistics just move the 
performance issue to other places.
The main issue here is a bad design, so I'd better focus on this than losing 
too much time with the current situation.
But this raises an interesting question on how/where does Postgres store 
statistics on functional indexes. 
in pg_statistics there are information on the column content, but I couldn't 
find stats on the function result which is fully computed only during the index 
creation.
I guess that the planner would need to know at least the function cost to 
weight the benefit of such an index. 
In my case I would set the function cost to 200 ...


I have also tried to reduce random_page_cost to "2", and it seems to help in a 
few cases.


(anonymized)

explain analyze
SELECT min(msoffset) as t, coalesce(pipelinecall_id,-2) as pid
 from  aserrorlist_20120125 l
 WHERE 1 = 1
 AND msoffset >= 132750300
AND my_func('foo',20120125,l.id, l.header_9_10_id, l.categories_id, 
l.firstline_id) @@  to_aserrcfg_search_tsq($KUKU$lexeme_1 ! lexeme_2$KUKU$)
 group by 
ridcount,pipelinecall_id,coalesce(toplevelrid,msoffset::varchar);


without stats: http://explain.depesz.com/s/qPg
with stats: http://explain.depesz.com/s/88q

aserr_20120125_tvi: GIN Index on my_func(.,.,.,.,.,.)

best regards,

Marc Mamin

> -Original Message-
> From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance-
> ow...@postgresql.org] On Behalf Of Josh Berkus
> Sent: Dienstag, 31. Januar 2012 19:44
> To: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] How to remove a table statistics ?
> 
> On 1/31/12 3:50 AM, Marc Mamin wrote:
> > Hello,
> >
> > I have a weird table, upon with the queries are much faster when no
> > statics were collected.
> >
> > Is there a way to delete statistics information for a table ?
> > I've tried ALTER.. set STATISTICS 0 and then run ANALYZE, but it
> seems
> > that old statistics are kept this way.
> > Can I delete entries directly in pg_statistic ?
> > (Postgresql 9.1)
> 
> You can, but it won't do any good; autovaccum will replace them.
> 
> It would be better to fix the actual query plan issue.  If you can,
> post
> the query plans with and without statistics (EXPLAIN ANALYZE, please)
> here.
> 
> --
> Josh Berkus
> PostgreSQL Experts Inc.
> http://pgexperts.com
> 
> --
> Sent via pgsql-performance mailing list (pgsql-
> performa...@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] How to remove a table statistics ?

2012-01-31 Thread Marc Mamin
Hello, 

I have a weird table, upon with the queries are much faster when no
statics were collected. 

Is there a way to delete statistics information for a table ?
I've tried ALTER.. set STATISTICS 0 and then run ANALYZE, but it seems
that old statistics are kept this way.
Can I delete entries directly in pg_statistic ?
(Postgresql 9.1)


 short backgroud Info:
 
 One of the table index is a GIN on a tsvector returning function, which
is very costy.
 once analyzed, the query planner often ignore this index in favour of
other one, hence triggering this function too often.
 
 I'll fix that model, but am first looking for a quick way to restore
performance on our production servers.
 
 
 best regards,
 
 Marc Mamin


Re: [PERFORM] Duplicate deletion optimizations

2012-01-07 Thread Marc Mamin
Yes, but it should become a bit slower if you fix your code :-)

  where t_imp.id is null and test.id=t_imp.id;
  =>
  where t_imp.id is not null and test.id=t_imp.id;

and a partial index on matching rows might help (should be tested):

 (after the first updat)
 create index t_imp_ix on t_imp(t_value,t_record,output_id) where t_imp.id is 
not null.

regards,
Marc Mamin

-Ursprüngliche Nachricht-
Von: pgsql-performance-ow...@postgresql.org im Auftrag von Jochen Erwied
Gesendet: Sa 1/7/2012 12:57
An: anto...@inaps.org
Cc: pgsql-performance@postgresql.org
Betreff: Re: [PERFORM] Duplicate deletion optimizations
 
Friday, January 6, 2012, 4:21:06 PM you wrote:

>> Every 5 minutes, a process have to insert a few thousand of rows in this
>> table, but sometime, the process have to insert an already existing row
>> (based on values in the triplet (t_value, t_record, output_id). In this
>> case, the row must be updated with the new count value. I've tried some
>> solution given on this stackoverflow question [1] but the insertion rate
>> is always too low for my needs.

I did check the following in a loop, starting with an empty table, and
inserting/updating 5 random unique entries. After 15 minutes I've got
about 10 million records, each loop takes about 3 seconds. After 30 minutes
the table contains approx. 18 million entries, time per loop only slightly
increased. After 90 minutes the database has about 30 million entries. The
speed has dropped to about 15-20 seconds per loop, but the server is doing
lots of other queries in parallel, so with an unloaded server the updates
should still take less than 10 seconds.

The generator runs in perl, and generates records for a maximum of 100 
million different entries:

use strict;

srand time;
my $i = 0;
open FD, ">data.in";
for (1..5)
{
$i += rand(2000);
print FD sprintf("%d\t%d\t%d\t%d\n", $i/65536, ($i/256)%255, $i%255, 
rand(1000));
}
close FD;

The SQL-script looks like this:

\timing on
begin;
create temp table t_imp(id bigint,t_value integer,t_record integer,output_id 
integer,count bigint);
\copy t_imp (t_value, t_record, output_id, count) from 'data.in'
--an index is not really needed, table is in memory anyway
--create index t_imp_ix on t_imp(t_value,t_record,output_id);

-- find matching rows
update t_imp
   set id=test.id
   from test
   where 
(t_imp.t_value,t_imp.t_record,t_imp.output_id)=(test.t_value,test.t_record,test.output_id);
-- update matching rows using primary key
update test
   set count=t_imp.count
   from t_imp
   where t_imp.id is null and test.id=t_imp.id;
-- insert missing rows
insert into test(t_value,t_record,output_id,count)
   select t_value,t_record,output_id,count
  from t_imp
  where id is null;
commit;

Advantages of this solution:

- all updates are done in-place, no index modifications (except for the 
  inserts, of course)
- big table only gets inserts
- no dead tuples from deletes
- completely avoids sequential scans on the big table

Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6 
GHz, table and indices stored on a SSD)

Table statistics:

relid | 14332525
schemaname| public
relname   | test
seq_scan  | 8
seq_tup_read  | 111541821
idx_scan  | 149240169
idx_tup_fetch | 117901695
n_tup_ins | 30280175
n_tup_upd | 0
n_tup_del | 0
n_tup_hot_upd | 0
n_live_tup| 30264431
n_dead_tup| 0
last_vacuum   |
last_autovacuum   |
last_analyze  |
last_autoanalyze  | 2012-01-07 12:38:49.593651+01
vacuum_count  | 0
autovacuum_count  | 0
analyze_count | 0
autoanalyze_count | 31

The sequential scans were from some 'select count(*)' in between.

HTH.

-- 
Jochen Erwied |   home: joc...@erwied.eu +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: j...@mbs-software.de  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen.erw...@vodafone.de   +49-173-5404164


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



Re: [PERFORM] Postgres array parser

2011-12-14 Thread Marc Mamin
> Yes, it would be great, but I haven't found such a function, which
> splits 2 dimensional array into rows =) Maybe we'll modify existing
> function, but unfortunately we have tried hstore type and function in
> postgres and we see a significant performance improvements. So we only
> need to convert existing data into hstore and I think this is a good
> solution.



I haven't tested hstore yet, but I would be interested to find out if it still 
better perform with custom "numeric" aggregates on the hstore values.

I've made a short "proof of concept"  test with a custom key/value type to 
achieve such an aggregation.
Something like:


   SELECT x, distinct_sum( (currency,amount)::keyval ) overview  FROM ... GROUP 
BY x

   x currency amount
   a  EUR   15.0
   a  EUR5.0
   a  CHF7.5
   b  USD   12.0
   =>

   x  overview
   -  ----
   a {(EUR,20.0), (CHF,7.5)}
   b {(USD,10.0)}


regards,

Marc Mamin

 
> On 12/14/2011 11:21 AM, Marc Mamin wrote:
> > Hello,
> >
> > For such cases (see below), it would be nice to have an unnest
> function that only affect the first array dimension.
> >
> > Something like
> >
> > unnest(ARRAY[[1,2],[2,3]], SLICE=1)
> > =>
> > unnest
> > --
> > [1,2]
> > [2,3]
> >
> >
> > With this function, I imagine that following sql function
> > might beat the plpgsql FOREACH version.
> >
> >
> > CREATE OR REPLACE FUNCTION input_value_un (in_inputs numeric[],
> in_input_nr numeric)
> > RETURNS numeric AS
> > $BODY$
> >
> >SELECT u[1][2]
> >FROM unnest($1, SLICE =1) u
> >WHERE u[1][1]=in_input_nr
> >LIMIT 1;
> >
> > $BODY$
> > LANGUAGE sql IMMUTABLE;
> >
> >
> >
> > best regards,
> >
> > Marc Mamin
> >
> >
> >> -Original Message-
> >> From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-
> performance-
> >> ow...@postgresql.org] On Behalf Of Pavel Stehule
> >> Sent: Dienstag, 13. Dezember 2011 15:43
> >> To: Aleksej Trofimov
> >> Cc: pgsql-performance@postgresql.org
> >> Subject: Re: [PERFORM] Postgres array parser
> >>
> >> Hello
> >>
> >> 2011/12/13 Aleksej Trofimov:
> >>> We have tried foreach syntax, but we have noticed performance
> >> degradation:
> >>> Function with for: 203ms
> >>> Function with foreach: ~250ms:
> >>>
> >>> there is functions code:
> >>> CREATE OR REPLACE FUNCTION input_value_fe(in_inputs numeric[],
> >> in_input_nr
> >>> numeric)
> >>>   RETURNS numeric AS
> >>> $BODY$
> >>> declare i numeric[];
> >>> BEGIN
> >>> FOREACH i SLICE 1 IN ARRAY in_inputs
> >>> LOOP
> >>>  if i[1] = in_input_nr then
> >>> return i[2];
> >>>  end if;
> >>> END LOOP;
> >>>
> >>> return null;
> >>> END;
> >>> $BODY$
> >>>   LANGUAGE plpgsql VOLATILE
> >>>   COST 100;
> >>>
> >>> CREATE OR REPLACE FUNCTION input_value(in_inputs numeric[],
> >> in_input_nr
> >>> numeric)
> >>>   RETURNS numeric AS
> >>> $BODY$
> >>> declare
> >>>   size int;
> >>> BEGIN
> >>>   size = array_upper(in_inputs, 1);
> >>> IF size IS NOT NULL THEN
> >>>
> >>> FOR i IN 1 .. size LOOP
> >>> if in_inputs[i][1] = in_input_nr then
> >>> return in_inputs[i][2];
> >>> end if;
> >>> END LOOP;
> >>> END IF;
> >>>
> >>> return null;
> >>> END;
> >>> $BODY$
> >>>   LANGUAGE plpgsql VOLATILE
> >>>   COST 100;
> >>>
> >>>
> >>> On 12/13/2011 04:02 PM, Pavel Stehule wrote:
> >>>> Hello
> >>>>
> >>>> do you know FOREACH IN ARRAY statement in 9.1
> >>>>
> >>>> this significantly accelerate iteration over array
> >>>>
> >>>>
> >>>> http://www.depesz.com/index.php/2011/03/07/waiting-for-9-1-
> foreach-
> >> in-array/
> >>>>
> >>>>
> >>>> 2011/12/13 Aleksej Trofimov:
> >>>>> Hello, I wanted to 

Re: [PERFORM] Postgres array parser

2011-12-14 Thread Marc Mamin
Hello,

For such cases (see below), it would be nice to have an unnest function that 
only affect the first array dimension.

Something like 

unnest(ARRAY[[1,2],[2,3]], SLICE=1)
=>
unnest
--
[1,2]
[2,3]


With this function, I imagine that following sql function
might beat the plpgsql FOREACH version. 


CREATE OR REPLACE FUNCTION input_value_un (in_inputs numeric[], in_input_nr 
numeric)
   RETURNS numeric AS
$BODY$
   
  SELECT u[1][2]
  FROM unnest($1, SLICE =1) u
  WHERE u[1][1]=in_input_nr
  LIMIT 1;

$BODY$
   LANGUAGE sql IMMUTABLE;

   
   
best regards,

Marc Mamin
   

> -Original Message-
> From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance-
> ow...@postgresql.org] On Behalf Of Pavel Stehule
> Sent: Dienstag, 13. Dezember 2011 15:43
> To: Aleksej Trofimov
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Postgres array parser
> 
> Hello
> 
> 2011/12/13 Aleksej Trofimov :
> > We have tried foreach syntax, but we have noticed performance
> degradation:
> > Function with for: 203ms
> > Function with foreach: ~250ms:
> >
> > there is functions code:
> > CREATE OR REPLACE FUNCTION input_value_fe(in_inputs numeric[],
> in_input_nr
> > numeric)
> >  RETURNS numeric AS
> > $BODY$
> > declare i numeric[];
> > BEGIN
> >        FOREACH i SLICE 1 IN ARRAY in_inputs
> >            LOOP
> >                 if i[1] = in_input_nr then
> >                    return i[2];
> >                 end if;
> >            END LOOP;
> >
> >    return null;
> > END;
> > $BODY$
> >  LANGUAGE plpgsql VOLATILE
> >  COST 100;
> >
> > CREATE OR REPLACE FUNCTION input_value(in_inputs numeric[],
> in_input_nr
> > numeric)
> >  RETURNS numeric AS
> > $BODY$
> > declare
> >  size int;
> > BEGIN
> >  size = array_upper(in_inputs, 1);
> >    IF size IS NOT NULL THEN
> >
> >        FOR i IN 1 .. size LOOP
> >            if in_inputs[i][1] = in_input_nr then
> >                return in_inputs[i][2];
> >            end if;
> >        END LOOP;
> >    END IF;
> >
> >    return null;
> > END;
> > $BODY$
> >  LANGUAGE plpgsql VOLATILE
> >  COST 100;
> >
> >
> > On 12/13/2011 04:02 PM, Pavel Stehule wrote:
> >>
> >> Hello
> >>
> >> do you know FOREACH IN ARRAY statement in 9.1
> >>
> >> this significantly accelerate iteration over array
> >>
> >>
> >> http://www.depesz.com/index.php/2011/03/07/waiting-for-9-1-foreach-
> in-array/
> >>
> >>
> >>
> >> 2011/12/13 Aleksej Trofimov:
> >>>
> >>> Hello, I wanted to ask according such a problem which we had faced
> with.
> >>> We are widely using postgres arrays like key->value array by doing
> like
> >>> this:
> >>>
> >>> {{1,5},{2,6},{3,7}}
> >>>
> >>> where 1,2,3 are keys, and 5,6,7 are values. In our pgSql functions
> we are
> >>> using self written array_input(array::numeric[], key::numeric)
> function
> >>> which makes a loop on whole array and searches for key like
> >>> FOR i IN 1 .. size LOOP
> >>>            if array[i][1] = key then
> >>>                return array[i][2];
> >>>            end if;
> >>> END LOOP;
> >>>
> >>> But this was a good solution until our arrays and database had
> grown. So
> >>> now
> >>> FOR loop takes a lot of time to find value of an array.
> >>>
> >>> And my question is, how this problem of performance could be
> solved? We
> >>> had
> >>> tried pgperl for string parsing, but it takes much more time than
> our
> >>> current solution. Also we are thinking about self-written C++
> function,
> >>> may
> >>> be someone had implemented this algorithm before?
> >>>
> >> you can use indexes or you can use hstore
> >>
> >> Regards
> >>
> >> Pavel Stehule
> >>
> >>> --
> >>> Best regards
> >>>
> >>> Aleksej Trofimov
> >>>
> >>>
> >>> --
> >>> Sent via pgsql-performance mailing list
> >>> (pgsql-performance@postgresql.org)
> >>> To make changes to your subscription:
> >>> http://www.postgresql.org/mailpref/pgsql-performance
> >
> >
> 
> It is strange - on my comp FOREACH is about 2x faster
> 
> postgres=# select inp

Re: [PERFORM] Postgres for a "data warehouse", 5-10 TB

2011-09-13 Thread Marc Mamin
Hi,

> (see attachment)

under high concurency you may expect that your data is already in.
In such a case you better do nothing at all:

begin
  
  select dat=a_dat from t where id=a_id into test:
  
  if test is null then
  
   begin
  
insert into t (id, dat) values (a_id, a_dat);
exception
when unique_violation then
  update t set dat = a_dat where id = a_id and dat <> a_dat;
  return 0;

   end;
  
  elsif not test then
  
update t set dat = a_dat where id = a_id;
  return 0;
  
  end if;

  return 1;


best regards,

Marc Mamin

-Ursprüngliche Nachricht-
Von: pgsql-performance-ow...@postgresql.org im Auftrag von Robert Klemme
Gesendet: Di 9/13/2011 6:34
An: Marti Raudsepp
Cc: pgsql-performance@postgresql.org
Betreff: Re: [PERFORM] Postgres for a "data warehouse", 5-10 TB
 
On Tue, Sep 13, 2011 at 5:13 PM, Marti Raudsepp  wrote:
> On Tue, Sep 13, 2011 at 00:26, Robert Klemme  
> wrote:
>> In the case of PG this particular example will work:
>> 1. TX inserts new PK row
>> 2. TX tries to insert same PK row => blocks
>> 1. TX commits
>> 2. TX fails with PK violation
>> 2. TX does the update (if the error is caught)
>
> That goes against the point I was making in my earlier comment. In
> order to implement this error-catching logic, you'll have to allocate
> a new subtransaction (transaction ID) for EVERY ROW you insert.

I don't think so.  You only need to catch the error (see attachment).
Or does this create a sub transaction?

> If
> you're going to be loading billions of rows this way, you will invoke
> the wrath of the "vacuum freeze" process, which will seq-scan all
> older tables and re-write every row that it hasn't touched yet. You'll
> survive it if your database is a few GB in size, but in the terabyte
> land that's unacceptable. Transaction IDs are a scarce resource there.

Certainly.  But it's not needed as far as I can see.

> In addition, such blocking will limit the parallelism you will get
> from multiple inserters.

Yes, I mentioned the speed issue.  But regardless of the solution for
MySQL's "INSERT..ON DUPLICATE KEY UPDATE" which Igor mentioned you
will have the locking problem anyhow if you plan to insert
concurrently into the same table and be robust.

Kind regards

robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/



Re: [PERFORM] Query improvement

2011-05-03 Thread Marc Mamin

> On Mon, May 2, 2011 at 10:54 PM, Mark  wrote:
> > but the result have been worst than before. By the way is there a
posibility
> > to create beeter query with same effect?
> > I have tried more queries, but this has got best performance yet.
> 
> Well, this seems to be the worst part:
> 
>(SELECT page_id FROM mediawiki.page WHERE page_id IN
>(SELECT page_id FROM mediawiki.page
> WHERE (titlevector @@ (to_tsquery('fotbal'
>OR page_id IN
>(SELECT p.page_id from mediawiki.page
p,mediawiki.revision r,
>(SELECT old_id FROM mediawiki.pagecontent
>WHERE (textvector @@ (to_tsquery('fotbal' ss
>WHERE (p.page_id=r.rev_page AND r.rev_id=ss.old_id)))
> 


'OR' statements often generate complicated plans. You should try to
rewrite your Query with a n UNION clause.
Using explicit joins may also help the planner:

SELECT page_id 
FROM mediawiki.page
WHERE (titlevector @@ (to_tsquery('fotbal')))

UNION 

SELECT p.page_id 
FROM mediawiki.page p 
  JOIN mediawiki.revision r on (p.page_id=r.rev_page)
  JOIN mediawiki.pagecontent ss on (r.rev_id=ss.old_id)
WHERE (ss.textvector @@ (to_tsquery('fotbal')))

HTH,

Marc Mamin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] UNION and bad performance

2010-12-11 Thread Marc Mamin
Hello,

 

UNION will remove all duplicates, so that the result additionally
requires to be sorted.

Anyway,  for performance issues, you should always start investigation
with explain analyze .

regards,

 

Marc Mamin

 

From: pgsql-performance-ow...@postgresql.org
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Olivier
Pala
Sent: Donnerstag, 9. Dezember 2010 11:52
To: pgsql-performance@postgresql.org
Cc: Olivier Pala
Subject: [PERFORM] UNION and bad performance

 

Hi, 

I have a performance trouble with UNION query


First I have this view :

SELECT a,b,c FROM table1, table2 WHERE jointure AND condition

Result : 150 000 records in 1~2 s



Then, I adding an UNION into the same view :

SELECT a,b,c FROM table1, table2 WHERE jointure AND condition
UNION
SELECT a,b,c FROM table3

Result : 150 200 records in 6~7 s


Why, do I have bad performance only for 200 adding records ?

Thanks

SGBD : Postgres 8.3 et 8.4 



Re: [PERFORM] hashed subplan 5000x slower than two sequential operations

2010-12-08 Thread Marc Mamin

Another point: would a conditionl index help ?

on articles (context_key) where indexed

regards,

-Ursprüngliche Nachricht-
Von: pgsql-performance-ow...@postgresql.org im Auftrag von Marc Mamin
Gesendet: Mi 12/8/2010 9:06
An: Shrirang Chitnis; Bryce Nesbitt; pgsql-performance@postgresql.org
Betreff: Re: [PERFORM] hashed subplan 5000x slower than two sequential 
operations
 


Hello,

are the table freshly analyzed, with a sufficient default_statistics_target ?

You may try to get a better plan while rewriting the query as an UNION to get 
rid of the OR clause.
Something like (not tested):

SELECT contexts.context_key
FROM contexts
 JOIN articles
 ON (articles.context_key=contexts.context_key)
WHERE contexts.parent_key = 392210
AND articles.indexed


UNION
SELECT context_key
FROM
(
  SELECT contexts.context_key
  FROM contexts JOIN collection_data ON ( contexts.context_key = 
collection_data .context_key)
  WHERE collection_data.collection_context_key = 392210)
) foo JOIN articles ON (foo.context_key=contexts.context_key)
WHERE articles.indexed
;


I've had one similar problem where there was no way for the planner to notice 
that the query would systematically return very few rows. Here, my last resort 
was to disable some planner methods within the given transaction.

regards,

Marc Mamin

-Ursprüngliche Nachricht-
Von: pgsql-performance-ow...@postgresql.org im Auftrag von Shrirang Chitnis
Gesendet: Mi 12/8/2010 8:05
An: Bryce Nesbitt; pgsql-performance@postgresql.org
Betreff: Re: [PERFORM] hashed subplan 5000x slower than two sequential 
operations
 
Bryce,

The two queries are different:

You are looking for contexts.context_key in first query

WHERE (contexts.parent_key = 392210
  OR contexts.context_key IN
 (SELECT collection_data.context_key
 FROM collection_data
  WHERE collection_data.collection_context_key = 392210)


but second query has context.parent_key

WHERE (contexts.parent_key = 392210 OR contexts.parent_key IN
(392210,392210,395073,1304250))

Is the contexts.context_key an indexed field? contexts.parent_key certainly 
seems to be.


HTH,


Shrirang Chitnis
Sr. Manager, Applications Development
HOV Services


Office: (866) 808-0935 Ext: 39210
shrirang.chit...@hovservices.com
www.hovservices.com

The information contained in this message, including any attachments, is 
attorney privileged and/or confidential information intended only for the use 
of the individual or entity named as addressee.  The review, dissemination, 
distribution or copying of this communication by or to anyone other than the 
intended addressee is strictly prohibited.  If you have received this 
communication in error, please immediately notify the sender by replying to the 
message and destroy all copies of the original message.


-Original Message-
From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Bryce Nesbitt
Sent: Thursday, December 09, 2010 12:24 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] hashed subplan 5000x slower than two sequential operations

Can you help me understand how to optimize the following.  There's a
subplan which in this case returns 3 rows,
but it is really expensive:



=
explain analyze SELECT contexts.context_key
FROM contexts
 JOIN articles
 ON (articles.context_key=contexts.context_key)
WHERE (contexts.parent_key = 392210
  OR contexts.context_key IN
 (SELECT collection_data.context_key
 FROM collection_data
  WHERE collection_data.collection_context_key = 392210)
)
AND articles.indexed
;

QUERY PLAN

  Hash Join  (cost=83054.41..443755.45 rows=261077 width=4) (actual
time=4362.143..6002.808 rows=28 loops=1)
Hash Cond: (articles.context_key = contexts.context_key)
->  Seq Scan on articles  (cost=0.00..345661.91 rows=522136 width=4)
(actual time=0.558..3953.002 rows=517356 loops=1)
  Filter: indexed
->  Hash  (cost=69921.25..69921.25 rows=800493 width=4) (actual
time=829.501..829.501 rows=31 loops=1)
  ->  Seq Scan on contexts  (cost=14.31..69921.25 rows=800493
width=4) (actual time=1.641..829.339 rows=31 loops=1)
Filter: ((parent_key = 392210) OR (hashed subplan))
SubPlan
  ->  Index Scan using collection_data_context_key_index
on collection_data  (cost=0.00..14.30 rows=6 width=4) (actual
time=0.018..0.023 rows=3 loops=1)
Index Cond: (collection_context_key = 392210)
  Total runtime: 6002.976 ms
(11 rows)


=

Re: [PERFORM] hashed subplan 5000x slower than two sequential operations

2010-12-08 Thread Marc Mamin


Hello,

are the table freshly analyzed, with a sufficient default_statistics_target ?

You may try to get a better plan while rewriting the query as an UNION to get 
rid of the OR clause.
Something like (not tested):

SELECT contexts.context_key
FROM contexts
 JOIN articles
 ON (articles.context_key=contexts.context_key)
WHERE contexts.parent_key = 392210
AND articles.indexed


UNION
SELECT context_key
FROM
(
  SELECT contexts.context_key
  FROM contexts JOIN collection_data ON ( contexts.context_key = 
collection_data .context_key)
  WHERE collection_data.collection_context_key = 392210)
) foo JOIN articles ON (foo.context_key=contexts.context_key)
WHERE articles.indexed
;


I've had one similar problem where there was no way for the planner to notice 
that the query would systematically return very few rows. Here, my last resort 
was to disable some planner methods within the given transaction.

regards,

Marc Mamin

-Ursprüngliche Nachricht-
Von: pgsql-performance-ow...@postgresql.org im Auftrag von Shrirang Chitnis
Gesendet: Mi 12/8/2010 8:05
An: Bryce Nesbitt; pgsql-performance@postgresql.org
Betreff: Re: [PERFORM] hashed subplan 5000x slower than two sequential 
operations
 
Bryce,

The two queries are different:

You are looking for contexts.context_key in first query

WHERE (contexts.parent_key = 392210
  OR contexts.context_key IN
 (SELECT collection_data.context_key
 FROM collection_data
  WHERE collection_data.collection_context_key = 392210)


but second query has context.parent_key

WHERE (contexts.parent_key = 392210 OR contexts.parent_key IN
(392210,392210,395073,1304250))

Is the contexts.context_key an indexed field? contexts.parent_key certainly 
seems to be.


HTH,


Shrirang Chitnis
Sr. Manager, Applications Development
HOV Services


Office: (866) 808-0935 Ext: 39210
shrirang.chit...@hovservices.com
www.hovservices.com

The information contained in this message, including any attachments, is 
attorney privileged and/or confidential information intended only for the use 
of the individual or entity named as addressee.  The review, dissemination, 
distribution or copying of this communication by or to anyone other than the 
intended addressee is strictly prohibited.  If you have received this 
communication in error, please immediately notify the sender by replying to the 
message and destroy all copies of the original message.


-Original Message-
From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Bryce Nesbitt
Sent: Thursday, December 09, 2010 12:24 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] hashed subplan 5000x slower than two sequential operations

Can you help me understand how to optimize the following.  There's a
subplan which in this case returns 3 rows,
but it is really expensive:



=
explain analyze SELECT contexts.context_key
FROM contexts
 JOIN articles
 ON (articles.context_key=contexts.context_key)
WHERE (contexts.parent_key = 392210
  OR contexts.context_key IN
 (SELECT collection_data.context_key
 FROM collection_data
  WHERE collection_data.collection_context_key = 392210)
)
AND articles.indexed
;

QUERY PLAN

  Hash Join  (cost=83054.41..443755.45 rows=261077 width=4) (actual
time=4362.143..6002.808 rows=28 loops=1)
Hash Cond: (articles.context_key = contexts.context_key)
->  Seq Scan on articles  (cost=0.00..345661.91 rows=522136 width=4)
(actual time=0.558..3953.002 rows=517356 loops=1)
  Filter: indexed
->  Hash  (cost=69921.25..69921.25 rows=800493 width=4) (actual
time=829.501..829.501 rows=31 loops=1)
  ->  Seq Scan on contexts  (cost=14.31..69921.25 rows=800493
width=4) (actual time=1.641..829.339 rows=31 loops=1)
Filter: ((parent_key = 392210) OR (hashed subplan))
SubPlan
  ->  Index Scan using collection_data_context_key_index
on collection_data  (cost=0.00..14.30 rows=6 width=4) (actual
time=0.018..0.023 rows=3 loops=1)
Index Cond: (collection_context_key = 392210)
  Total runtime: 6002.976 ms
(11 rows)


=
explain analyze SELECT contexts.context_key
FROM contexts
 JOIN articles
 ON (articles.context_key=contexts.context_key)
WHERE (contexts.parent_key = 392210 OR contexts.parent_key IN
(392210,392210,395073,1304250))
AND articles.indexed
;
   

Re: [PERFORM] anti-join chosen even when slower than old plan

2010-11-13 Thread Marc Mamin
Hello,

Just a short though:

Is it imaginable to compare the prognoses of the plans with the actual
results 
and somehow log the worst cases ? 

a) to help the DBA locate bad statistics and queries
b) as additional information source for the planner

This could possibly affect parameters of your formula on the fly.

best regards,

Marc Mamin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] CREATE INDEX as bottleneck

2010-11-11 Thread Marc Mamin
No, CONCURRENTLY is to improve table availability during index creation, but it 
degrades the performances.

best regards,

Marc Mamin


-Original Message-
From: Alex Hunsaker [mailto:bada...@gmail.com] 
Sent: Donnerstag, 11. November 2010 19:55
To: Marc Mamin
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] CREATE INDEX as bottleneck

On Thu, Nov 11, 2010 at 06:41, Marc Mamin  wrote:
> There are a few places in our data flow where we have to wait for index
> creation before being able to distribute the process on multiple threads
> again.

Would CREATE INDEX CONCURRENTLY help here?

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] CREATE INDEX as bottleneck

2010-11-11 Thread Marc Mamin
Hello,

in the last years, we have successfully manage to cope with our data
growth 
using partitioning and splitting large aggregation tasks on multiple
threads.
The partitioning is done logically by our applicationn server, thus
avoiding trigger overhead.

There are a few places in our data flow where we have to wait for index
creation before being able to distribute the process on multiple threads
again.

With the expected growth, create index will probably become a severe
bottleneck for us.

Is there any chance to see major improvement on it in a middle future ?
I guess the question is naive, but why can't posgres use multiple
threads for large sort operation ?


best regards,

Marc Mamin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] optimizing query with multiple aggregates

2009-10-26 Thread Marc Mamin
Hello,
 
I didn't try it, but following should be slightly faster:
 
COUNT( CASE WHEN field >= x AND field < y THEN true END)
intead of 
SUM( CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)
 
HTH,
 
Marc Mamin




From: pgsql-performance-ow...@postgresql.org
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Nikolas
Everett
Sent: Thursday, October 22, 2009 4:48 AM
To: Doug Cole
Cc: pgsql-performance
Subject: Re: [PERFORM] optimizing query with multiple aggregates


So you've got a query like: 

SELECT SUM(CASE WHEN field >= 0 AND field < 10 THEN 1 ELSE 0 END) as
zeroToTen,
  SUM(CASE WHEN field >= 10 AND field < 20 THEN 1 ELSE 0
END) as tenToTwenty,
  SUM(CASE WHEN field >= 20 AND field < 30 THEN 1 ELSE 0
END) as tenToTwenty,
...
FROM  bigtable




My guess is this forcing a whole bunch of if checks and your getting cpu
bound.  Could you try something like:


SELECT SUM(CASE WHEN field >= 0 AND field < 10 THEN count ELSE 0 END) as
zeroToTen,
  SUM(CASE WHEN field >= 10 AND field < 20 THEN count ELSE 0
END) as tenToTwenty,
  SUM(CASE WHEN field >= 20 AND field < 30 THEN count ELSE 0
END) as tenToTwenty,
...
FROM  (SELECT field, count(*) FROM bigtable GROUP BY field)


which will allow a hash aggregate?  You'd do a hash aggregate on the
whole table which should be quick and then you'd summarize your bins.


This all supposes that you don't want to just query postgres's column
statistics.


On Wed, Oct 21, 2009 at 10:21 PM, Doug Cole  wrote:


On Wed, Oct 21, 2009 at 5:39 PM, Merlin Moncure
 wrote:
>
> On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole
 wrote:
> > I have a reporting query that is taking nearly all of it's
time in aggregate
> > functions and I'm trying to figure out how to optimize it.
The query takes
> > approximately 170ms when run with "select *", but when run
with all the
> > aggregate functions the query takes 18 seconds.  The
slowness comes from our
> > attempt to find distribution data using selects of the form:
> >
> > SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)
> >
> > repeated across many different x,y values and fields to
build out several
> > histograms of the data.  The main culprit appears to be the
CASE statement,
> > but I'm not sure what to use instead.  I'm sure other people
have had
> > similar queries and I was wondering what methods they used
to build out data
> > like this?
>
> have you tried:
>
> count(*) where field >= x AND field < y;
>
> ??
>
> merlin


Unless I'm misunderstanding you, that would require breaking
each bin
into a separate sql statement and since I'm trying to calculate
more
than 100 bins between the different fields any improvement in
the
aggregate functions would be overwhelmed by the cost of the
actual
query, which is about 170ms.
Thanks,
Doug


--
Sent via pgsql-performance mailing list
(pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance





Re: [PERFORM] Very specialised query

2009-03-30 Thread Marc Mamin
Hello Matthew,

Another idea:

Are your objects limited to some smaller ranges of your whole interval ?
If yes you may possibly reduce the ranges to search for while using an 
additional table with the min(start) max(end) of each object...

Marc Mamin


Re: [PERFORM] Very specialised query

2009-03-30 Thread Marc Mamin

>> WHERE (l2.start BETWEEN  l1.start AND l1.end
>>  OR
>>  l1.start BETWEEN  l2.start AND l2.end
>>  )

>Yes, that's another way to calculate an overlap. However, it turns out to not 
>be that fast. 
>The problem is that OR there, which causes a bitmap index scan, as the leaf of 
>a nested loop join, 
>which can be rather slow.


Ok , than splitting these checks in 2 Queries with UNION  is better.   
But I often read that BETWEEN is faster than using 2 comparison operators.
Here I guess that a combined index on (start,end) makes sense:

..
WHERE l2.start BETWEEN  l1.start AND l1.end
..
UNION
..
WHERE l1.start BETWEEN  l2.start AND l2.end
..


The first clause being equivalent to

AND l1.start <= l2.end
AND l1.end   >= l2.start
AND l1.start <= l2.start

I don't know how you have to deal the limit conditions...


Marc Mamin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very specialised query

2009-03-27 Thread Marc Mamin
Hello,

if your data are mostly static and you have a few mains objects,
maybe you can have some gain while defining conditional indexes for those plus 
one for the rest 
and then slicing the query:


create index o_1x on X (start,end,id) where object_id = 1
create index o_2x on X (start,end,id) where object_id = 2
create index o_3x on X (start,end,id) where object_id = 3
create index o_4x on X (start,end,id) where object_id = 4
...
create index o_4x on X (start,end,id) where object_id not in (1,2,3,4..)


I'm not sure that putting all in one index and using the BETWEEN clause 
as in my example is the best method though.

Marc Mamin


SELECT 
l1.id AS id1,
l2.id AS id2
FROM
location l1,
location l2
WHERE l1.objectid = 1
AND (l2.start BETWEEN  l1.start AND l1.end
 OR 
 l1.start BETWEEN  l2.start AND l2.end
 )
 l1.start
AND l2.start <> l2.start -- if required
AND l2.start <> l2.end   -- if required
AND l1.id <> l2.id


UNION ALL

...
WHERE l1.objectid = 2
... 

UNION ALL

...
WHERE l1.objectid not in (1,2,3,4..)


[PERFORM] temp_tablespaces and RAID

2008-12-22 Thread Marc Mamin

Hello,

To improve performances, I would like to try moving the temp_tablespaces
locations outside of our RAID system.
Is it a good practice ?


Thanks,

Marc Mamin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Delete performance again

2008-10-03 Thread Marc Mamin
Hi,
 
Maybe you can try this syntax. I'm not sure, but it eventually perform better:
 
 
delete from company_alias USING comprm
where company_alias.company_id =comprm.id


Cheers,

Marc

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] join tables vs. denormalization by trigger

2007-09-04 Thread Marc Mamin
Hello,

I had a similar issue and -atfer testing - decided to merge the tables
B and C into a single table.
In my case the resulting table contains a large proportion of nulls
which limits the size increase...
You'll have to do some testing with your data to evaluate the
performance gain.

Hope to help,

Marc  

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Walter
Mauritz
Sent: Tuesday, September 04, 2007 8:53 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] join tables vs. denormalization by trigger

Hi,

I wonder about differences in performance between two scenarios:

Background:
Table A, ~50,000 records
Table B, ~3,000,000 records (~20 cols)
Table C, ~30,000,000 records (~10 cols)

a query every 3sec. with limit 10

Table C depends on Table B wich depends on Table A, int8 foreign key,
btree index

* consider it a read only scenario (load data only in night, with time
for vacuum analyze daily)
* im required to show records from Table C, but also with some (~5cols)
info from Table B
* where clause always contains the foreign key to Table A
* where clause may contain further 1-10 search parameter


Scenario A)
simply inner join Table B + C

Scenario B)
with use of trigger on insert/update I could push the required
information from table B down to table C.
-> so i would only require to select from table C.


My question:
1) From your experience ... how much faster (approximately) in percent
do you regard Scenario B faster than A ?

2) any other tips for such a read only scenario

Thx for any attention :-)
Walter
-- 
GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS.
Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Postgres configuration for 64 CPUs, 128 GB RAM...

2007-07-24 Thread Marc Mamin
 
Hello,

thank you for all your comments and recommendations.

I'm aware that the conditions for this benchmark are not ideal, mostly
due to the lack of time to prepare it. We will also need an additional
benchmark on a less powerful - more realistic - server to better
understand the scability of our application.


Our application is based on java and is generating dynamic reports from
log files content. Dynamic means here that a repor will be calculated
from the postgres data the first time it is requested (it will  then be
cached). Java is used to drive the data preparation and to
handle/generate the reports requests.

This is much more an OLAP system then an OLTP, at least for our
performance concern.




Data preparation:

1) parsing the log files with a heavy use of perl (regular expressions)
to generate csv files. Prepared statements also maintain reference
tables in the DB. Postgres performance is not an issue for this first
step.

2) loading the csv files with COPY. As around 70% of the data to load
come in a single daily table, we don't allow concurrent jobs for this
step. We have between a few and a few hundreds files to load into a
single table; they are processed one after the other. A primary key is
always defined; for the case when the required indexes are alreay built
and when the new data are above a given size, we are using a "shadow" 
table  instead (without the indexes) , build the index after the import
and then replace the live table with the shadow one. 
For example, we a have a table of 13 GB + 11 GB indexes (5 pieces).

Performances :

a) is there an "ideal" size to consider for our csv files (100 x 10
MB or better 1 x 1GB ?)
b) maintenance_work_mem: I'll use around 1 GB as recommended by
Stefan

3) Data agggregation. This is the heaviest part for Postgres. On our
current system some queries need above one hour, with phases of around
100% cpu use, alterning with times of heavy i/o load when temporary
results are written/read to the plate (pgsql_tmp). During the
aggregation, other postgres activities are low (at least should be) as
this should take place at night. Currently we have a locking mechanism
to avoid having more than one of such queries running concurently. This
may be to strict for the benchmark server but better reflect our current
hardware capabilities.

Performances : Here we should favorise a single huge transaction and
consider a low probability to have another transaction requiring large
sort space. Considering this, is it reasonable to define work_mem being
3GB (I guess I should raise this parameter dynamically before running
the aggregation queries)

4) Queries (report generation)

We have only few requests which are not satisfying while requiring large
sort operations. The data are structured in different aggregation levels
(minutes, hours, days) with logical time based partitions in oder to
limit the data size to compute for a given report. Moreover we can scale
our infrastrucure while using different or dedicated Postgres servers
for different customers. Smaller customers may share a same instance,
each of them having its own schema (The lock mechanism for large
aggregations apply to a whole Postgres instance, not to a single
customer) . The benchmark will help us to plan such distribution.

During the benchmark, we will probably not have more than 50 not idle
connections simultaneously. It is a bit too early for us to fine tune
this part. The benchmark will mainly focus on the steps 1 to 3

During the benchmark, the Db will reach a size of about 400 GB,
simulating 3 different customers, also with data quite equally splitted
in 3 scheemas.



I will post our configuration(s) later on.



Thanks again for all your valuable input.

Marc Mamin

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[PERFORM] Postgres configuration for 64 CPUs, 128 GB RAM...

2007-07-17 Thread Marc Mamin

Postgres configuration for 64 CPUs, 128 GB RAM...

Hello,

We have the oppotunity to benchmark our application on a large server. I
have to prepare the Postgres configuration and I'd appreciate some
comments on it as I am not experienced with servers of such a scale.
Moreover the configuration should be fail-proof as I won't be able to
attend the tests. 

Our application (java + perl) and Postgres will run on the same server,
whereas the application activity is low when Postgres has large
transactions to process.

There is a large gap between our current produtcion server (Linux, 4GB
RAM, 4 cpus) and the benchmark server; one of the target of this
benchmark is to verify the scalability of our application. 


And you have no reason to be envious as the server doesn't belong us :-)


Thanks for your comments,

Marc Mamin





Posgres version: 8.2.1



Server Specifications:
--

Sun SPARC Enterprise M8000 Server:

http://www.sun.com/servers/highend/m8000/specs.xml

File system:

http://en.wikipedia.org/wiki/ZFS



Planned configuration:


# we don't expect more than 150 parallel connections, 
# but I suspect a leak in our application that let some idle connections
open

max_connections=2000

ssl = off 

#maximum allowed
shared_buffers= 262143

# on our current best production server with 4GB RAM (not dedicated to
Postgres), work_mem is set to 600 MB
# this limitation is probably the bottleneck for our application as the
files in pgsql_tmp grows up to 15 GB 
# during large aggregations (we have a locking mechanismus to avoid
parallel processing of such transactions)
work_mem = 31457280  # (30 GB)

# index creation time is also an issue for us; the process is locking
other large processes too.
# our largest table so far is 13 GB + 11 GB indexes
maintenance_work_mem = 31457280  # (30 GB)

# more than the max number of tables +indexes expected during the
benchmark
max_fsm_relations = 10

max_fsm_pages = 180

# don't know if I schoud modify this.
# seems to be sufficient on our production servers
max_stack_depth = 2MB

# vacuum will be done per hand between each test session
autovacuum = off 



# required to analyse the benchmark
log_min_duration_statement = 1000


max_prepared_transaction = 100


# seems to be required to drop schema/roles containing large number of
objects
max_locks_per_transaction = 128 




# I use the default for the bgwriter as I couldnt find recommendation on
those

#bgwriter_delay = 200ms # 10-1ms between rounds
#bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers
scanned/round
#bgwriter_lru_maxpages = 5  # 0-1000 buffers max
written/round
#bgwriter_all_percent = 0.333   # 0-100% of all buffers
scanned/round
#bgwriter_all_maxpages = 5  # 0-1000 buffers max
written/round


#WAL

fsync = on

#use default
#wal_sync_method

# we are using 32 on our production system
wal_buffers=64


# we didn't make any testing with this parameter until now, but this
should'nt be a relevant
# point as our performance focus is on large transactions
commit_delay = 0 

#CHECKPOINT

# xlog will be  on a separate disk
checkpoint_segments=256

checkpoint_timeout = 5min


[PERFORM] tuning a function to insert/retrieve values from a reference table

2007-07-10 Thread Marc Mamin

Hello,

I have a simple table id/value, and a function that returns the id of a
given value, inserting the later if not yet present. The probability
that a value already exist within the reference table is very high.

Different db users may have their own reference table with different
content, but as the table definition is identical, I've defined a public
function to maintain these tables. 

Can I optimize this function with:

a) remove the EXCEPTION clause (Is there an underlying lock that prevent
concurrent inserts ?)

b) declare the function being IMMUTABLE ?
   
   - although it may insert a new raw, the returned id is invariant for
a given user
 (I don't really understand the holdability ov immutable functions;
are the results cached only for the livetime of a prepared statement ?,
or can they be shared by different sessions ?)


Thanks,

Marc




--Table definition:

create table ref_table (
  id serial NOT NULL, 
  v varchar NOT NULL, 
  constraint ref_table_pk primary key  (id)
) without oids;

create unique index ref_table_uk on ref_table(v);


-- Function:

CREATE OR REPLACE FUNCTION public.get_or_insert_value("varchar") RETURNS
INT AS 
$BODY$

DECLARE
  id_value INT;

BEGIN

  SELECT INTO id_value id FROM ref_table WHERE v =  $1;

  IF FOUND THEN

RETURN id_value;

  ELSE  --new value to be inserted

DECLARE
  rec record;

BEGIN

 FOR rec in INSERT INTO ref_table (v) VALUES ($1) RETURNING id
 LOOP
  return rec.id;  
 END LOOP;

 EXCEPTION --concurrent access ?
   WHEN unique_violation THEN
 RETURN(SELECT id FROM ref_table WHERE v =  $1);

END;

  END IF;
END;
$BODY$
  LANGUAGE 'plpgsql' VOLATILE;



[PERFORM] copy from performance on large tables with indexes

2007-06-07 Thread Marc Mamin

Hello,


Postgres: 8.2
os: Linux 4CPU, 4 GB RAM, Raid 1, 32 bit system
work_mem: 600 Mb


I have some tables which may become quite large (currently up to 6 Gb) .
I initially fill them using copy from (files) .

The import is fast enough as I only have a primary key on the table:
about 18 minutes
(over 300 Mb/minute)

Then I need 5 additional indexes on it. Creation time: 30 minutes


subsequently I compute some aggregations which need 4 hours and 30
minutes additional time


And now the problem:

If I get additional data for the table, the import become much more
slower due to the indexes (about 30 times slower !):

The performance degradation  is probably  due to the fact that all
indexs are too large to be kept in memory. 
Moreover I guess that the indexes fill factors are too high (90%)

During this second import, I have about 20% iowait time.



The usual solution is to drop the indexes before the second import and
rebuild them afterwards, but I feel unconfident doing this as I don't
know how the system will react if some SELECT statements occures when
the index are missing. I can hardly avoid this.


So my idea for the second import process:


1) make a copy of the table:

   create table B as select * from table A;
   alter table B add constraint B_pk primary key (id);


2) import the new data in table B

   copy B from file;

3) create the required indexes on B

   create index Bix_1 on B..
   create index Bix_2 on B..
   create index Bix_2 on B..
   create index Bix_2 on B..
   
4) replace table A with table B

   alter table A renam to A_trash;
   alter table B renam to A;
   drop table A_trash;

 (and rename the indexes to get the  original state)
 
 
 
 
 
 This seems to work but with side effects:
 
 The only objects that refer to the tables are functions and indexes.
 
If a function is called within a same session before and after the table
renaming, the second attempt fails (or use the table A_trash if it still
exists). So I should close the session and start a new one before
further processing. Errors in other live sessions are acceptable, but
maybe you know a way to avoid them?)



And now a few questions :-)

- do you see any issue that prevent this workflow to work?

- is there any other side effect to take care of ?

- what is the maximum acceptable value for the parameter work_mem for my
configuration 
  (see the complete configuration below)
  
- has anybody built a similar workflow ?  

- could this be a feature request to extend the capabilities of copy
from ?



Thanks for your time and attention,

Marc Mamin

 


Re: [PERFORM] Planner doing seqscan before indexed join

2007-03-29 Thread Marc Mamin
You may try to change the planner's opinion using sub queries. Something
like:


select * from 
   
   eventactivity,   

   (select * from 
  keyword_incidents, 
  eventmain, 
  eventgeo 
where 
  eventmain.incidentid = keyword_incidents.incidentid 
  and eventgeo.incidentid = keyword_incidents.incidentid 
  and (  recordtext like '%JOSE CHAVEZ%'   )
   )foo
   
 where eventactivity.incidentid = foo.incidentid 
 order by foo.entrydate limit 1;


HTH,

Marc

 

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Dan Harris
Sent: Thursday, March 29, 2007 4:22 AM
To: PostgreSQL Performance
Subject: [PERFORM] Planner doing seqscan before indexed join

8.0.3 - Linux 2.6.18..
Freshly vacuumed and analyzed

This database has been humming along fine for a while now, but I've got
one of those sticky queries that is taking much too long to finish.

After some digging, I've found that the planner is choosing to apply a
necessary seq scan to the table.  Unfortunately, it's scanning the whole
table, when it seems that it could have joined it to a smaller table
first and reduce the amount of rows it would have to scan dramatically (
70 million to about 5,000 ).

The table "eventactivity" has about 70million rows in it, index on
"incidentid"
The table "keyword_incidents" is a temporary table and has incidentid as
its primary key.  It contains
5125 rows. eventmain and eventgeo both have 2.1 million. My hope is that
I can convince the planner to do the
  join to keyword_incidents *first* and then do the seq scan for the
LIKE condition.  Instead, it seems that it's seqscanning the whole 70
million rows first and then doing the join, which takes a lot longer
than I'd like to wait for it.  Or, maybe I'm misreading the explain
output?

Thanks again

-Dan
-
Here's the query:

explain analyze 

select 
   * from 

   keyword_incidents, 

   eventactivity, 

   eventmain, 

   eventgeo 

  where 

   eventmain.incidentid = keyword_incidents.incidentid and 

   eventgeo.incidentid = keyword_incidents.incidentid and 

   eventactivity.incidentid = keyword_incidents.incidentid 

   and (  recordtext like '%JOSE CHAVEZ%'   )
order by eventmain.entrydate limit 1;



---
  Limit  (cost=2388918.07..2388918.08 rows=1 width=455) (actual
time=81771.186..81771.292 rows=26 loops=1)
->  Sort  (cost=2388918.07..2388918.08 rows=1 width=455) (actual
time=81771.180..81771.215 rows=26 loops=1)
  Sort Key: eventmain.entrydate
  ->  Nested Loop  (cost=0.00..2388918.06 rows=1 width=455)
(actual time=357.389..81770.982 rows=26 loops=1)
->  Nested Loop  (cost=0.00..2388913.27 rows=1
width=230) (actual time=357.292..81767.385 rows=26 loops=1)
  ->  Nested Loop  (cost=0.00..2388909.33 rows=1
width=122) (actual time=357.226..81764.501 rows=26 loops=1)
->  Seq Scan on eventactivity
(cost=0.00..2388874.46 rows=7 width=84) (actual time=357.147..81762.582
rows=27 loops=1)
  Filter: ((recordtext)::text ~~ '%JOSE
CHAVEZ%'::text)
->  Index Scan using keyword_incidentid_pkey
on keyword_incidents  (cost=0.00..4.97 rows=1 width=38) (actual
time=0.034..0.036 rows=1 loops=27)
  Index Cond:
(("outer".incidentid)::text = (keyword_incidents.incidentid)::text)
  ->  Index Scan using eventgeo_incidentid_idx on
eventgeo  (cost=0.00..3.93 rows=1 width=108) (actual
time=0.076..0.081 rows=1 loops=26)
Index Cond: (("outer".incidentid)::text =
(eventgeo.incidentid)::text)
->  Index Scan using eventmain_incidentid_idx on
eventmain  (cost=0.00..4.78 rows=1 width=225) (actual
time=0.069..0.075 rows=1 loops=26)
  Index Cond: (("outer".incidentid)::text =
(eventmain.incidentid)::text)
  Total runtime: 81771.529 ms
(15 rows)

---(end of broadcast)---
TIP 6: explain analyze is your friend

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Scaling SELECT:s with the number of disks on a stripe

2007-03-29 Thread Marc Mamin
Hello Peter,

If you are dealing with timed data or similar, you may consider to
partition your table(s).

In order to deal with large data, I've built a "logical" partition
system, 
whereas the target partition is defined by the date of my data (the date
is part of the filenames that I import...).

Instead of using the Postgres partitioning framework, I keep the tables
boundaries within a refererence table.
Then I've built a function that takes the different query parameters as
argument (column list, where clause...). 
This functions retrieve the list of tables to query from my reference
table and build the final query, binding 
the different subqueries from each partition with "UNION ALL". 
It also requires an additional reference table that describes the table
columns (data type, behaviour , e.g. groupable,summable...)


This allowed me to replace many "delete" with "drop table" statements,
whis is probably the main advantage of the solution.


The biggest issue was the implementation time ;-) but I'm really happy
with the resulting performances.

HTH,

Marc



 

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Peter
Schuller
Sent: Friday, March 30, 2007 7:17 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Scaling SELECT:s with the number of disks on a stripe

Hello,

I am looking to use PostgreSQL for storing some very simple flat data
mostly in a single table. The amount of data will be in the hundreds of
gigabytes range. Each row is on the order of 100-300 bytes in size; in
other words, small enough that I am expecting disk I/O to be seek bound
(even if PostgreSQL reads a full pg page at a time, since a page is
significantly smaller than the stripe size of the volume).

The only important performance characteristics are insertion/deletion
performance, and the performance of trivial SELECT queries whose WHERE
clause tests equality on one of the columns.

Other than absolute performance, an important goal is to be able to
scale fairly linearly with the number of underlying disk drives. We are
fully willing to take a disk seek per item selected, as long as it
scales.

To this end I have been doing some benchmarking to see whether the plan
is going to be feasable. On a 12 disk hardware stripe, insertion
performance does scale somewhat with concurrent inserters. However, I am
seeing surprising effects with SELECT:s: a single selecter generates the
same amount of disk activity as two concurrent selecters (I was easily
expecting about twice).

The query is simple:

SELECT * FROM test WHERE value = 'xxx' LIMIT 1000;

No ordering, no joins, no nothing. Selecting concurrently with two
different values of 'xxx' yields the same amount of disk activity (never
any significant CPU activity). Note that the total amount of data is too
large to fit in RAM (> 500 million rows), and the number of distinct
values in the value column is 1. The column in the WHERE clause is
indexed.

So my first question is - why am I not seeing this scaling? The absolute
amount of disk activity with a single selecter is consistent with what I
would expect from a SINGLE disk, which is completely expected since I
never thought PostgreSQL would introduce disk I/O concurrency on its
own. But this means that adding additional readers doing random-access
reads *should* scale very well with 12 underlying disks in a stripe.

(Note that I have seen fairly similar results on other RAID variants
too, including software RAID5 (yes yes I know), in addition to the
hardware stripe.)

These tests have been done Linux 2.6.19.3 and PostgreSQL 8.1.

Secondly, I am seeing a query plan switch after a certain threshold.
Observe:

perftest=# explain select * from test where val='7433' limit 1000; 
   QUERY PLAN


-
 Limit  (cost=0.00..4016.50 rows=1000 width=143)
   ->  Index Scan using test_val_ix on test  (cost=0.00..206620.88
rows=51443 width=143)
 Index Cond: ((val)::text = '7433'::text)
(3 rows)

Now increasing to a limit of 1:

perftest=# explain select * from test where val='7433' limit 1;
  QUERY PLAN


--
 Limit  (cost=360.05..38393.36 rows=1 width=143)
   ->  Bitmap Heap Scan on test  (cost=360.05..196014.82 rows=51443
width=143)
 Recheck Cond: ((val)::text = '7433'::text)
 ->  Bitmap Index Scan on test_val_ix  (cost=0.00..360.05
rows=51443 width=0)
   Index Cond: ((val)::text = '7433'::text)
(5 rows)

The interesting part is that the latter query is entirely CPU bound (no
disk I/O at all) for an extended period of time before even beginning to
read data from disk. And when it *does* start performing disk I/O, the
performance is about the same as for the other case. In other words, the
change in query plan se

Re: [PERFORM] Looking for tips

2005-07-23 Thread Marc Mamin
Title: Re: [PERFORM] Looking for tips






 
Hi,
I have a similar application,
but instead of adding new items to the db once at time,
I retrieve new IDs from a sequence (actually only every 10'000 
times) and write a csv file from perl.
When finished, I load all new record in one run with Copy.
 
hth,
 
Marc Mamin



From: 
[EMAIL PROTECTED] on behalf of Oliver 
CrosbySent: Wed 7/20/2005 3:50 AMTo: PFCCc: 
Sven Willenberger; Dawid Kuroczko; Kevin Grittner; [EMAIL PROTECTED]; 
pgsql-performance@postgresql.orgSubject: Re: [PERFORM] Looking for 
tips

Sorry for the lack of specifics...We have a file 
generated as a list of events, one per line. Supposelines 1,2,3,5,7,11,etc 
were related, then the last one would specifythat it's the last event. 
Gradually this gets assembled by a perlscript and when the last event is 
encountered, it gets inserted intothe db. For a given table, let's say it's 
of the form (a,b,c) where'a' is a pkey, 'b' is indexed, and 'c' is other 
related information.The most common 'b' values are cached locally with the 
perl script tosave us having to query the db. So what we end up having 
is:if 'b' exists in cache, use cached 'a' value and continueelse if 
'b' exists in the db, use the associated 'a' value and continueelse add a 
new line with 'b', return the new 'a' and continueThe local cache was a 
huge time saver with mysql. I've tried making aplpgsql function that handles 
everything in one step on the db side,but it didn't show any improvement. 
Time permitting, I'll try some newapproaches with changing the scripts and 
queries, though right now Iwas just hoping to tune postgresql.conf to work 
better with thehardware available.Thanks to everyone for your help. 
Very much appreciated.---(end of 
broadcast)---TIP 5: don't forget to increase your 
free space map settings




[PERFORM] Query limitations (size, number of UNIONs ...)

2005-06-02 Thread Marc Mamin

Hello,


I've split my data in daily tables to keep them in an acceptable size.

Now I have quite complex queries which may be very long if I need to query a
large number of daily tables.


I've just made a first test wich resulted in a query being 15KB big annd
containing 63 UNION.

The Query plan in PGAdmin is about 100KB big with 800 lines :-)


The performance is not such bad, but I'm wondering if there are some
POSTGRES limitations I should take care of with this strategy.


Thanks,

Marc

-- 
Geschenkt: 3 Monate GMX ProMail gratis + 3 Ausgaben stern gratis
++ Jetzt anmelden & testen ++ http://www.gmx.net/de/go/promail ++

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


[PERFORM] TIP 9: the planner will ignore... & datatypes

2005-06-01 Thread Marc Mamin
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


But INT2, INT4, INT8 and "SERIAL" are considered to be a unique datatype.
Am I Right?

Thanks,

Marc

-- 
Geschenkt: 3 Monate GMX ProMail gratis + 3 Ausgaben stern gratis
++ Jetzt anmelden & testen ++ http://www.gmx.net/de/go/promail ++

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Optimising queries involving unions

2005-05-27 Thread Marc Mamin
Hi,

I'm using a workaround for this kind of issues:


consider:

select A from 

   (select B from T1 where C 
  union
select B from T2 where C 
  union
select B from T3 where C 
) foo
where D


in your case:

SELECT u.txt
  FROM  (
SELECT id, txt FROM largetable1,smalltable t WHERE t.id = u.id   AND
t.foo = 'bar'
UNION ALL
SELECT id, txt FROM largetable2,smalltable t WHERE t.id = u.id   AND
t.foo = 'bar'
) u
  



and

select A from foo where C and D

(A, B, C, D being everything you want, C and D may also include "GROUP
BY,ORDER...)

The first version will be handled correctly by the optimiser, whereas in the
second version, 
Postgres will first build the UNION and then run the query on it.




I'm having large tables with identical structure, one per day.
Instead of defining a view on all tables, 
I' using functions that "distribute" my query on all tables.

The only issue if that I need to define a type that match the result
structure and each return type needs its own function.


Example:
(The first parameter is a schema name, the four next corresponds to A, B, C,
D





-
create type T_i2_vc1 as (int_1 int,int_2 int,vc_1 varchar);

CREATE OR REPLACE FUNCTION
vq_T_i2_vc1(varchar,varchar,varchar,varchar,varchar) RETURNS setof T_i2_vc1
AS $$


DECLARE
result T_i2_vc1%rowtype;
mviews RECORD;
sql varchar;
counter int;
BEGIN
select into counter 1;

-- loop on all daily tables
FOR mviews IN SELECT distinct this_day FROM daylist order by 
plainday
desc LOOP

IF counter =1 THEN
  select INTO  sql 'SELECT '||mviews.this_day||' AS plainday, 
'||$2||'
FROM '||$3||'_'||mviews.plainday||' WHERE '||$4;
ELSE
  select INTO  sql sql||' UNION ALL SELECT 
'||mviews.this_day||' AS
plainday, '||$2||' FROM '||$3||'_'||mviews.plainday||' WHERE '||$4;
END IF;

select into counter counter+1;
END LOOP;

select INTO  sql 'SELECT  '||$1||' FROM ('||sql||')foo '||$5;

   for result in   EXECUTE (sql) LOOP
 return  NEXT result;   
   end loop;
 return ;

END;
$$ LANGUAGE plpgsql;



Note: in your case the function shoud have a further parameter to join
largetable(n) to smalltable in the "sub queries"

HTH,

Marc





> I've got a query that I think the query optimiser should be able
> to work it's magic on but it doesn't!  I've had a look around and
> asked on the IRC channel and found that the current code doesn't
> attempt to optimise for what I'm asking it to do at the moment.
> Here's a bad example:
> 
>   SELECT u.txt
>   FROM smalltable t, (
> SELECT id, txt FROM largetable1
> UNION ALL
> SELECT id, txt FROM largetable2) u
>   WHERE t.id = u.id
> AND t.foo = 'bar';
> 
> I was hoping that "smalltable" would get moved up into the union,
> but it doesn't at the moment and the database does a LOT of extra
> work.  In this case, I can manually do quite a couple of transforms
> to move things around and it does the right thing:
> 
>   SELECT txt
>   FROM (
> SELECT l.id as lid, r.id as rid, r.foo, l.txt
>   FROM largetable1 l, smalltable r
> UNION ALL
> SELECT l.id as lid, r.id as rid, r.foo, l.txt
>   FROM largetable1 l, smalltable r)
>   WHERE foo = 'bar';
> AND lid = rid
> 
> The optimiser is intelligent enough to move the where clauses up
> into the union and end end up with a reasonably optimal query.
> Unfortunatly, in real life, the query is much larger and reorganising
> everything manually isn't really feasible!

-- 
Weitersagen: GMX DSL-Flatrates mit Tempo-Garantie!
Ab 4,99 Euro/Monat: http://www.gmx.net/de/go/dsl

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


[PERFORM] tuning Postgres for large data import (using Copy from)

2005-05-12 Thread Marc Mamin
Hello,


I'd like to tune Postgres for large data import (using Copy from).


here are a few steps already done:



1) use 3 different disks for:

-1: source data
-2: index tablespaces
-3: data tablespaces


2) define all foreign keys as initially deferred


3) tune some parameters:



max_connections =20
shared_buffers =3
work_mem = 8192  
maintenance_work_mem = 32768
checkpoint_segments = 12

(I also modified the kernel accordingly)




4) runs VACUUM regulary


The server runs RedHat and has 1GB RAM

In the production (which may run on a better server), I plan to: 

- import a few millions rows per day,
- keep up to ca 100 millions rows in the db
- delete older data




I've seen a few posting on hash/btree indexes, which say that hash index do
not work very well in Postgres;
currently, I only use btree indexes. Could I gain performances whole using
hash indexes as well ?

How does Postgres handle concurrent copy from on: same table / different
tables ?


I'd be glad on any further suggestion on how to further increase my
performances.




Marc




-- 
+++ Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS +++
GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail

---(end of broadcast)---
TIP 8: explain analyze is your friend