Re: Query became very slow after 9.6 -> 10 upgrade

2017-11-22 Thread Tomas Vondra
IMHO the problems here are due to poor cardinality estimates.

For example in the first query, the problem is here:

->  Nested Loop  (cost=0.42..2.46 rows=1 width=59)
 (actual time=2.431..91.330 rows=3173 loops=1)
->  CTE Scan on b  (cost=0.00..0.02 rows=1 width=40)
   (actual time=2.407..23.115 rows=3173 loops=1)
->  Index Scan using domains_pkey on domains d
(cost=0.42..2.44 rows=1 width=19)
(actual time=0.018..0.018 rows=1 loops=3173)

That is, the database expects the CTE to return 1 row, but it returns
3173 of them, which makes the nested loop very inefficient.

Similarly for the other query, where this happens:

 Nested Loop  (cost=88.63..25617.31 rows=491 width=16)
  (actual time=3.512..733248.271 rows=1442797 loops=1)
   ->  HashAggregate  (cost=88.06..88.07 rows=1 width=4)
  (actual time=3.380..13.561 rows=3043 loops=1)

That is, about 1:3000 difference in both cases.

Those estimation errors seem to be caused by a condition that is almost
impossible to estimate, because in both queries it does this:

groups->0->>'provider' ~ '^something'

That is, it's a regexp on an expression. You might try creating an index
on the expression (which is the only way to add expression statistics),
and reformulate the condition as LIKE (which I believe we can estimate
better than regular expressions, but I haven't tried).

So something like

CREATE INDEX ON adroom ((groups->0->>'provider'));

    WHERE groups->0->>'provider' LIKE 'something%';

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: CPU 100% usage caused by unknown postgres process..

2017-12-13 Thread Tomas Vondra


On 12/13/2017 01:19 PM, Justin Pryzby wrote:
> On Wed, Dec 13, 2017 at 11:36:11AM +0100, Laurenz Albe wrote:
>> Dinesh Chandra 12108 wrote:
>>> My CPU utilization is going to 100% in PostgreSQL because of one unknown 
>>> process /x330341 is running from postgres user.
>>>  
>>> PID   USER  PR  NIVIRTRESSHR S %CPU %MEM   TIME+   COMMAND
>>> 19885 postgres 20   0  192684   3916   1420 S 99.3  0.1   5689:04  
>>> x330341 
>>>  
>>> The same file is automatically created in Postgres Cluster also. I am using 
>>> Postgresql-9.3.
>>>  
>>> Kindly suggest how can I resolve this issue.
>>
>> I don't know, but the same problem has been reported on Stackoverflow:
>> https://stackoverflow.com/q/46617329/6464308
>>
>> If your queries look similar, then you might indeed be the victim of an 
>> attack.
>>
>> Figure out where the function and the executable come from.
>>
>> In case of doubt, disconnect the server from the network.
> 
> Looks suspicious; I would look at (and save) things like these:
> 
> ls -l /proc/19885/exe
> ls -l /proc/19885/fd
> ls -l /proc/19885/cwd
> 
> sudo lsof -n -p 19885
> sudo netstat -anpe |grep 19885
> 
> Stacktrace with gcore/gdb is a good idea.
> Save a copy of your log/postgres logfiles and try to figure out where it came
> from.  Since an attacker seems to control the postgres process, your data may
> have been compromized (leaked or tampered with).
> 

Any details about the x330341 file (is it a shell script or some
kind of binary)?

FWIW the queries (listed in the stackoverflow post) are running under
postgres, which I assume is superuser. The backend has full access to
the data directory, of course, so it may create extra files (using
adminpack extension, for example).

If that's the case (and if it's indeed an attack), it either means the
attacker likely already has access to all the data. So presumably
x330341 is doing something else at the OS level.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Autoanalyze CPU usage

2017-12-19 Thread Tomas Vondra


On 12/19/2017 05:47 PM, Habib Nahas wrote:
> Hi,
> 
> We operate an RDS postgres 9.5 instance and have periodic CPU spikes to
> 100%. These spikes appear to be due to autoanalyze kicking on our larger
> tables.
> 
> Our largest table has 75 million rows and the autoanalyze scale factor
> is set to 0.05. 
> 
> The documentation I've read suggests that the analyze always operates on
> the entire table and is not incremental. Given that supposition are
> there ways to control cost(especially CPU) of the autoanalyze operation?
> Would a more aggressive autoanalyze scale factor (0.01) help. With the
> current scale factor we see an autoanalyze once a week, query
> performance has been acceptable so far, which could imply that scale
> factor could be increased if necessary. 
> 

No, reducing the scale factor to 0.01 will not help at all, it will
actually make the issue worse. The only thing autoanalyze does is
running ANALYZE, which *always* collects a fixed-size sample. Making it
more frequent will not reduce the amount of work done on each run.

So the first question is if you are not using the default (0.1), i.e.
have you reduced it to 0.05.

The other question is why it's so CPU-intensive. Are you using the
default statistics_target value (100), or have you increased that too?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Updating a large table

2017-12-23 Thread Tomas Vondra

On 12/22/2017 05:46 PM, Timokhin Maxim wrote:
> 
> Hello! We have a large table 11GB ( about 37 million records ) and we
> need to alter a table - add a new column with default values is 
> false. Also 'NOT NULL' is required.
>
> So, first I've done:
>
> ALTER TABLE clusters ALTER COLUMN "is_paid";
> 

That seems somewhat incomplete ... what exactly did the ALTER do?

> after that:
> 
> UPDATE clusters SET is_paid = DEFAULT where ctime <= now() - interval '720h' 
> AND is_paid != FALSE;
> 
> Everything went ok. Then I tried to run it again for an interval of 1
> years. And I got that no one can't see - the was no available space
> on a disk. The reason was WAL-files ate everything.
> Master-server couldn't send some WAL-file to their replicas. Bandwidth wasn't 
> enough.
> 

Well, then perhaps the best solution is to add more disk space and/or
make sure the network bandwidth is sufficient?

In any case, don't forget this may also need to update all indexes on
the table, because the new row versions will end up on different pages.
So while the table has 11GB, this update may need much more WAL space
than that.

> Well, I'm searching for a better idea to update the table.
> Solutions I found.
> 1. Separate my UPDATE by chunks.

If this is a one-time change, this is probably the best option.

> 2. Alter a table using a new temporary table, but it's not convenient
> for me because there is a lot of foreign keys and indexes.
Right.

> 3. Hot-update. This is the most interesting case for me.
> Speaking of HOT-update 
> https://www.dbrnd.com/2016/03/postgresql-the-awesome-table-fillfactor-to-speedup-update-and-select-statement/
> The article says: it might be useful for tables that change often and 
> moreover It would be the best way to increase the speed of UPDATE.

First of all, to make HOT possible there would have to be enough free
space on the pages. As you need to update the whole table, that means
each table would have to be only 50% full. That's unlikely to be true,
and you can't fix that at this point.

> So, my questions are will it work for all tuples? It says that - no 
> https://www.dbrnd.com/2016/03/postgresql-alter-table-to-change- 
> fillfactor-value/, but I could not find a confirmation in official 
> postresql's documentation.
Not sure I understand your question, but HOT can only happen when two
conditions are met:

1) the update does not change any indexed column

This is likely met, assuming you don't have an index on is_paid.

2) there's enough space on the same page for the new row version

This is unlikely to be true, because the default fillfactor for tables
is 90%. You may change fillfactor using ALTER TABLE, but that only
applies to new data.

Moreover, as the article says - this is useful for tables that change
often. Which is not quite what one-time table rewrite does.

So HOT is not the solution you're looking for.

> Why do I need to launch vacuum after updating?

You don't need to launch vacuum - autovacuum will take care of that
eventually. But you may do that, to do the cleanup when it's convenient
for you.

> How should I reduce the better fillfactor?

For example to change fillfactor to 75% (i.e. 25% free space):

ALTER TABLE t SET (fillfactor = 75);

But as I said, it's not a solution for you.

> What will be with WAL-files it this case?

Not sure what you mean.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_xlog unbounded growth

2018-01-25 Thread Tomas Vondra
 so), to allow replicas to temporarily fall behind without having
to setup a WAL archive. But this is not the case here. Honestly, I doubt
setting this to 8 makes practical sense ... That value seems so low it
does not guarantee anything.

> Does wal_keep_segments = 0 turn off WAL recycling? Frankly, I would
> rather have WAL not be recycled/reused, and just deleted to keep
> pg_xlog below expected size.
> 

No, it doesn't. Why would it disable that? It simply means the segments
may need to be keept around for longer before getting recycled.

> Another question is, does wal_level = replica affect the size of
> pg_xlog in any way?  We have an archive_command that just exits with
> exit code 0, so I don't see any reason for the pg_xlog files to not be
> cleaned up.
> 

No, it shouldn't. The replica should stream the WAL (if it's close to
current positiion), or fetch the older WAL segments if it falls behind.
But if it falls behind too much, it may not be able to catch up as the
WAL segments get removed.

wal_keep_segments is a protection against that, but it won't keep
segments indefinitely and you set that just to 8. So this is not the
root cause.

Another option is that you created a replication slot. That is actually
the only I can think of that could cause this issue. Perhaps there is a
replication slot that is not used anymore and is preventing removal of
WAL segments? Because that's the whole point of replication slots.

What does

SELECT * FROM pg_replication_slots;

say on the master node?

> 2) Cleaning up pg_xlog using a watchdog script
> 
> To get the import done I wrote a script that's actually inspired from
> a blog post where the pg_xlog out of disk space problem is
> addressed [1].  It periodically reads the last checkpoint's REDO WAL
> file, and deletes all WAL in pg_xlog before that one. 
> 
> ...

This makes no sense. The database should be able to remove unnecessary
WAL segments automatically. There's pretty small chance you'll get it
right in an external script - not removing WAL segments that are still
needed, etc.

Find the actual root cause, fix it. Don't invent custom scripts messing
with the critical part of the database.

> 
> stefan.pet...@tangoe.com 
> tangoe.com
> 
> This e-mail message, including any attachments, is for the sole use 
> of the intended recipient of this message, and may contain
> information that is confidential or legally protected. If you are not
> the intended recipient or have received this message in error, you
> are not authorized to copy, distribute, or otherwise use this message
> or its attachments. Please notify the sender immediately by return
> e-mail and permanently delete this message and any attachments.
> Tangoe makes no warranty that this e-mail or its attachments are
> error or virus free.
>

LOL


kindd regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: postgresql 10.1 wrong plan in when using partitions bug

2018-02-04 Thread Tomas Vondra


On 02/04/2018 11:14 AM, Mariel Cherkassky wrote:
> 
> Hi,
> I configured range partitions on a date column of my main
> table(log_full). Each partition represents a day in the month. Every day
> partition has a list parition of 4 tables on a text column.
> 
> log_full
>           log_full_01_11_2017  -->
>                                           log_full _01_11_2017_x1
>                                           log_full _01_11_2017_x2
>                                           log_full _01_11_2017_x3 
>                                           log_full _01_11_2017_x4 
>             log_full_02_11_2017
>                                           log_full _02_11_2017_x1 
>                                           log_full _02_11_2017_x2 
>                                           log_full _02_11_2017_x3 
>                                           log_full _02_11_2017_x4
> 
> and so on
>       
> 
> The date column consist of date in the next format : -MM-DD HH:24:SS
> for example : 2017-11-01 00:01:40
> 
> I wanted to check the plan that I'm getting for a query that is using
> the date column and it seems that the planner choose to do seq scans on
> all tables.
> 
> -Each partition consist from 15M rows.
> I have about 120 partitions.
> 
> The query : 
> explain select count(*) from log_full where end_date between
> to_date('2017/12/03','YY/MM/DD') and to_date('2017/12/03','YY/MM/DD');
> 
> The output is too long but it do full scans on all paritions...
> any idea what can be the problem? Is it connected to the date format ?
> 

You haven't shown us how the partitions are defined, nor the query plan.
So it's rather hard to say. You mentioned text format, but then you use
to_date() to query the partitioned table. Which I guess might be the
cause, but it's hard to say for sure.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: SV: bad plan using nested loops

2018-02-05 Thread Tomas Vondra


On 02/02/2018 10:02 AM, Johan Fredriksson wrote:
> tor 2018-02-01 klockan 20:34 + skrev Johan Fredriksson:
>>> Johan Fredriksson  writes:
>>>> Bad plan: https://explain.depesz.com/s/avtZ
>>>> Good plan: https://explain.depesz.com/s/SJSt
>>>> Any suggestions on how to make the planner make better decisions
>>>> for
>>>> this query?
>>>
>>> Core of the problem looks to be the misestimation here:
>>>
>>>    Index Only Scan using shredder_cgm1 on
>>> public.cachedgroupmembers cachedgroupmembers_4
>>> (cost=0.43..2.33 rows=79 width=8) (actual time=0.020..0.903
>>> rows=1492 loops=804)
>>>  Output: cachedgroupmembers_4.memberid,
>>> cachedgroupmembers_4.groupid,
>>> cachedgroupmembers_4.disabled
>>>  Index Cond: ((cachedgroupmembers_4.memberid =
>>> principals_1.id) AND
>>> (cachedgroupmembers_4.disabled = 0))
>>>  Heap Fetches: 5018
>>>
>>> Probably, memberid and disabled are correlated but the planner
>>> doesn't
>>> know that, so it thinks the index condition is way more selective
>>> than it
>>> actually is.  In PG 10, you could very possibly fix that by
>>> installing
>>> extended statistics on that pair of columns.  See
>>>
>>> https://www.postgresql.org/docs/current/static/planner-stats.html#P
>>> LANNER-STATS-EXTENDED
>>
>> I'm not sure what you mean by correlated, but there are only a
>> handful (164 when I check it) disabled groupmembers out of total 7.5
>> million.
>> I'll give CREATE STATISTICS on those columns a shot and see if it
>> gets any better.
> 
> It looks like you are right, Tom. There actually exists full
> correlation between memberid, groupid and disabled.
> 
> rt4=# SELECT stxname, stxkeys, stxdependencies FROM pg_statistic_ext;
>  stxname  | stxkeys |   stxdependencies
> ---+-+--
>  cgm_stat2 | 2 6 | {"2
> => 6": 1.00}
>  cgm_stat1 | 3 6 | {"3 => 6": 1.00}
> (2 rows)
> 
> However, this does not help the planner. It still picks the bad plan.
> 

Yeah :-( Unfortunately, we're not using the extended statistics to
improve join cardinality estimates yet. PostgreSQL 10 can only use them
to improve estimates on individual tables, and judging by the progress
on already submitted improvements, it doesn't seem very likely to change
in PostgreSQL 11.

regards
Tomas

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Please help

2018-03-06 Thread Tomas Vondra
Can you please use separate threads for your questions? That is, don't
start new thread by responding to an existing message (because the new
message then points using "References" header, which is what e-mail
clients use to group messages into threads). And use a proper subject
describing the issue ("Please help" tells people nothing).

That being said, how is this related to performance at all? It seems to
be about pgadmin, so please send it to pgadmin-support I guess.

regards

On 03/06/2018 09:01 AM, Daulat Ram wrote:
> Dear Experts,
> 
>  
> 
> Kndly help to resolve the issue reported during startup pgadmin4 server
> mode on ubuntu 16.04
> 
>  
> 
> dram@vipgadmin:~$ cd .pgadmin4
> 
> dram@vipgadmin:~/.pgadmin4$ chmod +x 
> lib/python2.7/site-packages/pgadmin4/pgAdmin4.py
> 
> dram@vipgadmin:~/.pgadmin4$ chmod 
> lib/python2.7/site-packages/pgadmin4/pgAdmin4.py
> 
> dram@vipgadmin:~/.pgadmin4$ chmod -R 
> lib/python2.7/site-packages/pgadmin4/pgAdmin4.py
> 
> dram@vipgadmin:~/.pgadmin4$ sudo systemctl daemon-reload
> 
> dram@vipgadmin:~/.pgadmin4$  sudo systemctl enable pgadmin4
> 
> dram@vipgadmin:~/.pgadmin4$ sudo systemctl start pgadmin4
> 
> dram@vipgadmin:~/.pgadmin4$ sudo systemctl status  pgadmin4
> 
> ● pgadmin4.service - Pgadmin4 Service
> 
>    Loaded: loaded (/etc/systemd/system/pgadmin4.service; enabled; vendor
> preset: enabled)
> 
>    Active: failed (Result: exit-code) since Mon 2018-03-05 23:57:24 PST;
> 10s ago
> 
>   Process: 14190
> ExecStart=/root/.pgadmin4/lib/python2.7/site-packages/pgadmin4/pgAdmin4.py
> (code=exited, status=200/CHDIR)
> 
> Main PID: 14190 (code=exited, status=200/CHDIR)
> 
>  
> 
> Mar 05 23:57:24 vipgadmin systemd[1]: Started Pgadmin4 Service.
> 
> Mar 05 23:57:24 vipgadmin systemd[1]: pgadmin4.service: Main process
> exited, code=exited, status=200/CHDIR
> 
> Mar 05 23:57:24 vipgadmin systemd[1]: pgadmin4.service: Unit entered
> failed state.
> 
> Mar 05 23:57:24 vipgadmin systemd[1]: pgadmin4.service: Failed with
> result 'exit-code'.
> 
>  
> 
> Regards,
> 
> Daulat
> 
>  
> 
>  
> 

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Memory size

2018-03-11 Thread Tomas Vondra
On 03/11/2018 06:33 PM, dangal wrote:
> jeff thank you very much for your time, I tell you, they are the same queries
> with the same parameters, I take 3 minutes for example, but I execute it and
> it takes me seconds, that's why I suspect it is the shared buffer
> The server had 16 GB and we increased it to 24, but I really do not know if
> it should continue to increase since they are not our own resources, we have
> to ask for them and justify them
> 

It's not very clear if your question is about shared_buffers or amount
of RAM in general. In any case, it looks like the performance difference
is due to having to do I/O on the first execution, while the second
execution gets served from RAM. If that's the case, increasing shared
buffers is not going to help, in fact it's going to make matters worse
(due to double buffering etc.).

You should be able to confirm this by analyzing system metrics,
particularly I/O and CPU time. There should be a lot of I/O during the
first execution, and almost nothing during the second one.

So it seems you need to add more RAM, but it's unclear how much because
we don't know what part of the data is regularly accessed (I really
doubt it's the whole 750GB). That is something you have to determine by
analyzing your workload. All we know is data needed by this query likely
fit into RAM, but then get pushed out by other queries after a while.

An alternative would be to use better storage system, although that will
not give you the same performance, of course.

FWIW it's also possible something is going wrong at the hypervisor level
(e.g. contention for storage cache used by multiple VMs). It's hard to
say, considering you haven't even shared an explain analyze of the
queries. Try EXPLAIN (ANALYZE, BUFFERS) both for the slow and fast
executions, and show us the results.

FWIW you might also read this first:
https://wiki.postgresql.org/wiki/Slow_Query_Questions

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Unexplainable execution time difference between two test functions...one using IF (SELECT COUNT(*) FROM...) and the other using IF EXISTS (SELECT 1 FROM...)

2018-04-17 Thread Tomas Vondra
ost=0.25..0.26
rows=1 width=32) (actual time=762.154..762.155 rows=1 loops=1)
    Buffers: shared hit=160 read=5246

As you can see, the execution time jumps up to about 762
milliseonds.  I can see in the sequence scan node that the LOWER()
function shows up on the right side of the equal operator, whereas
in the first 5 runs of this test query the plan did not show this. 
Why is this?




It doesn't really matter on which side it shows, it's more about a 
generic plan built without knowledge of the parameter value.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Unexplainable execution time difference between two test functions...one using IF (SELECT COUNT(*) FROM...) and the other using IF EXISTS (SELECT 1 FROM...)

2018-04-17 Thread Tomas Vondra



On 04/17/2018 07:17 AM, Pavel Stehule wrote:

Hi

2018-04-16 22:42 GMT+02:00 Hackety Man <mailto:hackety...@gmail.com>>:


...

>
A support of parallel query execution is not complete -  it doesn't work 
in PostgreSQL 11 too. So although EXISTS variant can be faster (but can 
be - the worst case of EXISTS is same like COUNT), then due disabled 
parallel execution the COUNT(*) is faster now. It is unfortunate, 
because I believe so this issue will be fixed in few years.




None of the issues seems to be particularly related to parallel query. 
It's much more likely a general issue with planning EXISTS / LIMIT and 
non-uniform data distribution.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Unexplainable execution time difference between two test functions...one using IF (SELECT COUNT(*) FROM...) and the other using IF EXISTS (SELECT 1 FROM...)

2018-04-17 Thread Tomas Vondra



On 04/17/2018 04:01 PM, Hackety Man wrote:



On Tue, Apr 17, 2018 at 6:49 AM, Tomas Vondra 
mailto:tomas.von...@2ndquadrant.com>> wrote:




On 04/16/2018 10:42 PM, Hackety Man wrote:

...
     The first thing I did was to run some baseline tests using
the basic
     queries inside of the IF() checks found in each of the
functions to
     see how the query planner handled them.  I ran the
following two
     queries.

         EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM
zz_noidx1 WHERE
         LOWER(text_distinct) = LOWER('Test501');
         EXPLAIN (ANALYZE, BUFFERS) SELECT 1 FROM zz_noidx1 WHERE
         LOWER(text_distinct) = LOWER('Test501');


Those are not the interesting plans, though. The EXISTS only cares
about the first row, so you should be looking at

     EXPLAIN (ANALYZE, BUFFERS) SELECT 1 FROM zz_noidx1 WHERE
     LOWER(text_distinct) = LOWER('Test501') LIMIT 1;



Okay.  I tested this query and it did return an execution time on par 
with my tests of the "zz_spx_ifexists_noidx" function.

*
*



     I moved on to test the other function with the following
query...

         EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM
         zz_spx_ifcount_noidx('Test501');

     and I got the following "auto_explain" results...

         2018-04-16 14:58:34.134 EDT [12616] LOG:  duration:
426.279 ms         plan:
           Query Text: SELECT EXISTS (SELECT 1 FROM zz_noidx1 WHERE
         LOWER(zz_noidx1.text_distinct) = LOWER(p_findme))
           Result  (cost=4.08..4.09 rows=1 width=1) (actual
         time=426.274..426.274 rows=1 loops=1)
             Buffers: shared read=5406
             InitPlan 1 (returns $0)
            ->  Seq Scan on zz_noidx1  (cost=0.00..20406.00
rows=5000
         width=0) (actual time=426.273..426.273 rows=0 loops=1)
           Filter: (lower(text_distinct) = 'test501'::text)
           Rows Removed by Filter: 100
           Buffers: shared read=5406
         2018-04-16 14:58:34.134 EDT [12616] CONTEXT:  SQL statement
         "SELECT EXISTS (SELECT 1 FROM zz_noidx1 WHERE
         LOWER(zz_noidx1.text_distinct) = LOWER(p_findme))"
           PL/pgSQL function zz_spx_ifexists_noidx(text) line 4
at IF
         2018-04-16 14:58:34.134 EDT [12616] LOG:  duration:
428.077 ms         plan:
           Query Text: explain (analyze, buffers) select * from
         zz_spx_ifexists_noidx('Test501')
           Function Scan on zz_spx_ifexists_noidx  (cost=0.25..0.26
         rows=1 width=32) (actual time=428.076..428.076 rows=1
loops=1)
             Buffers: shared hit=30 read=5438

     Definitely not the execution time, or query plan, results I was
     expecting.  As we can see, no workers were employed here and my
     guess was that this was the reason or the large execution time
     difference between these 2 tests?  199 milliseconds versus 428
     milliseconds, which is a big difference.  Why are workers
not being
     employed here like they were when I tested the query found
inside of
     the IF() check in a standalone manner?  But then I ran
another test
     and the results made even less sense to me.


The plan difference is due to not realizing the EXISTS essentially
implies LIMIT 1. Secondly, it expects about 5000 rows matching the
condition,  uniformly spread through the table. But it apparently
takes much longer to find the first one, hence the increased duration.



Ah.  I did not know that.  So EXISTS inherently applies a LIMIT 1, even 
though it doesn't show in the query plan, correct? Is it not possible 
for parallel scans to be implemented while applying an implicit, or 
explicit, LIMIT 1?

**//___^



It doesn't add a limit node to the plan, but it behaves similarly to 
that. The database only needs to fetch the first row to answer the 
EXISTS predicate.


I don't think this is inherently incompatible with parallel plans, but 
the planner simply thinks it's going to bee very cheap - cheaper than 
setting up parallel workers etc. So it does not do that.




How did you generate the data?



I used generate_series() to create 1 million records in sequence at the 
same time that I created the table using the following script...


CREATE TABLE zz_noidx1 AS SELECT generate_series(0, 99) AS
int_distinct, 'Test'::text || generate_series(0, 99)::text AS
text_distinct;



Which means that there a

Re: Unexplainable execution time difference between two test functions...one using IF (SELECT COUNT(*) FROM...) and the other using IF EXISTS (SELECT 1 FROM...)

2018-04-17 Thread Tomas Vondra
On 04/17/2018 05:43 PM, Hackety Man wrote:
> 
> 
> On Tue, Apr 17, 2018 at 10:23 AM, Tomas Vondra
> mailto:tomas.von...@2ndquadrant.com>> wrote:
> 
> 
> 
> On 04/17/2018 04:01 PM, Hackety Man wrote:
> 
>  ...
> Right.  I was more wondering why it switched over to a generic
> plan, as you've stated, like clockwork starting with the 6th
> execution run.
> 
> 
> That's a hard-coded value. The first 5 executions are re-planned
> using the actual parameter values, and then we try generating a
> generic plan and see if it's cheaper than the non-generic one. You
> can disable that, though.
> 
> 
> 
> So on that note, in the planner's eyes, starting with the 6th execution,
> it looks like the planner still thinks that the generic plan will
> perform better than the non-generic one, which is why it keeps using the
> generic plan from that point forward?
> 

Yes. The point of prepared statements (which also applies to plpgsql, as
it uses prepared statements automatically) is to eliminate the planning
overhead. So we try planning it with actual parameter values for the
first 5 plans, and then compare it to the generic plan.

> Similar to the parallel scans, any chance of the planner possibly being
> enhanced in the future to come to a better conclusion as to whether, or
> not, the generic plan will perform better than the non-generic plan?  :-)

There's always hope, but it's hard to say if/when an enhancement will
happen, unfortunately.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Unexplainable execution time difference between two test functions...one using IF (SELECT COUNT(*) FROM...) and the other using IF EXISTS (SELECT 1 FROM...)

2018-04-17 Thread Tomas Vondra


On 04/17/2018 04:05 PM, Hackety Man wrote:
> Hi Pavel,
> 
> Thanks for sharing that information.  I was not aware that the parallel
> query functionality was not yet fully implemented.
> 

Nothing is ever "fully implemented". There are always gaps and possible
improvements ;-)

That being said, parallelism opens an entirely new dimension of possible
plans and planning issues.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Special bloom index of INT, BIGINT, BIT, VARBIT for bitwise operation

2018-07-17 Thread Tomas Vondra




On 07/11/2018 08:02 AM, Takao Magoori wrote:

Hello,

I have a table which has billions of rows and I want to select it by
bitwise operation like,

=# CREATE TABLE IF NOT EXISTS t_bitwise (
   id INTEGER NOT NULL
   ,status_bigint BITINT NOT NULL
   ,status_bit BIT(32) NOT NULL
);

=# INSERT INTO t_bitwise (id, status_bigint, status_bit) SELECT
   id
   ,(random() * 4294967295)::BIGINT
   ,(random() * 4294967295)::BIGINT::BIT(32)
FROM generate_series(1, 30) as t(id);

=# SELECT * FROM t_bitwise WHERE
   status_bigint & 170 = 170
   OR status_bigint & 256 = 256;

=# SELECT * FROM t_bitwise WHERE
   status_bit & b'10101010'::BIT(32) =
b'10101010'::BIT(32)
   OR status_bit & b'0001'::BIT(32) =
b'0001'::BIT(32);

Yes, these SELECT statements scan all rows. I guess possible index types are

- Expression indexes ?
- Partial Indexes ?
- GIN ?
- GIST ?
- bloom index ?

I googled but I feel there is no good solution and it would be good if
I hava "bloom index specific for bitwise operation".

In case of normal bloom index, a value is hashed into a few bits which
is mapped to a signature (default 80 bits).
This is a lossy representation of the original value, and as such is
prone to reporting false positives which requires "Recheck" process at
SELECT. The more rows or the more complex condition, the more
execution time.

My idea is that, in case of index for bitwise operation, each bit
should be mapped to exactly same bit on a signature (One to one
mapping). No false positives. No "Recheck" process is required. If the
target coulmn is BIT(32), just 32 bits signature lengh is enough.



So you're not computing a hash at all, so calling this "bloom index" is 
rather misleading.


Another question is where do you expect the performance advantage to 
come from? If the table is as narrow as this, such index will have 
almost the same size, and it won't allow fast lookups.



Is there any index module like this ? Since I am not familiar with C
and Postgresql, I can not write my own module.



I doubt there's a suitable index available currently. Perhaps a simple 
btree index could help, assuming it's (a) smaller than the table, (b) we 
can use IOS and (c) only a small fraction of the table matches.


Another idea is to use partial indexes - one for each bit, i.e.

  CREATE INDEX ON t_bitwise (id)
   WHERE status_bit & b'1000'::BIT(32) = b'1000'::BIT(32);

And then build the query accordingly.

I wonder if it might work with BRIN indexes of some kind. If the range 
summary is defined as OR of the values, that might help, depending on 
variability within the page range. But that would probably require some 
development.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Why the index is not used ?

2018-10-07 Thread Tomas Vondra
Hi,

On 10/07/2018 08:32 PM, ROS Didier wrote:
> Hi Francisco
> 
>   Thank you for your remark. 
> You're right, but it's the only procedure I found to make search on
> encrypted fields with good response times (using index) !
> 

Unfortunately, that kinda invalidates the whole purpose of in-database
encryption - you'll have encrypted on-disk data in one place, and then
plaintext right next to it. If you're dealing with credit card numbers,
then you presumably care about PCI DSS, and this is likely a direct
violation of that.

> Regarding access to the file system, our servers are in protected
network areas. few people can connect to it.
> 

Then why do you need encryption at all? If you assume access to the
filesystem / storage is protected, why do you bother with encryption?
What is your threat model?

> it's not the best solution, but we have data encryption needs and
> good performance needs too. I do not know how to do it except the
> specified procedure..
>
> if anyone has any proposals to put this in place, I'm interested.
> 

One thing you could do is hashing the value and then searching by the
hash. So aside from having the encrypted column you'll also have a short
hash, and you may use it in the query *together* with the original
condition. It does not need to be unique (in fact it should not be to
make it impossible to reverse the hash), but it needs to have enough
distinct values to make the index efficient. Say, 10k values should be
enough, because that means 0.01% selectivity.

So the function might look like this, for example:

  CREATE FUNCTION cchash(text) RETURNS int AS $$
SELECT abs(hashtext($1)) % 1;
  $$ LANGUAGE sql;

and then be used like this:

  CREATE INDEX idx_cartedecredit_cc02 ON cartedecredit(cchash(cc));

and in the query

  SELECT pgp_sym_decrypt(cc, 'motdepasse') FROM cartedecredit
   WHERE pgp_sym_decrypt(cc, 'motdepasse')='test value 32'
 AND cchash(cc) = cchash('test value 32');

Obviously, this does not really solve the issues with having to pass the
password to the query, making it visible in pg_stat_activity, various
logs etc.

Which is why people generally use FDE for the whole disk, which is
transparent and provides the same level of protection.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Why the index is not used ?

2018-10-08 Thread Tomas Vondra
Hi,

On 10/08/2018 04:10 PM, ROS Didier wrote:
> Hi Tomas
>    
>     Thank you for your answer and recommendation which is very
> interesting. I'm going to study the PCI DSS document right now.
> 
>   * Here are my answer to your question :
> 
> />>/
> /What is your threat model?/
> /< we want to prevent access to sensitive data for everyone except those
> who have the encryption key.
> in case of files theft, backups theft, dumps theft, we do not want
> anyone to access sensitive data.
>    

The thing is - encryption is not panacea. The interesting question is
whether this improves security compared to simply using FDE and regular
access rights (which are grantable at the column level).

Using those two pieces properly may very well be a better defense than
not well designed encryption scheme - and based on this discussion, it
does not seem very polished / resilient.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Fwd: Query with high planning time compared to execution time

2018-11-02 Thread Tomas Vondra
On 11/02/2018 10:36 AM, Richard Lee wrote:
> Hi,
> 
> I'm running a performance test for our application and encountered a
> particular query with high planning time compared to the execution.
> Please refer to attached explain.out for the explain analyze output.
> 
> Formatted explain: https://explain.depesz.com/s/R834
> 
> The test was performed with Jmeter sending requests to the database,
> query was generated by Hibernate which consists of a 133 table UNION.
> Also attached are some diagnostic info (database version, database
> settings, table definitions and maintenance related information).
> 
> Due to the extremely large query text, I'm choosing to provide
> information via attachments instead of pasting in the email body.
> 
> Below are some additional OS information on the database server:
> CPU: 8
> RAM: 24GB
> Disk: SSD
> OS: CentOS Linux release 7.4.1708 (Core)
> 
> [root@kvrh7os202 ~]# uname -a
> Linux kvrh7os202.comptel.com <http://kvrh7os202.comptel.com>
> 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64
> x86_64 x86_64 GNU/Linux
> [root@kvrh7os202 ~]#
> 
> Things I tried:
> 1. Setting random_page_cost = 1.1 and effective_io_concurrency = 200 -
> no effect on planning time
> 2. Create materialized view for big UNION query - planning time reduced
> significantly but not a viable solution
> 

Those changes likely affect the query costing and execution, but the
number of plans to consider is probably not going to change much. So
planning taking about the same time is kinda expected here.

> What are my other options to improve the query planning time?
> 

Can you do a bit of profiling, to determine which part of the query
planning process is slow here? That is:

1) make sure you have the debug symbols installed
2) do `perf record`
3) run the benchmark for a while (a minute or so)
4) stop the perf record using Ctrl-C
5) generate a profile using `perf report` and share the result

Possibly do the same thing with `perf record -g` to collect call-graph
information, but that's probably going way larger.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Tomas Vondra

On Sun, Apr 14, 2019 at 11:59:45PM -0400, Gunther wrote:

On 4/14/2019 23:24, Tom Lane wrote:

   ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 
chunks); 2230341056 used

Oooh, that looks like a memory leak right enough.  The ExecutorState
should not get that big for any reasonable query.

2.2 GB is massive yes.

Your error and stack trace show a failure in HashBatchContext,
which is probably the last of these four:


   HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 
41360 used
   HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 
74216 used
   HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 
74216 used
   HashBatchContext: 100711712 total in 3065 blocks; 7936 free (0 
chunks); 100703776 used

Perhaps that's more than it should be, but it's silly to obsess over 100M
when there's a 2.2G problem elsewhere.

Yes.

  I think it's likely that it was
just coincidence that the failure happened right there.  Unfortunately,
that leaves us with no info about where the actual leak is coming from.


Strange though, that the vmstat tracking never showed that the cache 
allocated memory goes much below 6 GB. Even if this 2.2 GB memory leak 
is there, and even if I had 2 GB of shared_buffers, I would still have 
enough for the OS to give me.




Depends on how the kernel is configured. What are vm.overcommit_memory
and vm.overcommit_ratio set to, for example?

It may easily be the case that the kernel is only allowing 50% of RAM to
be committed to user space, and then refusing to allocate more despite
having free memory. That's fairly common issue on swapless systems.

Try running the query again, watch

   cat /proc/meminfo | grep Commit

and if it crashes when Committed_AS hits the CommitLimit.

That doesn't explain where the memory leak is, though :-(


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Tomas Vondra

On Sun, Apr 14, 2019 at 05:19:50PM -0400, Jeff Janes wrote:

  On Sun, Apr 14, 2019 at 4:51 PM Gunther  wrote:

For weeks now, I am banging my head at an "out of memory" situation.
There is only one query I am running on an 8 GB system, whatever I try,
I get knocked out on this out of memory.

  Is PostgreSQL throwing an error with OOM, or is getting killed -9 by the
  OOM killer?  Do you get a core file you can inspect with gdb?

You might want to see the query, but it is a huge plan, and I can't
really break this down. It shouldn't matter though. But just so you can
get a glimpse here is the plan:

Insert on businessoperation  (cost=5358849.28..5361878.44 rows=34619 width=1197)
  ->  Unique  (cost=5358849.28..5361532.25 rows=34619 width=1197)
   


  Maybe it is memory for trigger or constraint checking, although I don't
  know why that would appear instantly.  What triggers or constraints do you
  have on businessoperation? 


Yeah, that would be my guess too. If I had to guess, something likely gets
confused and allocates memory in es_query_ctx instead of the per-tuple
context (es_per_tuple_exprcontext).

Triggers, constraints and expr evaluation all seem like a plausible
candidates. It's going to be hard to nail the exact place, though :-(


  What if you just run the SELECT without the INSERT?  Or insert into a temp
  table rather than into businessoperation?  And if that doesn't crash, what
  if you then insert to businessoperation from the temp table?
   


Yeah. What's the schema of "businessoperation"? Anything special about
it? Triggers, expression indexes, check constraints, ...

Gunther, you mentioned you build postgres from sources. Would it be
possible to add some sort of extra debugging to see where the memory is
allocated from? It's a bit heavy-handed, though.

Or maybe splitting es_query_ctx into smaller contexts. That might be
easier to evaluate than sifting throuht god-knows-how-many-gbs of log.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Block / Page Size Optimization

2019-04-15 Thread Tomas Vondra
 limitation did not seem very useful compared to
the added complexity. But it's probably a question for kernel hackers.

If you have a link that I can just read, I appreciate you sharing 
that. I think that should be on some Wiki or FAQ somewhere. If I get a 
quick and dirty explanation with some pointers, I can try to write it 
out into a more complete answer that might be added into some 
documentation or FAQ somewhere.




Maybe read this famous paper by Jim Gray & Franco Putzolu. It's not
exactly about the thing you're asking about, but it's related. It
essentially deals with sizing memory vs. disk I/O, and page size plays
an important role in that too.

[1] https://www.hpl.hp.com/techreports/tandem/TR-86.1.pdf


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Out of Memory errors are frustrating as heck!

2019-04-18 Thread Tomas Vondra

On Wed, Apr 17, 2019 at 11:52:44PM -0400, Gunther wrote:
Hi guys. I don't want to be pushy, but I found it strange that after 
so much lively back and forth getting to the bottom of this, suddenly 
my last nights follow-up remained completely without reply. I wonder 
if it even got received. For those who read their emails with modern 
readers (I know I too am from a time where I wrote everything in plain 
text) I marked some important questions in bold.




It was received (and it's visible in the archives). It's right before
easter, so I guess some people may be already on a vaction.

As for the issue - I think the current hypothesis is that the data
distribution is skewed in some strange way, triggering some unexpected
behavior in hash join. That seems plausible, but it's really hard to
investigate without knowing anything about the data distribution :-(

It would be possible to do at least one of these two things:

(a) export pg_stats info about distribution of the join keys

The number of tables involved in the query is not that high, and this
would allo us to generate a data set approximating your data. The one
thing this can't do is showing how it's affected by WHERE conditions.

(b) export data for join keys

This is similar to (a), but it would allow filtering data by the WHERE
conditions first. The amount of data would be higher, although we only
need data from the columns used as join keys.

Of course, if those key values contain sensitive data, it may not be
possible, but perhaps you could hash it in some way.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Best Filesystem for PostgreSQL

2019-04-18 Thread Tomas Vondra

On Thu, Apr 18, 2019 at 02:19:28AM +, Stephan Schmidt wrote:

my Question was meant for a Debian 9 environment with heavy read/wright
load and very high requirements towards Performance and data Consistency



Well, that's like asking "which car is the best" unfortunately. There's no
good answer, as it very much depends on your expectations, hardware etc.
Everyone wants good performance, reliability and consistency.

Simply said, if you're on current Linux and you don't have any additional
requirements (like snapshotting), then ext4/xfs are likely your best bet.
There are differences between these two filesystems, but it depends on the
workload, hardware etc. Overall the behavior is pretty close, though. So
either you just go with either of those, or you do some testing with your
application on the actual hardware.

If you need something more advanced (like better snapshotting, etc.) then
maybe ZFS is the right choice for you. It also allos various advanced
configurations with ZIL, L2ARC, ...

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Tomas Vondra

On Fri, Apr 19, 2019 at 11:34:54PM -0400, Gunther wrote:


...

It would be so nice if there was a way to force a specific plan for 
purposes of the testing.  I tried giving false data in pg_class 
reltuples and relpages:


foo=# analyze tmp_q;
ANALYZE
foo=# analyze tmp_r;
ANALYZE
foo=# select relname, relpages, reltuples from pg_class where relname in 
('tmp_q', 'tmp_r');
relname | relpages |  reltuples
-+--+-
tmp_r   |  5505039 | 1.13467e+08
tmp_q   |7 | 236
(2 rows)

foo=# update pg_class set (relpages, reltuples) = (5505039, 1.13467e+08) where 
relname = 'tmp_q';
UPDATE 1
foo=# update pg_class set (relpages, reltuples) = (7, 236) where relname = 
'tmp_r';
UPDATE 1

but that didn't help. Somehow the planner outsmarts every such trick, 
so I can't get it to follow my right outer join plan where the big 
table is hashed.  I am sure y'all know some way to force it.




That does not work, because the planner does not actually use these values
directly - it only computes the density from them, and then multiplies
that to the current number of pages in the file. That behaves much nicer
when the table grows/shrinks between refreshes of the pg_class values.

So what you need to do is tweak these values to skew the density in a way
that then results in the desired esimate when multiplied with the actual
number of pages. For me, this did the trick:

 update pg_class set (relpages, reltuples) = (100, 1)
 where relname = 'tmp_r';

 update pg_class set (relpages, reltuples) = (1, 100)
 where relname = 'tmp_q';

after which I get a plan like this:

  Hash Right Join
Hash Cond: (...)
->  Seq Scan on tmp_q q
->  Hash
  ->  Seq Scan on tmp_r r

As for the issue, I have a theory that I think would explain the issues.
It is related to the number of batches, as others speculated over here.
It's not a memory leak, though, it's just that each batch requires a lot
of extra memory and we don't account for that.

The trouble is - each batch is represented by BufFile, which is a whopping
8272 bytes, because it includes PGAlignedBlock. Now, with 131072 batches,
that's a nice 1GB of memory right there. And we don't actually account for
this memory in hashjoin code, so it's not counted against work_mem and we
just increase the number of batches.

Attached are two patches, that should help us to confirm that's actually
what's happening when running the query on actual data. The first patch
moves the BufFile stuff into a separate memory context, to make it more
obvious where the memory went. It also adds a buch of logging into the
ExecHashIncreaseNumBatches() function.

The second patch makes sure all the BufFiles are allocated right when
increasing the number of batches - otherwise we allocate them only when we
actually find a row for that batch, and I suspect the sample data shared
on this thread are somewhat correlated (I see long runs of the same UUID
value). That might slow down the memory growth. Of course, the real data
probably don't have such correlation, resulting in faster failures.

With the patch, I see stuff like this with 256k batches:

 ExecutorState: 65536 total in 4 blocks; 28136 free (4 chunks); 37400 used
   HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used
 hash batch files: 4404002656 total in 524302 blocks; 8387928 free (20 
chunks); 4395614728 used

so it's conceivable it's the root cause.

As for a fix, I'm not sure. I'm pretty sure we need to consider the amount
of memory for BufFile(s) when increasing the number of batches. But we
can't just stop incrementing the batches, because that would mean the
current batch may easily get bigger than work_mem :-(

I think we might cap the number of batches kept in memory, and at some
point start spilling data into an "overflow batch." So for example we'd
allow 32k batches, and then instead of increasing nbatch to 64k, we'd
create a single "overflow batch" representing batches 32k - 64k. After
processing the first 32k batches, we'd close the files and reuse the
memory for the next 32k batches. We'd read the overflow batch, split it
into the 32k batches, and just process them as usual. Of course, there
might be multiple rounds of this, for example we might end up with 32k
concurrent batches but 128k virtual ones, which means we'd do 4 rounds of
this dance.

It's a bit inefficient, but situations like this should be rather rare,
and it's more graceful than just crashing with OOM.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

>From 431f2442f463c16b4f8f7a4c268c1533ed1aca84 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sat, 20 Apr 2019 21:44:37 +0200
Subject: [PA

Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Tomas Vondra

On Sat, Apr 20, 2019 at 02:30:09PM -0500, Justin Pryzby wrote:

On Sun, Apr 14, 2019 at 11:24:59PM -0400, Tom Lane wrote:

Gunther  writes:
>ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 
chunks); 2230341056 used

Oooh, that looks like a memory leak right enough.  The ExecutorState
should not get that big for any reasonable query.


On Tue, Apr 16, 2019 at 11:30:19AM -0400, Tom Lane wrote:

Hmm ... this matches up with a vague thought I had that for some reason
the hash join might be spawning a huge number of separate batches.
Each batch would have a couple of files with associated in-memory
state including an 8K I/O buffer, so you could account for the


On Tue, Apr 16, 2019 at 10:24:53PM -0400, Gunther wrote:

   ->  Hash  (cost=2861845.87..2861845.87 rows=34619 width=74) (actual 
time=199792.446..199792.446 rows=113478127 loops=1)
 Buckets: 65536 (originally 65536)  Batches: 131072 (originally 2)  
Memory Usage: 189207kB


Is it significant that there are ~2x as many ExecutorState blocks as there are
batches ?  266261/131072 => 2.03...



IMO that confirms this is the issue with BufFile I just described, because 
the struct is >8K, so it's allocated as a separate block (it exceeds the 
threshold in AllocSet). And we have two BufFile(s) for each batch, because

we need to batch both the inner and outer relations.


If there was 1 blocks leaked when batch=2, and 2 blocks leaked when batch=4,
and 4 blocks leaked when batch=131072, then when batch=16, there'd be 64k
leaked blocks, and 131072 total blocks.

I'm guessing Tom probably already thought of this, but:
2230341056/266261 => ~8376


Well, the BufFile is 8272 on my system, so that's pretty close ;-)


which is pretty close to the 8kB I/O buffer you were talking about (if the
number of same-sized buffers much greater than other allocations).

If Tom thinks (as I understand) that the issue is *not* a memory leak, but out
of control increasing of nbatches, and failure to account for their size...then
this patch might help.

The number of batches is increased to avoid exceeding work_mem.  With very low
work_mem (or very larger number of tuples hashed), it'll try to use a large
number of batches.  At some point the memory used by BatchFiles structure
(increasing by powers of two) itself exceeds work_mem.

With larger work_mem, there's less need for more batches.  So the number of
batches used for small work_mem needs to be constrained, either based on
work_mem, or at all.

With my patch, the number of batches is nonlinear WRT work_mem, and reaches a
maximum for moderately small work_mem.  The goal is to choose the optimal
number of batches to minimize the degree to which work_mem is exceeded.



Yeah. The patch might be enough for debugging, but it's clearly not
something we could adopt as is, because we increase the number of batches
for a reason - we need to do that to keep the amount of memory needed for
the hash table contents (i.e. rows) below work_mem. If you just cap the
number of batches, you'll keep the amount of memory for BufFile under
control, but the hash table may exceed work_mem.

Considering how rare this issue likely is, we need to be looking for a
solution that does not break the common case.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Tomas Vondra

On Sat, Apr 20, 2019 at 04:26:34PM -0400, Tom Lane wrote:

Tomas Vondra  writes:

Considering how rare this issue likely is, we need to be looking for a
solution that does not break the common case.


Agreed.  What I think we need to focus on next is why the code keeps
increasing the number of batches.  It seems like there must be an undue
amount of data all falling into the same bucket ... but if it were simply
a matter of a lot of duplicate hash keys, the growEnabled shutoff
heuristic ought to trigger.



I think it's really a matter of underestimate, which convinces the planner
to hash the larger table. In this case, the table is 42GB, so it's
possible it actually works as expected. With work_mem = 4MB I've seen 32k
batches, and that's not that far off, I'd day. Maybe there are more common
values, but it does not seem like a very contrived data set.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Tomas Vondra

On Sat, Apr 20, 2019 at 04:46:03PM -0400, Tom Lane wrote:

Tomas Vondra  writes:

I think it's really a matter of underestimate, which convinces the planner
to hash the larger table. In this case, the table is 42GB, so it's
possible it actually works as expected. With work_mem = 4MB I've seen 32k
batches, and that's not that far off, I'd day. Maybe there are more common
values, but it does not seem like a very contrived data set.


Maybe we just need to account for the per-batch buffers while estimating
the amount of memory used during planning.  That would force this case
into a mergejoin instead, given that work_mem is set so small.



How would that solve the issue of underestimates like this one?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Tomas Vondra

On Sat, Apr 20, 2019 at 10:36:50PM +0200, Tomas Vondra wrote:

On Sat, Apr 20, 2019 at 04:26:34PM -0400, Tom Lane wrote:

Tomas Vondra  writes:

Considering how rare this issue likely is, we need to be looking for a
solution that does not break the common case.


Agreed.  What I think we need to focus on next is why the code keeps
increasing the number of batches.  It seems like there must be an undue
amount of data all falling into the same bucket ... but if it were simply
a matter of a lot of duplicate hash keys, the growEnabled shutoff
heuristic ought to trigger.



I think it's really a matter of underestimate, which convinces the planner
to hash the larger table. In this case, the table is 42GB, so it's
possible it actually works as expected. With work_mem = 4MB I've seen 32k
batches, and that's not that far off, I'd day. Maybe there are more common
values, but it does not seem like a very contrived data set.



Actually, I might have spoken too soon. I've dne some stats on the sample
data. There are 113478127 rows in total, and while most UUIDs are unique,
there are UUIDs that represent ~10% of the data. So maybe there really is
something broken in disabling the growth.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

test=# select c1, count(*) as c2 from (select count(*) as c1 from tmp_r group 
by documentinternalid) foo group by c1;
c1|c2 
--+---
1 | 67884
2 | 33033
3 |  9478
4 | 11205
5 |  2725
6 |  3457
7 |   677
8 |  1992
9 |   961
   10 |   857
   11 |   199
   12 |  1397
   13 |81
   14 |   289
   15 |   630
   16 |   561
   17 |69
   18 |   360
   19 |42
   20 |   367
   21 |   232
   22 |   121
   23 |49
   24 |   380
   25 |   113
   26 |85
   27 |93
   28 |   117
   29 |21
   30 |   136
   31 |32
   32 |   102
   33 |24
   34 |17
   35 |24
   36 |   145
   37 |37
   38 |15
   39 |17
   40 |54
   41 |15
   42 |65
   43 | 8
   44 |48
   45 |26
   46 |12
   47 | 5
   48 |88
   49 |80
   50 |26
   51 |10
   52 |10
   53 | 1
   54 |68
   55 | 6
   56 |31
   57 | 6
   58 | 7
   59 | 3
   60 |43
   61 | 1
   62 | 5
   63 | 7
   64 |24
   65 |10
   66 |24
   68 | 8
   69 | 2
   70 | 9
   71 | 2
   72 |31
   76 | 7
   77 | 5
   78 | 7
   79 | 3
   80 |22
   81 | 7
   82 | 4
   84 |26
   85 | 2
   86 | 4
   88 | 6
   90 |17
   91 | 5
   94 | 5
   96 |23
   98 |86
   99 | 9
  100 | 5
  102 | 2
  104 | 5
  105 | 2
  106 | 1
  108 |12
  110 |17
  112 | 6
  113 | 1
  114 | 9
  120 | 6
  121 | 8
  122 | 1
  124 |10
  126 | 3
  128 | 6
  129 | 2
  132 |10
  135 | 2
  136 | 1
  138 | 2
  140 | 2
  142 | 1
  144 | 9
  146 | 1
  147 |15
  148 | 2
  150 | 5
  152 | 4
  154 | 2
  156 | 3
  158 | 1
  160 | 9
  162 | 3
  168 | 2
  174 | 3
  176 | 3
  177 | 1
  178 | 1
  180 | 4
  181 | 2
  183 | 1
  186 | 2
  188 | 5
  192 | 2
  194 | 4
  196 |35
  198 | 2
  200 | 3
  208 | 4
  213 | 1
  216 | 4
  220 | 2
  223 | 2
  224 | 9
  227 | 1
  228 | 4
  234 | 1
  240 | 4
  241 | 1
  245 | 9
  248 | 1
  252 | 5
  256 | 7
  268 | 1
  272 | 3
  280 | 2
  282 | 1
  283 | 3
  288 | 4
  294 |24
  300 | 1
  311 | 1
  315 | 1
  320 | 2
  336 | 4
  343 | 1
  352 | 1
  356 | 5
  357 | 2
  360 | 2
  372 | 2
  375 | 1
  378 | 1
  384 | 3
  392 | 9
  393 | 1
  396 | 2
  400 | 3
  403 | 1
  405 | 1
  418 | 1
  432 | 3
  441 |11
  442 | 1
  443 | 1
  448 | 2
  450 | 1
  454 | 1
  456 | 2
  462 | 2
  475 | 1
  476 | 1
  480 | 2
  488 |

Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Tomas Vondra

On Sat, Apr 20, 2019 at 06:20:15PM -0400, Tom Lane wrote:

Justin Pryzby  writes:

On Sat, Apr 20, 2019 at 04:46:03PM -0400, Tom Lane wrote:

Maybe we just need to account for the per-batch buffers while estimating
the amount of memory used during planning.  That would force this case
into a mergejoin instead, given that work_mem is set so small.



Do you mean by adding disable_cost if work_mem is so small that it's estimated
to be exceeded ?


No, my point is that ExecChooseHashTableSize fails to consider the
I/O buffers at all while estimating hash table size.  It's not
immediately obvious how to factor that in, but we should.

If Tomas is right that there's also an underestimate of the number
of rows here, that might not solve Gunther's immediate problem; but
it seems like a clear costing oversight.

There's also the angle that the runtime code acts as though increasing
the number of batches is free, while it clearly isn't when you think
about the I/O buffers.  So at some point we should probably stop
increasing the number of batches on the grounds of needing too many
buffers.


Yes. I think it might be partially due to the cost being hidden elsewhere.
The hashjoin code only really deals with array of pointers to BufFile, not
with the BufFiles. And might have looked insignificant for common cases,
but clearly for these corner cases it matters quite a bit.

So yes, ExecChooseHashTableSize() needs to consider this memory and check
if doubling the number of batches has any chance of actually improving
things, because at some point the BufFile memory starts to dominate and
would just force us to do more and more batches.

But I think we should also consider this before even creating the hash
join path - see if the expected number of batches has any chance of
fitting into work_mem, and if not then just not create the path at all.
Just like we do for hash aggregate, for example. It's not going to solve
cases like this (with underestimates), but it seems reasonable. Although,
maybe we won't actually use such paths, because merge join will win thanks
to being automatically cheaper? Not sure.

Also, I wonder if we really need 8kB buffers here. Would it make sense to
allow smaller buffers in some cases? Say, 1kB. It's not going to save us,
but it's still 8x better than now.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Tomas Vondra

On Sat, Apr 20, 2019 at 08:33:46PM -0400, Gunther wrote:

On 4/20/2019 16:01, Tomas Vondra wrote:

For me, this did the trick:
 update pg_class set (relpages, reltuples) = (100, 1) where 
relname = 'tmp_r';
 update pg_class set (relpages, reltuples) = (1, 100) where 
relname = 'tmp_q';



YES! For me too. My EXPLAIN ANALYZE actually succeeded.

Hash Right Join  (cost=11009552.27..11377073.28 rows=11 width=4271) (actual 
time=511580.110..1058354.140 rows=113478386 loops=1)
  Hash Cond: (((q.documentinternalid)::text = (r.documentinternalid)::text) AND 
((q.actinternalid)::text = (r.actinternalid)::text))
  ->  Seq Scan on tmp_q q  (cost=0.00..210021.00 rows=2100 width=3417) 
(actual time=1.148..1.387 rows=236 loops=1)
  ->  Hash  (cost=11009552.11..11009552.11 rows=11 width=928) (actual 
time=511577.002..511577.002 rows=113478127 loops=1)
Buckets: 16384 (originally 1024)  Batches: 131072 (originally 1)  
Memory Usage: 679961kB
->  Seq Scan on tmp_r r  (cost=0.00..11009552.11 rows=11 width=928) 
(actual time=4.077..344558.954 rows=113478127 loops=1)
Planning Time: 0.725 ms
Execution Time: 1064128.721 ms

But it used a lot of resident memory, and now it seems like I actually 
have a leak! Because after the command returned as shown above, the 
memory is still allocated:


 PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
7100 postgres  20   0 2164012   1.1g 251364 S   0.0 14.5  23:27.23 postgres: 
postgres integrator [local] idle

and let's do the memory map dump:

2019-04-20 22:09:52.522 UTC [7100] LOG:  duration: 1064132.171 ms  statement: 
explain analyze
   SELECT *
 FROM tmp_q q
 RIGHT OUTER JOIN tmp_r r
   USING(documentInternalId, actInternalId);
TopMemoryContext: 153312 total in 8 blocks; 48168 free (70 chunks); 105144 used
 HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
 Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 
used
 TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used
 Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 
used
 RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used
 MessageContext: 8192 total in 1 blocks; 6896 free (1 chunks); 1296 used
 Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
 smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used
 TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 
used
 Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
 TopPortalContext: 8192 total in 1 blocks; 7936 free (1 chunks); 256 used
 Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used
 CacheMemoryContext: 1154080 total in 20 blocks; 151784 free (1 chunks); 
1002296 used
   index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: 
pg_class_tblspc_relfilenode_index
   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: 
pg_toast_2619_index
   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: 
entity_id_fkidx
   index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: 
entity_id_idx
   ...
   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: 
pg_attribute_relid_attnum_index
   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: 
pg_class_oid_index
 WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 
used
 PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used
 MdSmgr: 8192 total in 1 blocks; 4992 free (6 chunks); 3200 used
 LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used
 Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used
 ErrorContext: 8192 total in 1 blocks; 7936 free (3 chunks); 256 used
Grand total: 2082624 bytes in 240 blocks; 382760 free (175 chunks); 1699864 used

strange, it shows no leak here. Now I run this test again, to see if 
the memory grows further in top? This time I also leave the DISTINCT 
step in the query. I am trying to hit the out of memory situation. 
Well, I clearly saw memory growing now:




Unfortunately, interpreting RES is way more complicated. The trouble is
PostgreSQL does not get memory from kernel directly, it gets it through
glibc. So when we do free(), it's not guaranteed kernel gets it.

Also, I think glibc has multiple ways of getting memory from the kernel.
It can either to mmap or sbrk, and AFAIK it's easy to cause "islands" that
make it impossible to undo sbrk after freeing memory.

Memory leaks in PostgreSQL are usually about allocating memory in the
wrong context, and so are visible in MemoryContextStats. Permanent leaks
that don't show there are quite rare.



Also, I think while we might have focused in on a peculiar planning 
situation where a very unfortunate plan is chosen which st

Re: Out of Memory errors are frustrating as heck!

2019-04-21 Thread Tomas Vondra

On Sun, Apr 21, 2019 at 03:08:22AM -0500, Justin Pryzby wrote:

On Sun, Apr 21, 2019 at 01:03:50AM -0400, Gunther wrote:

On 4/20/2019 21:14, Tomas Vondra wrote:
>Maybe. But before wasting any more time on the memory leak investigation,
>I suggest you first try the patch moving the BufFile allocations to a
>separate context. That'll either confirm or disprove the theory.

OK, fair enough. So, first patch 0001-* applied, recompiled and

2019-04-21 04:08:04.364 UTC [11304] LOG:  server process (PID 11313) was 
terminated by signal 11: Segmentation fault

...

turns out the MemoryContext is NULL:

(gdb) p context
$1 = (MemoryContext) 0x0


I updated Tomas' patch to unconditionally set the context.
(Note, oldctx vs oldcxt is fairly subtle but I think deliberate?)



I don't follow - is there a typo confusing oldctx vs. oldcxt? I don't
think so, but I might have missed something. (I always struggle with which
spelling is the right one).

I think the bug is actually such simpler - the memory context was created
only in ExecuteIncreaseNumBatches() when starting with nbatch=1. But when
the initial nbatch value was higher (i.e. when starting with 2 or more
batches), it was left NULL. That was OK for testing with the contrived
data set, but it may easily break on other examples.

So here is an updated patch - hopefully this version works. I don't have
time to do much more testing now, though. But it compiles.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

>From 3292e67135ef67f4bcf1ec0595fe79497c49d47c Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sun, 21 Apr 2019 13:31:14 +0200
Subject: [PATCH 1/2] move BufFile stuff into separate context

---
 src/backend/executor/nodeHash.c | 54 +
 src/backend/executor/nodeHashjoin.c |  7 
 src/include/executor/hashjoin.h |  1 +
 3 files changed, 62 insertions(+)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6ffaa751f2..f7c92e78e9 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -498,6 +498,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, 
bool keepNulls)
hashtable->skewTuples = 0;
hashtable->innerBatchFile = NULL;
hashtable->outerBatchFile = NULL;
+   hashtable->fileCtx = NULL;
hashtable->spaceUsed = 0;
hashtable->spacePeak = 0;
hashtable->spaceAllowed = space_allowed;
@@ -527,6 +528,10 @@ ExecHashTableCreate(HashState *state, List *hashOperators, 
bool keepNulls)

"HashBatchContext",

ALLOCSET_DEFAULT_SIZES);
 
+   hashtable->fileCtx = AllocSetContextCreate(CurrentMemoryContext,
+
"HashBatchFiles",
+
ALLOCSET_DEFAULT_SIZES);
+
/* Allocate data that will live for the life of the hashjoin */
 
oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
@@ -559,10 +564,14 @@ ExecHashTableCreate(HashState *state, List 
*hashOperators, bool keepNulls)
 
if (nbatch > 1 && hashtable->parallel_state == NULL)
{
+   MemoryContext oldctx;
+
/*
 * allocate and initialize the file arrays in hashCxt (not 
needed for
 * parallel case which uses shared tuplestores instead of raw 
files)
 */
+   oldctx = MemoryContextSwitchTo(hashtable->fileCtx);
+
hashtable->innerBatchFile = (BufFile **)
palloc0(nbatch * sizeof(BufFile *));
hashtable->outerBatchFile = (BufFile **)
@@ -570,6 +579,8 @@ ExecHashTableCreate(HashState *state, List *hashOperators, 
bool keepNulls)
/* The files will not be opened until needed... */
/* ... but make sure we have temp tablespaces established for 
them */
PrepareTempTablespaces();
+
+   MemoryContextSwitchTo(oldctx);
}
 
MemoryContextSwitchTo(oldcxt);
@@ -900,6 +911,11 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
nbatch = oldnbatch * 2;
Assert(nbatch > 1);
 
+   elog(WARNING, "ExecHashIncreaseNumBatches: increasing number of batches 
from %d to %d", oldnbatch, nbatch);
+
+   elog(LOG, "ExecHashIncreaseNumBatches === context stats start 
===");
+   MemoryContextStats(TopMemoryContext);
+
 #ifdef HJDEBUG
printf("Hashjoin %p: increasing nbatch to %d because space = %zu\n",
   hashtable, nbatch,

Re: Out of Memory errors are frustrating as heck!

2019-04-21 Thread Tomas Vondra

On Sun, Apr 21, 2019 at 10:36:43AM -0400, Tom Lane wrote:

Jeff Janes  writes:

The growEnabled stuff only prevents infinite loops.  It doesn't prevent
extreme silliness.



If a single 32 bit hash value has enough tuples by itself to not fit in
work_mem, then it will keep splitting until that value is in a batch by
itself before shutting off


Right, that's the code's intention.  If that's not good enough for this
case, we'll need to understand the details a bit better before we can
design a better(?) heuristic.



I think we only disable growing when there are no other values in the
batch, but that seems rather easy to defeat - all you need is a single
tuple with a hash that falls into the same batch, and it's over. Maybe
we should make this a bit less accurate - say, if less than 5% memory
gets freed, don't add more batches.


I suspect, however, that we might be better off just taking the existence
of the I/O buffers into account somehow while deciding whether it's worth
growing further.  That is, I'm imagining adding a second independent
reason for shutting off growEnabled, along the lines of "increasing
nbatch any further will require an unreasonable amount of buffer memory".
The question then becomes how to define "unreasonable".



I think the question the code needs to be asking is "If we double the
number of batches, does the amount of memory we need drop?" And the
memory needs to account both for the buffers and per-batch data.

I don't think we can just stop increasing the number of batches when the
memory for BufFile exceeds work_mem, because that entirely ignores the
fact that by doing that we force the system to keep the per-batch stuff
in memory (and that can be almost arbitrary amount).

What I think we should be doing instead is instead make the threshold
dynamic - instead of just checking at work_mem, we need to increment the
number of batches when the total amount of memory exceeds

   Max(work_mem, 3 * memory_for_buffiles)

This is based on the observation that by increasing the number of
batches, we double memory_for_buffiles and split the per-batch data in
half. By adding more batches, we'd actually increase the amount of
memory used.

Of course, this just stops enforcing work_mem at some point, but it at
least attempts to minimize the amount of memory used.

An alternative would be spilling the extra tuples into a special
overflow file, as I explained earlier. That would actually enforce
work_mem I think.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Backup and Restore (pg_dump & pg_restore)

2019-04-21 Thread Tomas Vondra

On Sat, Apr 20, 2019 at 06:50:47PM +, Daulat Ram wrote:

  Hello Team,

   


  We are getting below error while migrating pg_dump from Postgresql 9.6 to
  Postgresql 11.2 via pg_restore in docker environment.

   


  90d4c9f363c8:~$ pg_restore -d kbcn "/var/lib/kbcn_backup19"

  pg_restore: [archiver (db)] Error while PROCESSING TOC:

  pg_restore: [archiver (db)] Error from TOC entry 3; 2615 2200 SCHEMA
  public postgres

  pg_restore: [archiver (db)] could not execute query: ERROR: schema
  "public" already exists

  Command was: CREATE SCHEMA public;

  


How is this related to performance? Please send it to pgsql-general, and
include information about how you created the dump.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Out of Memory errors are frustrating as heck!

2019-04-21 Thread Tomas Vondra

On Sun, Apr 21, 2019 at 07:25:15PM -0400, Gunther wrote:

  After applying Tomas' corrected patch 0001, and routing HJDEBUG messages
  to stderr:

integrator=# set enable_nestloop to off;
SET
integrator=# explain analyze select * from reports.v_BusinessOperation;

...
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 131072 
to 262144
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 262144 
to 524288
ERROR:  out of memory
DETAIL:  Failed on request of size 32800 in memory context "HashBatchContext".

  Now 

TopMemoryContext: 4347672 total in 9 blocks; 41688 free (18 chunks); 4305984 
used
...
  Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used
PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used
PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 
used:
  ExecutorState: 2449896 total in 16 blocks; 1795000 free (3158 chunks); 
654896 used
TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 
40120 used
...
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
HashBatchFiles: 2242545904 total in 266270 blocks; 3996232 free (14164 
chunks); 2238549672 used
HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used
  HashBatchContext: 168165080 total in 5118 blocks; 7936 free (0 
chunks); 168157144 used
TupleSort main: 452880 total in 8 blocks; 126248 free (27 chunks); 
326632 used
  Caller tuples: 1048576 total in 8 blocks; 21608 free (14 chunks); 
1026968 used
...
Grand total: 2424300520 bytes in 271910 blocks; 7332360 free (17596 chunks); 
2416968160 used



IMO this pretty much proves that the memory allocated for BufFile really
is the root cause of the issues with this query. 524288 batches means
up to 1048576 BufFiles, which is a bit more than ~8GB of RAM. However
those for the inner relation were not allycated yet, so at this point
only about 4GB might be allocated. And it seems ~1/2 of them did not
receive any tuples, so only about 2GB got allocated so far.

The second batch will probably make it fail much sooner, because it
allocates the BufFile stuff eagerly (both for inner and outer side).

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Out of Memory errors are frustrating as heck!

2019-04-21 Thread Tomas Vondra

On Sun, Apr 21, 2019 at 11:40:22AM -0500, Justin Pryzby wrote:

On Sun, Apr 21, 2019 at 10:36:43AM -0400, Tom Lane wrote:

Jeff Janes  writes:
> The growEnabled stuff only prevents infinite loops.  It doesn't prevent
> extreme silliness.

> If a single 32 bit hash value has enough tuples by itself to not fit in
> work_mem, then it will keep splitting until that value is in a batch by
> itself before shutting off

I suspect, however, that we might be better off just taking the existence
of the I/O buffers into account somehow while deciding whether it's worth
growing further.  That is, I'm imagining adding a second independent
reason for shutting off growEnabled, along the lines of "increasing
nbatch any further will require an unreasonable amount of buffer memory".
The question then becomes how to define "unreasonable".


On Sun, Apr 21, 2019 at 06:15:25PM +0200, Tomas Vondra wrote:

I think the question the code needs to be asking is "If we double the
number of batches, does the amount of memory we need drop?" And the
memory needs to account both for the buffers and per-batch data.

I don't think we can just stop increasing the number of batches when the
memory for BufFile exceeds work_mem, because that entirely ignores the
fact that by doing that we force the system to keep the per-batch stuff
in memory (and that can be almost arbitrary amount).

...

Of course, this just stops enforcing work_mem at some point, but it at
least attempts to minimize the amount of memory used.


This patch defines reasonable as "additional BatchFiles will not themselves
exceed work_mem; OR, exceeded work_mem already but additional BatchFiles are
going to save us RAM"...



OK.


I think the first condition is insensitive and not too important to get right,
it only allows work_mem to be exceeded by 2x, which maybe already happens for
multiple reasons, related to this thread and otherwise.  It'd be fine to slap
on a factor of /2 or /4 or /8 there too.



TBH I'm not quite sure I understand all the conditions in the patch - it
seems unnecessarily complicated. And I don't think it actually minimizes
the amount of memory used for hash table + buffers, because it keeps the
same spaceAllowed (which pushes nbatches up). At some point it actually
makes to bump spaceAllowed and make larger batches instead of adding
more batches, and the patch does not seem to do that.

Also, the patch does this:

   if (hashtable->nbatch*sizeof(PGAlignedBlock) < hashtable->spaceAllowed)
   {
   ExecHashIncreaseNumBatches(hashtable);
   }
   else if (hashtable->spaceUsed/2 >= hashtable->spaceAllowed)
   {
   /* Exceeded spaceAllowed by 2x, so we'll save RAM by allowing nbatches 
to increase */
   /* I think this branch would be hit almost same as below branch */
   ExecHashIncreaseNumBatches(hashtable);
   }
   ...

but the reasoning for the second branch seems wrong, because

   (spaceUsed/2 >= spaceAllowed)

is not enough to guarantee that we actually save memory by doubling the
number of batches. To do that, we need to make sure that

  (spaceUsed/2 >= hashtable->nbatch * sizeof(PGAlignedBlock))

But that may not be true - it certainly is not guaranteed by not getting
into the first branch.

Consider an ideal example with uniform distribution:

   create table small (id bigint, val text);
   create table large (id bigint, val text);

   insert into large select 10 * random(), md5(i::text)
   from generate_series(1, 7) s(i);

   insert into small select 10 * random(), md5(i::text)
   from generate_series(1, 1) s(i);

   vacuum analyze large;
   vacuum analyze small;

   update pg_class set (relpages, reltuples) = (100, 1)
   where relname = 'large';

   update pg_class set (relpages, reltuples) = (1, 100)
   where relname = 'small';

   set work_mem = '1MB';

   explain analyze select * from small join large using (id);

A log after each call to ExecHashIncreaseNumBatches says this:

   nbatch=2 spaceUsed=463200 spaceAllowed=1048576 BufFile=16384
   nbatch=4 spaceUsed=463120 spaceAllowed=1048576 BufFile=32768
   nbatch=8 spaceUsed=457120 spaceAllowed=1048576 BufFile=65536
   nbatch=16 spaceUsed=458320 spaceAllowed=1048576 BufFile=131072
   nbatch=32 spaceUsed=457120 spaceAllowed=1048576 BufFile=262144
   nbatch=64 spaceUsed=459200 spaceAllowed=1048576 BufFile=524288
   nbatch=128 spaceUsed=455600 spaceAllowed=1048576 BufFile=1048576
   nbatch=256 spaceUsed=525120 spaceAllowed=1048576 BufFile=2097152
   nbatch=256 spaceUsed=2097200 spaceAllowed=1048576 BufFile=2097152
   nbatch=512 spaceUsed=2097200 spaceAllowed=1048576 BufFile=4194304
   nbatch=1024 spaceUsed=2097200 spaceAllowed=1048576 BufFile=8388608
   nbatch=2048 spaceUsed=2097200 spaceAllowed=1048576 BufFile=16777216
   nbatch=4096 spaceUsed=2097200 spaceAllowed=1048576 BufFile=33554432
 

Re: Out of Memory errors are frustrating as heck!

2019-04-22 Thread Tomas Vondra

On Mon, Apr 22, 2019 at 10:07:52AM +0200, Gaetano Mendola wrote:

  Batch splitting shouldn't be followed by a hash function change? 


What would be the value? That can help with hash collisions, but that's
not the issue with the data sets discussed in this thread. The issue
reported originally is about underestimates, and the sample data set has
a large number of duplicate values (a single value representing ~10% of
the data set). Neither of those issues is about hash collisions.

The data set I used to demonstrate how the algorithms work is pretty
perfect, with uniform distribution and no hash collisions.

Furthermore, I don't think we can just change the hash function, for a
couple of technical reasons.

Firstly, it's not like we totally redistribute the whole dataset from N
old batches to (2*N) new ones. By using the same 32-bit hash value and
cconsidering one extra bit, the tuples either stay in the same batch
(when the new bit is 0) or move to a single new batch (when it's 1). So
each batch is split in 1/2. By changing the hash function this would no
longer be true, and we'd redistribute pretty much the whole data set.

The other issue is even more significant - we don't redistribute the
tuples immediately. We only redistribute the current batch, but leave
the other batches alone and handle them when we actually get to them.
This is possible, because the tuples never move backwards - when
splitting batch K, the tuples either stay in K or move to 2K. Or
something like that, I'm too lazy to recall the exact formula now.

And if I recall correctly, I think we can increment the number of
batches while already performing the join, after some rows were already
processed. That would probably be no longer true if we just switched the
hash function, because it might move rows backwards (to the already
processed region).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Out of Memory errors are frustrating as heck!

2019-04-23 Thread Tomas Vondra

On Tue, Apr 23, 2019 at 03:43:48PM -0500, Justin Pryzby wrote:

On Tue, Apr 23, 2019 at 04:37:50PM -0400, Gunther wrote:

On 4/21/2019 23:09, Tomas Vondra wrote:
>What I think might work better is the attached v2 of the patch, with a
Thanks for this, and I am trying this now.

...

Aand, it's a winner!

Unique (cost=5551524.36..5554207.33 rows=34619 width=1197) (actual 
time=6150303.060..6895451.210 rows=435274 loops=1)
-> Sort (cost=5551524.36..5551610.91 rows=34619 width=1197) (actual 
time=6150303.058..6801372.192 rows=113478386 loops=1)
Sort Method: external merge Disk: 40726720kB

For the first time this query has succeeded now. Memory was bounded. The
time of nearly hours is crazy, but things sometimes take that long


It wrote 40GB tempfiles - perhaps you can increase work_mem now to improve the
query time.



That's unlikely to reduce the amount of data written to temporary files,
it just means there will be fewer larger files - in total it's still
going to be ~40GB. And it's not guaranteed it'll improve performance,
because work_mem=4MB might fit into CPU caches and larger values almost
certainly won't. I don't think there's much to gain, really.


We didn't address it yet, but your issue was partially caused by a misestimate.
It's almost certainly because these conditions are correlated, or maybe
redundant.



Right. Chances are that with a bettwe estimate the optimizer would pick
merge join instead. I wonder if that would be significantly faster.


Merge Cond: (((documentinformationsubject.documentinternalid)::text =
(documentinformationsubject_1.documentinternalid)::text) AND
((documentinformationsubject.documentid)::text =
(documentinformationsubject_1.documentid)::text) AND
((documentinformationsubject.actinternalid)::text =
(documentinformationsubject_1.actinternalid)::text))


If they're completely redundant and you can get the same result after
dropping one or two of those conditions, then you should.

Alternately, if they're correlated but not redundant, you can use PG10
"dependency" statistics (CREATE STATISTICS) on the correlated columns
(and ANALYZE).



That's not going to help, because we don't use functional dependencies
in join estimation yet.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Out of Memory errors are frustrating as heck!

2019-04-23 Thread Tomas Vondra

On Tue, Apr 23, 2019 at 04:37:50PM -0400, Gunther wrote:

  On 4/21/2019 23:09, Tomas Vondra wrote:

What I think might work better is the attached v2 of the patch, with a
single top-level condition, comparing the combined memory usage
(spaceUsed + BufFile) against spaceAllowed. But it also tweaks
spaceAllowed once the size needed for BufFile gets over work_mem/3.

  Thanks for this, and I am trying this now.

  So far it is promising.

  I see the memory footprint contained under 1 GB. I see it go up, but also
  down again. CPU, IO, all being live.

foo=# set enable_nestloop to off;
SET
foo=# explain analyze select * from reports.v_BusinessOperation;
WARNING:  ExecHashIncreaseNumBatches: nbatch=32 spaceAllowed=4194304
WARNING:  ExecHashIncreaseNumBatches: nbatch=64 spaceAllowed=4194304
WARNING:  ExecHashIncreaseNumBatches: nbatch=128 spaceAllowed=4194304
WARNING:  ExecHashIncreaseNumBatches: nbatch=256 spaceAllowed=6291456
WARNING:  ExecHashIncreaseNumBatches: nbatch=512 spaceAllowed=12582912
WARNING:  ExecHashIncreaseNumBatches: nbatch=1024 spaceAllowed=25165824
WARNING:  ExecHashIncreaseNumBatches: nbatch=2048 spaceAllowed=50331648
WARNING:  ExecHashIncreaseNumBatches: nbatch=4096 spaceAllowed=100663296
WARNING:  ExecHashIncreaseNumBatches: nbatch=8192 spaceAllowed=201326592
WARNING:  ExecHashIncreaseNumBatches: nbatch=16384 spaceAllowed=402653184
WARNING:  ExecHashIncreaseNumBatches: nbatch=32768 spaceAllowed=805306368
WARNING:  ExecHashIncreaseNumBatches: nbatch=65536 spaceAllowed=1610612736

  Aand, it's a winner!



Good ;-)


 Unique  (cost=5551524.36..5554207.33 rows=34619 width=1197) (actual 
time=6150303.060..6895451.210 rows=435274 loops=1)
   ->  Sort  (cost=5551524.36..5551610.91 rows=34619 width=1197) (actual 
time=6150303.058..6801372.192 rows=113478386 loops=1)
 Sort Key: ...
 Sort Method: external merge  Disk: 40726720kB
 ->  Hash Right Join  (cost=4255031.53..5530808.71 rows=34619 
width=1197) (actual time=325240.679..1044194.775 rows=113478386 loops=1)
   Hash Cond: ...
...
 Planning Time: 40.559 ms
 Execution Time: 6896581.566 ms
(70 rows)


  For the first time this query has succeeded now. Memory was bounded. The
  time of nearly hours is crazy, but things sometimes take that long. The
  important thing was not to get an out of memory error.



TBH I don't think there's much we can do to improve this further - it's
a rather desperate effort to keep the memory usage as low as possible,
without any real guarantees.

Also, the hash join only takes about 1000 seconds out of the 6900 total.
So even if we got it much faster, the query would still take almost two
hours, give or take.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Out of Memory errors are frustrating as heck!

2019-04-23 Thread Tomas Vondra
stimate the optimizer would pick
merge join instead. I wonder if that would be significantly faster.

  The prospect of a merge join is interesting here to consider: with the
  Sort/Unique step taking so long, it seems the Merge Join might also take a
  lot of time? I see my disks are churning for the most time in this way:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   7.500.002.50   89.500.000.50

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
nvme1n1   0.00 0.00  253.00  131.0030.1532.20   332.50 
2.018.408.418.37   2.59  99.60
nvme1n1p240.00 0.00  253.00  131.0030.1532.20   332.50 
2.018.408.418.37   2.59  99.60

  I.e. 400 IOPS at 60 MB/s half of it read, half of it write. During the
  previous steps, the hash join presumably, throughput was a lot higher,
  like 2000 IOPS with 120 MB/s read or write.

  But even if the Merge Join would have taken about the same or a little
  more time than the Hash Join, I wonder, if one could not use that to
  collapse the Sort/Unique step into that? Like it seems that after the
  Sort/Merge has completed, one should be able to read Uniqe records without
  any further sorting? In that case the Merge would be a great advantage.



Probably not, because there are far more columns in the Unique step. We
might have done something with "incremental sort" but we don't have that
capability yet.


  What I like about the situation now is that with that 4x bigger work_mem,
  the overall memory situation remains the same. I.e., we are scraping just
  below 1GB for this process and we see oscillation, growth and shrinkage
  occurring. So I consider this case closed for me. That doesn't mean I
  wouldn't be available if you guys want to try anything else about it.

  OK, now here is the result with the 16 MB work_mem:

 Unique  (cost=5462874.86..5465557.83 rows=34619 width=1197) (actual 
time=6283539.282..7003311.451 rows=435274 loops=1)
   ->  Sort  (cost=5462874.86..5462961.41 rows=34619 width=1197) (actual 
time=6283539.280..6908879.456 rows=113478386 loops=1)
 ...
 Planning Time: 2.953 ms
 Execution Time: 7004340.091 ms
(70 rows)

  There isn't really any big news here. But what matters is that it works.



Yeah. Once the hash join outgrows the work_mem, the fallback logick
starts ignoring that in the effort to keep the memory usage minimal.

I still think the idea with an "overflow batch" is worth considering,
because it'd allow us to keep the memory usage within work_mem. And
after getting familiar with the hash join code again (haven't messed
with it since 9.5 or so) I think it should not be all that difficult.
I'll give it a try over the weekend if I get bored for a while.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Out of Memory errors are frustrating as heck!

2019-04-28 Thread Tomas Vondra

On Wed, Apr 24, 2019 at 02:36:33AM +0200, Tomas Vondra wrote:


...

I still think the idea with an "overflow batch" is worth considering,
because it'd allow us to keep the memory usage within work_mem. And
after getting familiar with the hash join code again (haven't messed
with it since 9.5 or so) I think it should not be all that difficult.
I'll give it a try over the weekend if I get bored for a while.



OK, so I took a stab at this, and overall it seems to be workable. The
patches I have are nowhere near committable, but I think the approach
works fairly well - the memory is kept in check, and the performance is
comparable to the "ballancing" approach tested before.

To explain it a bit, the idea is that we can compute how many BufFile
structures we can keep in memory - we can't use more than work_mem/2 for
that, because then we'd mostly eliminate space for the actual data. For
example with 4MB, we know we can keep 128 batches - we need 128 for
outer and inner side, so 256 in total, and 256*8kB = 2MB.

And then, we just increase the number of batches but instead of adding
the BufFile entries, we split batches into slices that we can keep in
memory (say, the 128 batches). And we keep BufFiles for the current one
and an "overflow file" for the other slices. After processing a slice,
we simply switch to the next one, and use the overflow file as a temp
file for the first batch - we redistribute it into the other batches in
the slice and another overflow file.

That's what the v3 patch (named 'single overflow file') does. I does
work, but unfortunately it significantly inflates the amount of data
written to temporary files. Assume we need e.g. 1024 batches, but only
128 fit into memory. That means we'll need 8 slices, and during the
first pass we'll handle 1/8 of the data and write 7/8 to the overflow
file.  Then after processing the slice and switching to the next one, we
repeat this dance - 1/8 gets processed, 6/8 written to another overflow
file. So essentially we "forward" about

   7/8 + 6/8 + 5/8 + ... + 1/8 = 28/8 = 3.5

of data between slices, and we need to re-shuffle data in each slice,
which amounts to additional 1x data. That's pretty significant overhead,
as will be clear from the measurements I'll present shortly.

But luckily, there's a simple solution to this - instead of writing the
data into a single overflow file, we can create one overflow file for
each slice. That will leave us with the ~1x of additional writes when
distributing data into batches in the current slice, but it eliminates
the main source of write amplification - awalanche-like forwarding of
data between slices.

This relaxes the memory limit a bit again, because we can't really keep
the number of overflow files constrained by work_mem, but we should only
need few of them (much less than when adding one file per batch right
away). For example with 128 in-memory batches, this reduces the amount
of necessary memory 128x.

And this is what v4 (per-slice overflow file) does, pretty much.


Two more comments, regarding memory accounting in previous patches. It
was a bit broken, because we actually need 2x the number of BufFiles. We
needed nbatch files for outer side and nbatch files for inner side, but
we only considered one of those - both when deciding when to increase
the number of batches / increase spaceAllowed, and when reporting the
memory usage. So with large number of batches the reported amount of
used memory was roughly 1/2 of the actual value :-/

The memory accounting was a bit bogus for another reason - spaceUsed
simply tracks the amount of memory for hash table contents. But at the
end we were simply adding the current space for BufFile stuff, ignoring
the fact that that's likely much larger than when the spacePeak value
got stored. For example we might have kept early spaceUsed when it was
almost work_mem, and then added the final large BufFile allocation.

I've fixed both issues in the patches attached to this message. It does
not make a huge difference in practice, but it makes it easier to
compare values between patches.


Now, some test results - I've repeated the simple test with uniform data
set, which is pretty much ideal for hash joins (no unexlectedly large
batches that can't be split, etc.). I've done this with 1M, 5M, 10M, 25M
and 50M rows in the large table (which gets picked for the "hash" side),
and measured how much memory gets used, how many batches, how long it
takes and how much data gets written to temp files.

See the hashjoin-test.sh script for more details.

So, here are the results with work_mem = 4MB (so the number of in-memory
batches for the last two entries is 128). The columns are:

* nbatch - the final number of batches
* memory - memory usage, as reported by explain analyze
* time - duration of the query (without explain analyze) 

Re: Trying to handle db corruption 9.6

2019-05-20 Thread Tomas Vondra

On Mon, May 20, 2019 at 04:20:45PM +, Bimal wrote:

  I had ran into same issue about year back, luckily I had standby to
  quickly promote.  But, I wish there was better a documentation on how to
  handle WAL log fill up and resetting them.  


pg_resetxlog is not a tool to deal with "WAL fill up". It's a last
resort option to deal with corrupted WAL, and can easily make matters
worse when used without due consideration. That seems to be the case
here, unfortunately.

On a properly behaving system, running out of disk space for pg_xlog
results in database shutdown. If you also get corrupted WAL, you have
bigger problems, I'm afraid.

Also, data corruption issues are one-off events, mostly unique. That
makes it rather difficult (~impossible) to write docs about recovering
from them. And it's why there are no magic tools.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Trying to handle db corruption 9.6

2019-05-20 Thread Tomas Vondra

On Mon, May 20, 2019 at 08:20:33PM +0300, Mariel Cherkassky wrote:

  Hey Greg,
  Basically my backup was made after the first pg_resetxlog so I was wrong.


Bummer.


  However, the customer had a secondary machine that wasn't synced for a
  month. I have all the walls since the moment the secondary went out of
  sync. Once I started it I hoped that it will start recover the wals and
  fill the gap. However I got an error in the secondary :         
   2019-05-20 10:11:28 PDT  19021  LOG:  entering standby mode
  2019-05-20 10:11:28 PDT  19021  LOG:  invalid primary checkpoint record
  2019-05-20 10:11:28 PDT  19021  LOG:  invalid secondary checkpoint link in
  control file
  2019-05-20 10:11:28 PDT  19021  PANIC:  could not locate a valid
  checkpoint record
  2019-05-20 10:11:28 PDT  19018  LOG:  startup process (PID 19021) was
  terminated by signal 6: Aborted
  2019-05-20 10:11:28 PDT  19018  LOG:  aborting startup due to startup
  process failure
  2019-05-20 10:11:28 PDT  19018  LOG:  database system is shut down.       
                      I checked my secondary archive dir and pg_xlog dir and
  it seems that the restore command doesnt work. My restore_command:      
  restore_command = 'rsync -avzhe ssh
  postgres@x.x.x.x:/var/lib/pgsql/archive/%f /var/lib/pgsql/archive/%f ;
  gunzip < /var/lib/pgsql/archive/%f > %p'
  archive_cleanup_command = '/usr/pgsql-9.6/bin/pg_archivecleanup
  /var/lib/pgsql/archive %r'


Well, when you say it does not work, why do you think so? Does it print
some error, or what? Does it even get executed? It does not seem to be
the case, judging by the log (there's no archive_command message).

How was the "secondary machine" created? You said you have all the WAL
since then - how do you know that?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Generic Plans for Prepared Statement are 158155 times slower than Custom Plans

2019-05-20 Thread Tomas Vondra

On Mon, May 20, 2019 at 09:37:34PM +, Deepak Somaiya wrote:

wow this is interesting! 
@Tom, Bruce, David - Experts
Any idea why would changing the datatype would cause so much degradation - this 
is even when plan remains the same ,data is same.
Deepak
   On Friday, May 17, 2019, 2:36:05 AM PDT, Naik, Sameer  
wrote:


Deepak,

I changed the datatype from citext to text and now everything works fine.

The data distribution is same, plan is same, yet there is a huge performance 
degradation when citext is used instead of text.

However the business case requires case insensitive string handling.

I am looking forward to some expert advice here when dealing with citext data 
type.




It's generally a good idea to share explain analyze output for both
versions of the query - both with citext and text.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Trying to handle db corruption 9.6

2019-05-21 Thread Tomas Vondra

On Tue, May 21, 2019 at 12:01:31PM +0300, Mariel Cherkassky wrote:

Tomas :

Well, when you say it does not work, why do you think so? Does it print
some error, or what? Does it even get executed? It does not seem to be
the case, judging by the log (there's no archive_command message).

How was the "secondary machine" created? You said you have all the WAL
since then - how do you know that?

Well, when I start the secondary in recovery mode (the primary is down,
auto failover is disabled..)  it doesnt start recovering the archive wals
from the primary. The logs of the secondary :
receiving incremental file list
rsync: link_stat "/var/lib/pgsql/archive/0002.history" failed: No such
file or directory (2)

sent 8 bytes  received 10 bytes  36.00 bytes/sec
total size is 0  speedup is 0.00
rsync error: some files/attrs were not transferred (see previous errors)
(code 23) at main.c(1505) [receiver=3.0.6]
sh: /var/lib/pgsql/archive/0002.history: No such file or directory
2019-05-20 09:41:33 PDT  18558  LOG:  entering standby mode
2019-05-20 09:41:33 PDT  18558  LOG:  invalid primary checkpoint record
2019-05-20 09:41:33 PDT  18558  LOG:  invalid secondary checkpoint link in
control file
2019-05-20 09:41:33 PDT  18558  PANIC:  could not locate a valid checkpoint
record
2019-05-20 09:41:33 PDT  18555  LOG:  startup process (PID 18558) was
terminated by signal 6: Aborted
2019-05-20 09:41:33 PDT  18555  LOG:  aborting startup due to startup
process failure
2019-05-20 09:41:33 PDT  18555  LOG:  database system is shut down
2019-05-20 09:56:12 PDT  18701  LOG:  database system was shut down in
recovery at 2019-05-01 09:40:02 PDT

As I said, the secondary was down for a month and I have all the archives
of the wals in my primary. I was hoping that the secondary will use the
restore_command to restore them :
restore_command = 'rsync -avzhe ssh postgres@X.X.X.X:/var/lib/pgsql/archive/%f
/var/lib/pgsql/archive/%f ; gunzip < /var/lib/pgsql/archive/%f > %p'

my archive_command on the primary was :
archive_command = 'gzip < %p > /var/lib/pgsql/archive/%f'

Am I missing something ?



First of all, the way you quote message is damn confusing - there's no
clear difference between your message and the message you quote. I don't
know which mail client you're using, but I suppose it can be configured to
quote sensibly ...

Well, clearly the standby tries to fetch WAL from archive, but the rsync
command fails for some reason. You're in the position to investigate
further, because you can run it manually - we can't. This has nothing to
do with PostgreSQL. My guess is you don't have /var/lib/pgsql/archive on
the standby, and it's confusing because archive uses the same path.



Another question, If I'll run initdb and initiate a new cluster and i'll
copy the data files of my old cluster into the new one, is there any chance
that it will work ?
I mean right now, my primary is down and cant start up because it is
missing an offset file in the pg_multixtrans/offset dir.



No, because you won't have contents of system catalogs, mapping the data
files to relations (tables, indexes) and containing information about the
structure (which columns / data types are in the data).

The data files are pretty useless on their own. It might be possible to do
some manualy recovery - say, you might create the same tables in the new
schema, and then guess which data files belong to them. But there are
various caveats e.g. due to dropped columns, etc.

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Trying to handle db corruption 9.6

2019-05-21 Thread Tomas Vondra

On Tue, May 21, 2019 at 04:03:52PM +, Greg Clough wrote:

 My restore command copy the wals from archive dir in the primary to an
 archive dir in the secondary(different from the pg_xlog in the
 secondary)


I think that you're restore command puts them back into the archive, and
then uncompresses them into pg_xlog, which is what %p represents.



Should I run it manually and see if the archives are copied to the
archive dir in the secondary or should I just copy all of them to the
xlog dir in the secondary ?


That would be my first test, but as Thomas mentioned, you don't have any
hint of WAL archives being restored in the postgresql.log... so it's not
even trying.  It's not likely that archive_command is your problem at the
moment.



I tried to start the secondary as a primary (I have a backup..) but I
still got an error (invalid checkpoint record from primary./
secondary). Does it means that my backup is corrupted ?


I think so, but Thomas could probably confirm if all hope is lost.  Also,
I'm not sure if there is a terminology difference but a "standby" is
never considered a "backup".  I realise it's late in the day, but even if
you have a correctly configured Standby you should also take backups with
pg_basebackup, Barman, pgBackRest, etc.



Well, I have no idea. We still got no information about how the standby
was created, if it was ever running fine, and so on. Considering it does
not seem to be getting data from the archive, it might be the case it was
created in some strange way and never really worked. And if there really
are no log messages about the restore_command, it probably fails before
the standby even tries to execute it.

So I don't know.


Restoring backups is where I would be heading now, as things seem
terribly broken.



Right. But my impression is there are no backups ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: pg_restore takes more time on creation of rules

2019-05-22 Thread Tomas Vondra

On Wed, May 22, 2019 at 06:26:49PM +0300, Mariel Cherkassky wrote:

Hey,
I'm trying to restore a cluster (9.2) from 3 binary dumps (pg_dump -Fc).
Each dump contains only one database.
The sizes :
A-10GB
B-20GB
C-5GB.

For unclear reason the restore of the third database is taking alot of
time. It isnt stuck but it continues creating db rules. This database has
more then 400K rules.



What do you mean by "rules"?


I changed a few postgresql.conf parameters :
shared_buffers = 2GB
effective_cache_size = 65GB
checkpoint_segments =20
checkpoint_completion_target = 0.9
maintenance_work_mem = 10GB
checkpoint_timeout=30min
work_mem=64MB
autovacuum = off
full_page_writes=off
wal_buffers=50MB

my machine has 31 cpu and 130GB of ram.

Any idea why the restore of the two dbs takes about 15 minutes while the
third db which is the smallest takes more than 1 hour ?  I restore the
dump with pg_restore with 5 jobs (-j).



Well, presumably the third database has complexity in other places,
possibly spending a lot of time on CPU, while the other databases don't
have such issue.

What would help is a CPU profile, e.g. from perf.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: pg_restore takes more time on creation of rules

2019-05-22 Thread Tomas Vondra

On Wed, May 22, 2019 at 06:44:29PM +0300, Mariel Cherkassky wrote:

By rules I mean DB rules (simillar to triggers but different)



I very much doubt such high number of rules was expected during the
design (especially if it's on a single table), so perhaps there's an
O(N^2) piece of code somewhere. I suggest you do a bit of profiling, for
example using perf [1], which would show where the time is spent.

[1] https://wiki.postgresql.org/wiki/Profiling_with_perf

And please stop top-posting, it makes it much harder to follow the
discussion.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Use Postgres as a column store by creating one table per column

2019-05-24 Thread Tomas Vondra

On Thu, May 23, 2019 at 01:08:42AM -0400, George Neuner wrote:
On Tue, 21 May 2019 21:28:07 -0700, Lev Kokotov 
 wrote:


Is it efficient to use Postgres as a column store by creating one 

table per

column?

I would query it with something like `[...] UNION SELECT value AS 
FROM  WHERE value =  UNION [...]` to build a row.


I think you mean JOIN.

You'd need more than that: Postgresql uses MVCC for concurrency, so 
whenever you update any row in a table, the ordering of the rows 
within the table changes.  And the JOIN operation inherently is 
unordered - you need to sort the result deliberately to control 
ordering.


To emulate a column-store, at the very least you need a way to 
associate values from different "columns" that belong to the same 
"row" of the virtual table.  IOW, every value in every "column" needs 
an explicit "row" identifier.  E.g.,


   col1 = { rowid, value1 }, col2 = { rowid, value2 }, ...

For performance you would need to have indexes on at least the rowid 
in each of the "column" tables.


This is a bare minimum and can only work if the columns of your 
virtual table and the queries against it are application controlled or 
statically known.  If you want to do something more flexible that will 
support ad hoc table modifications, elastically sized values (strings, 
bytes, arrays, JSON, XML), etc. this example is not suffice and the 
implementation can get very complicated very quickly



Justin Pryzby was not joking when he said the performance could be 
awful ... at least as compared to a more normal row-oriented 
structure.  Performance of a query that involves more than a handful 
of "columns", in general, will be horrible.  It is up to you to decide 
whether some (maybe little) increase in performance in processing 
*single* columns will offset likely MASSIVE loss of performance in 
processing multiple columns.




Maybe take a look at this paper:

   http://db.csail.mit.edu/projects/cstore/abadi-sigmod08.pdf

which essentially compares this approach to a "real" column store.

It certainly won't give you performance comparable to column store, it
adds quite a bit of overhead (disk space because of row headers, CPU
because of extra joins, etc.).

And it can't give you the column-store benefits - compression and/or
more efficient execution.



I'm thinking since Postgres stores tables in continuous blocks of 

16MB each

(I think that's the default page size?)


Default page size is 8 KB.  You'd have to recompile to change that, 
and it might break something - a whole lot of code depends on the 
knowing the size of storage pages.





Right. And the largest page size is 64kB. But 8kB is a pretty good
trade-off, in most cases.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: UUID v1 optimizations...

2019-05-25 Thread Tomas Vondra

On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote:

Ancoron Luciferis  writes:

On 25/05/2019 16:57, Tom Lane wrote:

(4) it in fact *wouldn't* do anything useful, because we'd still have
to sort UUIDs in the same order as today, meaning that btree index behavior
would remain the same as before.  Plus UUID comparison would get a lot
more complicated and slower than it is now.



I get your first sentence, but not your second. I know that when
changing the internal byte order we'd have to completed re-compute
everything on-disk (from table to index data), but why would the sorting
in the index have to be the same?


Because we aren't going to change the existing sort order of UUIDs.
We have no idea what applications might be dependent on that.

As Vitalii correctly pointed out, your beef is not with the physical
storage of UUIDs anyway: you just wish they'd sort differently, since
that is what determines the behavior of a btree index.  But we aren't
going to change the sort ordering because that's an even bigger
compatibility break than changing the physical storage; it'd affect
application-visible semantics.

What you might want to think about is creating a function that maps
UUIDs into an ordering that makes sense to you, and then creating
a unique index over that function instead of the raw UUIDs.  That
would give the results you want without having to negotiate with the
rest of the world about whether it's okay to change the semantics
of type uuid.



FWIW that's essentially what I implemented as an extension some time
ago. See [1] for a more detailed explanation and some benchmarks.

The thing is - it's not really desirable to get perfectly ordered
ordering, because that would mean we never get back to older parts of
the index (so if you delete data, we'd never fill that space).

[1] https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: UUID v1 optimizations...

2019-05-25 Thread Tomas Vondra

On Sat, May 25, 2019 at 06:38:08PM -0400, Tom Lane wrote:

Tomas Vondra  writes:

On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote:

What you might want to think about is creating a function that maps
UUIDs into an ordering that makes sense to you, and then creating
a unique index over that function instead of the raw UUIDs.  That
would give the results you want without having to negotiate with the
rest of the world about whether it's okay to change the semantics
of type uuid.



FWIW that's essentially what I implemented as an extension some time
ago. See [1] for a more detailed explanation and some benchmarks.


Also, another way to attack this is to create a new set of ordering
operators for UUID and an associated non-default btree opclass.
Then you declare your index using that opclass and you're done.
The key advantage of this way is that the regular UUID equality
operator can still be a member of that opclass, meaning that searches
of the form "uuidcol = constant" can still use this index, so you
don't have to change your queries (at least not for that common case).
Look at the interrelationship of the regular text btree operators and
the "pattern_ops" btree operators for a precedent.



Perhaps. But it does not allow to tune how often the values "wrap" and,
which I think is an useful capability.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: UUID v1 optimizations...

2019-05-25 Thread Tomas Vondra

On Sun, May 26, 2019 at 01:49:30AM +0200, Ancoron Luciferis wrote:

On 26/05/2019 00:14, Tomas Vondra wrote:

On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote:

Ancoron Luciferis  writes:

On 25/05/2019 16:57, Tom Lane wrote:

(4) it in fact *wouldn't* do anything useful, because we'd still have
to sort UUIDs in the same order as today, meaning that btree index
behavior
would remain the same as before.  Plus UUID comparison would get a lot
more complicated and slower than it is now.



I get your first sentence, but not your second. I know that when
changing the internal byte order we'd have to completed re-compute
everything on-disk (from table to index data), but why would the sorting
in the index have to be the same?


Because we aren't going to change the existing sort order of UUIDs.
We have no idea what applications might be dependent on that.

As Vitalii correctly pointed out, your beef is not with the physical
storage of UUIDs anyway: you just wish they'd sort differently, since
that is what determines the behavior of a btree index.  But we aren't
going to change the sort ordering because that's an even bigger
compatibility break than changing the physical storage; it'd affect
application-visible semantics.

What you might want to think about is creating a function that maps
UUIDs into an ordering that makes sense to you, and then creating
a unique index over that function instead of the raw UUIDs.  That
would give the results you want without having to negotiate with the
rest of the world about whether it's okay to change the semantics
of type uuid.



FWIW that's essentially what I implemented as an extension some time
ago. See [1] for a more detailed explanation and some benchmarks.


Yes, I've seen that before. Pretty nice work you but together there and
I'll surely have a look at it but we certainly need the node id in
compliance with v1 UUID's so that's why we've been generating UUID's at
the application side from day 1.



The thing is - it's not really desirable to get perfectly ordered
ordering, because that would mean we never get back to older parts of
the index (so if you delete data, we'd never fill that space).


Wouldn't this apply also to any sequential-looking index (e.g. on
serial)?


Yes, it does apply to any index on sequential (ordered) data. If you
delete data from the "old" part (but not all, so the pages don't get
completely empty), that space is lost. It's available for new data, but
if we only insert to "new" part of the index, that's useless.


The main issue with the UUID's is that it almost instantly
consumes a big part of the total value space (e.g. first value is
'01...' and second is coming as 'f3...') which I would assume not being
very efficient with btrees (space reclaim? - bloat).



I don't understand what you mean here. Perhaps you misunderstand how
btree indexes grow? It's not like we allocate separate pages for
different values/prefixes - we insert the data until a page gets full,
then it's split in half. There is some dependency on the order in which
the values are inserted, but AFAIK random order is generally fine.


One of our major concerns is to keep index size small (VACUUM can't be
run every minute) to fit into memory next to a lot of others.



I don't think this has much to do with vacuum - I don't see how it's
related to the ordering of generated UUID values. And I don't see where
the "can't run vacuum every minute" comes from.


I've experimented with the rollover "prefix" myself but found that it
makes the index too big (same or larger index size than standard v1
UUIDs) and VACUUM too slow (almost as slow as a standard V1 UUID),
although INSERT performance wasn't that bad, our sequential UUID's where
way faster (at least pre-generated and imported with COPY to eliminate
any value generation impact).



I very much doubt that has anything to do with the prefix. You'll need
to share more details about how you did your tests.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: UUID v1 optimizations...

2019-05-26 Thread Tomas Vondra

On Sun, May 26, 2019 at 02:27:05PM +1000, Morris de Oryx wrote:

I'm not worthy to post here, but a bit of a random thought.

If I've followed the conversation correctly, the reason for a V1 UUID is
partly to order and partition rows by a timestamp value, but without the
cost of a timestamp column. As I was told as a boy, "Smart numbers aren't."
Is it _absolutely_ the case that you can't afford another column? I don't
know the ins and outs of the Postgres row format, but my impression is that
it's a fixed size, so you may be able to add the column without splitting
rows? Anyway, even if that's not true and the extra column costs you disk
space, is it the index that concerns you?  Have you considered a timestamp
column, or a numeric column with an epoch offset, and a BRIN index? If you
insert data is in pretty much chronological order, that might work well for
you.

Best of luck, I've enjoyed following the commentary.



No, an extra column is not a solution, because it has no impact on the
index on the UUID column. One of the problems with indexes on random
data is that the entries go to random parts of the index. In the extreme
case, each index insert goes to a different index page (since the last
checkpoint) and therefore has to write the whole page into the WAL.
That's what full-page writes do. This inflates the amount of WAL, may
trigger more frequent checkpoints and (of course) reduces the cache hit
ratio for index pages (because we have to touch many of them).

The point of generating UUIDs in a more sequential way is to limit this
behavior by "concentrating" the index inserts into a smaller part of the
index. That's why indexes on sequential data (say, generated from a
SERIAL column) perform better.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Shortest offline window on database migration

2019-05-30 Thread Tomas Vondra

On Thu, May 30, 2019 at 12:08:04PM -0300, Haroldo Kerry wrote:

Hello,

We are migrating our PostgreSQL 9.6.10 database (with streaming replication
active) to a faster disk array.
We are using this opportunity to enable checksums, so we will have to do a
full backup-restore.
The database size is about 500GB, it takes about 2h:30min for a full
backup, and then about 1h to fully restore it with checksum enabled on the
new array, plus 2h to recreate the replica on the old array.

Although all synthetic tests (pgbench) indicate the new disk array is
faster, we will only be 100% confident once we see its performance on
production, so our backup plan is using our replica database on the older
array. If the new array performance is poor during production ramp up, we
can switch to the replica with little impact to our customers.

Problem is the offline window for backup, restore the full database with
checksum and recreate the replica is about 5h:30m.

One thing that occurred to us to shorten the offline window was restoring
the database to both the master and replica in parallel (of course we would
configure the replica as master do restore the database), that would shave
1h of the total time. Although this is not documented we thought that
restoring the same database to identical servers would result in binary
identical data files.

We tried this in lab. As this is not a kosher way to create a replica, we
ran a checksum comparison of all data files, and we ended up having a lot
of differences. Bummer. Both master and replica worked (no errors on logs),
but we ended up insecure about this path because of the binary differences
on data files.
But in principle it should work, right?


What should work? Backup using pg_dump and restore certainly won't give
you the same binary files - the commit timestamps will be different,
operations may happen in a different order (esp. with parallel restore),
and so on. And the instances don't start as a copy anyway, so there will
be different system IDs, etc.

So no, this is not a valid way to provision master/standby cluster.


Has anyone been through this type of problem?



Unfortunately, I don't think there's a much better solution that what you
initially described - dump/restore, and then creating a replica.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Strange query behaviour between 9.4 and 12beta1

2019-06-04 Thread Tomas Vondra

On Tue, Jun 04, 2019 at 05:34:07PM +0300, Vasilis Ventirozos wrote:

Hello everyone,
I started comparing performance between postgres 9.4 and 12beta1 more 
specifically comparing the new (materialized) CTE.


Are you saying the CTE is specified as MATERIALIZED in the query on 12?
Because I don't see it in the explain plan (it's mentioned in the 9.4
plan, though).


The statements i use are application statements that i have little control over,
Hardware is identical as both clusters are running on the same server, on the 
same disks, with the same data.
Also, cluster settings are almost identical and both clusters have been 
analyzed.
In all my tests 12 is faster , sometimes much faster, apart from one single 
query that takes ~12 seconds on 9.4 and nearly 300 seconds on 12.

Plans for both :
Plan for 12 <https://explain.depesz.com/s/wRXO>
plan for 9.4 <https://explain.depesz.com/s/njtH>

The plans are obfuscated , apologies for that but what stands out is the 
following :



Meh.



Hash Left Join 
<http://www.depesz.com/2013/05/09/explaining-the-unexplainable-part-3/#join-modifiers>
 (cost=200,673.150..203,301.940 rows=153,121 width=64) (actual 
time=1,485.883..284,536.440 rows=467,123 loops=1)
Hash Cond: (lima_sierra(six_lima_november2.victor_romeo, 1) = 
foxtrot_hotel.victor_romeo)
Join Filter: (whiskey_uniform1.six_sierra = foxtrot_hotel.uniform_juliet)
Rows Removed by Join Filter: 4549925366



You have two equality conditions for the join. The first one is used to
match rows by the hash join itself - it's used to compute the hash
value and lookups. But there may be multiple rows that match on either
side, generating additional "virtual rows". Those are then removed by
the second condition.

Consider for example simple cross-join on this table:

  a  |  b
 -
  1  |  a
  1  |  b
  2  |  a
  2  |  b

and the query is

 SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)

Now, in the first phase, the hash join might only do (t1.a = t2.a),
which will generate 8 rows

 a | t1.b | t2.b
 
 1 |a |a
 1 |a |b
 1 |b |a
 1 |b |b
 2 |a |a
 2 |a |b
 2 |b |a
 2 |b |b

And then it will apply the second condition (t1.b = t2.b) as a "filter"
which will remove some of the rows. In your case the first step
generates 4.5B rows the second step discards.

I'm not sure why we don't use both conditions as a hash condition.
Perhaps it's a data type that does not support hashing, and on 9.4 that
does not matter because we end up using merge join.




I really can't understand what these 4.5bil rows have been removed from, there 
is nothing suggesting that this dataset was ever created (eg. temp)
and these numbers definitely don't match what i was expecting, which is more or 
less what i'm seeing in 9.4 plan.

Obviously i've tested this more than once and this behaviour consists.


Best Regards,
Vasilis Ventirozos




regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Custom opclass for column statistics?

2019-07-06 Thread Tomas Vondra

On Sat, Jul 06, 2019 at 11:02:27AM +0200, Ancoron Luciferis wrote:

Hi,

I've been wondering whether it is possible somehow to have the standard
column statistics to respect a certain operator class?

The reason why I am asking for this is that I have a UUID column with a
unique index at it using a custom operator class which implies a
different sort order than for the default UUID operator class.

This results into planner mistakes when determining whether to use the
index for row selection or not. Too often it falls back into sequential
scan due to this.



Can you share an example demonstrating the issue?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Custom opclass for column statistics?

2019-07-06 Thread Tomas Vondra

On Sat, Jul 06, 2019 at 05:35:33PM +0200, Ancoron Luciferis wrote:

On 06/07/2019 15:38, Tomas Vondra wrote:

On Sat, Jul 06, 2019 at 11:02:27AM +0200, Ancoron Luciferis wrote:

Hi,

I've been wondering whether it is possible somehow to have the standard
column statistics to respect a certain operator class?

The reason why I am asking for this is that I have a UUID column with a
unique index at it using a custom operator class which implies a
different sort order than for the default UUID operator class.

This results into planner mistakes when determining whether to use the
index for row selection or not. Too often it falls back into sequential
scan due to this.



Can you share an example demonstrating the issue?


regards



Yes, I have an opclass as follows:

CREATE OPERATOR CLASS uuid_timestamp_ops FOR TYPE uuid
   USING btree AS
   OPERATOR1   <*,
   OPERATOR1   <~ (uuid, timestamp with time zone),
   OPERATOR2   <=*,
   OPERATOR2   <=~ (uuid, timestamp with time zone),
   OPERATOR3   =,
   OPERATOR3   =~ (uuid, timestamp with time zone),
   OPERATOR4   >=*,
   OPERATOR4   >=~ (uuid, timestamp with time zone),
   OPERATOR5   >*,
   OPERATOR5   >~ (uuid, timestamp with time zone),
   FUNCTION1   uuid_timestamp_cmp(uuid, uuid),
   FUNCTION1   uuid_timestamp_only_cmp(uuid, timestamp
with time zone),
   FUNCTION2   uuid_timestamp_sortsupport(internal)
;

...and e.g. operator "<*" is defined as:

CREATE FUNCTION uuid_timestamp_lt(uuid, uuid)
RETURNS bool
AS 'MODULE_PATHNAME', 'uuid_timestamp_lt'
LANGUAGE C
IMMUTABLE
LEAKPROOF
STRICT
PARALLEL SAFE;

COMMENT ON FUNCTION uuid_timestamp_lt(uuid, uuid) IS 'lower than';

CREATE OPERATOR <* (
   LEFTARG = uuid,
   RIGHTARG = uuid,
   PROCEDURE = uuid_timestamp_lt,
   COMMUTATOR = '>*',
   NEGATOR = '>=*',
   RESTRICT = scalarltsel,
   JOIN = scalarltjoinsel
);


The function "uuid_timestamp_lt" is basically defined as follows:
1. if not version 1 UUID fallback to standard uuid compare
2. extract timestamp values and compare
3. if equal timestamps fallback to standard uuid compare

...so that a chronological order is established.


The test table is created as follows:

CREATE TABLE uuid_v1_ext (id uuid);
CREATE UNIQUE INDEX idx_uuid_v1_ext ON uuid_v1_ext (id uuid_timestamp_ops);


The values for "histogram_bounds" of the test table look like this (due
to the default sort order for standard type UUID):

3789-97bf-11e9-b6bb-e03f49f7f733
008b88f8-6deb-11e9-901a-e03f4947f477
010a8b22-586a-11e9-8258-e03f49ce78f3
...
6f682e68-978d-11e9-901a-e03f4947f477
6ff412ee-926f-11e9-901a-e03f4947f477
7079ffe2-642f-11e9-b0cc-e03f49e7fd3b
70ffaeca-4645-11e9-adf9-e03f494677fb
...
fef26b41-9b9d-11e9-b0cc-e03f49e7fd3b
ff779ce8-9e52-11e9-8258-e03f49ce78f3
6bfc-4de4-11e9-b0d4-e03f49d6f6bf

...and I think that's where the planner gets the decision for a query
such as:

DELETE FROM uuid_v1_ext WHERE id <* '4bdf6f81-56ad-11e9-8258-e03f49ce78f3';

...which then get's executed as sequential scan instead of an index scan.

I was also thinking about changing the selectivity function used by the
custom operator, but I didn't find any hints how to implement that
without duplicating a lot of internal code.



Not sure, I'm not very familiar with this code, so I'd have to play with
it and try things. But that's hard when I don't have any code. Would it
be possible to share a small self-contained test case?

I wonder what does uuid_timestamp_cmp do? I suppose it first compares by
a timestamp extracted from the UUID, right?

It'd be interesting to see

(a) statistics for the column from pg_stats, both for the table and
index (which should have been built using the custom opclass, I think).

(b) EXPLAIN ANALYZE for queries with your opclass, and perhaps with the
default one (that can't use the timestamp condition, but it should be
possible to generate smallers/largest uuid for a timestamp).

BTW which PostgreSQL version is this?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Searching in varchar column having 100M records

2019-07-17 Thread Tomas Vondra

On Wed, Jul 17, 2019 at 02:53:20PM +0300, Sergei Kornilov wrote:

Hello

Please recheck with track_io_timing = on in configuration. explain
(analyze,buffers) with this option will report how many time we spend
during i/o


  Buffers: shared hit=2 read=31492


31492 blocks / 65 sec ~ 480 IOPS, not bad if you are using HDD

Your query reads table data from disks (well, or from OS cache). You need
more RAM for shared_buffers or disks with better performance.



Either that, or try creating a covering index, so that the query can do an
index-only scan. That might reduce the amount of IO against the table, and
in the index the data should be located close to each other (same page or
pages close to each other).

So try something like

   CREATE INDEX ios_idx ON table (field, user_id);

and make sure the table is vacuumed often enough (so that the visibility
map is up to date).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Searching in varchar column having 100M records

2019-07-18 Thread Tomas Vondra

On Thu, Jul 18, 2019 at 05:21:49PM +0530, mayank rupareliya wrote:

*Please recheck with track_io_timing = on in configuration. explain
(analyze,buffers) with this option will report how many time we spend
during i/o*

*>   Buffers: shared hit=2 read=31492*

*31492 blocks / 65 sec ~ 480 IOPS, not bad if you are using HDD*

*Your query reads table data from disks (well, or from OS cache). You need
more RAM for shared_buffers or disks with better performance.*


Thanks Sergei..
*track_io_timing = on helps.. Following is the result after changing that
config.*

Aggregate  (cost=10075.78..10075.79 rows=1 width=8) (actual
time=63088.198..63088.199 rows=1 loops=1)
 Buffers: shared read=31089
 I/O Timings: read=61334.252
 ->  Bitmap Heap Scan on fields  (cost=72.61..10069.32 rows=2586 width=0)
(actual time=69.509..63021.448 rows=31414 loops=1)
   Recheck Cond: ((field)::text = 'Klein'::text)
   Heap Blocks: exact=30999
   Buffers: shared read=31089
   I/O Timings: read=61334.252
   ->  Bitmap Index Scan on index_field  (cost=0.00..71.96 rows=2586
width=0) (actual time=58.671..58.671 rows=31414 loops=1)
 Index Cond: ((field)::text = 'Klein'::text)
 Buffers: shared read=90
 I/O Timings: read=45.316
Planning Time: 66.435 ms
Execution Time: 63088.774 ms



How did that help? It only gives you insight that it's really the I/O that
takes time. You need to reduce that, somehow.



*So try something like*

*CREATE INDEX ios_idx ON table (field, user_id);*

*and make sure the table is vacuumed often enough (so that the visibility*
*map is up to date).*

Thanks Tomas... I tried this and result improved but not much.



Well, you haven't shown us the execution plan, so it's hard to check why
it did not help much and give you further advice.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Searching in varchar column having 100M records

2019-07-19 Thread Tomas Vondra

On Fri, Jul 19, 2019 at 07:43:26PM +0530, mayank rupareliya wrote:

Well, you haven't shown us the execution plan, so it's hard to check why
it did not help much and give you further advice.


This is the latest query execution with explain after adding indexing on
both columns.

Aggregate  (cost=174173.57..174173.58 rows=1 width=8) (actual
time=65087.657..65087.658 rows=1 loops=1)
 ->  Bitmap Heap Scan on fields  (cost=1382.56..174042.61 rows=52386
width=0) (actual time=160.340..65024.533 rows=31857 loops=1)
   Recheck Cond: ((field)::text = 'Champlin'::text)
   Heap Blocks: exact=31433
   ->  Bitmap Index Scan on index_field  (cost=0.00..1369.46
rows=52386 width=0) (actual time=125.078..125.079 rows=31857 loops=1)
 Index Cond: ((field)::text = 'Champlin'::text)
Planning Time: 8.595 ms
Execution Time: 65093.508 ms



That very clearly does not use the index-only scan, so it's not
surprising it's not any faster. You need to find out why the planner
makes that decision.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Standard uuid vs. custom data type uuid_v1

2019-07-27 Thread Tomas Vondra

On Thu, Jul 25, 2019 at 11:26:23AM +0200, Ancoron Luciferis wrote:

Hi,

I have finally found some time to implement a custom data type optimized
for version 1 UUID's (timestamp, clock sequence, node):
https://github.com/ancoron/pg-uuid-v1

Some tests (using a few millions of rows) have shown the following
results (when used as a primary key):

COPY ... FROM: ~7.8x faster (from file - SSD)
COPY ... TO  : ~1.5x faster (no where clause, sequential output)

The best thing is that for INSERT's there is a very high chance of
hitting the B-Tree "fastpath" because of the timestamp being the most
significant part of the data type, which tends to be increasing.

This also results in much lower "bloat", where the standard "uuid" type
easily goes beyond 30%, the "uuid_v1" should be between 10 and 20%.

Additionally, it also reveals the massive performance degrade I saw in
my tests for standard UUID's:

Initial 200 million rows: ~ 80k rows / second
Additional 17 million rows: ~26k rows / second

...and the new data type:
Initial 200 million rows: ~ 623k rows / second
Additional 17 million rows: ~618k rows / second



Presumably, the new data type is sorted in a way that eliminates/reduces
random I/O against the index. But maybe that's not the case - hard to
say, because the linked results don't say how the data files were
generated ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Query went slow all of sudden. ON V 11.3

2019-10-06 Thread Tomas Vondra

On Fri, Oct 04, 2019 at 07:28:54PM +0530, nikhil raj wrote:

Hi Justin,

Its been executing for 35 + mins due to statement time out its getting
canceled.



Well, without a query plan it's really hard to give you any advice. We
need to see at least EXPLAIN output (without analyze) to get an idea of
how the query will be executed. Even better, disable the statement
timeout in the session and dive use EXPLAIN ANALYZE. Of course, it's
unclear how long it'll run.

Earlier you mentioned the query started running fast after you recreated
one of the tables. That likely means the table (or the indexes on it)
are getting bloated over time. Try looking at the sizes of those objects
(and maybe use pgstattuple to get more detailed statistics before
rebuilding it next time.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Slow PostgreSQL 10.6 query

2019-10-06 Thread Tomas Vondra

On Tue, Oct 01, 2019 at 11:42:33PM +1000, Behrang Saeedzadeh wrote:

Thanks. That eliminated the bottleneck!

Any ideas why adding ORDER BY to the subquery also changes the plan in a
way that eliminates the bottleneck?



IIRC the ORDER BY clause makes it impossible to "collapse" the subquery
into the main (upper) one, and it probably happens to constrict the
choices so that the planner ends up picking a good plan. I guess adding
"OFFSET 0" to the subquery would have the same effect.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Delete huge Table under XFS

2019-10-06 Thread Tomas Vondra

On Thu, Sep 19, 2019 at 07:00:01PM +0200, Joao Junior wrote:

A table with 800 gb means 800 files of 1 gb. When I use truncate or drop
table,  xfs that is a log based filesystem,  will write lots of data in its
log and this is the problem. The problem is not postgres, it is the way
that xfs works with big files , or being more clear, the way that it
handles lots of files.



I'm a bit skeptical about this explanation. Yes, XFS has journalling,
but only for metadata - and I have a hard time believing deleting 800
files (or a small multiple of that) would write "lots of data" into the
jornal, and noticeable performance issues. I wonder how you concluded
this is actually the problem.

That being said, TRUNCATE is unlikely to perform better than DROP,
because it also deletes all the files at once. What you might try is
dropping the indexes one by one, and then the table. That should delete
files in smaller chunks.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Out of Memory errors are frustrating as heck!

2019-10-06 Thread Tomas Vondra

On Sat, Aug 24, 2019 at 11:40:09AM -0400, Gunther wrote:
Thanks Tom, yes I'd say it's using a lot of memory, but wouldn't call 
it "leak" as it doesn't grow during the 30 min or so that this query 
runs. It explodes to 4GB and then stays flat until done.




Well, the memory context stats you've shared however show this:

total: 1345345736 bytes in 1209 blocks; 4529600 free (270 chunks); 1340816136 
used

That's only ~1.3GB, and ~1.1GB of that is the expression context. So
when you say 4GB, when does that happen and can you share stats showing
state at that point?

Yes, and this time the query is super complicated with many joins and 
tables involved. The query plan has 100 lines. Not easy to share for 
reproduce and I have my issue under control by adding some swap just 
in case. The swap space was never actually used.




Still, without the query plan we can hardly do any guesses about what
might be the issue.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Out of Memory errors are frustrating as heck!

2019-10-06 Thread Tomas Vondra

On Fri, Aug 23, 2019 at 09:17:38AM -0400, Gunther wrote:
Hi all, I am connecting to a discussion back from April this year. My 
data has grown and now I am running into new out of memory situations. 
Meanwhile the world turned from 11.2 to 11.5 which I just installed 
only to find the same out of memory error.




As Tom already said, this seems like a quite independent issue. Next
time it'd be better to share it in a new thread, not to mix it up with
the old discussion.

Have any of the things discussed and proposed, especially this last 
one by Tomas Vondra, been applied to the 11 releases? Should I try 
these older patches from April?




Unfortunately, no. We're still discussing what would be the right fix
(it's rather tricky and the patches I shared were way too experimental
for that). But I'm pretty sure whatever we end up doing it's going to be
way too invasive for backpatch. I.e. the older branches will likely have
this issue until EOL.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: pg_stat_bgwriter

2019-10-14 Thread Tomas Vondra

On Sun, Oct 13, 2019 at 06:27:35PM -0700, dangal wrote:

Dear I would like to share with you to see what you think about the
statistics of pg_stat_bgwriter

postgres = # select * from pg_stat_bgwriter;
 checkpoints_timed | checkpoints_req | checkpoint_write_time |
checkpoint_sync_time | buffers_checkpoint | buffers_clean | maxwritten_clean
| buffers_backend | buffers_
backend_fsync | buffers_alloc | stats_reset
--- + - +  --- +
-- + --- - + --- +
-- + -  + -
-- + --- + --- 
   338 | 6 | 247061792 | 89418 | 2939561 | 19872289 | 54876 |
6015787 |
0 | 710682240 | 2019-10-06 19: 25: 30.688186-03
(1 row)

postgres = # show bgwriter_delay;
 bgwriter_delay

 200ms
(1 row)

postgres = # show bgwriter_lru_maxpages;
 bgwriter_lru_maxpages
---
 100
(1 row)

postgres = # show bgwriter_lru_multiplier;
 bgwriter_lru_multiplier
-
 2
(1 row)


Do you think it should increase bgwriter_lru_maxpages due to the value of
maxwritten_clean?
Do you think it should increase bgwriter_lru_maxpages,
bgwriter_lru_multiplier, and decrease bgwriter_delay due to the value of
buffers_backend compared to buffers_alloc?
Do you think a modification is necessary?
What values ​​would you recommend?


buffers_alloc does not really matter, here, IMO. You need to compare
buffers_checkpoint, buffers_backend and buffers_clean, and ideally you'd
have (checkpoints > clean > backend). In your case it's already

   buffers_checkpoint | buffers_clean | buffers_backend
  2939561 |  19872289 | 6015787

You could make bgwriter even more aggressive, but that's unlikely to be
a huge improvement. You should investigate why buffers_checkpoint is so
low. This is usually a sign of shared_buffers being too small for the
active set, so perhaps you need to increase shared_buffers, or see which
queries are causing this and optimize them.

Note: FWIW, a single snapshot of pg_stats* may be misleading, because
it's cumulative, so it's not clear how accurately it reflects current
state. Next time take two snapshots and subtract them.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pg_stat_bgwriter

2019-10-14 Thread Tomas Vondra
ar how how that
correlates to the activity. The fact that you've removed names of tables
and even queries is does not really help either.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Reading explain plans- row estimates/actuals on lower nodes vs next level up

2019-10-17 Thread Tomas Vondra

On Thu, Oct 17, 2019 at 03:15:21PM -0600, Michael Lewis wrote:

https://explain.depesz.com/s/Caa5

I am looking at this explain analyze output and seeing a nested loop
towards the lowest levels with pretty bad estimate vs actual (2.3k vs 99k),
but the things that feed that nested loop seem like the estimates are
rather close (index scans with 11 estimated vs 30 actual and 3350 vs
3320)... why does the higher node have such a different estimate vs actual
ratio?



This is usually a sign of non-uniform distribution for the join columns,
and/or correlation between sides of the join.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Barman

2019-10-31 Thread Tomas Vondra

On Thu, Oct 31, 2019 at 05:29:34PM +, Daulat Ram wrote:

Hi All,

We have configured postgres 11.2 in streaming replication (primary &
Standby)  on docker and I am looking to initiate the Postgres backup
using barman. As I know there are few options for taking backup using
barman.

RSYNC backup
Incremental Backups
Streaming Backup with continuous WAL streaming
Centralized and Catalogued Backups

Which is the best option for backup using barman? So that we can keep
the database safe in case of disaster? I feel the Incremental Backups
are most useful to perform the PITR but I want to know the experts
suggestions.



You're mixing a number of topics, here. Firstly, all backups done by
barman are centralized and catalogued, that's pretty much one of the
main purposes of barman.

When it comes to backup methods, there are two basic methods. rsync and
postgres (which means pg_basebackup). This is about creating the initial
base backup. Both methods then can replicate WAL by either streaming or
archive_command.

So first you need to decide whether to use rsync and pg_basebackup,
where rsync allows advanced features like incremental backup, parallel
backup and deduplication.

Then you need to decide whether to use archive_command or streaming
(i.e. pg_receivexlog).

The "right" backup method very much depends on the size of your
database, activity, and so on. By default you should probably go with
the default option, described as "scenario 1" in the barman docs, i.e.
pg_basebackup (backup_method = postgres) and WAL streaming.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: FPGA optimization ...

2019-11-04 Thread Tomas Vondra

On Mon, Nov 04, 2019 at 06:33:15PM -0500, Gunther wrote:

The time has come.

FPGA optimization is in the palm of our hands (literally a 2 TB 40 
GB/s IO PostgreSQL server fits into less than a shoe box), and on 
Amazon AWS F1 instances.


Some demos are beginning to exist: https://github.com/Xilinx/data-analytics.
<https://github.com/Xilinx/data-analytics>

But a lot more could be done. How about linear sort performance at 
O(N)? https://hackaday.com/2016/01/20/a-linear-time-sorting-algorithm-for-fpgas/. 
And how about ​https://people.csail.mit.edu/wjun/papers/fccm2017.pdf, 
the following four sorting accelerators are used:


* Tuple Sorter : Sorts an N-tuple using a sorting network.
* Page Sorter : Sorts an 8KB (a flash page) chunk of sorted N-tuples
  in on-chip memory.
* Super-Page Sorter : Sorts 16 8K-32MB sorted chunks in DRAM.
* Storage-to-Storage Sorter: Sorts 16 512MB or larger sorted chunks in
  flash.

Order of magnitude speed improvements? Better than Hadoop clusters on 
a single chip? 40 GB/s I/O throughput massive full table scan, blazing 
fast sort-merge joins? Here it is. Anybody working more on that? 
Should be an ideal project for a student or a group of students.




For the record, this is not exactly a new thing. Netezza (a PostgreSQL
fork started in 1999 IBM) used FPGAs. Now there's swarm64 [1], another
PostgreSQL fork, also using FPGAs with newer PostgreSQL releases.

Those are proprietary forks, though. The main reason why the community
itself is not working on this directly (at least not on pgsql-hackers)
is exactly that it requires specialized hardware, which the devs
probably don't have, making development impossible, and the regular
customers are not asking for it either (one of the reasons being limited
availability of such hardware, especially for customers running in the
cloud and not being even able to deploy custom appliances).

I don't think this will change, unless the access to systems with FPGAs
becomes much easier (e.g. if AWS introduces such instance type).

Is there a PostgreSQL foundation I could donate to, 501(c)(3) tax 
exempt? I can donate and possibly find some people at Purdue 
University who might take this on. Interest?




I don't think there's any such non-profit, managing/funding development.
At least I'm not avare of it. There are various non-profits around the
world, but those are organizing events and local communities.

I'd say the best way to do something like this is to either talk to one
of the companies participating in PostgreSQL devopment (pgsql-hackers is
probably a good starting point), or - if you absolutely need to go
through a non-profit - approach a university (which does not mean people
from pgsql-hackers can't be involved, of course). I've been involved in
a couple of such research projects in Europe, not sure what exactly is
the situation/rules in US.

regards

[1] https://swarm64.com/netezza-replacement/

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: FPGA optimization ...

2019-11-06 Thread Tomas Vondra

On Wed, Nov 06, 2019 at 11:01:37AM -0700, AJG wrote:

From what I have read and benchmarks seen..

FPGA shines for writes (and up to 3x (as opposed to 10x claim) real world
for queries from memory)

GPU shines/outperforms FPGA for reads. There is a very recent and
interesting academic paper[1] on High Performance GPU B-Tree (vs lsm) and
the incredible performance it gets, but I 'think' it requires NVIDIA (so no
easy/super epyc+gpu+hbm on-chip combo solution then ;) ).

Doesn't both FPHGA and GPU going to require changes to executor from pull to
push to get real benefits from them? Isnt that something Andres working on
(pull to push)?



I think it very much depends on how the FPA/GPU/... is used.

If we're only talking about FPGA I/O acceleration, essentially FPGA
between the database and storage, it's likely possible to get that
working without any extensive executor changes. Essentially create an
FPGA-aware variant of SeqScan and you're done. Or an FPGA-aware
tuplesort, or something like that. Neither of this should require
significant planner/executor changes, except for costing.


What really is exciting is UPMEM (little 500mhz processors on the memory),
cost will be little more than memory cost itself, and shows up to 20x
performance improvement on things like index search (from memory). C
library, claim only needs few hundred lines of code to integrate from
memory, but not clear to me what use cases it can also be used for than ones
they show benchmarks for.



Interesting, and perhaps interesting for in-memory databases.



[1] https://escholarship.org/content/qt1ph2x5td/qt1ph2x5td.pdf?t=pkvkdm


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: FPGA optimization ...

2019-11-06 Thread Tomas Vondra

On Wed, Nov 06, 2019 at 03:15:53PM -0800, Andres Freund wrote:

Hi,

On 2019-11-06 22:54:48 +0100, Tomas Vondra wrote:

If we're only talking about FPGA I/O acceleration, essentially FPGA
between the database and storage, it's likely possible to get that
working without any extensive executor changes. Essentially create an
FPGA-aware variant of SeqScan and you're done. Or an FPGA-aware
tuplesort, or something like that. Neither of this should require
significant planner/executor changes, except for costing.


I doubt that that is true.  For one, you either need to teach the FPGA
to understand at least enough about the intricacies of postgres storage
format, to be able to make enough sense of visibility information to
know when it safe to look at a tuple (you can't evaluate qual's before
visibility information). It also needs to be fed a lot of information
about the layout of the table, involved operators etc.  And even if you
define those away somehow, you still need to make sure that the on-disk
state is coherent with the in-memory state - which definitely requires
reaching outside of just a replacement seqscan node.



That's true, of course - the new node would have to know a lot of
details about the on-disk format, meaning of operators, etc. Not
trivial, that's for sure. (I think PGStrom does this)

What I had in mind were extensive changes to how the executor works in
general, because the OP mentioned changing the executor from pull to
push, or abandoning the iterative executor design. And I think that
would not be necessary ...


I've a hard time believing that, even though some storage vendors are
pushing this model heavily, the approach of performing qual evaluation
on the storage level is actually useful for anything close to a general
purpose database, especially a row store.



I agree with this too - it's unlikely to be a huge win for "regular"
workloads, it's usually aimed at (some) analytical workloads.

And yes, row store is not the most efficient format for this type of
accelerators (I don't have much experience with FPGA, but for GPUs it's
very inefficient).


It's more realistic to have a model where the fpga is fed pre-processed
data, and it streams out the processed results. That way there are no
problems with coherency, one can can transparently handle parts of
reading the data that the FPGA can't, etc.



Well, the whole idea is that the FPGA does a lot of "simple" filtering
before the data even get into RAM / CPU, etc. So I don't think this
model would perform well - I assume the "processing" necessary could
easily be more expensive than the gains.



But I admit I'm sceptical even the above model is relevant for
postgres. The potential market seems likely to stay small, and there's
so much more performance work that's applicable to everyone using PG,
even without access to special purpose hardware.



Not sure. It certainly is irrelevant for everyone who does not have
access to systems with FPGAs, and useful only for some workloads. How
large the market is, I don't know.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Parallel Query

2019-11-13 Thread Tomas Vondra

On Wed, Nov 13, 2019 at 05:16:44PM -0300, Luís Roberto Weck wrote:

Hi!

Is there a reason query 3 can't use parallel workers? Using q1 and q2 
they seem very similar but can use up to 4 workers to run faster:


q1: https://pastebin.com/ufkbSmfB
q2: https://pastebin.com/Yt32zRNX
q3: https://pastebin.com/dqh7yKPb

The sort node on q3 takes almost 12 seconds, making the query run on 
68  if I had set enough work_mem to make it all in memory.




Most likely because it'd be actually slower. The trouble is the
aggregation does not actually reduce the cardinality, or at least the
planner does not expect that - the Sort and GroupAggregate are expected
to produce 3454539 rows. The last step of the aggregation has to receive
and merge data from all workers, which is not exactly free, and if there
is no reduction of cardinality it's likely cheaper to just do everything
in a single process serially.

How does the explain analyze output look like without the HAVING clause?

Try setting parallel_setup_cost and parallel_tuple_cost to 0. That might
trigger parallel query.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Out of memory error on automatic vacuum

2019-11-18 Thread Tomas Vondra

On Mon, Nov 18, 2019 at 12:41:27PM +0100, Ekaterina Amez wrote:

Hi all,

This morning I was checking postgres servers logs, looking for errors 
(we've recently upgraded them and changed default config) and long 
running queries when I found one of the servers had really big logs 
since yesterday.  It was giving the error of this mail's subject: out 
of memory, failed on request of size XXX on automatic vacuum of table 
YYY. A quick search revealed me some postgresql-lists messages talking 
about work_mem and shared_buffers configuration options, some kernel 
config options too. Although all of them were messages from several 
years ago, I decided to cut my shared_buffers configured value and 
restart server: now it looks like error is gone. But I'd like to 
understand what's beyond the logged error (it's really long and refers 
to things about inner functionalities that I'm missing), how to detect 
what config options are possibly conflicting and, most important, I 
want to know if I've solved it right.




Unfortunately that's hard to say, without further data. The "out of
memory" errors simply mean we called malloc() and it returned NULL,
because the kernel was unable to provide the memory.

There probably were other processes using all the available RAM (the
limit depends on various config values, e.g. overcommit). What were
these processes doing we don't know :-(

For example, there might be multiple complex queries, allocating
several work_mem each, using quite a bit of memory. Or there might be a
runaway query doing HashAgg allocating much more memory than predicted.
Or maybe there was running a completely separate job (say, backup)
allocating a lot of memory or dirtying data in page cache.

There are countless options what might have happened. The memory context
stats are nice, but it's just a snapshot from one particular process,
and it does not seem very interesting (the total is just ~1MB, so
nothing extreme). We still don't know what else was running.

Lowering shared_buffers certainly does reduce the memory pressure in
general, i.e. there is 1GB of work for use by processes. It may be
sufficient, hard to guess.

I don't know if work_mem 64MB is too low, becuase it depends on what
queries you're running etc. But you probably don't wat to increase that,
as it allows processes to use more memory when executing queries, i.e.
it increases memory pressure and makes OOM more likely.

So you need to watch system monitoring, see how much memory is being
used (excluding page cache) and consider reducing work_mem and/or
max_connections if it's too close.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Out of memory error on automatic vacuum

2019-11-18 Thread Tomas Vondra

On Mon, Nov 18, 2019 at 03:02:16PM +0100, Ekaterina Amez wrote:


El 18/11/19 a las 13:25, Tomas Vondra escribió:

On Mon, Nov 18, 2019 at 12:41:27PM +0100, Ekaterina Amez wrote:

Hi all,

This morning I was checking postgres servers logs, looking for 
errors (we've recently upgraded them and changed default config) 
and long running queries when I found one of the servers had 
really big logs since yesterday.  It was giving the error of this 
mail's subject: out of memory, failed on request of size XXX on 
automatic vacuum of table YYY. A quick search revealed me some 
postgresql-lists messages talking about work_mem and 
shared_buffers configuration options, some kernel config options 
too. Although all of them were messages from several years ago, I 
decided to cut my shared_buffers configured value and restart 
server: now it looks like error is gone. But I'd like to 
understand what's beyond the logged error (it's really long and 
refers to things about inner functionalities that I'm missing), 
how to detect what config options are possibly conflicting and, 
most important, I want to know if I've solved it right.




Unfortunately that's hard to say, without further data. The "out of
memory" errors simply mean we called malloc() and it returned NULL,
because the kernel was unable to provide the memory.
This (kernel unable to provide memory) was because no more RAM was 
available to allocate? It was because PG process did not have more 
memory assigned ready to use? Or is something unknown because it 
depends on the situations where the error is thrown?


Not sure I understand. Whenever PostgreSQL process needs memory it
requests it from the kernel by calling malloc(), and the amount of
availabe RAM is limited.  So when kernel can't provide more memory,
it returns NULL.



There probably were other processes using all the available RAM (the
limit depends on various config values, e.g. overcommit). What were
these processes doing we don't know :-(

For example, there might be multiple complex queries, allocating
several work_mem each, using quite a bit of memory. Or there might be a
runaway query doing HashAgg allocating much more memory than predicted.
Or maybe there was running a completely separate job (say, backup)
allocating a lot of memory or dirtying data in page cache.


I've looked at cron and I've seen a scheduled process that finished a 
bit before the error began to log (o couple of minutes or so). Errors 
began on Sunday morning and this machine doesn't have much workload on 
work days, and less on weekend. I'll keep an eye on this log and if 
the problem appears again I'll try to track database activity and 
machine activity.




If it finished a couple of minutes before, it's unlikely to be the
related. But hard to say, without knowing the details.



There are countless options what might have happened. The memory context
stats are nice, but it's just a snapshot from one particular process,
and it does not seem very interesting (the total is just ~1MB, so
nothing extreme). We still don't know what else was running.


When you talk about ~1MB are you getting this size from log lines like this?

: *1024* total in 1 blocks; 476 free (0 chunks); 548 used



No, that's just one of the memory contexts (they form a tree), using
only 1kB of memory. What matters is the "grand total"

Grand total: 1009356 bytes in 130 blocks; 436888 free (72 chunks);
572468 used

which is ~1MB.





Lowering shared_buffers certainly does reduce the memory pressure in
general, i.e. there is 1GB of work for use by processes. It may be
sufficient, hard to guess.

I don't know if work_mem 64MB is too low, becuase it depends on what
queries you're running etc. But you probably don't wat to increase that,
as it allows processes to use more memory when executing queries, i.e.
it increases memory pressure and makes OOM more likely.

So you need to watch system monitoring, see how much memory is being
used (excluding page cache) and consider reducing work_mem and/or
max_connections if it's too close.


I'll do, thanks for your suggestions.



Another thing you might do is adding a swap (if you don't have one
already), as a safety.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Out of memory error on automatic vacuum

2019-11-18 Thread Tomas Vondra

On Mon, Nov 18, 2019 at 03:46:03PM +0100, Ekaterina Amez wrote:


El 18/11/19 a las 15:16, Tomas Vondra escribió:


Not sure I understand. Whenever PostgreSQL process needs memory it
requests it from the kernel by calling malloc(), and the amount of
availabe RAM is limited.  So when kernel can't provide more memory,
it returns NULL.


Understood.



If it finished a couple of minutes before, it's unlikely to be the
related. But hard to say, without knowing the details.


Yeah, I thought the same but for me is too much coincidence and is 
suspicious (or at least a thing to have in mind).




No, that's just one of the memory contexts (they form a tree), using
only 1kB of memory. What matters is the "grand total"

Grand total: 1009356 bytes in 130 blocks; 436888 free (72 chunks);
572468 used

which is ~1MB.

OK, in my lack of knowledge I was understanding "memory context" as 
the whole message.




Another thing you might do is adding a swap (if you don't have one
already), as a safety.


Excuse my ignorance but... swap? You mean some mechanism that prevents 
server to be unavailable by having a second instance running but not 
accesible and changing from one to the other when the main fails? 
(It's the best way I find to describe it, as I don't usually 
speak/write english).




swap = space on disk, so that OS can page out unused data from RAM when
there's memory pressure

It's a basic sysadmin knowledge, I think. Search for mkswap.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Postgresql planning time too high

2019-11-22 Thread Tomas Vondra

On Fri, Nov 22, 2019 at 11:44:51AM +, Sterpu Victor wrote:

No rows should be returned, DB is empty.
I'm testing now on a empty DB trying to find out how to improve this.



I'm a bit puzzled why you're doinf tests on an empty database, when in
production it'll certainly contain data. I guess you're assuming that
this way you isolate planning time, which should remain about the same
even with data loaded, but I'm not entirely sure that's true - all this
planning is done with no statistics (histograms, MCV lists, ...) and
maybe it's forcing the planner to do more work? I wouldn't be surprised
if having those stats would allow the planner to take some shortcuts,
cutting the plannnig time down.

Not to mention that we don't know if the plan is actually any good, for
all what we know it might take 10 years on real data, making the
planning duration irrelevant.


Let's put that aside, though. Let's assume it's because of expensive
join order planning. I don't think you have a lot of options, here,
unfortunately.

One option is to try reducing the planner options that determine how
much effort should be spent on join planning, e.g. join_collapse_limit
and geqo_threshold. If this is the root cause, you might even rewrite
the query to use optimal join order and set join_collapse_limit=1.
You'll have to play with it.

The other option is using CTEs with materialization, with the same
effect, i.e. prevention of optimization across CTEs, reducing the
total effort.


In this query I have 3 joins like this:

SELECT t1.id, t2.valid_from
FROM t1
JOIN t2 ON (t1.id_t1 = t1.id)
LEFT JOIN t3 ON (t3.id_t1 = t1.id AND t3.valid_fromIf I delete these 3 joins than the planning time goes down from 5.482 
ms to 754.708 ms but I'm not sure why this context is so demanding on 
the planner.
I'm tryng now to make a materialized view that will allow me to stop 
using the syntax above.


I reattached the same files, they should be fine like this.



It'd be useful to have something others can use to reproduce the issue,
and investigate locally. SQL script that creates the whole schema and
runs the query, for example.

What I'd like to see is a perf profile from the planning, so that we can
see where exactly is the bottleneck. Maybe there actually is a bug that
makes it muych more expensive than it should be, in some corner case?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Seeking reason behind performance gain in 12 with HashAggregate

2020-01-13 Thread Tomas Vondra

On Mon, Jan 13, 2020 at 12:44:14PM -0800, Shira Bezalel wrote:

Hi Michael,

I appreciate your question. I ran a vacuum analyze on the 9.6 table and it
yielded no difference. Same number of buffers were read, same query plan.



VACUUM ANALYZE won't shrink the table - the number of buffers will be
exactly the same. You need to do VACUUM FULL, but be careful as that
acquires exclusive lock on the table.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Bad query plan when you add many OR conditions

2020-01-13 Thread Tomas Vondra

On Fri, Jan 10, 2020 at 02:30:27PM +0100, Marco Colli wrote:

@Justin Pryzby I have tried this as you suggested:

CREATE STATISTICS statistics_on_subscriptions_project_id_and_tags ON
project_id, tags FROM subscriptions;
VACUUM ANALYZE subscriptions;

Unfortunately nothing changes and Postgresql continues to use the wrong
plan (maybe stats don't work well on array fields like tags??).



We support this type of clause for extended statistics (yet).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: DB running out of memory issues after upgrade

2020-02-18 Thread Tomas Vondra

On Tue, Feb 18, 2020 at 05:46:28PM +, Nagaraj Raj wrote:

after upgrade Postgres to v9.6.11 from v9.6.9 DB running out of memory issues 
no world load has changed before and after upgrade. 

spec: RAM 16gb,4vCore
Any bug reported like this or suggestions on how to fix this issue? I 
appreciate the response..!! 



This bug report (in fact, we don't know if it's a bug, but OK) is
woefully incomplete :-(

The server log is mostly useless, unfortunately - it just says a bunch
of processes were killed (by OOM killer, most likely) so the server has
to restart. It tells us nothing about why the backends consumed so much
memory etc.

What would help us is knowing how much memory was the backend (killed by
OOM) consuming, which should be in dmesg.

And then MemoryContextStats output - you need to connect to a backend
consuming a lot of memory using gdb (before it gets killed) and do

 (gdb) p MemoryContextStats(TopMemoryContext)
 (gdb) q

and show us the output printed into server log. If it's a backend
running a query, it'd help knowing the execution plan.

It would also help knowing the non-default configuration, i.e. stuff
tweaked in postgresql.conf.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: increased max_parallel_workers_per_gather results in fewer workers?

2020-06-04 Thread Tomas Vondra

On Wed, Jun 03, 2020 at 06:23:57PM -0400, Philip Semanchuk wrote:


...

I then ran the EXPLAIN ANALYZE and got the same slow runtime (1473s) and 1 
worker in the EXPLAIN ANALYZE output.



I guess you should show an explain analyze, specifically "Workers
Planned/Launched", maybe by linking to explain.depesz.com


Out of an abundance of caution, our company has a policy of not pasting our plans 
to public servers. However, I can confirm that when I set 
max_parallel_workers_per_gather > 4 and the runtime increases, this is what’s 
in the EXPLAIN ANALYZE output:

Workers Planned: 1
Workers Launched: 1

FWIW, the Planning Time reported in EXPLAIN ANALYZE output doesn’t vary 
significantly, only from 411-443ms, and the variation within that range 
correlates only very weakly with max_parallel_workers_per_gather.



Well, that policy is stupid and it makes it unnecessarily harder to
answer your questions. We really need to see the plans, it's much harder
to give you any advices without it. We can only speculate about what's
going on.

It's understandable there may be sensitive information in the plan
(parameter values, ...) but that can be sanitized before posting.

We need to see plans for the good and bad case, so that we can compare
them, look at the plan general shapes, costs, etc.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Query performance issue

2020-09-04 Thread Tomas Vondra

On Fri, Sep 04, 2020 at 09:18:41PM +, Nagaraj Raj wrote:

I have a query which will more often run on DB and very slow and it is doing 
'seqscan'. I was trying to optimize it by adding indexes in different ways but 
nothing helps.
Any suggestions?



1) It's rather difficult to read the query plan as it's mangled by your
e-mail client. I recommend to check how to prevent the client from doing
that, or attaching the plan as a file.

2) The whole query takes ~3500ms, and the seqscan only accounts for
~200ms, so it's very clearly not the main issue.

3) Most of the time is spent in sort, so the one thing you can do is
either increasing work_mem, or adding index providing that ordering.
Even better if you include all necessary columns to allow IOS.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Performance Issue (Not using Index when joining two tables).

2020-09-13 Thread Tomas Vondra

On Sun, Sep 13, 2020 at 02:58:15PM +, Gopisetty, Ramesh wrote:

Hi,

Good Morning!

Postgres Version :  11.6  (AWS Native Postgres/AWS Aurora  tried on both 
flavours).

When i'm joining two tables the primary index is not being used.  While is use  
in clause with values then the index is being used.  I have reindexed all the 
tables,  run the auto vaccum as well.


pgwfc01q=> select count(*) from chr_simple_val;
count
---
13158
(1 row)

pgwfc01q=> select count(*) from chr_emp_position;
count
---
  228
(1 row)


The primary key for the table chr_Simple_val  contains OID.   Still not using 
the index.

I'm sharing the explain plan over here..

pgwfc01q=> explain analyze select 
cep.HOME_DEPT_OID,ctc.oid,ctc.category,ctc.code from chr_emp_position cep inner 
join chr_Simple_Val ctc on ctc.oid=cep.HOME_DEPT_OID;

QUERY P
LAN


Hash Join  (cost=49299.91..51848.83 rows=651 width=42) (actual 
time=3512.692..3797.583 rows=228 loops=1)
  Hash Cond: ((cep.home_dept_oid)::text = (ctc.oid)::text)
  ->  Seq Scan on chr_emp_position cep  (cost=0.00..2437.77 rows=436 width=11) 
(actual time=44.713..329.435 rows=22
8 loops=1)
Filter: ((("current_user"())::text <> ANY 
('{wfnadmin,skipvpd}'::text[])) AND f_sel_policy_all(vpd_key, 'CH
R_EMP_POSITION'::character varying) AND 
f_sel_policy_prod_locale((cep.*)::character varying, prod_locale_code))
Rows Removed by Filter: 3695
  ->  Hash  (cost=49176.40..49176.40 rows=9881 width=31) (actual 
time=3467.907..3467.908 rows=13158 loops=1)
Buckets: 16384  Batches: 1  Memory Usage: 1031kB
->  Seq Scan on chr_simple_val ctc  (cost=0.00..49176.40 rows=9881 
width=31) (actual time=2.191..3460.929 r
ows=13158 loops=1)
  Filter: ((("current_user"())::text <> ANY 
('{wfnadmin,skipvpd}'::text[])) AND f_sel_policy_ty_static(
vpd_key) AND f_sel_policy_prod_locale((ctc.*)::character varying, 
prod_locale_code))
  Rows Removed by Filter: 75771
Planning Time: 0.297 ms
Execution Time: 3797.768 ms
(12 rows)



Most of the time (3460ms) is spent in the sequential scan on
chr_simple_val, and the seqscan on chr_emp_position is taking ~330ms).
Combined that's 3790ms out of 3797ms, so the join is pretty much
irrelevant.

Either the seqscans are causing a lot of I/O, or maybe the f_sel_*
functions in the filter are expensive. Judging by how few rows are in
the tables (not sure how large the tables are), I'd guess it's the
latter ... Hard to say without knowing what the functions do etc.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Understanding bad estimate (related to FKs?)

2020-10-29 Thread Tomas Vondra

On Thu, Oct 29, 2020 at 11:25:48AM -0400, Philip Semanchuk wrote:




On Oct 28, 2020, at 9:13 PM, Justin Pryzby 
wrote:

On Mon, Oct 26, 2020 at 11:20:01AM -0600, Michael Lewis wrote:

On Mon, Oct 26, 2020 at 11:14 AM Philip Semanchuk
 wrote:


The item I'm focused on is node 23. The estimate is for 7 rows,
actual

is 896 (multiplied by 1062 loops). I'm confused about two things in
this node.


The first is Postgres' estimate. The condition for this index
scan

contains three expressions --


(five_uniform = zulu_five.five_uniform) AND (whiskey_mike =
juliet_india.whiskey_mike) AND (bravo = 'mike'::text)




Are the columns correlated?


I guess it shouldn't matter, since the FKs should remove all but one
of the conditions.


Yes, I had the same expectation. I thought Postgres would calculate the
selectivity as 1.0 * 1.0 * whatever estimate it has for the frequency
of ‘mike’, but since the frequency estimate is very accurate but the
planner’s estimate is not, there’s something else going on.



Well, this is quite a bit more complicated, I'm afraid :-( The clauses
include parameters passed from the nodes above the index scan. So even
if we had extended stats on the table, we couldn't use them as that
requires (Var op Const) conditions. So this likely ends up with a
product of estimates for each clause, and even then we can't use any
particular value so we probably end up with something like 1/ndistinct
or something like that. So if the values actually passed to the index
scan are more common and/or if the columns are somehow correlated, it's
not surprising we end up with an overestimate.


Maybe you saw this other thread, which I tentatively think also
affects your case (equijoin with nonjoin condition)
https://www.postgresql.org/message-id/AM6PR02MB5287A0ADD936C1FA80973E72AB190%40AM6PR02MB5287.eurprd02.prod.outlook.com


Yes, thank you, I read that thread with interest. I tried your clever
trick using BETWEEN, but it didn’t change the plan. Does that suggest
there’s some other cause for the planner’s poor estimate?



I don't think that's related - to hit that bug, there would have to be
implied conditions pushed-down to the scan level. And there's nothing
like that in this case.

FWIW I don't think this has anything to do with join cardinality
estimation - at least not for the node 23.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Understanding bad estimate (related to FKs?)

2020-11-02 Thread Tomas Vondra

On Mon, Nov 02, 2020 at 03:08:12PM -0500, Tom Lane wrote:

Philip Semanchuk  writes:

The query I asked about in the original post of this thread has 13 relations in 
it. IIUC, that's 13! or > 6 billion possible plans. How did the planner pick 
one plan out of 6 billion? I'm curious, both for practical purposes (I want my 
query to run well) and also because it's fascinating.


The twenty-thousand-foot overview is

https://www.postgresql.org/docs/devel/planner-optimizer.html

and then ten-thousand-foot level is the planner README file,

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob_plain;f=src/backend/optimizer/README;hb=HEAD

and then you pretty much gotta start reading code.  You could also dig
into various planner expository talks that people have given at PG
conferences.  I don't have links at hand, but there have been several.



Yeah. The jump from high-level overviews to reading source code is a bit
brutal, though ...


FWIW a short list of relevant talks I'm aware of & would recommend:

* Explaining the Postgres Query Optimizer [Bruce Momjian]
  https://www.postgresql.org/files/developer/tour.pdf

* Intro to Postgres Planner Hacking [Melanie Plageman]
  https://www.pgcon.org/2019/schedule/events/1379.en.html

* Learning to Hack on Postgres Planner [Melanie Plageman]
  
https://www.pgcon.org/2019/schedule/attachments/540_debugging_planner_pgcon2019_v4.pdf

* What’s in a Plan? [Robert Haas]
  
https://www.postgresql.eu/events/pgconfeu2019/schedule/session/2741-whats-in-a-plan/
  
* A Tour of PostgreSQL Internals [Tom Lane]

  https://www.postgresql.org/files/developer/tour.pdf

* Inside thePostgreSQL Query Optimizer [Neil Conway]
  http://www.neilconway.org/talks/optimizer/optimizer.pdf

Some are a bit dated, but the overall principles don't change much.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Autovacuum not functioning for large tables but it is working for few other small tables.

2020-12-16 Thread Tomas Vondra

On 12/16/20 12:55 PM, M Tarkeshwar Rao wrote:

Hi all,

We have facing some discrepancy in Postgresql database related to the 
autovacuum functionality.


By default autovacuum was enable on Postgres which is used to remove the 
dead tuples from the database.


We have observed autovaccum cleaning dead rows from *table_A* but same 
was not functioning correctly for *table_B* which have a large 
size(100+GB) in comparision to table_A.


All the threshold level requirements for autovacuum was meet and there 
are about Million’s of  dead tuples but autovacuum was unable to clear 
them, which cause performance issue on production server.


Is autovacuum not working against large sized tables or Is there any 
parameters which  need to set to make autovacuum functioning?




No, autovacuum should work for tables with any size. The most likely 
explanation is that the rows in the large table were deleted more 
recently and there is a long-running transaction blocking the cleanup. 
Or maybe not, hard to say with the info you provided.


A couple suggestions:

1) enable logging for autovacuum by setting

   log_autovacuum_min_duration = 10ms (or similar low value)

2) check that the autovacuum is actually executed on the large table 
(there's last_autovacuum in pg_stat_all_tables)


3) try running VACUUM VERBOSE on the large table, it may tell you that 
the rows can't be cleaned up yet.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Query performance issue

2021-02-14 Thread Tomas Vondra




On 1/22/21 3:35 AM, Justin Pryzby wrote:

On Fri, Jan 22, 2021 at 01:53:26AM +, Nagaraj Raj wrote:

Tables ddl are attached in dbfiddle -- Postgres 11 | db<>fiddle
Postgres 11 | db<>fiddle
Server configuration is: Version: 10.11RAM - 320GBvCPU - 32 "maintenance_work_mem" 
256MB"work_mem"             1GB"shared_buffers" 64GB



Aggregate  (cost=31.54..31.55 rows=1 width=8) (actual time=0.010..0.012 rows=1 
loops=1)
   ->  Nested Loop  (cost=0.00..31.54 rows=1 width=8) (actual time=0.007..0.008 
rows=0 loops=1)
 Join Filter: (a.household_entity_proxy_id = 
c.household_entity_proxy_id)
 ->  Nested Loop  (cost=0.00..21.36 rows=1 width=16) (actual 
time=0.006..0.007 rows=0 loops=1)
   Join Filter: (a.individual_entity_proxy_id = 
b.individual_entity_proxy_id)
   ->  Seq Scan on prospect a  (cost=0.00..10.82 rows=1 width=16) 
(actual time=0.006..0.006 rows=0 loops=1)
 Filter: (((last_contacted_anychannel_dttm IS NULL) OR 
(last_contacted_anychannel_dttm < '2020-11-23 00:00:00'::timestamp without time 
zone)) AND (shared_paddr_with_customer_ind = 'N'::bpchar) AND (profane_wrd_ind = 
'N'::bpchar) AND (tmo_ofnsv_name_ind = 'N'::bpchar) AND (has_individual_address = 
'Y'::bpchar) AND (has_last_name = 'Y'::bpchar) AND (has_first_name = 'Y'::bpchar))
   ->  Seq Scan on individual_demographic b  (cost=0.00..10.53 
rows=1 width=8) (never executed)
 Filter: ((tax_bnkrpt_dcsd_ind = 'N'::bpchar) AND 
(govt_prison_ind = 'N'::bpchar) AND ((cstmr_prspct_ind)::text = 
'Prospect'::text))
 ->  Seq Scan on household_demographic c  (cost=0.00..10.14 rows=3 
width=8) (never executed)
   Filter: (((hspnc_lang_prfrnc_cval)::text = ANY 
('{B,E,X}'::text[])) OR (hspnc_lang_prfrnc_cval IS NULL))
Planning Time: 1.384 ms
Execution Time: 0.206 ms
13 rows


It's doing nested loops with estimated rowcount=1, which indicates a bad
underestimate, and suggests that the conditions are redundant or correlated.



No, it's not. The dbfiddle does that because it's using empty tables, 
but the plan shared by Nagaraj does not contain any nested loops.


Nagaraj, if the EXPLAIN ANALYZE does not complete, there are two things 
you can do to determine which part of the plan is causing trouble.


Firstly, you can profile the backend using perf or some other profiles, 
and if we're lucky the function will give us some hints about which node 
type is using the CPU.


Secondly, you can "cut" the query into smaller parts, to run only parts 
of the plan - essentially start from inner-most join, and incrementally 
add more and more tables until it gets too long.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Disabling options lowers the estimated cost of a query

2021-04-15 Thread Tomas Vondra

On 2/26/21 4:00 AM, Tom Lane wrote:
> Arne Roland  writes:
>> I want to examine the exhaustive search and not the geqo here. I'd
>> expect the exhaustive search to give the plan with the lowest cost,
>> but apparently it doesn't. I have found a few dozen different
>> querys where that isn't the case. I attached one straight forward
>> example. For the join of two partitions a row first approach would
>> have been reasonable.
> 
> Hmm.  While the search should be exhaustive, there are pretty
> aggressive pruning heuristics (mostly in and around add_path()) that
> can cause us to drop paths that don't seem to be enough better than
> other alternatives. I suspect that the seqscan plan may have beaten
> out the other one at some earlier stage that didn't think that the
> startup-cost advantage was sufficient reason to keep it.
> 
> It's also possible that you've found a bug.  I notice that both plans
> are using incremental sort, which has been, um, rather buggy. Hard to
> tell without a concrete test case to poke at.
> 

Well, it's true incremental sort was not exactly bug free. But I very
much doubt it's causing this issue, for two reasons:

(a) It's trivial to simplify the reproducer further, so that there are
no incremental sort nodes. See the attached script, which has just a
single partition.

(b) The incremental sort patch does not really tweak the costing or
add_path in ways that would break this.

(c) PostgreSQL 12 has the same issue.


It seems the whole problem is in generate_orderedappend_paths(), which
simply considers two cases - paths with minimal startup cost and paths
with minimal total costs. But with LIMIT that does not work, of course.

With the simplified reproducer, I get these two plans:


   QUERY PLAN
  ---
   Limit  (cost=9748.11..10228.11 rows=1 width=8)
 ->  Merge Left Join (cost=9748.11..14548.11 rows=10 width=8)
   Merge Cond: (a.id = b.id)
   ->  Index Only Scan Backward using a0_pkey on a0 a
   (cost=0.29..3050.29 rows=10 width=8)
   ->  Sort  (cost=9747.82..9997.82 rows=10 width=8)
 Sort Key: b.id DESC
 ->  Seq Scan on b0 b
 (cost=0.00..1443.00 rows=10 width=8)
  (7 rows)

   QUERY PLAN
  ---
   Limit  (cost=0.58..3793.16 rows=1 width=8)
 ->  Nested Loop Left Join (cost=0.58..37926.29 rows=10 ...)
   ->  Index Only Scan Backward using a0_pkey on a0 a
   (cost=0.29..3050.29 rows=10 width=8)
   ->  Index Only Scan using b0_pkey on b0 b
   (cost=0.29..0.34 rows=1 width=8)
 Index Cond: (id = a.id)
  (5 rows)


The reason is quite simple - we get multiple join paths for each child
(not visible in the plans, because there's just a single partition),
with these costs:

A: nestloop_path   startup 0.585000total 35708.292500
B: nestloop_path   startup 0.292500total 150004297.292500
C: mergejoin_path  startup 9748.112737 total 14102.092737

The one we'd like is the nestloop (A), and with disabled partition-wise
join that's what we pick. But generate_orderedappend_paths calls
get_cheapest_path_for_pathkeys for startup/total cost, and gets the two
other paths. Clearly, nestlop (B) is pretty terrible for LIMIT, because
of the high total cost, and mergejoin (C) is what we end up with.

Not sure how to fix this without making generate_orderedappend_paths way
more complicated ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


optimizer_first_rows_simple.sql
Description: application/sql


Re: Disabling options lowers the estimated cost of a query

2021-04-16 Thread Tomas Vondra
Hi,

On 4/16/21 3:09 PM, Tom Lane wrote:
> I wrote:
>> ... The code to select the
>> right child path would be approximately like get_cheapest_fractional_path,
>> except that you need to restrict it to paths with the right sort order.
> 
> Duh, I forgot about get_cheapest_fractional_path_for_pathkeys().
> 
>   regards, tom lane
> 

The attached patch does fix the issue for me, producing the same plans
with and without partition-wise joins.

It probably needs a bit more work, though:

1) If get_cheapest_fractional_path_for_pathkeys returns NULL, it's not
clear whether to use cheapest_startup or cheapest_total with Sort on
top. Or maybe consider an incremental sort?

2) Same for the cheapest_total - maybe there's a partially sorted path,
and using it with incremental sort on top would be better than using
cheapest_total_path + sort.

3) Not sure if get_cheapest_fractional_path_for_pathkeys should worry
about require_parallel_safe too.


Doesn't seem like an urgent issue (has been there for a while, not sure
we even want to backpatch it). I'll add this to the next CF.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index edba5e49a8..0284162034 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1716,6 +1716,7 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *pathkeys = (List *) lfirst(lcp);
 		List	   *startup_subpaths = NIL;
 		List	   *total_subpaths = NIL;
+		List	   *fractional_subpaths = NIL;
 		bool		startup_neq_total = false;
 		ListCell   *lcr;
 		bool		match_partition_order;
@@ -1745,7 +1746,8 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		{
 			RelOptInfo *childrel = (RelOptInfo *) lfirst(lcr);
 			Path	   *cheapest_startup,
-	   *cheapest_total;
+	   *cheapest_total,
+	   *cheapest_fractional = NULL;
 
 			/* Locate the right paths, if they are available. */
 			cheapest_startup =
@@ -1761,6 +1763,21 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 			   TOTAL_COST,
 			   false);
 
+			/*
+			 * XXX strange that get_cheapest_fractional_path_for_pathkeys
+			 * does not have require_parallel_safe.
+			 */
+			if (root->tuple_fraction > 0)
+			{
+double	path_fraction = 1.0 / root->tuple_fraction;
+
+cheapest_fractional =
+	get_cheapest_fractional_path_for_pathkeys(childrel->pathlist,
+			  pathkeys,
+			  NULL,
+			  path_fraction);
+			}
+
 			/*
 			 * If we can't find any paths with the right order just use the
 			 * cheapest-total path; we'll have to sort it later.
@@ -1773,6 +1790,18 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 Assert(cheapest_total->param_info == NULL);
 			}
 
+			/*
+			 * XXX Do we need to do something about cheapest_fractional?
+			 * It could be NULL if there are no properly sorted paths,
+			 * but then maybe just doing the sort is good enough.
+			 *
+			 * XXX Well, maybe we should not just grab cheapest_total_path
+			 * here, because we might use incremental sort on a path that
+			 * is not fully sorted.
+			 */
+			if ((root->tuple_fraction > 0) && !cheapest_fractional)
+cheapest_fractional = cheapest_total;
+
 			/*
 			 * Notice whether we actually have different paths for the
 			 * "cheapest" and "total" cases; frequently there will be no point
@@ -1799,6 +1828,12 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 
 startup_subpaths = lappend(startup_subpaths, cheapest_startup);
 total_subpaths = lappend(total_subpaths, cheapest_total);
+
+if (cheapest_fractional)
+{
+	cheapest_fractional = get_singleton_append_subpath(cheapest_fractional);
+	fractional_subpaths = lappend(fractional_subpaths, cheapest_fractional);
+}
 			}
 			else if (match_partition_order_desc)
 			{
@@ -1812,6 +1847,12 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 
 startup_subpaths = lcons(cheapest_startup, startup_subpaths);
 total_subpaths = lcons(cheapest_total, total_subpaths);
+
+if (cheapest_fractional)
+{
+	cheapest_fractional = get_singleton_append_subpath(cheapest_fractional);
+	fractional_subpaths = lcons(cheapest_fractional, fractional_subpaths);
+}
 			}
 			else
 			{
@@ -1823,6 +1864,10 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		  &startup_subpaths, NULL);
 accumulate_append_subpath(cheapest_total,
 		  &total_subpaths, NULL);
+
+if (cheapest_fractional)
+	accumulate_append_subpath(cheapest_fractional,
+			  &fractional_subpaths, NULL);
 			}
 		}
 
@@ -1849,6 +1894,18 @@ generate_orderedappend_pat

Re: Does btrfs on Linux have a negative performance impact for PostgreSQL 13?

2021-04-24 Thread Tomas Vondra



On 4/24/21 9:02 PM, Christophe Pettus wrote:




On Apr 24, 2021, at 11:27, Simon Connah  wrote:

I'm curious, really. I use btrfs as my filesystem on my home systems and am 
setting up a server as I near releasing my project. I planned to use btrfs on 
the server, but it got me thinking about PostgreSQL 13. Does anyone know if it 
would have a major performance impact?


This is a few years old, but Tomas Vondra did a presentation comparing major 
Linux file systems for PostgreSQL:

https://www.slideshare.net/fuzzycz/postgresql-on-ext4-xfs-btrfs-and-zfs



That talk was ages ago, though. The general conclusions may be still 
valid, but maybe btrfs improved a lot - I haven't done any testing since 
then. Not sure about durability, but there are companies using btrfs so 
perhaps it's fine - not sure.


Arguably, a lot of this also depends on the exact workload - the issues 
I saw with btrfs were with OLTP stress test, it could have performed 
much better with other workloads.



regards
Tomas




Re: slow performance with cursor

2021-07-02 Thread Tomas Vondra
On 7/1/21 10:25 PM, Ayub Khan wrote:
> Justin,
> 
> Below is the stored procedure, is there any scope for improvement?
> 

Hard to say, based on just the stored procedure source code. The queries
are not too complex, but we don't know which of them gets selected for
each cursor, and which of them is the slow one.

I suggest you identify which of the cursors is the most problematic one,
and focus on investigating it alone. Show us the explain analyze for
that query with different cursor_tuple_fraction values and without the
cursort, and so on.

As Tom said, for a cursor the optimizer may be picking a plan with low
startup cost, on the basis that that's good for a cursor. But if you're
always consuming all the tuples, that may be inefficient. It's often an
issue for queries with LIMIT, but none of the queries you include that
clause, so who knows ...

Try identifying which of the cursors is causing the issues, show us the
explain analyze for that query (with and without the cursor), and that
should tell us more.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: temporary file log lines

2021-07-13 Thread Tomas Vondra




On 7/13/21 9:29 AM, Guillaume Lelarge wrote:

Hi,

Le lun. 12 juil. 2021 à 14:13, MichaelDBA <mailto:michael...@sqlexec.com>> a écrit :


hmmm, I think spilling over to disk for temporary tables is handled by
an entirely different branch in the PG source code.  In fact, some
other
folks have chimed in and said log_temp_files doesn't relate to temp
files at all use by temporary tables, just queries as you mentioned
below elsewhere.  This seems to be a dark area of PG that is not
convered well.


As far as I know, log_temp_files only relates to sort/hash going to 
disks, not to temporary objects (tables and indexes).




Right. log_temp_files does not cover temporary tables.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Postgres chooses slow query plan from time to time

2021-09-13 Thread Tomas Vondra
On 9/13/21 3:24 PM, Kristjan Mustkivi wrote:
> Dear community,
> 
> I have a query that most of the time gets executed in a few
> milliseconds yet occasionally takes ~20+ seconds. The difference, as
> far as I am able to tell, comes whether it uses the table Primary Key
> (fast) or an additional index with smaller size. The table in question
> is INSERT ONLY - no updates or deletes done there.
> 

It'd be really useful to have explain analyze for the slow execution.

My guess is there's a poor estimate, affecting some of the parameter
values, and it probably resolves itself after autoanalyze run.

I see you mentioned SET STATISTICS, so you tried increasing the
statistics target for some of the columns? Have you tried lowering
autovacuum_analyze_scale_factor to make autoanalyze more frequent?

It's also possible most values are independent, but some values have a
rather strong dependency, skewing the estimates. The MCV would help with
that, but those are in PG12 :-(


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PostgreSQLv14 TPC-H performance GCC vs Clang

2021-11-05 Thread Tomas Vondra

Hi,

IMO this thread provides so little information it's almost impossible to 
answer the question. There's almost no information about the hardware, 
scale of the test, configuration of the Postgres instance, the exact 
build flags, differences in generated asm code, etc.


I find it hard to believe merely switching from clang to gcc yields 22% 
speedup - that's way higher than any differences we've seen in the past.


In my experience, the speedup is unlikely to be "across the board". 
There will be a handful of affected queries, while most remaining 
queries will be about the same. In that case you need to focus on those 
queries, see if the plans are the same, do some profiling, etc.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Postgres process count GCC vs Clang is Different on autovaccum=on

2021-11-24 Thread Tomas Vondra




On 11/24/21 13:05, hpc researcher_mspk wrote:

Software/Hardware used:
===
PostgresV14.v
OS: RHELv8.4
Benchmark:HammerDB v4.3
Hardware used: Apple/AMD Ryzen.
RAM size: 256 GB
SSD/HDD: 1TB
CPU(s): 256(0-255)
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2
NUMA node(s):        8

Command used to count process: ps -eaf | grep postgres

Case1: AutoVaccum=on
vu  GCC Clang
32  43  42
64  76  74
192 203 202
250 262 262


Case2:AutoVaccum=off
vu  GCC Clang
32  40  40
64  72  72
192 200 200
250 261 263


In Case1 why is the process different in Clang vs GCC.
In postgresql process dependent wrt compiler GCC/Clang?


No, it's not. The most likely explanation is that you're seeing 
different number of autovacuum workers. Those are dynamic, i.e. may 
appear/disappear. Or maybe there are more connections to the DB.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Query chooses Bad Index Path

2022-02-09 Thread Tomas Vondra
It's a bit annoying that you post the same query over and over again, 
starting a new thread every time. Don't do that, please, it's just 
confusing, people lose track of information you already provided in 
other threads etc.


Now, to the question ...

On 2/9/22 06:37, Valli Annamalai wrote:

Postgres version: 11.4

Problem:
     Query choosing Bad Index Path. Details are provided below

>

...


Doubt
    1. Why is this Query choosing Index Scan Backward using table1_pkey 
Index though it's cost is high. It can rather choose

             BITMAP OR
                   (Index on RECORDID) i.e; table1_idx6
                   (Index on RELATEDID) i.e; table1_idx7

       Below is the selectivity details from pg_stats table
         - Recordid has 51969 distinct values. And selectivity 
(most_common_freqs) for recordid = 15842006928391817 is 0.00376667
         - Relatedid has 82128 distinct values. And selectivity 
(most_common_freqs) for recordid = 15842006928391817 is 0.0050666


Since, selectivity is less, this should logically choose this Index, 
which would have improve my query performance here.


Well, the filter condition is much more complex - it's not just 
conditions on recordid, but various conditions on other columns, with 
both AND and OR. So it's possible the estimate is off, and the optimizer 
picks the wrong plan. Try running explain analyze without the LIMIT, 
that'll tell you how accurate the estimates are (LIMIT terminates early, 
so the actual rowcount is incomplete).


The other option is data distribution issue, as pointed out by Monika 
Yadav in the other thread. The optimizer assumes matching rows are 
distributed uniformly in the input relation, but chances are they're 
either close to beginning/end depending on how you sort it.


Imagine you have 100 rows, 1000 of them match the filter, and you 
have LIMIT 10. It the matching rows are distributed uniformly, it's 
enough to scan 1% of the input, i.e. 1 rows (because there's one 
matching row for every 1000 rows, on average).


But let's assume the matching rows are not distributed uniformly, but at 
the end, when you sort it. Well, you'll have go through 100% of the 
input. But the optimizer won't realize that.


This is a known / common issue with LIMIT, unfortunately. The estimated 
cost is much lower that it should be, and it's hard to fix.


I cross-checked the same by removing PrimaryKey to this table and query 
now chooses these indexes and response is in 100ms. Please refer the 
plan below (after removing primary key):





Well, yeah. That's mostly consistent with the data distribution theory.

I'd try two things:

1) define a covering index, so that the query can do Index Only Scan

2) define partial index, moving some of the filter conditions to index 
predicate (not sure if that's possible, it depends on what parameters of 
the condition are static)



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Never Ending query in PostgreSQL

2022-03-01 Thread Tomas Vondra
On 2/27/22 18:20, Jeff Janes wrote:
> 
> On Sun, Feb 27, 2022 at 7:09 AM Kumar, Mukesh  <mailto:mku...@peabodyenergy.com>> wrote:
> 
> Hi Team, 
> 
> Can you please help in tunning the attached query as , i am trying
> to run this query and it runs for several hours and it did not give
> any output.
> 
> 
> Several hours is not all that long.  Without an EXPLAIN ANALYZE, we
> could easily spend several hours scratching our heads and still get
> nowhere.  So unless having this running cripples the rest of your
> system, please queue up another one and let it go longer.  But first, do
> an ANALYZE (and preferably a VACUUM ANALYZE) on all the tables.  If you
> have a test db which is a recent clone of production, you could do it
> there so as not to slow down production.  The problem is that the row
> estimates must be way off (otherwise, it shouldn't take long) and if
> that is the case, we can't use the plan to decide much of anything,
> since we don't trust it.
> 

I'd bet Jeff is right and poor estimates are the root cause. The pattern
with a cascade of "nested loop" in the explain is fairly typical. This
is likely due to the complex join conditions and correlation.


> In parallel you could start evicting table joins from the query to
> simplify it until it gets to the point where it will run, so you can
> then see the actual row counts.  To do that it does help if you know
> what the intent of the query is (or for that matter, the text of the
> query--you attached the plan twice).
> 

Right, simplify the query. Or maybe do it the other way around - start
with the simplest query (the inner-most part of the explain) and add
joins one by one (by following the explains) until it suddenly starts
being much slower.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Never Ending query in PostgreSQL

2022-03-01 Thread Tomas Vondra
On 3/1/22 16:01, Kumar, Mukesh wrote:
> Hi Tomas ,
> 
> Thanks for replying , We have identified a Join condition which is
> creating a problem for that query.
> 
> Accept my apologies for pasting the plan twice. I am attaching the
> query again in this mail
> 

Queries without explain (or even better "explain analyze") are useless.
We don't have the data, we don't know what the executed plan is, we
don't know what plan might be a better one.

There's a wiki page about reporting slow queries (what info to include,
etc):

https://wiki.postgresql.org/wiki/Slow_Query_Questions

> We have found that by evicting the View paymenttransdetails_view from
> the attached query runs in approx. 10 secs and the view contains
> multiple conditions and 1 jojn as well.
> 

You need to add individual tables, not a view which is itself a join of
10+ tables. The idea is that you start with a fast query, add tables one
by one (in the join order from the explain). You'll be able to do
EXPLAIN ANALYZE and watch estimate accuracy, and then at some point it
gets much slower, which is the join that causes trouble. But you might
still be able to do explain analyze.

So looking at the explain plan you shared before, you'd start with a
join of so_vendor_address_base + so_vendor_base, and then you'd add

- sapecc_lfa1_assoc
- lms_payment_item_vendor_base
- lms_payment_line_item_base
- lms_payment_check_request
- lms_pay_line_item_acct_base
- ...

(in this order). I'd bet "lms_payment_check_request" is where things
start to go south.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Optimal configuration for server

2022-03-07 Thread Tomas Vondra



On 3/7/22 12:51, Luiz Felipph wrote:
> Hi everybody!
> 
> I have a big application running on premise. One of my main database
> servers has the following configuration:
> 
> 72 CPUs(2 chips, 18 physical cores per chip, 2 threads) Xeon Gold 6240
> 1TB of ram or 786GB (5 servers at all)
> A huge storage( I don't know for sure what kind is, but is very powerful)
> 
> A consulting company recommended the following configuration for theses
> main servers(let me know if something important was left behind):
> 
> maxx_connections = 2000
> shared_buffers = 32GB
> temp_buffers = 1024
> max_prepared_transactions = 3000
> work_men = 32MB
> effective_io_concurrency = 200
> max_worker_processes = 24
> checkpoint_timeout = 15min
> max_wal_size = 64GB
> min_wall_size = 2GB
> effective_cache_size = 96GB
> (...)
> 
> I Think this is too low memory setting for de size of server... The
> number of connections, I'm still measuring to reduce this value( I think
> it's too high for the needs of application, but untill hit a value too
> high to justfy any memory issue, I think is not a problem)
> 

Hard to judge, not knowing your workload. We don't know what information
was provided to the consulting company, you'll have to ask them for
justification of the values they recommended.

I'd say it looks OK, but max_connections/max_prepared_transactions are
rather high, considering you only have 72 threads. But it depends ...

> My current problem:
> 
> under heavyload, i'm getting "connection closed" on the application
> level(java-jdbc, jboss ds)
> 

Most likely a java/jboss connection pool config. The database won't just
arbitrarily close connections (unless there are timeouts set, but you
haven't included any such info).

> The server never spikes more the 200GB of used ram(that's why I thing
> the configuration is too low)
> 

Unlikely. If needed, the system would use memory for page cache, to
cache filesystem data. So most likely this is due to the database not
being large enough to need more memory.

You're optimizing the wrong thing - the goal is not to use as much
memory as possible. The goal is to give good performance given the
available amount of memory.

You need to monitor shared buffers cache hit rate (from pg_stat_database
view) - if that's low, increase shared buffers. Then monitor and tune
slow queries - if a slow query benefits from higher work_mem values, do
increase that value. It's nonsense to just increase the parameters to
consume more memory.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Performance issue post upgrade on Version 13 - Incorrect Estimation Cost choosing Hash Aggregate-Nested Left Loop Join

2022-03-26 Thread Tomas Vondra



On 3/22/22 13:57, Prajna Shetty wrote:
> ++ _pgsql-performance@postgresql.org_
> <https://www.postgresql.org/list/pgsql-performance/>
>  
> Hello Team,
>  
> There is change in query plan in 12.4 version and Version 13 resulting
> in performance slowness post upgrade.
>  
> 
>   * In 12.4 version, Sort Operation Group Aggregate is selected which
> results to Merge Join. Query takes ~5 seconds.
>   * In 13.5 version, optimizer wrongly estimates and due to new Disk
> Based Hash Aggregate feature, it prefers Hash Aggregate instead of
> Sort Operation which finally blocks merge-join and chooses Nested
> Loop Left Join. Query takes ~5 minutes.
> 
>  
> *_NOTE: _*Disabling Hash Aggregate on instance level forces optimizer to
> choose merge operation but such instance level modification is not
> possible in terms of Application Functionality.
>  
> This performance issue is on all over most of queries. Attached one of
> the query and its plan in both version for reference in case that helps
> for recreating the issue.
>  

It's impossible to comment those other queries, but chances are the root
cause is the same.

> Version 13 query plan has lower estimated cost than that of 12.4 which
> implies 13.5 planner thought it found a better plan, but it is running
> slower and actual cost show more.
>  
> 12.4 Version:
> "Merge Right Join  (cost=*202198.78..295729.10* rows=1 width=8) (actual
> time=1399.727..*5224.574* rows=296 loops=1)"
>  
> 13.5 version:-
> "Nested Loop Left Join  (cost=*196360.90..287890.45* rows=1 width=8)
> (actual time=3209.577..*371300.693* rows=296 loops=1)"
>  

This is not a costing issue, the problem is that we expect 1 row and
calculate the cost for that, but then get 296. And unfortunately a
nested loop degrades much faster than a merge join.

I'm not sure why exactly 12.4 picked a merge join, chances are the
costing formular changed a bit somewhere. But as I said, the problem is
in bogus row cardinality estimates - 12.4 is simply lucky.

The problem most likely stems from this part:

  ->  GroupAggregate  (cost=0.43..85743.24 rows=1830 width=72) (actual
time=1.621..3452.034 rows=282179 loops=3)
  Group Key: student_class_detail.aamc_id
  Filter: (max((student_class_detail.class_level_cd)::text) = '4'::text)
  Rows Removed by Filter: 76060
  ->  Index Scan using uk_student_class_detail_aamcid_classlevelcd on
student_class_detail  (cost=0.43..74747.61 rows=1284079 width=6) (actual
time=1.570..2723.014 rows=1272390 loops=3)
Filter: (class_level_start_dt IS NOT NULL)
Rows Removed by Filter: 160402

The filter is bound to be misestimated, and the error then snowballs.
Try replacing this part with a temporary table (with pre-aggregated
results) - you can run analyze on it, etc. I'd bet that'll make the
issue go away.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




  1   2   >