Re: When you really want to force a certain join type?

2022-12-28 Thread Gunther Schadow

On 12/28/2022 10:48 AM, Justin Pryzby wrote:

Maybe the new parameter in v15 would help.

https://www.postgresql.org/docs/15/runtime-config-query.html#GUC-RECURSIVE-WORKTABLE-FACTOR
recursive_worktable_factor (floating point)

 Sets the planner's estimate of the average size of the working table
 of a recursive query, as a multiple of the estimated size of the
 initial non-recursive term of the query. This helps the planner
 choose the most appropriate method for joining the working table to
 the query's other tables. The default value is 10.0. A smaller value
 such as 1.0 can be helpful when the recursion has low “fan-out” from
 one step to the next, as for example in shortest-path queries. Graph
 analytics queries may benefit from larger-than-default values.


Thanks that's something I will try after I upgraded.

But speaking of such other uses for recursive queries, I can say I have 
quite a bit of experience of turning graph related "traversal" and 
search and optimization and classification queries into SQL, in short, 
computing the transitive closure. And usually I have stayed away from 
the recursive WITH query and instead set up a start table and then 
perform the iterative step. And there are two ways to go about it. Say 
you have a graph, simple nodes and arcs. You want to find all paths 
through the graph.


Now you can set up start nodes and then extend them at the end by 
joining the recursive table to the simple arc table and extend your path 
every time. This is what the WITH RECURSIVE supports. These queries 
linearly iterate as many times as the length of the longest path.


with recursive arcs as (
  select source, target, 1 as distance, ...
    from ...
), paths as (
  select * from arcs
  union all
  select a.source, b.target, a.distance + 1 as distance, ...
    from paths a inner join_*arcs *_b
  on(b.source = a.target)
)
select * from paths

But another way is to join paths with paths. It would be this, which I 
think I have seen postgresql unable to deal with:


with recursive arcs as (
  select source, target, 1 as distance, ...
    from ...
), paths as (
  select * from arcs
  union all
  select a.source, b.target, a.distance + 1 as distance, ...
    from paths a inner join_*paths *_b
  on(b.source = a.target)
)
select * from paths

So, instead of the recursive union to join back to the fixed table, it 
joins the recursive table to the recursive table, and the benefit of 
that is that these queries converge much quicker. Instead of going 10 
iterations to find a path of length 10, you go 1 iteration to find all 
paths of 2 (if distance 1 is the base table of all arcs), then next you 
find paths of up to 4 then you find paths of up to 8, then 16, 32, ... 
This converges much faster. I usually do that as follows


create table paths as
select source, target, 1 as distance, ...
  from arcs;

prepare rep as
insert into paths(source, target, distance, ...)
select a.source, b.target, a.distance + b.distance as distance, ...
  from paths a inner join paths b on(b.source = a.target)
except
select * from paths;

execute rep;
execute rep;
...

or instead of the except, in order to minimize distances:

where not exists (select 1 from paths x
   where x.source = a.source
 and x.target = a.target
 and x.distance < a.distance)

I have even done a group by in the recursive step which replaces the 
paths relation at every iteration (e.g. with only minimal distance paths).


Since this converges so rapidly I often prefer that approach over a 
recursive union query.


I think in IBM DB2 allowed to join the recursive table with itself. Is 
this something you want to support at some time?


Also, why even use the RECURSIVE keyword, DB2 didn't need it, and the 
query analyzer should immediately see the recursion, so no need to have 
that keyword.


regards,
-Gunther


When you really want to force a certain join type?

2022-12-28 Thread Gunther Schadow
60)
 ->  Materialize  (cost=120554.35..120966.44 rows=82419 
width=228)
   ->  Sort  (cost=120554.35..120760.39 rows=82419 
width=228)
 Sort Key: f.token, p.pathid, p.end
 ->  Nested Loop  (cost=0.43..104808.55 rows=82419 
width=228)
   ->  WorkTable Scan on path p  
(cost=0.00..68.60 rows=3430 width=212)
   ->  Index Scan using fsa_old_state_token_idx 
on fsa f  (cost=0.43..30.30 rows=24 width=28)

Once this merge join kicks in, the query essentially stalls (I mean, 
each of the limited components runs in seconds, and I can iteratively 
run them so that my initial set of tokens never grows past 100,000, and 
then I can complete everything in about linear time, each iteration 
takes about linear time proportional with the amount of tokens. But with 
the merge join it doesn't complete before several times that amount of time.


I doubt that I can find any trick to give to the planner better data 
which it can then use to figure out that the merge join is a bad 
proposition.


I wish I could just force it. I probably had this discussion here some 
years ago. I think that while the PostgreSQL optimizer is pretty good, 
there are situations such as this where its predictions do not work.


Note, for my immediate relief I have forced it by simply set 
enable_mergejoin=off. This works fine, except, it converts both into a 
nested loop, but the upper merge join was not a problem, and sometimes 
(most often) nested loop is a bad choice for bulk data. It's only for 
this recursive query it sometimes makes sense.


regards,
-Gunther


Faster more low-level methods of having hot standby / secondary read-only servers?

2022-09-13 Thread Gunther Schadow
ealistic numbers). And my 3 stand-bys rotate doing that, so 2 
of the 3 are always up while 1 of them might be briefly down.


This could be improved even with some OS and SAN support, all the way 
down to RAID mirrors. But practically, I could mount the same file 
system (e.g., BSD UFS) on the master with -o rw and on the stand-by with 
-o ro. Even without UFS having any "cluster" support, I can do that. I 
will have a few inconsistencies, but PostgreSQL is very tolerant of 
small inconsistencies and can fix them anyway. (I would not dare try 
anything like this with an Oracle system.)


Another tool I could think of using is BSD UFS snapshot support. I could 
make a snapshot that is consistent from the file system perspective. 
PostgreSQL writers could interface with that, issue a sync, and trigger 
an UFS snapshot, then a file system sync. Now any standby who reads its 
data files from this snapshot would not only have file-system level 
consistency, but even database level consistency. So with this 
replication using a file system mounted from multiple PgSQL servers, 
replication should work well while consuming minimal amount of server 
resources and also not lead to too much actual disk IO contention (seeks).


And even the disk contention could possibly be resolved by letting a 
RAID mirror sync to the same snapshot point and then split it off, or do 
the read activity of the stand-by server querying like crazy only from 
that mirror, while batching changes to the RAID master so that they can 
be applied with very low overhead.


Anyone thinking about these things?

regards,
-Gunther


PosgtgreSQL hot standby reading WAL from muli-attached volume?

2021-04-05 Thread Gunther Schadow

Hi,

I wonder can we set up a hot standby in such a way that we don't need 
any log streaming nor shipping, where instead every hot standby just 
mounts the same disk in read-only mode which the master uses to write 
his WAL files?


Even without a clustered file system, e.g., a UFS on FreeBSD, one can 
have the master mount in read-write mode while all the hot standbys 
would mount the volume read-only. Given that WAL logs are written out at 
a certain rate, one can at regular intervals issue


mount -u /pg_wal

and it should refresh the metadata, I assume. I am re-reading about 
hot-standby, and it strikes me that this method is essentially the "log 
shipping" method only that there is no actual "shipping" involved, the 
new log files simply appear all of a sudden on the disk.


I suppose there is a question how we know when a new WAL file is 
finished appearing? And as I read the log-shipping method may not be 
suitable for hot standby use?


Is this something that has been written about already?

regards,
-Gunther


___
freebsd-performa...@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to 
"freebsd-performance-unsubscr...@freebsd.org"





Re: Postgres performance comparing GCP and AWS

2021-02-24 Thread Gunther Schadow

Hi Maurici,

in my experience the key factor about speed in big queries is sequential 
scan. There is a huge variance in how the system is tuned. In some cases 
I cannot read more than 10 MB/s, in others I get to expect 20-40 MB/s. 
But then, when things are tuned well and the parallel workers set in, I 
see the throughput spike to 100-200 MB/s.


You may have to enable the parallel workers in your postgresql.conf

So, to me, this is what you want to check first. While the query runs, 
have both iostat and top running, with top -j or -c or -a or whatever it 
is on that particular OS to see the detail info about the process. 
Perhaps even -H to see threads.


Then you should see good flow with high read speed and reasonable CPU 
load %. If you get low read speed and low CPU that is a sign of IO 
blockage somewhere. If you get high CPU and low IO, that's a planning 
mistake (the nested loop trap). You don't have that here apparently. But 
index scans I have seen with much worse IO throughput than seq table 
scans. Not sure.


Also, on AWS you need to be sure you have enough IOPS provisioned on 
your EBS (I use gp3 now where you can have up to 10k IOPS) and also 
check bus throughput of the EC2 instance. Needless to say you don't want 
a t* instance where you have a limited burst CPU capacity only.


regards,
-Gunther

On 2/23/2021 1:12 PM, Maurici Meneghetti wrote:

Hi everyone,

I have 2 postgres instances created from the same dump (backup), one 
on a GCP VM and the other on AWS RDS. The first instance takes 18 
minutes and the second one takes less than 20s to run this simples query:
SELECT "Id", "DateTime", "SignalRegisterId", "Raw" FROM 
"SignalRecordsBlobs" WHERE "SignalSettingId" = 103 AND "DateTime" 
BETWEEN '2019-11-28T14:00:12.54020' AND 
'2020-07-23T21:12:32.24900';
I’ve run this query a few times to make sure both should be reading 
data from cache.
I expect my postgres on GPC to be at least similar to the one managed 
by AWS RDS so that I can work on improvements parallelly and compare.


*DETAILS:
Query explain for Postgres on GCP VM:
*Bitmap Heap Scan on SignalRecordsBlobs SignalRecordsBlobs 
 (cost=18.80..2480.65 rows=799 width=70) (actual time=216.766..776.032 
rows=5122 loops=1)
    Filter: (("DateTime" >= \'2019-11-28 14:00:12.5402\'::timestamp 
without time zone) AND ("DateTime" <= \'2020-07-23 
21:12:32.249\'::timestamp without time zone))

    Heap Blocks: exact=5223
    Buffers: shared hit=423 read=4821
  ->  Bitmap Index Scan on IDX_SignalRecordsBlobs_SignalSettingId 
 (cost=0.00..18.61 rows=824 width=0) (actual time=109.000..109.001 
rows=5228 loops=1)

          Index Cond: ("SignalSettingId" = 103)
          Buffers: shared hit=3 read=18
Planning time: 456.315 ms
Execution time: 776.976 ms

*Query explain for Postgres on AWS RDS:
*Bitmap Heap Scan on SignalRecordsBlobs SignalRecordsBlobs 
 (cost=190.02..13204.28 rows=6213 width=69) (actual time=2.215..14.505 
rows=5122 loops=1)
    Filter: (("DateTime" >= \'2019-11-28 14:00:12.5402\'::timestamp 
without time zone) AND ("DateTime" <= \'2020-07-23 
21:12:32.249\'::timestamp without time zone))

    Heap Blocks: exact=5209
    Buffers: shared hit=3290 read=1948
  ->  Bitmap Index Scan on IDX_SignalRecordsBlobs_SignalSettingId 
 (cost=0.00..188.46 rows=6405 width=0) (actual time=1.159..1.159 
rows=5228 loops=1)

          Index Cond: ("SignalSettingId" = 103)
          Buffers: shared hit=3 read=26
Planning time: 0.407 ms
Execution time: 14.87 ms

*PostgreSQL version number running:
• VM on GCP*: PostgreSQL 11.10 (Debian 11.10-0+deb10u1) on 
x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
*• Managed by RDS on AWS:* PostgreSQL 11.10 on x86_64-pc-linux-gnu, 
compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit


*How PostgreSQL was installed:
• VM on GCP*: Already installed when created VM running Debian on 
Google Console.

*• Managed by RDS on AWS:* RDS managed the installation.

*Changes made to the settings in the postgresql.conf file:
*Here are some postgres parameters that might be useful:
*Instance on VM on GCP (2 vCPUs, 2 GB memory, 800 GB disk):*
• effective_cache_size: 1496MB
• maintenance_work_mem: 255462kB (close to 249MB)
• max_wal_size: 1GB
• min_wal_size: 512MB
• shared_buffers: 510920kB (close to 499MB)
• max_locks_per_transaction 1000
• wal_buffers: 15320kB (close to 15MB)
• work_mem: 2554kB
• effective_io_concurrency: 200
• dynamic_shared_memory_type: posix
On this instance we installed a postgres extension called timescaledb 
to gain performance on other tables. Some of these parameters were set 
using recommendations from that extension.


*Instance managed by RDS (2 vCPUs, 2 GiB RAM, 250GB disk, 750 de IOPS):*
• effective_cache_size: 1887792kB (close to 184

Conundrum with scaling out of bottleneck with hot standby, PgPool-II, etc.

2020-12-23 Thread Gunther Schadow
aster and slave, or I can spend the same 
money to buy a single system with twice the CPU cores and a twice as 
wide IO path and disks. Why would I do anything other but to just 
increase that master db server?


regards,
-Gunther






Is there a known bug with SKIP LOCKED and "tuple to be locked was already moved to another partition due to concurrent update"?

2020-06-30 Thread Gunther Schadow

Hi all,

long time ago I devised with your help a task queuing system which uses 
SELECT ... FOR UPDATE SKIP LOCKED for many parallel workers to find 
tasks in the queue, and it used a partitioned table where the hot part 
of the queue is short and so the query for a job is quick and the skip 
locked locking makes sure that one job is only assigned to one worker. 
And this works pretty well for me, except that when we run many workers 
we find a lot of these failures occurring:


"tuple to be locked was already moved to another partition due to 
concurrent update"


This would not exactly look like a bug, because the message says "to be 
locked", so at least it's not allowing two workers to lock the same 
tuple. But it seems that the skip-locked mode should not make an error 
out of this, but treat it as the tuple was already locked. Why would it 
want to lock the tuple (representing the job) if another worker has 
already finished his UPDATE of the job to mark it as "done" (which is 
what makes the tuple move to the "completed" partition.)


Either the SELECT for jobs to do returned a wrong tuple, which was 
already update, or there is some lapse in the locking.


Either way it would seem to be a waste of time throwing all these errors 
when the tuple should not even have been selected for update and locking.


I wonder if anybody knows anything about that issue? Of course you'll 
want to see the DDL and SQL queries, etc. but you can't really try it 
out unless you do some massively parallel magic. So I figured I just ask.


regards,
-Gunther





Re: FPGA optimization ...

2019-11-04 Thread Gunther

Hi Thomas, you said:


For the record, this is not exactly a new thing. Netezza (a PostgreSQL
fork started in 1999 IBM) used FPGAs. Now there's swarm64 [1], another
PostgreSQL fork, also using FPGAs with newer PostgreSQL releases.


yes, I found the swarm thing on Google, and heard about Netezza years 
ago from the Indian consulting contractor that had worked on it (their 
price point was way out of the range that made sense for the academic 
place where I worked then).


But there is good news, better than you thought when you wrote:


Those are proprietary forks, though. The main reason why the community
itself is not working on this directly (at least not on pgsql-hackers)
is exactly that it requires specialized hardware, which the devs
probably don't have, making development impossible, and the regular
customers are not asking for it either (one of the reasons being limited
availability of such hardware, especially for customers running in the
cloud and not being even able to deploy custom appliances).

I don't think this will change, unless the access to systems with FPGAs
becomes much easier (e.g. if AWS introduces such instance type).


It already has changed! Amazon F1 instances. And Xilinx has already 
packaged a demo https://aws.amazon.com/marketplace/pp/B07BVSZL51. This 
demo appears very limited though (only for TPC-H query 6 and 12 or so).


Even the hardware to hold in your hand is now much cheaper. I know a guy 
who's marketing a board with 40 GB/s throughput. I don't have price but 
I can't imagine the board plus 1 TB disk to be much outside of US$ 2k. I 
could sponsor that if someone wants to have a serious shot at it.


Is there a PostgreSQL foundation I could donate to, 501(c)(3) tax 
exempt? I can donate and possibly find some people at Purdue 
University who might take this on. Interest?




I don't think there's any such non-profit, managing/funding development.
At least I'm not avare of it. There are various non-profits around the
world, but those are organizing events and local communities.

I'd say the best way to do something like this is to either talk to one
of the companies participating in PostgreSQL devopment (pgsql-hackers is
probably a good starting point), or - if you absolutely need to go
through a non-profit - approach a university (which does not mean people
from pgsql-hackers can't be involved, of course). I've been involved in
a couple of such research projects in Europe, not sure what exactly is
the situation/rules in US.


Yes, might work with a University directly. Although I will contact the 
PostgreSQL foundation in the US also.


regards,
-Gunther





FPGA optimization ...

2019-11-04 Thread Gunther

The time has come.

FPGA optimization is in the palm of our hands (literally a 2 TB 40 GB/s 
IO PostgreSQL server fits into less than a shoe box), and on Amazon AWS 
F1 instances.


Some demos are beginning to exist: https://github.com/Xilinx/data-analytics.
<https://github.com/Xilinx/data-analytics>

But a lot more could be done. How about linear sort performance at O(N)? 
https://hackaday.com/2016/01/20/a-linear-time-sorting-algorithm-for-fpgas/. 
And how about ​https://people.csail.mit.edu/wjun/papers/fccm2017.pdf, 
the following four sorting accelerators are used:


 * Tuple Sorter : Sorts an N-tuple using a sorting network.
 * Page Sorter : Sorts an 8KB (a flash page) chunk of sorted N-tuples
   in on-chip memory.
 * Super-Page Sorter : Sorts 16 8K-32MB sorted chunks in DRAM.
 * Storage-to-Storage Sorter: Sorts 16 512MB or larger sorted chunks in
   flash.

Order of magnitude speed improvements? Better than Hadoop clusters on a 
single chip? 40 GB/s I/O throughput massive full table scan, blazing 
fast sort-merge joins? Here it is. Anybody working more on that? Should 
be an ideal project for a student or a group of students.


Is there a PostgreSQL foundation I could donate to, 501(c)(3) tax 
exempt? I can donate and possibly find some people at Purdue University 
who might take this on. Interest?


regards,
-Gunther



Re: Out of Memory errors are frustrating as heck!

2019-08-24 Thread Gunther
Thanks Tom, yes I'd say it's using a lot of memory, but wouldn't call it 
"leak" as it doesn't grow during the 30 min or so that this query runs. 
It explodes to 4GB and then stays flat until done.


Yes, and this time the query is super complicated with many joins and 
tables involved. The query plan has 100 lines. Not easy to share for 
reproduce and I have my issue under control by adding some swap just in 
case. The swap space was never actually used.


thanks,
-Gunther

On 8/23/2019 10:20, Tom Lane wrote:

Gunther  writes:

Hi all, I am connecting to a discussion back from April this year. My
data has grown and now I am running into new out of memory situations.

It doesn't look like this has much of anything to do with the hash-table
discussion.  The big hog is an ExprContext:


ExprContext: 1107296256 total in 142 blocks; 6328 free (101 chunks);
1107289928 used

So there's something leaking in there, but this isn't enough info
to guess what.

regards, tom lane





Re: Out of Memory errors are frustrating as heck!

2019-08-23 Thread Gunther
OK, I went back through that old thread, and I noticed an early opinion 
by a certain Peter  who said that I should provision 
some swap space. Since I had plenty of disk and no other option I tried 
that. And it did some magic. Here this is a steady state now:


top - 14:07:32 up 103 days,  9:57,  5 users,  load average: 1.33, 1.05, 0.54
Tasks: 329 total,   2 running, 117 sleeping,   0 stopped,   0 zombie
%Cpu(s): 31.0 us, 11.4 sy,  0.0 ni, 35.3 id, 22.3 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  7910376 total,   120524 free,  2174940 used,  5614912 buff/cache
KiB Swap: 16777212 total, 16777212 free,0 used.  3239724 avail Mem

  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
 5068 postgres  20   0 4352496   4.0g   2.0g R  76.4 52.6   3:01.39 postgres: 
postgres integrator [local] INSERT
  435 root  20   0   0  0  0 S   4.0  0.0  10:52.38 [kswapd0]

and the nice thing is, the backend server process appears to be bounded 
at 4GB, so there isn't really a "memory leak". And also, the swap space 
isn't really being used. This may have to do with these vm. sysctl 
settings, overcommit, etc.


 * vm.overcommit_memory = 2 -- values are
 o 0 -- estimate free memory
 o 1 -- always assume there is enough memory
 o 2 -- no over-commit allocate only inside the following two
   parameters
 * vm.overcommit_kbytes = 0 -- how many kB above swap can be
   over-committed, EITHER this OR
 * vm.overcommit_ratio = 50 -- percent of main memory that can be
   committed over swap,
 o with 0 swap, that percent can be committed
 o i.e., this of 8 GB, 4 GB are reserved for buffer cache
 o not a good idea probably
 o at least we should allow 75% committed, i.e., 6 GB of 8 GB, leaving
 + 2 GB of buffer cache
 + 2 GB of shared buffers
 + 4 GB of all other memory

I have vm.overcommit_memory = 2, _kbytes = 0, _ratio = 50. So this means 
with _ratio = 50 I can commit 50% of memory, 4GB and this is exactly 
what the server process wants. So with little impact on the available 
buffer cache I am in a fairly good position now. The swap (that in my 
case I set at 2 x main memory = 16G) serves as a buffer to smooth out 
this peak usage without ever actually paging.


I suppose even without swap I could have set vm.overcommit_ratio = 75, 
and I notice now that I already commented this much (the above bullet 
points are my own notes.)


Anyway, for now, I am good. Thank you very much.

regards,
-Gunther



On 8/23/2019 9:17, Gunther wrote:


Hi all, I am connecting to a discussion back from April this year. My 
data has grown and now I am running into new out of memory situations. 
Meanwhile the world turned from 11.2 to 11.5 which I just installed 
only to find the same out of memory error.


Have any of the things discussed and proposed, especially this last 
one by Tomas Vondra, been applied to the 11 releases? Should I try 
these older patches from April?


regards,
-Gunther

For what it is worth, this is what I am getting:

TopMemoryContext: 67424 total in 5 blocks; 7184 free (7 chunks); 60240 
used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 
416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 
blocks; 7720 free (1 chunks); 472 used Operator lookup cache: 24576 
total in 2 blocks; 10760 free (3 chunks); 13816 used TableSpace cache: 
8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type 
information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 
21728 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 
chunks); 1296 used MessageContext: 8388608 total in 11 blocks; 3094872 
free (4 chunks); 5293736 used JoinRelHashTable: 16384 total in 2 
blocks; 5576 free (1 chunks); 10808 used Operator class cache: 8192 
total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 
32768 total in 3 blocks; 12720 free (8 chunks); 20048 used 
TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 
chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 
chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free 
(0 chunks); 528 used PortalContext: 1024 total in 1 blocks; 624 free 
(0 chunks); 400 used: ExecutorState: 202528536 total in 19 blocks; 
433464 free (12 chunks); 202095072 used HashTableContext: 8192 total 
in 1 blocks; 7656 free (0 chunks); 536 used HashBatchContext: 10615104 
total in 261 blocks; 7936 free (0 chunks); 10607168 used 
HashTableContext: 8192 total in 1 blocks; 7688 free (1 chunks); 504 
used HashBatchContext: 13079304 total in 336 blocks; 7936 free (0 
chunks); 13071368 used TupleSort main: 49208 total in 3 blocks; 8552 
free (7 chunks); 40656 used Caller tuples: 8192 total in 1 blocks; 
7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 
7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 
7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 
7936 free 

Re: Out of Memory errors are frustrating as heck!

2019-08-23 Thread Gunther
Hi all, I am connecting to a discussion back from April this year. My 
data has grown and now I am running into new out of memory situations. 
Meanwhile the world turned from 11.2 to 11.5 which I just installed only 
to find the same out of memory error.


Have any of the things discussed and proposed, especially this last one 
by Tomas Vondra, been applied to the 11 releases? Should I try these 
older patches from April?


regards,
-Gunther

For what it is worth, this is what I am getting:

TopMemoryContext: 67424 total in 5 blocks; 7184 free (7 chunks); 60240 
used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 
416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 
blocks; 7720 free (1 chunks); 472 used Operator lookup cache: 24576 
total in 2 blocks; 10760 free (3 chunks); 13816 used TableSpace cache: 
8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information 
cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used 
RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 
1296 used MessageContext: 8388608 total in 11 blocks; 3094872 free (4 
chunks); 5293736 used JoinRelHashTable: 16384 total in 2 blocks; 5576 
free (1 chunks); 10808 used Operator class cache: 8192 total in 1 
blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total 
in 3 blocks; 12720 free (8 chunks); 20048 used TransactionAbortContext: 
32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 
8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 
8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalContext: 
1024 total in 1 blocks; 624 free (0 chunks); 400 used: ExecutorState: 
202528536 total in 19 blocks; 433464 free (12 chunks); 202095072 used 
HashTableContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used 
HashBatchContext: 10615104 total in 261 blocks; 7936 free (0 chunks); 
10607168 used HashTableContext: 8192 total in 1 blocks; 7688 free (1 
chunks); 504 used HashBatchContext: 13079304 total in 336 blocks; 7936 
free (0 chunks); 13071368 used TupleSort main: 49208 total in 3 blocks; 
8552 free (7 chunks); 40656 used Caller tuples: 8192 total in 1 blocks; 
7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 
free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free 
(0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used Subplan HashTable Temp Context: 1024 total in 1 
blocks; 768 free (0 chunks); 256 used Subplan HashTable Context: 8192 
total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 
total in 1 blocks; 7936 free (0 chunks); 256 used Subplan HashTable Temp 
Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used Subplan 
HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 
used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used

Re: Out of Memory errors are frustrating as heck!

2019-04-23 Thread Gunther
..1310249.63 
rows=13 width=233) (actual time=35925.031..40176.447 rows=236 loops=1) 
Hash Cond: ((an.actinternalid)::text = (q.actinternalid)::text) -> Seq 
Scan on act_id an (cost=0.00..425941.04 rows=14645404 width=134) (actual 
time=1.609..7687.986 rows=14676871 loops=1) -> Hash 
(cost=829388.19..829388.19 rows=1 width=136) (actual 
time=30106.123..30106.123 rows=236 loops=1) Buckets: 1024 Batches: 1 
Memory Usage: 63kB -> Gather (cost=381928.46..829388.19 rows=1 
width=136) (actual time=24786.510..30105.983 rows=236 loops=1) ... -> 
Hash (cost=2823846.37..2823846.37 rows=34619 width=930) (actual 
time=252946.367..252946.367 rows=113478127 loops=1) Buckets: 32768 
(originally 32768) Batches: 65536 (originally 4) Memory Usage: 1204250kB 
-> Gather Merge (cost=2807073.90..2823846.37 rows=34619 width=930) 
(actual time=83891.069..153380.040 rows=113478127 loops=1) Workers 
Planned: 2 Workers Launched: 2 -> Merge Left Join 
(cost=2806073.87..2818850.46 rows=14425 width=930) (actual 
time=83861.921..108022.671 rows=37826042 loops=3) Merge Cond: 
(((documentinformationsubject.documentinternalid)::text = 
(documentinformationsubject_1.documentinternalid)::text) AND 
((documentinformationsubject.documentid)::text = 
(documentinformationsubject_1.documentid):: text) AND 
((documentinformationsubject.actinternalid)::text = 
(documentinformationsubject_1.actinternalid)::text)) -> Sort 
(cost=1295969.26..1296005.32 rows=14425 width=882) (actual 
time=44814.114..45535.398 rows=231207 loops=3) Sort Key: 
documentinformationsubject.documentinternalid, 
documentinformationsubject.docum... Workers Planned: 2 Workers Launched: 
2 -> Merge Left Join (cost=2806073.87..2818850.46 rows=14425 width=930) 
(actual time=83861.921..108022.671 rows=37826042 loops=3) Merge Cond: 
(((documentinformationsubject.documentinternalid)::text = 
(documentinformationsubject_1.documentinternalid)::text) AND 
((documentinformationsubject.documentid)::text = 
(documentinformationsubject_1.documentid):: text) AND 
((documentinformationsubject.actinternalid)::text = 
(documentinformationsubject_1.actinternalid)::text)) -> Sort 
(cost=1295969.26..1296005.32 rows=14425 width=882) (actual 
time=44814.114..45535.398rows=231207 loops=3) ... Planning Time: 2.953 
ms Execution Time: 7004340.091 ms (70 rows)


There isn't really any big news here. But what matters is that it works.

thanks & regards,
-Gunther Schadow






Re: Out of Memory errors are frustrating as heck!

2019-04-23 Thread Gunther
uckets: 65536 Batches: 1 Memory Usage: 4680kB -> 
Parallel Seq Scan on entity_id agencyid (cost=0.00..1434.07 rows=20107 
width=89) (actual time=0.376..46.875 rows=11393 loops=3) -> Hash 
(cost=1310249.63..1310249.63 rows=13 width=233) (actual 
time=39172.682..39172.682 rows=236 loops=1) Buckets: 1024 Batches: 1 
Memory Usage: 70kB -> Hash Right Join (cost=829388.20..1310249.63 
rows=13 width=233) (actual time=35084.850..39172.545 rows=236 loops=1) 
Hash Cond: ((an.actinternalid)::text = (q.actinternalid)::text) -> Seq 
Scan on act_id an (cost=0.00..425941.04 rows=14645404 width=134) (actual 
time=0.908..7583.123 rows=14676871 loops=1) -> Hash 
(cost=829388.19..829388.19 rows=1 width=136) (actual 
time=29347.614..29347.614 rows=236 loops=1) Buckets: 1024 Batches: 1 
Memory Usage: 63kB -> Gather (cost=381928.46..829388.19 rows=1 
width=136) (actual time=23902.428..29347.481 rows=236 loops=1) Workers 
Planned: 2 Workers Launched: 2 -> Parallel Hash Join 
(cost=380928.46..828388.09 rows=1 width=136) (actual 
time=23915.790..29336.452 rows=79 loops=3) Hash Cond: 
((q.actinternalid)::text = (r.sourceinternalid)::text) -> Parallel Seq 
Scan on documentinformation q (cost=0.00..447271.93 rows=50050 width=99) 
(actual time=10055.238..15484.478 rows=87921 loops=3) Filter: 
(((classcode)::text = 'CNTRCT'::text) AND ((moodcode)::text = 
'EVN'::text) AND ((code_codesystem)::text = 
'2.16.840.1.113883.3.26.1.1'::text)) Rows Removed by Filter: 1540625 -> 
Parallel Hash (cost=380928.44..380928.44 rows=1 width=74) (actual 
time=13825.726..13825.726 rows=79 loops=3) Buckets: 1024 Batches: 1 
Memory Usage: 112kB -> Parallel Seq Scan on actrelationship r 
(cost=0.00..380928.44 rows=1 width=74) (actual time=5289.948..13825.576 
rows=79 loops=3) Filter: ((typecode)::text = 'SUBJ'::text) Rows Removed 
by Filter: 3433326 -> Hash (cost=2908913.87..2908913.87 rows=34619 
width=930) (actual time=244145.322..244145.322 rows=113478127 loops=1) 
Buckets: 8192 (originally 8192) Batches: 65536 (originally 16) Memory 
Usage: 1204250kB -> Gather Merge (cost=2892141.40..2908913.87 rows=34619 
width=930) (actual time=75215.333..145622.427 rows=113478127 loops=1) 
Workers Planned: 2 Workers Launched: 2 -> Merge Left Join 
(cost=2891141.37..2903917.96 rows=14425 width=930) (actual 
time=75132.988..99411.448 rows=37826042 loops=3) Merge Cond: 
(((documentinformationsubject.documentinternalid)::text = 
(documentinformationsubject_1.documentinternalid)::text) AND 
((documentinformationsubject.documentid)::text = 
(documentinformationsubject_1.documentid)::text) AND 
((documentinformationsubject.actinternalid)::text = 
(documentinformationsubject_1.actinternalid)::text)) -> Sort 
(cost=1301590.26..1301626.32 rows=14425 width=882) (actual 
time=39801.337..40975.780 rows=231207 loops=3) Sort Key: 
documentinformationsubject.documentinternalid, 
documentinformationsubject.documentid, 
documentinformationsubject.actinternalidct_1.documentid, 
documentinformationsubject_1.actinternalid Sort Method: external merge 
Disk: 169768kB Worker 0: Sort Method: external merge Disk: 169768kB 
Worker 1: Sort Method: external merge Disk: 169768kB -> Seq Scan on 
documentinformationsubject documentinformationsubject_1 
(cost=0.00..1329868.64 rows=1010585 width=159) (actual 
time=23401.537..31758.042 rows=1031106 loops=3) Filter: 
((participationtypecode)::text = 'PRD'::text) Rows Removed by Filter: 
7415579 Planning Time: 40.559 ms Execution Time: 6896581.566 ms (70 rows)


For the first time this query has succeeded now. Memory was bounded. The 
time of nearly hours is crazy, but things sometimes take that long. The 
important thing was not to get an out of memory error.


Thank you. Anything else you want to try, I can do it.

regards,
-Gunther



Re: Out of Memory errors are frustrating as heck!

2019-04-21 Thread Gunther
(
hjstate=) at nodeHashjoin.c:1127
1127{
(

I don't understand why all of them are at the same location. Doesn't 
make any sense to me.


But I'll leave it at that right now.

-Gunther



Re: Out of Memory errors are frustrating as heck!

2019-04-21 Thread Gunther

After applying Tomas' patch 0002 as corrected, over 0001, same thing:

integrator=# set enable_nestloop to off;
SET
integrator=# explain analyze select * from reports.v_BusinessOperation;
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 16 to 32
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 32 to 64
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 64 to 
128
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 128 to 
256
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 256 to 
512
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 512 to 
1024
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 1024 to 
2048
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 2048 to 
4096
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 4096 to 
8192
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 8192 to 
16384
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 16384 
to 32768
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 32768 
to 65536
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 65536 
to 131072
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 131072 
to 262144
ERROR:  out of memory
DETAIL:  Failed on request of size 8272 in memory context "HashBatchFiles".

And from the log:

2019-04-21 23:29:33.497 UTC [8890] LOG:  database system was shut down at 
2019-04-21 23:29:33 UTC
2019-04-21 23:29:33.507 UTC [] LOG:  database system is ready to accept 
connections
Hashjoin 0x1732b88: initial nbatch = 16, nbuckets = 8192
2019-04-21 23:31:54.447 UTC [8896] WARNING:  ExecHashIncreaseNumBatches: 
increasing number of batches from 16 to 32
2019-04-21 23:31:54.447 UTC [8896] LOG:  ExecHashIncreaseNumBatches === 
context stats start ===
2019-04-21 23:31:54.447 UTC [8896] STATEMENT:  explain analyze select * from 
reports.v_BusinessOperation;
TopMemoryContext: 120544 total in 7 blocks; 10016 free (6 chunks); 110528 used
  HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used
  Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 
used
  pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 
chunks); 7776 used
  TopTransactionContext: 8192 total in 1 blocks; 6680 free (0 chunks); 1512 used
  RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used
  MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used
  Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 
used
  TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 
used
  Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used
PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used
PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 
used:
  ExecutorState: 647368 total in 10 blocks; 197536 free (13 chunks); 449832 
used
TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 
3973120 used
TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 
3973120 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
HashBatchFiles: 258032 total in 31 blocks; 6208 free (0 chunks); 251824 
used
HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used
  HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 
4271416 used
TupleSort main: 41016 total in 3 blocks; 6504 free (6 chunks); 34512 
used
  Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
...
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used
  CacheMemoryContext: 1101328 total in 14 blocks; 288672 free (1 chunks); 
812656 used
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: 
pg_toast_2619_index
index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: 
entity_id_fkidx
...
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: 
pg_attribute_relid_attnum_index
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: 
pg_class_oid_index
  WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400

Re: Out of Memory errors are frustrating as heck!

2019-04-21 Thread Gunther
s; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used
  CacheMemoryContext: 1101328 total in 14 blocks; 288672 free (1 chunks); 
812656 used
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: 
pg_toast_2619_index
index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: 
entity_id_fkidx
...
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: 
pg_attribute_relid_attnum_index
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: 
pg_class_oid_index
  WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 
used
  PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used
  MdSmgr: 8192 total in 1 blocks; 6560 free (1 chunks); 1632 used
  LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used
  Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used
  ErrorContext: 8192 total in 1 blocks; 7936 free (5 chunks); 256 used
Grand total: 2424300520 bytes in 271910 blocks; 7332360 free (17596 chunks); 
2416968160 used
2019-04-21 19:50:21.338 UTC [6974] ERROR:  out of memory
2019-04-21 19:50:21.338 UTC [6974] DETAIL:  Failed on request of size 32800 in memory 
context "HashBatchContext".
2019-04-21 19:50:21.338 UTC [6974] STATEMENT:  explain analyze select * from 
reports.v_BusinessOperation;

Next I'll apply Tomas' corrected 0002 patch on top of this and see ...

-Gunther



Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Gunther

On 4/20/2019 21:14, Tomas Vondra wrote:

Maybe. But before wasting any more time on the memory leak investigation,
I suggest you first try the patch moving the BufFile allocations to a
separate context. That'll either confirm or disprove the theory.


OK, fair enough. So, first patch 0001-* applied, recompiled and

2019-04-21 04:08:04.364 UTC [11304] LOG:  server process (PID 11313) was 
terminated by signal 11: Segmentation fault
2019-04-21 04:08:04.364 UTC [11304] DETAIL:  Failed process was running: 
explain analyze select * from reports.v_BusinessOperation;
2019-04-21 04:08:04.364 UTC [11304] LOG:  terminating any other active server 
processes
2019-04-21 04:08:04.368 UTC [11319] FATAL:  the database system is in recovery 
mode
2019-04-21 04:08:04.368 UTC [11315] WARNING:  terminating connection because of 
crash of another server process

SIGSEGV ... and with the core dump that I have I can tell you where:

Core was generated by `postgres: postgres integrator'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x009f300c in palloc (size=8272) at mcxt.c:936
936 context->isReset = false;
(gdb) bt
#0  0x009f300c in palloc (size=8272) at mcxt.c:936
#1  0x0082b068 in makeBufFileCommon (nfiles=1) at buffile.c:116
#2  0x0082b0f8 in makeBufFile (firstfile=73) at buffile.c:138
#3  0x0082b283 in BufFileCreateTemp (interXact=false) at buffile.c:201
#4  0x006bdc15 in ExecHashJoinSaveTuple (tuple=0x1c5a468, 
hashvalue=3834163156, fileptr=0x18a3730) at nodeHashjoin.c:1227
#5  0x006b9568 in ExecHashTableInsert (hashtable=0x188fb88, 
slot=0x1877a18, hashvalue=3834163156) at nodeHash.c:1701
#6  0x006b6c39 in MultiExecPrivateHash (node=0x1862168) at 
nodeHash.c:186
#7  0x006b6aec in MultiExecHash (node=0x1862168) at nodeHash.c:114
#8  0x006a19cc in MultiExecProcNode (node=0x1862168) at 
execProcnode.c:501
#9  0x006bc5d2 in ExecHashJoinImpl (pstate=0x17b90e0, parallel=false) 
at nodeHashjoin.c:290
...
(gdb) info frame
Stack level 0, frame at 0x7fffd5d4dc80:
 rip = 0x9f300c in palloc (mcxt.c:936); saved rip = 0x82b068
 called by frame at 0x7fffd5d4dcb0
 source language c.
 Arglist at 0x7fffd5d4dc70, args: size=8272
 Locals at 0x7fffd5d4dc70, Previous frame's sp is 0x7fffd5d4dc80
 Saved registers:
  rbx at 0x7fffd5d4dc60, rbp at 0x7fffd5d4dc70, r12 at 0x7fffd5d4dc68, rip at 
0x7fffd5d4dc78

and I have confirmed that this is while working the main T_HashJoin with 
jointype JOIN_RIGHT.


So now I am assuming that perhaps you want both of these patches 
applied. So applied it, and retried and boom, same thing same place.


turns out the MemoryContext is NULL:

(gdb) p context
$1 = (MemoryContext) 0x0

all patches applied cleanly (with the -p1 option). I see no .rej file, 
but also no .orig file, not sure why that version of patch didn't create 
them. But I paid attention and know that there was no error.


-Gunther



Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Gunther
1 chunks); 1368 used: 
pg_toast_2619_index
index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: 
entity_id_fkidx
...
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: 
pg_attribute_relid_attnum_index
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: 
pg_class_oid_index
  WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 
used
  PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used
  MdSmgr: 8192 total in 1 blocks; 6176 free (1 chunks); 2016 used
  LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used
  Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used
  ErrorContext: 8192 total in 1 blocks; 7936 free (4 chunks); 256 used
Grand total: 2538218304 bytes in 77339 blocks; 3075256 free (3372 chunks); 
2535143048 used
2019-04-21 05:27:07.731 UTC [968] ERROR:  out of memory
2019-04-21 05:27:07.731 UTC [968] DETAIL:  Failed on request of size 32800 in memory 
context "HashBatchContext".
2019-04-21 05:27:07.731 UTC [968] STATEMENT:  explain analyze select * from 
reports.v_BusinessOperation;

so we're ending up with the same problem.

No cigar. But lots of admiration and gratitude for all your attempts to 
pinpoint this.


Also, again, if anyone (of the trusted people) wants access to hack 
directly, I can provide.


regards,
-Gunther




Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Gunther
d. But to me, 
Valgrind would be a huge toolbox to just look after one thing. I wonder 
if we could not make a much simpler memory leak debugger tool.  One that 
is fast,  yet doesn't provide too much output to overwhelm the log 
destination file system (and waste too much time). There are already 
Debug macros there which, if enabled, just create an absolutely crazy 
amount of undecipherable log file content, because ALL backend processes 
would spit out this blabber. So I already stopped that by adding a 
variable that must be set to 1 (using the debugger on exactly one 
process for exactly the time desired):


int _alloc_info = 0;
#ifdef HAVE_ALLOCINFO
#define AllocFreeInfo(_cxt, _chunk) \
if(_alloc_info) \
fprintf(stderr, "AllocFree: %s: %p, %zu\n", \
(_cxt)->header.name, (_chunk), (_chunk)->size)
#define AllocAllocInfo(_cxt, _chunk) \
if(_alloc_info) \
fprintf(stderr, "AllocAlloc: %s: %p, %zu\n", \
(_cxt)->header.name, (_chunk), (_chunk)->size)
#else
#define AllocFreeInfo(_cxt, _chunk)
#define AllocAllocInfo(_cxt, _chunk)
#endif

But now I am thinking that should be the hook to use a limited cache 
where we can cancel out AllocSetAlloc with their AllocSetFree calls that 
follow relatively rapidly, which apparently is the majority of the log 
chatter created.


The memory debugger would allocate a single fixed memory chunk like 8 or 
16 kB as a cache per each memory context that is actually traced. In 
each we would record the memory allocation in the shortest possible way. 
With everything compressed. Instead of pointer, references to the memory 
we would store whatever memory chunk index, a very brief backtrace would 
be stored in a compressed form, by instruction pointer ($rip) of the 
caller and then variable length differences to the $rip of the caller 
next up. These could even be stored in an index of the 100 most common 
caller chains to compress this data some more, while minimizing the cost 
in searching. Now each allocated chunk would be kept in this table and 
when it fills up, the oldest allocated chunk removed and written to the 
log file. When freed before being evicted from the cache, the chunk gets 
silently removed. When a chunk is freed that is no longer in the cache, 
the free event is recorded in the log. That way only those chunks get 
written to the log files that have endured beyond the capacity of the 
cache. Hopefully that would be the chunks most likely involved in the 
memory leak. Once the log has been created, it can be loaded into 
PostgreSQL table itself, and analyzed to find the chunks that never get 
freed and from the compressed backtrace figure out where they have been 
allocated.


BTW, my explain analyze is still running. That Sort - Unique step is 
taking forever on this data.


OK, now I will try the various patches that people sent.

-Gunther





Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Gunther
ND ((i % 15) < 15) AND ((i % 16) < 16) AND ((i % 17) < 17) AND ((i % 18) < 18) AND 
((i % 19) < 19) AND ((i % 20) < 20) AND ((i % 21) < 21))
->  Function Scan on generate_series k  (cost=0.00..10.00 rows=1000 
width=4) (actual rows=99 loops=1)
->  Hash  (cost=0.02..0.02 rows=1 width=4) (actual rows=99 loops=1)
  Buckets: 4096 (originally 1024)  Batches: 512 (originally 1)  Memory 
Usage: 101kB
  ->  CTE Scan on v  (cost=0.00..0.02 rows=1 width=4) (actual 
rows=99 loops=1)


Yes I thought that with CTEs and functions one might be able to generate 
a test case, but still not seing how you can trick the planner into this 
peculiar jointype JOIN_RIGHT and whether that is requisite for 
triggering the problem.


Finally, I have tried to make a pstate pretty printer in explain.c:

void
DumpPlanState(PlanState *pstate) {
ExplainState *es = NewExplainState();
ExplainNode(pstate, NIL, NULL, NULL, es);
puts(es->str->data);
pfree(es->str->data);
}

but that didn't work, because unfortunately that ExplainNode function is 
destructive. It would be so nice to refactor this explain code such that 
there would be a completely conservative function that simply dumps the 
present pstate with all the information about its estimate and actual 
situation, how many iterations it has already accomplished, how many it 
estimates to still have to do, whether its original estimate panned out 
or not, etc. This would be so tremendously useful for runtime debugging 
of queries. I think the utility of this can hardly be overstated. I mean 
even for end-user applications of some data warehouse, where you could 
probe a long running query every 5 seconds as to where the execution is. 
Man, I could not think of any more low hanging fruit useful feature. I 
am sure that if PostgreSQL was originally written in Java, this feature 
would naturally exist already.


regards and Happy Easter,
-Gunther



Re: Out of Memory errors are frustrating as heck!

2019-04-19 Thread Gunther
 pg_class 
reltuples and relpages:


foo=# analyze tmp_q;
ANALYZE
foo=# analyze tmp_r;
ANALYZE
foo=# select relname, relpages, reltuples from pg_class where relname in 
('tmp_q', 'tmp_r');
 relname | relpages |  reltuples
-+--+-
 tmp_r   |  5505039 | 1.13467e+08
 tmp_q   |7 | 236
(2 rows)

foo=# update pg_class set (relpages, reltuples) = (5505039, 1.13467e+08) where 
relname = 'tmp_q';
UPDATE 1
foo=# update pg_class set (relpages, reltuples) = (7, 236) where relname = 
'tmp_r';
UPDATE 1

but that didn't help. Somehow the planner outsmarts every such trick, so 
I can't get it to follow my right outer join plan where the big table is 
hashed.  I am sure y'all know some way to force it.


regards,
-Gunther



Re: Out of Memory errors are frustrating as heck!

2019-04-17 Thread Gunther
e)::text = 'DEF'::text) AND ((code_codesystem)::text = 
'2.16.840.1.113883.3.26.1.1'::text))
 Rows Removed by Filter: 2584355
 ->  Materialize  (cost=1548104.12..1553157.04 rows=1010585 
width=111) (actual time=49869.984..54191.701 rows=38060250 loops=3)
   ->  Sort  (cost=1548104.12..1550630.58 rows=1010585 
width=111) (actual time=49869.980..50832.205 rows=1031106 loops=3)
 Sort Key: 
documentinformationsubject_1.documentinternalid, 
documentinformationsubject_1.documentid, 
documentinformationsubject_1.actinternalid
 Sort Method: external merge  Disk: 122192kB
 Worker 0:  Sort Method: external merge  Disk: 
122192kB
 Worker 1:  Sort Method: external merge  Disk: 
122192kB
 ->  Seq Scan on documentinformationsubject 
documentinformationsubject_1  (cost=0.00..1329868.64 rows=1010585 width=111) 
(actual time=20366.166..47751.267 rows=1031106 loops=3)
   Filter: ((participationtypecode)::text = 
'PRD'::text)
   Rows Removed by Filter: 7415579
 Planning Time: 2.523 ms
 Execution Time: 464825.391 ms
(66 rows)

*By the way, let me ask, do you have pretty-print functions I can call 
with, e.g., node in ExecProcNode, or pstate in ExecHashJoin? Because if 
there was, then we could determine where exactly in the current plan we 
are? And can I call the plan printer for the entire plan we are 
currently executing? Might it even give us preliminary counts of where 
in the process it is? (I ask the latter not only because it would be 
really useful for our present debugging, but also because it would be an 
awesome tool for monitoring of long running queries! Something I am sure 
tons of people would just love to have!)*


I also read the other responses. I agree that having a swap space 
available just in case is better than these annoying out of memory 
errors. And yes, I can add that memory profiler thing, if you think it 
would actually work. I've done it with java heap dumps, even upgrading 
the VM to a 32 GB VM just to crunch the heap dump. But can you tell me 
just a little more as to how I need to configure this thing to get the 
data you want without blowing up the memory and disk during this huge query?


regards,
-Gunther



Re: Best Filesystem for PostgreSQL

2019-04-17 Thread Gunther Schadow

On 4/17/2019 18:03, Imre Samu wrote:
test:    PostgreSQL v10.3 + Linux 5.0 File-System Benchmarks: Btrfs 
vs. EXT4 vs. F2FS vs. XFS

https://www.phoronix.com/scan.php?page=article&item=linux-50-filesystems&num=3

So looks like XFS won. I like XFS for its ease of use especially when 
growing.


Any ideas on how ZFS might do? ZFS is of course so much more flexible.

-Gunther






Re: Pg10 : Client Configuration for Parallelism ?

2019-04-17 Thread Gunther Schadow

On 4/17/2019 4:33, Thomas Kellerer wrote:

A CTE would prevent parallelism.


You mean like always? His

SELECT count(1) FROM BigTable

would be parallel if run alone but as

WITH Data AS (SELECT count(1) FROM BigTable) SELECT * FROM Data

nothing would be parallel any more? How about:

SELECT * FROM (SELECT count(1) FROM BigTable) x

Parallel or not?

-Gunther



Re: Pg10 : Client Configuration for Parallelism ?

2019-04-17 Thread Gunther Schadow

By the way

On 4/17/2019 7:26, laurent.decha...@orange.com wrote:

I can see whether there is parallelism with pg_top or barely top on the server.


PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
  38584 postgres  20   0 8863828 8.153g 8.151g R 100.0  3.2   1:23.01 postgres
 10 root  20   0   0  0  0 S   0.3  0.0  88:07.26 rcu_sched


PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
  46687 postgres  20   0 8864620 0.978g 0.977g S  38.5  0.4   0:01.16 postgres
  46689 postgres  20   0 8864348 996.4m 995.1m R  38.5  0.4   0:01.16 postgres
  46690 postgres  20   0 8864348 987.2m 985.8m S  38.5  0.4   0:01.16 postgres
  46691 postgres  20   0 8864348 998436 997084 R  38.5  0.4   0:01.16 postgres
  ...
  46682 postgres  20   0  157996   2596   1548 R   0.7  0.0   0:00.05 top


If you just use top with the -c option, you will see each postgres 
process identify itself as to its role, e.g.


postgres: parallel worker for PID 46687

or

postgres: SELECT ...

or

postgres: wal writer

extremely useful this.

-Gunther





Re: Out of Memory errors are frustrating as heck!

2019-04-16 Thread Gunther
104.12..1550630.58 rows=1010585 
width=111) (actual time=49869.980..50832.205 rows=1031106 loops=3)
 Sort Key: 
documentinformationsubject_1.documentinternalid, 
documentinformationsubject_1.documentid, 
documentinformationsubject_1.actinternalid
 Sort Method: external merge  Disk: 122192kB
 Worker 0:  Sort Method: external merge  Disk: 
122192kB
 Worker 1:  Sort Method: external merge  Disk: 
122192kB
 ->  Seq Scan on documentinformationsubject 
documentinformationsubject_1  (cost=0.00..1329868.64 rows=1010585 width=111) 
(actual time=20366.166..47751.267 rows=1031106 loops=3)
   Filter: ((participationtypecode)::text = 
'PRD'::text)
   Rows Removed by Filter: 7415579
 Planning Time: 2.523 ms
 Execution Time: 464825.391 ms
(66 rows)

By the way, let me ask, do you have pretty-print functions I can call 
with, e.g., node in ExecProcNode, or pstate in ExecHashJoin? Because if 
there was, then we could determine where exactly in the current plan we 
are? And can I call the plan printer for the entire plan we are 
currently executing? Might it even give us preliminary counts of where 
in the process it is? (I ask the latter not only because it would be 
really useful for our present debugging, but also because it would be an 
awesome tool for monitoring of long running queries! Something I am sure 
tons of people would just love to have!


BTW, I also read the other responses. I agree that having a swap space 
available just in case is better than these annoying out of memory 
errors. And yes, I can add that memory profiler thing, if you think it 
would actually work. I've done it with java heap dumps, even upgrading 
the VM to a 32 GB VM just to crunch the heap dump. But can you tell me 
just a little more as to how I need to configure this thing to get the 
data you want without blowing up the memory and disk during this huge query?


regards,
-Gunther



Re: Out of Memory errors are frustrating as heck!

2019-04-16 Thread Gunther
(8 chunks); 1471536 
used:
ExecutorState: 1369337168 total in 163397 blocks; 248840 free (36 chunks); 
1369088328 used

exact same as before:

ExecutorState: 1369337168 total in 163397 blocks; 248840 free (36 chunks); 
1369088328 used

but now we get an increase to:

ExecutorState: 1369345496 total in 163398 blocks; 248840 free (36 chunks); 
1369096656 used

after I did this:

(gdb) cont
Continuing.

Breakpoint 2, AllocSetAlloc (context=0x2a1d190, size=8) at aset.c:718
718 {
(gdb) bt 4
#0  AllocSetAlloc (context=0x2a1d190, size=8) at aset.c:718
#1  0x0084e8ad in palloc0 (size=size@entry=8) at mcxt.c:969
#2  0x00702b63 in makeBufFileCommon (nfiles=nfiles@entry=1) at 
buffile.c:119
#3  0x00702e4c in makeBufFile (firstfile=163423) at buffile.c:138
(More stack frames follow...)
(gdb)  call MemoryContextStats(TopPortalContext)

So now we have it confirmed don't we? No! No we have not! We stop at the 
/entrance /of the allocate method. So when I interrupted, there was no 
call yet. Then at the next stop the increase was from the previous.

Continuing ... this now is from a stop at the makeBufFileCommon

ExecutorState: 1369345496 total in 163398 blocks; 248816 free (36 chunks); 
1369096680 used

And again, now stopped before

ExecutorState: 1369345496 total in 163398 blocks; 248792 free (36 chunks); 
1369096704 used

ExecutorState: 1369345496 total in 163398 blocks; 248792 free (36 chunks); 
1369096704 used

I don't see a growth between individual invocations. Anyway, these are 
the two ways to get there:


(gdb) bt 4
#0  AllocSetAlloc (context=0x2a1d190, size=4) at aset.c:718
#1  0x0084e7dd in palloc (size=size@entry=4) at mcxt.c:938
#2  0x00702e59 in makeBufFile (firstfile=163423) at buffile.c:140
#3  BufFileCreateTemp (interXact=interXact@entry=false) at buffile.c:201
(More stack frames follow...)
(gdb) cont
Continuing.

Breakpoint 3, AllocSetAlloc (context=0x2a1d190, size=394) at aset.c:718
718 {
(gdb) bt 3
#0  AllocSetAlloc (context=0x2a1d190, size=394) at aset.c:718
#1  0x0084e7dd in palloc (size=394) at mcxt.c:938
#2  0x006101bc in ExecHashJoinGetSavedTuple (file=file@entry=0x4b4a198, 
hashvalue=hashvalue@entry=0x7ffcbf92fe5c,
tupleSlot=0x2ae0ab8, hjstate=0x2a1d920) at nodeHashjoin.c:1277
(More stack frames follow...)

But now it increased

ExecutorState: 1369353824 total in 163399 blocks; 248792 free (36 chunks); 
1369105032 used

It increases every 3 times I stop at the breakpoint.

-Gunther


Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Gunther
b in ExecHashJoinSaveTuple (tuple=0x2ba1018, 
hashvalue=, fileptr=0x6305b00) at nodeHashjoin.c:1220
#6  0x0060d766 in ExecHashTableInsert (hashtable=hashtable@entry=0x2b50ad8, 
slot=, hashvalue=)
at nodeHash.c:1663
#7  0x00610c8f in ExecHashJoinNewBatch (hjstate=0x29a6be0) at 
nodeHashjoin.c:1051
(More stack frames follow...)

and on

(gdb) info frame
Stack level 0, frame at 0x7ffcbf92fd90:
 rip = 0x849030 in AllocSetAlloc (aset.c:718); saved rip = 0x84e8ad
 called by frame at 0x7ffcbf92fdb0
 source language c.
 Arglist at 0x7ffcbf92fd80, args: context=0x29a6450, size=8
 Locals at 0x7ffcbf92fd80, Previous frame's sp is 0x7ffcbf92fd90
 Saved registers:
  rip at 0x7ffcbf92fd88
(gdb) b AllocSetAlloc if  (int)strcmp(context->name, "ExecutorState") == 0 && *(int 
*)$rsp != 0x84e7dd && 0x84e8ad != *(int *)$rsp
Note: breakpoint 6 also set at pc 0x849030.
Breakpoint 7 at 0x849030: file aset.c, line 718.
(gdb) delete 6

Now if I continue I don't seem to be stopping any more.

Does this help now?

-Gunther



Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Gunther

On 4/15/2019 21:49, Gunther wrote:


I'm going to try without that DISTINCT step, or perhaps by dismantling 
this query until it works without this excessive memory growth.


It also failed. Out of memory. The resident memory size of the backend 
was 1.5 GB before it crashed.


TopMemoryContext: 4335600 total in 8 blocks; 41208 free (16 chunks); 
4294392 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free 
(0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free 
(0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 
2624 free (0 chunks); 21728 used Operator lookup cache: 24576 total in 2 
blocks; 10760 free (3 chunks); 13816 used pgstat TabStatusArray lookup 
hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used 
TopTransactionContext: 8192 total in 1 blocks; 5416 free (2 chunks); 
2776 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 
chunks); 1296 used MessageContext: 524288 total in 7 blocks; 186848 free 
(7 chunks); 337440 used Operator class cache: 8192 total in 1 blocks; 
560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 
blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 
total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 
total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 
total in 1 blocks; 7664 free (0 chunks); 528 used PortalHoldContext: 
24632 total in 2 blocks; 7392 free (0 chunks); 17240 used PortalContext: 
1105920 total in 138 blocks; 10368 free (8 chunks); 1095552 used: 
ExecutorState: 2238648944 total in 266772 blocks; 3726944 free (16276 
chunks); 2234922000 used HashTableContext: 16384 total in 2 blocks; 4032 
free (5 chunks); 12352 used HashBatchContext: 8192 total in 1 blocks; 
7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 
7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 
7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 
7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 
7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 
7752 free (0 chunks); 440 used HashBatchContext: 90288 total in 4 
blocks; 16072 free (6 chunks); 74216 used HashTableContext: 8192 total 
in 1 blocks; 7624 free (0 chunks); 568 used HashBatchContext: 90288 
total in 4 blocks; 16072 free (6 chunks); 74216 used TupleSort main: 
286912 total in 8 blocks; 246792 free (39 chunks); 40120 used TupleSort 
main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used 
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used 
HashTableContext: 8454256 total in 6 blocks; 64848 free (32 chunks); 
8389408 used HashBatchContext: 66935744 total in 2037 blocks; 7936 free 
(0 chunks); 66927808 used ExprContext: 8192 total in 1 blocks; 7936 free 
(0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 
chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 
383480 free (0 chunks); 717848 used index info: 2048 total in 2 blocks; 
680 free (1 chunks); 1368 used: pg_toast_2619_index index info: 2048 
total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx index 
info: 2048 total in 2 blocks; 696 free (1 c

Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Gunther
-join specific code that is doing it.
How about it's in the DISTINCT? I noticed while peeking up the call 
chain, that it was already in the UNIQUE sort thing also.  I guess it's 
streaming the results from the hash join right into the unique sort step.

What I've done before is compile with the comments removed from
src/backend/utils/mmgr/aset.c:/* #define HAVE_ALLOCINFO */
I have just done that and it creates an insane amount of output from all 
the processes, I'm afraid there will be no way to keep that stuff 
separated. If there was a way of turning that one and off for one 
process only, then we could probably get more info...


Everything is also extremely slow that way. Like in a half hour the 
memory didn't even reach 100 MB.


and then look for allocations sizes which are getting allocated but 
not freed, and then you can go back to gdb to look for allocations of 
those specific sizes.

I guess I should look for both, address and size to match it better.
This generates a massive amount of output, and it bypasses the logging 
configuration and goes directly to stderr--so it might not end up 
where you expect.
Yes, massive, like I said. Impossible to use. File system fills up 
rapidly. I made it so that it can be turned on and off, with the debugger.


int _alloc_info = 0;
#ifdef HAVE_ALLOCINFO
#define AllocFreeInfo(_cxt, _chunk) \
if(_alloc_info) \
fprintf(stderr, "AllocFree: %s: %p, %zu\n", \
(_cxt)->header.name, (_chunk), (_chunk)->size)
#define AllocAllocInfo(_cxt, _chunk) \
if(_alloc_info) \
fprintf(stderr, "AllocAlloc: %s: %p, %zu\n", \
(_cxt)->header.name, (_chunk), (_chunk)->size)
#else
#define AllocFreeInfo(_cxt, _chunk)
#define AllocAllocInfo(_cxt, _chunk)
#endif

so with this I do

(gdb) b AllocSetAlloc
(gdb) cont
(gdb) set _alloc_info=1
(gdb) disable
(gdb) cont

then I wait, ... until it crashes again ... no, it's too much. It fills 
up my filesystem in no time with the logs.  It produced 3 GB in just a 
minute of run time.


And also, I doubt we can find anything specifically by allocation size. 
It's just going to be 512 or whatever.


Isn't there some other way?

I'm going to try without that DISTINCT step, or perhaps by dismantling 
this query until it works without this excessive memory growth.


-Gunther





Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Gunther
ProcNode (node=0x11687d0) at 
../../../src/include/executor/executor.h:247
#11 ExecUnique (pstate=0x11685e0) at nodeUnique.c:73
#12 0x005fde68 in ExecProcNodeInstr (node=0x11685e0) at 
execProcnode.c:461
#13 0x005f75ba in ExecProcNode (node=0x11685e0) at 
../../../src/include/executor/executor.h:247
#14 ExecutePlan (execute_once=, dest=0xcc60e0 , 
direction=, numberTuples=0,
sendTuples=, operation=CMD_SELECT, use_parallel_mode=, planstate=0x11685e0, estate=0x1168340)
at execMain.c:1723
#15 standard_ExecutorRun (queryDesc=0x119b6d8, direction=, count=0, 
execute_once=) at execMain.c:364
#16 0x0059c6f8 in ExplainOnePlan 
(plannedstmt=plannedstmt@entry=0x1199a68, into=into@entry=0x0, 
es=es@entry=0x1141d48,
queryString=, params=0x0, queryEnv=queryEnv@entry=0x0, 
planduration=0x7fff2e4ca990) at explain.c:535
#17 0x0059c9ef in ExplainOneQuery (query=, 
cursorOptions=, into=0x0, es=0x1141d48,
queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n
documentInternalId, 
is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...,
 params=0x0, queryEnv=0x0)
at explain.c:371
#18 0x0059ce37 in ExplainQuery (pstate=pstate@entry=0xf74608, 
stmt=stmt@entry=0x11ef240,
queryString=queryString@entry=0xf4af30 "explain analyze\nSELECT DISTINCT\n   
 documentInternalId, 
is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...,
params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, 
dest=dest@entry=0xf74578) at explain.c:254
#19 0x0072ca5d in standard_ProcessUtility (pstmt=0x11ef390,
queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n
documentInternalId, 
is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...,
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0xf74578, 
completionTag=0x7fff2e4cab20 "") at utility.c:675
#20 0x0072a052 in PortalRunUtility (portal=0xfb06b0, pstmt=0x11ef390, 
isTopLevel=,
setHoldSnapshot=, dest=, completionTag=0x7fff2e4cab20 
"") at pquery.c:1178
#21 0x0072add2 in FillPortalStore (portal=portal@entry=0xfb06b0, 
isTopLevel=isTopLevel@entry=true) at pquery.c:1038
#22 0x0072b855 in PortalRun (portal=portal@entry=0xfb06b0, 
count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, 
dest=dest@entry=0xf4c570, altdest=altdest@entry=0xf4c570,
completionTag=0x7fff2e4cad30 "") at pquery.c:768
#23 0x007276e8 in exec_simple_query (
query_string=0xf4af30 "explain analyze\nSELECT DISTINCT\n
documentInternalId, 
is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...)
 at postgres.c:1145
#24 0x00729534 in PostgresMain (argc=, 
argv=argv@entry=0xf76ce8, dbname=,
username=) at postgres.c:4182
#25 0x006be215 in BackendRun (port=0xf6efe0) at postmaster.c:4361
#26 BackendStartup (port=0xf6efe0) at postmaster.c:4033
#27 ServerLoop () at postmaster.c:1706
#28 0x006bf122 in PostmasterMain (argc=argc@entry=3, 
argv=argv@entry=0xf45320) at postmaster.c:1379
#29 0x004822dc in main (argc=3, argv=0xf45320) at main.c:228

Good, now I leave this all sitting like that for you to ask me what else 
you might want to see.


We are now close to the edge of the cliff.

-Gunther




Re: Out of Memory errors are frustrating as heck!

2019-04-14 Thread Gunther

On 4/14/2019 23:24, Tom Lane wrote:

ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 
chunks); 2230341056 used

Oooh, that looks like a memory leak right enough.  The ExecutorState
should not get that big for any reasonable query.

2.2 GB is massive yes.

Your error and stack trace show a failure in HashBatchContext,
which is probably the last of these four:


HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 
41360 used
HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 
74216 used
HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 
74216 used
HashBatchContext: 100711712 total in 3065 blocks; 7936 free (0 
chunks); 100703776 used

Perhaps that's more than it should be, but it's silly to obsess over 100M
when there's a 2.2G problem elsewhere.

Yes.

   I think it's likely that it was
just coincidence that the failure happened right there.  Unfortunately,
that leaves us with no info about where the actual leak is coming from.


Strange though, that the vmstat tracking never showed that the cache 
allocated memory goes much below 6 GB. Even if this 2.2 GB memory leak 
is there, and even if I had 2 GB of shared_buffers, I would still have 
enough for the OS to give me.


Is there any doubt that this might be a problem with Linux? Because if 
you want, I can whip out a FreeBSD machine, compile pgsql, and attach 
the same disk, and try it there. I am longing to have a reason to move 
back to FreeBSD anyway. But I have tons of stuff to do, so if you do not 
have reason to suspect Linux to do wrong here, I prefer skipping that 
futile attempt



The memory map shows that there were three sorts and four hashes going
on, so I'm not sure I believe that this corresponds to the query plan
you showed us before.
Like I said, the first explain was not using the same constraints (no 
NL). Now what I sent last should all be consistent. Memory dump and 
explain plan and gdb backtrace.

Any chance of extracting a self-contained test case that reproduces this?


With 18 million rows involved in the base tables, hardly.

But I am ready to try some other things with the debugger that you want 
me to try. If we have a memory leak issue, we might just as well try to  
plug it!


I could even to give someone of you access to the system that runs this.

thanks,
-Gunther




Re: Out of Memory errors are frustrating as heck!

2019-04-14 Thread Gunther
ter varying(555) |
 agencyname| text   |
 productitemcode   | character varying(512) |
 productinternalid | character varying(255) |


Could you rerun the query with \set VERBOSITY verbose to show the file/line
that's failing ?


Here goes:

integrator=# \set VERBOSITY verbose
integrator=# SET ENABLE_NESTLOOP TO OFF;
SET
integrator=# INSERT INTO reports.BusinessOperation SELECT * FROM 
reports.v_BusinessOperation;
ERROR:  53200: out of memory
DETAIL:  Failed on request of size 32800 in memory context "HashBatchContext".
LOCATION:  MemoryContextAlloc, mcxt.c:798

you notice that I set ENABLE_NESTLOOP to off, that is because the 
planner goes off thinking the NL plan is marginally more efficient, but 
in fact it will take 5 hours to get to the same out of memory crash, 
while the no NL plan gets there in half an hour. That verbose setting 
didn't help much I guess.



If you wanted to show a stack trace, you could attach gdb to PID from SELECT
pg_backend_pid(), "b"reak on errdetail, run the query, and then "bt" when it
fails.


gdb -p 27930
GNU gdb (GDB) Red Hat Enterprise Linux 8.0.1-30.amzn2.0.3
...
Attaching to process 27930
Reading symbols from /usr/local/pgsql/bin/postgres...done.
...
(gdb) b errdetail
Breakpoint 1 at 0x82b210: file elog.c, line 872.
(gdb) cont
Continuing.
Breakpoint 1, errdetail (fmt=fmt@entry=0x9d9958 "Failed on request of size %zu in memory 
context \"%s\".") at elog.c:872
872 {
(gdb) bt
#0  errdetail (fmt=fmt@entry=0x9d9958 "Failed on request of size %zu in memory context 
\"%s\".") at elog.c:872
#1  0x0084e320 in MemoryContextAlloc (context=0x600, 
size=size@entry=32800) at mcxt.c:794
#2  0x0060ce7a in dense_alloc (size=384, size@entry=381, hashtable=, hashtable=)
at nodeHash.c:2696
#3  0x0060d788 in ExecHashTableInsert (hashtable=hashtable@entry=0x10ead08, 
slot=, hashvalue=194758122)
at nodeHash.c:1614
#4  0x00610c6f in ExecHashJoinNewBatch (hjstate=0x10806b0) at 
nodeHashjoin.c:1051
#5  ExecHashJoinImpl (parallel=false, pstate=0x10806b0) at nodeHashjoin.c:539
#6  ExecHashJoin (pstate=0x10806b0) at nodeHashjoin.c:565
#7  0x0061ce4e in ExecProcNode (node=0x10806b0) at 
../../../src/include/executor/executor.h:247
#8  ExecSort (pstate=0x1080490) at nodeSort.c:107
#9  0x0061d2c4 in ExecProcNode (node=0x1080490) at 
../../../src/include/executor/executor.h:247
#10 ExecUnique (pstate=0x107ff60) at nodeUnique.c:73
#11 0x00619732 in ExecProcNode (node=0x107ff60) at 
../../../src/include/executor/executor.h:247
#12 ExecModifyTable (pstate=0x107fd20) at nodeModifyTable.c:2025
#13 0x005f75ba in ExecProcNode (node=0x107fd20) at 
../../../src/include/executor/executor.h:247
#14 ExecutePlan (execute_once=, dest=0x7f0442721998, 
direction=, numberTuples=0,
sendTuples=, operation=CMD_INSERT, use_parallel_mode=, planstate=0x107fd20, estate=0x107f830)
at execMain.c:1723
#15 standard_ExecutorRun (queryDesc=0x1086880, direction=, count=0, 
execute_once=) at execMain.c:364
#16 0x0072a972 in ProcessQuery (plan=,
sourceText=0xf4a710 "INSERT INTO reports.BusinessOperation SELECT * FROM 
reports.v_BusinessOperation;", params=0x0,
queryEnv=0x0, dest=0x7f0442721998, completionTag=0x7fff2e4cad30 "") at 
pquery.c:161
#17 0x0072abb0 in PortalRunMulti (portal=portal@entry=0xfb06b0, 
isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false, 
dest=dest@entry=0x7f0442721998, altdest=altdest@entry=0x7f0442721998,
completionTag=completionTag@entry=0x7fff2e4cad30 "") at pquery.c:1286
#18 0x0072b661 in PortalRun (portal=portal@entry=0xfb06b0, 
count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, 
dest=dest@entry=0x7f0442721998,
altdest=altdest@entry=0x7f0442721998, completionTag=0x7fff2e4cad30 "") at 
pquery.c:799
#19 0x007276e8 in exec_simple_query (
query_string=0xf4a710 "INSERT INTO reports.BusinessOperation SELECT * FROM 
reports.v_BusinessOperation;") at postgres.c:1145
#20 0x00729534 in PostgresMain (argc=, 
argv=argv@entry=0xf76ce8, dbname=,
username=) at postgres.c:4182
#21 0x006be215 in BackendRun (port=0xf6dfe0) at postmaster.c:4361
#22 BackendStartup (port=0xf6dfe0) at postmaster.c:4033
#23 ServerLoop () at postmaster.c:1706
#24 0x006bf122 in PostmasterMain (argc=argc@entry=3, 
argv=argv@entry=0xf45320) at postmaster.c:1379
#25 0x004822dc in main (argc=3, argv=0xf45320) at main.c:228

That's it.

Thank you all very much for your interest in this case.

-Gunther




Re: Out of Memory errors are frustrating as heck!

2019-04-14 Thread Gunther
ndex
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: 
pg_database_oid_index
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: 
pg_authid_rolname_index
  WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 
used
  PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used
  MdSmgr: 8192 total in 1 blocks; 7256 free (1 chunks); 936 used
  LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used
  Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used
  ErrorContext: 8192 total in 1 blocks; 7936 free (4 chunks); 256 used
Grand total: 1437014672 bytes in 168424 blocks; 11879744 free (3423 chunks); 
1425134928 used
2019-04-14 16:38:26.355 UTC [11061] ERROR:  out of memory
2019-04-14 16:38:26.355 UTC [11061] DETAIL:  Failed on request of size 8272 in memory 
context "ExecutorState".

I am delighted that Tom Lane is hinting that my query plan doesn't look 
so crazy, and it isn't. And delighted there may be a known bug involved.


I wonder if this is one data issue. May be a few rows have excessively 
long text fields? But even checking for that is rather difficult because 
there are many tables and columns involved. Would really be nice if the 
error would say exactly what plan step that ExecutorState referred to, 
so one could narrow it down.


regards,
-Gunther

On 4/14/2019 17:19, Tom Lane wrote:

Gunther  writes:

For weeks now, I am banging my head at an "out of memory" situation.
There is only one query I am running on an 8 GB system, whatever I try,
I get knocked out on this out of memory. It is extremely impenetrable to
understand and fix this error. I guess I could add a swap file, and then
I would have to take the penalty of swapping. But how can I actually
address an out of memory condition if the system doesn't tell me where
it is happening?
You might want to see the query, but it is a huge plan, and I can't
really break this down. It shouldn't matter though. But just so you can
get a glimpse here is the plan:

Is that the whole plan?  With just three sorts and two materializes,
it really shouldn't use more than more-or-less 5X work_mem.  What do
you have work_mem set to, anyway?  Is this a 64-bit build of PG?

Also, are the estimated rowcounts shown here anywhere close to what
you expect in reality?  If there are any AFTER INSERT triggers on the
insertion target table, you'd also be accumulating per-row trigger
queue entries ... but if there's only circa 35K rows to be inserted,
it's hard to credit that eating more than a couple hundred KB, either.


Might this be a bug?

It's conceivable that you've hit some memory-leakage bug, but if so you
haven't provided nearly enough info for anyone else to reproduce it.
You haven't even shown us the actual error message :-(

https://wiki.postgresql.org/wiki/Guide_to_reporting_problems

regards, tom lane


Out of Memory errors are frustrating as heck!

2019-04-14 Thread Gunther
0  26873160  430997600 0 0 
 181  188   0   2  98   0   0 2019-04-14 16:38:48
 0  00  27069200  430011600 0   105 
 137  263   0   0 100   0   0 2019-04-14 16:38:50
 0  00  27066720  430023200 0 0 
 142  204   0   0 100   0   0 2019-04-14 16:38:52
 0  00  28151160  419192800 0 0 
 116  242   0   0 100   0   0 2019-04-14 16:38:54
 0  00  28153640  419200800 0 0 
 116  239   0   0 100   0   0 2019-04-14 16:38:56
 0  00  28151160  419216400 0 0 
 159  236   0   0 100   0   0 2019-04-14 16:38:58

ending after the out of memory crash, that occurred exactly at the 
marked point 16:38:26.355 UTC.


We can't really see anything too worrisome. There is always lots of 
memory used by cache, which could have been mobilized. The only possible 
explanation I can think of is that in that moment of the crash the 
memory utilization suddenly skyrocketed in less than a second, so that 
the 2 second vmstat interval wouldn't show it??? Nah.


I have already much reduced work_mem, which has helped in some other 
cases before. Now I am going to reduce the shared_buffers now, but that 
seems counter-intuitive because we are sitting on all that cache memory 
unused!


Might this be a bug? It feels like a bug. It feels like those out of 
memory issues should be handled more gracefully (garbage collection 
attempt?) and that somehow there should be more information so the 
person can do anything about it.


Any ideas?

-Gunther



Block / Page Size Optimization

2019-04-08 Thread Gunther
Hi all, I am sure this should be a FAQ, but I can't see a definitive 
answer, only chatter on various lists and forums.


Default page size of PostgreSQL is 8192 bytes.

Default IO block size in Linux is 4096 bytes.

I can set an XFS file system with 8192 bytes block size, but then it 
does not mount on Linux, because the VM page size is the limit, 4096 again.


There seems to be no way to change that in (most, common) Linux 
variants. In FreeBSD there appears to be a way to change that.


But then, there is a hardware limit also, as far as the VM memory page 
allocation is concerned. Apparently most i386 / amd64 architectures the 
VM page sizes are 4k, 2M, and 1G. The latter, I believe, are called 
"hugepages" and I only ever see that discussed in the PostgreSQL manuals 
for Linux, not for FreeBSD.


People have asked: does it matter? And then there is all that chatter 
about "why don't you run a benchmark and report back to us" -- "OK, will 
do" -- and then it's crickets.


But why is this such a secret?

On Amazon AWS there is the following very simple situation: IO is capped 
on IO operations per second (IOPS). Let's say, on a smallish volume, I 
get 300 IOPS (once my burst balance is used up.)


Now my simple theoretical reasoning is this: one IO call transfers 1 
block of 4k size. That means, with a cap of 300 IOPS, I get to send 1.17 
MB per second. That would be the absolute limit. BUT, if I could double 
the transfer size to 8k, I should be able to move 2.34 MB per second. 
Shouldn't I?


That might well depend on whether AWS' virtual device paths would 
support these 8k block sizes.


But something tells me that my reasoning here is totally off. Because I 
get better IO throughput that that. Even on 3000 IOPS I would only get 
11 MB per second, and I am sure I am getting rather 50-100 MB/s, no? So 
my simplistic logic is false.


What really is the theoretical issue with the file system block size? 
Where does -- in theory -- the benefit come from of using an XFS block 
size of 8 kB, or even increasing the PostgreSQL page size to 16 kB and 
then the XFS block size also to 16 kB? I remember having seen standard 
UFS block sizes of 16 kB. But then why is Linux so tough on refusing to 
mount an 8 kB XFS because it's VM page size is only 4 kB?


Doesn't this all have one straight explanation?

If you have a link that I can just read, I appreciate you sharing that. 
I think that should be on some Wiki or FAQ somewhere. If I get a quick 
and dirty explanation with some pointers, I can try to write it out into 
a more complete answer that might be added into some documentation or 
FAQ somewhere.


thanks & regards,
-Gunther





Re: EXPLAIN PLAN for DELETE CASCADE or DELETE not using pkey index despite EXPLAINing that it would?

2019-03-22 Thread Gunther



On 3/21/2019 17:16, Tom Lane wrote:

Gunther  writes:

I have 250 rows to delete, but they are a target to a bunch of child
tables with foreign key on delete cascade.
EXPLAIN DELETE FROM Foo WHERE id = (SELECT fooId FROM Garbage);
shows me that it uses the nested loop by Foo_pkey index to find the 250
items from Garbage to be deleted.
But once that starts, I see HUGE amount of read activity from the
tablespace Foo_main that contains the Foo table, and only the Foo table,
not the Foo_pkey, not any other index, not any other child table, not
even the toast table for Foo is contained in that tablespace (I have the
toast table diverted with symlinks to another volume).

I'm betting you neglected to index the referencing column for one
or more of the foreign keys involved.  You can get away with that
as long as you're not concerned with the speed of DELETE ...

regards, tom lane


I had the same suspicion. But firstly my schema is generated 
automatically and all foreign keys have the indexes.


But what is even more stunning is that the table where this massive read 
activity happens is the Foo heap table. I verified that by using strace 
where all the massive amounts of reads are on those files for the main 
Foo table. And this doesn't make sense, since any foreign key targets 
its primary key. The foreign keys of the child tables are also indexed 
and there is no io on the volumes that hold these child tables, nor is 
the io on the volume that holds the Foo_pkey.






EXPLAIN PLAN for DELETE CASCADE or DELETE not using pkey index despite EXPLAINing that it would?

2019-03-21 Thread Gunther

Hi,

I have 250 rows to delete, but they are a target to a bunch of child 
tables with foreign key on delete cascade.


EXPLAIN DELETE FROM Foo WHERE id = (SELECT fooId FROM Garbage);

shows me that it uses the nested loop by Foo_pkey index to find the 250 
items from Garbage to be deleted.


But once that starts, I see HUGE amount of read activity from the 
tablespace Foo_main that contains the Foo table, and only the Foo table, 
not the Foo_pkey, not any other index, not any other child table, not 
even the toast table for Foo is contained in that tablespace (I have the 
toast table diverted with symlinks to another volume).


I see the read activity with iostat, reading heavily at 130 MB/s for a 
long time until my burst balance is used up, then continuing to churn 
with 32 MB/s.


I also see the read activity with iotop, that tells me that it is that 
postgres backend running the DELETE query that is doing this, not some 
autovacuum nor anything else.


It looks to me that in actuality it is doing a sequential scan for each 
of the 250 rows, despite it EPLAINing to me that it was going to use 
that index.


It would really be good to know what it is churning so heavily?

I have seen some ways of using dtrace or things like that to do some 
measurement points. But I haven't seen how this is done to inspect the 
effective execution plan and where in that plan it is, i.e., which 
iteration. It would be nice if there was some way of doing a "deep 
explain plan" or even better, having an idea of the execution plan which 
the executor is actually following, and a way to report on the current 
status of work according to this plan.


How else do I figure out what causes this heavy read activity on the 
main Foo table?


This is something I might even want to contribute. For many years I am 
annoyed by this waiting for long running statement without any idea 
where it is and how much is there still to go. If I have a plan 
structure and an executor that follows the plan structure, there must be 
a way to dump it out.


The pg_stat_activity table might contain a current statement id, and 
then a superuser might ask EXPLAIN STATEMENT :statementId. Or just a 
pg_plantrace command which would dump the current plan with an 
indication of completion % of each step.


But also delete cascades and triggers should be viewable from this, they 
should be traced, I am sure that somewhere inside there is some data 
structure representing this activity and all it would take is to dump it?


regards,
-Gunther




Re: Poor man's partitioned index .... not being used?

2019-03-21 Thread Gunther

Thanks David Rowley and Andrew Gierth.

On 3/20/2019 23:46, Andrew Gierth wrote:

If you did  WHERE mod(id,2) = mod(8934,2) AND id = 8934

then the index would likely be used - because the prover can then treat
mod(id,2) as an atom (call it X), constant-fold mod(8934,2) to 0 because
mod() is immutable, and then observe that (X = 0) proves that (X = 0).


foo=# EXPLAIN SELECT * FROM Test WHERE mod(id,2) = mod(8934,2) AND id = 8934;
QUERY PLAN
--
 Index Only Scan using test_pk0 on test  (cost=0.42..8.44 rows=1 width=4)
   Index Cond: (id = 8934)
(2 rows)

Yes indeed! It's being used that way! Interesting. Only that we can't 
use it if id was a variable? Hmm ...


foo=# PREPARE testplan(int) AS
foo-#SELECT * FROM Test WHERE mod(id,2) = mod($1,2) AND id = $1;
PREPARE
foo=# EXPLAIN EXECUTE testplan(8934);
QUERY PLAN
--
 Index Only Scan using test_pk0 on test  (cost=0.42..8.44 rows=1 width=4)
   Index Cond: (id = 8934)
(2 rows)

That's quite alright actually. Now the questions is, could we use this 
in a nested loop query plan? That's where I think it can't work:


foo=# CREATE TABLE Test2 AS SELECT * FROM Test WHERE random() < 0.01 ORDER BY 
id DESC;
SELECT 3730
integrator=# ANALYZE Test2;
ANALYZE
foo=# EXPLAIN SELECT * FROM Test2 a LEFT OUTER JOIN Test b ON( mod(b.id,2) = 
mod(a.id,2) AND b.id = a.id) LIMIT 10;
 QUERY PLAN
-
 Limit  (cost=110.25..135.67 rows=10 width=8)
   ->  Hash Right Join  (cost=110.25..9591.02 rows=3730 width=8)
 Hash Cond: ((mod(b.id, 2) = mod(a.id, 2)) AND (b.id = a.id))
 ->  Seq Scan on test b  (cost=0.00..5599.29 rows=388129 width=4)
 ->  Hash  (cost=54.30..54.30 rows=3730 width=4)
   ->  Seq Scan on test2 a  (cost=0.00..54.30 rows=3730 width=4)
(6 rows)

foo=# SET enable_hashjoin TO off;
SET
foo=# EXPLAIN SELECT * FROM Test2 a LEFT OUTER JOIN Test b ON( mod(b.id,2) = 
mod(a.id,2) AND b.id = a.id) LIMIT 10;
   QUERY PLAN

 Limit  (cost=47214.73..47227.86 rows=10 width=8)
   ->  Merge Right Join  (cost=47214.73..52113.16 rows=3730 width=8)
 Merge Cond: (((mod(b.id, 2)) = (mod(a.id, 2))) AND (b.id = a.id))
 ->  Sort  (cost=46939.15..47909.47 rows=388129 width=4)
   Sort Key: (mod(b.id, 2)), b.id
   ->  Seq Scan on test b  (cost=0.00..5599.29 rows=388129 width=4)
 ->  Sort  (cost=275.58..284.91 rows=3730 width=4)
   Sort Key: (mod(a.id, 2)), a.id
   ->  Seq Scan on test2 a  (cost=0.00..54.30 rows=3730 width=4)
(9 rows)

foo=# SET enable_mergejoin TO off;
SET
foo=# EXPLAIN SELECT * FROM Test2 a LEFT OUTER JOIN Test b ON( mod(b.id,2) = 
mod(a.id,2) AND b.id = a.id) LIMIT 10;
   QUERY PLAN

 Limit  (cost=0.00..102516.78 rows=10 width=8)
   ->  Nested Loop Left Join  (cost=0.00..38238760.24 rows=3730 width=8)
 Join Filter: ((b.id = a.id) AND (mod(b.id, 2) = mod(a.id, 2)))
 ->  Seq Scan on test2 a  (cost=0.00..54.30 rows=3730 width=4)
 ->  Materialize  (cost=0.00..9056.93 rows=388129 width=4)
   ->  Seq Scan on test b  (cost=0.00..5599.29 rows=388129 width=4)
(6 rows)

It looks like it doesn't want to evaluate the mod(a.id, 2) before it 
moves to the index query for the nested loop.


Notably the partitioned table approach should do that, but it has a 
different expression for the partition. No mod function but MODULUS and 
REMAINDER.


I wonder if there was a way of marking such expressions as safe in the 
query, like suggesting a certain evaluation order, i.e.,


SELECT * FROM Test2 a LEFT OUTER JOIN Test b ON(mod(b.id,2) = EVAL(mod(a.id,2)) 
AND b.id = a.id) LIMIT 10;

It's OK though. It just goes to show that in a case like this, it is 
best to just go with the partitioned table anyway.


regards,
-Gunther



Poor man's partitioned index .... not being used?

2019-03-20 Thread Gunther

Hi all, look at this short story please:

foo=# CREATE TABLE Test(id int NOT NULL PRIMARY KEY);
CREATE TABLE
foo=# INSERT INTO test SELECT row_number() OVER() FROM pg_class a CROSS JOIN 
pg_class b;
INSERT 0 388129
foo=# EXPLAIN SELECT * FROM Test WHERE id = '8934';
QUERY PLAN
---
 Index Only Scan using test_pkey on test  (cost=0.42..8.44 rows=1 width=4)
   Index Cond: (id = 8934)
(2 rows)

foo=# ALTER TABLE Test DROP CONSTRAINT Test_pkey;
ALTER TABLE
foo=# EXPLAIN SELECT * FROM Test WHERE id = '8934';
   QUERY PLAN
---
 Seq Scan on test  (cost=0.00..6569.61 rows=1 width=4)
   Filter: (id = 8934)
(2 rows)

foo=# SELECT max(id)/2 FROM Test;
 ?column?
--
   194064
(1 row)

foo=# CREATE UNIQUE INDEX Test_pk0 ON Test(id) WHERE id < 194064;
CREATE INDEX
foo=# CREATE UNIQUE INDEX Test_pk1 ON Test(id) WHERE id >= 194064;
CREATE INDEX
foo=# ANALYZE Test;
ANALYZE
foo=# EXPLAIN SELECT * FROM Test WHERE id = 8934;
QUERY PLAN
--
 Index Only Scan using test_pk0 on test  (cost=0.42..8.44 rows=1 width=4)
   Index Cond: (id = 8934)
(2 rows)


foo=# DROP INDEX Test_pk0;
DROP INDEX
foo=# DROP INDEX Test_pk1;
DROP INDEX

foo=# CREATE UNIQUE INDEX Test_pk0 ON Test(id) WHERE mod(id,2) = 0;
CREATE INDEX
foo=# CREATE UNIQUE INDEX Test_pk1 ON Test(id) WHERE mod(id,2) = 1;
CREATE INDEX
foo=# ANALYZE Test;
ANALYZE
foo=# EXPLAIN SELECT * FROM Test WHERE id = '8934';
  QUERY PLAN
---
 Seq Scan on test  (cost=0.00..6569.61 rows=1 width=4)
   Filter: (id = 8934)
(2 rows)

Why is that index never used?

PS: there is a performance question behind this, big table, heavily used index,
the hope was that with this simple scheme of partitioning just the index one 
might
distribute the load better. I know, if the load really is so big, why not 
partition
the entire table. But just for hecks, why not this way?

regards,
-Gunther



Re: Distributing data over "spindles" even on AWS EBS, (followup to the work queue saga)

2019-03-17 Thread Gunther

On 3/14/2019 11:11, Jeremy Schneider wrote:

On 3/14/19 07:53, Gunther wrote:

  2. build a low level "spreading" scheme which is to take the partial
 files 4653828 and 4653828.1, .2, _fsm, etc. and move each to another
 device and then symlink it back to that directory (I come back to this!)

...

To 2. I find that it would be a nice feature of PostgreSQL if we could
just use symlinks and a symlink rule, for example, when PostgreSQL finds
that 4653828 is in fact a symlink to /otherdisk/PG/16284/4653828, then
it would

  * by default also create  4653828.1 as a symlink and place the actual
data file on /otherdisk/PG/16284/4653828.1

How about if we could just specify multiple tablespaces for an object,
and then PostgreSQL would round-robin new segments across the presently
configured tablespaces?  This seems like a simple and elegant solution
to me.


Very good idea! I agree.

Very important also would be to take out the existing patch someone had 
contributed to allow toast tables to be assigned to different tablespaces.



  4. maybe I can configure in AWS EBS to reserve more IOPS -- but why
 would I pay for more IOPS if my cost is by volume size? I can just
 make another volume? or does AWS play a similar trick on us with
 IOPS being limited on some "credit" system???

Not credits, but if you're using gp2 volumes then pay close attention to
how burst balance works. A single large volume is the same price as two
striped volumes at half size -- but the striped volumes will have double
the burst speed and take twice as long to refill the burst balance.


Yes, I learned that too. It seems a very interesting "bug" of the Amazon 
GP2 IOPS allocation scheme.  They say it's like 3 IOPS per GiB, so if I 
have 100 GiB I get 300 IOPS. But it also says minimum 100. So that means 
if I have 10 volumes of 10 GiB each, I get 1000 IOPS minimum between 
them all. But if I have it all on one 100 GiB volume I only get 300 IOPS.


I wonder if Amazon is aware of this. I hope they are and think that's 
just fine. Because I like it.


It also is a clear sign to me that I want to use page sizes > 4k for the 
file system. I have tried on Amazon Linux to use 8k block sizes of the 
XFS volume, but I cannot mount those, since the Linux says it can 
currently only deal with 4k blocks. This is another reason I consider 
switching the database server(s) to FreeBSD.  OTOH, who knows may be 
this 4k is a limit of the AWS EBS infrastructure. After all, if I am 
scraping the 300 or 1000 IOPS limit already and if I can suddenly 
upgrade my block sizes per IO, I double my IO throughput.


regards,
-Gunther




Re: Facing issue in using special characters

2019-03-15 Thread Gunther
This is not an issue for "hackers" nor "performance" in fact even for 
"general" it isn't really an issue.


"Special characters" is actually nonsense.

When people complain about "special characters" they haven't thought 
things through.


If you are unwilling to think things through and go step by step to make 
sure you know what you are doing, then you will not get it and really 
nobody can help you.


In my professional experience, people who complain about "special 
characters" need to be cut loose or be given a chance (if they are 
established employees who carry some weight). If a contractor complains 
about "special characters" they need to be fired.


Understand charsets -- character set, code point, and encoding. Then 
understand how encoding and string literals and "escape sequences" in 
string literals might work.


Know that UNICODE today is the one standard, and there is no more need 
to do code table switch. There is nothing special about a Hebrew alef or 
a greek lower case alpha or a latin A. Nor a hyphen and en-dash or an 
em-dash. All these characters are in the UNICODE. Yes, there are some 
Japanese who claim that they don't like that their Chinese character 
versions are put together with simplified reform Chinese font. But 
that's a font issue, not a character code issue.


7 bit ASCII is the first page of UNICODE, even in the UTF-8 encoding.

ISO Latin 1, or the Windoze 123 whatever special table of ISO Latin 1 
has the same code points as UNICODE pages 0 and 1, but not compatible 
with UTF-8 coding because of the way UTF-8 uses the 8th bit.


But none of this is likely your problem.

Your problem is about string literals in SQL for examples. About the 
configuration of your database (I always use initdb with --locale C and 
--encoding UTF-8). Use UTF-8 in the database. Then all your issues are 
about string literals in SQL and in JAVA and JSON and XML or whatever 
you are using.


You have to do the right thing. If you produce any representation, 
whether that is XML or JSON or SQL or URL query parameters, or a CSV 
file, or anything at all, you need to escape your string values properly.


This question with no detail didn't deserve such a thorough answer, but 
it's my soap box. I do not accept people complaining about "special 
characters". My own people get that same sermon from me when they make 
that mistake.


-Gunther

On 3/15/2019 1:19, M Tarkeshwar Rao wrote:


Hi all,

Facing issue in using special characters. We are trying to insert 
records to a remote Postgres Server and our application not able to 
perform this because of errors.


It seems that issue is because of the special characters that has been 
used in one of the field of a row.


Regards

Tarkeshwar



Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.

2019-02-25 Thread Gunther Schadow
Wow, yes, partition instead of index, that is interesting. Thanks Corey 
and Justin.


The index isn't required at all if all my pending jobs are in a 
partition of only pending jobs. In that case the plan can just be a 
sequential scan.


And Jeff James, sorry, I failed to show the LIMIT 1 clause on my dequeue 
query. That was an omission. My query is actually somewhat more complex 
and I just translated it down to the essentials but forgot the LIMIT 1 
clause.


SELECT seqNo, action
  FROM Queue
 WHERE pending
   AND/... other criteria .../
 LIMIT 1
 FOR UPDATE SKIP LOCKED;

And sorry I didn't capture the stats for vacuum verbose. And they would 
be confusing because there are other things involved.


Anyway, I think the partitioned table is the right and brilliant 
solution, because an index really isn't required. The actual pending 
partition will always remain quite small, and being a queue, it doesn't 
even matter how big it might grow, as long as new rows are inserted at 
the end and not in the middle of the data file and still there be some 
way of fast skip over the part of the dead rows at the beginning that 
have already been processed and moved away.


Good thing is, I don't worry about maintenance window.  I have the 
leisure to simply tear down my design now and make a better design. 
What's 2 million transactions if I can re-process them at a rate of 
80/s? 7 hours max. I am still in development. So, no need to worry about 
migration / transition acrobatics. So if I take Corey's steps and 
envision the final result, not worrying about the transition steps, then 
I understand this:


1. Create the Queue table partitioned on the pending column, this 
creates the partition with the pending jobs (on which I set the 
fillfactor kind of low, maybe 50) and the default partition with all the 
rest. Of course that allows people with a constant transaction volume to 
also partition on jobId or completionTime and move chunks out to cold 
archive storage. But that's beside the current point.


2. Add all needed indexes on the partitioned table, except the main 
partial index that I used before and that required all that reindexing 
maintenance. Actually I won't need any other indexes really, why invite 
another similar problem again.


That's really simple.

One question I have though: I imagine our pending partition heap file to 
now be essentially sequentially organized as a queue. New jobs are 
appended at the end, old jobs are at the beginning. As pending jobs 
become completed (pending = false) these initial rows will be marked as 
dead. So, while the number of live rows will remain small in that 
pending partition, sequential scans will have to skip over the dead rows 
in the beginning.


Does PostgreSQL structure its files such that skipping over dead rows is 
fast? Or do the dead rows have to be read and discarded during a table 
scan?


Of course vacuum eliminates dead rows, but unless I do vacuum full, it 
will not re-pack the live rows, and that requires an exclusive table 
lock. So, what is the benefit of vacuuming that pending partition? What 
I _/don't/_ want is insertion of new jobs to go into open slots at the 
beginning of the file. I want them to be appended (in Oracle there is an 
INSERT /*+APPEND*/ hint for that. How does that work in PostgreSQL?


Ultimately that partition will amass too many dead rows, then what do I 
do? I don't think that the OS has a way to truncate files physically 
from the head, does it? I guess it could set the file pointer from the 
first block to a later block. But I don't know of an IOCTL/FCNTL command 
for that. On some OS there is a way of making blocks sparse again, is 
that how PostgreSQL might do it? Just knock out blocks as sparse from 
the front of the file?


If not, the next thing I can think of is to partition the table further 
by time, may be alternating even and odd days, such that on any given 
day one of the two pending partitions are quiet? Is that how it's done?


regards,
-Gunther








Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.

2019-02-24 Thread Gunther
Wow, yes, partition instead of index, that is interesting. Thanks Corey 
and Justin.


The index isn't required at all if all my pending jobs are in a 
partition of only pending jobs. In that case the plan can just be a 
sequential scan.


And Jeff James, sorry, I failed to show the LIMIT 1 clause on my dequeue 
query. That was an omission. My query is actually somewhat more complex 
and I just translated it down to the essentials but forgot the LIMIT 1 
clause.


SELECT seqNo, action
  FROM Queue
 WHERE pending
   AND/... other criteria .../
 LIMIT 1
 FOR UPDATE SKIP LOCKED;

And sorry I didn't capture the stats for vacuum verbose. And they would 
be confusing because there are other things involved.


Anyway, I think the partitioned table is the right and brilliant 
solution, because an index really isn't required. The actual pending 
partition will always remain quite small, and being a queue, it doesn't 
even matter how big it might grow, as long as new rows are inserted at 
the end and not in the middle of the data file and still there be some 
way of fast skip over the part of the dead rows at the beginning that 
have already been processed and moved away.


Good thing is, I don't worry about maintenance window.  I have the 
leisure to simply tear down my design now and make a better design. 
What's 2 million transactions if I can re-process them at a rate of 
80/s? 7 hours max. I am still in development. So, no need to worry about 
migration / transition acrobatics. So if I take Corey's steps and 
envision the final result, not worrying about the transition steps, then 
I understand this:


1. Create the Queue table partitioned on the pending column, this 
creates the partition with the pending jobs (on which I set the 
fillfactor kind of low, maybe 50) and the default partition with all the 
rest. Of course that allows people with a constant transaction volume to 
also partition on jobId or completionTime and move chunks out to cold 
archive storage. But that's beside the current point.


2. Add all needed indexes on the partitioned table, except the main 
partial index that I used before and that required all that reindexing 
maintenance. Actually I won't need any other indexes really, why invite 
another similar problem again.


That's really simple.

One question I have though: I imagine our pending partition heap file to 
now be essentially sequentially organized as a queue. New jobs are 
appended at the end, old jobs are at the beginning. As pending jobs 
become completed (pending = false) these initial rows will be marked as 
dead. So, while the number of live rows will remain small in that 
pending partition, sequential scans will have to skip over the dead rows 
in the beginning.


Does PostgreSQL structure its files such that skipping over dead rows is 
fast? Or do the dead rows have to be read and discarded during a table 
scan?


Of course vacuum eliminates dead rows, but unless I do vacuum full, it 
will not re-pack the live rows, and that requires an exclusive table 
lock. So, what is the benefit of vacuuming that pending partition? What 
I _/don't/_ want is insertion of new jobs to go into open slots at the 
beginning of the file. I want them to be appended (in Oracle there is an 
INSERT /*+APPEND*/ hint for that. How does that work in PostgreSQL?


Ultimately that partition will amass too many dead rows, then what do I 
do? I don't think that the OS has a way to truncate files physically 
from the head, does it? I guess it could set the file pointer from the 
first block to a later block. But I don't know of an IOCTL/FCNTL command 
for that. On some OS there is a way of making blocks sparse again, is 
that how PostgreSQL might do it? Just knock out blocks as sparse from 
the front of the file?


If not, the next thing I can think of is to partition the table further 
by time, may be alternating even and odd days, such that on any given 
day one of the two pending partitions are quiet? Is that how it's done?


regards,
-Gunther








Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.

2019-02-24 Thread Gunther

Thank you all for responding so far.

David Rowley  and Justin Pryzby suggested things about autovacuum. But I 
don't think autovacuum has any helpful role here. I am explicitly doing 
a vacuum on that table. And it doesn't help at all. Almost not at all.


I want to believe that

VACUUM FREEZE Queue;

will push the database CPU% down again once it is climbing up, and I can 
do this may be 3 to 4 times, but ultimately I will always have to 
rebuild the index. But also, none of these vaccuum operations I do takes 
very long at all. It is just not efficacious at all.


Rebuilding the index by building  a new index and removing the old, then 
rename, and vacuum again, is prone to get stuck.


I tried to do it in a transaction. But it says CREATE INDEX can't be 
done in a transaction.


Need to CREATE INDEX CONCURRENTLY ... and I can't even do that in a 
procedure.


If I do it manually by issuing first CREATE INDEX CONCURRENTLY new and 
then DROP INDEX CONCURRENTLY old,  it might work once, but usually it 
just gets stuck with two indexes.  Although I noticed that it would 
actually put CPU back down and improve transaction throughput.


I also noticed that after I quit from DROP INDEX CONCURRENTLY old, that 
index is shown as INVALID


\d Queue
 ...
Indexes:
  "queue_idx_pending" UNIQUE, btree (jobId, action) WHERE pending INVALID
  "queue_idx_pending2" UNIQUE, btree (jobId, action) WHERE pending INVALID
  "queue_idx_pending3" UNIQUE, btree (jobId, action) WHERE pending INVALID
  "queue_idx_pending4" UNIQUE, btree (jobId, action) WHERE pending INVALID
  "queue_idx_pending5" UNIQUE, btree (jobId, action) WHERE pending INVALID
  "queue_idx_pending6" UNIQUE, btree (jobId, action) WHERE pending
...

and so I keep doing that same routine hands-on, every time that the CPU% 
creeps above 50% I do


CREATE UNIQUE INDEX CONCURRENTLY Queue_idx_pending6 ON Queue(jobId, action) 
WHERE currentlyOpen;
DROP INDEX CONCURRENTLY Queue_idx_pending5;

at which place it hangs, I interrupt the DROP command, which leaves the 
old index behind as "INVALID".


VACUUM FREEZE ANALYZE Queue;

At this point the db's CPU% dropping below 20% after the new index has 
been built.


Unfortunately this is totally hands on approach I have to do this every 
5 minutes or so. And possibly the time between these necessities 
decreases. It also leads to inefficiency over time, even despite the CPU 
seemingly recovering.


So this isn't sustainable like that (worse because my Internet 
constantly drops).


What I am most puzzled by is that no matter how long I wait, the DROP 
INDEX CONCURRENTLY never completes. Why is that?


Also, the REINDEX command always fails with a deadlock because there is 
a row lock and a complete table lock involved.


I consider this ultimately a bug, or at the very least there is room for 
improvement. And I am on version 11.1.


regards,
-Gunther


Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.

2019-02-23 Thread Gunther

On 2/23/2019 16:13, Peter Geoghegan wrote:

On Sat, Feb 23, 2019 at 1:06 PM Gunther  wrote:

I thought to keep my index tight, I would define it like this:

CREATE UNIQUE INDEX Queue_idx_pending ON Queue(jobId) WHERE pending;

so that only pending jobs are in that index.

When a job is done, follow up work is often inserted into the Queue as pending, 
thus adding to that index.

How many distinct jobIds are there in play, roughly? Would you say
that there are many fewer distinct Jobs than distinct entries in the
index/table? Is the number of jobs fixed at a fairly low number, that
doesn't really grow as the workload needs to scale up?


Jobs start on another, external queue, there were about 200,000 of them 
waiting when I started the run.


When the SQL Queue is empty, the workers pick one job from the external 
queue and add it to the SQL queue.


When that happens immediately 2 more jobs are created on that queue. 
Let's cal it phase 1 a and b


When phase 1 a has been worked off, another follow-up job is created. 
Let' s call it phase 2.


When phase 2 has been worked off, a final phase 3 job is created.

When that is worked off, nothing new is created, and the next item is 
pulled from the external queue and added to the SQL queue.


So this means, each of the 200,000 items add (up to) 4 jobs onto the 
queue during their processing.


But since these 200,000 items are on an external queue, the SQL queue 
itself is not stuffed full at all. It only slowly grows, and on the main 
index where we have only the pending jobs, there are only probably than 
20 at any given point in time. When I said 7 jobs per second, it meant 
7/s simultaneously for all these 3+1 phases, i.e., 28 jobs per second. 
And at that rate it takes little less than 30 min for the index to 
deteriorate. I.e. once about 50,000 queue entries have been processed 
through that index it has deteriorated to become nearly unusable until 
it is rebuilt.


thanks,
-Gunther





Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.

2019-02-23 Thread Gunther

Hi,

I am using an SQL queue for distributing work to massively parallel 
workers. Workers come in servers with 12 parallel threads. One of those 
worker sets handles 7 transactions per second. If I add a second one, 
for 24 parallel workers, it scales to 14 /s. Even a third, for 36 
parallel workers, I can add to reach 21 /s. If I try a fourth set, 48 
workers, I end up in trouble. But that isn't even so much my problem 
rather than the fact that in short time, the performance will 
deteriorate, and it looks like that is because the queue index 
deteriorates and needs a REINDEX.


The queue table is essentially this:

CREATE TABLE Queue (
  jobId bigint,
  action text,
  pending boolean,
  result text
);

the dequeue operation is essentially this:

BEGIN

SELECT jobId, action
  FROM Queue
  WHERE pending
  FOR UPDATE SKIP LOCKED

which is a wonderful concept with the SKIP LOCKED.

Then I perform the action and finally:

UPDATE Queue
   SET pending = false,
   result = ?
 WHERE jobId = ?

COMMIT

I thought to keep my index tight, I would define it like this:

CREATE UNIQUE INDEX Queue_idx_pending ON Queue(jobId) WHERE pending;

so that only pending jobs are in that index.

When a job is done, follow up work is often inserted into the Queue as 
pending, thus adding to that index.


Below is the performance chart.

The blue line at the bottom is the db server.


You can see the orange line is the first worker server with 12 threads. 
It settled into a steady state of 7/s ran with 90% CPU for some 30 min, 
and then the database CPU% started climbing and I tried to rebuild the 
indexes on the queue, got stuck there, exclusive lock, no jobs were 
processing, but the exclusive lock was never obtained for too long. So I 
shut down the worker server. Database quiet I could resolve the messed 
up indexes and restarted again. Soon I added a second worker server 
(green line) starting around 19:15. Once settled in they were pulling 
14/s together. but you can see in just 15 min, the db server CPU % 
climbed again to over 40% and the performance of the workers dropped, 
their load falling to 30%. Now at around 19:30 I stopped them all, 
REINDEXed the queue table and then started 3 workers servers 
simultaneously. They settled in to 21/s but in just 10 min again the 
deterioration happened. Again I stopped them all, REINDEXed, and now 
started 4 worker servers (48 threads). This time 5 min was not enough to 
see them ever settling into a decent 28/s transaction rate, but I guess 
they might have reached that for a minute or two, only for the index 
deteriorating again. I did another stop now started only 2 servers and 
again, soon the index deteriorated again.


Clearly that index is deteriorating quickly, in about 10,000 transactions.

BTW: when I said 7/s, it is in reality about 4 times as many 
transactions, because of the follow up jobs that also get added on this 
queue. So 10, transactions may be 30 or 40 k transactions before the 
index deteriorates.


Do you have any suggestion how I can avoid that index deterioration 
problem smoothly?


I figured I might just pause all workers briefly to schedule the REINDEX 
Queue command, but the problem with this is that while the transaction 
volume is large, some jobs may take minutes to process, and in that case 
we need to wait minutes to quiet the database with then 47 workers 
sitting as idle capacity waiting for the 48th to finish so that the 
index can be rebuilt!


Of course I tried to resolve the issue with vacuumdb --analyze (just in 
case if the autovacuum doesn't act in time) and that doesn't help. 
Vacuumdb --full --analyze would probably help but can't work because it 
required an exclusive table lock.


I tried to just create a new index of the same

CREATE UNIQUE INDEX Queue_idx2_pending ON Queue(jobId) WHERE pending;
DROP INDEX Queue_idx_pending;
ANALYZE Queue;

but with that I got completely stuck with two indexes where I could not 
remove either of them for those locking issues. And REINDEX will give me 
a deadlock error rightout.


I am looking for a way to manage that index so that it does not deteriorate.

May be if I was not defining it with

... WHERE pending;

then it would only grow, but never shrink. May be that helps somehow? I 
doubt it though. Adding to an index also causes deterioration, and most 
of the rows would be irrelevant because they would be past work. It 
would be nicer if there was another smooth way.


regards,
-Gunther





Re: neither CPU nor IO bound, but throttled performance

2019-02-21 Thread Gunther Schadow

Thank you Magnus. 68% steal. Indeed. You probably hit the target. Yes.

That explains the discrepancy. I need to watch and understand that CPU 
credits issue.


regards,
-Gunther

On 2/21/2019 4:08, Magnus Hagander wrote:
On Thu, Feb 21, 2019 at 12:34 AM Gunther <mailto:r...@gusw.net>> wrote:


Hi, I have an Amazon Linux based Postgresql 11 server here on a
t2.medium EC2 instance.

It is serving 24 worker processes that read jobs from a queue
(thanks to SELECT ... FOR UPDATE SKIP LOCKED!) and do jobs some of
which are reading and writing business data to the database,
others are only reading, and some don't hit the business data at
all, only the queue.

Everything flows quite nicely. Except, I don't understand why I
can't max out the CPU or the IO, instead, IO is almost negligible
yet the CPU is at 30% hardly hitting 50%.

Here I give you a view of top:

top - 23:17:09 up 45 days, 2:07, 4 users, load average: 20.32,
18.92, 13.80 Tasks: 338 total, 24 running, 111 sleeping, 0
stopped, 0 zombie %Cpu(s): 28.7 us, 2.5 sy, 0.0 ni, 0.0 id, 0.0
wa, 0.0 hi, 0.0 si, 68.7 st


If I read that right, it's about 70% "steal". The description for this 
is "the percentage of time spent in involuntary wait by the virtual 
CPU or CPUs while the hypervisor was servicing another virtual 
processor.".


So basically, the CPU is spent dealing with other peoples VMs on the 
same hardware. Welcome to the multi-tenant cloud.


In particular, I believe T series instances get a limited number of 
CPU "credits" per hours. My guess is you've hit this limit and are 
thus being throttled. T series are not intended for persistent 
workloads. Either way, this is probably a question better asked at the 
Amazon EC2 forums rather than PostgreSQL as you'll find more people 
who know the EC2 interactions there.

--
 Magnus Hagander
 Me: https://www.hagander.net/ <http://www.hagander.net/>
 Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>


neither CPU nor IO bound, but throttled performance

2019-02-20 Thread Gunther
 0.00 0.00 0.00 0.00 0.00 xvdm 0.00 0.00 0.00 0.00 0.00 0.00 
0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdn 0.00 0.00 0.00 0.00 0.00 0.00 
0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdo 0.00 0.00 0.00 0.00 0.00 0.00 
0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdp 0.00 0.00 0.00 0.00 0.00 0.00 
0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdq 0.00 0.00 0.00 0.00 0.00 0.00 
0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdr 0.00 0.00 0.00 0.00 0.00 0.00 
0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvds 0.00 0.00 0.00 0.00 0.00 0.00 
0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdt 0.00 0.00 0.00 0.00 0.00 0.00 
0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdu 0.00 0.00 0.00 0.00 0.00 0.00 
0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdv 0.00 0.00 0.00 8.65 0.00 76.92 
17.78 0.03 4.00 0.00 4.00 0.44 0.38 xvdw 0.00 0.00 0.00 0.00 0.00 0.00 
0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdx 0.00 0.00 0.00 0.00 0.00 0.00 
0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdy 0.00 0.00 0.00 0.00 0.00 0.00 
0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdz 0.00 0.00 0.00 0.00 0.00 0.00 
0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdaa 0.00 0.00 0.00 0.00 0.00 0.00 
0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdab 0.00 0.00 0.00 2.88 0.00 30.77 
21.33 0.01 4.00 0.00 4.00 1.33 0.38


previously I had hit 100 %util here, that was when I didn't have the 
tables so spread out over so many tablespaces. Now I have it spread out 
like in the olden days where you spread out your tables over many 
"spindles", and I did this here so I could see which tables or indexes 
would be bottlenecks.


So how can it be that queries take quite long without the process 
running at higher CPU%?


Or is there something wrong with the total CPU% estimated by both top 
and iostat? From the top it looks like I have 24 worker processes use 8% 
each, most of them in R(unning) state, so that would be 192%, which is 
divided over the 2 CPUs of the t2.medium instance, really 96%. So I am 
CPU bound after all?


regards,
-Gunther



Re: pg_dump 3 times as slow after 8.4 -> 9.5 upgrade

2017-12-07 Thread Gunther
I confess I don't do dump or any backups much other than file system 
snapshots.


But when I do, I don't like how long it takes.

I confess my database is big, I have about 200 GB. But still, dumping it 
should not take 48 hours (and running) while the system is 75% idle and 
reads are at 4.5 MB/s when the system sustains over 100 MB/s during 
processing of table scan and hash join queries.


Something is wrong with the dump thing. And no, it's not SSL or 
whatever, I am doing it on a local system with local connections. 
Version 9.5 something.


regards,
-Gunther


On 11/23/2017 4:26, Henrik Cednert (Filmlance) wrote:



On 22 Nov 2017, at 22:07, Patrick KUI-LI <mailto:pku...@hipay.com>> wrote:


Hello,

I had this behaviors when the upgraded pg 9.5 was on ssl mode by default.

So i deactivated ssl mode in postgresql.conf. That's all.

Regards,

Patrick




Hello

And you just uncommented the  'ssl = off' line in the config for this?

Is this default behaviour different from 8.4? Is there a 'show running 
config' for pgsql?


I tried that in the test vm and didn't really give me a significant 
difference.


COMPRESSION LEVEL: 6, SSL ON
real82m33.744s
user60m55.069s
sys3m3.375s

COMPRESSION LEVEL: 6, SSL OFF
real76m31.083s
user61m23.282s
sys1m23.341s