Re: Fwd: temp_file_limit?
Hi Justin, As our queries are generated I decided to create a peephole optimizer kind of thing to scan the generated SQL AST to find multiple conditions on the same table reference. I can then use our metadata to see if these references are expected to be correlated. This creates about 20 statistics sets, including the one you have indicated. This at least makes the problematic query have a stable and very fast plan (so far). I had hoped for some more improvement with other queries but that has not yet been evident ;) Thanks a lot for the tips and your help! Cordially, Frits On Tue, Dec 20, 2022 at 10:11 PM Justin Pryzby wrote: > On Mon, Dec 19, 2022 at 09:10:27PM +0100, Frits Jalvingh wrote: > > @justin > > > > I tried the create statistics variant and that definitely improves the > > estimate, and with that one of the "bad" cases (the one with the 82 > minute > > plan) now creates a good plan using only a few seconds. > > That is a worthwhile path to follow. A bit hard to do, because those > > conditions can be anything, but I can probably calculate the ones used > per > > customer and create those correlation statistics from that... It is > > definitely better than tweaking the "poor man's query hints" enable_ > > 8-/ which is really not helping with plan stability either. > > > > That will be a lot of work, but I'll let you know the results ;) > > Yeah, if the conditions are arbitrary, then it's going to be more > difficult. Hopefully you don't have too many columns. :) > > I suggest enabling autoexplain and monitoring for queries which were > slow, and retroactively adding statistics to those columns which are > most-commonly queried, and which have correlations (which the planner > doesn't otherwise know about). > > You won't want to have more than a handful of columns in a stats object > (since it requires factorial(N) complexity), but you can have multiple > stats objects with different combinations of columns (and, in v14, > expressions). You can also set a lower stats target to make the cost a > bit lower. > > You could try to check which columns are correlated, either by running: > | SELECT COUNT(1),col1,col2 FROM tbl GROUP BY 2,3 ORDER BY 1; > for different combinations of columns. > > Or by creating a tentative/experimental stats object on a handful of > columns at a time for which you have an intuition about their > correlation, and then checking the calculated dependencies FROM > pg_stats_ext. You may need to to something clever to use that for > arbitrarily columns. Maybe this is a start. > | SELECT dep.value::float, tablename, attnames, dep.key, exprs FROM > (SELECT (json_each_text(dependencies::text::json)).* AS dep, * FROM > pg_stats_ext)dep WHERE dependencies IS NOT NULL ORDER BY 1 DESC ; -- AND > regexp_count(key, ',') < 2 > > -- > Justin >
Re: Fwd: temp_file_limit?
On Mon, Dec 19, 2022 at 09:10:27PM +0100, Frits Jalvingh wrote: > @justin > > I tried the create statistics variant and that definitely improves the > estimate, and with that one of the "bad" cases (the one with the 82 minute > plan) now creates a good plan using only a few seconds. > That is a worthwhile path to follow. A bit hard to do, because those > conditions can be anything, but I can probably calculate the ones used per > customer and create those correlation statistics from that... It is > definitely better than tweaking the "poor man's query hints" enable_ > 8-/ which is really not helping with plan stability either. > > That will be a lot of work, but I'll let you know the results ;) Yeah, if the conditions are arbitrary, then it's going to be more difficult. Hopefully you don't have too many columns. :) I suggest enabling autoexplain and monitoring for queries which were slow, and retroactively adding statistics to those columns which are most-commonly queried, and which have correlations (which the planner doesn't otherwise know about). You won't want to have more than a handful of columns in a stats object (since it requires factorial(N) complexity), but you can have multiple stats objects with different combinations of columns (and, in v14, expressions). You can also set a lower stats target to make the cost a bit lower. You could try to check which columns are correlated, either by running: | SELECT COUNT(1),col1,col2 FROM tbl GROUP BY 2,3 ORDER BY 1; for different combinations of columns. Or by creating a tentative/experimental stats object on a handful of columns at a time for which you have an intuition about their correlation, and then checking the calculated dependencies FROM pg_stats_ext. You may need to to something clever to use that for arbitrarily columns. Maybe this is a start. | SELECT dep.value::float, tablename, attnames, dep.key, exprs FROM (SELECT (json_each_text(dependencies::text::json)).* AS dep, * FROM pg_stats_ext)dep WHERE dependencies IS NOT NULL ORDER BY 1 DESC ; -- AND regexp_count(key, ',') < 2 -- Justin
Re: Fwd: temp_file_limit?
@Thomas Thanks for helping identifying the issue. I think it would be nice to clean up those obsoleted files during the run because running out of disk is reality not a good thing to have ;) Of course the bad estimates leading to the resize are the real issue but this at least makes it less bad. Thanks for the tip about disabling parallel_hash but I also found it in the source. As mentioned before I disable (on production systems) all parallelism, not just for this issue but to prevent the OOM killer from killing Postgres - which happens way more often with parallelism on...
Re: Fwd: temp_file_limit?
@justin I tried the create statistics variant and that definitely improves the estimate, and with that one of the "bad" cases (the one with the 82 minute plan) now creates a good plan using only a few seconds. That is a worthwhile path to follow. A bit hard to do, because those conditions can be anything, but I can probably calculate the ones used per customer and create those correlation statistics from that... It is definitely better than tweaking the "poor man's query hints" enable_ 8-/ which is really not helping with plan stability either. That will be a lot of work, but I'll let you know the results ;)
Re: Fwd: temp_file_limit?
On Tue, Dec 20, 2022 at 8:59 AM Frits Jalvingh wrote: > @ranier > These files ONLY exist during the query. They get deleted as soon as the > query terminates, by Postgres itself. Once the query terminates pgsql_tmp is > completely empty. Considering what Thomas said (and the actual occurrence of > the files he mentioned) this does seem to be the more likely cause to me. I'm working on some bug fixes near this area at the moment, so I'll also see if I can figure out how to implement the missing eager cleanup of earlier generations. It's still a pretty bad scenario once you reach it (repartitioning repeatedly, that is) and the solution to that it probably much harder, but it's obviously not great to waste temporary disk space like that. BTW you can disable just parallel hash with enable_parallel_hash=false.
Re: Fwd: temp_file_limit?
@ranier These files ONLY exist during the query. They get deleted as soon as the query terminates, by Postgres itself. Once the query terminates pgsql_tmp is completely empty. Considering what Thomas said (and the actual occurrence of the files he mentioned) this does seem to be the more likely cause to me.
Re: Fwd: temp_file_limit?
On Sun, Dec 18, 2022 at 06:29:41PM +0100, Frits Jalvingh wrote: > By itself I'm used to bad query performance in Postgresql; our application > only does bulk queries and Postgres quite often makes terrible plans for > those, but with set enable_nestloop=false set always most of them at least > execute. The remaining failing queries are almost 100% caused by bad join > sequences; I plan to work around those by forcing the join order from our > application. For instance, the exact same query above can also generate the > following plan (this one was created by manually setting > join_collapse_limit = 1, but fast variants also occur quite often when > disabling parallelism): I, too, ended up setting enable_nestloop=false for our report queries, to avoid the worst-case plans. But you should also try to address the rowcount misestimates. This underestimates the rowcount by a factor of 69 (or 138 in the plan you sent today): | (soort = 'MSL'::text) AND (code = 'DAE'::text) If those conditions are correlated, you can improve the estimate by adding extended stats object. | CREATE STATISTICS s_h_sturingslabel_ssm_stats soort,code FROM s_h_sturingslabel_ssm; ANALYZE s_h_sturingslabel_ssm; Unfortunately, stats objects currently only improve scans, and not joins, so that might *improve* some queries, but it won't resolve the worst problems: | Hash Join (cost=22,832.23..44,190.21 rows=185 width=47) (actual time=159.725..2,645,634.918 rows=28,086,472,886 loops=1) Maybe you can improve that by adjusting the stats target or ndistinct... -- Justin
Re: Fwd: temp_file_limit?
Em seg., 19 de dez. de 2022 às 16:29, Frits Jalvingh escreveu: > Ok, just to make sure that I understand correctly: > The parallel hash implementation needs to resize its table because of a > mismatch in expected tuple count. I do expect this to be true: Postgres > often grossly underestimates the expected row counts in our queries. > This is not fully implemented yet: removing the "old "files is not yet > done, so every time the table resizes it creates a new set of files and the > old ones remain. > I assume that the "used file size" only includes the "current" set of > files, and that the old ones are not counted towards that amount? That > would explain why it overallocates, of course. > It is not necessary what is happening. Could you try manually deleting (rm) these files, using the postgres user? It's an ugly and dirty test, but it could indicate that files are really being left behind, without being deleted by Postgres. Alternatively, you could compile a version with CHECK_WRITE_VS_EXTEND set, and try to fetch as much information from the logs as possible, as has been indicated by others here. > By itself I now know what to do: I just need to disable all parallelism ( > •̀ᴗ•́ )و ̑̑ > > I usually do that anyway because it makes queries die randomly. This is > just another reason. > > I restarted that query with max_parallel_workers_per_gather=0, and this > does not seem to use tempspace at all. It was not exactly fast, it took 82 > minutes of a single process running at 100% cpu. > https://explain.depesz.com/s/HedE > Anyway, see the hint page (https://explain.depesz.com/s/HedE#hints), maybe it will be useful. regards, Ranier Vilela
Re: Fwd: temp_file_limit?
Ok, just to make sure that I understand correctly: The parallel hash implementation needs to resize its table because of a mismatch in expected tuple count. I do expect this to be true: Postgres often grossly underestimates the expected row counts in our queries. This is not fully implemented yet: removing the "old "files is not yet done, so every time the table resizes it creates a new set of files and the old ones remain. I assume that the "used file size" only includes the "current" set of files, and that the old ones are not counted towards that amount? That would explain why it overallocates, of course. By itself I now know what to do: I just need to disable all parallelism ( •̀ᴗ•́ )و ̑̑ I usually do that anyway because it makes queries die randomly. This is just another reason. I restarted that query with max_parallel_workers_per_gather=0, and this does not seem to use tempspace at all. It was not exactly fast, it took 82 minutes of a single process running at 100% cpu. https://explain.depesz.com/s/HedE Thanks a lot for your help Justin, and all others that chimed in too. Next round is to try to get that query to run in the 5 seconds that we know it can do, reliably. On Mon, Dec 19, 2022 at 6:46 PM Justin Pryzby wrote: > On Mon, Dec 19, 2022 at 06:27:57PM +0100, Frits Jalvingh wrote: > > I have listed the files during that run, > > > 213M -rw--- 1 postgres postgres 213M dec 19 17:46 i100of128.p0.0 > > 207M -rw--- 1 postgres postgres 207M dec 19 17:46 i100of128.p1.0 > > 210M -rw--- 1 postgres postgres 210M dec 19 17:49 i100of256.p0.0 > > 211M -rw--- 1 postgres postgres 211M dec 19 17:49 i100of256.p1.0 > > 188M -rw--- 1 postgres postgres 188M dec 19 17:53 i100of512.p0.0 > [...] > > I think that proves Thomas' theory. I'm not sure how that helps you, > though... > > On Mon, Dec 19, 2022 at 01:51:33PM +1300, Thomas Munro wrote: > > One possibility is that you've hit a case that needs several rounds of > > repartitioning (because of a failure to estimate the number of tuples > > well), but we can't see that because you didn't show EXPLAIN (ANALYZE) > > output (understandably if it runs out of disk space before > > completing...). The parallel hash code doesn't free up the previous > > generations' temporary files; it really only needs two generations' > > worth concurrently (the one it's reading from and the one it's writing > > to). In rare cases where more generations are needed it could unlink > > the older ones -- that hasn't been implemented yet. If you set > > log_temp_files = 0 to log temporary file names, it should be clear if > > it's going through multiple rounds of repartitioning, from the names > > (...of32..., ...of64..., ...of128..., ...of256..., ...). > > -- > Justin Pryzby > System Administrator > Telsasoft > +1-952-707-8581 >
Re: Fwd: temp_file_limit?
On Mon, Dec 19, 2022 at 06:27:57PM +0100, Frits Jalvingh wrote: > I have listed the files during that run, > 213M -rw--- 1 postgres postgres 213M dec 19 17:46 i100of128.p0.0 > 207M -rw--- 1 postgres postgres 207M dec 19 17:46 i100of128.p1.0 > 210M -rw--- 1 postgres postgres 210M dec 19 17:49 i100of256.p0.0 > 211M -rw--- 1 postgres postgres 211M dec 19 17:49 i100of256.p1.0 > 188M -rw--- 1 postgres postgres 188M dec 19 17:53 i100of512.p0.0 [...] I think that proves Thomas' theory. I'm not sure how that helps you, though... On Mon, Dec 19, 2022 at 01:51:33PM +1300, Thomas Munro wrote: > One possibility is that you've hit a case that needs several rounds of > repartitioning (because of a failure to estimate the number of tuples > well), but we can't see that because you didn't show EXPLAIN (ANALYZE) > output (understandably if it runs out of disk space before > completing...). The parallel hash code doesn't free up the previous > generations' temporary files; it really only needs two generations' > worth concurrently (the one it's reading from and the one it's writing > to). In rare cases where more generations are needed it could unlink > the older ones -- that hasn't been implemented yet. If you set > log_temp_files = 0 to log temporary file names, it should be clear if > it's going through multiple rounds of repartitioning, from the names > (...of32..., ...of64..., ...of128..., ...of256..., ...). -- Justin Pryzby System Administrator Telsasoft +1-952-707-8581
Re: Fwd: temp_file_limit?
On Mon, Dec 19, 2022 at 05:57:42PM +0100, Frits Jalvingh wrote: > @justin: > > Ran the query again. Top shows the following processes: >PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ Thanks > root@chatelet:/d2/var/lib/postgresql/15/main/base# du --max=2 -mx > ./pgsql_tmp |sort -nr > 412021 ./pgsql_tmp/pgsql_tmp650830.3.fileset > 412021 ./pgsql_tmp > ^^^ a few seconds after this last try the query aborted: > ERROR: temporary file size exceeds temp_file_limit (104857600kB) > > One possibility is that there are files in the tmpdir, which have been > > unlinked, but are still opened, so their space hasn't been reclaimed. > > You could check for that by running lsof -nn |grep pgsql_tmp Any deleted > > files would say things like 'DEL|deleted|inode|no such' > > I do not really understand what you would like me to do, and when. The disk > space is growing, and it is actual files under pgsql_tmp? Run this during the query as either postgres or root: | lsof -nn |grep pgsql_tmp |grep -E 'DEL|deleted|inode|no such' Any files it lists would be interesting to know about. > Hope this tells you something, please let me know if you would like more > info, and again - thanks! I think Thomas' idea is more likely. We'd want to know the names of files being written, either as logged by log_temp_files or from | find pgsql_tmp -ls during the query. -- Justin
Re: Fwd: temp_file_limit?
@justin: Ran the query again. Top shows the following processes: PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 650830 postgres 20 0 7503,2m 2,6g 2,6g R 100,0 4,2 12:46.34 postgres: jal datavault_317_prd [local] EXPLAIN 666141 postgres 20 0 7486,3m 2,6g 2,6g R 100,0 4,1 2:10.24 postgres: parallel worker for PID 650830 Your commands shows, during execution: root@chatelet:/d2/var/lib/postgresql/15/main/base# du --max=2 -mx ./pgsql_tmp |sort -nr 68629 ./pgsql_tmp/pgsql_tmp650830.3.fileset 68629 ./pgsql_tmp root@chatelet:/d2/var/lib/postgresql/15/main/base# du --max=2 -mx ./pgsql_tmp |sort -nr 194494 ./pgsql_tmp 194493 ./pgsql_tmp/pgsql_tmp650830.3.fileset root@chatelet:/d2/var/lib/postgresql/15/main/base# du --max=2 -mx ./pgsql_tmp |sort -nr 335289 ./pgsql_tmp/pgsql_tmp650830.3.fileset 335289 ./pgsql_tmp root@chatelet:/d2/var/lib/postgresql/15/main/base# du --max=2 -mx ./pgsql_tmp |sort -nr 412021 ./pgsql_tmp/pgsql_tmp650830.3.fileset 412021 ./pgsql_tmp ^^^ a few seconds after this last try the query aborted: ERROR: temporary file size exceeds temp_file_limit (104857600kB) One possibility is that there are files in the tmpdir, which have been > unlinked, but are still opened, so their space hasn't been reclaimed. > You could check for that by running lsof -nn |grep pgsql_tmp Any deleted > files would say things like 'DEL|deleted|inode|no such' > I do not really understand what you would like me to do, and when. The disk space is growing, and it is actual files under pgsql_tmp? Hope this tells you something, please let me know if you would like more info, and again - thanks!
Re: Fwd: temp_file_limit?
Hehehe, that is not the worst plan ;) I did that once to debug a deadlock in the JDBC driver when talking with Postgres, but it's not an adventure I'd like to repeat right now ;) >
Re: Fwd: temp_file_limit?
Em seg., 19 de dez. de 2022 às 11:45, Frits Jalvingh escreveu: > Hi Ranier, thanks for your help. > > I do not have more disks lying around, and I fear that if it does not > complete with 1.3TB of disk space it might not be that likely that adding > 750GB would work... > Postgres version: the original (prd) issue was on 10.x. I also tested it > on 14.x with the same issue. I then upgraded my machine to 15.1 to make > sure to report on the latest version, and all information mentioned in this > thread is from that version. > You can run with a Postgres debug compiled version? https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD Maybe, some light appears. regards, Ranier Vilela >
Re: Fwd: temp_file_limit?
Hi Ranier, thanks for your help. I do not have more disks lying around, and I fear that if it does not complete with 1.3TB of disk space it might not be that likely that adding 750GB would work... Postgres version: the original (prd) issue was on 10.x. I also tested it on 14.x with the same issue. I then upgraded my machine to 15.1 to make sure to report on the latest version, and all information mentioned in this thread is from that version. btw, this query generates quite different plans when tweaking things like nested_loop=false/true, and the "fast" plan requires nested_loops=true and join_collapse_limit=1 (5 seconds response). An odd thing is that both plans contain only one nested loop (a cross join, I think it cannot do that without one) but the general plan changes a lot.. I am trying to get output from that second plan because this one just loops using CPU, not disk... Perhaps that one will finish with some statistics...
Re: Fwd: temp_file_limit?
Em seg., 19 de dez. de 2022 às 06:47, Frits Jalvingh escreveu: > > The test is running on Ubuntu 22.04.1, x86_64, the disk is an NVMe 2TB > WD850X with ext4 as a file system. > It's probably not a production environment. Any chance of adding another 2TB NVMe, just for the temp files? To see if Postgres can finish the queries and provide more information? What exactly is the version of Postgres (14.???) are you using it? regards, Ranier Vilela
Re: Fwd: temp_file_limit?
Hi Tom and Thomas, thanks for your help. @Tom: If it really is per-process then I would have expected it to die after 200GB was used? As far as "valid bug" is concerned: I had hoped this would be per session, as this at least delivers a reasonable and usable limit; it is easy to control the number of sessions/statements in execution. If it really is per process then the limit is not really useful, just like work_mem: the execution plan of a query determines the number of processes (up to the max, at least that is way better than work_mem) and that can change whenever Postgres feels a new plan is in order. I can understand that solving this might be harder (although to me it looks like just a little bit of shared memory per session to keep a number). To me it does not sound like an invalid bug, just one you do not want to solve now ;) And the real problem, for me, is actually that both work_mem and temp_file_limit should be for the entire instance/cluster ;) I know that that is even harder. For us it means we really cannot use Postgres parallelism: it is infinitely better to have a query that runs longer but which finishes than to have the database die and recover randomly with OOM or with disk space filling up killing random queries. Which is a bit of a pity, ofc. @Justin The test is running on Ubuntu 22.04.1, x86_64, the disk is an NVMe 2TB WD850X with ext4 as a file system. I will collect the other data around tmpfiles hopefully later today. I have already set max_parallel_workers_per_gather to 1. I will probably disable all parallelism for the next runs to see whether that makes the size limit more workable..
Re: Fwd: temp_file_limit?
On Mon, Dec 19, 2022 at 1:51 PM Thomas Munro wrote: > It's really the limit for a single file Oops, sorry I take that back. It should be per process.
Re: Fwd: temp_file_limit?
On Mon, Dec 19, 2022 at 9:11 AM Justin Pryzby wrote: > On Sun, Dec 18, 2022 at 06:29:41PM +0100, Frits Jalvingh wrote: > > Just to be clear: my real question is: why is temp_file_limit not > > working at the specified size? Because this is my real problem: when a > > query is dying like this it will also kill other queries because these > > are also running out of space. Even when the limit is per-process it > > should not have exceeded 200GB imo. It's really the limit for a single file (or virtual file because we split them on 1GB boundaries, probably well past time we stopped doing that), but we create many temporary files for various reasons. One possibility is that you've hit a case that needs several rounds of repartitioning (because of a failure to estimate the number of tuples well), but we can't see that because you didn't show EXPLAIN (ANALYZE) output (understandably if it runs out of disk space before completing...). The parallel hash code doesn't free up the previous generations' temporary files; it really only needs two generations' worth concurrently (the one it's reading from and the one it's writing to). In rare cases where more generations are needed it could unlink the older ones -- that hasn't been implemented yet. If you set log_temp_files = 0 to log temporary file names, it should be clear if it's going through multiple rounds of repartitioning, from the names (...of32..., ...of64..., ...of128..., ...of256..., ...).
Re: Fwd: temp_file_limit?
On Sun, Dec 18, 2022 at 06:29:41PM +0100, Frits Jalvingh wrote: > Just to be clear: my real question is: why is temp_file_limit not > working at the specified size? Because this is my real problem: when a > query is dying like this it will also kill other queries because these > are also running out of space. Even when the limit is per-process it > should not have exceeded 200GB imo. What OS and filesystem are in use ? Could you list the tmpdir when it's getting huge? The filenames include the PID, which would indicate if there's another procecss involved, or a bug allowed it to get huge. sudo du --max=2 -mx ./pgsql_tmp |sort -nr BTW, pg_ls_tmpdir() hides directories, so you shouldn't rely on it for listing temporary directories... One possibility is that there are files in the tmpdir, which have been unlinked, but are still opened, so their space hasn't been reclaimed. You could check for that by running lsof -nn |grep pgsql_tmp Any deleted files would say things like 'DEL|deleted|inode|no such' > BTW, if that limit is really per process instead of per > session/query then that is a Very Bad Thing(tm), because this makes the > limit effectively worthless - if a query can spawn 8 parallel processes > then you can suddenly, without any form of control, again fill up that disk. 8 is the default value of max_worker_processes and max_parallel_workers, but 2 is the default value of max_parallel_workers_per_gather. You're free the change the default value to balance it with the temp_file_limit (as suggested by the earlier-mentioned commit). -- Justin
Re: Fwd: temp_file_limit?
Frits Jalvingh writes: > Just to be clear: my real question is: why is temp_file_limit not working > at the specified size? I've not looked at that code lately, but I strongly suspect that it's implemented in such a way that it's a per-process limit, not a per-session limit. So each parallel worker could use up that much space. It's also possible that you've found an actual bug, but without a reproducer case nobody's going to take that possibility too seriously. We're unlikely to accept "the limit should work across multiple processes" as a valid bug though. That would require a vastly more complicated implementation. regards, tom lane