Re: Parallel Seq Scan vs kernel read ahead
On Wed, 15 Jul 2020 at 12:24, David Rowley wrote: > If we've not seen any performance regressions within 1 week, then I > propose that we (pending final review) push this to allow wider > testing. It seems we're early enough in the PG14 cycle that there's a > large window of time for us to do something about any reported > performance regressions that come in. I did that final review which ended up in quite a few cosmetic changes. Functionality-wise, it's basically that of the v2 patch with the PARALLEL_SEQSCAN_MAX_CHUNK_SIZE set to 8192. I mentioned that we might want to revisit giving users some influence on the chunk size, but we'll only do so once we see some conclusive evidence that it's worthwhile. David
Re: Parallel Seq Scan vs kernel read ahead
Hi Soumyadeep, Thanks for re-running the tests. On Thu, 23 Jul 2020 at 06:01, Soumyadeep Chakraborty wrote: > On Tue, Jul 14, 2020 at 8:52 PM David Rowley wrote: > > It would be good to see EXPLAIN (ANALYZE, BUFFERS) with SET > > track_io_timing = on; for each value of max_parallel_workers. > > As for running EXPLAIN ANALYZE, running that on this system incurs a > non-trivial amount of overhead. The overhead is simply staggering. I didn't really intend for that to be used to get an accurate overall timing for the query. It was more to get an indication of the reads are actually hitting the disk or not. I mentioned to Kirk in [1] that his read speed might be a bit higher than what his disk can actually cope with. I'm not too sure on the HDD he mentions, but if it's a single HDD then reading at an average speed of 3457 MB/sec seems quite a bit too fast. It seems more likely, in his cases, that those reads are mostly coming from the kernel's cache. David [1] https://www.postgresql.org/message-id/CAApHDvoDzAzXEp+Ay2CfT3U=zcd5nld7k9_y936bnhjzs5j...@mail.gmail.com
Re: Parallel Seq Scan vs kernel read ahead
On Wed, Jul 22, 2020 at 10:03 AM Thomas Munro wrote: > > On Wed, Jul 22, 2020 at 3:57 PM Amit Kapila wrote: > > Yeah, that is true but every time before the test the same amount of > > data should be present in shared buffers (or OS cache) if any which > > will help in getting consistent results. However, it is fine to > > reboot the machine as well if that is a convenient way. > > We really should have an extension (pg_prewarm?) that knows how to > kick stuff out PostgreSQL's shared buffers and the page cache > (POSIX_FADV_DONTNEED). > +1. Such an extension would be quite helpful for performance benchmarks. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel Seq Scan vs kernel read ahead
On Tue, Jul 21, 2020 at 9:33 PM Thomas Munro wrote: > > On Wed, Jul 22, 2020 at 3:57 PM Amit Kapila wrote: > > Yeah, that is true but every time before the test the same amount of > > data should be present in shared buffers (or OS cache) if any which > > will help in getting consistent results. However, it is fine to > > reboot the machine as well if that is a convenient way. > > We really should have an extension (pg_prewarm?) that knows how to > kick stuff out PostgreSQL's shared buffers and the page cache > (POSIX_FADV_DONTNEED). > > +1. Clearing the OS page cache on FreeBSD is non-trivial during testing. You can't do this on FreeBSD: sync; echo 3 > /proc/sys/vm/drop_caches Also, it would be nice to evict only those pages from the OS page cache that are Postgres pages instead of having to drop everything. Regards, Soumyadeep (VMware)
Re: Parallel Seq Scan vs kernel read ahead
Hi David, Apologies for the delay, I had missed these emails. On Tue, Jul 14, 2020 at 8:52 PM David Rowley wrote: > It would be good to know if the > regression is repeatable or if it was affected by some other process. These are the latest results on the same setup as [1]. (TL;DR: the FreeBSD VM with Google Persistent Disk is too unstable - there is too much variability in performance to conclude that there is a regression) With 1 rows: master (606c384598): max_parallel_workers_per_gatherTime(seconds) 0 20.09s 19.77s 29.92s 69.55s v2 patch (applied on 606c384598): max_parallel_workers_per_gatherTime(seconds) 0 18.34s 19.68s 29.15s 69.11s The above results were averaged across 3 runs with little or no deviation between runs. The absolute values are very different from the results reported in [1]. So, I tried to repro the regression as I had reported in [1] with 15000 rows: master (449e14a561) max_parallel_workers_per_gatherTime(seconds) 0 42s, 42s 1 395s, 393s 2 404s, 403s 6 403s, 403s Thomas' patch (applied on 449e14a561): max_parallel_workers_per_gatherTime(seconds) 0 43s,43s 1203s, 42s 2 42s, 42s 6 44s, 43s v2 patch (applied on 449e14a561): max_parallel_workers_per_gatherTime(seconds) 0 274s, 403s 1 419s, 419s 2 448s, 448s 6 137s, 419s As you can see, I got wildly different results with 15000 rows (even between runs of the same experiment) I don't think that the environment is stable enough to tell if there is any regression. What I can say is that there are no processes apart from Postgres running on the system. Also, the environment is pretty constrained - just 1G of free hard drive space before the start of every run, when I have 15000 rows, apart from the mere 32M of shared buffers and only 4G of RAM. I don't know much about Google Persistent Disk to be very honest. Basically, I just provisioned one when I provisioned a GCP VM for testing on FreeBSD, as Thomas had mentioned that FreeBSD UFS is a bad case for parallel seq scan. > It would be good to see EXPLAIN (ANALYZE, BUFFERS) with SET > track_io_timing = on; for each value of max_parallel_workers. As for running EXPLAIN ANALYZE, running that on this system incurs a non-trivial amount of overhead. The overhead is simply staggering. This is the result of pg_test_timing in the FreeBSD GCP VM: $ /usr/local/pgsql/bin/pg_test_timing -d 50 Testing timing overhead for 50 seconds. Per loop time including overhead: 4329.80 ns Histogram of timing durations: < us % of total count 1 0.0 0 2 0.0 0 4 3.08896 356710 8 95.97096 11082616 16 0.37748 43591 32 0.55502 64093 64 0.00638737 128 0.00118136 256 0.2 2 As a point of comparison, on my local Ubuntu workstation: $ /usr/local/pgsql/bin/pg_test_timing -d 50 Testing timing overhead for 50 seconds. Per loop time including overhead: 22.65 ns Histogram of timing durations: < us % of total count 1 97.73691 2157634382 2 2.26246 49945854 4 0.00039 8711 8 0.00016 3492 16 0.8 1689 32 0.0 63 64 0.0 1 This is why I opted to simply use \timing on. Regards, Soumyadeep (VMware) [1] https://www.postgresql.org/message-id/CADwEdoqirzK3H8bB=xyj192ezcnwgfcca_wj5ghvm7sv8oe...@mail.gmail.com
RE: Parallel Seq Scan vs kernel read ahead
On Wednesday, July 22, 2020 2:21 PM (GMT+9), David Rowley wrote: > On Wed, 22 Jul 2020 at 16:40, k.jami...@fujitsu.com > wrote: > > I used the default max_parallel_workers & max_worker_proceses which is 8 by > default in postgresql.conf. > > IOW, I ran all those tests with maximum of 8 processes set. But my > > query planner capped both the Workers Planned and Launched at 6 for > > some reason when increasing the value for max_parallel_workers_per_gather. > > max_parallel_workers_per_gather just imposes a limit on the planner as to the > maximum number of parallel workers it may choose for a given parallel portion > of > a plan. The actual number of workers the planner will decide is best to use is > based on the size of the relation. More pages = more workers. It sounds like > in > this case the planner didn't think it was worth using more than 6 workers. > > The parallel_workers reloption, when not set to -1 overwrites the planner's > decision on how many workers to use. It'll just always try to use > "parallel_workers". > > > However, when I used the ALTER TABLE SET (parallel_workers = N) based > > from your suggestion, the query planner acquired that set value only for > "Workers Planned", but not for "Workers Launched". > > The behavior of query planner is also different when I also set the > > value of max_worker_processes and max_parallel_workers to parallel_workers > + 1. > > When it comes to execution, the executor is limited to how many parallel > worker > processes are available to execute the plan. If all workers happen to be busy > with > other tasks then it may find itself having to process the entire query in > itself > without any help from workers. Or there may be workers available, just not as > many as the planner picked to execute the query. Even though I read the documentation [1][2] on parallel query, I might not have understood it clearly yet. So thank you very much for explaining simpler how the relation size, GUCs, and reloption affect the query planner's behavior So in this test case, I shouldn't force the workers to have same values for workers planned and workers launched, is it correct? To just let the optimizer do its own decision. > The number of available workers is configured with the > "max_parallel_workers". That's set in postgresql.conf. PostgreSQL > won't complain if you try to set a relation's parallel_workers reloption to a > number > higher than the "max_parallel_workers" GUC. > "max_parallel_workers" is further limited by "max_worker_processes". > Likely you'll want to set both those to at least 32 for this test, then just > adjust the > relation's parallel_workers setting for each test. > Thank you for the advice. For the same test case [3], I will use the following configuration: shared_buffers = 32MB max_parallel_workers =32 max_worker_processes = 32 Maybe the relation size is also small as you mentioned, that the query optimizer decided to only use 6 workers in my previous test. So let me see first if the results would vary again with the above configuration and testing different values for parallel_workers. Kind regards, Kirk Jamison [1] https://www.postgresql.org/docs/13/how-parallel-query-works.html [2] https://www.postgresql.org/docs/current/runtime-config-resource.html [3] https://www.postgresql.org/message-id/CADwEdoqirzK3H8bB=xyj192ezcnwgfcca_wj5ghvm7sv8oe...@mail.gmail.com
RE: Parallel Seq Scan vs kernel read ahead
On Tuesday, July 21, 2020 7:33 PM, Amit Kapila wrote: > On Tue, Jul 21, 2020 at 3:08 PM k.jami...@fujitsu.com > wrote: > > > > On Tuesday, July 21, 2020 12:18 PM, Amit Kapila wrote: > > > On Tue, Jul 21, 2020 at 8:06 AM k.jami...@fujitsu.com > > > > > > wrote: > > > > > > > > I am definitely missing something. Perhaps I think I could not > > > > understand why there's no I/O difference between the Master and > > > > Patched (V2). Or has it been already improved even without this patch? > > > > > > > > > > I don't think it is strange that you are not seeing much difference > > > because as per the initial email by Thomas this patch is not > > > supposed to give benefits on all systems. I think we wanted to > > > check that the patch should not regress performance in cases where > > > it doesn't give benefits. I think it might be okay to run with a > > > higher number of workers than you have CPUs in the system as we > > > wanted to check if such cases regress as shown by Soumyadeep above > > > [1]. Can you once try with > > > 8 and or 10 workers as well? > > > > > > > You are right. Kindly excuse me on that part, which only means it may > > or may not have any benefits on the filesystem I am using. But for > > other fs, as we can see from David's benchmarks significant > > results/benefits. > > > > Following your advice on regression test case, I increased the number > > of workers, but the query planner still capped the workers at 6, so > > the results from 6 workers onwards are almost the same. > > > > I am slightly confused if the number of workers are capped at 6, then what > exactly > the data at 32 worker count means? If you want query planner to choose more > number of workers, then I think either you need to increase the data or use > Alter > Table Set (parallel_workers = ); Oops I'm sorry, the "workers" labelled in those tables actually mean max_parallel_workers_per_gather and not parallel_workers. In the query planner, I thought the _per_gather corresponds or controls the workers planned/launched values, and those are the numbers that I used in the tables. I used the default max_parallel_workers & max_worker_proceses which is 8 by default in postgresql.conf. IOW, I ran all those tests with maximum of 8 processes set. But my query planner capped both the Workers Planned and Launched at 6 for some reason when increasing the value for max_parallel_workers_per_gather. However, when I used the ALTER TABLE SET (parallel_workers = N) based from your suggestion, the query planner acquired that set value only for "Workers Planned", but not for "Workers Launched". The behavior of query planner is also different when I also set the value of max_worker_processes and max_parallel_workers to parallel_workers + 1. For example (ran on Master), 1. Set same value as parallel_workers, but "Workers Launched" and "Workers Planned" do not match. max_worker_processes = 8 max_parallel_workers = 8 ALTER TABLE t_heap SET (parallel_workers = 8); ALTER TABLE SET max_parallel_workers_per_gather = 8; SET test=# EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) from t_heap; QUERY PLAN -- Finalize Aggregate (cost=619778.66..619778.67 rows=1 width=8) (actual time=16316.295..16316.295 rows=1 loops=1) Buffers: shared read=442478 dirtied=442478 written=44 -> Gather (cost=619777.83..619778.64 rows=8 width=8) (actual time=16315.528..16316.668 rows=8 loops=1) Workers Planned: 8 Workers Launched: 7 Buffers: shared read=442478 dirtied=442478 written=44 -> Partial Aggregate (cost=618777.83..618777.84 rows=1 width=8) (actual time=16305.092..16305.092 rows=1 loops=8) Buffers: shared read=442478 dirtied=442478 written=44 -> Parallel Seq Scan on t_heap (cost=0.00..583517.86 rows=14103986 width=0) (actual time=0.725..14290.117 rows=1250 loops=8) Buffers: shared read=442478 dirtied=442478 written=44 Planning Time: 5.327 ms Buffers: shared hit=17 read=10 Execution Time: 16316.915 ms (13 rows) 2. Match the workers launched and workers planned values (parallel_workers + 1) max_worker_processes = 9 max_parallel_workers = 9 ALTER TABLE t_heap SET (parallel_workers = 8); ALTER TABLE; SET max_parallel_workers_per_gather = 8; SET test=# EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) from t_heap; QUERY PLAN -- Finalize Aggregate (cost=619778.66..619778.67 rows=1 width=8) (actual time=16783.944..16783.944 rows=1 loops=1) Buffers: shared read=442478 dirtied=442478 written=442190 -> Ga
Re: Parallel Seq Scan vs kernel read ahead
On Wed, 22 Jul 2020 at 18:17, k.jami...@fujitsu.com wrote: > Even though I read the documentation [1][2] on parallel query, I might not > have > understood it clearly yet. So thank you very much for explaining simpler how > the > relation size, GUCs, and reloption affect the query planner's behavior > So in this test case, I shouldn't force the workers to have same values for > workers > planned and workers launched, is it correct? To just let the optimizer do its > own decision. What you want to do is force the planner's hand with each test as to how many parallel workers it uses by setting the reloption parallel_workers to the number of workers you want to test. Just make sure the executor has enough workers to launch by setting max_parallel_workers and max_worker_processes to something high enough to conduct the tests without the executor failing to launch any workers. > Maybe the relation size is also small as you mentioned, that the query > optimizer decided > to only use 6 workers in my previous test. So let me see first if the results > would vary > again with the above configuration and testing different values for > parallel_workers. The parallel_worker reloption will overwrite the planner's choice of how many parallel workers to use. Just make sure the executor has enough to use. You'll be able to determine that from the Workers Planned matching Workers Launched. David
Re: Parallel Seq Scan vs kernel read ahead
On Wed, 22 Jul 2020 at 16:40, k.jami...@fujitsu.com wrote: > I used the default max_parallel_workers & max_worker_proceses which is 8 by > default in postgresql.conf. > IOW, I ran all those tests with maximum of 8 processes set. But my query > planner capped both the > Workers Planned and Launched at 6 for some reason when increasing the value > for > max_parallel_workers_per_gather. max_parallel_workers_per_gather just imposes a limit on the planner as to the maximum number of parallel workers it may choose for a given parallel portion of a plan. The actual number of workers the planner will decide is best to use is based on the size of the relation. More pages = more workers. It sounds like in this case the planner didn't think it was worth using more than 6 workers. The parallel_workers reloption, when not set to -1 overwrites the planner's decision on how many workers to use. It'll just always try to use "parallel_workers". > However, when I used the ALTER TABLE SET (parallel_workers = N) based from > your suggestion, > the query planner acquired that set value only for "Workers Planned", but not > for "Workers Launched". > The behavior of query planner is also different when I also set the value of > max_worker_processes > and max_parallel_workers to parallel_workers + 1. When it comes to execution, the executor is limited to how many parallel worker processes are available to execute the plan. If all workers happen to be busy with other tasks then it may find itself having to process the entire query in itself without any help from workers. Or there may be workers available, just not as many as the planner picked to execute the query. The number of available workers is configured with the "max_parallel_workers". That's set in postgresql.conf. PostgreSQL won't complain if you try to set a relation's parallel_workers reloption to a number higher than the "max_parallel_workers" GUC. "max_parallel_workers" is further limited by "max_worker_processes". Likely you'll want to set both those to at least 32 for this test, then just adjust the relation's parallel_workers setting for each test. David
Re: Parallel Seq Scan vs kernel read ahead
On Wed, Jul 22, 2020 at 3:57 PM Amit Kapila wrote: > Yeah, that is true but every time before the test the same amount of > data should be present in shared buffers (or OS cache) if any which > will help in getting consistent results. However, it is fine to > reboot the machine as well if that is a convenient way. We really should have an extension (pg_prewarm?) that knows how to kick stuff out PostgreSQL's shared buffers and the page cache (POSIX_FADV_DONTNEED).
Re: Parallel Seq Scan vs kernel read ahead
On Wed, Jul 22, 2020 at 5:25 AM David Rowley wrote: > > I understand that Amit wrote: > > On Fri, 17 Jul 2020 at 21:18, Amit Kapila wrote: > > I think recreating the database and restarting the server after each > > run might help in getting consistent results. Also, you might want to > > take median of three runs. > > Please also remember, if you're recreating the database after having > restarted the machine that creating the table will likely end up > caching some of the pages either in shared buffers or the Kernel's > cache. > Yeah, that is true but every time before the test the same amount of data should be present in shared buffers (or OS cache) if any which will help in getting consistent results. However, it is fine to reboot the machine as well if that is a convenient way. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel Seq Scan vs kernel read ahead
Hi Kirk, Thank you for doing some testing on this. It's very useful to get some samples from other hardware / filesystem / os combinations. On Tue, 21 Jul 2020 at 21:38, k.jami...@fujitsu.com wrote: > Query Planner I/O Timings (ms): > | Worker | I/O READ (Master) | I/O READ (Patch) | I/O WRITE (Master) | I/O > WRITE (Patch) | > ||---|--||---| > | 0 | "1,130.78"| "1,250.82" | "1,698.05" | > "1,733.44"| > | Worker | Buffers | > ||--| > | 0 | shared read=442478 dirtied=442478 written=442446 | I'm thinking the scale of this test might be a bit too small for the machine you're using to test. When you see "shared read" in the EXPLAIN (ANALYZE, BUFFERS) output, it does not necessarily mean that the page had to be read from disk. We use buffered I/O, so the page could just have been fetched from the kernel's cache. If we do some maths here on the timing. It took 1130.78 milliseconds to read 442478 pages, which, assuming the standard page size of 8192 bytes, that's 3457 MB in 1130.78 milliseconds, or 3057 MB/sec. Is that a realistic throughput for this machine in terms of I/O? Or do you think that some of these pages might be coming from the Kernel's cache? I understand that Amit wrote: On Fri, 17 Jul 2020 at 21:18, Amit Kapila wrote: > I think recreating the database and restarting the server after each > run might help in getting consistent results. Also, you might want to > take median of three runs. Please also remember, if you're recreating the database after having restarted the machine that creating the table will likely end up caching some of the pages either in shared buffers or the Kernel's cache. It would be better to leave the database intact and just reboot the machine. I didn't really like that option with my tests so I just increased the size of the table beyond any size that my machines could have cached. With the 16GB RAM Windows laptop, I used a 100GB table and with the 64GB workstation, I used an 800GB table. I think my test using SELECT * FROM t WHERE a < 0; with a table that has a padding column is likely going to be a more accurate test. Providing there is no rows with a < 0 in the table then the executor will spend almost all of the time in nodeSeqscan.c trying to find a row with a < 0. There's no additional overhead of aggregation doing the count(*). Having the additional padding column means that we read more data per evaluation of the a < 0 expression. Also, having a single column table is not that realistic. I'm pretty keen to see this machine running something closer to the test I mentioned in [1] but the benchmark query I mentioned in [2] with the "t" table being at least twice the size of RAM in the machine. Larger would be better though. With such a scaled test, I don't think there's much need to reboot the machine in between. Just run a single query first to warm up the cache before timing anything. Having the table a few times larger than RAM will mean that we can be certain that the disk was actually used during the test. The more data we can be certain came from disk the more we can trust that the results are meaningful. Thanks again for testing this. David [1] https://www.postgresql.org/message-id/caaphdvrfjfyh51_wy-iqqpw8ygr4fdotxaqkqn+sa7ntkev...@mail.gmail.com [2] https://www.postgresql.org/message-id/caaphdvo+legkmcavoipyk8nebgp-lrxns2tj1n_xnrjve9x...@mail.gmail.com
RE: Parallel Seq Scan vs kernel read ahead
On Friday, July 17, 2020 6:18 PM (GMT+9), Amit Kapila wrote: > On Fri, Jul 17, 2020 at 11:35 AM k.jami...@fujitsu.com > wrote: > > > > On Wednesday, July 15, 2020 12:52 PM (GMT+9), David Rowley wrote: > > > > >On Wed, 15 Jul 2020 at 14:51, Amit Kapila wrote: > > >> > > >> On Wed, Jul 15, 2020 at 5:55 AM David Rowley > wrote: > > >>> If we've not seen any performance regressions within 1 week, then > > >>> I propose that we (pending final review) push this to allow wider > > >>> testing. > > >> > > >> I think Soumyadeep has reported a regression case [1] with the > > >> earlier version of the patch. I am not sure if we have verified > > >> that the situation improves with the latest version of the patch. > > >> I request Soumyadeep to please try once with the latest patch. > > >... > > >Yeah, it would be good to see some more data points on that test. > > >Jumping from 2 up to 6 workers just leaves us to guess where the > > >performance started to become bad. >It would be good to know if the > > >regression is repeatable or if it was affected by some other process. > > >... > > >It would be good to see EXPLAIN (ANALYZE, BUFFERS) with SET > > >track_io_timing = on; for each value of >max_parallel_workers. > > > > Hi, > > > > If I'm following the thread correctly, we may have gains on this patch > > of Thomas and David, but we need to test its effects on different > > filesystems. It's also been clarified by David through benchmark tests > > that synchronize_seqscans is not affected as long as the set cap per > > chunk size of parallel scan is at 8192. > > > > I also agree that having a control on this through GUC can be > > beneficial for users, however, that can be discussed in another thread > > or development in the future. > > > > David Rowley wrote: > > >I'd like to propose that if anyone wants to do further testing on > > >other operating systems with SSDs or HDDs then it would be good if > > >that could be done within a 1 week from this email. There are various > > >benchmarking ideas on this thread for inspiration. > > > > I'd like to join on testing it, this one using HDD, and at the bottom > > are the results. Due to my machine limitations, I only tested > > 0~6 workers, that even if I increase max_parallel_workers_per_gather > > more than that, the query planner would still cap the workers at 6. > > I also set the track_io_timing to on as per David's recommendation. > > > > Tested on: > > XFS filesystem, HDD virtual machine > > RHEL4, 64-bit, > > 4 CPUs, Intel Core Processor (Haswell, IBRS) PostgreSQL 14devel on > > x86_64-pc-linux-gnu > > > > > > Test Case (Soumyadeep's) [1] > > > > shared_buffers = 32MB (to use OS page cache) > > > > create table t_heap as select generate_series(1, 1) i; --about > 3.4GB size > > > > SET track_io_timing = on; > > > > \timing > > > > set max_parallel_workers_per_gather = 0; --0 to 6 > > > > SELECT count(*) from t_heap; > > EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) from t_heap; > > > > [Summary] > > I used the same query from the thread. However, the sql query > > execution time and query planner execution time results between the > > master and patched do not vary much. > > OTOH, in terms of I/O stats, I observed similar regression in both > > master and patched as we increase max_parallel_workers_per_gather. > > > > It could also be possible that each benchmark setting for > > max_parallel_workers_per_gather is affected by previous result . IOW, > > later benchmark runs benefit from the data cached by previous runs on OS > > level. > > > > Yeah, I think to some extent that is visible in results because, after patch, > at 0 > workers, the execution time is reduced significantly whereas there is not much > difference at other worker counts. I think for non-parallel case (0 workers), > there shouldn't be any difference. > Also, I am not sure if there is any reason why after patch the number of > shared hits > is improved, probably due to caching effects? > > > Any advice? > > I think recreating the database and restarting the server after each run > might help > in getting consistent results. Also, you might want to take median of three > runs. Thank you for the advice. I repeated the test as per your advice and average of 3 runs per worker/s planned. It still shows the following similar performance results between Master and Patch V2. I wonder why there's no difference though. The test on my machine is roughly like this: createdb test psql -d test create table t_heap as select generate_series(1, 1) i; \q pg_ctl restart psql -d test SET track_io_timing = on; SET max_parallel_workers_per_gather = 0; SHOW max_parallel_workers_per_gather; EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) from t_heap; \timing SELECT count(*) from t_heap; drop table t_heap; \q dropdb test pg_ctl restart Below are the results. Again, almost no discernible difference between the master and patch. Also, the results when max_parallel_workers_per_g
RE: Parallel Seq Scan vs kernel read ahead
On Tuesday, July 21, 2020 12:18 PM, Amit Kapila wrote: > On Tue, Jul 21, 2020 at 8:06 AM k.jami...@fujitsu.com > wrote: > > > > Thank you for the advice. I repeated the test as per your advice and > > average of 3 runs per worker/s planned. > > It still shows the following similar performance results between Master and > Patch V2. > > I wonder why there's no difference though. > > > > The test on my machine is roughly like this: > > > > createdb test > > psql -d test > > create table t_heap as select generate_series(1, 1) i; \q > > > > pg_ctl restart > > psql -d test > > SET track_io_timing = on; > > SET max_parallel_workers_per_gather = 0; SHOW > > max_parallel_workers_per_gather; EXPLAIN (ANALYZE, BUFFERS) SELECT > > count(*) from t_heap; \timing SELECT count(*) from t_heap; > > > > drop table t_heap; > > \q > > dropdb test > > pg_ctl restart > > > > Below are the results. Again, almost no discernible difference between the > master and patch. > > Also, the results when max_parallel_workers_per_gather is more than 4 > > could be inaccurate due to my machine's limitation of only having v4 > > CPUs. Even so, query planner capped it at > > 6 workers. > > > > Query Planner I/O Timings (track_io_timing = on) in ms : > > | Worker | I/O READ (Master) | I/O READ (Patch) | I/O WRITE (Master) | > > | I/O WRITE (Patch) | > > > ||---|--||- > --| > > | 0 | "1,130.777" | "1,250.821" | "01,698.051" | > "01,733.439" | > > | 1 | "1,603.016" | "1,660.767" | "02,312.248" | > "02,291.661" | > > | 2 | "2,036.269" | "2,107.066" | "02,698.216" | > "02,796.893" | > > | 3 | "2,298.811" | "2,307.254" | "05,695.991" | > "05,894.183" | > > | 4 | "2,098.642" | "2,135.960" | "23,837.088" | > "26,537.158" | > > | 5 | "1,956.536" | "1,997.464" | "45,891.851" | > "48,049.338" | > > | 6 | "2,201.816" | "2,219.001" | "61,937.828" | > "67,809.486" | > > > > Query Planner Execution Time (ms): > > | Worker | QueryPlanner (Master) | QueryPlanner (Patch) | > > ||---|--| > > | 0.000 | "40,454.252" | "40,521.578" | > > | 1.000 | "21,332.067" | "21,205.068" | > > | 2.000 | "14,266.756" | "14,385.539" | > > | 3.000 | "11,597.936" | "11,722.055" | > > | 4.000 | "12,937.468" | "13,439.247" | > > | 5.000 | "14,383.083" | "14,782.866" | > > | 6.000 | "14,671.336" | "15,507.581" | > > > > Based from the results above, the I/O latency increases as number of > > workers also increase. Despite that, the query planner execution time > > is almost closely same when 2 or more workers are used (14~11s). Same > results between Master and Patch V2. > > > > As for buffers, same results are shown per worker (both Master and Patch). > > | Worker | Buffers | > > ||--| > > | 0 | shared read=442478 dirtied=442478 written=442446 | > > | 1 | shared read=442478 dirtied=442478 written=442414 | > > | 2 | shared read=442478 dirtied=442478 written=442382 | > > | 3 | shared read=442478 dirtied=442478 written=442350 | > > | 4 | shared read=442478 dirtied=442478 written=442318 | > > | 5 | shared read=442478 dirtied=442478 written=442286 | > > | 6 | shared read=442478 dirtied=442478 written=442254 | > > > > > > SQL Query Execution Time (ms) : > > | Worker | SQL (Master) | SQL (Patch) | > > ||--|--| > > | 0 | "10,418.606" | "10,377.377" | > > | 1 | "05,427.460" | "05,402.727" | > > | 2 | "03,662.998" | "03,650.277" | > > | 3 | "02,718.837" | "02,692.871" | > > | 4 | "02,759.802" | "02,693.370" | > > | 5 | "02,761.834" | "02,682.590" | > > | 6 | "02,711.434" | "02,726.332" | > > > > The SQL query execution time definitely benefitted from previous run > > of query planner, so the results are faster. But again, both Master and > > Patched > have almost the same results. > > Nonetheless, the execution time is almost consistent when > > max_parallel_workers_per_gather is 2 (default) and above. > > > > I am definitely missing something. Perhaps I think I could not > > understand why there's no I/O difference between the Master and > > Patched (V2). Or has it been already improved even without this patch? > > > > I don't think it is strange that you are not seeing much difference because > as per > the initial email by Thomas this patch is not supposed to give benefits on all > systems. I think we wanted to check that the patch should not regress > performance in cases where it doesn't give benefits. I think it might be > okay to > run wi
Re: Parallel Seq Scan vs kernel read ahead
On Tue, Jul 21, 2020 at 3:08 PM k.jami...@fujitsu.com wrote: > > On Tuesday, July 21, 2020 12:18 PM, Amit Kapila wrote: > > On Tue, Jul 21, 2020 at 8:06 AM k.jami...@fujitsu.com > > > > wrote: > > > > > > I am definitely missing something. Perhaps I think I could not > > > understand why there's no I/O difference between the Master and > > > Patched (V2). Or has it been already improved even without this patch? > > > > > > > I don't think it is strange that you are not seeing much difference because > > as per > > the initial email by Thomas this patch is not supposed to give benefits on > > all > > systems. I think we wanted to check that the patch should not regress > > performance in cases where it doesn't give benefits. I think it might be > > okay to > > run with a higher number of workers than you have CPUs in the system as we > > wanted to check if such cases regress as shown by Soumyadeep above [1]. Can > > you once try with > > 8 and or 10 workers as well? > > > > You are right. Kindly excuse me on that part, which only means it may or may > not have any > benefits on the filesystem I am using. But for other fs, as we can see from > David's benchmarks > significant results/benefits. > > Following your advice on regression test case, I increased the number of > workers, > but the query planner still capped the workers at 6, so the results from 6 > workers onwards > are almost the same. > I am slightly confused if the number of workers are capped at 6, then what exactly the data at 32 worker count means? If you want query planner to choose more number of workers, then I think either you need to increase the data or use Alter Table Set (parallel_workers = ); > I don't see significant difference between master and patched on my machine > as per my test results below. (Just for reconfirmation) > I see the difference of about 7-8% at higher (32) client-count. Am, I missing something? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel Seq Scan vs kernel read ahead
On Tue, Jul 21, 2020 at 8:06 AM k.jami...@fujitsu.com wrote: > > Thank you for the advice. I repeated the test as per your advice and average > of 3 runs > per worker/s planned. > It still shows the following similar performance results between Master and > Patch V2. > I wonder why there's no difference though. > > The test on my machine is roughly like this: > > createdb test > psql -d test > create table t_heap as select generate_series(1, 1) i; > \q > > pg_ctl restart > psql -d test > SET track_io_timing = on; > SET max_parallel_workers_per_gather = 0; > SHOW max_parallel_workers_per_gather; > EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) from t_heap; > \timing > SELECT count(*) from t_heap; > > drop table t_heap; > \q > dropdb test > pg_ctl restart > > Below are the results. Again, almost no discernible difference between the > master and patch. > Also, the results when max_parallel_workers_per_gather is more than 4 could > be inaccurate > due to my machine's limitation of only having v4 CPUs. Even so, query planner > capped it at > 6 workers. > > Query Planner I/O Timings (track_io_timing = on) in ms : > | Worker | I/O READ (Master) | I/O READ (Patch) | I/O WRITE (Master) | I/O > WRITE (Patch) | > ||---|--||---| > | 0 | "1,130.777" | "1,250.821" | "01,698.051" | > "01,733.439" | > | 1 | "1,603.016" | "1,660.767" | "02,312.248" | > "02,291.661" | > | 2 | "2,036.269" | "2,107.066" | "02,698.216" | > "02,796.893" | > | 3 | "2,298.811" | "2,307.254" | "05,695.991" | > "05,894.183" | > | 4 | "2,098.642" | "2,135.960" | "23,837.088" | > "26,537.158" | > | 5 | "1,956.536" | "1,997.464" | "45,891.851" | > "48,049.338" | > | 6 | "2,201.816" | "2,219.001" | "61,937.828" | > "67,809.486" | > > Query Planner Execution Time (ms): > | Worker | QueryPlanner (Master) | QueryPlanner (Patch) | > ||---|--| > | 0.000 | "40,454.252" | "40,521.578" | > | 1.000 | "21,332.067" | "21,205.068" | > | 2.000 | "14,266.756" | "14,385.539" | > | 3.000 | "11,597.936" | "11,722.055" | > | 4.000 | "12,937.468" | "13,439.247" | > | 5.000 | "14,383.083" | "14,782.866" | > | 6.000 | "14,671.336" | "15,507.581" | > > Based from the results above, the I/O latency increases as number of workers > also increase. Despite that, the query planner execution time is almost > closely same > when 2 or more workers are used (14~11s). Same results between Master and > Patch V2. > > As for buffers, same results are shown per worker (both Master and Patch). > | Worker | Buffers | > ||--| > | 0 | shared read=442478 dirtied=442478 written=442446 | > | 1 | shared read=442478 dirtied=442478 written=442414 | > | 2 | shared read=442478 dirtied=442478 written=442382 | > | 3 | shared read=442478 dirtied=442478 written=442350 | > | 4 | shared read=442478 dirtied=442478 written=442318 | > | 5 | shared read=442478 dirtied=442478 written=442286 | > | 6 | shared read=442478 dirtied=442478 written=442254 | > > > SQL Query Execution Time (ms) : > | Worker | SQL (Master) | SQL (Patch) | > ||--|--| > | 0 | "10,418.606" | "10,377.377" | > | 1 | "05,427.460" | "05,402.727" | > | 2 | "03,662.998" | "03,650.277" | > | 3 | "02,718.837" | "02,692.871" | > | 4 | "02,759.802" | "02,693.370" | > | 5 | "02,761.834" | "02,682.590" | > | 6 | "02,711.434" | "02,726.332" | > > The SQL query execution time definitely benefitted from previous run of query > planner, > so the results are faster. But again, both Master and Patched have almost the > same results. > Nonetheless, the execution time is almost consistent when > max_parallel_workers_per_gather is 2 (default) and above. > > I am definitely missing something. Perhaps I think I could not understand why > there's no > I/O difference between the Master and Patched (V2). Or has it been already > improved > even without this patch? > I don't think it is strange that you are not seeing much difference because as per the initial email by Thomas this patch is not supposed to give benefits on all systems. I think we wanted to check that the patch should not regress performance in cases where it doesn't give benefits. I think it might be okay to run with a higher number of workers than you have CPUs in the system as we wanted to check if such cases regress as shown by Soumyadeep above [1]. Can you once try with 8 and or 10 workers as well? [1] - https://www.postgresql.org/me
RE: Parallel Seq Scan vs kernel read ahead
On Wednesday, July 15, 2020 12:52 PM (GMT+9), David Rowley wrote: >On Wed, 15 Jul 2020 at 14:51, Amit Kapila wrote: >> >> On Wed, Jul 15, 2020 at 5:55 AM David Rowley wrote: >>> If we've not seen any performance regressions within 1 week, then I >>> propose that we (pending final review) push this to allow wider >>> testing. >> >> I think Soumyadeep has reported a regression case [1] with the earlier >> version of the patch. I am not sure if we have verified that the >> situation improves with the latest version of the patch. I request >> Soumyadeep to please try once with the latest patch. >... >Yeah, it would be good to see some more data points on that test. >Jumping from 2 up to 6 workers just leaves us to guess where the performance >started to become bad. >It would be good to know if the regression is >repeatable or if it was affected by some other process. >... >It would be good to see EXPLAIN (ANALYZE, BUFFERS) with SET track_io_timing = >on; >for each value of >max_parallel_workers. Hi, If I'm following the thread correctly, we may have gains on this patch of Thomas and David, but we need to test its effects on different filesystems. It's also been clarified by David through benchmark tests that synchronize_seqscans is not affected as long as the set cap per chunk size of parallel scan is at 8192. I also agree that having a control on this through GUC can be beneficial for users, however, that can be discussed in another thread or development in the future. David Rowley wrote: >I'd like to propose that if anyone wants to do further testing on >other operating systems with SSDs or HDDs then it would be good if >that could be done within a 1 week from this email. There are various >benchmarking ideas on this thread for inspiration. I'd like to join on testing it, this one using HDD, and at the bottom are the results. Due to my machine limitations, I only tested 0~6 workers, that even if I increase max_parallel_workers_per_gather more than that, the query planner would still cap the workers at 6. I also set the track_io_timing to on as per David's recommendation. Tested on: XFS filesystem, HDD virtual machine RHEL4, 64-bit, 4 CPUs, Intel Core Processor (Haswell, IBRS) PostgreSQL 14devel on x86_64-pc-linux-gnu Test Case (Soumyadeep's) [1] shared_buffers = 32MB (to use OS page cache) create table t_heap as select generate_series(1, 1) i; --about 3.4GB size SET track_io_timing = on; \timing set max_parallel_workers_per_gather = 0; --0 to 6 SELECT count(*) from t_heap; EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) from t_heap; [Summary] I used the same query from the thread. However, the sql query execution time and query planner execution time results between the master and patched do not vary much. OTOH, in terms of I/O stats, I observed similar regression in both master and patched as we increase max_parallel_workers_per_gather. It could also be possible that each benchmark setting for max_parallel_workers_per_gather is affected by previous result . IOW, later benchmark runs benefit from the data cached by previous runs on OS level. Any advice? Please refer to tables below for results. (MASTER/UNPATCHED) | Parallel Workers | SQLExecTime | PlannerExecTime | Buffers | |--|--|--|-| | 0| 12942.606 ms | 37031.786 ms | shared hit=32 read=442446 | | 1| 4959.567 ms | 17601.813 ms | shared hit=128 read=442350 | | 2| 3273.610 ms | 11766.441 ms | shared hit=288 read=442190 | | 3| 2449.342 ms | 9057.236 ms | shared hit=512 read=441966 | | 4| 2482.404 ms | 8853.702 ms | shared hit=800 read=441678 | | 5| 2430.944 ms | 8777.630 ms | shared hit=1152 read=441326 | | 6| 2493.416 ms | 8798.200 ms | shared hit=1568 read=440910 | (PATCHED V2) | Parallel Workers | SQLExecTime | PlannerExecTime | Buffers | |--|-|--|-| | 0| 9283.193 ms | 34471.050 ms | shared hit=2624 read=439854 | | 1| 4872.728 ms | 17449.725 ms | shared hit=2528 read=439950 | | 2| 3240.301 ms | 11556.243 ms | shared hit=2368 read=440110 | | 3| 2419.512 ms | 8709.572 ms | shared hit=2144 read=440334 | | 4| 2746.820 ms | 8768.812 ms | shared hit=1856 read=440622 | | 5| 2424.687 ms | 8699.762 ms | shared hit=1504 read=440974 | | 6| 2581.999 ms | 8627.627 ms | shared hit=1440 read=441038 | (I/O Read Stat) | Parallel Workers | I/O (Master) | I/O (Patched) | |--|---|---| | 0| read=1850.233 | read=1071.209 | | 1| read=1246.939
Re: Parallel Seq Scan vs kernel read ahead
On Fri, Jul 17, 2020 at 11:35 AM k.jami...@fujitsu.com wrote: > > On Wednesday, July 15, 2020 12:52 PM (GMT+9), David Rowley wrote: > > >On Wed, 15 Jul 2020 at 14:51, Amit Kapila wrote: > >> > >> On Wed, Jul 15, 2020 at 5:55 AM David Rowley wrote: > >>> If we've not seen any performance regressions within 1 week, then I > >>> propose that we (pending final review) push this to allow wider > >>> testing. > >> > >> I think Soumyadeep has reported a regression case [1] with the earlier > >> version of the patch. I am not sure if we have verified that the > >> situation improves with the latest version of the patch. I request > >> Soumyadeep to please try once with the latest patch. > >... > >Yeah, it would be good to see some more data points on that test. > >Jumping from 2 up to 6 workers just leaves us to guess where the performance > >started to become bad. >It would be good to know if the regression is > >repeatable or if it was affected by some other process. > >... > >It would be good to see EXPLAIN (ANALYZE, BUFFERS) with SET track_io_timing > >= on; > >for each value of >max_parallel_workers. > > Hi, > > If I'm following the thread correctly, we may have gains on this patch > of Thomas and David, but we need to test its effects on different > filesystems. It's also been clarified by David through benchmark tests > that synchronize_seqscans is not affected as long as the set cap per > chunk size of parallel scan is at 8192. > > I also agree that having a control on this through GUC can be > beneficial for users, however, that can be discussed in another > thread or development in the future. > > David Rowley wrote: > >I'd like to propose that if anyone wants to do further testing on > >other operating systems with SSDs or HDDs then it would be good if > >that could be done within a 1 week from this email. There are various > >benchmarking ideas on this thread for inspiration. > > I'd like to join on testing it, this one using HDD, and at the bottom > are the results. Due to my machine limitations, I only tested > 0~6 workers, that even if I increase max_parallel_workers_per_gather > more than that, the query planner would still cap the workers at 6. > I also set the track_io_timing to on as per David's recommendation. > > Tested on: > XFS filesystem, HDD virtual machine > RHEL4, 64-bit, > 4 CPUs, Intel Core Processor (Haswell, IBRS) > PostgreSQL 14devel on x86_64-pc-linux-gnu > > > Test Case (Soumyadeep's) [1] > > shared_buffers = 32MB (to use OS page cache) > > create table t_heap as select generate_series(1, 1) i; --about > 3.4GB size > > SET track_io_timing = on; > > \timing > > set max_parallel_workers_per_gather = 0; --0 to 6 > > SELECT count(*) from t_heap; > EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) from t_heap; > > [Summary] > I used the same query from the thread. However, the sql query execution time > and query planner execution time results between the master and patched do > not vary much. > OTOH, in terms of I/O stats, I observed similar regression in both master > and patched as we increase max_parallel_workers_per_gather. > > It could also be possible that each benchmark setting for > max_parallel_workers_per_gather > is affected by previous result . IOW, later benchmark runs benefit from the > data cached by > previous runs on OS level. > Yeah, I think to some extent that is visible in results because, after patch, at 0 workers, the execution time is reduced significantly whereas there is not much difference at other worker counts. I think for non-parallel case (0 workers), there shouldn't be any difference. Also, I am not sure if there is any reason why after patch the number of shared hits is improved, probably due to caching effects? > Any advice? I think recreating the database and restarting the server after each run might help in getting consistent results. Also, you might want to take median of three runs. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel Seq Scan vs kernel read ahead
On Wed, 15 Jul 2020 at 14:51, Amit Kapila wrote: > > On Wed, Jul 15, 2020 at 5:55 AM David Rowley wrote: > > If we've not seen any performance regressions within 1 week, then I > > propose that we (pending final review) push this to allow wider > > testing. > > I think Soumyadeep has reported a regression case [1] with the earlier > version of the patch. I am not sure if we have verified that the > situation improves with the latest version of the patch. I request > Soumyadeep to please try once with the latest patch. Yeah, it would be good to see some more data points on that test. Jumping from 2 up to 6 workers just leaves us to guess where the performance started to become bad. It would be good to know if the regression is repeatable or if it was affected by some other process. I see the disk type on that report was Google PersistentDisk. I don't pretend to be any sort of expert on network filesystems, but I guess a regression would be possible in that test case if say there was an additional layer of caching of very limited size between the kernel cache and the disks, maybe on a remote machine. If it were doing some sort of prefetching to try to reduce latency and requests to the actual disks then perhaps going up to 6 workers with 64 chunk size (as Thomas' patch used at that time) caused more cache misses on that cache due to the requests exceeding what had already been prefetched. That's just a stab in the dark. Maybe someone with knowledge of these network file systems can come up with a better theory. It would be good to see EXPLAIN (ANALYZE, BUFFERS) with SET track_io_timing = on; for each value of max_parallel_workers. David > [1] - > https://www.postgresql.org/message-id/CADwEdoqirzK3H8bB%3DxyJ192EZCNwGfcCa_WJ5GHVM7Sv8oenuA%40mail.gmail.com
Re: Parallel Seq Scan vs kernel read ahead
On Wed, Jul 15, 2020 at 5:55 AM David Rowley wrote: > > On Tue, 14 Jul 2020 at 19:13, Thomas Munro wrote: > > > > On Fri, Jun 26, 2020 at 3:33 AM Robert Haas wrote: > > > On Tue, Jun 23, 2020 at 11:53 PM David Rowley > > > wrote: > > > > In summary, based on these tests, I don't think we're making anything > > > > worse in regards to synchronize_seqscans if we cap the maximum number > > > > of blocks to allocate to each worker at once to 8192. Perhaps there's > > > > some argument for using something smaller than that for servers with > > > > very little RAM, but I don't personally think so as it still depends > > > > on the table size and It's hard to imagine tables in the hundreds of > > > > GBs on servers that struggle with chunk allocations of 16MB. The > > > > table needs to be at least ~70GB to get a 8192 chunk size with the > > > > current v2 patch settings. > > > > > > Nice research. That makes me happy. I had a feeling the maximum useful > > > chunk size ought to be more in this range than the larger values we > > > were discussing before, but I didn't even think about the effect on > > > synchronized scans. > > > > +1. This seems about right to me. We can always reopen the > > discussion if someone shows up with evidence in favour of a tweak to > > the formula, but this seems to address the basic problem pretty well, > > and also fits nicely with future plans for AIO and DIO. > > Thank you both of you for having a look at the results. > > I'm now pretty happy with this too. I do understand that we've not > exactly exhaustively tested all our supported operating systems. > However, we've seen some great speedups with Windows 10 and Linux with > SSDs. Thomas saw great speedups with FreeBSD with the original patch > using chunk sizes of 64 blocks. (I wonder if it's worth verifying that > it increases further with the latest patch with the same test you did > in the original email on this thread?) > > I'd like to propose that if anyone wants to do further testing on > other operating systems with SSDs or HDDs then it would be good if > that could be done within a 1 week from this email. There are various > benchmarking ideas on this thread for inspiration. > Yeah, I agree it would be good if we could do what you said. > If we've not seen any performance regressions within 1 week, then I > propose that we (pending final review) push this to allow wider > testing. I think Soumyadeep has reported a regression case [1] with the earlier version of the patch. I am not sure if we have verified that the situation improves with the latest version of the patch. I request Soumyadeep to please try once with the latest patch. [1] - https://www.postgresql.org/message-id/CADwEdoqirzK3H8bB%3DxyJ192EZCNwGfcCa_WJ5GHVM7Sv8oenuA%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel Seq Scan vs kernel read ahead
On Tue, 14 Jul 2020 at 19:13, Thomas Munro wrote: > > On Fri, Jun 26, 2020 at 3:33 AM Robert Haas wrote: > > On Tue, Jun 23, 2020 at 11:53 PM David Rowley wrote: > > > In summary, based on these tests, I don't think we're making anything > > > worse in regards to synchronize_seqscans if we cap the maximum number > > > of blocks to allocate to each worker at once to 8192. Perhaps there's > > > some argument for using something smaller than that for servers with > > > very little RAM, but I don't personally think so as it still depends > > > on the table size and It's hard to imagine tables in the hundreds of > > > GBs on servers that struggle with chunk allocations of 16MB. The > > > table needs to be at least ~70GB to get a 8192 chunk size with the > > > current v2 patch settings. > > > > Nice research. That makes me happy. I had a feeling the maximum useful > > chunk size ought to be more in this range than the larger values we > > were discussing before, but I didn't even think about the effect on > > synchronized scans. > > +1. This seems about right to me. We can always reopen the > discussion if someone shows up with evidence in favour of a tweak to > the formula, but this seems to address the basic problem pretty well, > and also fits nicely with future plans for AIO and DIO. Thank you both of you for having a look at the results. I'm now pretty happy with this too. I do understand that we've not exactly exhaustively tested all our supported operating systems. However, we've seen some great speedups with Windows 10 and Linux with SSDs. Thomas saw great speedups with FreeBSD with the original patch using chunk sizes of 64 blocks. (I wonder if it's worth verifying that it increases further with the latest patch with the same test you did in the original email on this thread?) I'd like to propose that if anyone wants to do further testing on other operating systems with SSDs or HDDs then it would be good if that could be done within a 1 week from this email. There are various benchmarking ideas on this thread for inspiration. If we've not seen any performance regressions within 1 week, then I propose that we (pending final review) push this to allow wider testing. It seems we're early enough in the PG14 cycle that there's a large window of time for us to do something about any reported performance regressions that come in. I also have in mind that Amit was keen to see a GUC or reloption to allow users to control this. My thoughts on that are still that it would be possible to craft a case where we scan an entire heap to get a very small number of rows that are all located in the same area in the table and then call some expensive function on those rows. The chunk size ramp down code will help reduce the chances of one worker running on much longer than its co-workers, but not eliminate the chances. Even the code as it stands today could suffer from this to a lesser extent if all the matching rows are on a single page. My current thoughts are that this just seems unlikely and that the granularity of 1 block for cases like this was never that great anyway. I suppose a more ideal plan shape would "Distribute" matching rows to allow another set of workers to pick these rows up one-by-one and process them. Our to-date lack of such an operator probably counts a little towards the fact that one parallel worker being tied up with a large amount of work is not that common. Based on those thoughts, I'd like to avoid any GUC/reloption until we see evidence that it's really needed. Any objections to any of the above? David
Re: Parallel Seq Scan vs kernel read ahead
On Fri, Jun 26, 2020 at 3:33 AM Robert Haas wrote: > On Tue, Jun 23, 2020 at 11:53 PM David Rowley wrote: > > In summary, based on these tests, I don't think we're making anything > > worse in regards to synchronize_seqscans if we cap the maximum number > > of blocks to allocate to each worker at once to 8192. Perhaps there's > > some argument for using something smaller than that for servers with > > very little RAM, but I don't personally think so as it still depends > > on the table size and It's hard to imagine tables in the hundreds of > > GBs on servers that struggle with chunk allocations of 16MB. The > > table needs to be at least ~70GB to get a 8192 chunk size with the > > current v2 patch settings. > > Nice research. That makes me happy. I had a feeling the maximum useful > chunk size ought to be more in this range than the larger values we > were discussing before, but I didn't even think about the effect on > synchronized scans. +1. This seems about right to me. We can always reopen the discussion if someone shows up with evidence in favour of a tweak to the formula, but this seems to address the basic problem pretty well, and also fits nicely with future plans for AIO and DIO.
Re: Parallel Seq Scan vs kernel read ahead
On Tue, Jun 23, 2020 at 11:53 PM David Rowley wrote: > In summary, based on these tests, I don't think we're making anything > worse in regards to synchronize_seqscans if we cap the maximum number > of blocks to allocate to each worker at once to 8192. Perhaps there's > some argument for using something smaller than that for servers with > very little RAM, but I don't personally think so as it still depends > on the table size and It's hard to imagine tables in the hundreds of > GBs on servers that struggle with chunk allocations of 16MB. The > table needs to be at least ~70GB to get a 8192 chunk size with the > current v2 patch settings. Nice research. That makes me happy. I had a feeling the maximum useful chunk size ought to be more in this range than the larger values we were discussing before, but I didn't even think about the effect on synchronized scans. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Parallel Seq Scan vs kernel read ahead
Em seg., 22 de jun. de 2020 às 23:29, David Rowley escreveu: > On Tue, 23 Jun 2020 at 07:33, Robert Haas wrote: > > On Mon, Jun 22, 2020 at 12:47 PM Ranier Vilela > wrote: > > > Questions: > > > 1. Why acquire and release lock in retry: loop. > > > > This is a super-bad idea. Note the coding rule mentioned in spin.h. > > There are many discussion on this mailing list about the importance of > > keeping the critical section for a spinlock to a few instructions. > > Calling another function that *itself acquires an LWLock* is > > definitely not OK. > > Just a short history lesson for Ranier to help clear up any confusion: > > Back before 3cda10f41 there was some merit in improving the > performance of these functions. Before that, we used to dish out pages > under a lock. With that old method, if given enough workers and a > simple enough query, we could start to see workers waiting on the lock > just to obtain the next block number they're to work on. After the > atomics were added in that commit, we didn't really see that again. > It is a good explanation. I already imagined it could be to help other processes, but I still wasn't sure. However, I did a test with this modification (lock before retry), and it worked. > > What we're trying to fix here is the I/O pattern that these functions > induce and that's all we should be doing here. Changing this is > tricky to get right as we need to consider so many operating systems > and how they deal with I/O readahead. > Yes, I understand that focus here is I/O. Sorry, by the noise. regards, Ranier Vilela
Re: Parallel Seq Scan vs kernel read ahead
On Tue, 23 Jun 2020 at 07:33, Robert Haas wrote: > On Mon, Jun 22, 2020 at 12:47 PM Ranier Vilela wrote: > > Questions: > > 1. Why acquire and release lock in retry: loop. > > This is a super-bad idea. Note the coding rule mentioned in spin.h. > There are many discussion on this mailing list about the importance of > keeping the critical section for a spinlock to a few instructions. > Calling another function that *itself acquires an LWLock* is > definitely not OK. Just a short history lesson for Ranier to help clear up any confusion: Back before 3cda10f41 there was some merit in improving the performance of these functions. Before that, we used to dish out pages under a lock. With that old method, if given enough workers and a simple enough query, we could start to see workers waiting on the lock just to obtain the next block number they're to work on. After the atomics were added in that commit, we didn't really see that again. What we're trying to fix here is the I/O pattern that these functions induce and that's all we should be doing here. Changing this is tricky to get right as we need to consider so many operating systems and how they deal with I/O readahead. David
Re: Parallel Seq Scan vs kernel read ahead
On Tue, 23 Jun 2020 at 09:52, Thomas Munro wrote: > > On Fri, Jun 19, 2020 at 2:10 PM David Rowley wrote: > > Here's a patch which caps the maximum chunk size to 131072. If > > someone doubles the page size then that'll be 2GB instead of 1GB. I'm > > not personally worried about that. > > I wonder how this interacts with the sync scan feature. It has a > conflicting goal... Of course, syncscan relies on subsequent scanners finding buffers cached, either in (ideally) shared buffers or the kernel cache. The scans need to be roughly synchronised for that to work. If we go and make the chunk size too big, then that'll reduce the chances useful buffers being found by subsequent scans. It sounds like a good reason to try and find the smallest chunk size that allows readahead to work well. The benchmarks I did on Windows certainly show that there are diminishing returns when the chunk size gets larger, so capping it at some number of megabytes would probably be a good idea. It would just take a bit of work to figure out how many megabytes that should be. Likely it's going to depend on the size of shared buffers and how much memory the machine has got, but also what other work is going on that might be evicting buffers at the same time. Perhaps something in the range of 2-16MB would be ok. I can do some tests with that and see if I can get the same performance as with the larger chunks. David
Re: Parallel Seq Scan vs kernel read ahead
On Fri, Jun 19, 2020 at 2:10 PM David Rowley wrote: > Here's a patch which caps the maximum chunk size to 131072. If > someone doubles the page size then that'll be 2GB instead of 1GB. I'm > not personally worried about that. I wonder how this interacts with the sync scan feature. It has a conflicting goal...
Re: Parallel Seq Scan vs kernel read ahead
Em seg., 22 de jun. de 2020 às 16:33, Robert Haas escreveu: > Ranier, > > This topic is largely unrelated to the current thread. Also... > Weel, I was trying to improve the patch for the current thread. Or perhaps, you are referring to something else, which I may not have understood. > > On Mon, Jun 22, 2020 at 12:47 PM Ranier Vilela > wrote: > > Questions: > > 1. Why acquire and release lock in retry: loop. > > This is a super-bad idea. Note the coding rule mentioned in spin.h. > There are many discussion on this mailing list about the importance of > keeping the critical section for a spinlock to a few instructions. > Calling another function that *itself acquires an LWLock* is > definitely not OK. > Perhaps, I was not clear and it is another misunderstanding. I am not suggesting a function to acquire the lock. By the way, I did the tests with this change and it worked perfectly. But, as it is someone else's patch, I asked why to learn. By the way, my suggestion is with less instructions than the patch. The only change I asked is why to acquire and release the lock repeatedly within the goto retry, when you already have it. If I can acquire the lock before retry: and release it only at the end when I leave table_block_parallelscan_startblock_init, why not do it. I will attach the suggested excerpt so that I have no doubts about what I am asking. regards, Ranier Vilela void table_block_parallelscan_startblock_init(Relation rel, ParallelBlockTableScanWork workspace, ParallelBlockTableScanDesc pbscan) { BlockNumber sync_startpage = InvalidBlockNumber; /* Reset the state we use for controlling allocation size. */ memset(workspace, 0, sizeof(*workspace)); StaticAssertStmt(MaxBlockNumber <= 0xFFFE, "pg_nextpower2_32 may be too small for non-standard BlockNumber width"); workspace->phsw_chunk_size = pg_nextpower2_32(Max(pbscan->phs_nblocks / PARALLEL_SEQSCAN_NCHUNKS, 1)); /* Ensure we don't go over the maximum size with larger rels */ workspace->phsw_chunk_size = Min(workspace->phsw_chunk_size, PARALLEL_SEQSCAN_MAX_CHUNK_SIZE); /* Grab the spinlock. */ SpinLockAcquire(&pbscan->phs_mutex); retry: /* * If the scan's startblock has not yet been initialized, we must do so * now. If this is not a synchronized scan, we just start at block 0, but * if it is a synchronized scan, we must get the starting position from * the synchronized scan machinery. We can't hold the spinlock while * doing that, though, so release the spinlock, get the information we * need, and retry. If nobody else has initialized the scan in the * meantime, we'll fill in the value we fetched on the second time * through. */ if (pbscan->phs_startblock == InvalidBlockNumber) { if (!pbscan->base.phs_syncscan) pbscan->phs_startblock = 0; else if (sync_startpage != InvalidBlockNumber) pbscan->phs_startblock = sync_startpage; else { sync_startpage = ss_get_location(rel, pbscan->phs_nblocks); goto retry; } } SpinLockRelease(&pbscan->phs_mutex); }
Re: Parallel Seq Scan vs kernel read ahead
Ranier, This topic is largely unrelated to the current thread. Also... On Mon, Jun 22, 2020 at 12:47 PM Ranier Vilela wrote: > Questions: > 1. Why acquire and release lock in retry: loop. This is a super-bad idea. Note the coding rule mentioned in spin.h. There are many discussion on this mailing list about the importance of keeping the critical section for a spinlock to a few instructions. Calling another function that *itself acquires an LWLock* is definitely not OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Parallel Seq Scan vs kernel read ahead
Em seg., 22 de jun. de 2020 às 02:53, David Rowley escreveu: > On Mon, 22 Jun 2020 at 16:54, David Rowley wrote: > > I also tested this an AMD machine running Ubuntu 20.04 on kernel > > version 5.4.0-37. I used the same 100GB table I mentioned in [1], but > > with the query "select * from t where a < 0;", which saves having to > > do any aggregate work. > > I just wanted to add a note here that Thomas and I just discussed this > a bit offline. He recommended I try setting the kernel readhead a bit > higher. > > It was set to 128kB, so I cranked it up to 2MB with: > > sudo blockdev --setra 4096 /dev/nvme0n1p2 > > I didn't want to run the full test again as it took quite a long time, > so I just tried with 32 workers. > > The first two results here are taken from the test results I just > posted 1 hour ago. > > Master readhead=128kB = 89921.283 ms > v2 patch readhead=128kB = 36085.642 ms > master readhead=2MB = 60984.905 ms > v2 patch readhead=2MB = 22611.264 ms > Hi, redoing the tests with v2 here. notebook with i5, 8GB, 256 GB (SSD) Windows 10 64 bits (2004 msvc 2019 64 bits Postgresql head (with v2 patch) Configuration: none Connection local ipv4 (not localhost) create table t (a int, b text); insert into t select x,md5(x::text) from generate_series(1,100*1572.7381809)x; vacuum freeze t; set max_parallel_workers_per_gather = 0; Time: 354211,826 ms (05:54,212) set max_parallel_workers_per_gather = 1; Time: 332805,773 ms (05:32,806) set max_parallel_workers_per_gather = 2; Time: 282566,711 ms (04:42,567) set max_parallel_workers_per_gather = 3; Time: 263383,945 ms (04:23,384) set max_parallel_workers_per_gather = 4; Time: 255728,259 ms (04:15,728) set max_parallel_workers_per_gather = 5; Time: 238288,720 ms (03:58,289) set max_parallel_workers_per_gather = 6; Time: 238647,792 ms (03:58,648) set max_parallel_workers_per_gather = 7; Time: 231295,763 ms (03:51,296) set max_parallel_workers_per_gather = 8; Time: 232502,828 ms (03:52,503) set max_parallel_workers_per_gather = 9; Time: 230970,604 ms (03:50,971) set max_parallel_workers_per_gather = 10; Time: 232104,182 ms (03:52,104) set max_parallel_workers_per_gather = 8; postgres=# explain select count(*) from t; QUERY PLAN --- Finalize Aggregate (cost=15564556.43..15564556.44 rows=1 width=8) -> Gather (cost=15564555.60..15564556.41 rows=8 width=8) Workers Planned: 8 -> Partial Aggregate (cost=15563555.60..15563555.61 rows=1 width=8) -> Parallel Seq Scan on t (cost=0.00..15072074.88 rows=196592288 width=0) (5 rows) Questions: 1. Why acquire and release lock in retry: loop. Wouldn't that be better? /* Grab the spinlock. */ SpinLockAcquire(&pbscan->phs_mutex); retry: /* * If the scan's startblock has not yet been initialized, we must do so * now. If this is not a synchronized scan, we just start at block 0, but * if it is a synchronized scan, we must get the starting position from * the synchronized scan machinery. We can't hold the spinlock while * doing that, though, so release the spinlock, get the information we * need, and retry. If nobody else has initialized the scan in the * meantime, we'll fill in the value we fetched on the second time * through. */ if (pbscan->phs_startblock == InvalidBlockNumber) { if (!pbscan->base.phs_syncscan) pbscan->phs_startblock = 0; else if (sync_startpage != InvalidBlockNumber) pbscan->phs_startblock = sync_startpage; else { sync_startpage = ss_get_location(rel, pbscan->phs_nblocks); goto retry; } } SpinLockRelease(&pbscan->phs_mutex); } Acquire lock once, before retry? 2. Is there any configuration to improve performance? regards, Ranier Vilela
Re: Parallel Seq Scan vs kernel read ahead
On Sun, Jun 21, 2020 at 6:52 PM David Rowley wrote: > Perhaps that's not a problem though, but then again, perhaps just > keeping it at 131072 regardless of RELSEG_SIZE and BLCKSZ is also ok. > The benchmarks I did on Windows [1] showed that the returns diminished > once we started making the step size some decent amount so my thoughts > are that I've set PARALLEL_SEQSCAN_MAX_CHUNK_SIZE to something large > enough that it'll make no difference to the performance anyway. So > there's probably not much point in giving it too much thought. > > Perhaps pg_nextpower2_32(RELSEG_SIZE) would be okay though. I guess I don't care that much; it was just a thought. Maybe tying it to RELSEG_SIZE is a bad idea anyway. After all, what if we find cases where 1GB is too much? Like, how much benefit do we get from making it 1GB rather than 64MB, say? I don't think we should be making this value big just because we can. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Parallel Seq Scan vs kernel read ahead
On Mon, 22 Jun 2020 at 16:54, David Rowley wrote: > I also tested this an AMD machine running Ubuntu 20.04 on kernel > version 5.4.0-37. I used the same 100GB table I mentioned in [1], but > with the query "select * from t where a < 0;", which saves having to > do any aggregate work. I just wanted to add a note here that Thomas and I just discussed this a bit offline. He recommended I try setting the kernel readhead a bit higher. It was set to 128kB, so I cranked it up to 2MB with: sudo blockdev --setra 4096 /dev/nvme0n1p2 I didn't want to run the full test again as it took quite a long time, so I just tried with 32 workers. The first two results here are taken from the test results I just posted 1 hour ago. Master readhead=128kB = 89921.283 ms v2 patch readhead=128kB = 36085.642 ms master readhead=2MB = 60984.905 ms v2 patch readhead=2MB = 22611.264 ms There must be a fairly large element of reading from the page cache there since 22.6 seconds means 4528MB/sec over the 100GB table. The maximum for a PCIe 3.0 x4 slot is 3940MB/s David
Re: Parallel Seq Scan vs kernel read ahead
On Sat, 20 Jun 2020 at 08:00, Robert Haas wrote: > > On Thu, Jun 18, 2020 at 10:10 PM David Rowley wrote: > > Here's a patch which caps the maximum chunk size to 131072. If > > someone doubles the page size then that'll be 2GB instead of 1GB. I'm > > not personally worried about that. > > Maybe use RELSEG_SIZE? I was hoping to keep the guarantees that the chunk size is always a power of 2. If, for example, someone configured PostgreSQL --with-segsize=3, then RELSEG_SIZE would be 393216 with the standard BLCKSZ. Not having it a power of 2 does mean the ramp-down is more uneven when the sizes become very small: postgres=# select 393216>>x from generate_Series(0,18)x; ?column? -- 393216 196608 98304 49152 24576 12288 6144 3072 1536 768 384 192 96 48 24 12 6 3 1 (19 rows) Perhaps that's not a problem though, but then again, perhaps just keeping it at 131072 regardless of RELSEG_SIZE and BLCKSZ is also ok. The benchmarks I did on Windows [1] showed that the returns diminished once we started making the step size some decent amount so my thoughts are that I've set PARALLEL_SEQSCAN_MAX_CHUNK_SIZE to something large enough that it'll make no difference to the performance anyway. So there's probably not much point in giving it too much thought. Perhaps pg_nextpower2_32(RELSEG_SIZE) would be okay though. David [1] https://www.postgresql.org/message-id/caaphdvoppka+q5y_k_6cuv4u6dphmz771veumuzls3d3mwy...@mail.gmail.com
Re: Parallel Seq Scan vs kernel read ahead
On Thu, Jun 18, 2020 at 10:10 PM David Rowley wrote: > Here's a patch which caps the maximum chunk size to 131072. If > someone doubles the page size then that'll be 2GB instead of 1GB. I'm > not personally worried about that. Maybe use RELSEG_SIZE? > I tested the performance on a Windows 10 laptop using the test case from [1] > > Master: > > workers=0: Time: 141175.935 ms (02:21.176) > workers=1: Time: 316854.538 ms (05:16.855) > workers=2: Time: 323471.791 ms (05:23.472) > workers=3: Time: 321637.945 ms (05:21.638) > workers=4: Time: 308689.599 ms (05:08.690) > workers=5: Time: 289014.709 ms (04:49.015) > workers=6: Time: 267785.270 ms (04:27.785) > workers=7: Time: 248735.817 ms (04:08.736) > > Patched: > > workers=0: Time: 155985.204 ms (02:35.985) > workers=1: Time: 112238.741 ms (01:52.239) > workers=2: Time: 105861.813 ms (01:45.862) > workers=3: Time: 91874.311 ms (01:31.874) > workers=4: Time: 92538.646 ms (01:32.539) > workers=5: Time: 93012.902 ms (01:33.013) > workers=6: Time: 94269.076 ms (01:34.269) > workers=7: Time: 90858.458 ms (01:30.858) Nice results. I wonder if these stack with the gains Thomas was discussing with his DSM-from-the-main-shmem-segment patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Parallel Seq Scan vs kernel read ahead
On Fri, 19 Jun 2020 at 11:34, David Rowley wrote: > > On Fri, 19 Jun 2020 at 03:26, Robert Haas wrote: > > > > On Thu, Jun 18, 2020 at 6:15 AM David Rowley wrote: > > > With a 32TB relation, the code will make the chunk size 16GB. Perhaps > > > I should change the code to cap that at 1GB. > > > > It seems pretty hard to believe there's any significant advantage to a > > chunk size >1GB, so I would be in favor of that change. > > I could certainly make that change. With the standard page size, 1GB > is 131072 pages and a power of 2. That would change for non-standard > page sizes, so we'd need to decide if we want to keep the chunk size a > power of 2, or just cap it exactly at whatever number of pages 1GB is. > > I'm not sure how much of a difference it'll make, but I also just want > to note that synchronous scans can mean we'll start the scan anywhere > within the table, so capping to 1GB does not mean we read an entire > extent. It's more likely to span 2 extents. Here's a patch which caps the maximum chunk size to 131072. If someone doubles the page size then that'll be 2GB instead of 1GB. I'm not personally worried about that. I tested the performance on a Windows 10 laptop using the test case from [1] Master: workers=0: Time: 141175.935 ms (02:21.176) workers=1: Time: 316854.538 ms (05:16.855) workers=2: Time: 323471.791 ms (05:23.472) workers=3: Time: 321637.945 ms (05:21.638) workers=4: Time: 308689.599 ms (05:08.690) workers=5: Time: 289014.709 ms (04:49.015) workers=6: Time: 267785.270 ms (04:27.785) workers=7: Time: 248735.817 ms (04:08.736) Patched: workers=0: Time: 155985.204 ms (02:35.985) workers=1: Time: 112238.741 ms (01:52.239) workers=2: Time: 105861.813 ms (01:45.862) workers=3: Time: 91874.311 ms (01:31.874) workers=4: Time: 92538.646 ms (01:32.539) workers=5: Time: 93012.902 ms (01:33.013) workers=6: Time: 94269.076 ms (01:34.269) workers=7: Time: 90858.458 ms (01:30.858) David [1] https://www.postgresql.org/message-id/CAApHDvrfJfYH51_WY-iQqPw8yGR4fDoTxAQKqn%2BSa7NTKEVWtg%40mail.gmail.com diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 537913d1bb..5792040bc1 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -520,12 +520,14 @@ heapgettup(HeapScanDesc scan, { ParallelBlockTableScanDesc pbscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel; +ParallelBlockTableScanWork pbscanwork = +(ParallelBlockTableScanWork) scan->rs_base.rs_parallel_work; table_block_parallelscan_startblock_init(scan->rs_base.rs_rd, - pbscan); + pbscanwork, pbscan); page = table_block_parallelscan_nextpage(scan->rs_base.rs_rd, - pbscan); + pbscanwork, pbscan); /* Other processes might have already finished the scan. */ if (page == InvalidBlockNumber) @@ -720,9 +722,11 @@ heapgettup(HeapScanDesc scan, { ParallelBlockTableScanDesc pbscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel; + ParallelBlockTableScanWork pbscanwork = + (ParallelBlockTableScanWork) scan->rs_base.rs_parallel_work; page = table_block_parallelscan_nextpage(scan->rs_base.rs_rd, - pbscan); + pbscanwork, pbscan); finished = (page == InvalidBlockNumber); } else @@ -834,12 +838,14 @@ heapgettup_pagemode(HeapScanDesc scan, { ParallelBlockTableScanDesc pbscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel; +ParallelBlockTableScanWork pbscanwork = +(ParallelBlockTableScanWork) scan->rs_base.rs_parallel_work; table_block_parallelscan_startblock_init(scan->rs_base.rs_rd, - pbscan); + pbscanwork, pbscan); page = table_block_parallelscan_nextpage(scan->rs_base.rs_rd, - pbscan); + pbscanwork, pbscan); /* Other processes might have already finished the scan. */ if (page == InvalidBlockNumber) @@ -1019,9 +1025,11 @@ heapgettup_pagemode(HeapScanDesc scan, { ParallelBlockTableScanDesc pbscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel; + ParallelBlockTableScanWork pbscanwork = + (ParallelBlockTableScanWork) scan->rs_base.rs_parallel_work; page = table_block_parallelscan_nextpage(scan->rs_base.rs_rd, - pbscan); + pbscanwork, pbscan); finished = (page == InvalidBlockNumber); } else @@ -1155,6 +1163,8 @@ heap_beginscan(Relation relation, Snapshot snapshot, scan->rs_base.rs_nkeys = nkeys; scan->rs_base.rs_flags = flags; scan->rs_base.rs_parallel = parallel_scan; + scan->rs_base.rs_parallel_work = + palloc0(sizeof(ParallelBlockTableScanWorkData)); scan->rs_strategy = NULL; /* set in initscan */ /* diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c index c814733b22..c2ee42e795 100644 --- a/src/backend/access/table/tableam.c +++ b/src/backend/access/table/tableam.c @@ -25,10 +2
Re: Parallel Seq Scan vs kernel read ahead
On Fri, 19 Jun 2020 at 03:26, Robert Haas wrote: > > On Thu, Jun 18, 2020 at 6:15 AM David Rowley wrote: > > With a 32TB relation, the code will make the chunk size 16GB. Perhaps > > I should change the code to cap that at 1GB. > > It seems pretty hard to believe there's any significant advantage to a > chunk size >1GB, so I would be in favor of that change. I could certainly make that change. With the standard page size, 1GB is 131072 pages and a power of 2. That would change for non-standard page sizes, so we'd need to decide if we want to keep the chunk size a power of 2, or just cap it exactly at whatever number of pages 1GB is. I'm not sure how much of a difference it'll make, but I also just want to note that synchronous scans can mean we'll start the scan anywhere within the table, so capping to 1GB does not mean we read an entire extent. It's more likely to span 2 extents. David
Re: Parallel Seq Scan vs kernel read ahead
On Thu, Jun 18, 2020 at 6:15 AM David Rowley wrote: > With a 32TB relation, the code will make the chunk size 16GB. Perhaps > I should change the code to cap that at 1GB. It seems pretty hard to believe there's any significant advantage to a chunk size >1GB, so I would be in favor of that change. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Parallel Seq Scan vs kernel read ahead
On Wed, 17 Jun 2020 at 03:20, Robert Haas wrote: > > On Mon, Jun 15, 2020 at 5:09 PM David Rowley wrote: > > * Perhaps when there are less than 2 full chunks remaining, workers > > can just take half of what is left. Or more specifically > > Max(pg_next_power2(remaining_blocks) / 2, 1), which ideally would work > > out allocating an amount of pages proportional to the amount of beer > > each mathematician receives in the "An infinite number of > > mathematicians walk into a bar" joke, obviously with the exception > > that we stop dividing when we get to 1. However, I'm not quite sure > > how well that can be made to work with multiple bartenders working in > > parallel. > > That doesn't sound nearly aggressive enough to me. I mean, let's > suppose that we're concerned about the scenario where one chunk takes > 50x as long as all the other chunks. Well, if we have 1024 chunks > total, and we hit the problem chunk near the beginning, there will be > no problem. In effect, there are 1073 units of work instead of 1024, > and we accidentally assigned one guy 50 units of work when we thought > we were assigning 1 unit of work. If there's enough work left that we > can assign each other worker 49 units more than what we would have > done had that chunk been the same cost as all the others, then there's > no problem. So for instance if there are 4 workers, we can still even > things out if we hit the problematic chunk more than ~150 chunks from > the end. If we're closer to the end than that, there's no way to avoid > the slow chunk delaying the overall completion time, and the problem > gets worse as the problem chunk gets closer to the end. I've got something like that in the attached. Currently, I've set the number of chunks to 2048 and I'm starting the ramp down when 64 chunks remain, which means we'll start the ramp-down when there's about 3.1% of the scan remaining. I didn't see the point of going with the larger number of chunks and having ramp-down code. Attached is the patch and an .sql file with a function which can be used to demonstrate what chunk sizes the patch will choose and demo the ramp-down. e.g. # select show_parallel_scan_chunks(100, 2048, 64); It would be really good if people could test this using the test case mentioned in [1]. We really need to get a good idea of how this behaves on various operating systems. With a 32TB relation, the code will make the chunk size 16GB. Perhaps I should change the code to cap that at 1GB. David [1] https://www.postgresql.org/message-id/CAApHDvrfJfYH51_WY-iQqPw8yGR4fDoTxAQKqn%2BSa7NTKEVWtg%40mail.gmail.com create function nextpower2(a bigint) returns bigint as $$ declare n bigint := 1; begin while n < a loop n := n * 2; end loop; return n; end; $$ language plpgsql; create or replace function show_parallel_scan_chunks(relsize bigint, nchunks int, rampdown_chunks int) returns int as $$ declare chunksize bigint; declare allocated bigint; declare chunk_num bigint; begin chunksize := nextpower2(relsize / nchunks); allocated := 0; chunk_num := 0; while (allocated <= relsize) loop if (chunksize > 1 AND allocated > relsize - chunksize * rampdown_chunks) then chunksize := chunksize / 2; end if; raise notice 'chunk_num = %, chunksize = %, allocated = %', chunk_num, chunksize, allocated; allocated := allocated + chunksize; chunk_num := chunk_num + 1; end loop; return 1; end; $$ language plpgsql; bigger_io_chunks_for_parallel_seqscan.patch Description: Binary data
Re: Parallel Seq Scan vs kernel read ahead
On Tue, Jun 16, 2020 at 6:57 AM Amit Kapila wrote: > I agree that won't be a common scenario but apart from that also I am > not sure if we can conclude that the proposed patch won't cause any > regressions. See one of the tests [1] done by Soumyadeep where the > patch has caused regression in one of the cases, now we can either try > to improve the patch and see we didn't cause any regressions or assume > that those are some minority cases which we don't care. Another point > is that this thread has started with a theory that this idea can give > benefits on certain filesystems and AFAICS we have tested it on one > other type of system, so not sure if that is sufficient. Yeah, it seems like those cases might need some more investigation, but they're also not necessarily an argument for a configuration setting. It's not so much that I dislike the idea of being able to configure something here; it's really that I don't want a reloption that feels like magic. For example, we know that work_mem can be really hard to configure because there may be no value that's high enough to make your queries run fast during normal periods but low enough to avoid running out of memory during busy periods. That kind of thing sucks, and we should avoid creating more such cases. One problem here is that the best value might depend not only on the relation but on the individual query. A GUC could be changed per-query, but different tables in the query might need different values. Changing a reloption requires locking, and you wouldn't want to have to keep changing it for each different query. Now if we figure out that something is hardware-dependent -- like we come up with a good formula that adjusts the value automatically most of the time, but say it needs to more more on SSDs than on spinning disks or the other way around, well then that's a good candidate for some kind of setting, maybe a tablespace option. But if it seems to depend on the query, we need a better idea, not a user-configurable setting. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Parallel Seq Scan vs kernel read ahead
On Mon, Jun 15, 2020 at 5:09 PM David Rowley wrote: > To summarise what's all been proposed so far: > > 1. Use a constant, (e.g. 64) as the parallel step size > 2. Ramp up the step size over time > 3. Ramp down the step size towards the end of the scan. > 4. Auto-determine a good stepsize based on the size of the relation. > 5. Add GUC to allow users to control or override the step size. > 6. Add relption to allow users to control or override the step size. > > Here are my thoughts on each of those: > > #1 is a bad idea as there are legitimate use-cases for using parallel > query on small tables. e.g calling some expensive parallel safe > function. Small tables are more likely to be cached. I agree. > #2 I don't quite understand why this is useful I was thinking that if the query had a small LIMIT, you'd want to avoid handing out excessively large chunks, but actually that seems like it might just be fuzzy thinking on my part. We're not committing to scanning the entirety of the chunk just because we've assigned it to a worker. > #3 I understand this is to try to make it so workers all complete > around about the same time. > #4 We really should be doing it this way. > #5 Having a global knob to control something that is very specific to > the size of a relation does not make much sense to me. > #6. I imagine someone will have some weird use-case that works better > when parallel workers get 1 page at a time. I'm not convinced that > they're not doing something else wrong. Agree with all of that. > So my vote is for 4 with possibly 3, if we can come up with something > smart enough * that works well in parallel. I think there's less of a > need for this if we divided the relation into more chunks, e.g. 8192 > or 16384. I agree with that too. > * Perhaps when there are less than 2 full chunks remaining, workers > can just take half of what is left. Or more specifically > Max(pg_next_power2(remaining_blocks) / 2, 1), which ideally would work > out allocating an amount of pages proportional to the amount of beer > each mathematician receives in the "An infinite number of > mathematicians walk into a bar" joke, obviously with the exception > that we stop dividing when we get to 1. However, I'm not quite sure > how well that can be made to work with multiple bartenders working in > parallel. That doesn't sound nearly aggressive enough to me. I mean, let's suppose that we're concerned about the scenario where one chunk takes 50x as long as all the other chunks. Well, if we have 1024 chunks total, and we hit the problem chunk near the beginning, there will be no problem. In effect, there are 1073 units of work instead of 1024, and we accidentally assigned one guy 50 units of work when we thought we were assigning 1 unit of work. If there's enough work left that we can assign each other worker 49 units more than what we would have done had that chunk been the same cost as all the others, then there's no problem. So for instance if there are 4 workers, we can still even things out if we hit the problematic chunk more than ~150 chunks from the end. If we're closer to the end than that, there's no way to avoid the slow chunk delaying the overall completion time, and the problem gets worse as the problem chunk gets closer to the end. How can we improve? Well, if when we're less than 150 chunks from the end, we reduce the chunk size by 2x, then instead of having 1 chunk that is 50x as expensive, hopefully we'll have 2 smaller chunks that are each 50x as expensive. They'll get assigned to 2 different workers, and the remaining 2 workers now need enough extra work from other chunks to even out the work distribution, which should still be possible. It gets tough though if breaking the one expensive chunk in half produces 1 regular-price half-chunk and one half-chunk that is 50x as expensive as all the others. Because we have <150 chunks left, there's no way to keep everybody else busy until the sole expensive half-chunk completes. In a sufficiently-extreme scenario, assigning even a single full block to a worker is too much, and you really want to handle the tuples out individually. Anyway, if we don't do anything special until we get down to the last 2 chunks, it's only going to make a difference when one of those last 2 chunks happens to be the expensive one. If say the third-to-last chunk is the expensive one, subdividing the last 2 chunks lets all the workers who didn't get the expensive chunk fight over the scraps, but that's not an improvement. If anything it's worse, because there's more communication overhead and you don't gain anything vs. just assigning each chunk to a worker straight up. In a real example we don't know that we have a single expensive chunk -- each chunk just has its own cost, and they could all be the same, or some could be much more expensive. When we have a lot of work left, we can be fairly cavalier in handing out larger chunks of it with the full confidence that even if some o
Re: Parallel Seq Scan vs kernel read ahead
On Mon, Jun 15, 2020 at 8:59 PM Robert Haas wrote: > > On Sat, Jun 13, 2020 at 2:13 AM Amit Kapila wrote: > > The performance can vary based on qualification where some workers > > discard more rows as compared to others, with the current system with > > step-size as one, the probability of unequal work among workers is > > quite low as compared to larger step-sizes. > > It seems like this would require incredibly bad luck, though. > I agree that won't be a common scenario but apart from that also I am not sure if we can conclude that the proposed patch won't cause any regressions. See one of the tests [1] done by Soumyadeep where the patch has caused regression in one of the cases, now we can either try to improve the patch and see we didn't cause any regressions or assume that those are some minority cases which we don't care. Another point is that this thread has started with a theory that this idea can give benefits on certain filesystems and AFAICS we have tested it on one other type of system, so not sure if that is sufficient. Having said that, I just want to clarify that I am positive about this work but just not very sure that it is a universal win based on the testing done till now. [1] - https://www.postgresql.org/message-id/CADwEdoqirzK3H8bB%3DxyJ192EZCNwGfcCa_WJ5GHVM7Sv8oenuA%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel Seq Scan vs kernel read ahead
On Tue, 16 Jun 2020 at 03:29, Robert Haas wrote: > > On Sat, Jun 13, 2020 at 2:13 AM Amit Kapila wrote: > > The performance can vary based on qualification where some workers > > discard more rows as compared to others, with the current system with > > step-size as one, the probability of unequal work among workers is > > quite low as compared to larger step-sizes. > > It seems like this would require incredibly bad luck, though. If the > step size is less than 1/1024 of the relation size, and we ramp down > for, say, the last 5% of the relation, then the worst case is that > chunk 972 of 1024 is super-slow compared to all the other blocks, so > that it takes longer to process chunk 972 only than it does to process > chunks 973-1024 combined. It is not impossible, but that chunk has to > be like 50x worse than all the others, which doesn't seem like > something that is going to happen often enough to be worth worrying > about very much. I'm not saying it will never happen. I'm just > skeptical about the benefit of adding a GUC or reloption for a corner > case like this. I think people will fiddle with it when it isn't > really needed, and won't realize it exists in the scenarios where it > would have helped. I'm trying to think of likely scenarios where "lots of work at the end" is going to be common. I can think of queue processing, but everything I can think about there requires an UPDATE to the processed flag, which won't be using parallel query anyway. There's then processing something based on some period of time like "the last hour", "today". For append-only tables the latest information is likely to be at the end of the heap. For that, anyone that's getting a SeqScan on a large relation should likely have added an index. If a btree is too costly, then BRIN is pretty perfect for that case. FWIW, I'm not really keen on adding a reloption or a GUC. I've also voiced here that I'm not even keen on the ramp-up. To summarise what's all been proposed so far: 1. Use a constant, (e.g. 64) as the parallel step size 2. Ramp up the step size over time 3. Ramp down the step size towards the end of the scan. 4. Auto-determine a good stepsize based on the size of the relation. 5. Add GUC to allow users to control or override the step size. 6. Add relption to allow users to control or override the step size. Here are my thoughts on each of those: #1 is a bad idea as there are legitimate use-cases for using parallel query on small tables. e.g calling some expensive parallel safe function. Small tables are more likely to be cached. #2 I don't quite understand why this is useful #3 I understand this is to try to make it so workers all complete around about the same time. #4 We really should be doing it this way. #5 Having a global knob to control something that is very specific to the size of a relation does not make much sense to me. #6. I imagine someone will have some weird use-case that works better when parallel workers get 1 page at a time. I'm not convinced that they're not doing something else wrong. So my vote is for 4 with possibly 3, if we can come up with something smart enough * that works well in parallel. I think there's less of a need for this if we divided the relation into more chunks, e.g. 8192 or 16384. * Perhaps when there are less than 2 full chunks remaining, workers can just take half of what is left. Or more specifically Max(pg_next_power2(remaining_blocks) / 2, 1), which ideally would work out allocating an amount of pages proportional to the amount of beer each mathematician receives in the "An infinite number of mathematicians walk into a bar" joke, obviously with the exception that we stop dividing when we get to 1. However, I'm not quite sure how well that can be made to work with multiple bartenders working in parallel. David
Re: Parallel Seq Scan vs kernel read ahead
On Sat, Jun 13, 2020 at 2:13 AM Amit Kapila wrote: > The performance can vary based on qualification where some workers > discard more rows as compared to others, with the current system with > step-size as one, the probability of unequal work among workers is > quite low as compared to larger step-sizes. It seems like this would require incredibly bad luck, though. If the step size is less than 1/1024 of the relation size, and we ramp down for, say, the last 5% of the relation, then the worst case is that chunk 972 of 1024 is super-slow compared to all the other blocks, so that it takes longer to process chunk 972 only than it does to process chunks 973-1024 combined. It is not impossible, but that chunk has to be like 50x worse than all the others, which doesn't seem like something that is going to happen often enough to be worth worrying about very much. I'm not saying it will never happen. I'm just skeptical about the benefit of adding a GUC or reloption for a corner case like this. I think people will fiddle with it when it isn't really needed, and won't realize it exists in the scenarios where it would have helped. And then, because we have the setting, we'll have to keep it around forever, even as we improve the algorithm in other ways, which could become a maintenance burden. I think it's better to treat stuff like this as an implementation detail rather than something we expect users to adjust. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Parallel Seq Scan vs kernel read ahead
On Fri, Jun 12, 2020 at 11:28 PM Robert Haas wrote: > > On Thu, Jun 11, 2020 at 4:54 PM David Rowley wrote: > > I think someone at some point is not going to like the automatic > > choice. So perhaps a reloption to allow users to overwrite it is a > > good idea. -1 should most likely mean use the automatic choice based > > on relation size. I think for parallel seq scans that filter a large > > portion of the records most likely need some sort of index, but there > > are perhaps some genuine cases for not having one. e.g perhaps the > > query is just not run often enough for an index to be worthwhile. In > > that case, the performance is likely less critical, but at least the > > reloption would allow users to get the old behaviour. > > Let me play the devil's advocate here. I feel like if the step size is > limited by the relation size and there is ramp-up and ramp-down, or > maybe even if you don't have all 3 of those but perhaps say 2 of them, > the chances of there being a significant downside from using this seem > quite small. At that point I wonder whether you really need an option. > It's true that someone might not like it, but there are all sorts of > things that at least one person doesn't like and one can't cater to > all of them. > > To put that another way, in what scenario do we suppose that a > reasonable person would wish to use this reloption? > The performance can vary based on qualification where some workers discard more rows as compared to others, with the current system with step-size as one, the probability of unequal work among workers is quite low as compared to larger step-sizes. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel Seq Scan vs kernel read ahead
On Thu, Jun 11, 2020 at 4:54 PM David Rowley wrote: > I think someone at some point is not going to like the automatic > choice. So perhaps a reloption to allow users to overwrite it is a > good idea. -1 should most likely mean use the automatic choice based > on relation size. I think for parallel seq scans that filter a large > portion of the records most likely need some sort of index, but there > are perhaps some genuine cases for not having one. e.g perhaps the > query is just not run often enough for an index to be worthwhile. In > that case, the performance is likely less critical, but at least the > reloption would allow users to get the old behaviour. Let me play the devil's advocate here. I feel like if the step size is limited by the relation size and there is ramp-up and ramp-down, or maybe even if you don't have all 3 of those but perhaps say 2 of them, the chances of there being a significant downside from using this seem quite small. At that point I wonder whether you really need an option. It's true that someone might not like it, but there are all sorts of things that at least one person doesn't like and one can't cater to all of them. To put that another way, in what scenario do we suppose that a reasonable person would wish to use this reloption? If there's none, we don't need it. If there is one, can we develop a mitigation that solves their problem automatically instead? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Parallel Seq Scan vs kernel read ahead
On Fri, Jun 12, 2020 at 2:24 AM David Rowley wrote: > > On Thu, 11 Jun 2020 at 23:35, Amit Kapila wrote: > > Another point I am thinking is that whatever formula we come up here > > might not be a good fit for every case. For ex. as you mentioned > > above that larger step-size can impact the performance based on > > qualification, similarly there could be other things like having a > > target list or qual having some function which takes more time for > > certain tuples and lesser for others especially if function evaluation > > is based on some column values. So, can we think of providing a > > rel_option for step-size? > > I think someone at some point is not going to like the automatic > choice. So perhaps a reloption to allow users to overwrite it is a > good idea. -1 should most likely mean use the automatic choice based > on relation size. I think for parallel seq scans that filter a large > portion of the records most likely need some sort of index, but there > are perhaps some genuine cases for not having one. e.g perhaps the > query is just not run often enough for an index to be worthwhile. In > that case, the performance is likely less critical, but at least the > reloption would allow users to get the old behaviour. > makes sense to me. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel Seq Scan vs kernel read ahead
On Thu, 11 Jun 2020 at 23:35, Amit Kapila wrote: > Another point I am thinking is that whatever formula we come up here > might not be a good fit for every case. For ex. as you mentioned > above that larger step-size can impact the performance based on > qualification, similarly there could be other things like having a > target list or qual having some function which takes more time for > certain tuples and lesser for others especially if function evaluation > is based on some column values. So, can we think of providing a > rel_option for step-size? I think someone at some point is not going to like the automatic choice. So perhaps a reloption to allow users to overwrite it is a good idea. -1 should most likely mean use the automatic choice based on relation size. I think for parallel seq scans that filter a large portion of the records most likely need some sort of index, but there are perhaps some genuine cases for not having one. e.g perhaps the query is just not run often enough for an index to be worthwhile. In that case, the performance is likely less critical, but at least the reloption would allow users to get the old behaviour. David
Re: Parallel Seq Scan vs kernel read ahead
On Thu, Jun 11, 2020 at 10:13 AM David Rowley wrote: > > On Thu, 11 Jun 2020 at 16:03, Amit Kapila wrote: > > I think something on these lines would be a good idea especially > > keeping step-size proportional to relation size. However, I am not > > completely sure if doubling the step-size with equal increase in > > relation size (ex. what is happening between 16MB~8192MB) is the best > > idea. Why not double the step-size when relation size increases by > > four times? Will some more tests help us to identify this? I also > > don't know what is the right answer here so just trying to brainstorm. > > Brainstorming sounds good. I'm by no means under any illusion that the > formula is correct. > > But, why four times? > Just trying to see if we can optimize such that we use bigger step-size for bigger relations and smaller step-size for smaller relations. > The way I did it tries to keep the number of > chunks roughly the same each time. I think the key is the number of > chunks more than the size of the chunks. Having fewer chunks increases > the chances of an imbalance of work between workers, and with what you > mention, the number of chunks will vary more than what I have proposed > But, I think it will lead to more number of chunks for smaller relations. > The code I showed above will produce something between 512-1024 chunks > for all cases until we 2^20 pages, then we start capping the chunk > size to 1024. I could probably get onboard with making it depend on > the number of parallel workers, but perhaps it would be better just to > divide by, say, 16384 rather than 1024, as I proposed above. That way > we'll be more fine-grained, but we'll still read in larger than 1024 > chunk sizes when the relation gets beyond 128GB. > I think increasing step-size might be okay for very large relations. Another point I am thinking is that whatever formula we come up here might not be a good fit for every case. For ex. as you mentioned above that larger step-size can impact the performance based on qualification, similarly there could be other things like having a target list or qual having some function which takes more time for certain tuples and lesser for others especially if function evaluation is based on some column values. So, can we think of providing a rel_option for step-size? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel Seq Scan vs kernel read ahead
On Thu, 11 Jun 2020 at 16:03, Amit Kapila wrote: > I think something on these lines would be a good idea especially > keeping step-size proportional to relation size. However, I am not > completely sure if doubling the step-size with equal increase in > relation size (ex. what is happening between 16MB~8192MB) is the best > idea. Why not double the step-size when relation size increases by > four times? Will some more tests help us to identify this? I also > don't know what is the right answer here so just trying to brainstorm. Brainstorming sounds good. I'm by no means under any illusion that the formula is correct. But, why four times? The way I did it tries to keep the number of chunks roughly the same each time. I think the key is the number of chunks more than the size of the chunks. Having fewer chunks increases the chances of an imbalance of work between workers, and with what you mention, the number of chunks will vary more than what I have proposed The code I showed above will produce something between 512-1024 chunks for all cases until we 2^20 pages, then we start capping the chunk size to 1024. I could probably get onboard with making it depend on the number of parallel workers, but perhaps it would be better just to divide by, say, 16384 rather than 1024, as I proposed above. That way we'll be more fine-grained, but we'll still read in larger than 1024 chunk sizes when the relation gets beyond 128GB. David
Re: Parallel Seq Scan vs kernel read ahead
On Thu, Jun 11, 2020 at 8:35 AM David Rowley wrote: > > On Thu, 11 Jun 2020 at 14:09, Amit Kapila wrote: > > > > On Thu, Jun 11, 2020 at 7:18 AM David Rowley wrote: > > > > > > On Thu, 11 Jun 2020 at 01:24, Amit Kapila wrote: > > > > Can we try the same test with 4, 8, 16 workers as well? I don't > > > > foresee any problem with a higher number of workers but it might be > > > > better to once check that if it is not too much additional work. > > > > > > I ran the tests again with up to 7 workers. The CPU here only has 8 > > > cores (a laptop), so I'm not sure if there's much sense in going > > > higher than that? > > > > > > > I think it proves your point that there is a value in going for step > > size greater than 64. However, I think the difference at higher sizes > > is not significant. For example, the difference between 8192 and > > 16384 doesn't seem much if we leave higher worker count where the data > > could be a bit misleading due to variation. I could see that there is > > a clear and significant difference till 1024 but after that difference > > is not much. > > I guess the danger with going too big is that we have some Seqscan > filter that causes some workers to do very little to nothing with the > rows, despite discarding them and other workers are left with rows > that are not filtered and require some expensive processing. Keeping > the number of blocks on the smaller side would reduce the chances of > someone being hit by that. > Right and good point. > The algorithm I proposed above still can > be capped by doing something like nblocks = Min(1024, > pg_nextpower2_32(pbscan->phs_nblocks / 1024)); That way we'll end up > with: > I think something on these lines would be a good idea especially keeping step-size proportional to relation size. However, I am not completely sure if doubling the step-size with equal increase in relation size (ex. what is happening between 16MB~8192MB) is the best idea. Why not double the step-size when relation size increases by four times? Will some more tests help us to identify this? I also don't know what is the right answer here so just trying to brainstorm. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel Seq Scan vs kernel read ahead
On Thu, 11 Jun 2020 at 14:09, Amit Kapila wrote: > > On Thu, Jun 11, 2020 at 7:18 AM David Rowley wrote: > > > > On Thu, 11 Jun 2020 at 01:24, Amit Kapila wrote: > > > Can we try the same test with 4, 8, 16 workers as well? I don't > > > foresee any problem with a higher number of workers but it might be > > > better to once check that if it is not too much additional work. > > > > I ran the tests again with up to 7 workers. The CPU here only has 8 > > cores (a laptop), so I'm not sure if there's much sense in going > > higher than that? > > > > I think it proves your point that there is a value in going for step > size greater than 64. However, I think the difference at higher sizes > is not significant. For example, the difference between 8192 and > 16384 doesn't seem much if we leave higher worker count where the data > could be a bit misleading due to variation. I could see that there is > a clear and significant difference till 1024 but after that difference > is not much. I guess the danger with going too big is that we have some Seqscan filter that causes some workers to do very little to nothing with the rows, despite discarding them and other workers are left with rows that are not filtered and require some expensive processing. Keeping the number of blocks on the smaller side would reduce the chances of someone being hit by that. The algorithm I proposed above still can be capped by doing something like nblocks = Min(1024, pg_nextpower2_32(pbscan->phs_nblocks / 1024)); That way we'll end up with: rel_size | stepsize --+-- 16 kB|1 32 kB|1 64 kB|1 128 kB |1 256 kB |1 512 kB |1 1024 kB |1 2048 kB |1 4096 kB |1 8192 kB |1 16 MB|2 32 MB|4 64 MB|8 128 MB | 16 256 MB | 32 512 MB | 64 1024 MB | 128 2048 MB | 256 4096 MB | 512 8192 MB | 1024 16 GB| 1024 32 GB| 1024 64 GB| 1024 128 GB | 1024 256 GB | 1024 512 GB | 1024 1024 GB | 1024 2048 GB | 1024 4096 GB | 1024 8192 GB | 1024 16 TB| 1024 32 TB| 1024 (32 rows) David
Re: Parallel Seq Scan vs kernel read ahead
On Thu, Jun 11, 2020 at 7:18 AM David Rowley wrote: > > On Thu, 11 Jun 2020 at 01:24, Amit Kapila wrote: > > Can we try the same test with 4, 8, 16 workers as well? I don't > > foresee any problem with a higher number of workers but it might be > > better to once check that if it is not too much additional work. > > I ran the tests again with up to 7 workers. The CPU here only has 8 > cores (a laptop), so I'm not sure if there's much sense in going > higher than that? > I think it proves your point that there is a value in going for step size greater than 64. However, I think the difference at higher sizes is not significant. For example, the difference between 8192 and 16384 doesn't seem much if we leave higher worker count where the data could be a bit misleading due to variation. I could see that there is a clear and significant difference till 1024 but after that difference is not much. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel Seq Scan vs kernel read ahead
On Thu, 11 Jun 2020 at 01:24, Amit Kapila wrote: > Can we try the same test with 4, 8, 16 workers as well? I don't > foresee any problem with a higher number of workers but it might be > better to once check that if it is not too much additional work. I ran the tests again with up to 7 workers. The CPU here only has 8 cores (a laptop), so I'm not sure if there's much sense in going higher than that? CPU = Intel i7-8565U. 16GB RAM. Note that I did the power2 ramp-up with each of the patched tests this time. Thomas' version ramps up 1 pages at a time, which is ok when only ramping up to 64 pages, but not for these higher numbers I'm testing with. (Patch attached) Results attached in a graph format, or in text below: Master: workers=0: Time: 141175.935 ms (02:21.176) (742.7MB/sec) workers=1: Time: 316854.538 ms (05:16.855) (330.9MB/sec) workers=2: Time: 323471.791 ms (05:23.472) (324.2MB/sec) workers=3: Time: 321637.945 ms (05:21.638) (326MB/sec) workers=4: Time: 308689.599 ms (05:08.690) (339.7MB/sec) workers=5: Time: 289014.709 ms (04:49.015) (362.8MB/sec) workers=6: Time: 267785.27 ms (04:27.785) (391.6MB/sec) workers=7: Time: 248735.817 ms (04:08.736) (421.6MB/sec) Patched 64: (power 2 ramp-up) workers=0: Time: 152752.558 ms (02:32.753) (686.5MB/sec) workers=1: Time: 149940.841 ms (02:29.941) (699.3MB/sec) workers=2: Time: 136534.043 ms (02:16.534) (768MB/sec) workers=3: Time: 119387.248 ms (01:59.387) (878.3MB/sec) workers=4: Time: 114080.131 ms (01:54.080) (919.2MB/sec) workers=5: Time: 111472.144 ms (01:51.472) (940.7MB/sec) workers=6: Time: 108290.608 ms (01:48.291) (968.3MB/sec) workers=7: Time: 104349.947 ms (01:44.350) (1004.9MB/sec) Patched 1024: (power 2 ramp-up) workers=0: Time: 146106.086 ms (02:26.106) (717.7MB/sec) workers=1: Time: 109832.773 ms (01:49.833) (954.7MB/sec) workers=2: Time: 98921.515 ms (01:38.922) (1060MB/sec) workers=3: Time: 94259.243 ms (01:34.259) (1112.4MB/sec) workers=4: Time: 93275.637 ms (01:33.276) (1124.2MB/sec) workers=5: Time: 93921.452 ms (01:33.921) (1116.4MB/sec) workers=6: Time: 93988.386 ms (01:33.988) (1115.6MB/sec) workers=7: Time: 92096.414 ms (01:32.096) (1138.6MB/sec) Patched 8192: (power 2 ramp-up) workers=0: Time: 143367.057 ms (02:23.367) (731.4MB/sec) workers=1: Time: 103138.918 ms (01:43.139) (1016.7MB/sec) workers=2: Time: 93368.573 ms (01:33.369) (1123.1MB/sec) workers=3: Time: 89464.529 ms (01:29.465) (1172.1MB/sec) workers=4: Time: 89921.393 ms (01:29.921) (1166.1MB/sec) workers=5: Time: 93575.401 ms (01:33.575) (1120.6MB/sec) workers=6: Time: 93636.584 ms (01:33.637) (1119.8MB/sec) workers=7: Time: 93682.21 ms (01:33.682) (1119.3MB/sec) Patched 16384 (power 2 ramp-up) workers=0: Time: 144598.502 ms (02:24.599) (725.2MB/sec) workers=1: Time: 97344.16 ms (01:37.344) (1077.2MB/sec) workers=2: Time: 88025.42 ms (01:28.025) (1191.2MB/sec) workers=3: Time: 97711.521 ms (01:37.712) (1073.1MB/sec) workers=4: Time: 88877.913 ms (01:28.878) (1179.8MB/sec) workers=5: Time: 96985.978 ms (01:36.986) (1081.2MB/sec) workers=6: Time: 92368.543 ms (01:32.369) (1135.2MB/sec) workers=7: Time: 87498.156 ms (01:27.498) (1198.4MB/sec) David parallel_step_size.patch Description: Binary data
Re: Parallel Seq Scan vs kernel read ahead
On Wed, Jun 10, 2020 at 6:04 PM David Rowley wrote: > > On Wed, 10 Jun 2020 at 17:39, David Rowley wrote: > > > > On Wed, 10 Jun 2020 at 17:21, Thomas Munro wrote: > > > I also heard from Andres that he likes this patch with his AIO > > > prototype, because of the way request merging works. So it seems like > > > there are several reasons to want it. > > > > > > But ... where should we get the maximum step size from? A GUC? > > > > I guess we'd need to determine if other step sizes were better under > > any conditions. I guess one condition would be if there was a LIMIT > > clause. I could check if setting it to 1024 makes any difference, but > > I'm thinking it won't since I got fairly consistent results on all > > worker settings with the patched version. > > I did another round of testing on the same machine trying some step > sizes larger than 64 blocks. I can confirm that it does improve the > situation further going bigger than 64. > Can we try the same test with 4, 8, 16 workers as well? I don't foresee any problem with a higher number of workers but it might be better to once check that if it is not too much additional work. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel Seq Scan vs kernel read ahead
On Wed, 10 Jun 2020 at 17:39, David Rowley wrote: > > On Wed, 10 Jun 2020 at 17:21, Thomas Munro wrote: > > I also heard from Andres that he likes this patch with his AIO > > prototype, because of the way request merging works. So it seems like > > there are several reasons to want it. > > > > But ... where should we get the maximum step size from? A GUC? > > I guess we'd need to determine if other step sizes were better under > any conditions. I guess one condition would be if there was a LIMIT > clause. I could check if setting it to 1024 makes any difference, but > I'm thinking it won't since I got fairly consistent results on all > worker settings with the patched version. I did another round of testing on the same machine trying some step sizes larger than 64 blocks. I can confirm that it does improve the situation further going bigger than 64. I got up as far as 16384, but made a couple of additional changes for that run only. Instead of increasing the ramp-up 1 block at a time, I initialised phsw_step_size to 1 and multiplied it by 2 until I reached the chosen step size. With numbers that big, ramping up 1 block at a time was slow enough that I'd never have reached the target step size Here are the results of the testing: Master: max_parallel_workers_per_gather = 0: Time: 148481.244 ms (02:28.481) (706.2MB/sec) max_parallel_workers_per_gather = 1: Time: 327556.121 ms (05:27.556) (320.1MB/sec) max_parallel_workers_per_gather = 2: Time: 329055.530 ms (05:29.056) (318.6MB/sec) Patched stepsize = 64: max_parallel_workers_per_gather = 0: Time: 141363.991 ms (02:21.364) (741.7MB/sec) max_parallel_workers_per_gather = 1: Time: 144982.202 ms (02:24.982) (723.2MB/sec) max_parallel_workers_per_gather = 2: Time: 143355.656 ms (02:23.356) (731.4MB/sec) Patched stepsize = 1024: max_parallel_workers_per_gather = 0: Time: 152599.159 ms (02:32.599) (687.1MB/sec) max_parallel_workers_per_gather = 1: Time: 104227.232 ms (01:44.227) (1006.04MB/sec) max_parallel_workers_per_gather = 2: Time: 97149.343 ms (01:37.149) (1079.3MB/sec) Patched stepsize = 8192: max_parallel_workers_per_gather = 0: Time: 143524.038 ms (02:23.524) (730.59MB/sec) max_parallel_workers_per_gather = 1: Time: 102899.288 ms (01:42.899) (1019.0MB/sec) max_parallel_workers_per_gather = 2: Time: 91148.340 ms (01:31.148) (1150.4MB/sec) Patched stepsize = 16384 (power 2 ramp-up) max_parallel_workers_per_gather = 0: Time: 144598.502 ms (02:24.599) (725.16MB/sec) max_parallel_workers_per_gather = 1: Time: 97344.160 ms (01:37.344) (1077.1MB/sec) max_parallel_workers_per_gather = 2: Time: 88025.420 ms (01:28.025) (1191.2MB/sec) I thought about what you mentioned about a GUC, and I think it's a bad idea to do that. I think it would be better to choose based on the relation size. For smaller relations, we want to keep the step size small. Someone may enable parallel query on such a small relation if they're doing something like calling an expensive function on the results, so we do need to avoid going large for small relations. I considered something like: create function nextpower2(a bigint) returns bigint as $$ declare n bigint := 1; begin while n < a loop n := n * 2; end loop; return n; end; $$ language plpgsql; select pg_size_pretty(power(2,p)::numeric * 8192) rel_size, nextpower2(power(2,p)::bigint / 1024) as stepsize from generate_series(1,32) p; rel_size | stepsize --+-- 16 kB|1 32 kB|1 64 kB|1 128 kB |1 256 kB |1 512 kB |1 1024 kB |1 2048 kB |1 4096 kB |1 8192 kB |1 16 MB|2 32 MB|4 64 MB|8 128 MB | 16 256 MB | 32 512 MB | 64 1024 MB | 128 2048 MB | 256 4096 MB | 512 8192 MB | 1024 16 GB| 2048 32 GB| 4096 64 GB| 8192 128 GB |16384 256 GB |32768 512 GB |65536 1024 GB | 131072 2048 GB | 262144 4096 GB | 524288 8192 GB | 1048576 16 TB| 2097152 32 TB| 4194304 So with that algorithm with this 100GB table that I've been using in my test, we'd go with a step size of 16384. I think we'd want to avoid going any more than that. The above code means we'll do between just below 0.1% and 0.2% of the relation per step. If I divided the number of blocks by say 128 instead of 1024, then that would be about 0.78% and 1.56% of the relation each time. It's not unrealistic today that someone might throw that many workers at a job, so, I'd say dividing by 1024 or even 2048 would likely be about right. David
Re: Parallel Seq Scan vs kernel read ahead
On Wed, 10 Jun 2020 at 17:21, Thomas Munro wrote: > I also heard from Andres that he likes this patch with his AIO > prototype, because of the way request merging works. So it seems like > there are several reasons to want it. > > But ... where should we get the maximum step size from? A GUC? I guess we'd need to determine if other step sizes were better under any conditions. I guess one condition would be if there was a LIMIT clause. I could check if setting it to 1024 makes any difference, but I'm thinking it won't since I got fairly consistent results on all worker settings with the patched version. David
Re: Parallel Seq Scan vs kernel read ahead
On Wed, Jun 10, 2020 at 5:06 PM David Rowley wrote: > I repeated this test on an up-to-date Windows 10 machine to see if the > later kernel is any better at the readahead. > > Results for the same test are: > > Master: > > max_parallel_workers_per_gather = 0: Time: 148481.244 ms (02:28.481) > (706.2MB/sec) > max_parallel_workers_per_gather = 1: Time: 327556.121 ms (05:27.556) > (320.1MB/sec) > max_parallel_workers_per_gather = 2: Time: 329055.530 ms (05:29.056) > (318.6MB/sec) > > Patched: > > max_parallel_workers_per_gather = 0: Time: 141363.991 ms (02:21.364) > (741.7MB/sec) > max_parallel_workers_per_gather = 1: Time: 144982.202 ms (02:24.982) > (723.2MB/sec) > max_parallel_workers_per_gather = 2: Time: 143355.656 ms (02:23.356) > (731.4MB/sec) Thanks! I also heard from Andres that he likes this patch with his AIO prototype, because of the way request merging works. So it seems like there are several reasons to want it. But ... where should we get the maximum step size from? A GUC?
Re: Parallel Seq Scan vs kernel read ahead
On Thu, 21 May 2020 at 17:06, David Rowley wrote: > create table t (a int, b text); > -- create a table of 100GB in size. > insert into t select x,md5(x::text) from > generate_series(1,100*1572.7381809)x; -- took 1 hr 18 mins > vacuum freeze t; > > query = select count(*) from t; > Disk = Samsung SSD 850 EVO mSATA 1TB. > > Master: > workers = 0 : Time: 269104.281 ms (04:29.104) 380MB/s > workers = 1 : Time: 741183.646 ms (12:21.184) 138MB/s > workers = 2 : Time: 656963.754 ms (10:56.964) 155MB/s > > Patched: > > workers = 0 : Should be the same as before as the code for this didn't change. > workers = 1 : Time: 300299.364 ms (05:00.299) 340MB/s > workers = 2 : Time: 270213.726 ms (04:30.214) 379MB/s > > (A better query would likely have been just: SELECT * FROM t WHERE a = > 1; but I'd run the test by the time I thought of that.) > > So, this shows that Windows, at least 8.1, does suffer from this too. I repeated this test on an up-to-date Windows 10 machine to see if the later kernel is any better at the readahead. Results for the same test are: Master: max_parallel_workers_per_gather = 0: Time: 148481.244 ms (02:28.481) (706.2MB/sec) max_parallel_workers_per_gather = 1: Time: 327556.121 ms (05:27.556) (320.1MB/sec) max_parallel_workers_per_gather = 2: Time: 329055.530 ms (05:29.056) (318.6MB/sec) Patched: max_parallel_workers_per_gather = 0: Time: 141363.991 ms (02:21.364) (741.7MB/sec) max_parallel_workers_per_gather = 1: Time: 144982.202 ms (02:24.982) (723.2MB/sec) max_parallel_workers_per_gather = 2: Time: 143355.656 ms (02:23.356) (731.4MB/sec) David
Re: Parallel Seq Scan vs kernel read ahead
On Wed, Jun 3, 2020 at 3:18 PM Soumyadeep Chakraborty wrote: > Idk if that is a lesser evil than the workers > being idle..probably not? Apologies, I meant that the extra atomic fetches is probably a lesser evil than the workers being idle. Soumyadeep
Re: Parallel Seq Scan vs kernel read ahead
On Sat, May 23, 2020 at 12:00 AM Robert Haas wrote: > I think there's a significant difference. The idea I remember being > discussed at the time was to divide the relation into equal parts at > the very start and give one part to each worker. I think that carries > a lot of risk of some workers finishing much sooner than others. Was the idea of work-stealing considered? Here is what I have been thinking about: Each worker would be assigned a contiguous chunk of blocks at init time. Then if a worker is finished with its work, it can inspect other workers' remaining work and "steal" some of the blocks from the end of the victim worker's allocation. Considerations for such a scheme: 1. Victim selection: Who will be the victim worker? It can be selected at random if nothing else. 2. How many blocks to steal? Stealing half of the victim's remaining blocks seems to be fair. 3. Stealing threshold: We should disallow stealing if the amount of remaining work is not enough in the victim worker. 4. Additional parallel state: Representing the chunk of "work". I guess one variable for the current block and one for the last block in the chunk allocated. The latter would have to be protected with atomic fetches as it would be decremented by the stealing worker. 5. Point 4 implies that there might be more atomic fetch operations as compared to this patch. Idk if that is a lesser evil than the workers being idle..probably not? A way to salvage that a little would be to forego atomic fetches when the amount of work remaining is less than the threshold discussed in 3 as there is no possibility of work stealing then. Regards, Soumyadeep
Re: Parallel Seq Scan vs kernel read ahead
> It doesn't look like it's using table_block_parallelscan_nextpage() as > a block allocator so it's not affected by the patch. It has its own > thing zs_parallelscan_nextrange(), which does > pg_atomic_fetch_add_u64(&pzscan->pzs_allocatedtids, > ZS_PARALLEL_CHUNK_SIZE), and that macro is 0x10. My apologies, I was too hasty. Indeed, you are correct. Zedstore's unit of work is chunks of the logical zstid space. There is a correlation between the zstid and blocks: zstids near each other are likely to lie in the same block or in neighboring blocks. It would be interesting to try something like this patch for Zedstore. Regards, Soumyadeep
Re: Parallel Seq Scan vs kernel read ahead
On Sat, 23 May 2020 at 06:31, Robert Haas wrote: > > On Thu, May 21, 2020 at 6:28 PM Thomas Munro wrote: > > Right, I think it's safe. I think you were probably right that > > ramp-up isn't actually useful though, it's only the end of the scan > > that requires special treatment so we don't get unfair allocation as > > the work runs out, due to course grain. > > The ramp-up seems like it might be useful if the query involves a LIMIT. That's true, but I think the intelligence there would need to go beyond, "if there's a LIMIT clause, do ramp-up", as we might have already fully ramped up well before the LIMIT is reached. David
Re: Parallel Seq Scan vs kernel read ahead
On Sat, May 23, 2020 at 12:00 AM Robert Haas wrote: > > On Tue, May 19, 2020 at 10:23 PM Amit Kapila wrote: > > Good experiment. IIRC, we have discussed a similar idea during the > > development of this feature but we haven't seen any better results by > > allocating in ranges on the systems we have tried. So, we want with > > the current approach which is more granular and seems to allow better > > parallelism. I feel we need to ensure that we don't regress > > parallelism in existing cases, otherwise, the idea sounds promising to > > me. > > I think there's a significant difference. The idea I remember being > discussed at the time was to divide the relation into equal parts at > the very start and give one part to each worker. > I have checked the archives and found that we have done some testing by allowing each worker to work on a block-by-block basis and by having a fixed number of chunks for each worker. See the results [1] (the program used is attached in another email [2]). The conclusion was that we didn't find much difference with any of those approaches. Now, the reason could be that because we have tested on a machine (I think it was hydra (Power-7)) where the chunk-size doesn't matter but I think it can show some difference in the machines on which Thomas and David are testing. At that time there was also a discussion to chunk on the basis of "each worker processes one 1GB-sized segment" which Tom and Stephen seem to support [3]. I think an idea to divide the relation into segments based on workers for a parallel scan has been used by other database (DynamoDB) as well [4] so it is not completely without merit. I understand that larger sized chunks can lead to unequal work distribution but they have their own advantages, so we might want to get the best of both the worlds where in the beginning we have larger sized chunks and then slowly reduce the chunk-size towards the end of the scan. I am not sure what is the best thing to do here but maybe some experiments can shed light on this mystery. [1] - https://www.postgresql.org/message-id/CAA4eK1JHCmN2X1LjQ4bOmLApt%2BbtOuid5Vqqk5G6dDFV69iyHg%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAA4eK1JyVNEBE8KuxKd3bJhkG6tSbpBYX_%2BZtP34ZSTCSucA1A%40mail.gmail.com [3] - https://www.postgresql.org/message-id/30549.1422459647%40sss.pgh.pa.us [4] - https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Scan.html#Scan.ParallelScan -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel Seq Scan vs kernel read ahead
On Thu, May 21, 2020 at 6:28 PM Thomas Munro wrote: > Right, I think it's safe. I think you were probably right that > ramp-up isn't actually useful though, it's only the end of the scan > that requires special treatment so we don't get unfair allocation as > the work runs out, due to course grain. The ramp-up seems like it might be useful if the query involves a LIMIT. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Parallel Seq Scan vs kernel read ahead
On Tue, May 19, 2020 at 10:23 PM Amit Kapila wrote: > Good experiment. IIRC, we have discussed a similar idea during the > development of this feature but we haven't seen any better results by > allocating in ranges on the systems we have tried. So, we want with > the current approach which is more granular and seems to allow better > parallelism. I feel we need to ensure that we don't regress > parallelism in existing cases, otherwise, the idea sounds promising to > me. I think there's a significant difference. The idea I remember being discussed at the time was to divide the relation into equal parts at the very start and give one part to each worker. I think that carries a lot of risk of some workers finishing much sooner than others. This idea, AIUI, is to divide the relation into chunks that are small compared to the size of the relation, but larger than 1 block. That carries some risk of an unequal division of work, as has already been noted, but it's much less, especially if we use smaller chunk sizes once we get close to the end, as proposed here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Parallel Seq Scan vs kernel read ahead
On Fri, May 22, 2020 at 1:14 PM Soumyadeep Chakraborty wrote: > Some more data points: Thanks! > max_parallel_workers_per_gatherTime(seconds) > 0 29.04s > 1 29.17s > 2 28.78s > 6 291.27s > > I checked with explain analyze to ensure that the number of workers > planned = max_parallel_workers_per_gather > > Apart from the last result (max_parallel_workers_per_gather=6), all > the other results seem favorable. > Could the last result be down to the fact that the number of workers > planned exceeded the number of vCPUs? Interesting. I guess it has to do with patterns emerging from various parameters like that magic number 64 I hard coded into the test patch, and other unknowns in your storage stack. I see a small drop off that I can't explain yet, but not that. > I also wanted to evaluate Zedstore with your patch. > I used the same setup as above. > No discernible difference though, maybe I'm missing something: It doesn't look like it's using table_block_parallelscan_nextpage() as a block allocator so it's not affected by the patch. It has its own thing zs_parallelscan_nextrange(), which does pg_atomic_fetch_add_u64(&pzscan->pzs_allocatedtids, ZS_PARALLEL_CHUNK_SIZE), and that macro is 0x10.
Re: Parallel Seq Scan vs kernel read ahead
Hi Thomas, Some more data points: create table t_heap as select generate_series(1, 1) i; Query: select count(*) from t_heap; shared_buffers=32MB (so that I don't have to clear buffers, OS page cache) OS: FreeBSD 12.1 with UFS on GCP 4 vCPUs, 4GB RAM Intel Skylake 22G Google PersistentDisk Time is measured with \timing on. Without your patch: max_parallel_workers_per_gatherTime(seconds) 0 33.88s 1 57.62s 2 62.01s 6 222.94s With your patch: max_parallel_workers_per_gatherTime(seconds) 0 29.04s 1 29.17s 2 28.78s 6 291.27s I checked with explain analyze to ensure that the number of workers planned = max_parallel_workers_per_gather Apart from the last result (max_parallel_workers_per_gather=6), all the other results seem favorable. Could the last result be down to the fact that the number of workers planned exceeded the number of vCPUs? I also wanted to evaluate Zedstore with your patch. I used the same setup as above. No discernible difference though, maybe I'm missing something: Without your patch: max_parallel_workers_per_gatherTime(seconds) 0 25.86s 1 15.70s 2 12.60s 6 12.41s With your patch: max_parallel_workers_per_gatherTime(seconds) 0 26.96s 1 15.73s 2 12.46s 6 12.10s -- Soumyadeep On Thu, May 21, 2020 at 3:28 PM Thomas Munro wrote: > On Fri, May 22, 2020 at 10:00 AM David Rowley > wrote: > > On Thu, 21 May 2020 at 17:06, David Rowley wrote: > > > For the patch. I know you just put it together quickly, but I don't > > > think you can do that ramp up the way you have. It looks like there's > > > a risk of torn reads and torn writes and I'm unsure how much that > > > could affect the test results here. > > > > Oops. On closer inspection, I see that memory is per worker, not > > global to the scan. > > Right, I think it's safe. I think you were probably right that > ramp-up isn't actually useful though, it's only the end of the scan > that requires special treatment so we don't get unfair allocation as > the work runs out, due to course grain. I suppose that even if you > have a scheme that falls back to fine grained allocation for the final > N pages, it's still possible that a highly distracted process (most > likely the leader given its double duties) can finish up sitting on a > large range of pages and eventually have to process them all at the > end after the other workers have already knocked off and gone for a > pint. > > >
Re: Parallel Seq Scan vs kernel read ahead
On Fri, May 22, 2020 at 10:00 AM David Rowley wrote: > On Thu, 21 May 2020 at 17:06, David Rowley wrote: > > For the patch. I know you just put it together quickly, but I don't > > think you can do that ramp up the way you have. It looks like there's > > a risk of torn reads and torn writes and I'm unsure how much that > > could affect the test results here. > > Oops. On closer inspection, I see that memory is per worker, not > global to the scan. Right, I think it's safe. I think you were probably right that ramp-up isn't actually useful though, it's only the end of the scan that requires special treatment so we don't get unfair allocation as the work runs out, due to course grain. I suppose that even if you have a scheme that falls back to fine grained allocation for the final N pages, it's still possible that a highly distracted process (most likely the leader given its double duties) can finish up sitting on a large range of pages and eventually have to process them all at the end after the other workers have already knocked off and gone for a pint.
Re: Parallel Seq Scan vs kernel read ahead
On Thu, 21 May 2020 at 17:06, David Rowley wrote: > For the patch. I know you just put it together quickly, but I don't > think you can do that ramp up the way you have. It looks like there's > a risk of torn reads and torn writes and I'm unsure how much that > could affect the test results here. Oops. On closer inspection, I see that memory is per worker, not global to the scan.
Re: Parallel Seq Scan vs kernel read ahead
On Thu, 21 May 2020 at 14:32, Thomas Munro wrote: > Thanks. So it seems like Linux, Windows and anything using ZFS are > OK, which probably explains why we hadn't heard complaints about it. I tried out a different test on a Windows 8.1 machine I have here. I was concerned that the test that was used here ends up with tuples that are too narrow and that the executor would spend quite a bit of time going between nodes and performing the actual aggregation. I thought it might be good to add some padding so that there are far fewer tuples on the page. I ended up with: create table t (a int, b text); -- create a table of 100GB in size. insert into t select x,md5(x::text) from generate_series(1,100*1572.7381809)x; -- took 1 hr 18 mins vacuum freeze t; query = select count(*) from t; Disk = Samsung SSD 850 EVO mSATA 1TB. Master: workers = 0 : Time: 269104.281 ms (04:29.104) 380MB/s workers = 1 : Time: 741183.646 ms (12:21.184) 138MB/s workers = 2 : Time: 656963.754 ms (10:56.964) 155MB/s Patched: workers = 0 : Should be the same as before as the code for this didn't change. workers = 1 : Time: 300299.364 ms (05:00.299) 340MB/s workers = 2 : Time: 270213.726 ms (04:30.214) 379MB/s (A better query would likely have been just: SELECT * FROM t WHERE a = 1; but I'd run the test by the time I thought of that.) So, this shows that Windows, at least 8.1, does suffer from this too. For the patch. I know you just put it together quickly, but I don't think you can do that ramp up the way you have. It looks like there's a risk of torn reads and torn writes and I'm unsure how much that could affect the test results here. It looks like there's a risk that a worker gets some garbage number of pages to read rather than what you think it will. Also, I also don't quite understand the need for a ramp-up in pages per serving. Shouldn't you instantly start at some size and hold that, then only maybe ramp down at the end so that workers all finish at close to the same time? However, I did have other ideas which I'll explain below. >From my previous work on that function to add the atomics. I did think that it would be better to dish out more than 1 page at a time. However, there is the risk that the workload is not evenly distributed between the workers. My thoughts were that we could divide the total pages by the number of workers then again by 100 and dish out blocks based on that. That way workers will get about 100th of their fair share of pages at once, so assuming there's an even amount of work to do per serving of pages, then the last worker should only run on at most 1% longer. Perhaps that 100 should be 1000, then the run on time for the last worker is just 0.1%. Perhaps the serving size can also be capped at some maximum like 64. We'll certainly need to ensure it's at least 1! I imagine that will eliminate the need for any ramp down of pages per serving near the end of the scan. David
Re: Parallel Seq Scan vs kernel read ahead
On Thu, May 21, 2020 at 1:38 PM Ranier Vilela wrote: >> >> On Thu, May 21, 2020 at 11:15 AM Ranier Vilela >> >> wrote: >> >> > postgres=# set max_parallel_workers_per_gather = 0; >> >> > Time: 227238,445 ms (03:47,238) >> >> > postgres=# set max_parallel_workers_per_gather = 1; >> >> > Time: 138027,351 ms (02:18,027) > Vanila Postgres (latest) > > create table t as select generate_series(1, 8)::int i; > set max_parallel_workers_per_gather = 0; > Time: 210524,317 ms (03:30,524) > set max_parallel_workers_per_gather = 1; > Time: 146982,737 ms (02:26,983) Thanks. So it seems like Linux, Windows and anything using ZFS are OK, which probably explains why we hadn't heard complaints about it.
Re: Parallel Seq Scan vs kernel read ahead
Em qua., 20 de mai. de 2020 às 21:03, Thomas Munro escreveu: > On Thu, May 21, 2020 at 11:51 AM Ranier Vilela > wrote: > > Em qua., 20 de mai. de 2020 às 20:48, Thomas Munro < > thomas.mu...@gmail.com> escreveu: > >> On Thu, May 21, 2020 at 11:15 AM Ranier Vilela > wrote: > >> > postgres=# set max_parallel_workers_per_gather = 0; > >> > Time: 227238,445 ms (03:47,238) > >> > postgres=# set max_parallel_workers_per_gather = 1; > >> > Time: 138027,351 ms (02:18,027) > >> > >> Ok, so it looks like NT/NTFS isn't suffering from this problem. > >> Thanks for testing! > > > > Maybe it wasn’t clear, the tests were done with your patch applied. > > Oh! And how do the times look without it? > Vanila Postgres (latest) create table t as select generate_series(1, 8)::int i; set max_parallel_workers_per_gather = 0; Time: 210524,317 ms (03:30,524) set max_parallel_workers_per_gather = 1; Time: 146982,737 ms (02:26,983) regards, Ranier Vilela
Re: Parallel Seq Scan vs kernel read ahead
On Thu, May 21, 2020 at 11:51 AM Ranier Vilela wrote: > Em qua., 20 de mai. de 2020 às 20:48, Thomas Munro > escreveu: >> On Thu, May 21, 2020 at 11:15 AM Ranier Vilela wrote: >> > postgres=# set max_parallel_workers_per_gather = 0; >> > Time: 227238,445 ms (03:47,238) >> > postgres=# set max_parallel_workers_per_gather = 1; >> > Time: 138027,351 ms (02:18,027) >> >> Ok, so it looks like NT/NTFS isn't suffering from this problem. >> Thanks for testing! > > Maybe it wasn’t clear, the tests were done with your patch applied. Oh! And how do the times look without it?
Re: Parallel Seq Scan vs kernel read ahead
Em qua., 20 de mai. de 2020 às 20:48, Thomas Munro escreveu: > On Thu, May 21, 2020 at 11:15 AM Ranier Vilela > wrote: > > postgres=# set max_parallel_workers_per_gather = 0; > > Time: 227238,445 ms (03:47,238) > > postgres=# set max_parallel_workers_per_gather = 1; > > Time: 138027,351 ms (02:18,027) > > Ok, so it looks like NT/NTFS isn't suffering from this problem. > Thanks for testing! > Maybe it wasn’t clear, the tests were done with your patch applied. regards, Ranier Vilela
Re: Parallel Seq Scan vs kernel read ahead
On Thu, May 21, 2020 at 11:15 AM Ranier Vilela wrote: > postgres=# set max_parallel_workers_per_gather = 0; > Time: 227238,445 ms (03:47,238) > postgres=# set max_parallel_workers_per_gather = 1; > Time: 138027,351 ms (02:18,027) Ok, so it looks like NT/NTFS isn't suffering from this problem. Thanks for testing!
Re: Parallel Seq Scan vs kernel read ahead
Em qua., 20 de mai. de 2020 às 18:49, Thomas Munro escreveu: > On Wed, May 20, 2020 at 11:03 PM Ranier Vilela > wrote: > > Time: 47767,916 ms (00:47,768) > > Time: 32645,448 ms (00:32,645) > > Just to make sure kernel caching isn't helping here, maybe try making > the table 2x or 4x bigger? My test was on a virtual machine with only > 4GB RAM, so the table couldn't be entirely cached. > 4x bigger. Postgres defaults settings. postgres=# create table t as select generate_series(1, 8)::int i; SELECT 8 postgres=# \timing Timing is on. postgres=# set max_parallel_workers_per_gather = 0; SET Time: 8,622 ms postgres=# select count(*) from t; count --- 8 (1 row) Time: 227238,445 ms (03:47,238) postgres=# set max_parallel_workers_per_gather = 1; SET Time: 20,975 ms postgres=# select count(*) from t; count --- 8 (1 row) Time: 138027,351 ms (02:18,027) regards, Ranier Vilela
Re: Parallel Seq Scan vs kernel read ahead
On Wed, May 20, 2020 at 11:03 PM Ranier Vilela wrote: > Time: 47767,916 ms (00:47,768) > Time: 32645,448 ms (00:32,645) Just to make sure kernel caching isn't helping here, maybe try making the table 2x or 4x bigger? My test was on a virtual machine with only 4GB RAM, so the table couldn't be entirely cached. > How display " -> execution time 5.2s, average read size ="? Execution time is what you showed, and average read size should be inside the Windows performance window somewhere (not sure what it's called).
Re: Parallel Seq Scan vs kernel read ahead
Em qua., 20 de mai. de 2020 às 00:09, Thomas Munro escreveu: > On Wed, May 20, 2020 at 2:23 PM Amit Kapila > wrote: > > Good experiment. IIRC, we have discussed a similar idea during the > > development of this feature but we haven't seen any better results by > > allocating in ranges on the systems we have tried. So, we want with > > the current approach which is more granular and seems to allow better > > parallelism. I feel we need to ensure that we don't regress > > parallelism in existing cases, otherwise, the idea sounds promising to > > me. > > Yeah, Linux seems to do pretty well at least with smallish numbers of > workers, and when you use large numbers you can probably tune your way > out of the problem. ZFS seems to do fine. I wonder how well the > other OSes cope. > Windows 10 (64bits, i5, 8GB, SSD) postgres=# set max_parallel_workers_per_gather = 0; SET Time: 2,537 ms postgres=# select count(*) from t; count --- 2 (1 row) Time: 47767,916 ms (00:47,768) postgres=# set max_parallel_workers_per_gather = 1; SET Time: 4,889 ms postgres=# select count(*) from t; count --- 2 (1 row) Time: 32645,448 ms (00:32,645) How display " -> execution time 5.2s, average read size ="? regards, Ranier VIlela
Re: Parallel Seq Scan vs kernel read ahead
On Wed, May 20, 2020 at 2:23 PM Amit Kapila wrote: > Good experiment. IIRC, we have discussed a similar idea during the > development of this feature but we haven't seen any better results by > allocating in ranges on the systems we have tried. So, we want with > the current approach which is more granular and seems to allow better > parallelism. I feel we need to ensure that we don't regress > parallelism in existing cases, otherwise, the idea sounds promising to > me. Yeah, Linux seems to do pretty well at least with smallish numbers of workers, and when you use large numbers you can probably tune your way out of the problem. ZFS seems to do fine. I wonder how well the other OSes cope.
Re: Parallel Seq Scan vs kernel read ahead
On Wed, May 20, 2020 at 7:24 AM Thomas Munro wrote: > > Hello hackers, > > Parallel sequential scan relies on the kernel detecting sequential > access, but we don't make the job easy. The resulting striding > pattern works terribly on strict next-block systems like FreeBSD UFS, > and degrades rapidly when you add too many workers on sliding window > systems like Linux. > > Demonstration using FreeBSD on UFS on a virtual machine, taking ball > park figures from iostat: > > create table t as select generate_series(1, 2)::int i; > > set max_parallel_workers_per_gather = 0; > select count(*) from t; > -> execution time 13.3s, average read size = ~128kB, ~500MB/s > > set max_parallel_workers_per_gather = 1; > select count(*) from t; > -> execution time 24.9s, average read size = ~32kB, ~250MB/s > > Note the small read size, which means that there was no read > clustering happening at all: that's the logical block size of this > filesystem. > > That explains some complaints I've heard about PostgreSQL performance > on that filesystem: parallel query destroys I/O performance. > > As a quick experiment, I tried teaching the block allocated to > allocate ranges of up 64 blocks at a time, ramping up incrementally, > and ramping down at the end, and I got: > Good experiment. IIRC, we have discussed a similar idea during the development of this feature but we haven't seen any better results by allocating in ranges on the systems we have tried. So, we want with the current approach which is more granular and seems to allow better parallelism. I feel we need to ensure that we don't regress parallelism in existing cases, otherwise, the idea sounds promising to me. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com