On Wed, Jan 28, 2015 at 8:59 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > > I have tried the tests again and found that I have forgotten to increase > max_worker_processes due to which the data is so different. Basically > at higher client count it is just scanning lesser number of blocks in > fixed chunk approach. So today I again tried with changing > max_worker_processes and found that there is not much difference in > performance at higher client count. I will take some more data for > both block_by_block and fixed_chunk approach and repost the data. >
I have again taken the data and found that there is not much difference either between block-by-block or fixed_chuck approach, the data is at end of mail for your reference. There is variation in some cases like in fixed_chunk approach, in 8 workers case it is showing lesser time, however on certain executions it has taken almost the same time as other workers. Now if we go with block-by-block approach then we have advantage that the work distribution granularity will be smaller and hence better and if we go with chunk-by-chunk (fixed_chunk of 1GB) approach, then there is good chance that kernel can do the better optimization for reading it. Based on inputs on this thread, one way for execution strategy could be: a. In optimizer, based on effective_io_concurrency, size of relation and parallel_seqscan_degree, we can decide how many workers can be used for executing the plan - choose the number_of_workers equal to effective_io_concurrency, if it is less than parallel_seqscan_degree, else number_of_workers will be equal to parallel_seqscan_degree. - if the size of relation is greater than number_of_workers times GB (if number_of_workers is 8, then we need to compare the size of relation with 8GB), then keep number_of_workers intact and distribute the remaining chunks/segments during execution, else reduce the number_of_workers such that each worker gets 1GB to operate. - if the size of relation is less than 1GB, then we can either not choose the parallel_seqscan at all or could use smaller chunks or could use block-by-block approach to execute. - here we need to consider other parameters like parallel_setup parallel_startup and tuple_communication cost as well. b. In executor, if less workers are available than what are required for statement execution, then we can redistribute the remaining work among workers. Performance Data - Before first run of each worker, I have executed drop_caches to clear the cache and restarted the server, so we can assume that except Run-1, all other runs have some caching effect. *Fixed-Chunks* *No. of workers/Time (ms)* 0 8 16 32 Run-1 322822 245759 330097 330002 Run-2 275685 275428 301625 286251 Run-3 252129 244167 303494 278604 Run-4 252528 259273 250438 258636 Run-5 250612 242072 235384 265918 *Block-By-Block* *No. of workers/Time (ms)* 0 8 16 32 Run-1 323084 341950 338999 334100 Run-2 310968 349366 344272 322643 Run-3 250312 336227 346276 322274 Run-4 262744 314489 351652 325135 Run-5 265987 316260 342924 319200 With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com