Re: [HACKERS] [PoC] pgstattuple2: block sampling to reduce physical read

Mark Kirkwood Thu, 10 Oct 2013 21:50:31 -0700

On 11/10/13 17:08, Satoshi Nagayasu wrote:
> (2013/10/11 7:32), Mark Kirkwood wrote:
>> On 11/10/13 11:09, Mark Kirkwood wrote:
>>> On 16/09/13 16:20, Satoshi Nagayasu wrote:
>>>> (2013/09/15 11:07), Peter Eisentraut wrote:
>>>>> On Sat, 2013-09-14 at 16:18 +0900, Satoshi Nagayasu wrote:
>>>>>> I'm looking forward to seeing more feedback on this approach,
>>>>>> in terms of design and performance improvement.
>>>>>> So, I have submitted this for the next CF.
>>>>> Your patch fails to build:
>>>>>
>>>>> pgstattuple.c: In function ‘pgstat_heap_sample’:
>>>>> pgstattuple.c:737:13: error: ‘SnapshotNow’ undeclared (first use in
>>>>> this function)
>>>>> pgstattuple.c:737:13: note: each undeclared identifier is reported
>>>>> only once for each function it appears in
>>>> Thanks for checking. Fixed to eliminate SnapshotNow.
>>>>
>>> This seems like a cool idea! I took a quick look, and initally
>>> replicated the sort of improvement you saw:
>>>
>>>
>>> bench=# explain analyze select * from pgstattuple('pgbench_accounts');
>>> QUERY PLAN
>>>
>>> --------------------------------------------------------------------------------
>>> Function Scan on pgstattuple (cost=0.00..0.01 rows=1 width=72) (actual
>>> time=786.368..786.369 rows=1 loops=1)
>>> Total runtime: 786.384 ms
>>> (2 rows)
>>>
>>> bench=# explain analyze select * from pgstattuple2('pgbench_accounts');
>>> NOTICE: pgstattuple2: SE tuple_count 0.00, tuple_len 0.00,
>>> dead_tuple_count 0.00, dead_tuple_len 0.00, free_space 0.00
>>> QUERY PLAN
>>>
>>> --------------------------------------------------------------------------------
>>> Function Scan on pgstattuple2 (cost=0.00..0.01 rows=1 width=72) (actual
>>> time=12.004..12.005 rows=1 loops=1)
>>> Total runtime: 12.019 ms
>>> (2 rows)
>>>
>>>
>>>
>>> I wondered what sort of difference eliminating caching would make:
>>>
>>> $ sudo sysctl -w vm.drop_caches=3
>>>
>>> Repeating the above queries:
>>>
>>>
>>> bench=# explain analyze select * from pgstattuple('pgbench_accounts');
>>> QUERY PLAN
>>>
>>> --------------------------------------------------------------------------------
>>> Function Scan on pgstattuple (cost=0.00..0.01 rows=1 width=72) (actual
>>> time=9503.774..9503.776 rows=1 loops=1)
>>> Total runtime: 9504.523 ms
>>> (2 rows)
>>>
>>> bench=# explain analyze select * from pgstattuple2('pgbench_accounts');
>>> NOTICE: pgstattuple2: SE tuple_count 0.00, tuple_len 0.00,
>>> dead_tuple_count 0.00, dead_tuple_len 0.00, free_space 0.00
>>> QUERY PLAN
>>>
>>> --------------------------------------------------------------------------------
>>> Function Scan on pgstattuple2 (cost=0.00..0.01 rows=1 width=72) (actual
>>> time=12330.630..12330.631 rows=1 loops=1)
>>> Total runtime: 12331.353 ms
>>> (2 rows)
>>>
>>>
>>> So the sampling code seems *slower* when the cache is completely cold -
>>> is that expected? (I have not looked at how the code works yet - I'll
>>> dive in later if I get a chance)!
> Thanks for testing that. It would be very helpful to improve the
> performance.
>
>> Quietly replying to myself - looking at the code the sampler does 3000
>> random page reads... I guess this is slower than 163935 (number of pages
>> in pgbench_accounts) sequential page reads thanks to os readahead on my
>> type of disk (WD Velociraptor). Tweaking the number of random reads (i.e
>> the sample size) down helps - but obviously that can impact estimation
>> accuracy.
>>
>> Thinking about this a bit more, I guess the elapsed runtime is not the
>> *only* theng to consider - the sampling code will cause way less
>> disruption to the os page cache (3000 pages vs possibly lots more than
>> 3000 for reading an entire ralation).
>>
>> Thoughts?
> I think it could be improved by sorting sample block numbers
> *before* physical block reads in order to eliminate random access
> on the disk.
>
> pseudo code:
> --------------------------------------
> for (i=0 ; i<SAMPLE_SIZE ; i++)
> {
>     sample_block[i] = random();
> }
>
> qsort(sample_block);
>
> for (i=0 ; i<SAMPLE_SIZE ; i++)
> {
>     buf = ReadBuffer(rel, sample_block[i]);
>
>     do_some_stats_stuff(buf);
> }
> --------------------------------------
>
> I guess it would be helpful for reducing random access thing.
>
> Any comments?


Ah yes - that's a good idea (rough patch to your patch attached)!

bench=# explain analyze select * from pgstattuple2('pgbench_accounts');
NOTICE: pgstattuple2: SE tuple_count 0.00, tuple_len 0.00,
dead_tuple_count 0.00, dead_tuple_len 0.00, free_space 0.00
QUERY PLAN

--------------------------------------------------------------------------------
Function Scan on pgstattuple2 (cost=0.00..0.01 rows=1 width=72) (actual
time=9968.318..9968.319 rows=1 loops=1)
Total runtime: 9968.443 ms
(2 rows)

It would appear that we are now not any worse than a complete sampling...

Cheers

Mark

*** pgstattuple.c.orig	2013-10-11 17:46:00.592666316 +1300
--- pgstattuple.c	2013-10-11 17:46:08.616450284 +1300
***************
*** 100,105 ****
--- 100,123 ----
  				   uint64 *tuple_count, uint64 *tuple_len,
  				   uint64 *dead_tuple_count, uint64 *dead_tuple_len,
  				   uint64 *free_space);
+ static int	compare_blocknumbers(const void *a, const void *b);
+ 
+ /*
+  * compare for qsort
+  */
+ static int
+ compare_blocknumbers(const void *a, const void *b)
+ {
+ 	BlockNumber ba = *(const BlockNumber *)a;
+ 	BlockNumber bb = *(const BlockNumber *)b;
+ 
+ 	if (ba > bb)
+ 		return -1;
+     if (ba < bb)
+ 		return 1;
+     return 0;
+ }
+ 
  
  /*
   * build_pgstattuple_type -- build a pgstattuple_type tuple
***************
*** 686,692 ****
  pgstat_heap_sample(Relation rel, FunctionCallInfo fcinfo)
  {
  	BlockNumber nblocks;
! 	BlockNumber block;
  	pgstattuple_type stat = {0};
  	pgstattuple_block_stats block_stats[SAMPLE_SIZE];
  	int i;
--- 704,710 ----
  pgstat_heap_sample(Relation rel, FunctionCallInfo fcinfo)
  {
  	BlockNumber nblocks;
! 	BlockNumber sample_block[SAMPLE_SIZE];
  	pgstattuple_type stat = {0};
  	pgstattuple_block_stats block_stats[SAMPLE_SIZE];
  	int i;
***************
*** 695,700 ****
--- 713,725 ----
  
  	for (i=0 ; i<SAMPLE_SIZE ; i++)
  	{
+ 		sample_block[i] = (double)random() / RAND_MAX * nblocks;
+ 	}
+ 
+ 	qsort(sample_block, SAMPLE_SIZE, sizeof(BlockNumber), compare_blocknumbers);
+ 
+ 	for (i=0 ; i<SAMPLE_SIZE ; i++)
+ 	{
  		Buffer		buffer;
  		Page		page;
  		OffsetNumber pageoff;
***************
*** 702,713 ****
  		/*
  		 * sample random blocks
  		 */
- 		block = (double)random() / RAND_MAX * nblocks;
- 
  		CHECK_FOR_INTERRUPTS();
  
  		/* Read and lock buffer */
! 		buffer = ReadBuffer(rel, block);
  		LockBuffer(buffer, BUFFER_LOCK_SHARE);
  		page = BufferGetPage(buffer);
  
--- 727,736 ----
  		/*
  		 * sample random blocks
  		 */
  		CHECK_FOR_INTERRUPTS();
  
  		/* Read and lock buffer */
! 		buffer = ReadBuffer(rel, sample_block[i]);
  		LockBuffer(buffer, BUFFER_LOCK_SHARE);
  		page = BufferGetPage(buffer);

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PoC] pgstattuple2: block sampling to reduce physical read

Reply via email to