Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

Greg Stark Fri, 06 Jan 2006 13:13:39 -0800

"Jim C. Nasby" <[EMAIL PROTECTED]> writes:

> Before we start debating merits of proposals based on random reads, can
> someone confirm that the sampling code actually does read randomly? I
> looked at it yesterday; there is a comment that states that blocks to be
> scanned are passed to the analyze function in physical order, and AFAICT
> the function that chooses blocks does so based strictly on applying a
> probability function to block numbers as it increments a counter. It
> seems that any reading is actually sequential and not random, which
> makes all the random_page_cost hand-waving null and void.


Hm. I'm curious just how much that behaves like a sequential scan actually. I
think I'll do some experiments. 

Reading 1% (1267 read, 126733 skipped):          7748264us
Reading 2% (2609 read, 125391 skipped):         12672025us
Reading 5% (6502 read, 121498 skipped):         19005678us
Reading 5% (6246 read, 121754 skipped):         18509770us
Reading 10% (12975 read, 115025 skipped):       19305446us
Reading 20% (25716 read, 102284 skipped):       18147151us
Reading 50% (63656 read, 64344 skipped):        18089229us
Reading 100% (128000 read, 0 skipped):          18173003us

These numbers don't make much sense to me. It seems like 5% is about as slow
as reading the whole file which is even worse than I expected. I thought I was
being a bit pessimistic to think reading 5% would be as slow as reading 20% of
the table.

Anyone see anything wrong my my methodology?

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <time.h>
#include <fcntl.h>
#include <unistd.h>

#include <stdio.h>
#include <stdlib.h>

#define BLOCKSIZE 8192

int main(int argc, char *argv[], char *arge[]) 
{
  char *fn;
  int fd;
  int perc;
  struct stat statbuf;
  struct timeval tv1,tv2;
  off_t size, offset;
  char *buf[BLOCKSIZE];
  int b_read=0, b_skipped=0;

  fn = argv[1];
  perc = atoi(argv[2]);

  fd = open(fn, O_RDONLY);
  fstat(fd, &statbuf);
  size = statbuf.st_size;
  
  size = size/BLOCKSIZE*BLOCKSIZE;
  
  gettimeofday(&tv1, NULL);

  srandom(getpid()^tv1.tv_sec^tv1.tv_usec);

  for(offset=0;offset<size;offset+=BLOCKSIZE) {
    if (random()%100 < perc) {
      lseek(fd, offset, SEEK_SET);
      read(fd, buf, BLOCKSIZE);
      b_read++;
    } else {
      b_skipped++;
    }
  }
  
  gettimeofday(&tv2, NULL);
  
  fprintf(stderr,
	  "Reading %d%% (%d read, %d skipped): %ldus\n",
	 (int)perc, b_read, b_skipped,
	 (tv2.tv_sec-tv1.tv_sec)*1000000 + (tv2.tv_usec-tv1.tv_usec)
	 );
  exit(0);
}


-- 
greg

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

Reply via email to