Hi,

In reference to the seq scans roadmap, I have just submitted a patch that addresses some of the concerns.

The patch does this:

1. for small relation (smaller than 60% of bufferpool), use the current logic
2. for big relation:
        - use a ring buffer in heap scan
        - pin first 12 pages when scan starts
        - on consumption of every 4-page, read and pin the next 4-page
- invalidate used pages of in the scan so they do not force out other useful pages

4 files changed:
bufmgr.c, bufmgr.h, heapam.c, relscan.h

If there are interests, I can submit another scan patch that returns N tuples at a time, instead of current one-at-a-time interface. This improves code locality and further improve performance by another 10-20%.

For TPCH 1G tables, we are seeing more than 20% improvement in scans on the same hardware.

------------------------------------------------------------------------ -
----- PATCHED VERSION
------------------------------------------------------------------------ -
gptest=# select count(*) from lineitem;
  count
---------
6001215
(1 row)

Time: 2117.025 ms

------------------------------------------------------------------------ -
----- ORIGINAL CVS HEAD VERSION
------------------------------------------------------------------------ -
gptest=# select count(*) from lineitem;
  count
---------
6001215
(1 row)

Time: 2722.441 ms


Suggestions for improvement are welcome.

Regards,
-cktan
Greenplum, Inc.

On May 8, 2007, at 5:57 AM, Heikki Linnakangas wrote:

Luke Lonergan wrote:
What do you mean with using readahead inside the heapscan? Starting an async read request?
Nope - just reading N buffers ahead for seqscans. Subsequent calls use
previously read pages.  The objective is to issue contiguous reads to
the OS in sizes greater than the PG page size (which is much smaller
than what is needed for fast sequential I/O).

Are you filling multiple buffers in the buffer cache with a single read-call? The OS should be doing readahead for us anyway, so I don't see how just issuing multiple ReadBuffers one after each other helps.

Yes, I think the ring buffer strategy should be used when the table size is > 1 x bufcache and the ring buffer should be of a fixed size smaller
than L2 cache (32KB - 128KB seems to work well).

I think we want to let the ring grow larger than that for updating transactions and vacuums, though, to avoid the WAL flush problem.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend




---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

              http://www.postgresql.org/docs/faq

Reply via email to