Hi,
In reference to the seq scans roadmap, I have just submitted a patch
that addresses some of the concerns.
The patch does this:
1. for small relation (smaller than 60% of bufferpool), use the
current logic
2. for big relation:
- use a ring buffer in heap scan
- pin first 12 pages when scan starts
- on consumption of every 4-page, read and pin the next 4-page
- invalidate used pages of in the scan so they do not force out
other useful pages
4 files changed:
bufmgr.c, bufmgr.h, heapam.c, relscan.h
If there are interests, I can submit another scan patch that returns
N tuples at a time, instead of current one-at-a-time interface. This
improves code locality and further improve performance by another
10-20%.
For TPCH 1G tables, we are seeing more than 20% improvement in scans
on the same hardware.
------------------------------------------------------------------------
-
----- PATCHED VERSION
------------------------------------------------------------------------
-
gptest=# select count(*) from lineitem;
count
---------
6001215
(1 row)
Time: 2117.025 ms
------------------------------------------------------------------------
-
----- ORIGINAL CVS HEAD VERSION
------------------------------------------------------------------------
-
gptest=# select count(*) from lineitem;
count
---------
6001215
(1 row)
Time: 2722.441 ms
Suggestions for improvement are welcome.
Regards,
-cktan
Greenplum, Inc.
On May 8, 2007, at 5:57 AM, Heikki Linnakangas wrote:
Luke Lonergan wrote:
What do you mean with using readahead inside the heapscan?
Starting an async read request?
Nope - just reading N buffers ahead for seqscans. Subsequent
calls use
previously read pages. The objective is to issue contiguous reads to
the OS in sizes greater than the PG page size (which is much smaller
than what is needed for fast sequential I/O).
Are you filling multiple buffers in the buffer cache with a single
read-call? The OS should be doing readahead for us anyway, so I
don't see how just issuing multiple ReadBuffers one after each
other helps.
Yes, I think the ring buffer strategy should be used when the
table size
is > 1 x bufcache and the ring buffer should be of a fixed size
smaller
than L2 cache (32KB - 128KB seems to work well).
I think we want to let the ring grow larger than that for updating
transactions and vacuums, though, to avoid the WAL flush problem.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
---------------------------(end of
broadcast)---------------------------
TIP 6: explain analyze is your friend
---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?
http://www.postgresql.org/docs/faq