Re: [HACKERS] WIP: Faster Expression Processing and Tuple Deforming (including JIT)
On Mon, Dec 12, 2016 at 6:14 PM, Andres Freundwrote: > > > For Q1 I think the bigger win is JITing the transition function > invocation in advance_aggregates/transition_function - that's IIRC where > the biggest bottleneck lies. > Yeah, we bundle the agg core into our expr work... no point otherwise since we do it for OLAP. As for experience, I think you have found out for yourself. There is a lot that can be done and heuristics are involved in many places to decide whether to jit fully, partially, or not at all. But it looks like you have a solid basis now to proceed and explore the beyond :-) Send me private email if you have a particular question. Regards, -cktan
Re: [HACKERS] WIP: Faster Expression Processing and Tuple Deforming (including JIT)
Andres, > dev (no jiting): > Time: 30343.532 ms > dev (jiting): > SET jit_tuple_deforming = on; > SET jit_expressions = true; > > Time: 24439.803 ms FYI, ~20% improvement for TPCH Q1 is consistent with what we find when we only jit expression. Cheers, -cktan
[HACKERS] left, right, full sort merge join plan
Hi Hackers, I am looking for some help in creating LEFT/RIGHT/FULL sort-merge-join. Does anyone have a complete and reproducible script that would generate those plans? Can I find it in the regression test suite? If not, how do you exercise those code path for QA purposes? Thanks! -cktan
Re: [HACKERS] Memory Accounting v11
On 14 June 2015 at 23:51, Tomas Vondra tomas.von...@2ndquadrant.com wrote: The current state, where HashAgg just blows up the memory, is just not reasonable, and we need to track the memory to fix that problem. Meh. HashAgg could track its memory usage without loading the entire system with a penalty. +1 to a solution like that, although I don't think that's doable without digging the info from memory contexts somehow. I am sorry to ask questions unrelated to the subject, but how is tracking memory going to fix the HashAgg blow up problem? Is there a plan to make HashAgg not blow up (i.e. spill the hash table)? Thanks, -cktan On Thu, Jul 2, 2015 at 4:19 AM, Simon Riggs si...@2ndquadrant.com wrote: On 14 June 2015 at 23:51, Tomas Vondra tomas.von...@2ndquadrant.com wrote: The current state, where HashAgg just blows up the memory, is just not reasonable, and we need to track the memory to fix that problem. Meh. HashAgg could track its memory usage without loading the entire system with a penalty. +1 to a solution like that, although I don't think that's doable without digging the info from memory contexts somehow. Jeff is right, we desperately need a solution and this is the place to start. Tom's concern remains valid: we must not load the entire system with a penalty. The only questions I have are: * If the memory allocations adapt to the usage pattern, then we expect to see few memory chunk allocations. Why are we expecting the entire system to experience a penalty? * If we do not manage our resources, how are we certain this does not induce a penalty? Not tracking memory could be worse than tracking it. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] On columnar storage
http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf In sketch: There is the concept of a Write-Optimized-Store (WOS) and Read-optimized-store (ROS), and a TupleMover that moves records from WOS to ROS (some what like vacuum), and from ROS to WOS for updates. It seems to me that heap is naturally a WOS, and only vacuuming for a column-backed heap table would move records from the heap into the column store. Of course, there would need to be a deeper vacuum when the column store itself needs to be vacuumed. When a record in column store needs to be updated, a top-level transaction moves the record into the heap by marking the row as deleted in the column store and inserting the record into the heap store. The updates could then proceed according to the current heap transactional logic. I am not sure if this makes sense, but it seems plausible and 1/ retains the heap transactional logic code which is very hard to get right 2/ makes column store essentially a storage optimization that users do not need to be too concerned with; heap is kept small and old data are moved into column store automatically 3/ no need to keep 20+bytes of visibility info on the rows in column store 4/ instead of column store, this could be a heap (without visibility) store if you prefer row I haven't thought about the indexing aspect of this. From a DW angle, I am more interested in a heap store that is backed by multiple column stores via partition keys. Regards, -cktan On Mon, Jun 15, 2015 at 12:02 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Fri, Jun 12, 2015 at 10:58 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Amit Kapila wrote: On Fri, Jun 12, 2015 at 4:33 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: One critical detail is what will be used to identify a heap row when talking to a CS implementation. There are two main possibilities: 1. use CTIDs 2. use some logical tuple identifier Using CTIDs is simpler. One disadvantage is that every UPDATE of a row needs to let the CS know about the new location of the tuple, so that the value is known associated with the new tuple location as well as the old. This needs to happen even if the value of the column itself is not changed. Isn't this somewhat similar to index segment? Not sure what you mean with index segment. The part similar to index segment is reference to heap for visibility information and tuple id (TID). Have I misunderstood something? Will the column store obey snapshot model similar to current heap tuples, if so will it derive the transaction information from heap tuple? Yes, visibility will be tied to the heap tuple -- a value is accessed only when its corresponding heap row has already been determined to be visible. Won't it possible that all columns of a table belong to column-store? I think for such a case heap will just be used to store transaction information (visibility info) for a column store tuple and depending on how the column-store is organized, the reference to this information needs to be stored in column-store (the same row reference might need to be stored for each column value). Also any write operation could lead to much more I/O because of updation at 2 different locations (one in column-store and other in heap). One interesting point that raises from this is about vacuum: when are we able to remove a value from the store? Yes, that could also be quite tricky to handle, may be one naive way could be to make list of all TID's from heap that needs to be expired and then search for references of all those TID's in column-store. I understand your point for re-using the existing transaction infrastructure for column-store by keeping that information in heap as it is done now, but I think that won't be free either. Another point to consider here is does the column-store needs transactional consistency, do other commercial/opensource column-store implementation's are transactional consistent and if yes, then can't we think of doing it in a way where data could be present both in heap as well as in column-store (I understand that it could lead to duplicate data, OTOH, such an implementation anyway eliminates the need for indexes, so may be worth considering). With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Order of columns in query is important?!
You're right. I misread the problem description. On Tue, May 26, 2015 at 3:13 AM, Petr Jelinek p...@2ndquadrant.com wrote: On 26/05/15 11:59, CK Tan wrote: It has to do with the implementation of slot_getattr, which tries to do the deform on-demand lazily. if you do select a,b,c, the execution would do slot_getattr(1) and deform a, and then slot_getattr(2) which reparse the tuple to deform b, and finally slot_getattr(3), which parse the tuple yet again to deform c. Where as if you do select c, b, a, it would do slot_getattr(3) to deform c, and in the process deform a and b in one pass. Subsequent calls to slot_getattr 1 and 2 would find the attribute ready and available, and return it (without parsing the tuple again). If this was the case, changing column order would lead to performance increase, not decrease as reported. My guess would be same as Amits, it's most likely the additional projection step. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] Order of columns in query is important?!
It has to do with the implementation of slot_getattr, which tries to do the deform on-demand lazily. if you do select a,b,c, the execution would do slot_getattr(1) and deform a, and then slot_getattr(2) which reparse the tuple to deform b, and finally slot_getattr(3), which parse the tuple yet again to deform c. Where as if you do select c, b, a, it would do slot_getattr(3) to deform c, and in the process deform a and b in one pass. Subsequent calls to slot_getattr 1 and 2 would find the attribute ready and available, and return it (without parsing the tuple again). For Vitesse X, we mark all columns that were required in the query during JIT compile, and deform it in one shot. PG should be able to do the same. -cktan On Mon, May 25, 2015 at 2:26 AM, Colin 't Hart co...@sharpheart.org wrote: Hi, I hope this is the best place to report this or should I be on pgsql-general or pgsql-bugs? It seems that the order of columns in a query can make a difference in execution times. In my brief investigation, queries on table(a,b,c,d,e,f,g,h) of the form select * from table order by non-indexed-column limit 25; select a,b,c,d,e,f,g,h from table order by non-indexed-column limit 25; performed the same (approx 1.5 seconds on our customers table -- rows=514431 width=215), while the query select h,g,f,e,d,c,b,a from table order by non-indexed-column limit 25; was about 50% slower (approx 2.2 seconds on our customers table). I had expected these to perform the same -- to my mind column ordering in a query should be purely presentation -- as far as I'm concerned, the DBMS can retrieve the columns in a different order as long as it displays it in the order I've asked for them. Although, again, the order of columns in a resultset in a Java or Python is mostly irrelevant, though when displayed in psql I'd want the columns in the order I asked for them. Is there really something strange happening here? Or perfectly explainable and expected? Regards, Colin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Let's drop two obsolete features which are bear-traps for novices
Josh, Do you have a list of what needs to be done to keep the MONEY type? What is wrong with it? Thanks, -cktan On Mon, Nov 3, 2014 at 10:30 PM, Feng Tian ft...@vitessedata.com wrote: Hi, This is Feng from Vitesse. Performance different between Money and Numeric is *HUGE*. For TPCH Q1, the performance difference is 5x for stock postgres, and ~20x for vitesse. Stock postgres, for my laptop, TPCH 1G, Q1, use money type ~ 9s, use Numeric (15, 2) is ~53s. Kevin, test=# do $$ begin perform sum('1.01'::numeric) from generate_series(1,1000); end; $$; This may not reflect the difference of the two data type. One aggregate is not where most of the time is spent. TPCH Q1 has many more computing. On Mon, Nov 3, 2014 at 4:54 AM, Michael Banck michael.ba...@credativ.de wrote: Am Sonntag, den 02.11.2014, 12:41 -0500 schrieb Tom Lane: BTW, after reflecting a bit more I'm less than convinced that this datatype is completely useless. Even if you prefer to store currency values in numeric columns, casting to or from money provides a way to accept or emit values in whatever monetary format the LC_MONETARY locale setting specifies. That seems like a useful feature, and it's one you could not easily duplicate using to_char/to_number (not to mention that those functions aren't without major shortcomings of their own). As an additional datapoint, Vitesse Data changed the DB schema from NUMERIC to MONEY for their TPCH benchmark for performance reasons: The modification to data types is easy to understand -- money and double types are faster than Numeric (and no one on this planet has a bank account that overflows the money type, not any time soon).[1] And Replaced NUMERIC fields representing currency with MONEY[2]. Not sure whether they modified/optimized PostgreSQL with respect to the MONEY data type and/or how much performance that gained, so CCing CK Tan as well. Michael [1] http://vitesse-timing-on.blogspot.de/2014/10/running-tpch-on-postgresql-part-1.html [2] http://vitessedata.com/benchmark/ -- Michael Banck Projektleiter / Berater Tel.: +49 (2161) 4643-171 Fax: +49 (2161) 4643-100 Email: michael.ba...@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Hohenzollernstr. 133, 41061 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vitesse DB call for testing
Hi Mark, Vitesse DB won't be open-sourced, or it would have been a contrib module to postgres. We should take further discussions off this list. People should contact me directly if there is any questions. Thanks, ck...@vitessedata.com On Fri, Oct 17, 2014 at 10:55 PM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: On 18/10/14 07:13, Josh Berkus wrote: CK, Before we go any further on this, how is Vitesse currently licensed? last time we talked it was still proprietary. If it's not being open-sourced, we likely need to take discussion off this list. +1 Guys, you need to 'fess up on the licensing! Regards Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Vitesse DB call for testing
Hi everyone, Vitesse DB 9.3.5.S is Postgres 9.3.5 with a LLVM-JIT query executor designed for compute intensive OLAP workload. We have gotten it to a reasonable state and would like to open it up to the pg hackers community for testing and suggestions. Vitesse DB offers -- JIT Compilation for compute-intensive queries -- CSV parsing with SSE instructions -- 100% binary compatibility with PG9.3.5. Our results show CSV imports run up to 2X faster, and TPCH Q1 runs 8X faster. Our TPCH 1GB benchmark results is also available at http://vitessedata.com/benchmark/ . Please direct any questions by email to ck...@vitessedata.com . Thank you for your help. -- CK Tan Vitesse Data, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vitesse DB call for testing
Merlin, glad you tried it. We take the query plan exactly as given by the planner, decide whether to JIT or to punt depending on the cost. If we punt, it goes back to pg executor. If we JIT, and if we could not proceed (usually of some operators we haven't implemented yet), we again punt. Once we were able to generate the code, there is no going back; we call into LLVM to obtain the function entry point, and run it to completion. The 3% improvement you see in OLTP tests is definitely noise. The bigint sum,avg,count case in the example you tried has some optimization. We use int128 to accumulate the bigint instead of numeric in pg. Hence the big speed up. Try the same query on int4 for the improvement where both pg and vitessedb are using int4 in the execution. The speed up is really noticeable when the data type is nonvarlena. In the varlena cases, we still call into pg routines most of the times. Again, try the sum,avg,count query on numeric, and you will see what I mean. Also, we don't support UDF at the moment. So all queries involving UDF gets sent to pg executor. On your question of 32k page size, the rational is that some of our customers could be interested in a data warehouse on pg. 32k page size is a big win when all you do is seqscan all day long. We are looking for bug reports at these stage and some stress tests done without our own prejudices. Some test on real data in non prod setting on queries that are highly CPU bound would be ideal. Thanks, -cktan On Oct 17, 2014, at 6:43 AM, Merlin Moncure mmonc...@gmail.com wrote: On Fri, Oct 17, 2014 at 8:14 AM, Merlin Moncure mmonc...@gmail.com wrote: On Fri, Oct 17, 2014 at 7:32 AM, CK Tan ck...@vitessedata.com wrote: Hi everyone, Vitesse DB 9.3.5.S is Postgres 9.3.5 with a LLVM-JIT query executor designed for compute intensive OLAP workload. We have gotten it to a reasonable state and would like to open it up to the pg hackers community for testing and suggestions. Vitesse DB offers -- JIT Compilation for compute-intensive queries -- CSV parsing with SSE instructions -- 100% binary compatibility with PG9.3.5. Our results show CSV imports run up to 2X faster, and TPCH Q1 runs 8X faster. Our TPCH 1GB benchmark results is also available at http://vitessedata.com/benchmark/ . Please direct any questions by email to ck...@vitessedata.com . You offer a binary with 32k block size...what's the rationale for that? (sorry for the double post) OK, I downloaded the ubuntu binary and ran your benchmarks (after making some minor .conf tweaks like disabling SSL). I then ran your benchmark (after fixing the typo) of the count/sum/avg test -- *and noticed a 95% reduction in runtime performance* which is really quite amazing IMNSHO. I also ran a select only test on small scale factor pgbench and didn't see any regression there -- in fact you beat stock by ~ 3% (although this could be measurement noise). So now you've got my attention. So, if you don't mind, quit being coy and explain how the software works and all the neat things it does and doesn't do. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vitesse DB call for testing
Happy to contribute to that decision :-) On Fri, Oct 17, 2014 at 11:35 AM, Tom Lane t...@sss.pgh.pa.us wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-10-17 13:12:27 -0400, Tom Lane wrote: Well, that's pretty much cheating: it's too hard to disentangle what's coming from JIT vs what's coming from using a different accumulator datatype. If we wanted to depend on having int128 available we could get that speedup with a couple hours' work. I think doing that when configure detects int128 would make a great deal of sense. Yeah, I was wondering about that myself: use int128 if available, else fall back on existing code path. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vitesse DB call for testing
Indeed! A big part of our implementation is based on the Neumann paper. There are also a few other papers that impacted our implemented: A. Ailamaki, D. DeWitt, M. Hill, D. Wood. DBMSs On A Modern Processor: Where Does Time Go? Peter Boncz, Marcin Zukowski, Niels Nes. MonetDB/X100: Hyper-Pipelining Query Execution M. Zukowski el al. Super-Scalar RAM-CPU Cache Compression Of course, we need to adapt a lot of the design to Postgres to make something that could stand up harmoniously with the Postgres system, and also to take care that we would be able to merge easily with future versions of Postgres -- the implementation needs to be as non-invasive as possible. Regards, -cktan On Fri, Oct 17, 2014 at 8:40 PM, David Gould da...@sonic.net wrote: On Fri, 17 Oct 2014 13:12:27 -0400 Tom Lane t...@sss.pgh.pa.us wrote: CK Tan ck...@vitessedata.com writes: The bigint sum,avg,count case in the example you tried has some optimization. We use int128 to accumulate the bigint instead of numeric in pg. Hence the big speed up. Try the same query on int4 for the improvement where both pg and vitessedb are using int4 in the execution. Well, that's pretty much cheating: it's too hard to disentangle what's coming from JIT vs what's coming from using a different accumulator datatype. If we wanted to depend on having int128 available we could get that speedup with a couple hours' work. But what exactly are you compiling here? I trust not the actual data accesses; that seems far too complicated to try to inline. regards, tom lane I don't have any inside knowledge, but from the presentation given at the recent SFPUG followed by a bit of google-fu I think these papers are relevant: http://www.vldb.org/pvldb/vol4/p539-neumann.pdf http://sites.computer.org/debull/A14mar/p3.pdf -dg -- David Gould 510 282 0869 da...@sonic.net If simplicity worked, the world would be overrun with insects. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Seq scans roadmap
Hi All, COPY/INSERT are also bottlenecked on record at a time insertion into heap, and in checking for pre-insert trigger, post-insert trigger and constraints. To speed things up, we really need to special case insertions without triggers and constraints, [probably allow for unique constraints], and make these insertions to go into heap N tuples at a time. With this change, comes the benefit of optimizing REDO log to log multiple inserts or even logging a whole new heap page that gets filled in a single WAL record. Those with triggers and other constraints would still have to go in one at a time because of the trigger/constraints semantics. It seems to me that dirty pages should be written out by the bg writer instead of circumventing it using ring buffer. If it is slow, we should change bg writer. -cktan On May 12, 2007, at 8:42 AM, Luke Lonergan wrote: Hi Simon, On 5/12/07 12:35 AM, Simon Riggs [EMAIL PROTECTED] wrote: I'm slightly worried that the results for COPY aren't anywhere near as good as the SELECT and VACUUM results. It isn't clear from those numbers that the benefit really is significant. COPY is bottlenecked on datum formation and format translation with very low performance, so I don't think we should expect the ring buffer to make much of a dent. - Luke ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Seq scans roadmap
Sorry, I should have been clearer. I meant because we need to check for trigger firing pre/post insertion, and the trigger definitions expect tuples to be inserted one by one, therefore we cannot insert N- tuples at a time into the heap. Checking for triggers itself is not taking up much CPU at all. If we could predetermine that there is not any triggers for a relation, inserts into that relation could then follow a different path that inserts N-tuples at a time. Regards, -cktan On May 13, 2007, at 4:54 PM, Tom Lane wrote: CK Tan [EMAIL PROTECTED] writes: COPY/INSERT are also bottlenecked on record at a time insertion into heap, and in checking for pre-insert trigger, post-insert trigger and constraints. To speed things up, we really need to special case insertions without triggers and constraints, [probably allow for unique constraints], Do you have any profiling data to back up these assertions? I haven't noticed that firing zero tuples takes any visible percentage of COPY time. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Seq scans roadmap
The patch has no effect on scans that do updates. The KillAndReadBuffer routine does not force out a buffer if the dirty bit is set. So updated pages revert to the current performance characteristics. -cktan GreenPlum, Inc. On May 10, 2007, at 5:22 AM, Heikki Linnakangas wrote: Zeugswetter Andreas ADI SD wrote: In reference to the seq scans roadmap, I have just submitted a patch that addresses some of the concerns. The patch does this: 1. for small relation (smaller than 60% of bufferpool), use the current logic 2. for big relation: - use a ring buffer in heap scan - pin first 12 pages when scan starts - on consumption of every 4-page, read and pin the next 4-page - invalidate used pages of in the scan so they do not force out other useful pages A few comments regarding the effects: I do not see how this speedup could be caused by readahead, so what are the effects ? I was wondering that as well. We'd really need to test all the changes separately to see where the improvements are really coming from. Also, that patch doesn't address the VACUUM issue at all. And using a small fixed size ring with scans that do updates can be devastating. I'm experimenting with different ring sizes for COPY at the moment. Too small ring leads to a lot of WAL flushes, it's basically the same problem we have with VACUUM in CVS HEAD. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Seq scans roadmap
Sorry, 16x8K page ring is too small indeed. The reason we selected 16 is because greenplum db runs on 32K page size, so we are indeed reading 128K at a time. The #pages in the ring should be made relative to the page size, so you achieve 128K per read. Also agree that KillAndReadBuffer could be split into a KillPinDontRead(), and ReadThesePinnedPages() functions. However, we are thinking of AIO and would rather see a ReadNPagesAsync() function. -cktan Greenplum, Inc. On May 10, 2007, at 3:14 AM, Zeugswetter Andreas ADI SD wrote: In reference to the seq scans roadmap, I have just submitted a patch that addresses some of the concerns. The patch does this: 1. for small relation (smaller than 60% of bufferpool), use the current logic 2. for big relation: - use a ring buffer in heap scan - pin first 12 pages when scan starts - on consumption of every 4-page, read and pin the next 4-page - invalidate used pages of in the scan so they do not force out other useful pages A few comments regarding the effects: I do not see how this speedup could be caused by readahead, so what are the effects ? (It should make no difference to do the CPU work for count(*) inbetween reading each block when the pages are not dirtied) Is the improvement solely reduced CPU because no search for a free buffer is needed and/or L2 cache locality ? What effect does the advance pinnig have, avoid vacuum ? A 16 x 8k page ring is too small to allow the needed IO blocksize of 256k. The readahead is done 4 x one page at a time (=32k). What is the reasoning behind 1/4 ring for readahead (why not 1/2), is 3/4 the trail for followers and bgwriter ? I think in anticipation of doing a single IO call for more that one page, the KillAndReadBuffer function should be split into two parts. One that does the killing for n pages, and one that does the reading for n pages. Killing n before reading n would also have the positive effect of grouping perhaps needed writes (not interleaving them with the reads). I think the 60% Nbuffers is a very good starting point. I would only introduce a GUC when we see evidence that it is needed (I agree with Simon's partitioning comments, but I'd still wait and see). Andreas ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Seq scans roadmap
Hi, In reference to the seq scans roadmap, I have just submitted a patch that addresses some of the concerns. The patch does this: 1. for small relation (smaller than 60% of bufferpool), use the current logic 2. for big relation: - use a ring buffer in heap scan - pin first 12 pages when scan starts - on consumption of every 4-page, read and pin the next 4-page - invalidate used pages of in the scan so they do not force out other useful pages 4 files changed: bufmgr.c, bufmgr.h, heapam.c, relscan.h If there are interests, I can submit another scan patch that returns N tuples at a time, instead of current one-at-a-time interface. This improves code locality and further improve performance by another 10-20%. For TPCH 1G tables, we are seeing more than 20% improvement in scans on the same hardware. - - PATCHED VERSION - gptest=# select count(*) from lineitem; count - 6001215 (1 row) Time: 2117.025 ms - - ORIGINAL CVS HEAD VERSION - gptest=# select count(*) from lineitem; count - 6001215 (1 row) Time: 2722.441 ms Suggestions for improvement are welcome. Regards, -cktan Greenplum, Inc. On May 8, 2007, at 5:57 AM, Heikki Linnakangas wrote: Luke Lonergan wrote: What do you mean with using readahead inside the heapscan? Starting an async read request? Nope - just reading N buffers ahead for seqscans. Subsequent calls use previously read pages. The objective is to issue contiguous reads to the OS in sizes greater than the PG page size (which is much smaller than what is needed for fast sequential I/O). Are you filling multiple buffers in the buffer cache with a single read-call? The OS should be doing readahead for us anyway, so I don't see how just issuing multiple ReadBuffers one after each other helps. Yes, I think the ring buffer strategy should be used when the table size is 1 x bufcache and the ring buffer should be of a fixed size smaller than L2 cache (32KB - 128KB seems to work well). I think we want to let the ring grow larger than that for updating transactions and vacuums, though, to avoid the WAL flush problem. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq