Re: [HACKERS] WIP: Faster Expression Processing and Tuple Deforming (including JIT)

2016-12-12 Thread CK Tan
On Mon, Dec 12, 2016 at 6:14 PM, Andres Freund  wrote:
>
>
> For Q1 I think the bigger win is JITing the transition function
> invocation in advance_aggregates/transition_function - that's IIRC where
> the biggest bottleneck lies.
>

Yeah, we bundle the agg core into our expr work... no point otherwise since
we do
it for OLAP.

As for experience, I think you have found out for yourself. There is a lot
that
can be done and heuristics are involved in many places to decide whether
to jit fully, partially, or not at all.  But it looks like you have a solid
basis now
to proceed and explore the beyond :-)

Send me private email if you have a particular question.

Regards,
-cktan


Re: [HACKERS] WIP: Faster Expression Processing and Tuple Deforming (including JIT)

2016-12-12 Thread CK Tan
Andres,


> dev (no jiting):
> Time: 30343.532 ms

> dev (jiting):
> SET jit_tuple_deforming = on;
> SET jit_expressions = true;
>
> Time: 24439.803 ms

FYI, ~20% improvement for TPCH Q1 is consistent with what we find when we
only jit expression.

Cheers,
-cktan


[HACKERS] left, right, full sort merge join plan

2016-02-02 Thread CK Tan
Hi Hackers,

I am looking for some help in creating LEFT/RIGHT/FULL sort-merge-join.
Does anyone have a complete and reproducible script that would generate
those plans? Can I find it in the regression test suite? If not, how do you
exercise those code path for QA purposes?

Thanks!

-cktan


Re: [HACKERS] Memory Accounting v11

2015-07-02 Thread CK Tan
On 14 June 2015 at 23:51, Tomas Vondra tomas.von...@2ndquadrant.com wrote:


 The current state, where HashAgg just blows up the memory, is just not
 reasonable, and we need to track the memory to fix that problem.


 Meh. HashAgg could track its memory usage without loading the entire
 system with a penalty.


 +1 to a solution like that, although I don't think that's doable without
 digging the info from memory contexts somehow.


I am sorry to ask questions unrelated to the subject, but how is tracking
memory going to fix the HashAgg blow up problem? Is there a plan to make
HashAgg not blow up (i.e. spill the hash table)?

Thanks,
-cktan



On Thu, Jul 2, 2015 at 4:19 AM, Simon Riggs si...@2ndquadrant.com wrote:

 On 14 June 2015 at 23:51, Tomas Vondra tomas.von...@2ndquadrant.com
 wrote:


 The current state, where HashAgg just blows up the memory, is just not
 reasonable, and we need to track the memory to fix that problem.


 Meh. HashAgg could track its memory usage without loading the entire
 system with a penalty.


 +1 to a solution like that, although I don't think that's doable without
 digging the info from memory contexts somehow.


 Jeff is right, we desperately need a solution and this is the place to
 start.

 Tom's concern remains valid: we must not load the entire system with a
 penalty.


 The only questions I have are:

 * If the memory allocations adapt to the usage pattern, then we expect to
 see few memory chunk allocations. Why are we expecting the entire system
 to experience a penalty?

 * If we do not manage our resources, how are we certain this does not
 induce a penalty? Not tracking memory could be worse than tracking it.

 --
 Simon Riggshttp://www.2ndQuadrant.com/
 http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services



Re: [HACKERS] On columnar storage

2015-06-15 Thread CK Tan
http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf

In sketch:

There is the concept of a Write-Optimized-Store (WOS) and
Read-optimized-store (ROS), and a TupleMover that moves records from WOS to
ROS (some what like vacuum), and from ROS to WOS for updates. It seems to
me that heap is naturally a WOS, and only vacuuming for a column-backed
heap table would move records from the heap into the column store. Of
course, there would need to be a deeper vacuum when the column store itself
needs to be vacuumed.

When a record in column store needs to be updated, a top-level transaction
moves the record into the heap by marking the row as deleted in the column
store and inserting the record into the heap store. The updates could then
proceed according to the current heap transactional logic.

I am not sure if this makes sense, but it seems plausible and

1/ retains the heap transactional logic code which is very hard to get right
2/ makes column store essentially a storage optimization that users do not
need to be too concerned with; heap is kept small and old data are moved
into column store automatically
3/ no need to keep 20+bytes of visibility info on the rows in column store
4/ instead of column store, this could be a heap (without visibility) store
if you prefer row

I haven't thought about the indexing aspect of this. From a DW angle, I am
more interested in a heap store that is backed by multiple column stores
via partition keys.

Regards,
-cktan



On Mon, Jun 15, 2015 at 12:02 AM, Amit Kapila amit.kapil...@gmail.com
wrote:

 On Fri, Jun 12, 2015 at 10:58 PM, Alvaro Herrera alvhe...@2ndquadrant.com
 wrote:
 
  Amit Kapila wrote:
   On Fri, Jun 12, 2015 at 4:33 AM, Alvaro Herrera 
 alvhe...@2ndquadrant.com
   wrote:
One critical detail is what will be used to identify a heap row when
talking to a CS implementation.  There are two main possibilities:
   
1. use CTIDs
2. use some logical tuple identifier
   
Using CTIDs is simpler.  One disadvantage is that every UPDATE of a
 row
needs to let the CS know about the new location of the tuple, so that
the value is known associated with the new tuple location as well as
 the
old.  This needs to happen even if the value of the column itself is
 not
changed.
  
   Isn't this somewhat similar to index segment?
 
  Not sure what you mean with index segment.

 The part similar to index segment is reference to heap for visibility
 information and tuple id (TID).  Have I misunderstood something?

   Will the column store obey snapshot model similar to current heap
 tuples,
   if so will it derive the transaction information from heap tuple?
 
  Yes, visibility will be tied to the heap tuple -- a value is accessed
  only when its corresponding heap row has already been determined to be
  visible.

 Won't it possible that all columns of a table belong to column-store?
 I think for such a case heap will just be used to store transaction
 information
 (visibility info) for a column store tuple and depending on how the
 column-store is organized, the reference to this information needs to be
 stored in column-store (the same row reference might need to be stored for
 each column value).  Also any write operation could lead to much more
 I/O because of updation at 2 different locations (one in column-store and
 other in heap).

   One interesting point that raises from this is about vacuum:
  when are we able to remove a value from the store?

 Yes, that could also be quite tricky to handle, may be one naive way
 could be to make list of all TID's from heap that needs to be expired
 and then search for references of all those TID's in column-store.

 I understand your point for re-using the existing transaction
 infrastructure
 for column-store by keeping that information in heap as it is done now,
 but I think that won't be free either.

 Another point to consider here is does the column-store needs
 transactional consistency, do other commercial/opensource column-store
 implementation's are transactional consistent and if yes, then can't we
 think of doing it in a way where data could be present both in heap as well
 as in column-store (I understand that it could lead to duplicate data,
 OTOH, such an implementation anyway eliminates the need for indexes,
 so may be worth considering).


 With Regards,
 Amit Kapila.
 EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Order of columns in query is important?!

2015-05-26 Thread CK Tan
You're right. I misread the problem description.

On Tue, May 26, 2015 at 3:13 AM, Petr Jelinek p...@2ndquadrant.com wrote:

 On 26/05/15 11:59, CK Tan wrote:

 It has to do with the implementation of slot_getattr, which tries to do
 the deform on-demand lazily.

 if you do select a,b,c, the execution would do slot_getattr(1) and
 deform a, and then slot_getattr(2) which reparse the tuple to deform b,
 and finally slot_getattr(3), which parse the tuple yet again to deform c.

 Where as if you do select c, b, a, it would do slot_getattr(3) to deform
 c, and in the process deform a and b in one pass. Subsequent calls to
 slot_getattr 1 and 2 would find the attribute ready and available, and
 return it (without parsing the tuple again).


 If this was the case, changing column order would lead to performance
 increase, not decrease as reported.

 My guess would be same as Amits, it's most likely the additional
 projection step.

 --
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services



Re: [HACKERS] Order of columns in query is important?!

2015-05-26 Thread CK Tan
It has to do with the implementation of slot_getattr, which tries to do the
deform on-demand lazily.

if you do select a,b,c, the execution would do slot_getattr(1) and deform
a, and then slot_getattr(2) which reparse the tuple to deform b, and
finally slot_getattr(3), which parse the tuple yet again to deform c.

Where as if you do select c, b, a, it would do slot_getattr(3) to deform c,
and in the process deform a and b in one pass. Subsequent calls to
slot_getattr 1 and 2 would find the attribute ready and available, and
return it (without parsing the tuple again).

For Vitesse X, we mark all columns that were required in the query during
JIT compile, and deform it in one shot. PG should be able to do the same.

-cktan




On Mon, May 25, 2015 at 2:26 AM, Colin 't Hart co...@sharpheart.org wrote:

 Hi,

 I hope this is the best place to report this or should I be on
 pgsql-general or pgsql-bugs?


 It seems that the order of columns in a query can make a difference in
 execution times.

 In my brief investigation, queries on table(a,b,c,d,e,f,g,h) of the form

 select * from table order by non-indexed-column limit 25;
 select a,b,c,d,e,f,g,h from table order by non-indexed-column limit 25;

 performed the same (approx 1.5 seconds on our customers table --
 rows=514431 width=215), while the query

 select h,g,f,e,d,c,b,a from table order by non-indexed-column limit 25;

 was about 50% slower (approx 2.2 seconds on our customers table).


 I had expected these to perform the same -- to my mind column ordering
 in a query should be purely presentation -- as far as I'm concerned,
 the DBMS can retrieve the columns in a different order as long as it
 displays it in the order I've asked for them. Although, again, the
 order of columns in a resultset in a Java or Python is mostly
 irrelevant, though when displayed in psql I'd want the columns in the
 order I asked for them.


 Is there really something strange happening here? Or perfectly
 explainable and expected?


 Regards,

 Colin


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] Let's drop two obsolete features which are bear-traps for novices

2014-11-03 Thread CK Tan
Josh,

Do you have a list of what needs to be done to keep the MONEY type?
What is wrong with it?

Thanks,
-cktan

On Mon, Nov 3, 2014 at 10:30 PM, Feng Tian ft...@vitessedata.com wrote:
 Hi,

 This is Feng from Vitesse.   Performance different between Money and Numeric
 is *HUGE*.   For TPCH Q1, the performance difference is 5x for stock
 postgres, and ~20x for vitesse.

 Stock postgres, for my laptop, TPCH 1G, Q1, use money type ~ 9s, use Numeric
 (15, 2) is ~53s.

 Kevin,
  test=# do $$ begin perform sum('1.01'::numeric) from
 generate_series(1,1000); end; $$;

 This may not reflect the difference of the two data type.   One aggregate is
 not where most of the time is spent.  TPCH Q1 has many more computing.

















 On Mon, Nov 3, 2014 at 4:54 AM, Michael Banck michael.ba...@credativ.de
 wrote:

 Am Sonntag, den 02.11.2014, 12:41 -0500 schrieb Tom Lane:
  BTW, after reflecting a bit more I'm less than convinced that this
  datatype is completely useless.  Even if you prefer to store currency
  values in numeric columns, casting to or from money provides a way to
  accept or emit values in whatever monetary format the LC_MONETARY locale
  setting specifies.  That seems like a useful feature, and it's one you
  could not easily duplicate using to_char/to_number (not to mention that
  those functions aren't without major shortcomings of their own).

 As an additional datapoint, Vitesse Data changed the DB schema from
 NUMERIC to MONEY for their TPCH benchmark for performance reasons: The
 modification to data types is easy to understand -- money and double
 types are faster than Numeric (and no one on this planet has a bank
 account that overflows the money type, not any time soon).[1] And
 Replaced NUMERIC fields representing currency with MONEY[2].

 Not sure whether they modified/optimized PostgreSQL with respect to the
 MONEY data type and/or how much performance that gained, so CCing CK Tan
 as well.


 Michael

 [1]
 http://vitesse-timing-on.blogspot.de/2014/10/running-tpch-on-postgresql-part-1.html
 [2] http://vitessedata.com/benchmark/

 --
 Michael Banck
 Projektleiter / Berater
 Tel.: +49 (2161) 4643-171
 Fax:  +49 (2161) 4643-100
 Email: michael.ba...@credativ.de

 credativ GmbH, HRB Mönchengladbach 12080
 USt-ID-Nummer: DE204566209
 Hohenzollernstr. 133, 41061 Mönchengladbach
 Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer



 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vitesse DB call for testing

2014-10-18 Thread CK Tan
Hi Mark,

Vitesse DB won't be open-sourced, or it would have been a contrib
module to postgres. We should take further discussions off this list.
People should contact me directly if there is any questions.

Thanks,
ck...@vitessedata.com



On Fri, Oct 17, 2014 at 10:55 PM, Mark Kirkwood
mark.kirkw...@catalyst.net.nz wrote:
 On 18/10/14 07:13, Josh Berkus wrote:

 CK,

 Before we go any further on this, how is Vitesse currently licensed?
 last time we talked it was still proprietary.  If it's not being
 open-sourced, we likely need to take discussion off this list.


 +1

 Guys, you need to 'fess up on the licensing!

 Regards

 Mark


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Vitesse DB call for testing

2014-10-17 Thread CK Tan
Hi everyone,

Vitesse DB 9.3.5.S is Postgres 9.3.5 with a LLVM-JIT query executor
designed for compute intensive OLAP workload. We have gotten it to a
reasonable state and would like to open it up to the pg hackers
community for testing and suggestions.

Vitesse DB offers
-- JIT Compilation for compute-intensive queries
-- CSV parsing with SSE instructions
-- 100% binary compatibility with PG9.3.5.

Our results show CSV imports run up to 2X faster, and TPCH Q1 runs 8X faster.

Our TPCH 1GB benchmark results is also available at
http://vitessedata.com/benchmark/ .

Please direct any questions by email to ck...@vitessedata.com .

Thank you for your help.

--
CK Tan
Vitesse Data, Inc.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vitesse DB call for testing

2014-10-17 Thread CK Tan
Merlin, glad you tried it. 

We take the query plan exactly as given by the planner, decide whether to JIT 
or to punt depending on the cost. If we punt, it goes back to pg executor. If 
we JIT, and if we could not proceed (usually of some operators we haven't 
implemented yet), we again punt. Once we were able to generate the code, there 
is no going back; we call into LLVM to obtain the function entry point, and run 
it to completion. The 3% improvement you see in OLTP tests is definitely noise.

The bigint sum,avg,count case in the example you tried has some optimization. 
We use int128 to accumulate the bigint instead of numeric in pg. Hence the big 
speed up. Try the same query on int4 for the improvement where both pg and 
vitessedb are using int4 in the execution.

The speed up is really noticeable when the data type is nonvarlena. In the 
varlena cases, we still call into pg routines most of the times. Again, try the 
sum,avg,count query on numeric, and you will see what I mean.

Also, we don't support UDF at the moment. So all queries involving UDF gets 
sent to pg executor.

On your question of 32k page size, the rational is that some of our customers 
could be interested in a data warehouse on pg. 32k page size is a big win when 
all you do is seqscan all day long.

We are looking for bug reports at these stage and some stress tests done 
without our own prejudices. Some test on real data in non prod setting on 
queries that are highly CPU bound would be ideal.

Thanks,
-cktan

 On Oct 17, 2014, at 6:43 AM, Merlin Moncure mmonc...@gmail.com wrote:
 
 On Fri, Oct 17, 2014 at 8:14 AM, Merlin Moncure mmonc...@gmail.com wrote:
 On Fri, Oct 17, 2014 at 7:32 AM, CK Tan ck...@vitessedata.com wrote:
 Hi everyone,
 
 Vitesse DB 9.3.5.S is Postgres 9.3.5 with a LLVM-JIT query executor
 designed for compute intensive OLAP workload. We have gotten it to a
 reasonable state and would like to open it up to the pg hackers
 community for testing and suggestions.
 
 Vitesse DB offers
 -- JIT Compilation for compute-intensive queries
 -- CSV parsing with SSE instructions
 -- 100% binary compatibility with PG9.3.5.
 
 Our results show CSV imports run up to 2X faster, and TPCH Q1 runs 8X 
 faster.
 
 Our TPCH 1GB benchmark results is also available at
 http://vitessedata.com/benchmark/ .
 
 Please direct any questions by email to ck...@vitessedata.com .
 
 You offer a binary with 32k block size...what's the rationale for that?
 
 (sorry for the double post)
 
 OK, I downloaded the ubuntu binary and ran your benchmarks (after
 making some minor .conf tweaks like disabling SSL).  I then ran your
 benchmark (after fixing the typo) of the count/sum/avg test -- *and
 noticed a 95% reduction in runtime performance* which is really quite
 amazing IMNSHO.  I also ran a select only test on small scale factor
 pgbench and didn't see any regression there -- in fact you beat stock
 by ~ 3% (although this could be measurement noise).   So now you've
 got my attention.  So, if you don't mind, quit being coy and explain
 how the software works and all the neat things it does and doesn't do.
 
 merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vitesse DB call for testing

2014-10-17 Thread CK Tan
Happy to contribute to that decision :-)


On Fri, Oct 17, 2014 at 11:35 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Andres Freund and...@2ndquadrant.com writes:
 On 2014-10-17 13:12:27 -0400, Tom Lane wrote:
 Well, that's pretty much cheating: it's too hard to disentangle what's
 coming from JIT vs what's coming from using a different accumulator
 datatype.  If we wanted to depend on having int128 available we could
 get that speedup with a couple hours' work.

 I think doing that when configure detects int128 would make a great deal
 of sense.

 Yeah, I was wondering about that myself: use int128 if available,
 else fall back on existing code path.

 regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vitesse DB call for testing

2014-10-17 Thread CK Tan
Indeed! A big part of our implementation is based on the Neumann
paper. There are also a few other papers that impacted our
implemented:

A. Ailamaki, D. DeWitt, M. Hill, D. Wood. DBMSs On A Modern Processor:
Where Does Time Go?

Peter Boncz, Marcin Zukowski, Niels Nes. MonetDB/X100:
Hyper-Pipelining Query Execution

M. Zukowski el al. Super-Scalar RAM-CPU Cache Compression

Of course, we need to adapt a lot of the design to Postgres to make
something that could stand up harmoniously with the Postgres system,
and also to take care that we would be able to merge easily with
future versions of Postgres -- the implementation needs to be as
non-invasive as possible.

Regards,
-cktan

On Fri, Oct 17, 2014 at 8:40 PM, David Gould da...@sonic.net wrote:
 On Fri, 17 Oct 2014 13:12:27 -0400
 Tom Lane t...@sss.pgh.pa.us wrote:

 CK Tan ck...@vitessedata.com writes:
  The bigint sum,avg,count case in the example you tried has some 
  optimization. We use int128 to accumulate the bigint instead of numeric in 
  pg. Hence the big speed up. Try the same query on int4 for the improvement 
  where both pg and vitessedb are using int4 in the execution.

 Well, that's pretty much cheating: it's too hard to disentangle what's
 coming from JIT vs what's coming from using a different accumulator
 datatype.  If we wanted to depend on having int128 available we could
 get that speedup with a couple hours' work.

 But what exactly are you compiling here?  I trust not the actual data
 accesses; that seems far too complicated to try to inline.

   regards, tom lane



 I don't have any inside knowledge, but from the presentation given at the
 recent SFPUG followed by a bit of google-fu I think these papers are
 relevant:

   http://www.vldb.org/pvldb/vol4/p539-neumann.pdf
   http://sites.computer.org/debull/A14mar/p3.pdf

 -dg

 --
 David Gould  510 282 0869 da...@sonic.net
 If simplicity worked, the world would be overrun with insects.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Seq scans roadmap

2007-05-13 Thread CK Tan

Hi All,

COPY/INSERT are also bottlenecked on record at a time insertion into  
heap, and in checking for pre-insert trigger, post-insert trigger and  
constraints.


To speed things up, we really need to special case insertions without  
triggers and constraints, [probably allow for unique constraints],  
and make these insertions to go into heap N tuples at a time. With  
this change, comes the benefit of optimizing REDO log to log multiple  
inserts or even logging a whole new heap page that gets filled in a  
single WAL record.


Those with triggers and other constraints would still have to go in  
one at a time because of the trigger/constraints semantics.


It seems to me that dirty pages should be written out by the bg  
writer instead of circumventing it using ring buffer. If it is slow,  
we should change bg writer.


-cktan

On May 12, 2007, at 8:42 AM, Luke Lonergan wrote:


Hi Simon,

On 5/12/07 12:35 AM, Simon Riggs [EMAIL PROTECTED] wrote:

I'm slightly worried that the results for COPY aren't anywhere  
near as
good as the SELECT and VACUUM results. It isn't clear from those  
numbers

that the benefit really is significant.


COPY is bottlenecked on datum formation and format translation with  
very low
performance, so I don't think we should expect the ring buffer to  
make much

of a dent.

- Luke





---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Seq scans roadmap

2007-05-13 Thread CK Tan
Sorry, I should have been clearer. I meant because we need to check  
for trigger firing pre/post insertion, and the trigger definitions  
expect tuples to be inserted one by one, therefore we cannot insert N- 
tuples at a time into the heap. Checking for triggers itself is not  
taking up much CPU at all. If we could predetermine that there is not  
any triggers for a relation, inserts into that relation could then  
follow a different path that inserts N-tuples at a time.


Regards,
-cktan

On May 13, 2007, at 4:54 PM, Tom Lane wrote:


CK Tan [EMAIL PROTECTED] writes:

COPY/INSERT are also bottlenecked on record at a time insertion into
heap, and in checking for pre-insert trigger, post-insert trigger and
constraints.



To speed things up, we really need to special case insertions without
triggers and constraints, [probably allow for unique constraints],


Do you have any profiling data to back up these assertions?  I haven't
noticed that firing zero tuples takes any visible percentage of COPY
time.

regards, tom lane





---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Seq scans roadmap

2007-05-10 Thread CK Tan
The patch has no effect on scans that do updates. The  
KillAndReadBuffer routine does not force out a buffer if the dirty  
bit is set. So updated pages revert to the current performance  
characteristics.


-cktan
GreenPlum, Inc.

On May 10, 2007, at 5:22 AM, Heikki Linnakangas wrote:


Zeugswetter Andreas ADI SD wrote:
In reference to the seq scans roadmap, I have just submitted a  
patch that addresses some of the concerns.


The patch does this:

1. for small relation (smaller than 60% of bufferpool), use the  
current logic 2. for big relation:

- use a ring buffer in heap scan
- pin first 12 pages when scan starts
- on consumption of every 4-page, read and pin the next 4-page
	- invalidate used pages of in the scan so they do not force out  
other useful pages

A few comments regarding the effects:
I do not see how this speedup could be caused by readahead, so  
what are

the effects ?


I was wondering that as well. We'd really need to test all the  
changes separately to see where the improvements are really coming  
from.


Also, that patch doesn't address the VACUUM issue at all. And using  
a small fixed size ring with scans that do updates can be  
devastating. I'm experimenting with different ring sizes for COPY  
at the moment. Too small ring leads to a lot of WAL flushes, it's  
basically the same problem we have with VACUUM in CVS HEAD.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com





---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Seq scans roadmap

2007-05-10 Thread CK Tan
Sorry, 16x8K page ring is too small indeed. The reason we selected 16  
is because greenplum db runs on 32K page size, so we are indeed  
reading 128K at a time. The #pages in the ring should be made  
relative to the page size, so you achieve 128K per read.


Also agree that KillAndReadBuffer could be split into a  
KillPinDontRead(), and ReadThesePinnedPages() functions. However, we  
are thinking of AIO and would rather see a ReadNPagesAsync() function.


-cktan
Greenplum, Inc.

On May 10, 2007, at 3:14 AM, Zeugswetter Andreas ADI SD wrote:




In reference to the seq scans roadmap, I have just submitted
a patch that addresses some of the concerns.

The patch does this:

1. for small relation (smaller than 60% of bufferpool), use
the current logic 2. for big relation:
- use a ring buffer in heap scan
- pin first 12 pages when scan starts
- on consumption of every 4-page, read and pin the next 4-page
- invalidate used pages of in the scan so they do not
force out other useful pages


A few comments regarding the effects:

I do not see how this speedup could be caused by readahead, so what  
are

the effects ?
(It should make no difference to do the CPU work for count(*)  
inbetween

reading each block when the pages are not dirtied)
Is the improvement solely reduced CPU because no search for a free
buffer is needed and/or L2 cache locality ?

What effect does the advance pinnig have, avoid vacuum ?

A 16 x 8k page ring is too small to allow the needed IO blocksize of
256k.
The readahead is done 4 x one page at a time (=32k).
What is the reasoning behind 1/4 ring for readahead (why not 1/2), is
3/4 the trail for followers and bgwriter ?

I think in anticipation of doing a single IO call for more that one
page, the KillAndReadBuffer function should be split into two  
parts. One

that does the killing
for n pages, and one that does the reading for n pages.
Killing n before reading n would also have the positive effect of
grouping perhaps needed writes (not interleaving them with the reads).

I think the 60% Nbuffers is a very good starting point. I would only
introduce a GUC when we see evidence that it is needed (I agree with
Simon's partitioning comments, but I'd still wait and see).

Andreas





---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Seq scans roadmap

2007-05-09 Thread CK Tan

Hi,

In reference to the seq scans roadmap, I have just submitted a patch  
that addresses some of the concerns.


The patch does this:

1. for small relation (smaller than 60% of bufferpool), use the  
current logic

2. for big relation:
- use a ring buffer in heap scan
- pin first 12 pages when scan starts
- on consumption of every 4-page, read and pin the next 4-page
	- invalidate used pages of in the scan so they do not force out  
other useful pages


4 files changed:
bufmgr.c, bufmgr.h, heapam.c, relscan.h

If there are interests, I can submit another scan patch that returns  
N tuples at a time, instead of current one-at-a-time interface. This  
improves code locality and further improve performance by another  
10-20%.


For TPCH 1G tables, we are seeing more than 20% improvement in scans  
on the same hardware.


 
-

- PATCHED VERSION
 
-

gptest=# select count(*) from lineitem;
  count
-
6001215
(1 row)

Time: 2117.025 ms

 
-

- ORIGINAL CVS HEAD VERSION
 
-

gptest=# select count(*) from lineitem;
  count
-
6001215
(1 row)

Time: 2722.441 ms


Suggestions for improvement are welcome.

Regards,
-cktan
Greenplum, Inc.

On May 8, 2007, at 5:57 AM, Heikki Linnakangas wrote:


Luke Lonergan wrote:
What do you mean with using readahead inside the heapscan?  
Starting an async read request?
Nope - just reading N buffers ahead for seqscans.  Subsequent  
calls use

previously read pages.  The objective is to issue contiguous reads to
the OS in sizes greater than the PG page size (which is much smaller
than what is needed for fast sequential I/O).


Are you filling multiple buffers in the buffer cache with a single  
read-call? The OS should be doing readahead for us anyway, so I  
don't see how just issuing multiple ReadBuffers one after each  
other helps.


Yes, I think the ring buffer strategy should be used when the  
table size
is  1 x bufcache and the ring buffer should be of a fixed size  
smaller

than L2 cache (32KB - 128KB seems to work well).


I think we want to let the ring grow larger than that for updating  
transactions and vacuums, though, to avoid the WAL flush problem.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of  
broadcast)---

TIP 6: explain analyze is your friend





---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq