On Mon, May 9, 2016 at 8:34 PM, David Rowley <david.row...@2ndquadrant.com> wrote: > It's interesting that you mention this. We identified this as a pain > point during our work on column stores last year. Simply passing > single tuples around the executor is really unfriendly towards L1 > instruction cache, plus also the points you mention about L3 cache and > hash tables and tuple stores. I really think that we're likely to see > significant gains by processing >1 tuple at a time, so this topic very > much interests me.
Cool. I hope we can work together on it, and with anyone else who is interested. > When we start multiplying those increases with the increases with > something like parallel query then we're starting to see very nice > gains in performance. Check. > Alvaro, Tomas and I had been discussing this and late last year I did > look into what would be required to allow this to happen in Postgres. > Basically there's 2 sub-projects, I'll describe what I've managed to > learn so far about each, and the rough plan that I have to implement > them: > > 1. Batch Execution: > > a. Modify ScanAPI to allow batch tuple fetching in predefined batch sizes. > b. Modify TupleTableSlot to allow > 1 tuple to be stored. Add flag to > indicate if the struct contains a single or a multiple tuples. > Multiple tuples may need to be deformed in a non-lazy fashion in order > to prevent too many buffers from having to be pinned at once. Tuples > will be deformed into arrays of each column rather than arrays for > each tuple (this part is important to support the next sub-project) > c. Modify some nodes (perhaps start with nodeAgg.c) to allow them to > process a batch TupleTableSlot. This will require some tight loop to > aggregate the entire TupleTableSlot at once before returning. > d. Add function in execAmi.c which returns true or false depending on > if the node supports batch TupleTableSlots or not. > e. At executor startup determine if the entire plan tree supports > batch TupleTableSlots, if so enable batch scan mode. I'm wondering if we should instead have a whole new kind of node, a TupleTableVector, say. Things that want to work one tuple at a time can continue to do so with no additional overhead. Things that want to return batches can do it via this new result type. > node types, which *may* not be all that difficult. I'm also assuming > that batch mode (in all cases apart from queries with LIMIT or > cursors) will always be faster than tuple-at-a-time, so requires no > costings from the planner. I definitely agree that we need to consider cases with and without LIMIT separately, but there's more than one way to get a LIMIT. For example, a subquery returns only one row; a semijoin returns only one row. I don't think you are going to escape planner considerations. Nested Loop Semi Join -> Seq Scan -> Index Scan on dont_batch_here > 2. Vector processing > > (I admit that I've given this part much less thought so far, but > here's what I have in mind) > > This depends on batch execution, and is intended to allow the executor > to perform function calls to an entire batch at once, rather than > tuple-at-a-time. For example, let's take the following example; > > SELECT a+b FROM t; > > here (as of now) we'd scan "t" one row at a time and perform a+b after > having deformed enough of the tuple to do that. We'd then go and get > another Tuple from the scan node and repeat until the scan gave us no > more Tuples. > > With batch execution we'd fetch multiple Tuples from the scan and we'd > then perform the call to say int4_pl() multiple times, which still > kinda sucks as it means calling int4_pl() possibly millions of times > (once per tuple). The vector mode here would require that we modify > pg_operator to add a vector function for each operator so that we can > call the function passing in an array of Datums and a length to have > SIMD operations perform the addition, so we'd call something like > int4_pl_vector() only once per batch of tuples allowing the CPU to > perform SIMD operations on those datum arrays. This could be done in > an incremental way as the code could just callback on the standard > function in cases where a vectorised version of it is not available. > Thought is needed here about when exactly this decision is made as the > user may not have permissions to execute the vector function, so it > can't simply be a run time check. These functions would simply return > another vector of the results. Aggregates could be given a vector > transition function, where something like COUNT(*)'s vector_transfn > would simply just current_count += vector_length; > > This project does appear to require that we bloat the code with 100's > of vector versions of each function. I'm not quite sure if there's a > better way to handle this. The problem is that the fmgr is pretty much > a barrier to SIMD operations, and this was the only idea that I've had > so far about breaking through that barrier. So further ideas here are > very welcome. I don't have any at the moment, but I'm not keen on hundreds of new vector functions that can all have bugs or behavior differences versus the unvectorized versions of the same code. That's a substantial tax on future development. I think it's important to understand what sorts of queries we are targeting here. KaiGai's GPU-acceleration stuff does great on queries with complex WHERE clauses, but most people don't care not only because it's out-of-core but because who actually looks for the records where (a + b) % c > (d + e) * f / g? This seems like it has the same issue. If we can speed up common queries people are actually likely to run, OK, that's interesting. By the way, I think KaiGai's GPU-acceleration stuff points to another pitfall here. There's other stuff somebody might legitimately want to do that requires another copy of each function. For example, run-time code generation likely needs that (a function to tell the code generator what to generate for each of our functions), and GPU-acceleration probably does, too. If fixing a bug in numeric_lt requires changing not only the regular version and the vectorized version but also the GPU-accelerated version and the codegen version, Tom and Dean are going to kill us. And justifiably so! Granted, nobody is proposing those other features in core right now, but they're totally reasonable things to want to do. I suspect the number of queries that are being hurt by fmgr overhead is really large, and I think it would be nice to attack that problem more directly. It's a bit hard to discuss what's worthwhile in the abstract, without performance numbers, but when you vectorize, how much is the benefit from using SIMD instructions and how much is the benefit from just not going through the fmgr every time? > The idea here is that these 2 projects help pave the way to bring > columnar storage into PostgreSQL. Without these we're unlikely to get > much benefit of columnar storage as we'd be stuck processing rows one > at a time still. Adding columnar storage on the top of the above > should further increase performance as we can skip the tuple-deform > step and pull columnar array/vectors directly into a TupleTableSlot, > although some trickery would be involved here when the scan has keys. I'm a bit mystified by this. It seems to me that you could push down the optimizable quals into the AM, just like what index AMs due for Index Quals and what postgres_fdw does for pushdown-safe quals. Then those quals get executed on the optimized representation, and you only have to fill TupleTableSlots for the surviving tuples. AFAICS, vectorizing the core executor only helps if you want to keep the data in vectorized form for longer, e.g. to somehow optimize joins or aggs, or if the data starts out in row-oriented form and we convert it to columnar form before doing vector ops. Evidently I'm confused. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers