On Mon, 5 Dec 2022 at 02:34, Ankit Kumar Pandey <itsanki...@gmail.com> wrote: > Interesting problem, Hashtables created in normal aggregates (AGG_HASHED > mode) may provide some reference as they have hashagg_spill_tuple but I > am not sure of any prior implementation of hashtable with counter and > spill.
I'm unsure if there's such guidance that can be gleaned from studying nodeAgg.c. IIRC, that works by deciding up-front how many partitions we're going to split the hash key space into and then writing out tuples to files based on "hashkey MOD number-of-partitions". At the end of that, you can just aggregate tuples one partition at a time. All groups are in the same file/partition. The reason this does not seem useful to your case is that you need to be able to quickly look up a given Datum or set of Datums to check if they are unique or not. For that, you'd need to reload the hash table every time your lookup lands on a different partition of the hashkey space. I fail to see how that could ever be fast unless there happened to only be 1 partition. To make that worse, when a tuple goes out of the frame and the counter that's tracking how many times the Datum(s) appeared reaches 0, you need to write the entire file out again minus that tuple. Let's say you're window function is on a column which is distinct or *very* close to it and the given window is moving the window frame forward 1 tuple per input tuple. If each subsequent Datum hashes to a different partition, then you're going to need to load the file for that hash key space to check if that Datum has already been seen, then you're going to have to evict that tuple from the file as it moves out of frame, so that means reading and writing that entire file per input tuple consumed. That'll perform very poorly! It's possible that you could maybe speed it up a bit with some lossy hash table that sits atop of this can only tell you if the given key definitely does *not* exists. You'd then be able to just write that tuple out to the partition and you'd not have to read or write out the file again. It's going to slow down to a crawl when the lossy table contains too many false positives though. > Major concern is, if we go through tuplesort route (without order > by case), would we get handicapped in future if we want order by or more > features? Yeah, deciding that before you go down one of the paths is going to be important. I imagine the reason that you've not found another database that supports DISTINCT window functions in a window with an ORDER BY clause is that it's very hard to make it in a way where it performs well in all cases. Maybe another way to go about it that will give you less lock-in if we decide to make ORDER BY work later would be to design some new tuple-store-like data structure that can be defined with a lookup key so you could ask it if a given key is stored and it would return the answer quickly without having to trawl through all stored tuples. It would also need to support the same positional lookups as tuplestore does today so that all evicting-tuples-from-the-window-frame stuff works as it does today. If you made something like that, then the changes required in nodeWindowAgg.c would be significantly reduced. You'd also just have 1 work_mem limit to abide by instead of having to consider sharing that between a tuplestore and a hashtable/tuplesort. Maybe as step 1, you could invent keyedtuplestore.c and consume tuplestore's functions but layer on the lossy hashtable idea that I mentioned above. That'd have to be more than just a bloom filter as you need a payload of the count of tuples matching the given hashkey MOD nbuckets. If you then had a function like keyedtuplestore_key_definately_does_not_exist() (can't think of a better name now) then you can just lookup the lossy table and if there are 0 tuples at that lossy bucket, then you can keyedtuplestore_puttupleslot() from nodeWindowAgg.c. keyedtuplestore_key_definately_does_not_exist() would have to work much harder if there were>0 tuples with the same lossy hashkey. You'd need to trawl through the tuples and check each one. Perhaps that could be tuned a bit so if you get too many collisions then the lossy table could be rehashed to a larger size. It's going to fall flat on its face, performance-wise, when the hash table can't be made larger due to work_mem constraints. Anyway, that's a lot of only partially thought-through ideas above. If you're working on a patch like this, you should expect to have to rewrite it a dozen or 2 times as new ideas arrive. If you're good at using the fact that the new patch is better than the old one as motivation to continue, then you're onto an attitude that is PostgreSQL-community-proof :-) (thankfully) we're often not good at "let's just commit it now and make it better later". David