My name is Beth Jen and I'm a Summer of Code student currently adding a hash-based implementation of DISTINCT to PostgreSQL. My prototype is near completion, and the current design is as follows:
I've created a new exec node that uses hash table functions in execGrouping. The node simply sticks distinct values into the hash table and then returns them, else it discards the value. The idea was to then have the optimizer consider using this node instead of the current sort -> unique combination.
However, I've run into an issue where Jonah and Bruce suggested I post the question to the hackers list.
Right now, the distinct clause adds its targets to the sort clause list when it is parsed. This causes an automatic insertion of the sort node into the query plan before the application of the unique node. The hash-based implementation however is meant to bypass the need to sort. I could just remove this action, but the optimizer should only consider using the hash-based implementation, it should still use the sort -> unique implementation when necessary. (Not to mention, this sort -> unique implementation is used in other cases where unique values are needed, such as in unions.) Therefore, I need to be able to somehow either move this component from the parser into the optimizer, or prevent the creation of a sort node when the query planner chooses to use the hash-based implementation.
What are your suggestions for going about this? Are these approaches feasible without a significant restructuring of the code? Are there any other approaches I should consider?
Thank you for your assistance.
-Beth