On Tue, 9 Jan 2024, 09:23 Peter Eisentraut, <pe...@eisentraut.org> wrote: > > On 04.01.24 00:23, Matthias van de Meent wrote: > > Something like the attached? It splits out into the following > > 0001: basic 'omit default values' > > /* Write an integer field (anything written as ":fldname %d") */ > -#define WRITE_INT_FIELD(fldname) \ > +#define WRITE_INT_FIELD_DIRECT(fldname) \ > appendStringInfo(str, " :" CppAsString(fldname) " %d", > node->fldname) > +#define WRITE_INT_FIELD_DEFAULT(fldname, default) \ > + ((node->fldname == default) ? (0) : WRITE_INT_FIELD_DIRECT(fldname)) > +#define WRITE_INT_FIELD(fldname) \ > + WRITE_INT_FIELD_DEFAULT(fldname, 0) > > Do we need the _DIRECT macros at all? Could we not combine that into > the _DEFAULT ones?
I was planning on using them to reduce the size of generated code for select fields that we know we will always serialize, but then later decided against doing that in this patch as it'd add even more arbitrary annotations to nodes. This is a leftover from that. > I think the way the conditional operator (?:) is written is not > technically correct C, > [...] > I think it would be better to write this > in a more straightforward way like > > #define WRITE_INT_FIELD_DEFAULT(fldname, default) \ > do { \ > [...] > while (0) > > Relatedly, this > > +/* a scaffold function to read an optionally-omitted field */ > [...] > would need to be written with a do { } while (0) wrapper around it. I'll fix that. > > 0002: reset location and other querystring-related node fields for all > > catalogs of type pg_node_tree. > > This goal makes sense, but I think it can be done in a better way. If > you look into the area of stringToNode(), stringToNodeWithLocations(), > and stringToNodeInternal(), there already is support for selectively > resetting or omitting location fields. Notably, this works with the > existing automated knowledge of where the location fields are and > doesn't require a new hand-maintained table. I think the way forward > here would be to apply a similar toggle to nodeToString() (the reverse). I'll try to work something out for that. > > 0003: add default marking on typmod fields. > > 0004 & 0006: various node fields marked with default() based on > > observed common or initial values of those fields > > I think we could get about half the benefit here more automatically, by > creating a separate type for typmods, like > > typedef int32 TypMod; > > and then having the node support automatically generate the > serialization support with a -1 default. Hm, I suspect that the code churn for that would be significant. I'd also be confused when the type in storage (pg_attribute, pg_type's typtypmod) is still int32 when it would be TypMod only in nodes. > (A similar thing could be applied to the location fields, which would > allow us to get rid of the current hack of parsing out the name.) I suppose so. > Most of the other defaults I'm doubtful about. First, we are colliding > here between the goals of minimizing the storage size and making the > debug output more readable. I've never really wanted to make the output "more readable". The current one is too verbose, yes. > If a Query dump would now omit the > commandType field if it is CMD_SELECT, I think that would be widely > confusing, and one would need to check the source code to identify the > reason. AFAIK, SELECT is the only command type you can possibly store in a view (insert/delete/update/utility are all invalid there, and while I'm not fully certain about MERGE, I'd say it's certainly a niche). > Also, what if we later decide to change a "default" for a > field. Then output between version would differ. Of course, node > output does change between versions in general, but these kinds of > differences would be confusing. I've not heard of anyone trying to read and compare the contents of pg_node_tree manually where they're not trying to debug some deep-nested issue. Note > Second, this relies on hand-maintained > annotations that were created by you presumably through a combination of > intuition and testing, based on what is in the template database. Do we > know whether this matches real-world queries created by users later? No, or at least I don't know this for certain. But I think it's a good start. > Also, my experience dealing with the node support over the last little > while is that these manually maintained exceptions get ossified and > outdated and create a maintenance headache for the future. I'm not sure what headache this would become. nodeToString is a fairly straightforward API with (AFAIK) no external dependencies, where only nodes go in and out. The metadata on top of that will indeed require some maintenance, but AFAIK only in the areas that read and utilize said metadata. While it certainly wouldn't be great if we didn't have this metadata, it'd be no worse than not having compression. > > 0005: truncate trailing 0s from outDatum > > Does this significantly affect anything other than the "name" type? > User views don't usually use the "name" type, so this would have limited > impact outside of system views. It saves a few bytes each on byval types like bool, oid, and int on little-endian systems, as they don't utilize the latter bytes of the 4- or 8-byte Datum. At least in the default catalog this shaves some bytes off. > > 0007 (new): do run-length + gap coding for bitmapset and the various > > integer list types. This saves a surprising amount of bytes. > > Can you show examples of this? How would this affects the ability to > manually interpret the output? The ability to interpret the results manually is somewhat reduced for complex cases (bitmaps), but things like RangeTableEntries are significantly reduced in size because of this. A good amount of IntegerLists is reduced to (i 1 +10) instead of (i 1 2 3 4 5 ... 11). Specifically notable are the joinleftcols/joinrightcols fields, as they will often contain large lists of joined columns when many tables are joined together. While bitmaps are less prevalent/large, they also benefit from this optimization. As for bitmapsets, the use of differential coding saves bytes when the set is large or otherwise has structure: the bitmapset of uneven numbers (b 1 3 5 7 ... 23 25 27 ... 101 103 ...) takes up more space (and is less compressible than) the equivalent differential coded (b 1 2 2 2 2 ...). This is at the cost of direct readability, but I think that's worth it. Kind regards, Matthias van de Meent Neon (https://neon.tech)