Improving on MAX_CONVERSION_GROWTH

Tom Lane Tue, 24 Sep 2019 14:16:01 -0700

Thinking about the nearby thread[1] about overrunning MaxAllocSize
during encoding conversion, it struck me that another thing
we could usefully do to improve that situation is to be smarter
about what's the growth factor --- the existing one-size-fits-all
choice of MAX_CONVERSION_GROWTH = 4 is leaving a lot on the table.


In particular, it seems like we could usefully frame things as
having a constant max growth factor associated with each target
encoding, stored as a new field in pg_wchar_table[].  By definition,
the max growth factor cannot be more than the maximum character
length in the target encoding.  So this approach immediately gives
us a growth factor of 1 with any single-byte output encoding,
and even many of the multibyte encodings would have max growth 2
or 3 without having to think any harder than that.

But we can do better, I think, recognizing that all the supported
encodings are ASCII extensions.  The only possible way to expend
4 output bytes per input byte is if there is some 1-byte character
that translates to a 4-byte character, and I think this is not the
case for converting any of our encodings to UTF8.  If you need at
least a 2-byte character to produce a 3-byte or 4-byte UTF8 character,
then UTF8 has max growth 2.  I'm not quite sure if that's true
for every source encoding, but I'm pretty certain it couldn't be
worse than 3.

It might be worth getting a bit more complex and having a 2-D
array indexed by both source and destination encodings to determine
the max growth factor.  I haven't run tests to empirically verify
what is the max growth factor.

A fly in this ointment is: could a custom encoding conversion
function violate our conclusions about what's the max growth
factor?  Maybe it would be worth treating the growth factor
as a property of a particular conversion (i.e., add a column
to pg_conversion) rather than making it a hard-wired property.

In any case, it seems likely that we could end up with a
multiplier of 1, 2, or 3 rather than 4 in just about every
case of practical interest.  That sure seems like a win
when converting long strings.

Thoughts?

                        regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/[email protected]

Improving on MAX_CONVERSION_GROWTH

Reply via email to