Re: NAMEDATALEN increase because of non-latin languages

2022-08-10 Thread John Naylor
I wrote:

> The syscache use of GETSTRUCT still uses a simple cast of the tuple (for 
> pg_cast those calls live in parse_coerce.c, which is unchanged from master in 
> v3). Next step I think is to see about the syscache piece -- teaching a 
> syscache miss to deform the entire tuple into a struct and store it in the 
> syscache.

v4 is a hackish attempt at getting the syscache to deform the tuple
and store the struct. Using only pg_cast again for ease, it seems to
work for that. It's now about as far as I can get without thinking
about byref types.

0001 just adds copies of some syscache / catcache functions for
private use by pg_cast, as scaffolding.
0002 teaches the temporary CatalogCacheCreateEntry_STRUCT() to call
heap_deform_tuple(). This then calls a for-now handwritten function
(similar to the one generated in v3) which palloc's space for the
struct and copies the fields over. Only this function knows the
catalog struct type, so a future design could call the per-cache
function via a pointer in the catcache control array. Since we already
have the isnull/values array, it's also trivial to copy the datums for
the cache keys. WIP: CatCTup->tuple is still declared a tuple, but the
t_data contents are now bogus, so there is a temporary GETSTRUCT_NEW
that knows it's looking directly at the struct.

Getting to varlen attributes: For one, I think it was mentioned above
that we will need a way to perform a deep copy of structs that contain
pointers to varlen fields. I imagine keeping track of the attributes
length will come up for some types and might be tricky. And before
that, the catalog machinery will need some preprocessor tricks to
declare pointers in the structs.

-- 
John Naylor
EDB: http://www.enterprisedb.com
From 20fba44412e8ef1bb4cd5b051b9d7e82618a6d93 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Wed, 10 Aug 2022 17:19:24 +0700
Subject: [PATCH v4 2/2] Teach catcache to store structs for pg_cast

---
 src/backend/parser/parse_coerce.c  |  6 +--
 src/backend/utils/cache/catcache.c | 73 --
 src/include/access/htup_details.h  |  2 +
 3 files changed, 55 insertions(+), 26 deletions(-)

diff --git a/src/backend/parser/parse_coerce.c b/src/backend/parser/parse_coerce.c
index 39b7e5707b..07a1b047e3 100644
--- a/src/backend/parser/parse_coerce.c
+++ b/src/backend/parser/parse_coerce.c
@@ -3056,7 +3056,7 @@ IsBinaryCoercible(Oid srctype, Oid targettype)
 			ObjectIdGetDatum(targettype));
 	if (!HeapTupleIsValid(tuple))
 		return false;			/* no cast */
-	castForm = (Form_pg_cast) GETSTRUCT(tuple);
+	castForm = GETSTRUCT_NEW(pg_cast, tuple);
 
 	result = (castForm->castmethod == COERCION_METHOD_BINARY &&
 			  castForm->castcontext == COERCION_CODE_IMPLICIT);
@@ -3120,7 +3120,7 @@ find_coercion_pathway(Oid targetTypeId, Oid sourceTypeId,
 
 	if (HeapTupleIsValid(tuple))
 	{
-		Form_pg_cast castForm = (Form_pg_cast) GETSTRUCT(tuple);
+		Form_pg_cast castForm = GETSTRUCT_NEW(pg_cast, tuple);
 		CoercionContext castcontext;
 
 		/* convert char value for castcontext to CoercionContext enum */
@@ -3287,7 +3287,7 @@ find_typmod_coercion_function(Oid typeId,
 
 	if (HeapTupleIsValid(tuple))
 	{
-		Form_pg_cast castForm = (Form_pg_cast) GETSTRUCT(tuple);
+		Form_pg_cast castForm = GETSTRUCT_NEW(pg_cast, tuple);
 
 		*funcid = castForm->castfunc;
 		ReleaseSysCache(tuple);
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index b1287bb6a0..8ddc109052 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -21,6 +21,7 @@
 #include "access/table.h"
 #include "access/valid.h"
 #include "access/xact.h"
+#include "catalog/pg_cast.h" // fixme
 #include "catalog/pg_collation.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_type.h"
@@ -2158,6 +2159,42 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
 	return ct;
 }
 
+// WIP: generated functions would look like this and be called through a pointer
+// FIXME: ct->tuple is no longer a real tuple
+// XXX: for now assume the caller has switched to the right memory context
+static CatCTup *
+CatCArrayGetStruct_pg_cast(HeapTuple pg_cast_tuple, CatCTup *ct, Datum *values, bool *isnull)
+{
+	Form_pg_cast pg_cast_struct;
+
+	/* Allocate memory for CatCTup and the cached struct in one go */
+	ct = (CatCTup *) palloc(sizeof(CatCTup) +
+			MAXIMUM_ALIGNOF + sizeof(FormData_pg_cast));
+
+	/* copy the identification info */
+	// WIP: for caches we only need t_self, can we just have that as a
+	// separate field in CatCTup?
+	ct->tuple.t_len = pg_cast_tuple->t_len;
+	ct->tuple.t_self = pg_cast_tuple->t_self;
+	ct->tuple.t_tableOid = pg_cast_tuple->t_tableOid;
+
+	// WIP: treat t_data as a pointer to the struct
+	ct->tuple.t_data = (HeapTupleHeader)
+		MAXALIGN(((char *) ct) + sizeof(CatCTup));
+	pg_cast_struct = (Form_pg_cast) ct->tuple.t_data;
+
+	/* copy tuple contents */
+	// WIP: we can just assign because there are no varlen 

Re: NAMEDATALEN increase because of non-latin languages

2022-07-22 Thread John Naylor
On Tue, Jul 19, 2022 at 10:57 PM Andres Freund  wrote:
>
> Hi,
>
> On 2022-07-19 14:30:34 +0700, John Naylor wrote:
> > I'm thinking where the first few attributes are fixed length, not null,
and
> > (because of AIX) not double-aligned, we can do a single memcpy on
multiple
> > columns at once. That will still be a common pattern after namedata is
> > varlen. Otherwise, use helper functions/macros similar to the above but
> > instead of passing a tuple descriptor, use info we have at compile time.
>
> I think that might be over-optimizing things. I don't think we do these
> conversions at a rate that's high enough to warrant it - the common stuff
> should be in relcache etc.  It's possible that we might want to optimize
the
> catcache case specifically - but that'd be more optimizing memory usage
than
> "conversion" imo.

Okay, here is a hackish experiment that applies on top of v2 but also
invalidates some of that earlier work. Since there is already a pg_cast.c,
I demoed a new function there which looks like this:

void
Deform_pg_cast_tuple(Form_pg_cast pg_cast_struct, HeapTuple pg_cast_tuple,
TupleDesc pg_cast_desc)
{
Datum values[Natts_pg_cast];
bool isnull[Natts_pg_cast];

heap_deform_tuple(pg_cast_tuple, pg_cast_desc, values, isnull);

pg_cast_struct->oid = DatumGetObjectId(values[Anum_pg_cast_oid - 1]);
pg_cast_struct->castsource =
DatumGetObjectId(values[Anum_pg_cast_castsource - 1]);
pg_cast_struct->casttarget =
DatumGetObjectId(values[Anum_pg_cast_casttarget - 1]);
pg_cast_struct->castfunc =
DatumGetObjectId(values[Anum_pg_cast_castfunc - 1]);
pg_cast_struct->castcontext =
DatumGetChar(values[Anum_pg_cast_castcontext - 1]);
pg_cast_struct->castmethod =
DatumGetChar(values[Anum_pg_cast_castmethod - 1]);
}

For the general case we can use pg_*_deform.c or something like that, with
extern declarations in the main headers. To get this to work, I had to add
a couple pointless table open/close calls to get the tuple descriptor,
since currently the whole tuple is stored in the syscache, but that's not
good even as a temporary measure. Storing the full struct in the syscache
is a good future step, as noted upthread, but to get there without a bunch
more churn, maybe the above function can copy the tuple descriptor into a
local stack variable from an expanded version of schemapg.h. Once the
deformed structs are stored in caches, I imagine most of the times we want
to deform are when we have the table open, and we can pass the descriptor
as above without additional code.

--
John Naylor
EDB: http://www.enterprisedb.com
diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c
index 66e98daad1..cdc26b1118 100644
--- a/src/backend/catalog/objectaddress.c
+++ b/src/backend/catalog/objectaddress.c
@@ -3028,7 +3028,7 @@ getObjectDescription(const ObjectAddress *object, bool missing_ok)
 	break;
 }
 
-GETSTRUCT_MEMCPY(pg_cast, , tup);
+GETSTRUCT_MEMCPY(pg_cast, , tup, RelationGetDescr(castDesc));
 
 appendStringInfo(, _("cast from %s to %s"),
  format_type_be(castForm.castsource),
@@ -4870,7 +4870,7 @@ getObjectIdentityParts(const ObjectAddress *object,
 	break;
 }
 
-GETSTRUCT_MEMCPY(pg_cast, , tup);
+GETSTRUCT_MEMCPY(pg_cast, , tup, RelationGetDescr(castRel));
 
 appendStringInfo(, "(%s AS %s)",
  format_type_be_qualified(castForm.castsource),
diff --git a/src/backend/catalog/pg_cast.c b/src/backend/catalog/pg_cast.c
index 1812bb7fcc..83379f3505 100644
--- a/src/backend/catalog/pg_cast.c
+++ b/src/backend/catalog/pg_cast.c
@@ -27,6 +27,23 @@
 #include "utils/rel.h"
 #include "utils/syscache.h"
 
+// WIP: #include from generated .c file?
+void
+Deform_pg_cast_tuple(Form_pg_cast pg_cast_struct, HeapTuple pg_cast_tuple, TupleDesc pg_cast_desc)
+{
+	Datum			values[Natts_pg_cast];
+	bool			isnull[Natts_pg_cast];
+
+	heap_deform_tuple(pg_cast_tuple, pg_cast_desc, values, isnull);
+
+	pg_cast_struct->oid = DatumGetObjectId(values[Anum_pg_cast_oid - 1]);
+	pg_cast_struct->castsource = DatumGetObjectId(values[Anum_pg_cast_castsource - 1]);
+	pg_cast_struct->casttarget = DatumGetObjectId(values[Anum_pg_cast_casttarget - 1]);
+	pg_cast_struct->castfunc = DatumGetObjectId(values[Anum_pg_cast_castfunc - 1]);
+	pg_cast_struct->castcontext = DatumGetChar(values[Anum_pg_cast_castcontext - 1]);
+	pg_cast_struct->castmethod = DatumGetChar(values[Anum_pg_cast_castmethod - 1]);
+}
+
 /*
  * 
  *		CastCreate
diff --git a/src/backend/parser/parse_coerce.c b/src/backend/parser/parse_coerce.c
index 0bd90fd5e8..f4bac9d171 100644
--- a/src/backend/parser/parse_coerce.c
+++ b/src/backend/parser/parse_coerce.c
@@ -30,6 +30,9 @@
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 #include "utils/typcache.h"
+// WIP
+#include "access/table.h"
+#include "utils/rel.h"
 
 
 static Node *coerce_type_typmod(Node *node,
@@ -2997,6 +3000,7 @@ 

Re: NAMEDATALEN increase because of non-latin languages

2022-07-19 Thread Andres Freund
Hi,

On 2022-07-19 14:30:34 +0700, John Naylor wrote:
> I wrote:
> 
> > On Mon, Jul 18, 2022 at 9:58 AM Andres Freund  wrote:
> > > Hm. Wouldn't it make sense to just use the normal tuple deforming
> routines and
> > > then map the results to the structs?
> >
> > I wasn't sure if they'd be suitable for this, but if they are, that'd
> make this easier and more maintainable. I'll look into it.
> 
> This would seem to have its own problems: heap_deform_tuple writes to
> passed arrays of datums and bools. The lower level parts like fetchatt and
> nocachegetattr return datums, so still need some generated boilerplate.
> Some of these also assume they can write cached offsets on a passed tuple
> descriptor.

Sure. But that'll just be a form of conversion we do all over, rather than
encoding low-level data layout details. Basically
struct->member1 = DatumGetInt32(values[0]);
struct->member2 = DatumGetChar(values[1]);

etc.


> I'm thinking where the first few attributes are fixed length, not null, and
> (because of AIX) not double-aligned, we can do a single memcpy on multiple
> columns at once. That will still be a common pattern after namedata is
> varlen. Otherwise, use helper functions/macros similar to the above but
> instead of passing a tuple descriptor, use info we have at compile time.

I think that might be over-optimizing things. I don't think we do these
conversions at a rate that's high enough to warrant it - the common stuff
should be in relcache etc.  It's possible that we might want to optimize the
catcache case specifically - but that'd be more optimizing memory usage than
"conversion" imo.


Greetings,

Andres Freund




Re: NAMEDATALEN increase because of non-latin languages

2022-07-19 Thread John Naylor
I wrote:

> On Mon, Jul 18, 2022 at 9:58 AM Andres Freund  wrote:
> > Hm. Wouldn't it make sense to just use the normal tuple deforming
routines and
> > then map the results to the structs?
>
> I wasn't sure if they'd be suitable for this, but if they are, that'd
make this easier and more maintainable. I'll look into it.

This would seem to have its own problems: heap_deform_tuple writes to
passed arrays of datums and bools. The lower level parts like fetchatt and
nocachegetattr return datums, so still need some generated boilerplate.
Some of these also assume they can write cached offsets on a passed tuple
descriptor.

I'm thinking where the first few attributes are fixed length, not null, and
(because of AIX) not double-aligned, we can do a single memcpy on multiple
columns at once. That will still be a common pattern after namedata is
varlen. Otherwise, use helper functions/macros similar to the above but
instead of passing a tuple descriptor, use info we have at compile time.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: NAMEDATALEN increase because of non-latin languages

2022-07-18 Thread John Naylor
On Mon, Jul 18, 2022 at 9:58 AM Andres Freund  wrote:

> > 0001 is just boilerplate, same as v1
>
> If we were to go for this, I wonder if we should backpatch the cast
containing
> version of GESTRUCT for less pain backpatching bugfixes. That'd likely
require
> using a different name for the cast containing one.

The new version in this series was meant to be temporary scaffolding, but
in the back of my mind I wondered if we should go ahead and keep the simple
cast for catalogs that have no varlenas or alignment issues. It sounds like
you'd be in favor of that.

> > 0003 generates static inline functions that work the same as the current
> > GETSTRUCT macro, i.e. just cast to the right pointer and return it.
>
> It seems likely that inline functions are going to be too large for
> this. There's a lot of GESTRUCTs in a lot of files, emitting a copy of the
> function every time doesn't seem great.

Ok.

> > current offset is the previous offset plus the previous type length,
plus
> > any alignment padding suitable for the current type (there is none
here, so
> > the alignment aspect is not tested). I'm hoping something like this
will be
> > sufficient for what's in the current structs, but of course will need
> > additional work when expanding those to include pointers to varlen
> > attributes. I've not yet inspected the emitted assembly language, but
> > regression tests pass.
>
> Hm. Wouldn't it make sense to just use the normal tuple deforming
routines and
> then map the results to the structs?

I wasn't sure if they'd be suitable for this, but if they are, that'd make
this easier and more maintainable. I'll look into it.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: NAMEDATALEN increase because of non-latin languages

2022-07-17 Thread Andres Freund
Hi,

On 2022-07-18 09:46:44 +0700, John Naylor wrote:
> I've made a small step in this direction.

Thanks for working on this!


> 0001 is just boilerplate, same as v1

If we were to go for this, I wonder if we should backpatch the cast containing
version of GESTRUCT for less pain backpatching bugfixes. That'd likely require
using a different name for the cast containing one.


> 0002 teaches Catalog.pm to export both C attr name and SQL attr name, so we
> can use "sizeof".
> 0003 generates static inline functions that work the same as the current
> GETSTRUCT macro, i.e. just cast to the right pointer and return it.

It seems likely that inline functions are going to be too large for
this. There's a lot of GESTRUCTs in a lot of files, emitting a copy of the
function every time doesn't seem great.


> current offset is the previous offset plus the previous type length, plus
> any alignment padding suitable for the current type (there is none here, so
> the alignment aspect is not tested). I'm hoping something like this will be
> sufficient for what's in the current structs, but of course will need
> additional work when expanding those to include pointers to varlen
> attributes. I've not yet inspected the emitted assembly language, but
> regression tests pass.

Hm. Wouldn't it make sense to just use the normal tuple deforming routines and
then map the results to the structs?

Greetings,

Andres Freund




Re: NAMEDATALEN increase because of non-latin languages

2022-06-27 Thread Julien Rouhaud
On Sat, Jun 25, 2022 at 08:00:04PM -0700, Andres Freund wrote:
>
> On 2022-06-26 10:48:24 +0800, Julien Rouhaud wrote:
> > Anyway, per the nearby discussions I don't see much interest, especially 
> > not in
> > the context of varlena identifiers (I should have started a different 
> > thread,
> > sorry about that), so I don't think it's worth investing more efforts into
> > it.
>
> FWIW, I still think it's a worthwhile feature

Oh, ok, I will start a new thread then.

> - just, to me, unrelated to varlena identifiers.

Yeah I get that and I agree.

> I've seen tremendous amounts of space wasted in tables
> just because of alignment issues...

Same here unfortunately.




Re: NAMEDATALEN increase because of non-latin languages

2022-06-25 Thread Andres Freund
Hi,

On 2022-06-26 10:48:24 +0800, Julien Rouhaud wrote:
> Anyway, per the nearby discussions I don't see much interest, especially not 
> in
> the context of varlena identifiers (I should have started a different thread,
> sorry about that), so I don't think it's worth investing more efforts into
> it.

FWIW, I still think it's a worthwhile feature - just, to me, unrelated to
varlena identifiers. I've seen tremendous amounts of space wasted in tables
just because of alignment issues...

- Andres




Re: NAMEDATALEN increase because of non-latin languages

2022-06-25 Thread Julien Rouhaud
On Thu, Jun 23, 2022 at 10:19:44AM -0400, Robert Haas wrote:
> On Thu, Jun 23, 2022 at 6:13 AM Julien Rouhaud  wrote:
>
> > And should record_in / record_out use the logical position, as in:
> > SELECT ab::text FROM ab / SELECT (a, b)::ab;
> >
> > I would think not, as relying on a possibly dynamic order could break 
> > things if
> > you store the results somewhere, but YMMV.
>
> I think here the answer is yes again. I mean, consider that you could
> also ALTER TABLE DROP COLUMN and then ALTER TABLE ADD COLUMN with the
> same name. That is surely going to affect the meaning of such things.
> I don't think we want to have one meaning if you reorder things that
> way and a different meaning if you reorder things using whatever
> commands we create for changing the display column positions.

It indeed would, but ALTER TABLE DROP COLUMN is a destructive operation, and
I'm assuming that anyone doing that is aware that it will have an impact on
stored data and such.  I initially thought that changing the display order of
columns shouldn't have the same impact with the stability of otherwise
unchanged record definition, as it would make such reorder much more impacting.
But I agree that having different behaviors seems worse.

> > Then, what about joinrels expansion?  I learned that the column ordering 
> > rules
> > are far from being obvious, and I didn't find those in the documentation 
> > (note
> > that I don't know if that something actually described in the SQL standard).
> > So for instance, if a join is using an explicit USING clause rather than an 
> > ON
> > clause, the merged columns are expanded first, so:
> >
> > SELECT * FROM ab ab1 JOIN ab ab2 USING (b)
> >
> > should unexpectedly expand to (b, a, a).  Is this order a strict 
> > requirement?
>
> I dunno, but I can't see why it creates a problem for this patch to
> maintain the current behavior. I mean, just use the logical column
> position instead of the physical one here and forget about the details
> of how it works beyond that.

I'm not that familiar with this part of the code so I may have missed
something, but I didn't see any place where I could just simply do that.

To be clear, the approach I used is to change the expansion ordering but
otherwise keep the current behavior, to try to minimize the changes.  This is
done by keeping the attribute in the physical ordering pretty much everywhere,
including in the nsitems, and just logically reorder them during the expansion.
In other words all the code still knows that the 1st column is the first
physical column and so on.

So in that query, the ordering is supposed to happen when handling the "SELECT
*", which makes it impossible to retain that order.

I'm assuming that what you meant is to change the ordering when processing the
JOIN and retain the old "SELECT *" behavior, which is to emit items in the
order they're found.  But IIUC the only way to do that would be to change the
order when building the nsitems themselves, and have the code believe that the
attributes are physically stored in the logical order.  That's probably doable,
but that looks like a way more impacting change.  Or did you mean to keep the
approach I used, and just have some special case for "SELECT *" when referring
to a joinrel and instead try to handle the logical expansion in the join?
AFAICS it would require to add some extra info in the parsing structures, as it
doesn't really really store any position, just relies on array offset / list
position and maps things that way.

> > Another problem (that probably wouldn't be a problem for system catalogs) is
> > that defaults are evaluated in the physical position.  This example from the
> > regression test will clearly have a different behavior if the columns are 
> > in a
> > different physical order:
> >
> >  CREATE TABLE INSERT_TBL (
> > x INT DEFAULT nextval('insert_seq'),
> > y TEXT DEFAULT '-NULL-',
> > z INT DEFAULT -1 * currval('insert_seq'),
> > CONSTRAINT INSERT_TBL_CON CHECK (x >= 3 AND y <> 'check failed' AND 
> > x < 8),
> > CHECK (x + z = 0));
> >
> > But changing the behavior to rely on the logical position seems quite
> > dangerous.
>
> Why?

It feels to me like a POLA violation, and probably people wouldn't expect it to
behave this way (even if this is clearly some corner case problem).  Even if
you argue that this is not simply a default display order but something more
like real column order, the physical position being some implementation detail,
it still doesn't really feels right.

The main reason for having the possibility to change the logical position is to
have "better looking", easier to work with, relations even if you have some
requirements with the real physical order like trying to optimize things as
much as possible (reordering columns to avoid padding space, put non-nullable
columns first...).  The order in which defaults are evaluated looks like the
same kind of requirements.  How useful would 

Re: NAMEDATALEN increase because of non-latin languages

2022-06-24 Thread Tom Lane
Robert Haas  writes:
> I don't know whether we can or should move all the "name" columns to
> the end of the catalog. It would be user-visible and probably not
> user-desirable,

I'm a strong -1 on changing that if we're not absolutely forced to.

> but it would save something in terms of tuple
> deforming cost. I'm just not sure how much.

I'd guess nothing, if we are deforming all the fields.

regards, tom lane




Re: NAMEDATALEN increase because of non-latin languages

2022-06-24 Thread Robert Haas
On Thu, Jun 23, 2022 at 11:11 PM John Naylor
 wrote:
> Hmm, I must have misunderstood this aspect. In my mind I was thinking
> that if a varlen attribute were at the end, these functions would make
> it easier to access them quickly. But from this and the follow-on
> responses, these would be used to access varlen attributes wherever
> they may be. I'll take a look at the current uses of GETSTRUCT().

I don't know whether we can or should move all the "name" columns to
the end of the catalog. It would be user-visible and probably not
user-desirable, but it would save something in terms of tuple
deforming cost. I'm just not sure how much.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: NAMEDATALEN increase because of non-latin languages

2022-06-24 Thread Robert Haas
On Thu, Jun 23, 2022 at 6:43 PM Tom Lane  wrote:
> Nonetheless, the presence of GETSTRUCT calls should be a good guide
> to where we need to do something.

Indubitably.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: NAMEDATALEN increase because of non-latin languages

2022-06-23 Thread John Naylor
On Thu, Jun 23, 2022 at 9:17 PM Andres Freund  wrote:
>
> Hi,
>
> On 2022-06-03 13:28:16 +0700, John Naylor wrote:
> > 1. That would require putting the name physically closer to the end of
> > the column list. To make this less annoying for users, we'd need to
> > separate physical order from display order (at least/at first only for
> > system catalogs).
>
> FWIW, it's not at all clear to me that this would be required. If I were to
> tackle this I'd look at breaking up the mirroring of C structs to catalog
> struct layout, by having genbki (or such) generate functions to map to/from
> tuple layout. It'll be an annoying amount of changes, but feasible, I think.

Hmm, I must have misunderstood this aspect. In my mind I was thinking
that if a varlen attribute were at the end, these functions would make
it easier to access them quickly. But from this and the follow-on
responses, these would be used to access varlen attributes wherever
they may be. I'll take a look at the current uses of GETSTRUCT().

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: NAMEDATALEN increase because of non-latin languages

2022-06-23 Thread Tom Lane
Robert Haas  writes:
> On Thu, Jun 23, 2022 at 5:49 PM Andres Freund  wrote:
>> I was thinking we'd basically do it wherever we do a GETSTRUCT() today.

> That seems a little fraught, because you'd be turning what's now
> basically a trivial operation into a non-trivial operation involving
> memory allocation.

Nonetheless, the presence of GETSTRUCT calls should be a good guide
to where we need to do something.

regards, tom lane




Re: NAMEDATALEN increase because of non-latin languages

2022-06-23 Thread Robert Haas
On Thu, Jun 23, 2022 at 5:49 PM Andres Freund  wrote:
> I was thinking we'd basically do it wherever we do a GETSTRUCT() today.
>
> A first step could be to transform code like
> (Form_pg_attribute) GETSTRUCT(tuple)
> into
>GETSTRUCT(pg_attribute, tuple)
>
> then, in a subsequent step, we'd redefine GETSTRUCT as something
> #define GESTRUCT(catalog, tuple) tuple_to_struct_##catalog(tuple)

That seems a little fraught, because you'd be turning what's now
basically a trivial operation into a non-trivial operation involving
memory allocation.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: NAMEDATALEN increase because of non-latin languages

2022-06-23 Thread Andres Freund
Hi,

On 2022-06-23 14:42:17 -0400, Robert Haas wrote:
> On Thu, Jun 23, 2022 at 2:07 PM Tom Lane  wrote:
> > The extra cost of the deforming step could also be repaid, in some
> > cases, by not having to use SysCacheGetAttr etc later on to fetch
> > variable-length fields.  That is, I'm imagining that the deformer
> > would extract all the fields, even varlena ones, and drop pointers
> > or whatever into fields of the C struct.

I was also thinking we'd translate all attributes. Not entirely sure whether
we'd want to use "plain" pointers though - there are some places where we rely
on being able to copy such structs around. That'd be a bit easier with
relative pointers, pointing to the end of the struct. But likely the
notational overhead of dealing with relative pointers would be far higher than
the notational overhead of having to invoke a generated "tuple struct" copy
function. Which we'd likely need anyway, because some previously statically
sized allocations would end up being variable sized?


> Yeah, if we were going to do something like this, I can't see why we
> wouldn't do it this way. It wouldn't make sense to do it for only some
> of the attributes.

Agreed.


> I'm not sure exactly where we would put this translation step, though.
> I think for the syscaches and relcache we'd want to translate when
> populating the cache so that when you do a cache lookup you get the
> data already translated. It's hard to be sure without testing, but
> that seems like it would make this cheap enough that we wouldn't have
> to be too worried, since the number of times we build new cache
> entries should be small compared to the number of times we access
> existing ones. The trickier thing might be code that uses
> systable_beginscan() et. al. directly.

I was thinking we'd basically do it wherever we do a GETSTRUCT() today.

A first step could be to transform code like
(Form_pg_attribute) GETSTRUCT(tuple)
into
   GETSTRUCT(pg_attribute, tuple)

then, in a subsequent step, we'd redefine GETSTRUCT as something
#define GESTRUCT(catalog, tuple) tuple_to_struct_##catalog(tuple)

Greetings,

Andres Freund




Re: NAMEDATALEN increase because of non-latin languages

2022-06-23 Thread Robert Haas
On Thu, Jun 23, 2022 at 2:07 PM Tom Lane  wrote:
> Sounds worth investigating, anyway.  It'd also get us out from under
> C-struct-related problems such as the nearby AIX alignment issue.

Yeah.

> The extra cost of the deforming step could also be repaid, in some
> cases, by not having to use SysCacheGetAttr etc later on to fetch
> variable-length fields.  That is, I'm imagining that the deformer
> would extract all the fields, even varlena ones, and drop pointers
> or whatever into fields of the C struct.

Yeah, if we were going to do something like this, I can't see why we
wouldn't do it this way. It wouldn't make sense to do it for only some
of the attributes.

I'm not sure exactly where we would put this translation step, though.
I think for the syscaches and relcache we'd want to translate when
populating the cache so that when you do a cache lookup you get the
data already translated. It's hard to be sure without testing, but
that seems like it would make this cheap enough that we wouldn't have
to be too worried, since the number of times we build new cache
entries should be small compared to the number of times we access
existing ones. The trickier thing might be code that uses
systable_beginscan() et. al. directly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: NAMEDATALEN increase because of non-latin languages

2022-06-23 Thread Tom Lane
Robert Haas  writes:
> On Thu, Jun 23, 2022 at 10:17 AM Andres Freund  wrote:
>> FWIW, I don't agree that this is a reasonable way to tackle changing
>> NAMEDATALEN. It'd be nice to have, but it to me it seems a pretty small
>> fraction of the problem of making Names variable length. You'll still have 
>> all
>> the problems of name fields being accessed all over, but now not being
>> included in the part of the struct visible to C etc.

> Yeah, I agree. I think that being able to reorder columns logically
> without changing anything physically would be cool, but I think we
> could find some way of making the catalog columns variable-length
> without introducing that feature.

Agreed.  Bearing in mind that multiple smart people have tackled that idea
and gotten nowhere, I think setting it as a prerequisite for this project
is a good way of ensuring this won't happen.

> I'm not sure whether your idea of creating translator functions is a
> good one or not, but it doesn't seem too crazy. It'd be like a special
> purpose tuple deformer.

Sounds worth investigating, anyway.  It'd also get us out from under
C-struct-related problems such as the nearby AIX alignment issue.

The extra cost of the deforming step could also be repaid, in some
cases, by not having to use SysCacheGetAttr etc later on to fetch
variable-length fields.  That is, I'm imagining that the deformer
would extract all the fields, even varlena ones, and drop pointers
or whatever into fields of the C struct.

regards, tom lane




Re: NAMEDATALEN increase because of non-latin languages

2022-06-23 Thread Robert Haas
On Thu, Jun 23, 2022 at 10:17 AM Andres Freund  wrote:
> > This would require:
> >
> > - changing star expansion in SELECTs (expandRTE etc)
> > - adjusting pg_dump, \d, etc
> >
> > That much seems clear and agreed upon.
>
> FWIW, I don't agree that this is a reasonable way to tackle changing
> NAMEDATALEN. It'd be nice to have, but it to me it seems a pretty small
> fraction of the problem of making Names variable length. You'll still have all
> the problems of name fields being accessed all over, but now not being
> included in the part of the struct visible to C etc.

Yeah, I agree. I think that being able to reorder columns logically
without changing anything physically would be cool, but I think we
could find some way of making the catalog columns variable-length
without introducing that feature.

I'm not sure whether your idea of creating translator functions is a
good one or not, but it doesn't seem too crazy. It'd be like a special
purpose tuple deformer.

deform_pg_class_tuple(_class_struct, pg_class_tuple);

The code in there could be automatically generated statements that
maybe look like this:

memcpy(, tuple + Aoffset_pg_class_relnamespace,
sizeof(Oid));

Once you hit the first variable-length field you'd need something more
clever, but because the code would be specialized for each catalog, it
would probably be quite fast anyway.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: NAMEDATALEN increase because of non-latin languages

2022-06-23 Thread Robert Haas
On Thu, Jun 23, 2022 at 6:13 AM Julien Rouhaud  wrote:
> While some problem wouldn't happen if we restricted the feature to system
> catalogs only (e.g. with renamed / dropped attributes, inheritance...), a lot
> would still exist and would have to be dealt with initially.  However I'm not
> sure what behavior would be wanted or acceptable, especially if we want
> something that can eventually be used for user relations too, with possibly
> dynamic positions.

I am not very convinced that this would make the project a whole lot
easier, but it does seem like the result would be less nice.

> I'll describe some of those problems, and just to make things easier I will 
> use
> a normal table "ab" with 2 attributes, a and b, with their physical / logical
> position reversed.
>
> Obviously
>
> SELECT * FROM ab
>
> should return a and b in that order.  The aliases should also probably match
> the logical position, as in:
>
> SELECT * FROM ab ab(aa, bb)
>
> should probably map aa to a and bb to b.

Check.

> But should COPY FROM / COPY TO / INSERT INTO use the logical position too if
> not column list is provided?  I'd think so, but it goes further from the
> original "only handle star expansion" idea.

I think yes.

> And should record_in / record_out use the logical position, as in:
> SELECT ab::text FROM ab / SELECT (a, b)::ab;
>
> I would think not, as relying on a possibly dynamic order could break things 
> if
> you store the results somewhere, but YMMV.

I think here the answer is yes again. I mean, consider that you could
also ALTER TABLE DROP COLUMN and then ALTER TABLE ADD COLUMN with the
same name. That is surely going to affect the meaning of such things.
I don't think we want to have one meaning if you reorder things that
way and a different meaning if you reorder things using whatever
commands we create for changing the display column positions.

> Note that there a variations of that, like
> SELECT ab::ab FROM (SELECT * FROM ab) ab;
>
> But those should probably work as intended (for now it doesn't) as you can't
> store a bare record, and storing a plain "ab" type wouldn't be problematic 
> with
> dynamic changes.

If the sub-SELECT and the cast both use the display order, which I
think they should, then there's no problem here, I believe.

> Then, what about joinrels expansion?  I learned that the column ordering rules
> are far from being obvious, and I didn't find those in the documentation (note
> that I don't know if that something actually described in the SQL standard).
> So for instance, if a join is using an explicit USING clause rather than an ON
> clause, the merged columns are expanded first, so:
>
> SELECT * FROM ab ab1 JOIN ab ab2 USING (b)
>
> should unexpectedly expand to (b, a, a).  Is this order a strict requirement?

I dunno, but I can't see why it creates a problem for this patch to
maintain the current behavior. I mean, just use the logical column
position instead of the physical one here and forget about the details
of how it works beyond that.

> Another problem (that probably wouldn't be a problem for system catalogs) is
> that defaults are evaluated in the physical position.  This example from the
> regression test will clearly have a different behavior if the columns are in a
> different physical order:
>
>  CREATE TABLE INSERT_TBL (
> x INT DEFAULT nextval('insert_seq'),
> y TEXT DEFAULT '-NULL-',
> z INT DEFAULT -1 * currval('insert_seq'),
> CONSTRAINT INSERT_TBL_CON CHECK (x >= 3 AND y <> 'check failed' AND x 
> < 8),
> CHECK (x + z = 0));
>
> But changing the behavior to rely on the logical position seems quite
> dangerous.

Why?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: NAMEDATALEN increase because of non-latin languages

2022-06-23 Thread Andres Freund
Hi,

On 2022-06-03 13:28:16 +0700, John Naylor wrote:
> 1. That would require putting the name physically closer to the end of
> the column list. To make this less annoying for users, we'd need to
> separate physical order from display order (at least/at first only for
> system catalogs).

FWIW, it's not at all clear to me that this would be required. If I were to
tackle this I'd look at breaking up the mirroring of C structs to catalog
struct layout, by having genbki (or such) generate functions to map to/from
tuple layout. It'll be an annoying amount of changes, but feasible, I think.


> This would require:
> 
> - changing star expansion in SELECTs (expandRTE etc)
> - adjusting pg_dump, \d, etc
> 
> That much seems clear and agreed upon.

FWIW, I don't agree that this is a reasonable way to tackle changing
NAMEDATALEN. It'd be nice to have, but it to me it seems a pretty small
fraction of the problem of making Names variable length. You'll still have all
the problems of name fields being accessed all over, but now not being
included in the part of the struct visible to C etc.

Greetings,

Andres Freund




Re: NAMEDATALEN increase because of non-latin languages

2022-06-03 Thread John Naylor
Hi,

I wanted to revive this thread to summarize what was discussed and get
a sense of next steps we could take.

The idea that gained the most traction is to make identifiers
variable-length in the catalogs, which has the added benefit of
reducing memory in syscaches in the common case. That presented two
challenges:

1. That would require putting the name physically closer to the end of
the column list. To make this less annoying for users, we'd need to
separate physical order from display order (at least/at first only for
system catalogs). This would require:

- changing star expansion in SELECTs (expandRTE etc)
- adjusting pg_dump, \d, etc

That much seems clear and agreed upon.

2. We'd need to change how TupleDescs are stored and accessed.

For this point, Matthias shared a prototype patch for a 'compact
attribute access descriptor' and Andres wondered if we should "give up
on the struct / ondisk layout mirroring for catalog tables, and
generate explicit transformation routines via Catalog.pm".

After this discussion dropped off, and it's not quite clear to me what
the first logical step is to make progress on the thread subject, and
what's nice to have for other reasons. Is Matthias's patch or
something like it a good next step?

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: NAMEDATALEN increase because of non-latin languages

2021-08-20 Thread Matthias van de Meent
On Thu, 19 Aug 2021 at 14:58, Andres Freund  wrote:
>
> Hi,
>
> On 2021-08-19 14:47:42 +0200, Matthias van de Meent wrote:
> > I tried to implement this 'compact attribute access descriptor' a few
> > months ago in my effort to improve btree index performance.
>
> cool
>
>
> > The patch allocates an array of 'TupleAttrAlignData'-structs at the
> > end of the attrs-array, fills it with the correct data upon
> > TupleDesc-creation, and uses this TupleAttrAlign-data for constructing
> > and destructing tuples.
>
> > One main difference from what you described was that I used a union
> > for storing attbyval and attstorage, as the latter is only applicable
> > to attlen < 0, and the first only for attlen >= 0. This keeps the
> > whole structure in 8 bytes, whilst also being useable in both tuple
> > forming and deforming.
>
> That's why I just talked about the naive way - it's clearly possible to
> do better... ;)
>
>
> > I hope this can is useful, otherwise sorry for the noise.
>
> It is!

Great!

> I haven't looked at your patch in detail, but I suspect that one reason
> that you didn't see performance benefits is that you added overhead as
> well. The computation of the "compact" memory location now will need a
> few more instructions than before, and I suspect the compiler may not
> even be able to optimize out some of the redundant accesses in loops.
>
> It'd be interesting to see what you'd get if you stored the compact
> array as the flexible-array and stored a pointer to the "full" attrs
> array (while still keeping it allocated together).

Yes, I remember testing swapping the order of the compact array with the
FormData_pg_attribute array as well, with no clear results.

I think this can partially be attributed to the split access methods of the
data in the attribute descriptor: some of it is 'give me the name', some of
it is 'does this attribute exist, what type description does it have?'
(atttypid, attnum, atttypmod, , and others are only interested in the
physical representation information. Prioritizing some over the other might
work, but I think to make full use of that it'd need a lot of work.

> Another reason is that it looks like you didn't touch
> slot_deform_heap_tuple(), which is I think the hottest of the deforming
> routines...

That might be for normal operations, but I'm not certain that code is in
the hot path for (btree) indexing workloads, due to the relatively high
number of operations on each tuple whilst sorting, or finding an insertion
point or scan start point.

Anyway, after some digging I found the final state of this patch before I
stopped working on it, and after polishing it up a bit with your
suggestions it now passes check-world on the latest head (8d2d6ec7).


Kind regards,

Matthias van de Meent
From 9624034b6d23f7c5812b87bed705cbf2ae781d36 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Sun, 20 Jun 2021 21:16:33 +0200
Subject: [PATCH] Compact TupleDesc->attr access descriptors

Evolved from previous iterations, and includes some further work based
on comments from Andres Freund [0]

[0] 
https://www.postgresql.org/message-id/20210819125756.lhcp5wlppfv2r47d%40alap3.anarazel.de
---
 src/backend/access/common/heaptuple.c   | 95 +
 src/backend/access/common/indextuple.c  | 52 --
 src/backend/access/common/relation.c|  2 +
 src/backend/access/common/tupdesc.c | 73 ++-
 src/backend/access/heap/heapam.c|  6 +-
 src/backend/access/heap/heaptoast.c | 12 ++--
 src/backend/access/nbtree/nbtutils.c|  6 +-
 src/backend/access/spgist/spgdoinsert.c |  2 +-
 src/backend/access/spgist/spgutils.c|  3 +
 src/backend/access/table/toast_helper.c |  6 +-
 src/backend/catalog/heap.c  | 10 +--
 src/backend/catalog/index.c |  2 +
 src/backend/catalog/toasting.c  |  4 ++
 src/backend/executor/execTuples.c   | 10 +--
 src/backend/executor/nodeValuesscan.c   |  4 +-
 src/backend/utils/adt/expandedrecord.c  | 14 ++--
 src/backend/utils/adt/ri_triggers.c |  2 +-
 src/backend/utils/cache/catcache.c  |  6 +-
 src/backend/utils/cache/relcache.c  | 18 -
 src/include/access/htup_details.h   | 54 +++---
 src/include/access/itup.h   |  9 +--
 src/include/access/tupdesc.h| 42 +--
 src/include/access/tupmacs.h|  2 +-
 src/include/catalog/pg_attribute.h  |  5 ++
 24 files changed, 297 insertions(+), 142 deletions(-)

diff --git a/src/backend/access/common/heaptuple.c 
b/src/backend/access/common/heaptuple.c
index 0b56b0fa5a..3600f88b25 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -123,19 +123,18 @@ heap_compute_data_size(TupleDesc tupleDesc,
Sizedata_length = 0;
int i;
int numberOfAttributes = tupleDesc->natts;
+   TupleAttrAlign att;
 
-   for (i = 0; i < 

Re: NAMEDATALEN increase because of non-latin languages

2021-08-19 Thread Andres Freund
Hi,

On 2021-08-19 14:47:42 +0200, Matthias van de Meent wrote:
> I tried to implement this 'compact attribute access descriptor' a few
> months ago in my effort to improve btree index performance.

cool


> The patch allocates an array of 'TupleAttrAlignData'-structs at the
> end of the attrs-array, fills it with the correct data upon
> TupleDesc-creation, and uses this TupleAttrAlign-data for constructing
> and destructing tuples.

> One main difference from what you described was that I used a union
> for storing attbyval and attstorage, as the latter is only applicable
> to attlen < 0, and the first only for attlen >= 0. This keeps the
> whole structure in 8 bytes, whilst also being useable in both tuple
> forming and deforming.

That's why I just talked about the naive way - it's clearly possible to
do better... ;)


> I hope this can is useful, otherwise sorry for the noise.

It is!


I haven't looked at your patch in detail, but I suspect that one reason
that you didn't see performance benefits is that you added overhead as
well. The computation of the "compact" memory location now will need a
few more instructions than before, and I suspect the compiler may not
even be able to optimize out some of the redundant accesses in loops.

It'd be interesting to see what you'd get if you stored the compact
array as the flexible-array and stored a pointer to the "full" attrs
array (while still keeping it allocated together).


Another reason is that it looks like you didn't touch
slot_deform_heap_tuple(), which is I think the hottest of the deforming
routines...

Greetings,

Andres Freund




Re: NAMEDATALEN increase because of non-latin languages

2021-08-19 Thread Matthias van de Meent
On Thu, 19 Aug 2021 at 13:44, Andres Freund  wrote:
>
> > Another fun thing --- and, I fear, another good argument against just
> > raising NAMEDATALEN --- is what about TupleDescs, which last I checked
> > used an array of fixed-width pg_attribute images.  But maybe we could
> > replace that with an array of pointers.  Andres already did a lot of
> > the heavy code churn required to hide that data structure behind
> > TupleDescAttr() macros, so changing the representation should be much
> > less painful than it would once have been.
>
> I was recently wondering if we shouldn't go to a completely bespoke
> datastructure for TupleDesc->attrs, rather than reusing FormData_pg_attribute.
>
> Right now every attribute uses nearly two cachelines (112 bytes). Given how
> frequent a task tuple [de]forming is, and how often it's a bottleneck,
> increasing the cache efficiency of tupledescs would worth quite a bit of
> effort - I do see tupledesc attr cache misses in profiles. A secondary benefit
> would be that we do create a lot of short-lived descs in the executor,
> slimming those down obviously would be good on its own. A third benefit would
> be that we could get rid of attcacheoff in pg_attribute, that always smelled
> funny to me.
>
> One possible way to structure such future tupledescs would be to have multiple
> arrays in struct TupleDescData. With an array of just the data necessary for
> [de]forming at the place ->attrs is, and other stuff in one or more separate
> arrays. The other option could perhaps be omitted for some tupledescs or
> computed lazily.
>
> For deforming we just need attlen (2byte), attbyval (1 byte), attalign (1byte)
> and optionally attcacheoff (4 byte), for forming we also need attstorage (1
> byte). Naively that ends up being 12 bytes - 5 attrs / cacheline is a heck of
> a lot better than ~0.5.

I tried to implement this 'compact attribute access descriptor' a few
months ago in my effort to improve btree index performance.

I abandoned the idea at the time as I didn't find any measurable
difference for the (limited!) tests I ran, where the workload was
mainly re-indexing, select * into, and similar items while
benchmarking reindexing in the 'pp-complete' dataset. But, seeing that
there might be interest outside this effort on a basis seperate from
just plain performance, I'll share the results.

Attached is the latest version of my patch that I could find; it might
be incorrect or fail, as this is something I sent to myself between 2
of my systems during development of the patch. Also, attached as .txt,
as I don't want any CFBot coverage on this (this is not proposed for
inclusion, it is just a show of work, and might be basis for future
work).

The patch allocates an array of 'TupleAttrAlignData'-structs at the
end of the attrs-array, fills it with the correct data upon
TupleDesc-creation, and uses this TupleAttrAlign-data for constructing
and destructing tuples.

One main difference from what you described was that I used a union
for storing attbyval and attstorage, as the latter is only applicable
to attlen < 0, and the first only for attlen >= 0. This keeps the
whole structure in 8 bytes, whilst also being useable in both tuple
forming and deforming.

I hope this can is useful, otherwise sorry for the noise.


Kind regards,

Matthias van de Meent
From 3dd3f470ab78b8811015c9f374d0bdc44ffef531 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Wed, 23 Jun 2021 20:38:47 +0200
Subject: [PATCH] Some work on storing attribute access fields more compactly

This would increase the amount of useful data in each cache line, potentially 
increasing performance. It _does_ use more memory (8 bytes / attribute extra 
overhead)
---
 src/backend/access/common/heaptuple.c  | 95 +-
 src/backend/access/common/indextuple.c | 52 --
 src/backend/access/common/relation.c   |  2 +
 src/backend/access/common/tupdesc.c| 74 +++-
 src/backend/access/spgist/spgutils.c   |  3 +
 src/backend/catalog/index.c|  2 +
 src/backend/catalog/toasting.c |  4 ++
 src/backend/utils/cache/relcache.c | 18 -
 src/include/access/htup_details.h  |  6 +-
 src/include/access/itup.h  |  9 +--
 src/include/access/tupdesc.h   | 35 +-
 src/include/access/tupmacs.h   |  2 +-
 src/include/catalog/pg_attribute.h |  5 ++
 13 files changed, 226 insertions(+), 81 deletions(-)

diff --git a/src/backend/access/common/heaptuple.c 
b/src/backend/access/common/heaptuple.c
index 0b56b0fa5a..3600f88b25 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -123,19 +123,18 @@ heap_compute_data_size(TupleDesc tupleDesc,
Sizedata_length = 0;
int i;
int numberOfAttributes = tupleDesc->natts;
+   TupleAttrAlign att;
 
-   for (i = 0; i < numberOfAttributes; i++)
+   for (i = 0, att = 

Re: NAMEDATALEN increase because of non-latin languages

2021-08-19 Thread Andres Freund
Hi,

On 2021-08-18 10:21:03 -0400, Tom Lane wrote:
> Anyway, this whole argument could be rendered moot if we could convert
> name to a variable-length type.  That would satisfy *both* sides of
> the argument, since those who need long names could have them, while
> those who don't would see net reduction instead of growth in space usage.

Yes, I think that's the best direction to go. We're loosing a fair bit
of in-memory efficiency with current NAMEDATALEN already, it'd imo be
beneficial to go for variable length encoding from that angle alone.


> Of course, this is far far easier said than done; else we would have
> done it years ago.  But maybe it's not entirely out of reach.
> I do not think it'd be hard to change "name" to have the same on-disk
> storage representation as cstring; the hard part is what about its
> usage in fixed-width catalog structures.

Indeed. ISTM that the hardest part of that is dealing with copying around
Form* structs?

I wonder if we're getting closer to the time where we should just give up on
the struct / ondisk layout mirroring for catalog tables, and generate explicit
transformation routines via Catalog.pm. If we have to touch places handling
Form_* structs anyway, we could make that more worthwhile by doing away with
CATALOG_VARLEN etc - the transformation routines would just make those fields
pointers which then can be accessed normally.


> Maybe we could finesse that
> by decreeing that the name column always has to be the last
> non-CATALOG_VARLEN field.  (This would require fixing up the couple of
> places where we let some other var-width field have that distinction;
> but I suspect that would be small in comparison to the other work this
> implies.  If there are any catalogs having two name columns, one of them
> would become more difficult to reach from C code.)

I wish we could find a way to make "relative pointers" (i.e. a variable offset
from some base struct) work on the C level in a transparent and portable
manner. Just about any instruction set has them natively anyway, and for a lot
of tasks like variable length members it can be considerably more
efficient... It'd also make DSM using code simpler and faster. Oh well, one
can dream.



> Another fun thing --- and, I fear, another good argument against just
> raising NAMEDATALEN --- is what about TupleDescs, which last I checked
> used an array of fixed-width pg_attribute images.  But maybe we could
> replace that with an array of pointers.  Andres already did a lot of
> the heavy code churn required to hide that data structure behind
> TupleDescAttr() macros, so changing the representation should be much
> less painful than it would once have been.

I was recently wondering if we shouldn't go to a completely bespoke
datastructure for TupleDesc->attrs, rather than reusing FormData_pg_attribute.

Right now every attribute uses nearly two cachelines (112 bytes). Given how
frequent a task tuple [de]forming is, and how often it's a bottleneck,
increasing the cache efficiency of tupledescs would worth quite a bit of
effort - I do see tupledesc attr cache misses in profiles. A secondary benefit
would be that we do create a lot of short-lived descs in the executor,
slimming those down obviously would be good on its own. A third benefit would
be that we could get rid of attcacheoff in pg_attribute, that always smelled
funny to me.

One possible way to structure such future tupledescs would be to have multiple
arrays in struct TupleDescData. With an array of just the data necessary for
[de]forming at the place ->attrs is, and other stuff in one or more separate
arrays. The other option could perhaps be omitted for some tupledescs or
computed lazily.

For deforming we just need attlen (2byte), attbyval (1 byte), attalign (1byte)
and optionally attcacheoff (4 byte), for forming we also need attstorage (1
byte). Naively that ends up being 12 bytes - 5 attrs / cacheline is a heck of
a lot better than ~0.5.

Greetings,

Andres Freund




Re: NAMEDATALEN increase because of non-latin languages

2021-08-18 Thread Julien Rouhaud
On Thu, Aug 19, 2021 at 12:12 AM Tom Lane  wrote:
>
> Yeah, exactly: conceptually that's simple, but flushing all the bugs
> out would be a years-long nightmare.  It'd make all the fun we had
> with missed attisdropped checks look like a walk in the park.  Unless
> somebody can figure out a way to mechanically check for mistakes,
> I don't think I want to go there.

Maybe a silly idea, but we could have some shared_preload_libraries
module with a command_utility_hook that intercept table creation and
randomize the logical column order.  If we also have some GUC (assert
only if needed) to switch star expansion to physical-order rather than
logical-order, we could then use the regression test as a big hammer
to see if anything breaks.  That's clearly not ideal (it would
obviously need some other hacks to avoid other breakage like \d and
other things) and not 100% coverage, but it should give some
confidence in any patch completeness.




Re: NAMEDATALEN increase because of non-latin languages

2021-08-18 Thread Andrew Dunstan


On 8/18/21 12:39 PM, Alvaro Herrera wrote:
> On 2021-Aug-18, Tom Lane wrote:
>
>> I wonder though if we could fix the immediate problem with something
>> less ambitious.  The hard part of the full proposal, I think, is
>> separating permanent identity from physical position.  If we were to
>> split out *only* the display order from that, the patch footprint
>> ought to be far far smaller --- basically, I think, we'd need to fix
>> star-expansion and not a lot more in the backend.  Of course,
>> client-side code like psql's \d and pg_dump would need to be upgraded
>> too, but any missed things there would be cosmetic that's-not-the-
>> expected-column-order bugs, not core dumps.  Also, at least in the v1
>> patch we could use this just for system catalogs without exposing it
>> as a feature for user tables, which would greatly restrict the set of
>> client-side places that really need fixed.
>>
>> (I think Alvaro was the last person to mess with this issue, so I
>> wonder what his take is on the feasibility of such a restricted
>> solution.)
> Yeah, my impression is that the project of just changing star expansion
> is much, much easier than splitting out attribute identity from physical
> location.  The former, as I recall, is a localized change in expandRTE
> and friends, and you don't have to touch anything else.  The other part
> of the change is much more invasive and got me into territory that I
> wasn't able to navigate successfully.
>
> (I think we should consider keeping 'attnum' as the display-order
> attribute, and have the physical-and-identity attribute get a new name,
> say attphysnum.  That's so that you don't have to change psql and the
> whole world -- the change is transparent to them.  This means we need a
> first step that's very invasive, because every single current use of
> 'attnum' has to be changed to attphysnum first, followed by a functional
> patch that changes a few of those back to attnum.  It'd be a large
> inconvenience to backend developers to ease the lives of client-side
> developers.)


Can we call it attid just in case we decide some day to break the nexus
on the other side?


I like the idea of keeping it to the catalog to start with.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: NAMEDATALEN increase because of non-latin languages

2021-08-18 Thread Alvaro Herrera
On 2021-Aug-18, Tom Lane wrote:

> I wonder though if we could fix the immediate problem with something
> less ambitious.  The hard part of the full proposal, I think, is
> separating permanent identity from physical position.  If we were to
> split out *only* the display order from that, the patch footprint
> ought to be far far smaller --- basically, I think, we'd need to fix
> star-expansion and not a lot more in the backend.  Of course,
> client-side code like psql's \d and pg_dump would need to be upgraded
> too, but any missed things there would be cosmetic that's-not-the-
> expected-column-order bugs, not core dumps.  Also, at least in the v1
> patch we could use this just for system catalogs without exposing it
> as a feature for user tables, which would greatly restrict the set of
> client-side places that really need fixed.
> 
> (I think Alvaro was the last person to mess with this issue, so I
> wonder what his take is on the feasibility of such a restricted
> solution.)

Yeah, my impression is that the project of just changing star expansion
is much, much easier than splitting out attribute identity from physical
location.  The former, as I recall, is a localized change in expandRTE
and friends, and you don't have to touch anything else.  The other part
of the change is much more invasive and got me into territory that I
wasn't able to navigate successfully.

(I think we should consider keeping 'attnum' as the display-order
attribute, and have the physical-and-identity attribute get a new name,
say attphysnum.  That's so that you don't have to change psql and the
whole world -- the change is transparent to them.  This means we need a
first step that's very invasive, because every single current use of
'attnum' has to be changed to attphysnum first, followed by a functional
patch that changes a few of those back to attnum.  It'd be a large
inconvenience to backend developers to ease the lives of client-side
developers.)

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: NAMEDATALEN increase because of non-latin languages

2021-08-18 Thread Tom Lane
Andrew Dunstan  writes:
> On 8/18/21 10:53 AM, Tom Lane wrote:
>> Yeah, it would annoy the heck out of me too.  Again there's a potential
>> technical solution, which is to decouple the user-visible column order
>> from the storage order.  However, multiple people have tilted at that
>> windmill without much success, so making it a prerequisite for improving
>> the name-length situation doesn't seem like a smart plan.

> There might be other benefits, though. IIRC what we discussed years ago
> was having for each attribute an immutable number (presumably attnum as
> now) and a mutable display order and storage order. Previous patches
> didn't implement this and so were rejected. I think part of the trouble
> is that we'd have to go through roughly 1700 mentions of attnum in the
> source and decide if it's really attnum they want or if it's
> attnum_display_order or attnum_storage_order. So this has the potential
> to be extraordinarily invasive and potentially bug-prone. And then
> there's the world of extensions to consider.

Yeah, exactly: conceptually that's simple, but flushing all the bugs
out would be a years-long nightmare.  It'd make all the fun we had
with missed attisdropped checks look like a walk in the park.  Unless
somebody can figure out a way to mechanically check for mistakes,
I don't think I want to go there.

I wonder though if we could fix the immediate problem with something
less ambitious.  The hard part of the full proposal, I think, is
separating permanent identity from physical position.  If we were to
split out *only* the display order from that, the patch footprint
ought to be far far smaller --- basically, I think, we'd need to fix
star-expansion and not a lot more in the backend.  Of course,
client-side code like psql's \d and pg_dump would need to be upgraded
too, but any missed things there would be cosmetic that's-not-the-
expected-column-order bugs, not core dumps.  Also, at least in the v1
patch we could use this just for system catalogs without exposing it
as a feature for user tables, which would greatly restrict the set of
client-side places that really need fixed.

(I think Alvaro was the last person to mess with this issue, so I
wonder what his take is on the feasibility of such a restricted
solution.)

regards, tom lane




Re: NAMEDATALEN increase because of non-latin languages

2021-08-18 Thread Andrew Dunstan


On 8/18/21 10:53 AM, Tom Lane wrote:
> Julien Rouhaud  writes:
>> On Wed, Aug 18, 2021 at 10:21 PM Tom Lane  wrote:
>>> I wonder if we'd get complaints from changing the catalog column layouts
>>> that much.  People are used to the name at the front, I think.  OTOH,
>>> I expected a lot of bleating about the OID column becoming frontmost,
>>> but there hasn't been much.
>> I don't think that would be comparable.  Having an extra oid in the
>> 1st column doesn't really make a raw SELECT * harder to read.  But
>> having the XXXname column way behind, and not even at the end, means
>> that most people will have to type an extra "xxxname," for each
>> throwaway query run to quickly verify something.  I know that I often
>> do that, and while I could live with it I'd rather not have to do it.
> Yeah, it would annoy the heck out of me too.  Again there's a potential
> technical solution, which is to decouple the user-visible column order
> from the storage order.  However, multiple people have tilted at that
> windmill without much success, so making it a prerequisite for improving
> the name-length situation doesn't seem like a smart plan.
>
>   


There might be other benefits, though. IIRC what we discussed years ago
was having for each attribute an immutable number (presumably attnum as
now) and a mutable display order and storage order. Previous patches
didn't implement this and so were rejected. I think part of the trouble
is that we'd have to go through roughly 1700 mentions of attnum in the
source and decide if it's really attnum they want or if it's
attnum_display_order or attnum_storage_order. So this has the potential
to be extraordinarily invasive and potentially bug-prone. And then
there's the world of extensions to consider.

I have a bit less sympathy for the argument that just moving it will
break things that use 'select *' on the catalogs. In general, if you
care about the order of columns you should name the columns you want in
the order you want. I've seen 'select *' break for people on other
changes, like adding or dropping a column. It might cause Postgres
developers a bit of pain, but it should be manageable, so I kind of like
your suggestion.


cheers

andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: NAMEDATALEN increase because of non-latin languages

2021-08-18 Thread Tom Lane
Julien Rouhaud  writes:
> On Wed, Aug 18, 2021 at 10:21 PM Tom Lane  wrote:
>> I wonder if we'd get complaints from changing the catalog column layouts
>> that much.  People are used to the name at the front, I think.  OTOH,
>> I expected a lot of bleating about the OID column becoming frontmost,
>> but there hasn't been much.

> I don't think that would be comparable.  Having an extra oid in the
> 1st column doesn't really make a raw SELECT * harder to read.  But
> having the XXXname column way behind, and not even at the end, means
> that most people will have to type an extra "xxxname," for each
> throwaway query run to quickly verify something.  I know that I often
> do that, and while I could live with it I'd rather not have to do it.

Yeah, it would annoy the heck out of me too.  Again there's a potential
technical solution, which is to decouple the user-visible column order
from the storage order.  However, multiple people have tilted at that
windmill without much success, so making it a prerequisite for improving
the name-length situation doesn't seem like a smart plan.

regards, tom lane




Re: NAMEDATALEN increase because of non-latin languages

2021-08-18 Thread Peter Eisentraut

On 18.08.21 13:33, Julien Rouhaud wrote:

Agreed, but I don't have access to such hardware.  However this won't
influence the memory overhead part, and there is already frequent
problems with that, especially since declarative partitioning,


On the flip side, with partitioning you need room for longer table 
names, since you need room for the real name plus some partition identifier.





Re: NAMEDATALEN increase because of non-latin languages

2021-08-18 Thread Julien Rouhaud
On Wed, Aug 18, 2021 at 10:21 PM Tom Lane  wrote:
>
> Anyway, this whole argument could be rendered moot if we could convert
> name to a variable-length type.  That would satisfy *both* sides of
> the argument, since those who need long names could have them, while
> those who don't would see net reduction instead of growth in space usage.

Yeah it seems like the best way forward.

> Of course, this is far far easier said than done; else we would have
> done it years ago.  But maybe it's not entirely out of reach.
> I do not think it'd be hard to change "name" to have the same on-disk
> storage representation as cstring; the hard part is what about its
> usage in fixed-width catalog structures.  Maybe we could finesse that
> by decreeing that the name column always has to be the last
> non-CATALOG_VARLEN field.  (This would require fixing up the couple of
> places where we let some other var-width field have that distinction;
> but I suspect that would be small in comparison to the other work this
> implies.  If there are any catalogs having two name columns, one of them
> would become more difficult to reach from C code.)

Here is the list on some recent build (as of 17707c059c):

 relname  |array_agg
--+--
 pg_collation | {collname,collctype,collcollate}
 pg_database  | {datctype,datcollate,datname}
 pg_event_trigger | {evtevent,evtname}
 pg_subscription  | {subname,subslotname}
 pg_trigger   | {tgname,tgnewtable,tgoldtable}
(5 rows)

> I wonder if we'd get complaints from changing the catalog column layouts
> that much.  People are used to the name at the front, I think.  OTOH,
> I expected a lot of bleating about the OID column becoming frontmost,
> but there hasn't been much.

I don't think that would be comparable.  Having an extra oid in the
1st column doesn't really make a raw SELECT * harder to read.  But
having the XXXname column way behind, and not even at the end, means
that most people will have to type an extra "xxxname," for each
throwaway query run to quickly verify something.  I know that I often
do that, and while I could live with it I'd rather not have to do it.




Re: NAMEDATALEN increase because of non-latin languages

2021-08-18 Thread Tom Lane
John Naylor  writes:
> The main thing I'm worried about is the fact that a name would no longer
> fit in a Datum. The rest I think we can mitigate in some way.

Not sure what you mean by that?  name is a pass-by-ref data type.

Anyway, this whole argument could be rendered moot if we could convert
name to a variable-length type.  That would satisfy *both* sides of
the argument, since those who need long names could have them, while
those who don't would see net reduction instead of growth in space usage.

Of course, this is far far easier said than done; else we would have
done it years ago.  But maybe it's not entirely out of reach.
I do not think it'd be hard to change "name" to have the same on-disk
storage representation as cstring; the hard part is what about its
usage in fixed-width catalog structures.  Maybe we could finesse that
by decreeing that the name column always has to be the last
non-CATALOG_VARLEN field.  (This would require fixing up the couple of
places where we let some other var-width field have that distinction;
but I suspect that would be small in comparison to the other work this
implies.  If there are any catalogs having two name columns, one of them
would become more difficult to reach from C code.)

Another fun thing --- and, I fear, another good argument against just
raising NAMEDATALEN --- is what about TupleDescs, which last I checked
used an array of fixed-width pg_attribute images.  But maybe we could
replace that with an array of pointers.  Andres already did a lot of
the heavy code churn required to hide that data structure behind
TupleDescAttr() macros, so changing the representation should be much
less painful than it would once have been.

I wonder if we'd get complaints from changing the catalog column layouts
that much.  People are used to the name at the front, I think.  OTOH,
I expected a lot of bleating about the OID column becoming frontmost,
but there hasn't been much.

Anyway, I have little desire to work on this myself, but I recommend that
somebody who is more affected by the name length restriction look into it.

regards, tom lane




Re: NAMEDATALEN increase because of non-latin languages

2021-08-18 Thread Ranier Vilela
Em qua., 18 de ago. de 2021 às 09:33, Laurenz Albe 
escreveu:

> On Wed, 2021-08-18 at 08:16 -0300, Ranier Vilela wrote:
> > Em qua., 18 de ago. de 2021 às 08:08, Денис Романенко <
> deromane...@gmail.com> escreveu:
> > > Hello dear hackers. I understand the position of the developers
> community about
> > >  NAMEDATALEN length - and, in fact, 63 bytes is more than enough - but
> only if we
> > >  speak about latin languages.
> > >
> > > Postgresql has wonderful support for unicode in table and column
> names. And it
> > >  looks like very good idea to create table with names on native
> language for
> > >  databases across the world. But when I want to create, for example,
> table with
> > >  name "Catalog_Контрагенты_КонтактнаяИнформация" (that stands in
> Russian for
> > >  catalog of counteragent contacts) it will be auto-shrinked to
> > >  "Catalog_Контрагенты_КонтактнаяИнформ". And this is not a fictional
> problem -
> > >  many words in Russian are just longer than it's English counterparts
> and I
> > >  have many examples like this.
> > >
> > > Although recompiling the source is not so hard, updating is hard. I
> know that
> > >  is not free for disk space because of storing table names and field
> names but,
> > >  from my point of view, in 2021 year convenience is more important
> than disk space.
> > >
> > > I ask you to consider increasing NAMEDATALEN for maybe 128 bytes in
> future releases.
>
> My stance here is that you should always use ASCII only for database
> identifiers,
> not only because of this, but also to avoid unpleasant encoding problems if
> you want to do something like
>
>  pg_dump -t Catalog_Контрагенты_КонтактнаяИнформация mydb
>
> on a shell with an encoding different from the database encoding.
>
> So I am not too excited about this.
>
> > +1 once that Oracle Database 12.2 and higher, has support for 128 bytes
> names.
> > What possibly, in the future, could impact some migration from Oracle to
> Postgres.
>
> That seems to be a better argument from my point of view.
>
> I have no idea as to how bad the additional memory impact for the catalog
> caches would be...
>
It seems to me that this is a case for macro:
HAS_SUPPORT_NAME_128_BYTES
Деnis Романенко would like and would pay the price for regression in
exchange for the convenience.
What impacts him now is the difficulty of maintaining a private tree, with
this support.

regards,
Ranier Vilela


Re: NAMEDATALEN increase because of non-latin languages

2021-08-18 Thread Laurenz Albe
On Wed, 2021-08-18 at 08:16 -0300, Ranier Vilela wrote:
> Em qua., 18 de ago. de 2021 às 08:08, Денис Романенко  
> escreveu:
> > Hello dear hackers. I understand the position of the developers community 
> > about
> >  NAMEDATALEN length - and, in fact, 63 bytes is more than enough - but only 
> > if we
> >  speak about latin languages. 
> > 
> > Postgresql has wonderful support for unicode in table and column names. And 
> > it
> >  looks like very good idea to create table with names on native language for
> >  databases across the world. But when I want to create, for example, table 
> > with
> >  name "Catalog_Контрагенты_КонтактнаяИнформация" (that stands in Russian for
> >  catalog of counteragent contacts) it will be auto-shrinked to
> >  "Catalog_Контрагенты_КонтактнаяИнформ". And this is not a fictional 
> > problem -
> >  many words in Russian are just longer than it's English counterparts and I
> >  have many examples like this.
> > 
> > Although recompiling the source is not so hard, updating is hard. I know 
> > that
> >  is not free for disk space because of storing table names and field names 
> > but,
> >  from my point of view, in 2021 year convenience is more important than 
> > disk space. 
> > 
> > I ask you to consider increasing NAMEDATALEN for maybe 128 bytes in future 
> > releases. 

My stance here is that you should always use ASCII only for database 
identifiers,
not only because of this, but also to avoid unpleasant encoding problems if
you want to do something like

 pg_dump -t Catalog_Контрагенты_КонтактнаяИнформация mydb

on a shell with an encoding different from the database encoding.

So I am not too excited about this.

> +1 once that Oracle Database 12.2 and higher, has support for 128 bytes names.
> What possibly, in the future, could impact some migration from Oracle to 
> Postgres.

That seems to be a better argument from my point of view.

I have no idea as to how bad the additional memory impact for the catalog
caches would be...

Yours,
Laurenz Albe





Re: NAMEDATALEN increase because of non-latin languages

2021-08-18 Thread Julien Rouhaud
On Wed, Aug 18, 2021 at 8:03 PM Hannu Krosing  wrote:
>
> Also - have we checked that at least the truncation does not cut utf-8
> characters in half ?

Yes, same for all other places that can truncate text (like the query
text in pg_stat_activity and similar).  See usage of pg_mbcliplen() in
truncate_identifier.




Re: NAMEDATALEN increase because of non-latin languages

2021-08-18 Thread Julien Rouhaud
On Wed, Aug 18, 2021 at 8:04 PM John Naylor
 wrote:
>
> >
> > Agreed, but I don't have access to such hardware.  However this won't
>
> Well, by "recent" I had in mind something more recent than 2002, which is the 
> time where I see a lot of hits in the archives if you search for this topic.

Yeah, but my current laptop has a tendency to crash after a few minute
if I stress it too much, so I'm still out.

> If you mean the thread "Protect syscache from bloating with negative cache 
> entries", it had activity earlier this year, so I wouldn't give up hope just 
> yet.

Yes that's the thread I was thinking about.  I'm not giving up hope
either, but I also don't see it being solved for v15.

> The main thing I'm worried about is the fact that a name would no longer fit 
> in a Datum. The rest I think we can mitigate in some way.

Agreed.




Re: NAMEDATALEN increase because of non-latin languages

2021-08-18 Thread John Naylor
On Wed, Aug 18, 2021 at 8:03 AM Hannu Krosing  wrote:
>
> Could we just make the limitation to be 64 (or 128) _characters_ not
_bytes_ ?

That couldn't work because characters are variable length. The limit has to
be a fixed length in bytes so we can quickly compute offsets in the
attribute tuple.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: NAMEDATALEN increase because of non-latin languages

2021-08-18 Thread John Naylor
On Wed, Aug 18, 2021 at 7:33 AM Julien Rouhaud  wrote:
> > Some actual numbers on recent hardware would show what kind of tradeoff
is involved. No one has done that for a long time that I recall.
>
> Agreed, but I don't have access to such hardware.  However this won't

Well, by "recent" I had in mind something more recent than 2002, which is
the time where I see a lot of hits in the archives if you search for this
topic.

> influence the memory overhead part, and there is already frequent
> problems with that, especially since declarative partitioning, so I

That's a fair point.

> don't see how we could afford that without some kind of cache TTL or
> similar.  AFAIR the last discussion about it a few years ago didn't
> lead anywhere :(

If you mean the thread "Protect syscache from bloating with negative cache
entries", it had activity earlier this year, so I wouldn't give up hope
just yet. Progress has been slow, so I'll see about putting some effort
into that after concluding my attempt to speed up the syscaches first [1].

The main thing I'm worried about is the fact that a name would no longer
fit in a Datum. The rest I think we can mitigate in some way.

[1]
https://www.postgresql.org/message-id/cafbsxse35vlj3hhkjjarb3qwqj0zwedw-jzqrfzkzmpud_c...@mail.gmail.com

--
John Naylor
EDB: http://www.enterprisedb.com


Re: NAMEDATALEN increase because of non-latin languages

2021-08-18 Thread Hannu Krosing
Could we just make the limitation to be 64 (or 128) _characters_ not _bytes_ ?

Memory sizes and processor speeds have grown by order(s) of magnitude
since the 64 byte limit was decided and supporting non-ASCII charsets
properly seems like a prudent thing to do.

Also - have we checked that at least the truncation does not cut utf-8
characters in half ?

-
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

On Wed, Aug 18, 2021 at 1:33 PM Julien Rouhaud  wrote:
>
> On Wed, Aug 18, 2021 at 7:27 PM John Naylor
>  wrote:
> >
> > On Wed, Aug 18, 2021 at 7:15 AM Julien Rouhaud  wrote:
> > >
> > > Unfortunately, the problem isn't really the additional disk space it
> > > would require.  The problem is the additional performance hit and
> > > memory overhead, as the catalog names are part of the internal
> > > syscache.
> >
> > Some actual numbers on recent hardware would show what kind of tradeoff is 
> > involved. No one has done that for a long time that I recall.
>
> Agreed, but I don't have access to such hardware.  However this won't
> influence the memory overhead part, and there is already frequent
> problems with that, especially since declarative partitioning, so I
> don't see how we could afford that without some kind of cache TTL or
> similar.  AFAIR the last discussion about it a few years ago didn't
> lead anywhere :(
>
>




Re: NAMEDATALEN increase because of non-latin languages

2021-08-18 Thread Денис Романенко
I don't very close with PG testing methodology, but I can pay for a server
(virtual or dedicated, DO maybe) and give access to it, if anyone has time
for that.

Or if someone describes to me steps and shows where to look - I can do it
by myself.


Re: NAMEDATALEN increase because of non-latin languages

2021-08-18 Thread Julien Rouhaud
On Wed, Aug 18, 2021 at 7:27 PM John Naylor
 wrote:
>
> On Wed, Aug 18, 2021 at 7:15 AM Julien Rouhaud  wrote:
> >
> > Unfortunately, the problem isn't really the additional disk space it
> > would require.  The problem is the additional performance hit and
> > memory overhead, as the catalog names are part of the internal
> > syscache.
>
> Some actual numbers on recent hardware would show what kind of tradeoff is 
> involved. No one has done that for a long time that I recall.

Agreed, but I don't have access to such hardware.  However this won't
influence the memory overhead part, and there is already frequent
problems with that, especially since declarative partitioning, so I
don't see how we could afford that without some kind of cache TTL or
similar.  AFAIR the last discussion about it a few years ago didn't
lead anywhere :(




Re: NAMEDATALEN increase because of non-latin languages

2021-08-18 Thread John Naylor
On Wed, Aug 18, 2021 at 7:15 AM Julien Rouhaud  wrote:
>
> Unfortunately, the problem isn't really the additional disk space it
> would require.  The problem is the additional performance hit and
> memory overhead, as the catalog names are part of the internal
> syscache.

Some actual numbers on recent hardware would show what kind of tradeoff is
involved. No one has done that for a long time that I recall.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: NAMEDATALEN increase because of non-latin languages

2021-08-18 Thread Ranier Vilela
Em qua., 18 de ago. de 2021 às 08:08, Денис Романенко 
escreveu:

> Hello dear hackers. I understand the position of the developers community
> about NAMEDATALEN length - and, in fact, 63 bytes is more than enough - but
> only if we speak about latin languages.
>
> Postgresql has wonderful support for unicode in table and column names.
> And it looks like very good idea to create table with names on native
> language for databases across the world. But when I want to create, for
> example, table with name "Catalog_Контрагенты_КонтактнаяИнформация" (that
> stands in Russian for catalog of counteragent contacts) it will be
> auto-shrinked to "Catalog_Контрагенты_КонтактнаяИнформ". And this is not a
> fictional problem - many words in Russian are just longer than it's English
> counterparts and I have many examples like this.
>
> Although recompiling the source is not so hard, updating is hard. I know
> that is not free for disk space because of storing table names and field
> names but, from my point of view, in 2021 year convenience is more
> important than disk space.
>
> I ask you to consider increasing NAMEDATALEN for maybe 128 bytes in future
> releases.
>
+1 once that Oracle Database 12.2 and higher, has support for 128 bytes
names.
What possibly, in the future, could impact some migration from Oracle to
Postgres.

regards,
Ranier Vilela


Re: NAMEDATALEN increase because of non-latin languages

2021-08-18 Thread Julien Rouhaud
On Wed, Aug 18, 2021 at 7:08 PM Денис Романенко  wrote:
>
> Hello dear hackers. I understand the position of the developers community 
> about NAMEDATALEN length - and, in fact, 63 bytes is more than enough - but 
> only if we speak about latin languages.
>
> Postgresql has wonderful support for unicode in table and column names. And 
> it looks like very good idea to create table with names on native language 
> for databases across the world. But when I want to create, for example, table 
> with name "Catalog_Контрагенты_КонтактнаяИнформация" (that stands in Russian 
> for catalog of counteragent contacts) it will be auto-shrinked to 
> "Catalog_Контрагенты_КонтактнаяИнформ". And this is not a fictional problem - 
> many words in Russian are just longer than it's English counterparts and I 
> have many examples like this.
>
> Although recompiling the source is not so hard, updating is hard. I know that 
> is not free for disk space because of storing table names and field names 
> but, from my point of view, in 2021 year convenience is more important than 
> disk space.

Unfortunately, the problem isn't really the additional disk space it
would require.  The problem is the additional performance hit and
memory overhead, as the catalog names are part of the internal
syscache.

I understand your frustration, but given those problems I don't think
that postgres will increase the default NAMEDATALEN value any time
soon, even though it's in contradiction with the SQL standard.




NAMEDATALEN increase because of non-latin languages

2021-08-18 Thread Денис Романенко
Hello dear hackers. I understand the position of the developers community
about NAMEDATALEN length - and, in fact, 63 bytes is more than enough - but
only if we speak about latin languages.

Postgresql has wonderful support for unicode in table and column names. And
it looks like very good idea to create table with names on native language
for databases across the world. But when I want to create, for example,
table with name "Catalog_Контрагенты_КонтактнаяИнформация" (that stands in
Russian for catalog of counteragent contacts) it will be auto-shrinked to
"Catalog_Контрагенты_КонтактнаяИнформ". And this is not a fictional problem
- many words in Russian are just longer than it's English counterparts and
I have many examples like this.

Although recompiling the source is not so hard, updating is hard. I know
that is not free for disk space because of storing table names and field
names but, from my point of view, in 2021 year convenience is more
important than disk space.

I ask you to consider increasing NAMEDATALEN for maybe 128 bytes in future
releases.

Sorry for wasted time for this message if this topic is not match with
direction of postgresql development (and thank you for your hard work)