Re: Increasing IndexTupleData.t_info from uint16 to uint32

2024-01-19 Thread Aleksander Alekseev
Hi,

> > The overall trend in machine learning embedding sizes has been growing 
> > rapidly over the last few years from 128 up to 4K dimensions yielding 
> > additional value and quality improvements. It's not clear when this trend 
> > in growth will ease. The leading text embedding models generate now exceeds 
> > the index storage available in IndexTupleData.t_info.
> >
> > The current index tuple size is stored in 13 bits of IndexTupleData.t_info, 
> > which limits the max size of an index tuple to 2^13 = 8129 bytes. Vectors 
> > implemented by pgvector currently use a 32 bit float for elements, which 
> > limits vector size to 2K dimensions, which is no longer state of the art.
> >
> > I've attached a patch that increases  IndexTupleData.t_info from 16bits to 
> > 32bits allowing for significantly larger index tuple sizes. I would guess 
> > this patch is not a complete implementation that allows for migration from 
> > previous versions, but it does compile and initdb succeeds. I'd be happy to 
> > continue work if the core team is receptive to an update in this area, and 
> > I'd appreciate any feedback the community has on the approach.

If I read this correctly, basically the patch adds 16 useless bits for
all applications except for ML ones...

Perhaps implementing an alternative storage specifically for ML using
TAM interface would be a better approach?

-- 
Best regards,
Aleksander Alekseev




Re: Increasing IndexTupleData.t_info from uint16 to uint32

2024-01-18 Thread Matthias van de Meent
On Thu, 18 Jan 2024 at 13:41, Montana Low  wrote:
>
> The overall trend in machine learning embedding sizes has been growing 
> rapidly over the last few years from 128 up to 4K dimensions yielding 
> additional value and quality improvements. It's not clear when this trend in 
> growth will ease. The leading text embedding models generate now exceeds the 
> index storage available in IndexTupleData.t_info.
>
> The current index tuple size is stored in 13 bits of IndexTupleData.t_info, 
> which limits the max size of an index tuple to 2^13 = 8129 bytes. Vectors 
> implemented by pgvector currently use a 32 bit float for elements, which 
> limits vector size to 2K dimensions, which is no longer state of the art.
>
> I've attached a patch that increases  IndexTupleData.t_info from 16bits to 
> 32bits allowing for significantly larger index tuple sizes. I would guess 
> this patch is not a complete implementation that allows for migration from 
> previous versions, but it does compile and initdb succeeds. I'd be happy to 
> continue work if the core team is receptive to an update in this area, and 
> I'd appreciate any feedback the community has on the approach.

I'm not sure why this is needed.
Vector data indexing generally requires bespoke index methods, which
are not currently available in the core PostgreSQL repository, and
indexes are not at all required to utilize the IndexTupleData format
for their data tuples (one example of this being BRIN).
The only hard requirement in AMs which use Postgres' relfile format is
that they follow the Page layout and optionally the pd_linp/ItemId
array, which limit the size of Page tuples to 2^15-1 (see
ItemIdData.lp_len) and ~2^16-bytes
(PageHeaderData.pd_pagesize_version).

Next, the only non-internal use of IndexTuple is in IndexOnlyScans.
However, here the index may fill the scandesc->xs_hitup with a heap
tuple instead, which has a length stored in uint32, too. So, I don't
quite see why this would be required for all indexes.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Increasing IndexTupleData.t_info from uint16 to uint32

2024-01-18 Thread Tom Lane
I wrote:
> On a micro level, this makes sizeof(IndexTupleData) be not maxaligned,
> which is likely to cause problems on alignment-picky hardware, or else
> result in space wastage if we were careful to MAXALIGN() everywhere.
> (Which we should have been, but I don't care to bet on it.)  A lot of
> people would be sad if their indexes got noticeably bigger when they
> weren't getting anything out of that.

After thinking about that a bit more, there might be a way out that
both avoids bloating index tuples that don't need it, and avoids
the compatibility problem.  How about defining that if the
INDEX_SIZE_MASK bits aren't zero, they are the tuple size as now;
but if they are zero, then the size appears in a separate uint16
field following the existing IndexTupleData fields.  We could perhaps
also rethink how the nulls bitmap storage works in this "v2"
index tuple header layout?  In any case, I'd expect to end up in
a place where (on 64-bit hardware) you pay an extra MAXALIGN quantum
for either an oversize tuple or a nulls bitmap, but only one quantum
when you have both, and nothing above today when the tuple is not
oversize.

This'd complicate tuple construction and inspection a bit, but
it would avoid building an enormous lot of infrastructure to deal
with transitioning to a not-upward-compatible definition.

regards, tom lane




Re: Increasing IndexTupleData.t_info from uint16 to uint32

2024-01-18 Thread Tom Lane
Montana Low  writes:
> I've attached a patch that increases  IndexTupleData.t_info from 16bits to
> 32bits allowing for significantly larger index tuple sizes.

I fear this idea is a non-starter because it'd break on-disk
compatibility.  Certainly, if we were to try to pursue it, there'd
need to be an enormous amount of effort spent on dealing with existing
indexes and transition mechanisms.  I don't think you've made the case
why that would be time well spent.

On a micro level, this makes sizeof(IndexTupleData) be not maxaligned,
which is likely to cause problems on alignment-picky hardware, or else
result in space wastage if we were careful to MAXALIGN() everywhere.
(Which we should have been, but I don't care to bet on it.)  A lot of
people would be sad if their indexes got noticeably bigger when they
weren't getting anything out of that.

regards, tom lane




Increasing IndexTupleData.t_info from uint16 to uint32

2024-01-18 Thread Montana Low
The overall trend in machine learning embedding sizes has been growing
rapidly over the last few years from 128 up to 4K dimensions yielding
additional value and quality improvements. It's not clear when this trend
in growth will ease. The leading text embedding models

generate
now exceeds the index storage available in IndexTupleData.t_info.

The current index tuple size is stored in 13 bits of IndexTupleData.t_info,
which limits the max size of an index tuple to 2^13 = 8129 bytes. Vectors
implemented by pgvector

currently use
a 32 bit float for elements, which limits vector size to 2K
dimensions, which is no longer state of the art.

I've attached a patch that increases  IndexTupleData.t_info from 16bits to
32bits allowing for significantly larger index tuple sizes. I would guess
this patch is not a complete implementation that allows for migration from
previous versions, but it does compile and initdb succeeds. I'd be happy to
continue work if the core team is receptive to an update in this area, and
I'd appreciate any feedback the community has on the approach.

I imagine it might be worth hiding this change behind a compile time
configuration parameter similar to blocksize. I'm sure there are
implications I'm unaware of with this change, but I wanted to start the
discussion around a bit of code to see how much would actually need to
change.

Also, I believe this is my first mailing list post in a decade or 2, so let
me know if I've missed something important. BTW, thanks for all your work
over the decades!


32bit_index_info.patch
Description: Binary data