Re: [HACKERS] Remove 1MB size limit in tsvector

2017-09-11 Thread Tomas Vondra
On 09/11/2017 01:54 PM, Robert Haas wrote:
> On Mon, Sep 11, 2017 at 5:33 AM, Ildus Kurbangaliev
>  wrote:
>> Moreover, RUM index
>> stores positions + lexemes, so it doesn't need tsvectors for ranked
>> search. As a result, tsvector becomes a storage for
>> building indexes (indexable type), not something that should be used at
>> runtime. And the change of the format doesn't affect index creation
>> time.
> 
> RUM indexes, though, are not in core.
> 

Yeah, but I think Ildus has a point that this should not really matter
on indexed tsvectors. So the question is how realistic that benchmark
actually is. How likely are we to do queries on fts directly, not
through a GIN/GiST index? Particularly in performance-sensitive cases?


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove 1MB size limit in tsvector

2017-09-11 Thread Robert Haas
On Mon, Sep 11, 2017 at 5:33 AM, Ildus Kurbangaliev
 wrote:
> Moreover, RUM index
> stores positions + lexemes, so it doesn't need tsvectors for ranked
> search. As a result, tsvector becomes a storage for
> building indexes (indexable type), not something that should be used at
> runtime. And the change of the format doesn't affect index creation
> time.

RUM indexes, though, are not in core.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove 1MB size limit in tsvector

2017-09-11 Thread Ildus Kurbangaliev
On Thu, 7 Sep 2017 23:08:14 +0200
Tomas Vondra  wrote:

> Hi,
> 
> On 08/17/2017 12:23 PM, Ildus Kurbangaliev wrote:
> > In my benchmarks when database fits into buffers (so it's
> > measurement of the time required for the tsvectors conversion) it
> > gives me these results:
> > 
> > Without conversion:
> > 
> > $ ./tsbench2 -database test1 -bench_time 300
> > 2017/08/17 12:04:44 Number of connections:  4
> > 2017/08/17 12:04:44 Database:  test1
> > 2017/08/17 12:09:44 Processed: 51419
> > 
> > With conversion:
> > 
> > $ ./tsbench2 -database test1 -bench_time 300
> > 2017/08/17 12:14:31 Number of connections:  4
> > 2017/08/17 12:14:31 Database:  test1
> > 2017/08/17 12:19:31 Processed: 43607
> > 
> > I ran a bunch of these tests, and these results are stable on my
> > machine. So in these specific tests performance regression about
> > 15%.
> > 
> > Same time I think this could be the worst case, because usually data
> > is on disk and conversion will not affect so much to performance.
> >   
> 
> That seems like a fairly significant regression, TBH. I don't quite
> agree we can simply assume in-memory workloads don't matter, plenty of
> databases have 99% cache hit ratio (particularly when considering not
> just shared buffers, but also page cache).

I think part of this regression is caused by better compression of new
format. I can't say exact percent here, need to check with perf.

If you care about performace, you create indexes, which means that
tsvector will no longer be used for text search (except for ORDER BY
rank). Index machinery will only peek into tsquery. Moreover, RUM index
stores positions + lexemes, so it doesn't need tsvectors for ranked
search. As a result, tsvector becomes a storage for
building indexes (indexable type), not something that should be used at
runtime. And the change of the format doesn't affect index creation
time.

> 
> Can you share the benchmarks, so that others can retry running them?

Benchmarks are published at github:
https://github.com/ildus/tsbench . I'm not sure that they are easy to
use.

Best regards,
Ildus Kurbangaliev


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove 1MB size limit in tsvector

2017-09-07 Thread Tomas Vondra
Hi,

On 08/17/2017 12:23 PM, Ildus Kurbangaliev wrote:
> In my benchmarks when database fits into buffers (so it's measurement of
> the time required for the tsvectors conversion) it gives me these
> results:
> 
> Without conversion:
> 
> $ ./tsbench2 -database test1 -bench_time 300
> 2017/08/17 12:04:44 Number of connections:  4
> 2017/08/17 12:04:44 Database:  test1
> 2017/08/17 12:09:44 Processed: 51419
> 
> With conversion:
> 
> $ ./tsbench2 -database test1 -bench_time 300
> 2017/08/17 12:14:31 Number of connections:  4
> 2017/08/17 12:14:31 Database:  test1
> 2017/08/17 12:19:31 Processed: 43607
> 
> I ran a bunch of these tests, and these results are stable on my
> machine. So in these specific tests performance regression about 15%.
> 
> Same time I think this could be the worst case, because usually data
> is on disk and conversion will not affect so much to performance.
> 

That seems like a fairly significant regression, TBH. I don't quite
agree we can simply assume in-memory workloads don't matter, plenty of
databases have 99% cache hit ratio (particularly when considering not
just shared buffers, but also page cache).

Can you share the benchmarks, so that others can retry running them?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove 1MB size limit in tsvector

2017-08-17 Thread Ildus Kurbangaliev
On Thu, 10 Aug 2017 18:06:17 +0300
Alexander Korotkov  wrote:

> On Wed, Aug 9, 2017 at 7:38 PM, Robert Haas 
> wrote:
> 
> > On Tue, Aug 1, 2017 at 4:00 PM, Ildus K
> >  wrote:  
> > > It's a workaround. DatumGetTSVector and
> > > DatumGetTSVectorCopy will upgrade tsvector on the fly if it
> > > has old format.  
> >
> > Hmm, that seems like a real fix, not just a workaround.  If you can
> > transparently read the old format, there's no problem.  Not sure
> > about performance, though.
> >  
> 
> +1
> Ildus, I think we need to benchmark reading of the old format.  There
> would be tradeoff between performance of old format reading and
> amount of extra code needed.  Once we will have benchmarks we can
> consider whether this is the solution we would like to buy.

In my benchmarks when database fits into buffers (so it's measurement of
the time required for the tsvectors conversion) it gives me these
results:

Without conversion:

$ ./tsbench2 -database test1 -bench_time 300
2017/08/17 12:04:44 Number of connections:  4
2017/08/17 12:04:44 Database:  test1
2017/08/17 12:09:44 Processed: 51419

With conversion:

$ ./tsbench2 -database test1 -bench_time 300
2017/08/17 12:14:31 Number of connections:  4
2017/08/17 12:14:31 Database:  test1
2017/08/17 12:19:31 Processed: 43607

I ran a bunch of these tests, and these results are stable on my
machine. So in these specific tests performance regression about 15%.

Same time I think this could be the worst case, because usually data
is on disk and conversion will not affect so much to performance.

-- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove 1MB size limit in tsvector

2017-08-16 Thread Ildus Kurbangaliev
On Thu, 10 Aug 2017 11:46:55 -0400
Tom Lane  wrote:

> Alexander Korotkov  writes:
> > ...
> > You have random mix of tabs and spaces here.  
> 
> It's worth running pgindent over your code before submitting.  It
> should be pretty easy to set that up nowadays, see
> src/tools/pgindent/README. (If you find any portability problems
> while trying to install pgindent, please let me know.)

Attached a new version of the patch. It mostly contains cosmetic
changes. I rebased it to current master, ran pgindent and fixed
formatting errors.

-- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/src/backend/tsearch/Makefile b/src/backend/tsearch/Makefile
index 34fe4c5b3c..9585a25003 100644
--- a/src/backend/tsearch/Makefile
+++ b/src/backend/tsearch/Makefile
@@ -26,7 +26,7 @@ DICTFILES_PATH=$(addprefix dicts/,$(DICTFILES))
 OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \
 	dict_simple.o dict_synonym.o dict_thesaurus.o \
 	dict_ispell.o regis.o spell.o \
-	to_tsany.o ts_selfuncs.o ts_typanalyze.o ts_utils.o
+	to_tsany.o ts_selfuncs.o ts_typanalyze.o ts_utils.o ts_compat.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/tsearch/to_tsany.c b/src/backend/tsearch/to_tsany.c
index 35d9ab276c..aa87fd8a04 100644
--- a/src/backend/tsearch/to_tsany.c
+++ b/src/backend/tsearch/to_tsany.c
@@ -156,13 +156,10 @@ TSVector
 make_tsvector(ParsedText *prs)
 {
 	int			i,
-j,
 lenstr = 0,
-totallen;
+totallen,
+stroff = 0;
 	TSVector	in;
-	WordEntry  *ptr;
-	char	   *str;
-	int			stroff;
 
 	/* Merge duplicate words */
 	if (prs->curwords > 0)
@@ -171,12 +168,9 @@ make_tsvector(ParsedText *prs)
 	/* Determine space needed */
 	for (i = 0; i < prs->curwords; i++)
 	{
-		lenstr += prs->words[i].len;
-		if (prs->words[i].alen)
-		{
-			lenstr = SHORTALIGN(lenstr);
-			lenstr += sizeof(uint16) + prs->words[i].pos.apos[0] * sizeof(WordEntryPos);
-		}
+		int			npos = prs->words[i].alen ? prs->words[i].pos.apos[0] : 0;
+
+		INCRSIZE(lenstr, i, prs->words[i].len, npos);
 	}
 
 	if (lenstr > MAXSTRPOS)
@@ -187,41 +181,21 @@ make_tsvector(ParsedText *prs)
 	totallen = CALCDATASIZE(prs->curwords, lenstr);
 	in = (TSVector) palloc0(totallen);
 	SET_VARSIZE(in, totallen);
-	in->size = prs->curwords;
+	TS_SETCOUNT(in, prs->curwords);
 
-	ptr = ARRPTR(in);
-	str = STRPTR(in);
-	stroff = 0;
 	for (i = 0; i < prs->curwords; i++)
 	{
-		ptr->len = prs->words[i].len;
-		ptr->pos = stroff;
-		memcpy(str + stroff, prs->words[i].word, prs->words[i].len);
-		stroff += prs->words[i].len;
-		pfree(prs->words[i].word);
+		int			npos = 0;
+
 		if (prs->words[i].alen)
-		{
-			int			k = prs->words[i].pos.apos[0];
-			WordEntryPos *wptr;
+			npos = prs->words[i].pos.apos[0];
 
-			if (k > 0x)
-elog(ERROR, "positions array too long");
+		tsvector_addlexeme(in, i, , prs->words[i].word, prs->words[i].len,
+		   prs->words[i].pos.apos + 1, npos);
 
-			ptr->haspos = 1;
-			stroff = SHORTALIGN(stroff);
-			*(uint16 *) (str + stroff) = (uint16) k;
-			wptr = POSDATAPTR(in, ptr);
-			for (j = 0; j < k; j++)
-			{
-WEP_SETWEIGHT(wptr[j], 0);
-WEP_SETPOS(wptr[j], prs->words[i].pos.apos[j + 1]);
-			}
-			stroff += sizeof(uint16) + k * sizeof(WordEntryPos);
+		pfree(prs->words[i].word);
+		if (prs->words[i].alen)
 			pfree(prs->words[i].pos.apos);
-		}
-		else
-			ptr->haspos = 0;
-		ptr++;
 	}
 
 	if (prs->words)
@@ -251,7 +225,6 @@ to_tsvector_byid(PG_FUNCTION_ARGS)
 	PG_FREE_IF_COPY(in, 1);
 
 	out = make_tsvector();
-
 	PG_RETURN_TSVECTOR(out);
 }
 
diff --git a/src/backend/tsearch/ts_compat.c b/src/backend/tsearch/ts_compat.c
new file mode 100644
index 00..bc45109241
--- /dev/null
+++ b/src/backend/tsearch/ts_compat.c
@@ -0,0 +1,84 @@
+#include "postgres.h"
+#include "tsearch/ts_type.h"
+
+/*
+ * Definition of old WordEntry struct in TSVector. Because of limitations
+ * in size (max 1MB for lexemes), the format has changed
+ */
+typedef struct
+{
+	uint32
+haspos:1,
+len:11,
+pos:20;
+}			OldWordEntry;
+
+typedef struct
+{
+	uint16		npos;
+	WordEntryPos pos[FLEXIBLE_ARRAY_MEMBER];
+}			OldWordEntryPosVector;
+
+#define OLDSTRPTR(x)	( (char *) &(x)->entries[x->size_] )
+#define _OLDPOSVECPTR(x, e)	\
+	((OldWordEntryPosVector *)(STRPTR(x) + SHORTALIGN((e)->pos + (e)->len)))
+#define OLDPOSDATALEN(x,e) ( ( (e)->haspos ) ? (_OLDPOSVECPTR(x,e)->npos) : 0 )
+#define OLDPOSDATAPTR(x,e) (_OLDPOSVECPTR(x,e)->pos)
+
+/*
+ * Converts tsvector with the old structure to current.
+ * Can return copy of tsvector, but it has a meaning when tsvector doensn't
+ * need to be converted.
+ */
+TSVector
+tsvector_upgrade(Datum orig, bool copy)
+{
+	int			i,
+dataoff = 0,
+datalen = 0,
+totallen;
+	TSVector	in,
+out;
+
+	in = (TSVector) PG_DETOAST_DATUM(orig);
+
+	/* If already in new format, return as is */
+	if (in->size_ & TS_FLAG_STRETCHED)
+	{
+		TSVector	out;
+
+	

Re: [HACKERS] Remove 1MB size limit in tsvector

2017-08-10 Thread Tom Lane
Alexander Korotkov  writes:
> ...
> You have random mix of tabs and spaces here.

It's worth running pgindent over your code before submitting.  It should
be pretty easy to set that up nowadays, see src/tools/pgindent/README.
(If you find any portability problems while trying to install pgindent,
please let me know.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove 1MB size limit in tsvector

2017-08-10 Thread Alexander Korotkov
On Thu, Aug 10, 2017 at 7:37 AM, Michael Paquier 
wrote:

> On Wed, Aug 9, 2017 at 6:38 PM, Robert Haas  wrote:
> > The patch doesn't really conform to our coding standards, though, so
> > you need to clean it up (or, if you're not sure what you need to do,
> > you need to have someone who knows how PostgreSQL code needs to look
> > review it for you).
>
> The documentation has a couple of rules for coding conventions:
> https://www.postgresql.org/docs/9.6/static/source.html


+1

Ildus, from the first glance I see at least following violations of
PostgreSQL coding standards in your code.

+/*
+ * Converts tsvector with the old structure to current.
+ * @orig - tsvector to convert,
+ * @copy - return copy of tsvector, it has a meaning when tsvector doensn't
+ * need to be converted.
+ */
This comment will be reflowed by pgindent.  Also we don't use '@' for
parameters description in comments.
https://www.postgresql.org/docs/9.6/static/source-format.html

+TSVector
+tsvector_upgrade(Datum orig, bool copy)
+{
+ int   i,
+   dataoff = 0,
+   datalen = 0,
+   totallen;
+ TSVector   in,
+   out;

You have random mix of tabs and spaces here.

+ {
+ stroff = SHORTALIGN(stroff); \
+ entry->hasoff = 0;
+ entry->len = lexeme_len;
+ entry->npos = npos;
+ }

What this backslash is doing here?
There are other similar (and probably different) violations of coding
standard over the code.  Ildus, please check you patches carefully before
publishing.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Remove 1MB size limit in tsvector

2017-08-10 Thread Alexander Korotkov
On Wed, Aug 9, 2017 at 7:38 PM, Robert Haas  wrote:

> On Tue, Aug 1, 2017 at 4:00 PM, Ildus K 
> wrote:
> > It's a workaround. DatumGetTSVector and
> > DatumGetTSVectorCopy will upgrade tsvector on the fly if it
> > has old format.
>
> Hmm, that seems like a real fix, not just a workaround.  If you can
> transparently read the old format, there's no problem.  Not sure about
> performance, though.
>

+1
Ildus, I think we need to benchmark reading of the old format.  There would
be tradeoff between performance of old format reading and amount of extra
code needed.  Once we will have benchmarks we can consider whether this is
the solution we would like to buy.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Remove 1MB size limit in tsvector

2017-08-09 Thread Michael Paquier
On Wed, Aug 9, 2017 at 6:38 PM, Robert Haas  wrote:
> The patch doesn't really conform to our coding standards, though, so
> you need to clean it up (or, if you're not sure what you need to do,
> you need to have someone who knows how PostgreSQL code needs to look
> review it for you).

The documentation has a couple of rules for coding conventions:
https://www.postgresql.org/docs/9.6/static/source.html
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove 1MB size limit in tsvector

2017-08-09 Thread Robert Haas
On Tue, Aug 1, 2017 at 4:00 PM, Ildus K  wrote:
> It's a workaround. DatumGetTSVector and
> DatumGetTSVectorCopy will upgrade tsvector on the fly if it
> has old format.

Hmm, that seems like a real fix, not just a workaround.  If you can
transparently read the old format, there's no problem.  Not sure about
performance, though.

The patch doesn't really conform to our coding standards, though, so
you need to clean it up (or, if you're not sure what you need to do,
you need to have someone who knows how PostgreSQL code needs to look
review it for you).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove 1MB size limit in tsvector

2017-08-09 Thread Ildus Kurbangaliev
On Wed, 9 Aug 2017 09:01:44 +0200
Torsten Zuehlsdorff  wrote:

> On 01.08.2017 22:00, Ildus K wrote:
> > On Tue, 1 Aug 2017 15:33:08 -0400
> > Robert Haas  wrote:
> >   
> >> On Tue, Aug 1, 2017 at 3:10 PM, Ildus K
> >>  wrote:  
>  So this would break pg_upgrade for tsvector columns?  
> >>>
> >>> I added a function that will convert old tsvectors on the fly.
> >>> It's the approach used in hstore before.  
> >>
> >> Does that mean the answer to the question that I asked is "yes,
> >> but I have a workaround" or does it mean that the answer is "no"?
> >>  
> > 
> > It's a workaround. DatumGetTSVector and
> > DatumGetTSVectorCopy will upgrade tsvector on the fly if it
> > has old format.  
> 
> I'm not familiar with pg_upgrade, but want to ask: should this 
> workaround be part of pg_upgrade?
> 
> Greetings,
> Torsten

I chose the way when the data remains the same, until the user decides
to update it. I'm not so familiar with pg_upgrade myself and I don't
see now how the data will be converted with it, but it will anyway
increase downtime which is the shorter the better.

-- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove 1MB size limit in tsvector

2017-08-09 Thread Torsten Zuehlsdorff



On 01.08.2017 22:00, Ildus K wrote:

On Tue, 1 Aug 2017 15:33:08 -0400
Robert Haas  wrote:


On Tue, Aug 1, 2017 at 3:10 PM, Ildus K
 wrote:

So this would break pg_upgrade for tsvector columns?


I added a function that will convert old tsvectors on the fly. It's
the approach used in hstore before.


Does that mean the answer to the question that I asked is "yes, but I
have a workaround" or does it mean that the answer is "no"?



It's a workaround. DatumGetTSVector and
DatumGetTSVectorCopy will upgrade tsvector on the fly if it
has old format.


I'm not familiar with pg_upgrade, but want to ask: should this 
workaround be part of pg_upgrade?


Greetings,
Torsten


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove 1MB size limit in tsvector

2017-08-01 Thread Ildus K
On Tue, 1 Aug 2017 15:33:08 -0400
Robert Haas  wrote:

> On Tue, Aug 1, 2017 at 3:10 PM, Ildus K
>  wrote:
> >> So this would break pg_upgrade for tsvector columns?  
> >
> > I added a function that will convert old tsvectors on the fly. It's
> > the approach used in hstore before.  
> 
> Does that mean the answer to the question that I asked is "yes, but I
> have a workaround" or does it mean that the answer is "no"?
> 

It's a workaround. DatumGetTSVector and
DatumGetTSVectorCopy will upgrade tsvector on the fly if it
has old format.

Regards,
Ildus Kurbangaliev


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove 1MB size limit in tsvector

2017-08-01 Thread Robert Haas
On Tue, Aug 1, 2017 at 3:10 PM, Ildus K  wrote:
>> So this would break pg_upgrade for tsvector columns?
>
> I added a function that will convert old tsvectors on the fly. It's the
> approach used in hstore before.

Does that mean the answer to the question that I asked is "yes, but I
have a workaround" or does it mean that the answer is "no"?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove 1MB size limit in tsvector

2017-08-01 Thread Ildus K
On Tue, 1 Aug 2017 14:56:54 -0400
Robert Haas  wrote:

> On Tue, Aug 1, 2017 at 10:08 AM, Ildus Kurbangaliev
>  wrote:
> > Historically tsvector type can't hold more than 1MB data.
> > I want to propose a patch that removes that limit.
> >
> > That limit is created by 'pos' field from WordEntry, which have only
> > 20 bits for storage.
> >
> > In the proposed patch I removed this field and instead of it I keep
> > offsets only at each Nth item in WordEntry's array.  
> 
> So this would break pg_upgrade for tsvector columns?
> 

I added a function that will convert old tsvectors on the fly. It's the
approach used in hstore before.

Regards,
Ildus Kurbangaliev


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove 1MB size limit in tsvector

2017-08-01 Thread Robert Haas
On Tue, Aug 1, 2017 at 10:08 AM, Ildus Kurbangaliev
 wrote:
> Historically tsvector type can't hold more than 1MB data.
> I want to propose a patch that removes that limit.
>
> That limit is created by 'pos' field from WordEntry, which have only
> 20 bits for storage.
>
> In the proposed patch I removed this field and instead of it I keep
> offsets only at each Nth item in WordEntry's array.

So this would break pg_upgrade for tsvector columns?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Remove 1MB size limit in tsvector

2017-08-01 Thread Ildus Kurbangaliev
Hello, hackers!

Historically tsvector type can't hold more than 1MB data.
I want to propose a patch that removes that limit.

That limit is created by 'pos' field from WordEntry, which have only
20 bits for storage.

In the proposed patch I removed this field and instead of it I keep
offsets only at each Nth item in WordEntry's array. Now I set N as 4,
because it gave best results in my benchmarks. It can be increased in
the future without affecting already saved data in database. Also
removing the field improves compression of tsvectors.

I simplified the code by creating functions that can be used to
build tsvectors. There were duplicated code fragments in places where
tsvector was built.

Also new patch frees some space in WordEntry that can be used to
save some additional information about saved words.

- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/src/backend/tsearch/Makefile b/src/backend/tsearch/Makefile
index 34fe4c5b3c..9585a25003 100644
--- a/src/backend/tsearch/Makefile
+++ b/src/backend/tsearch/Makefile
@@ -26,7 +26,7 @@ DICTFILES_PATH=$(addprefix dicts/,$(DICTFILES))
 OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \
 	dict_simple.o dict_synonym.o dict_thesaurus.o \
 	dict_ispell.o regis.o spell.o \
-	to_tsany.o ts_selfuncs.o ts_typanalyze.o ts_utils.o
+	to_tsany.o ts_selfuncs.o ts_typanalyze.o ts_utils.o ts_compat.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/tsearch/to_tsany.c b/src/backend/tsearch/to_tsany.c
index 35d9ab276c..d66b69baea 100644
--- a/src/backend/tsearch/to_tsany.c
+++ b/src/backend/tsearch/to_tsany.c
@@ -156,13 +156,10 @@ TSVector
 make_tsvector(ParsedText *prs)
 {
 	int			i,
-j,
 lenstr = 0,
-totallen;
+totallen,
+stroff = 0;
 	TSVector	in;
-	WordEntry  *ptr;
-	char	   *str;
-	int			stroff;
 
 	/* Merge duplicate words */
 	if (prs->curwords > 0)
@@ -171,12 +168,8 @@ make_tsvector(ParsedText *prs)
 	/* Determine space needed */
 	for (i = 0; i < prs->curwords; i++)
 	{
-		lenstr += prs->words[i].len;
-		if (prs->words[i].alen)
-		{
-			lenstr = SHORTALIGN(lenstr);
-			lenstr += sizeof(uint16) + prs->words[i].pos.apos[0] * sizeof(WordEntryPos);
-		}
+		int npos = prs->words[i].alen ? prs->words[i].pos.apos[0] : 0;
+		INCRSIZE(lenstr, i, prs->words[i].len, npos);
 	}
 
 	if (lenstr > MAXSTRPOS)
@@ -187,41 +180,21 @@ make_tsvector(ParsedText *prs)
 	totallen = CALCDATASIZE(prs->curwords, lenstr);
 	in = (TSVector) palloc0(totallen);
 	SET_VARSIZE(in, totallen);
-	in->size = prs->curwords;
+	TS_SETCOUNT(in, prs->curwords);
 
-	ptr = ARRPTR(in);
-	str = STRPTR(in);
-	stroff = 0;
 	for (i = 0; i < prs->curwords; i++)
 	{
-		ptr->len = prs->words[i].len;
-		ptr->pos = stroff;
-		memcpy(str + stroff, prs->words[i].word, prs->words[i].len);
-		stroff += prs->words[i].len;
-		pfree(prs->words[i].word);
+		int npos = 0;
 		if (prs->words[i].alen)
-		{
-			int			k = prs->words[i].pos.apos[0];
-			WordEntryPos *wptr;
+			npos = prs->words[i].pos.apos[0];
 
-			if (k > 0x)
-elog(ERROR, "positions array too long");
+		tsvector_addlexeme(in, i, , prs->words[i].word, prs->words[i].len,
+			prs->words[i].pos.apos + 1, npos);
 
-			ptr->haspos = 1;
-			stroff = SHORTALIGN(stroff);
-			*(uint16 *) (str + stroff) = (uint16) k;
-			wptr = POSDATAPTR(in, ptr);
-			for (j = 0; j < k; j++)
-			{
-WEP_SETWEIGHT(wptr[j], 0);
-WEP_SETPOS(wptr[j], prs->words[i].pos.apos[j + 1]);
-			}
-			stroff += sizeof(uint16) + k * sizeof(WordEntryPos);
+		pfree(prs->words[i].word);
+		if (prs->words[i].alen)
 			pfree(prs->words[i].pos.apos);
-		}
-		else
-			ptr->haspos = 0;
-		ptr++;
+
 	}
 
 	if (prs->words)
@@ -251,7 +224,6 @@ to_tsvector_byid(PG_FUNCTION_ARGS)
 	PG_FREE_IF_COPY(in, 1);
 
 	out = make_tsvector();
-
 	PG_RETURN_TSVECTOR(out);
 }
 
diff --git a/src/backend/tsearch/ts_compat.c b/src/backend/tsearch/ts_compat.c
new file mode 100644
index 00..bb2a62eaf7
--- /dev/null
+++ b/src/backend/tsearch/ts_compat.c
@@ -0,0 +1,83 @@
+#include "postgres.h"
+#include "tsearch/ts_type.h"
+
+/*
+ * Definition of old WordEntry struct in TSVector. Because of limitations
+ * in size (max 1MB for lexemes), the format has changed
+ */
+typedef struct
+{
+	uint32
+haspos:1,
+len:11,
+pos:20;
+} OldWordEntry;
+
+typedef struct
+{
+	uint16		npos;
+	WordEntryPos pos[FLEXIBLE_ARRAY_MEMBER];
+} OldWordEntryPosVector;
+
+#define OLDSTRPTR(x)	( (char *) &(x)->entries[x->size_] )
+#define _OLDPOSVECPTR(x, e)	\
+	((OldWordEntryPosVector *)(STRPTR(x) + SHORTALIGN((e)->pos + (e)->len)))
+#define OLDPOSDATALEN(x,e) ( ( (e)->haspos ) ? (_OLDPOSVECPTR(x,e)->npos) : 0 )
+#define OLDPOSDATAPTR(x,e) (_OLDPOSVECPTR(x,e)->pos)
+
+/*
+ * Converts tsvector with the old structure to current.
+ * @orig - tsvector to convert,
+ * @copy - return copy of tsvector, it has a meaning when tsvector doensn't
+ * need to be converted.
+ */
+TSVector
+tsvector_upgrade(Datum orig, bool