Re: Rethinking the implementation of ts_headline()

2023-04-06 Thread Tom Lane
Alexander Lakhin  writes:
> I've found that starting from commit 5a617d75 this query:
> SELECT ts_headline('english', 'To be, or not to be', to_tsquery('english', 
> 'or'));
> invokes a valgrind-detected error:
> ==00:00:00:03.950 3241424== Invalid read of size 1

On my machine, I also see PG-reported errors such as "unrecognized
operator: 0".  It's a live bug all right.  We need to be more careful
about empty tsqueries.

> But the less-verbose call:
> SELECT ts_headline('', '');

> discovers a different error even on 5a617d75~1:
> ==00:00:00:04.113 3139158== Conditional jump or move depends on uninitialised 
> value(s)
> ==00:00:00:04.113 3139158==    at 0x77B44F: mark_fragment (wparser_def.c:2100)

Yeah, this one seems to be ancient sloppiness.  I don't think it has
any bad effect beyond annoying valgrind, but I fixed it anyway.

Thanks for the report!

regards, tom lane




Re: Rethinking the implementation of ts_headline()

2023-04-06 Thread Alexander Lakhin

Hi,

19.01.2023 19:13, Tom Lane wrote:

Alvaro Herrera  writes:


Anyway, I don't think this needs to stop your current patch.

Many thanks for looking at it!


I've found that starting from commit 5a617d75 this query:
SELECT ts_headline('english', 'To be, or not to be', to_tsquery('english', 
'or'));

invokes a valgrind-detected error:
==00:00:00:03.950 3241424== Invalid read of size 1
==00:00:00:03.950 3241424==    at 0x8EE2A9: TS_execute_locations_recurse 
(tsvector_op.c:2046)
==00:00:00:03.950 3241424==    by 0x8EE220: TS_execute_locations 
(tsvector_op.c:2017)
==00:00:00:03.950 3241424==    by 0x77EAC4: prsd_headline (wparser_def.c:2677)
==00:00:00:03.950 3241424==    by 0x94536C: FunctionCall3Coll (fmgr.c:1173)
==00:00:00:03.950 3241424==    by 0x778648: ts_headline_byid_opt (wparser.c:322)
==00:00:00:03.950 3241424==    by 0x9440F5: DirectFunctionCall3Coll (fmgr.c:836)
==00:00:00:03.950 3241424==    by 0x778763: ts_headline_byid (wparser.c:343)
==00:00:00:03.950 3241424==    by 0x4AC9F2: ExecInterpExpr 
(execExprInterp.c:752)
==00:00:00:03.950 3241424==    by 0x4AEDFE: ExecInterpExprStillValid 
(execExprInterp.c:1838)
==00:00:00:03.950 3241424==    by 0x636A7E: ExecEvalExprSwitchContext 
(executor.h:344)
==00:00:00:03.950 3241424==    by 0x63E92D: evaluate_expr (clauses.c:4843)
==00:00:00:03.950 3241424==    by 0x63DA53: evaluate_function (clauses.c:4345)
...
(Initially I had encountered an asan-detected heap-buffer-overflow with a
more informative document.)

But the less-verbose call:
SELECT ts_headline('', '');

discovers a different error even on 5a617d75~1:
==00:00:00:04.113 3139158== Conditional jump or move depends on uninitialised 
value(s)
==00:00:00:04.113 3139158==    at 0x77B44F: mark_fragment (wparser_def.c:2100)
==00:00:00:04.113 3139158==    by 0x77E2F2: mark_hl_words (wparser_def.c:2519)
==00:00:00:04.113 3139158==    by 0x77E891: prsd_headline (wparser_def.c:2610)
==00:00:00:04.113 3139158==    by 0x944B68: FunctionCall3Coll (fmgr.c:1173)
==00:00:00:04.113 3139158==    by 0x778648: ts_headline_byid_opt (wparser.c:322)
==00:00:00:04.113 3139158==    by 0x9438F1: DirectFunctionCall3Coll (fmgr.c:836)
==00:00:00:04.113 3139158==    by 0x7787B6: ts_headline (wparser.c:352)
==00:00:00:04.113 3139158==    by 0x4AC9F2: ExecInterpExpr 
(execExprInterp.c:752)
==00:00:00:04.113 3139158==    by 0x4AEDFE: ExecInterpExprStillValid 
(execExprInterp.c:1838)
==00:00:00:04.113 3139158==    by 0x50BD0C: ExecEvalExprSwitchContext 
(executor.h:344)
==00:00:00:04.113 3139158==    by 0x50BD84: ExecProject (executor.h:378)
==00:00:00:04.113 3139158==    by 0x50BFBB: ExecResult (nodeResult.c:136)
==00:00:00:04.113 3139158==

I've reproduced it on REL9_4_STABLE (REL9_4_15) and it looks like the defect
in mark_hl_words():
    int bestb = -1,
    beste = -1;
    int bestlen = -1;
    int pose = 0,
...
    if (highlight == 0)
    {
    while (hlCover(prs, query, , ))
    {
...
    if (bestlen < 0)
    {
    curlen = 0;
    for (i = 0; i < prs->curwords && curlen < min_words; 
i++)
    {
    if (!NONWORDTOKEN(prs->words[i].type))
    curlen++;
    pose = i;
    }
    bestb = 0;
    beste = pose;
    }
...
// here we have bestb == 0, beste == 0, but no prs->words in this case
    for (i = bestb; i <= beste; i++)
    {
    if (prs->words[i].item)
    prs->words[i].selected = 1;

exists since 2a0083ede.
(Sorry for the distraction.)

Best regards,
Alexander

Re: Rethinking the implementation of ts_headline()

2023-01-19 Thread Tom Lane
Alvaro Herrera  writes:
> On 2023-Jan-18, Tom Lane wrote:
>> It's including hits for "day" into the cover despite the lack of any
>> nearby match to "drink".

> I suppose it would be possible to put 'day' and 'drink' in two different
> fragments: since the query has a & operator for them, the words don't
> necessarily have to be nearby.  But OK, your argument for this being the
> shortest result is sensible.

> I do wonder, though, if it's effectively usable for somebody building a
> search interface on top.  If I'm ranking the results from several
> documents, and this document comes on top of others that only match one
> arm of the OR query, then I would like to be able to show the matches
> for both arms of the OR.

The fundamental problem with the case you're posing is that MaxWords
is too small to allow the 'day & drink' match to be shown as a whole.
If you make MaxWords large enough then you do find it including
(and highlighting) 'drink', but I'm not sure we should stress out
about what happens otherwise.

> Oh, I see the problem, and it is my misunderstanding: the <-> operator
> is counting the words in between, even if they are stop words.

Yeah.  AFAICS this is a very deliberate, longstanding decision,
so I'm hesitant to change it.  Your test case with 'simple'
proves little, because there are no stop words in 'simple':

regression=# select to_tsvector('simple', 'As idle as a painted Ship');
 to_tsvector  
--
 'a':4 'as':1,3 'idle':2 'painted':5 'ship':6
(1 row)

However, when I switch to 'english':

regression=# select to_tsvector('english', 'As idle as a painted Ship');
to_tsvector 

 'idl':2 'paint':5 'ship':6
(1 row)

the stop words are gone, but the recorded positions remain the same.
So this is really a matter of how to_tsvector chooses to count word
positions, it's not the fault of the <-> construct in particular.

I'm not convinced that this particular behavior is wrong, anyway.
The user of text search isn't supposed to have to think about
which words are stopwords or not, so I think that it's entirely
sensible for 'idle as a painted' to match 'idle <3> painted'.
Maybe the docs need some adjustment?  But in any case that's
material for a different thread.

> I again have to question how valuable in practice is a  operator
> that's so strict that I have to know exactly how many stop words I want
> there to be in between the phrase search.  For some reason, in my mind I
> had it as "at most N words, ignoring stop words", but that's not what it
> is.

Yeah, I recall discussing "up to N words" semantics for this in the
past, but nobody has made that happen.

> Anyway, I don't think this needs to stop your current patch.

Many thanks for looking at it!

regards, tom lane




Re: Rethinking the implementation of ts_headline()

2023-01-19 Thread Alvaro Herrera
On 2023-Jan-18, Tom Lane wrote:

> Alvaro Herrera  writes:

> > and was surprised that the match for the 'day & drink' arm of the OR
> > disappears from the reported headline.
> 
> I'd argue that that's exactly what should happen.  It's supposed to
> find as-short-as-possible cover strings that satisfy the query.

OK, that makes sense.

> I don't find anything particularly principled about the old behavior:
> 
> >  Day after day, day after day,↵
> >We stuck ... motion,   ↵
> >  As idle as a painted Ship  ↵
> >Upon
> 
> It's including hits for "day" into the cover despite the lack of any
> nearby match to "drink".

I suppose it would be possible to put 'day' and 'drink' in two different
fragments: since the query has a & operator for them, the words don't
necessarily have to be nearby.  But OK, your argument for this being the
shortest result is sensible.

I do wonder, though, if it's effectively usable for somebody building a
search interface on top.  If I'm ranking the results from several
documents, and this document comes on top of others that only match one
arm of the OR query, then I would like to be able to show the matches
for both arms of the OR.

> > Another thing I think might be a regression is the way fragments are
> > selected.  Consider what happens if I change the "idle & painted" in the
> > earlier query to "idle <-> painted", and MaxWords is kept low:
> 
> Of course, "idle <-> painted" is satisfied nowhere in this text
> (the words are there, but not adjacent).

Oh, I see the problem, and it is my misunderstanding: the <-> operator
is counting the words in between, even if they are stop words.  I
understood from the docs that those words were ignored, but that is not
so.  I misread the phraseto_tsquery doc as though they were explaining
the <-> operator.

Another experiment shows that the headline becomes "complete" only if I
specify the exact distance in the  operator:

SELECT dist, ts_headline('simple', 'As idle as a painted Ship', 
to_tsquery('simple', format('(idle <%s> painted)', dist)), 'MaxFragments=5, 
MaxWords=8, MinWords=4') from generate_series(1, 4) dist;
 dist │ ts_headline  
──┼──
1 │ As idle as a
2 │ As idle as a
3 │ idle as a painted Ship
4 │ As idle as a
(4 filas)

I again have to question how valuable in practice is a  operator
that's so strict that I have to know exactly how many stop words I want
there to be in between the phrase search.  For some reason, in my mind I
had it as "at most N words, ignoring stop words", but that's not what it
is.

Anyway, I don't think this needs to stop your current patch.

> So the cover has to run from the last 'day' to the 'drink'.  I think
> v15 is deciding that it runs from the first 'day' to the 'drink',
> which while not exactly wrong is not the shortest cover.

Sounds fair.

> The rest of this is just minor variations in what mark_hl_fragments()
> decides to do based on the precise cover string it's given.  I don't
> dispute that mark_hl_fragments() could be improved, but this patch
> doesn't touch its algorithm and I think that doing so should be
> material for a different patch.

Understood and agreed.

> > (Both 15 and master highlight 'painted' in the "Upon a painted Ocean"
> > verse, which perhaps they shouldn't do, since it's not preceded by
> > 'idle'.)
> 
> Yeah, and 'idle' too.  Once it's chosen a string to show, it'll
> highlight all the query words within that string, whether they
> constitute part of the match or not.  I can see arguments on both
> sides of doing it that way; it was probably more sensible before
> we had phrase match than it is now.  But again, changing that phase
> of the processing is outside the scope of this patch.  I'm just
> trying to undo the damage I did to the cover-string-selection phase.

All clear then.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"The eagle never lost so much time, as
when he submitted to learn of the crow." (William Blake)




Re: Rethinking the implementation of ts_headline()

2023-01-18 Thread Tom Lane
Alvaro Herrera  writes:
> I tried this other test, based on looking at the new regression tests
> you added,

> SELECT ts_headline('english', '
> Day after day, day after day,
>   We stuck, nor breath nor motion,
> As idle as a painted Ship
>   Upon a painted Ocean.
> Water, water, every where
>   And all the boards did shrink;
> Water, water, every where,
>   Nor any drop to drink.
> S. T. Coleridge (1772-1834)
> ', to_tsquery('english', '(day & drink) | (idle & painted)'), 
> 'MaxFragments=5, MaxWords=9, MinWords=4');
>ts_headline   
> ─
>  motion,↵
>  As idle as a painted Ship↵
>Upon
> (1 fila)

> and was surprised that the match for the 'day & drink' arm of the OR
> disappears from the reported headline.

I'd argue that that's exactly what should happen.  It's supposed to
find as-short-as-possible cover strings that satisfy the query.
In this case, satisfying the 'day & drink' condition would require
nearly the entire input text, whereas 'idle & painted' can be
satisfied in just the third line.  So what you get is the third line,
slightly expanded because of some later rules that like to add
context if the cover is shorter than MaxWords.  I don't find anything
particularly principled about the old behavior:

>  Day after day, day after day,↵
>We stuck ... motion,   ↵
>  As idle as a painted Ship  ↵
>Upon

It's including hits for "day" into the cover despite the lack of any
nearby match to "drink".

> Another thing I think might be a regression is the way fragments are
> selected.  Consider what happens if I change the "idle & painted" in the
> earlier query to "idle <-> painted", and MaxWords is kept low:

Of course, "idle <-> painted" is satisfied nowhere in this text
(the words are there, but not adjacent).  So the cover has to
run from the last 'day' to the 'drink'.  I think v15 is deciding
that it runs from the first 'day' to the 'drink', which while not
exactly wrong is not the shortest cover.  The rest of this is just
minor variations in what mark_hl_fragments() decides to do based
on the precise cover string it's given.  I don't dispute that
mark_hl_fragments() could be improved, but this patch doesn't touch
its algorithm and I think that doing so should be material for a
different patch.  (I have no immediate ideas about what would be
a better algorithm for it, anyway.)

> (Both 15 and master highlight 'painted' in the "Upon a painted Ocean"
> verse, which perhaps they shouldn't do, since it's not preceded by
> 'idle'.)

Yeah, and 'idle' too.  Once it's chosen a string to show, it'll
highlight all the query words within that string, whether they
constitute part of the match or not.  I can see arguments on both
sides of doing it that way; it was probably more sensible before
we had phrase match than it is now.  But again, changing that phase
of the processing is outside the scope of this patch.  I'm just
trying to undo the damage I did to the cover-string-selection phase.

regards, tom lane




Re: Rethinking the implementation of ts_headline()

2023-01-18 Thread Alvaro Herrera
I tried this other test, based on looking at the new regression tests
you added,

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle & painted)'), 'MaxFragments=5, 
MaxWords=9, MinWords=4');
   ts_headline   
─
 motion,↵
 As idle as a painted Ship↵
   Upon
(1 fila)

and was surprised that the match for the 'day & drink' arm of the OR
disappears from the reported headline.

This is what 15 reports for the same query:

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle & painted)'), 'MaxFragments=5, 
MaxWords=9, MinWords=4');
ts_headline
───
 Day after day, day after day,↵
   We stuck ... motion,   ↵
 As idle as a painted Ship  ↵
   Upon
(1 fila)

I think this was better.

15 seems to fail in other ways; for instance, 'drink' is not highlighted in the
headline when the OR matches, but if the other arm of the OR doesn't match, it
is; for example both 15 and master return the same for this one:

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (mountain & backpack)'), 
'MaxFragments=5, MaxWords=9, MinWords=4');
ts_headline
───
 Day after day, day after day,↵
   We stuck ... drop to drink. ↵
 S. T. Coleridge
(1 fila)



Another thing I think might be a regression is the way fragments are
selected.  Consider what happens if I change the "idle & painted" in the
earlier query to "idle <-> painted", and MaxWords is kept low:

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle <-> painted)'), 
'MaxFragments=5, MaxWords=9, MinWords=4');
  ts_headline  
───
 day,  ↵
   We stuck, nor breath nor motion,   ↵
 As idle ... painted Ship   ↵
   Upon a painted Ocean.   ↵
 Water, water, every ... drop to drink.↵
 S. T. Coleridge
(1 fila)

Note that it chose to put a fragment delimiter exactly in the middle of the
phrase match, where the stop words are.  If I raise MaxWords, it is of course
much better, I suppose because the word limit doesn't force a new fragment,

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle <-> painted)'), 
'MaxFragments=5, MaxWords=25, MinWords=4');
   ts_headline
──
 after day, day after day,  ↵
   We stuck, nor breath nor motion,  ↵
 As idle as a painted Ship ↵
   Upon a painted Ocean.  ↵
 Water, water, every where ... boards did shrink;↵
 Water, water, every where,  ↵
   Nor any drop to drink. ↵
 S. T. Coleridge
(1 fila)

But in 15, the query with low MaxWords does this instead, where the
fragment delimiter occurs just *before* the phrasal match.

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle <-> painted)'), 
'MaxFragments=5, MaxWords=9, MinWords=4');
ts_headline
───
 Day after day, 

Re: Rethinking the implementation of ts_headline()

2023-01-17 Thread Alvaro Herrera
On 2023-Jan-16, Tom Lane wrote:

> I get this with the patch:
> 
>  ts_headline | ts_headline 
> -+-
>  {ipsum} ... {labor} | {ipsum} ... {labor}
> (1 row)
> 
> which is what I'd expect, because it removes the artificial limit on
> cover length that I added in 78e73e875.  So I'm wondering why you got a
> different result.

Indeed, that's what I get now, too, after re-applying the patches.
I find no way to reproduce the bogus result I got yesterday.

Sorry for the noise.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Every machine is a smoke machine if you operate it wrong enough."
https://twitter.com/libseybieda/status/1541673325781196801




Re: Rethinking the implementation of ts_headline()

2023-01-16 Thread Tom Lane
Alvaro Herrera  writes:
> I came across #17556 which contains a different test for this, and I'm
> not sure that this patch changes things completely for the better.

Thanks for looking at my patch.  However ...

> That is, once past the 5000 words of distance, it fails to find a good
> cover, but before that it returns an acceptable headline.  However,
> after your proposed patch, we get this:

>  ts_headline │ ts_headline 
> ─┼─
>  {ipsum} │ {ipsum}
> (1 fila)

I get this with the patch:

 ts_headline | ts_headline 
-+-
 {ipsum} ... {labor} | {ipsum} ... {labor}
(1 row)

which is what I'd expect, because it removes the artificial limit on
cover length that I added in 78e73e875.  So I'm wondering why you got a
different result.  Maybe something to do with locale?  I tried it in
C and en_US.utf8.

regards, tom lane




Re: Rethinking the implementation of ts_headline()

2023-01-16 Thread Alvaro Herrera
On 2022-Nov-25, Tom Lane wrote:

> After further contemplation of bug #17691 [1], I've concluded that
> what I did in commit c9b0c678d was largely misguided.  For one
> thing, the new hlCover() algorithm no longer finds shortest-possible
> cover strings: if your query is "x & y" and the text is like
> "... x ... x ... y ...", then the selected cover string will run
> from the first occurrence of x to the y, whereas the old algorithm
> would have correctly selected "x ... y".  For another thing, the
> maximum-cover-length hack that I added in 78e73e875 to band-aid
> over the performance issues of the original c9b0c678d patch means
> that various scenarios no longer work as well as they used to,
> which is the proximate cause of the complaints in bug #17691.

I came across #17556 which contains a different test for this, and I'm
not sure that this patch changes things completely for the better.  In
that bug report, Alex Malek presents this example

select ts_headline('baz baz baz ipsum ' || repeat(' foo ',4998) || 'labor',
   $$'ipsum' & 'labor'$$::tsquery,
   'StartSel={, StopSel=}, MaxFragments=100, MaxWords=7, MinWords=3'),
ts_headline('baz baz baz ipsum ' || repeat(' foo ',4999) || 'labor',
   $$'ipsum' & 'labor'$$::tsquery,
   'StartSel={, StopSel=}, MaxFragments=100, MaxWords=7, MinWords=3');

which returns, in the current HEAD, the following
 ts_headline │ ts_headline 
─┼─
 {ipsum} ... {labor} │ baz baz baz
(1 fila)

That is, once past the 5000 words of distance, it fails to find a good
cover, but before that it returns an acceptable headline.  However,
after your proposed patch, we get this:

 ts_headline │ ts_headline 
─┼─
 {ipsum} │ {ipsum}
(1 fila)

which is an improvement in the second case, though perhaps not as much
as we would like, and definitely not an improvement in the first case.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"If you have nothing to say, maybe you need just the right tool to help you
not say it."   (New York Times, about Microsoft PowerPoint)




Rethinking the implementation of ts_headline()

2022-11-25 Thread Tom Lane
After further contemplation of bug #17691 [1], I've concluded that
what I did in commit c9b0c678d was largely misguided.  For one
thing, the new hlCover() algorithm no longer finds shortest-possible
cover strings: if your query is "x & y" and the text is like
"... x ... x ... y ...", then the selected cover string will run
from the first occurrence of x to the y, whereas the old algorithm
would have correctly selected "x ... y".  For another thing, the
maximum-cover-length hack that I added in 78e73e875 to band-aid
over the performance issues of the original c9b0c678d patch means
that various scenarios no longer work as well as they used to,
which is the proximate cause of the complaints in bug #17691.

What I'm now thinking is that the original hlCover() algorithm was
fine (if underdocumented) *as long as it's only asked to deal with
AND-like semantics*.  Given that restriction, its approach of
finding the latest first occurrence of any query term, and then
backing up to the earliest last occurrence, is visibly correct;
and we weren't hearing of any performance issues with it either.
The problem is that this approach fails miserably for tsqueries
that aren't pure ANDs, because it will insist on including query
terms that aren't required for a match and indeed could prevent a
match.

So what we need is to find a way to fold the other sorts of queries
into an AND context.  After a couple of false starts, I came up
with the attached patch.  It builds on the existing TS_phrase_execute
infrastructure, which already produces an ExecPhraseData structure
containing an exact list of the match locations for a phrase subquery.
It's easy to get an ExecPhraseData structure for a single-lexeme
match, too.  We can handle plain ANDs by forming lists of
ExecPhraseData structs (with implicit AND semantics across the list).
We can handle ORs by union'ing the component ExecPhraseData structs.
And we can handle NOTs by just dropping them, because we don't want
ts_headline to prioritize matches to such words.  There are some fine
points but it all seems to work pretty well.

Hence I propose the attached patchset.  0001 adds some test cases that
I felt necessary after examining code coverage reports; I split this
out mainly so that 0002 clearly shows the behavioral changes from
current code.  0002 adds TS_execute_locations() which does what I just
described, and rewrites hlCover() into something that's a spiritual
descendant of the original algorithm.  It can't be quite the same
as before because the location data that TS_execute_locations() returns
is measured in lexeme indexes not word indexes, so some translation is
needed.

Notable here is that a couple of regression test results revert
to what they were before c9b0c678d.  They're both instances of
preferring "x ... x ... y" to "x ... y", which I argued was okay
at the time, but I now see the error in that.

Although we back-patched c9b0c678d, I'm inclined to propose this
for HEAD only.  The misbehaviors it's fixing are less bad than what
we were dealing with in that patch.

BTW, while experimenting with this I realized that tsvector's limitation
of lexeme indexes to be at most 16383 is really quite disastrous for
phrase searches.  That limitation was arguably okay before we had phrase
searching, but now it seems untenable.  For example, tsvector entries like
"foo:16383 bar:16383" will not match "foo <-> bar" because the phrase
match code wants their lexeme positions to differ by one.  This basically
means that phrase searches do not work beyond ~20K words into a document.
I'm not sure what to do about that exactly, but I think we need to do
something.

Whatever we might do about that would be largely orthogonal to this
patch, in any case.  I'll stick this in the January CF.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/17691-93cef39a14663963%40postgresql.org

diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out
index dc03f15499..e6a01cf985 100644
--- a/src/test/regress/expected/tsearch.out
+++ b/src/test/regress/expected/tsearch.out
@@ -1804,6 +1804,123 @@ S. T. Coleridge (1772-1834)
  Water, water, every where
 (1 row)
 
+SELECT ts_headline('english', '
+Day after day, day after day,
+  We stuck, nor breath nor motion,
+As idle as a painted Ship
+  Upon a painted Ocean.
+Water, water, every where
+  And all the boards did shrink;
+Water, water, every where,
+  Nor any drop to drink.
+S. T. Coleridge (1772-1834)
+', to_tsquery('english', 'day & drink'));
+ts_headline
+---
+ Day after day, day after day,+
+   We stuck, nor breath nor motion,   +
+ As idle as a painted Ship+
+   Upon a painted Ocean.  +
+ Water, water, every where+
+   And all the boards did shrink;