Re: [PATCH] fix GIN index search sometimes losing results

2020-07-24 Thread Tom Lane
Pavel Borisov  writes:
> ср, 22 июл. 2020 г. в 19:10, Tom Lane :
>> The other issue we have to agree on is whether we want to sneak this
>> fix into v13, or wait another year for it.  I feel like it's pretty
>> late to be making potentially API-breaking changes, but on the other
>> hand this is undoubtedly a bug fix.

> I am convinced patch 0001 is necessary and enough to fix a bug, so I think
> it's very much worth adding it to v13.

Agreed, and done.

> As for 0002 I see the beauty of this change but I also see the value of
> leaving defaults as they were before.
> The change of CALC_NOT behavior doesn't seem to be a source of big changes,
> though. I'm just not convinced it is very much needed.
> The best way I think is to leave 0002 until the next version and add
> commentary in the code that this default behavior of NOT
> doing TS_EXEC_CALC_NOT is inherited from past, so basically any caller
> should set this flag (see patch 0003-add-comments-on-calc-not.

I don't think it's a great plan to make these two changes in two
successive versions.  They're going to be affecting basically the
same set of outside callers, at least if you assume that every
TS_execute caller will be supplying its own callback function.
So we might as well force people to make both updates at once.
Also, if there is anyone who thinks they need "skip NOT" behavior,
this'd be a great time to reconsider.

I revised 0002 to still define a flag for skipping NOTs, but
it's not the default and is indeed unused in the core code now.

regards, tom lane




Re: [PATCH] fix GIN index search sometimes losing results

2020-07-22 Thread Pavel Borisov
ср, 22 июл. 2020 г. в 19:10, Tom Lane :

> Pavel Borisov  writes:
> > For 0002-remove-calc-not-flag.patch
> > The patch changes the behavior which is now considered default. This is
> true in RUM module and maybe in some other tsearch side modules. Applying
> the patch can make code more beautiful but possibly will not give some
> performance gain and bug is anyway fixed by patch 0001.
>
> I'd be willing to compromise on just adding TS_EXEC_CALC_NOT to the
> calls that are missing it today.  But I don't see why that's really
> a great idea --- it still leaves a risk-of-omission hazard for future
> callers.  Calculating NOTs correctly really ought to be the default
> behavior.
>
> What do you think of replacing TS_EXEC_CALC_NOT with a different
> flag having the opposite sense, maybe called TS_EXEC_SKIP_NOT?
> If anyone really does need that behavior, they could still get it,
> but they'd have to be explicit.
>
> > Overall I'd recommend patch 0001-make-callbacks-ternary.patch and close
> the issue.
>
> The other issue we have to agree on is whether we want to sneak this
> fix into v13, or wait another year for it.  I feel like it's pretty
> late to be making potentially API-breaking changes, but on the other
> hand this is undoubtedly a bug fix.
>
> regards, tom lane
>

I am convinced patch 0001 is necessary and enough to fix a bug, so I think
it's very much worth adding it to v13.

As for 0002 I see the beauty of this change but I also see the value of
leaving defaults as they were before.
The change of CALC_NOT behavior doesn't seem to be a source of big changes,
though. I'm just not convinced it is very much needed.
The best way I think is to leave 0002 until the next version and add
commentary in the code that this default behavior of NOT
doing TS_EXEC_CALC_NOT is inherited from past, so basically any caller
should set this flag (see patch 0003-add-comments-on-calc-not.

Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com


0003-add-comments-on-calc-not.diff
Description: Binary data


Re: [PATCH] fix GIN index search sometimes losing results

2020-07-22 Thread Tom Lane
Pavel Borisov  writes:
> For 0002-remove-calc-not-flag.patch
> The patch changes the behavior which is now considered default. This is true 
> in RUM module and maybe in some other tsearch side modules. Applying the 
> patch can make code more beautiful but possibly will not give some 
> performance gain and bug is anyway fixed by patch 0001.

I'd be willing to compromise on just adding TS_EXEC_CALC_NOT to the
calls that are missing it today.  But I don't see why that's really
a great idea --- it still leaves a risk-of-omission hazard for future
callers.  Calculating NOTs correctly really ought to be the default
behavior.

What do you think of replacing TS_EXEC_CALC_NOT with a different
flag having the opposite sense, maybe called TS_EXEC_SKIP_NOT?
If anyone really does need that behavior, they could still get it,
but they'd have to be explicit.

> Overall I'd recommend patch 0001-make-callbacks-ternary.patch and close the 
> issue.

The other issue we have to agree on is whether we want to sneak this
fix into v13, or wait another year for it.  I feel like it's pretty
late to be making potentially API-breaking changes, but on the other
hand this is undoubtedly a bug fix.

regards, tom lane




Re: [PATCH] fix GIN index search sometimes losing results

2020-07-13 Thread Pavel Borisov
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

Hi, all!

It seems that as of now we have two sets of patches for this bug:
1. Tom Lane's: 0001-make-callbacks-ternary.patch and 
0002-remove-calc-not-flag.patch
2. My: gin-gist-weight-patch-v4.diff

There was a quite long discussion above and I suppose that despite the 
difference both of them suit and will do the necessary fix. 
So I decided to make a review of both Tom Lane's patches.

Both of them apply clean. Checks are sucessful. There are regression tests 
included and they cover the bug. Also I made checks on my PgList database and I 
suppose the bug is indeed fixed.

For 0001-make-callbacks-ternary.patch
As it was mentioned in discussion, the issue was that in certain cases compare 
function of a single operand in a query should give undefined meaning "MAYBE" 
which should remain towards to the root of a tree. So the patch in my opinion 
adresses the problem in a right way.

Possible dangers of changed callback from binary to ternary is that any side 
modules which still use binary interface will get warnings on compile and will 
need minor modifications of code to comply with new interface. I checked it 
with RUM index and indeed get warnings on compile. In discussion above it was 
noted that anyway there is no way to get right results in tsearch with NOT 
without modification of this so I'd recommend committing patch 0001.

For 0002-remove-calc-not-flag.patch
The patch changes the behavior which is now considered default. This is true in 
RUM module and maybe in some other tsearch side modules. Applying the patch can 
make code more beautiful but possibly will not give some performance gain and 
bug is anyway fixed by patch 0001.

Overall I'd recommend patch 0001-make-callbacks-ternary.patch and close the 
issue.

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

The new status of this patch is: Ready for Committer


Re: [PATCH] fix GIN index search sometimes losing results

2020-07-02 Thread Tom Lane
Pavel Borisov  writes:
> чт, 2 июл. 2020 г. в 19:38, Artur Zakirov :
>> So it seems we are losing some results with RUM as well.

> For me it is 100% predictable that unmodified RUM is still losing results
> as it is still using binary callback.

Right, that's in line with what I expected as well.

> The main my goal of saving binary legacy callback is that side callers like
> RUM will not break immediately but remain in
> existing state (i.e. losing results in some queries).

I don't really see why that should be a goal here.  I think a forced
compile error, calling attention to the fact that there's something
to fix, is a good thing as long as we do it in a major release.

regards, tom lane




Re: [PATCH] fix GIN index search sometimes losing results

2020-07-02 Thread Pavel Borisov
чт, 2 июл. 2020 г. в 19:38, Artur Zakirov :

> Hello,
>
> On Thu, Jul 2, 2020 at 8:23 PM Pavel Borisov 
> wrote:
> >
> > ср, 1 июл. 2020 г. в 23:16, Tom Lane :
> >>
> >> Pavel Borisov  writes:
> >> > Below is my variant how to patch Gin-Gist weights issue:
> >>
> >> I looked at this patch, but I'm unimpressed, because it's buggy.
> >
> >
> > Thank you, i'd noticed and made minor corrections in the patch. Now it
> should work
> > correctly,
> >
> > As for preserving the option to use legacy bool-style calls, personally
> I see much
> > value of not changing API ad hoc to fix something. This may not harm
> vanilla reseases
> > but can break many possible side things like RUM index etc which I think
> are abundant
> > around there. Furthermore if we leave legacy bool callback along with
> newly proposed and
> > recommended for further use it will cost nothing.
> >
> > So I've attached a corrected patch. Also I wrote some comments to the
> code and added
> > your test as a part of apatch. Again thank you for sharing your thoughts
> and advice.
> >
> > As always I'd appreciate everyone's opinion on the bugfix.
>
> I haven't looked at any of the patches carefully yet. But I tried both of
> them.
>
> I tried Tom's patch. To compile the RUM extension I've made few
> changes to use new
> TS_execute(). Speaking about backward compatibility. I also think that
> it is not so important
> here. And RUM alreadyhas a number of "#if PG_VERSION_NUM" directives. API
> breaks
> from time to time and it seems inevitable.
>
> I also tried "gin-gist-weight-patch-v4.diff". And it didn't require
> changes into RUM. But as
> Tom said above TS_execute() is broken already. Here is the example with
> "gin-gist-weight-patch-v4.diff" and RUM:
>
> =# create extension rum;
> =# create table test (a tsvector);
> =# insert into test values ('wd:1A wr:2D'), ('wd:1A wr:2D');
> =# create index on test using rum (a);
> =# select a from test where a @@ '!wd:D';
>a
> 
>  'wd':1A 'wr':2
>  'wd':1A 'wr':2
> (2 rows)
> =# set enable_seqscan to off;
> =# select a from test where a @@ '!wd:D';
>  a
> ---
> (0 rows)
>
> So it seems we are losing some results with RUM as well.
>
> --
> Artur
>
For me it is 100% predictable that unmodified RUM is still losing results
as it is still using binary callback.
The main my goal of saving binary legacy callback is that side callers like
RUM will not break immediately but remain in
existing state (i.e. losing results in some queries). To fix the issue
completely it is needed to make ternary logic in
Postgres Tsearch AND engage this ternary logic in RUM and other side
modules.

Thank you for your consideration!

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


Re: [PATCH] fix GIN index search sometimes losing results

2020-07-02 Thread Artur Zakirov
Hello,

On Thu, Jul 2, 2020 at 8:23 PM Pavel Borisov  wrote:
>
> ср, 1 июл. 2020 г. в 23:16, Tom Lane :
>>
>> Pavel Borisov  writes:
>> > Below is my variant how to patch Gin-Gist weights issue:
>>
>> I looked at this patch, but I'm unimpressed, because it's buggy.
>
>
> Thank you, i'd noticed and made minor corrections in the patch. Now it should 
> work
> correctly,
>
> As for preserving the option to use legacy bool-style calls, personally I see 
> much
> value of not changing API ad hoc to fix something. This may not harm vanilla 
> reseases
> but can break many possible side things like RUM index etc which I think are 
> abundant
> around there. Furthermore if we leave legacy bool callback along with newly 
> proposed and
> recommended for further use it will cost nothing.
>
> So I've attached a corrected patch. Also I wrote some comments to the code 
> and added
> your test as a part of apatch. Again thank you for sharing your thoughts and 
> advice.
>
> As always I'd appreciate everyone's opinion on the bugfix.

I haven't looked at any of the patches carefully yet. But I tried both of them.

I tried Tom's patch. To compile the RUM extension I've made few
changes to use new
TS_execute(). Speaking about backward compatibility. I also think that
it is not so important
here. And RUM alreadyhas a number of "#if PG_VERSION_NUM" directives. API breaks
from time to time and it seems inevitable.

I also tried "gin-gist-weight-patch-v4.diff". And it didn't require
changes into RUM. But as
Tom said above TS_execute() is broken already. Here is the example with
"gin-gist-weight-patch-v4.diff" and RUM:

=# create extension rum;
=# create table test (a tsvector);
=# insert into test values ('wd:1A wr:2D'), ('wd:1A wr:2D');
=# create index on test using rum (a);
=# select a from test where a @@ '!wd:D';
   a

 'wd':1A 'wr':2
 'wd':1A 'wr':2
(2 rows)
=# set enable_seqscan to off;
=# select a from test where a @@ '!wd:D';
 a
---
(0 rows)

So it seems we are losing some results with RUM as well.

-- 
Artur




Re: [PATCH] fix GIN index search sometimes losing results

2020-07-02 Thread Pavel Borisov
ср, 1 июл. 2020 г. в 23:16, Tom Lane :

> Pavel Borisov  writes:
> > Below is my variant how to patch Gin-Gist weights issue:
>
> I looked at this patch, but I'm unimpressed, because it's buggy.
>

Thank you, i'd noticed and made minor corrections in the patch. Now it
should work
correctly,

As for preserving the option to use legacy bool-style calls, personally I
see much
value of not changing API ad hoc to fix something. This may not harm
vanilla reseases
but can break many possible side things like RUM index etc which I think
are abundant
around there. Furthermore if we leave legacy bool callback along with
newly proposed and
recommended for further use it will cost nothing.

So I've attached a corrected patch. Also I wrote some comments to the code
and added
your test as a part of apatch. Again thank you for sharing your thoughts
and advice.

As always I'd appreciate everyone's opinion on the bugfix.

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


gin-gist-weight-patch-v4.diff
Description: Binary data


Re: [PATCH] fix GIN index search sometimes losing results

2020-07-01 Thread Tom Lane
Pavel Borisov  writes:
> Below is my variant how to patch Gin-Gist weights issue:

I looked at this patch, but I'm unimpressed, because it's buggy.
You would have noticed if you'd included the test cases I wrote:

--- /home/postgres/pgsql/src/test/regress/expected/tsearch.out  2020-07-01 14:58
:56.637627628 -0400
+++ /home/postgres/pgsql/src/test/regress/results/tsearch.out   2020-07-01 14:59
:10.996990037 -0400
@@ -1008,13 +1008,13 @@
 SELECT count(*) FROM test_tsvector WHERE a @@ '!wd:A';
  count 
 ---
-   452
+ 2
 (1 row)
 
 SELECT count(*) FROM test_tsvector WHERE a @@ '!wd:D';
  count 
 ---
-   450
+ 0
 (1 row)
 
 -- Test optimization of non-empty GIN_SEARCH_MODE_ALL queries



In general, I'm not very convinced by your arguments about preserving the
option for external TS_execute callers to still use bool flags/results.
Given what we've seen so far, it seems almost certain that any such code
is buggy and needs to be rewritten anyway.  Converting to ternary logic
is far more likely to produce non-buggy code than if we continue to
try to put band-aids on the wounds.

Also, at this point I feel like it's a bit late to consider putting
anything API-breaking in v13.  But if this is a HEAD-only patch then
the argument for preserving API is even weaker.

regards, tom lane




Re: [PATCH] fix GIN index search sometimes losing results

2020-05-21 Thread Pavel Borisov
Hi All!
1. Generally the difference of my patch in comparison to Tom's patch 0001
is that I tried to move previous logic of GIN's own TS_execute_ternary() to
the general logic of TS_execute_recurse and in case we have index without
positions to avoid diving into phrase operator replacing (only in this
case) in by an AND operator. The reason for this I suppose is speed and
I've done testing of some corner cases like phrase operator with big number
of OR comparisons inside it.

-
BEFORE ANY PATCH:
 Bitmap Heap Scan on pglist  (cost=1715.72..160233.31 rows=114545
width=1234) (actual time=652.294..2719.961 rows=4904 loops=1)
   Recheck Cond: (fts @@ '( ''worth'' | ''good'' | ''result'' | ''index'' |
''anoth'' | ''know'' | ''like'' | ''tool'' | ''job'' | ''think'' | ''slow''
| ''articl'' | ''knowledg'' | ''join'' | ''need'' | ''experi'' |
''understand'' | ''free'' | ''say'' | ''comment'' | ''littl'' | ''move'' |
''function'' | ''new'' | ''never'' | ''general'' | ''get'' | ''java'' |
''postgresql'' | ''notic'' | ''recent'' | ''serious'' ) <->
''start'''::tsquery)
   Rows Removed by Index Recheck: 108191
   Heap Blocks: exact=73789
   ->  Bitmap Index Scan on pglist_fts_idx  (cost=0.00..1687.09 rows=114545
width=0) (actual time=636.883..636.883 rows=113095 loops=1)
 Index Cond: (fts @@ '( ''worth'' | ''good'' | ''result'' |
''index'' | ''anoth'' | ''know'' | ''like'' | ''tool'' | ''job'' |
''think'' | ''slow'' | ''articl'' | ''knowledg'' | ''join'' | ''need'' |
''experi'' | ''understand'' | ''free'' | ''say'' | ''comment'' | ''littl''
| ''move'' | ''function'' | ''new'' | ''never'' | ''general'' | ''get'' |
''java'' | ''postgresql'' | ''notic'' | ''recent'' | ''serious'' ) <->
''start'''::tsquery)
 Planning Time: 3.016 ms
 Execution Time: *2721.002 ms*
---
AFTER TOM's PATCH (0001)
Bitmap Heap Scan on pglist  (cost=1715.72..160233.31 rows=114545
width=1234) (actual time=916.640..2960.571 rows=4904 loops=1)
   Recheck Cond: (fts @@ '( ''worth'' | ''good'' | ''result'' | ''index'' |
''anoth'' | ''know'' | ''like'' | ''tool'' | ''job'' | ''think'' | ''slow''
| ''articl'' | ''knowledg'' | ''join'' | ''need'' | ''experi'' |
''understand'' | ''free'' | ''say'' | ''comment'' | ''littl'' | ''move'' |
''function'' | ''new'' | ''never'' | ''general'' | ''get'' | ''java'' |
''postgresql'' | ''notic'' | ''recent'' | ''serious'' ) <->
''start'''::tsquery)
   Rows Removed by Index Recheck: 108191
   Heap Blocks: exact=73789
   ->  Bitmap Index Scan on pglist_fts_idx  (cost=0.00..1687.09 rows=114545
width=0) (actual time=900.472..900.472 rows=113095 loops=1)
 Index Cond: (fts @@ '( ''worth'' | ''good'' | ''result'' |
''index'' | ''anoth'' | ''know'' | ''like'' | ''tool'' | ''job'' |
''think'' | ''slow'' | ''articl'' | ''knowledg'' | ''join'' | ''need'' |
''experi'' | ''understand'' | ''free'' | ''say'' | ''comment'' | ''littl''
| ''move'' | ''function'' | ''new'' | ''never'' | ''general'' | ''get'' |
''java'' | ''postgresql'' | ''notic'' | ''recent'' | ''serious'' ) <->
''start'''::tsquery)
 Planning Time: 2.688 ms
 Execution Time: *2961.704 ms*

AFTER MY PATCH (gin-gist-weight-patch-v3)
Bitmap Heap Scan on pglist  (cost=1715.72..160233.31 rows=114545
width=1234) (actual time=616.982..2710.571 rows=4904 loops=1)
   Recheck Cond: (fts @@ '( ''worth'' | ''good'' | ''result'' | ''index'' |
''anoth'' | ''know'' | ''like'' | ''tool'' | ''job'' | ''think'' | ''slow''
| ''articl'' | ''knowledg'' | ''join'' | ''need'' | ''experi'' |
''understand'' | ''free'' | ''say'' | ''comment'' | ''littl'' | ''move'' |
''function'' | ''new'' | ''never'' | ''general'' | ''get'' | ''java'' |
''postgresql'' | ''notic'' | ''recent'' | ''serious'' ) <->
''start'''::tsquery)
   Rows Removed by Index Recheck: 108191
   Heap Blocks: exact=73789
   ->  Bitmap Index Scan on pglist_fts_idx  (cost=0.00..1687.09 rows=114545
width=0) (actual time=601.586..601.586 rows=113095 loops=1)
 Index Cond: (fts @@ '( ''worth'' | ''good'' | ''result'' |
''index'' | ''anoth'' | ''know'' | ''like'' | ''tool'' | ''job'' |
''think'' | ''slow'' | ''articl'' | ''knowledg'' | ''join'' | ''need'' |
''experi'' | ''understand'' | ''free'' | ''say'' | ''comment'' | ''littl''
| ''move'' | ''function'' | ''new'' | ''never'' | ''general'' | ''get'' |
''java'' | ''postgresql'' | ''notic'' | ''recent'' | ''serious'' ) <->
''start'''::tsquery)
 Planning Time: 3.115 ms
 Execution Time: *2711.533 ms*

I've done the test several times and seems that difference is real effect,
though not very big (around 7%). So maybe there is some reason to save
PHRASE_AS_AND behavior for GIN-GIST indexes despite migration from GIN's
own TS_execute_ternary() to general TS_execute_recurse.

2. As for shifting from bool to ternary callback I am not quite sure
whether it can be useful to save bool callbacks via bool-ternary wrapper.
We can include this for compatibility with old callers and can drop. Any

Re: [PATCH] fix GIN index search sometimes losing results

2020-05-17 Thread Pavel Borisov
Hi, all!
Below is my variant how to patch Gin-Gist weights issue:
1. First of all I propose to shift from previously Gin's own TS_execute
variant and leave only two: TS_execute with bool result and bool type
callback and ternary TS_execute_recurse with ternary callback. I suppose
all legacy consistent callers can still use bool via provided wrapper.
2. I integrated logic for indexes which do not support weights and
positions inside (which gives MAYBE in certain cases on negation) inside
previous TS_execute_recurse function called with additional flag for this
class of indexes.
3. Check function for GIST and GIN now gives ternary result and is called
with ternary type callback. I think in future nothing prevents smoothly
shifting callback functions, check functions and even TS_execute result to
ternary.

So I also send my variant patch for review and discussion.

Regards,
Pavel Borisov

вс, 17 мая 2020 г. в 03:14, Tom Lane :

> I wrote:
> > I think the root of the problem is that if we have a query using
> > weights, and we are testing tsvector data that lacks positions/weights,
> > we can never say there's definitely a match.  I don't see any decently
> > clean way to fix this without redefining the TSExecuteCallback API
> > to return a tri-state YES/NO/MAYBE result, because really we need to
> > decide that it's MAYBE at the level of processing the QI_VAL node,
> > not later on.  I'd tried to avoid that in e81e5741a, but maybe we
> > should just bite that bullet, and not worry about whether there's
> > any third-party code providing its own TSExecuteCallback routine.
> > codesearch.debian.net suggests that there are no external callers
> > of TS_execute, so maybe we can get away with that.
>
> 0001 attached is a proposed patch that does it that way.  Given the
> API break involved, it's not quite clear what to do with this.
> ISTM we have three options:
>
> 1. Ignore the API issue and back-patch.  Given the apparent lack of
> external callers of TS_execute, maybe we can get away with that;
> but I wonder if we'd get pushback from distros that have automatic
> ABI-break detectors in place.
>
> 2. Assume we can't backpatch, but it's still OK to slip this into
> v13.  (This option clearly has a limited shelf life, but I think
> we could get away with it until late beta.)
>
> 3. Assume we'd better hold this till v14.
>
> I find #3 unduly conservative, seeing that this is clearly a bug
> fix, but on the other hand #1 is a bit scary.  Aside from the API
> issue, it's not impossible that this has introduced some corner
> case behavioral changes that we'd consider to be new bugs rather
> than bug fixes.
>
> Anyway, some notes for reviewers:
>
> * The core idea of the patch is to make the TS_execute callbacks
> have ternary results and to insist they return TS_MAYBE in any
> case where the correct result is uncertain.
>
> * That fixes the bug at hand, and it also allows getting rid of
> some kluges at higher levels.  The GIN code no longer needs its
> own TS_execute_ternary implementation, and the GIST code no longer
> needs to suppose that it can't trust NOT results.
>
> * I put some effort into not leaking memory within tsvector_op.c's
> checkclass_str and checkcondition_str.  (The final output array
> can still get leaked, I believe.  Fixing that seems like material
> for a different patch, and it might not be worth any trouble.)
>
> * The new test cases in tstypes.sql are to verify that we didn't
> change behavior of the basic tsvector @@ tsquery code.  There wasn't
> any coverage of these cases before, and the logic for checkclass_str
> without position info had to be tweaked to preserve this behavior.
>
> * The new cases in tsearch verify that the GIN and GIST code gives
> the same results as the basic operator.
>
> Now, as for the 0002 patch attached: after 0001, the only TS_execute()
> callers that are not specifying TS_EXEC_CALC_NOT are hlCover(),
> which I'd already complained is probably a bug, and the first of
> the two calls in tsrank.c's Cover().  It seems difficult to me to
> argue that it's not a bug for Cover() to process NOT in one call
> but not the other --- moreover, if there was any argument for that
> once upon a time, it probably falls to the ground now that (a) we
> have a less buggy implementation of NOT and (b) the presence of
> phrase queries significantly raises the importance of not taking
> short-cuts.  Therefore, 0002 attached rips out the TS_EXEC_CALC_NOT
> flag and has TS_execute compute NOT expressions accurately all the
> time.
>
> As it stands, 0002 changes no regression test results, which I'm
> afraid speaks more to our crummy test coverage than anything else;
> tests that exercise those two functions with NOT-using queries
> would easily show that there is a difference.
>
> Even if we decide to back-patch 0001, I would not suggest
> back-patching 0002, as it's more nearly a definitional change
> than a bug fix.  But I think it's a good idea anyway.
>
> I'll stick this in 

Re: [PATCH] fix GIN index search sometimes losing results

2020-05-16 Thread Tom Lane
I wrote:
> I think the root of the problem is that if we have a query using
> weights, and we are testing tsvector data that lacks positions/weights,
> we can never say there's definitely a match.  I don't see any decently
> clean way to fix this without redefining the TSExecuteCallback API
> to return a tri-state YES/NO/MAYBE result, because really we need to
> decide that it's MAYBE at the level of processing the QI_VAL node,
> not later on.  I'd tried to avoid that in e81e5741a, but maybe we
> should just bite that bullet, and not worry about whether there's
> any third-party code providing its own TSExecuteCallback routine.
> codesearch.debian.net suggests that there are no external callers
> of TS_execute, so maybe we can get away with that.

0001 attached is a proposed patch that does it that way.  Given the
API break involved, it's not quite clear what to do with this.
ISTM we have three options:

1. Ignore the API issue and back-patch.  Given the apparent lack of
external callers of TS_execute, maybe we can get away with that;
but I wonder if we'd get pushback from distros that have automatic
ABI-break detectors in place.

2. Assume we can't backpatch, but it's still OK to slip this into
v13.  (This option clearly has a limited shelf life, but I think
we could get away with it until late beta.)

3. Assume we'd better hold this till v14.

I find #3 unduly conservative, seeing that this is clearly a bug
fix, but on the other hand #1 is a bit scary.  Aside from the API
issue, it's not impossible that this has introduced some corner
case behavioral changes that we'd consider to be new bugs rather
than bug fixes.

Anyway, some notes for reviewers:

* The core idea of the patch is to make the TS_execute callbacks
have ternary results and to insist they return TS_MAYBE in any
case where the correct result is uncertain.

* That fixes the bug at hand, and it also allows getting rid of
some kluges at higher levels.  The GIN code no longer needs its
own TS_execute_ternary implementation, and the GIST code no longer
needs to suppose that it can't trust NOT results.

* I put some effort into not leaking memory within tsvector_op.c's
checkclass_str and checkcondition_str.  (The final output array
can still get leaked, I believe.  Fixing that seems like material
for a different patch, and it might not be worth any trouble.)

* The new test cases in tstypes.sql are to verify that we didn't
change behavior of the basic tsvector @@ tsquery code.  There wasn't
any coverage of these cases before, and the logic for checkclass_str
without position info had to be tweaked to preserve this behavior.

* The new cases in tsearch verify that the GIN and GIST code gives
the same results as the basic operator.

Now, as for the 0002 patch attached: after 0001, the only TS_execute()
callers that are not specifying TS_EXEC_CALC_NOT are hlCover(),
which I'd already complained is probably a bug, and the first of
the two calls in tsrank.c's Cover().  It seems difficult to me to
argue that it's not a bug for Cover() to process NOT in one call
but not the other --- moreover, if there was any argument for that
once upon a time, it probably falls to the ground now that (a) we
have a less buggy implementation of NOT and (b) the presence of
phrase queries significantly raises the importance of not taking
short-cuts.  Therefore, 0002 attached rips out the TS_EXEC_CALC_NOT
flag and has TS_execute compute NOT expressions accurately all the
time.

As it stands, 0002 changes no regression test results, which I'm
afraid speaks more to our crummy test coverage than anything else;
tests that exercise those two functions with NOT-using queries
would easily show that there is a difference.

Even if we decide to back-patch 0001, I would not suggest
back-patching 0002, as it's more nearly a definitional change
than a bug fix.  But I think it's a good idea anyway.

I'll stick this in the queue for the July commitfest, in case
anybody wants to review it.

regards, tom lane

diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c
index 48e55e1..bfbc8d5 100644
--- a/src/backend/tsearch/wparser_def.c
+++ b/src/backend/tsearch/wparser_def.c
@@ -1969,7 +1969,7 @@ typedef struct
 /*
  * TS_execute callback for matching a tsquery operand to headline words
  */
-static bool
+static TSTernaryValue
 checkcondition_HL(void *opaque, QueryOperand *val, ExecPhraseData *data)
 {
 	hlCheck*checkval = (hlCheck *) opaque;
@@ -1982,7 +1982,7 @@ checkcondition_HL(void *opaque, QueryOperand *val, ExecPhraseData *data)
 		{
 			/* if data == NULL, don't need to report positions */
 			if (!data)
-return true;
+return TS_YES;
 
 			if (!data->pos)
 			{
@@ -1999,9 +1999,9 @@ checkcondition_HL(void *opaque, QueryOperand *val, ExecPhraseData *data)
 	}
 
 	if (data && data->npos > 0)
-		return true;
+		return TS_YES;
 
-	return false;
+	return TS_NO;
 }
 
 /*
diff --git a/src/backend/utils/adt/tsginidx.c 

Re: [PATCH] fix GIN index search sometimes losing results

2020-05-07 Thread Tom Lane
Pavel Borisov  writes:
> It appeared than GIN index sometimes lose results if simultaneously:
> 1 if query operand contains weight marks
> 2 if weight-marked operand is negated by ! operator
> 3 if there are only logical (not phrase) operators from this negation
> towards the root of query tree.

Nice catch ... but if you try it with a GIST index, that fails too.

Even if it were only GIN indexes, this patch is an utter hack.
It might accidentally work for the specific case of NOT with
a single QI_VAL node as argument, but not for anything more
complicated.

I think the root of the problem is that if we have a query using
weights, and we are testing tsvector data that lacks positions/weights,
we can never say there's definitely a match.  I don't see any decently
clean way to fix this without redefining the TSExecuteCallback API
to return a tri-state YES/NO/MAYBE result, because really we need to
decide that it's MAYBE at the level of processing the QI_VAL node,
not later on.  I'd tried to avoid that in e81e5741a, but maybe we
should just bite that bullet, and not worry about whether there's
any third-party code providing its own TSExecuteCallback routine.
codesearch.debian.net suggests that there are no external callers
of TS_execute, so maybe we can get away with that.

regards, tom lane