Re: Rethinking the implementation of ts_headline()
Alexander Lakhin writes: > I've found that starting from commit 5a617d75 this query: > SELECT ts_headline('english', 'To be, or not to be', to_tsquery('english', > 'or')); > invokes a valgrind-detected error: > ==00:00:00:03.950 3241424== Invalid read of size 1 On my machine, I also see PG-reported errors such as "unrecognized operator: 0". It's a live bug all right. We need to be more careful about empty tsqueries. > But the less-verbose call: > SELECT ts_headline('', ''); > discovers a different error even on 5a617d75~1: > ==00:00:00:04.113 3139158== Conditional jump or move depends on uninitialised > value(s) > ==00:00:00:04.113 3139158== at 0x77B44F: mark_fragment (wparser_def.c:2100) Yeah, this one seems to be ancient sloppiness. I don't think it has any bad effect beyond annoying valgrind, but I fixed it anyway. Thanks for the report! regards, tom lane
Re: Rethinking the implementation of ts_headline()
Hi, 19.01.2023 19:13, Tom Lane wrote: Alvaro Herrera writes: Anyway, I don't think this needs to stop your current patch. Many thanks for looking at it! I've found that starting from commit 5a617d75 this query: SELECT ts_headline('english', 'To be, or not to be', to_tsquery('english', 'or')); invokes a valgrind-detected error: ==00:00:00:03.950 3241424== Invalid read of size 1 ==00:00:00:03.950 3241424== at 0x8EE2A9: TS_execute_locations_recurse (tsvector_op.c:2046) ==00:00:00:03.950 3241424== by 0x8EE220: TS_execute_locations (tsvector_op.c:2017) ==00:00:00:03.950 3241424== by 0x77EAC4: prsd_headline (wparser_def.c:2677) ==00:00:00:03.950 3241424== by 0x94536C: FunctionCall3Coll (fmgr.c:1173) ==00:00:00:03.950 3241424== by 0x778648: ts_headline_byid_opt (wparser.c:322) ==00:00:00:03.950 3241424== by 0x9440F5: DirectFunctionCall3Coll (fmgr.c:836) ==00:00:00:03.950 3241424== by 0x778763: ts_headline_byid (wparser.c:343) ==00:00:00:03.950 3241424== by 0x4AC9F2: ExecInterpExpr (execExprInterp.c:752) ==00:00:00:03.950 3241424== by 0x4AEDFE: ExecInterpExprStillValid (execExprInterp.c:1838) ==00:00:00:03.950 3241424== by 0x636A7E: ExecEvalExprSwitchContext (executor.h:344) ==00:00:00:03.950 3241424== by 0x63E92D: evaluate_expr (clauses.c:4843) ==00:00:00:03.950 3241424== by 0x63DA53: evaluate_function (clauses.c:4345) ... (Initially I had encountered an asan-detected heap-buffer-overflow with a more informative document.) But the less-verbose call: SELECT ts_headline('', ''); discovers a different error even on 5a617d75~1: ==00:00:00:04.113 3139158== Conditional jump or move depends on uninitialised value(s) ==00:00:00:04.113 3139158== at 0x77B44F: mark_fragment (wparser_def.c:2100) ==00:00:00:04.113 3139158== by 0x77E2F2: mark_hl_words (wparser_def.c:2519) ==00:00:00:04.113 3139158== by 0x77E891: prsd_headline (wparser_def.c:2610) ==00:00:00:04.113 3139158== by 0x944B68: FunctionCall3Coll (fmgr.c:1173) ==00:00:00:04.113 3139158== by 0x778648: ts_headline_byid_opt (wparser.c:322) ==00:00:00:04.113 3139158== by 0x9438F1: DirectFunctionCall3Coll (fmgr.c:836) ==00:00:00:04.113 3139158== by 0x7787B6: ts_headline (wparser.c:352) ==00:00:00:04.113 3139158== by 0x4AC9F2: ExecInterpExpr (execExprInterp.c:752) ==00:00:00:04.113 3139158== by 0x4AEDFE: ExecInterpExprStillValid (execExprInterp.c:1838) ==00:00:00:04.113 3139158== by 0x50BD0C: ExecEvalExprSwitchContext (executor.h:344) ==00:00:00:04.113 3139158== by 0x50BD84: ExecProject (executor.h:378) ==00:00:00:04.113 3139158== by 0x50BFBB: ExecResult (nodeResult.c:136) ==00:00:00:04.113 3139158== I've reproduced it on REL9_4_STABLE (REL9_4_15) and it looks like the defect in mark_hl_words(): int bestb = -1, beste = -1; int bestlen = -1; int pose = 0, ... if (highlight == 0) { while (hlCover(prs, query, , )) { ... if (bestlen < 0) { curlen = 0; for (i = 0; i < prs->curwords && curlen < min_words; i++) { if (!NONWORDTOKEN(prs->words[i].type)) curlen++; pose = i; } bestb = 0; beste = pose; } ... // here we have bestb == 0, beste == 0, but no prs->words in this case for (i = bestb; i <= beste; i++) { if (prs->words[i].item) prs->words[i].selected = 1; exists since 2a0083ede. (Sorry for the distraction.) Best regards, Alexander
Re: Rethinking the implementation of ts_headline()
Alvaro Herrera writes: > On 2023-Jan-18, Tom Lane wrote: >> It's including hits for "day" into the cover despite the lack of any >> nearby match to "drink". > I suppose it would be possible to put 'day' and 'drink' in two different > fragments: since the query has a & operator for them, the words don't > necessarily have to be nearby. But OK, your argument for this being the > shortest result is sensible. > I do wonder, though, if it's effectively usable for somebody building a > search interface on top. If I'm ranking the results from several > documents, and this document comes on top of others that only match one > arm of the OR query, then I would like to be able to show the matches > for both arms of the OR. The fundamental problem with the case you're posing is that MaxWords is too small to allow the 'day & drink' match to be shown as a whole. If you make MaxWords large enough then you do find it including (and highlighting) 'drink', but I'm not sure we should stress out about what happens otherwise. > Oh, I see the problem, and it is my misunderstanding: the <-> operator > is counting the words in between, even if they are stop words. Yeah. AFAICS this is a very deliberate, longstanding decision, so I'm hesitant to change it. Your test case with 'simple' proves little, because there are no stop words in 'simple': regression=# select to_tsvector('simple', 'As idle as a painted Ship'); to_tsvector -- 'a':4 'as':1,3 'idle':2 'painted':5 'ship':6 (1 row) However, when I switch to 'english': regression=# select to_tsvector('english', 'As idle as a painted Ship'); to_tsvector 'idl':2 'paint':5 'ship':6 (1 row) the stop words are gone, but the recorded positions remain the same. So this is really a matter of how to_tsvector chooses to count word positions, it's not the fault of the <-> construct in particular. I'm not convinced that this particular behavior is wrong, anyway. The user of text search isn't supposed to have to think about which words are stopwords or not, so I think that it's entirely sensible for 'idle as a painted' to match 'idle <3> painted'. Maybe the docs need some adjustment? But in any case that's material for a different thread. > I again have to question how valuable in practice is a operator > that's so strict that I have to know exactly how many stop words I want > there to be in between the phrase search. For some reason, in my mind I > had it as "at most N words, ignoring stop words", but that's not what it > is. Yeah, I recall discussing "up to N words" semantics for this in the past, but nobody has made that happen. > Anyway, I don't think this needs to stop your current patch. Many thanks for looking at it! regards, tom lane
Re: Rethinking the implementation of ts_headline()
On 2023-Jan-18, Tom Lane wrote: > Alvaro Herrera writes: > > and was surprised that the match for the 'day & drink' arm of the OR > > disappears from the reported headline. > > I'd argue that that's exactly what should happen. It's supposed to > find as-short-as-possible cover strings that satisfy the query. OK, that makes sense. > I don't find anything particularly principled about the old behavior: > > > Day after day, day after day,↵ > >We stuck ... motion, ↵ > > As idle as a painted Ship ↵ > >Upon > > It's including hits for "day" into the cover despite the lack of any > nearby match to "drink". I suppose it would be possible to put 'day' and 'drink' in two different fragments: since the query has a & operator for them, the words don't necessarily have to be nearby. But OK, your argument for this being the shortest result is sensible. I do wonder, though, if it's effectively usable for somebody building a search interface on top. If I'm ranking the results from several documents, and this document comes on top of others that only match one arm of the OR query, then I would like to be able to show the matches for both arms of the OR. > > Another thing I think might be a regression is the way fragments are > > selected. Consider what happens if I change the "idle & painted" in the > > earlier query to "idle <-> painted", and MaxWords is kept low: > > Of course, "idle <-> painted" is satisfied nowhere in this text > (the words are there, but not adjacent). Oh, I see the problem, and it is my misunderstanding: the <-> operator is counting the words in between, even if they are stop words. I understood from the docs that those words were ignored, but that is not so. I misread the phraseto_tsquery doc as though they were explaining the <-> operator. Another experiment shows that the headline becomes "complete" only if I specify the exact distance in the operator: SELECT dist, ts_headline('simple', 'As idle as a painted Ship', to_tsquery('simple', format('(idle <%s> painted)', dist)), 'MaxFragments=5, MaxWords=8, MinWords=4') from generate_series(1, 4) dist; dist │ ts_headline ──┼── 1 │ As idle as a 2 │ As idle as a 3 │ idle as a painted Ship 4 │ As idle as a (4 filas) I again have to question how valuable in practice is a operator that's so strict that I have to know exactly how many stop words I want there to be in between the phrase search. For some reason, in my mind I had it as "at most N words, ignoring stop words", but that's not what it is. Anyway, I don't think this needs to stop your current patch. > So the cover has to run from the last 'day' to the 'drink'. I think > v15 is deciding that it runs from the first 'day' to the 'drink', > which while not exactly wrong is not the shortest cover. Sounds fair. > The rest of this is just minor variations in what mark_hl_fragments() > decides to do based on the precise cover string it's given. I don't > dispute that mark_hl_fragments() could be improved, but this patch > doesn't touch its algorithm and I think that doing so should be > material for a different patch. Understood and agreed. > > (Both 15 and master highlight 'painted' in the "Upon a painted Ocean" > > verse, which perhaps they shouldn't do, since it's not preceded by > > 'idle'.) > > Yeah, and 'idle' too. Once it's chosen a string to show, it'll > highlight all the query words within that string, whether they > constitute part of the match or not. I can see arguments on both > sides of doing it that way; it was probably more sensible before > we had phrase match than it is now. But again, changing that phase > of the processing is outside the scope of this patch. I'm just > trying to undo the damage I did to the cover-string-selection phase. All clear then. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "The eagle never lost so much time, as when he submitted to learn of the crow." (William Blake)
Re: Rethinking the implementation of ts_headline()
Alvaro Herrera writes: > I tried this other test, based on looking at the new regression tests > you added, > SELECT ts_headline('english', ' > Day after day, day after day, > We stuck, nor breath nor motion, > As idle as a painted Ship > Upon a painted Ocean. > Water, water, every where > And all the boards did shrink; > Water, water, every where, > Nor any drop to drink. > S. T. Coleridge (1772-1834) > ', to_tsquery('english', '(day & drink) | (idle & painted)'), > 'MaxFragments=5, MaxWords=9, MinWords=4'); >ts_headline > ─ > motion,↵ > As idle as a painted Ship↵ >Upon > (1 fila) > and was surprised that the match for the 'day & drink' arm of the OR > disappears from the reported headline. I'd argue that that's exactly what should happen. It's supposed to find as-short-as-possible cover strings that satisfy the query. In this case, satisfying the 'day & drink' condition would require nearly the entire input text, whereas 'idle & painted' can be satisfied in just the third line. So what you get is the third line, slightly expanded because of some later rules that like to add context if the cover is shorter than MaxWords. I don't find anything particularly principled about the old behavior: > Day after day, day after day,↵ >We stuck ... motion, ↵ > As idle as a painted Ship ↵ >Upon It's including hits for "day" into the cover despite the lack of any nearby match to "drink". > Another thing I think might be a regression is the way fragments are > selected. Consider what happens if I change the "idle & painted" in the > earlier query to "idle <-> painted", and MaxWords is kept low: Of course, "idle <-> painted" is satisfied nowhere in this text (the words are there, but not adjacent). So the cover has to run from the last 'day' to the 'drink'. I think v15 is deciding that it runs from the first 'day' to the 'drink', which while not exactly wrong is not the shortest cover. The rest of this is just minor variations in what mark_hl_fragments() decides to do based on the precise cover string it's given. I don't dispute that mark_hl_fragments() could be improved, but this patch doesn't touch its algorithm and I think that doing so should be material for a different patch. (I have no immediate ideas about what would be a better algorithm for it, anyway.) > (Both 15 and master highlight 'painted' in the "Upon a painted Ocean" > verse, which perhaps they shouldn't do, since it's not preceded by > 'idle'.) Yeah, and 'idle' too. Once it's chosen a string to show, it'll highlight all the query words within that string, whether they constitute part of the match or not. I can see arguments on both sides of doing it that way; it was probably more sensible before we had phrase match than it is now. But again, changing that phase of the processing is outside the scope of this patch. I'm just trying to undo the damage I did to the cover-string-selection phase. regards, tom lane
Re: Rethinking the implementation of ts_headline()
I tried this other test, based on looking at the new regression tests you added, SELECT ts_headline('english', ' Day after day, day after day, We stuck, nor breath nor motion, As idle as a painted Ship Upon a painted Ocean. Water, water, every where And all the boards did shrink; Water, water, every where, Nor any drop to drink. S. T. Coleridge (1772-1834) ', to_tsquery('english', '(day & drink) | (idle & painted)'), 'MaxFragments=5, MaxWords=9, MinWords=4'); ts_headline ─ motion,↵ As idle as a painted Ship↵ Upon (1 fila) and was surprised that the match for the 'day & drink' arm of the OR disappears from the reported headline. This is what 15 reports for the same query: SELECT ts_headline('english', ' Day after day, day after day, We stuck, nor breath nor motion, As idle as a painted Ship Upon a painted Ocean. Water, water, every where And all the boards did shrink; Water, water, every where, Nor any drop to drink. S. T. Coleridge (1772-1834) ', to_tsquery('english', '(day & drink) | (idle & painted)'), 'MaxFragments=5, MaxWords=9, MinWords=4'); ts_headline ─── Day after day, day after day,↵ We stuck ... motion, ↵ As idle as a painted Ship ↵ Upon (1 fila) I think this was better. 15 seems to fail in other ways; for instance, 'drink' is not highlighted in the headline when the OR matches, but if the other arm of the OR doesn't match, it is; for example both 15 and master return the same for this one: SELECT ts_headline('english', ' Day after day, day after day, We stuck, nor breath nor motion, As idle as a painted Ship Upon a painted Ocean. Water, water, every where And all the boards did shrink; Water, water, every where, Nor any drop to drink. S. T. Coleridge (1772-1834) ', to_tsquery('english', '(day & drink) | (mountain & backpack)'), 'MaxFragments=5, MaxWords=9, MinWords=4'); ts_headline ─── Day after day, day after day,↵ We stuck ... drop to drink. ↵ S. T. Coleridge (1 fila) Another thing I think might be a regression is the way fragments are selected. Consider what happens if I change the "idle & painted" in the earlier query to "idle <-> painted", and MaxWords is kept low: SELECT ts_headline('english', ' Day after day, day after day, We stuck, nor breath nor motion, As idle as a painted Ship Upon a painted Ocean. Water, water, every where And all the boards did shrink; Water, water, every where, Nor any drop to drink. S. T. Coleridge (1772-1834) ', to_tsquery('english', '(day & drink) | (idle <-> painted)'), 'MaxFragments=5, MaxWords=9, MinWords=4'); ts_headline ─── day, ↵ We stuck, nor breath nor motion, ↵ As idle ... painted Ship ↵ Upon a painted Ocean. ↵ Water, water, every ... drop to drink.↵ S. T. Coleridge (1 fila) Note that it chose to put a fragment delimiter exactly in the middle of the phrase match, where the stop words are. If I raise MaxWords, it is of course much better, I suppose because the word limit doesn't force a new fragment, SELECT ts_headline('english', ' Day after day, day after day, We stuck, nor breath nor motion, As idle as a painted Ship Upon a painted Ocean. Water, water, every where And all the boards did shrink; Water, water, every where, Nor any drop to drink. S. T. Coleridge (1772-1834) ', to_tsquery('english', '(day & drink) | (idle <-> painted)'), 'MaxFragments=5, MaxWords=25, MinWords=4'); ts_headline ── after day, day after day, ↵ We stuck, nor breath nor motion, ↵ As idle as a painted Ship ↵ Upon a painted Ocean. ↵ Water, water, every where ... boards did shrink;↵ Water, water, every where, ↵ Nor any drop to drink. ↵ S. T. Coleridge (1 fila) But in 15, the query with low MaxWords does this instead, where the fragment delimiter occurs just *before* the phrasal match. SELECT ts_headline('english', ' Day after day, day after day, We stuck, nor breath nor motion, As idle as a painted Ship Upon a painted Ocean. Water, water, every where And all the boards did shrink; Water, water, every where, Nor any drop to drink. S. T. Coleridge (1772-1834) ', to_tsquery('english', '(day & drink) | (idle <-> painted)'), 'MaxFragments=5, MaxWords=9, MinWords=4'); ts_headline ─── Day after day,
Re: Rethinking the implementation of ts_headline()
On 2023-Jan-16, Tom Lane wrote: > I get this with the patch: > > ts_headline | ts_headline > -+- > {ipsum} ... {labor} | {ipsum} ... {labor} > (1 row) > > which is what I'd expect, because it removes the artificial limit on > cover length that I added in 78e73e875. So I'm wondering why you got a > different result. Indeed, that's what I get now, too, after re-applying the patches. I find no way to reproduce the bogus result I got yesterday. Sorry for the noise. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Every machine is a smoke machine if you operate it wrong enough." https://twitter.com/libseybieda/status/1541673325781196801
Re: Rethinking the implementation of ts_headline()
Alvaro Herrera writes: > I came across #17556 which contains a different test for this, and I'm > not sure that this patch changes things completely for the better. Thanks for looking at my patch. However ... > That is, once past the 5000 words of distance, it fails to find a good > cover, but before that it returns an acceptable headline. However, > after your proposed patch, we get this: > ts_headline │ ts_headline > ─┼─ > {ipsum} │ {ipsum} > (1 fila) I get this with the patch: ts_headline | ts_headline -+- {ipsum} ... {labor} | {ipsum} ... {labor} (1 row) which is what I'd expect, because it removes the artificial limit on cover length that I added in 78e73e875. So I'm wondering why you got a different result. Maybe something to do with locale? I tried it in C and en_US.utf8. regards, tom lane
Re: Rethinking the implementation of ts_headline()
On 2022-Nov-25, Tom Lane wrote: > After further contemplation of bug #17691 [1], I've concluded that > what I did in commit c9b0c678d was largely misguided. For one > thing, the new hlCover() algorithm no longer finds shortest-possible > cover strings: if your query is "x & y" and the text is like > "... x ... x ... y ...", then the selected cover string will run > from the first occurrence of x to the y, whereas the old algorithm > would have correctly selected "x ... y". For another thing, the > maximum-cover-length hack that I added in 78e73e875 to band-aid > over the performance issues of the original c9b0c678d patch means > that various scenarios no longer work as well as they used to, > which is the proximate cause of the complaints in bug #17691. I came across #17556 which contains a different test for this, and I'm not sure that this patch changes things completely for the better. In that bug report, Alex Malek presents this example select ts_headline('baz baz baz ipsum ' || repeat(' foo ',4998) || 'labor', $$'ipsum' & 'labor'$$::tsquery, 'StartSel={, StopSel=}, MaxFragments=100, MaxWords=7, MinWords=3'), ts_headline('baz baz baz ipsum ' || repeat(' foo ',4999) || 'labor', $$'ipsum' & 'labor'$$::tsquery, 'StartSel={, StopSel=}, MaxFragments=100, MaxWords=7, MinWords=3'); which returns, in the current HEAD, the following ts_headline │ ts_headline ─┼─ {ipsum} ... {labor} │ baz baz baz (1 fila) That is, once past the 5000 words of distance, it fails to find a good cover, but before that it returns an acceptable headline. However, after your proposed patch, we get this: ts_headline │ ts_headline ─┼─ {ipsum} │ {ipsum} (1 fila) which is an improvement in the second case, though perhaps not as much as we would like, and definitely not an improvement in the first case. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "If you have nothing to say, maybe you need just the right tool to help you not say it." (New York Times, about Microsoft PowerPoint)
Rethinking the implementation of ts_headline()
After further contemplation of bug #17691 [1], I've concluded that what I did in commit c9b0c678d was largely misguided. For one thing, the new hlCover() algorithm no longer finds shortest-possible cover strings: if your query is "x & y" and the text is like "... x ... x ... y ...", then the selected cover string will run from the first occurrence of x to the y, whereas the old algorithm would have correctly selected "x ... y". For another thing, the maximum-cover-length hack that I added in 78e73e875 to band-aid over the performance issues of the original c9b0c678d patch means that various scenarios no longer work as well as they used to, which is the proximate cause of the complaints in bug #17691. What I'm now thinking is that the original hlCover() algorithm was fine (if underdocumented) *as long as it's only asked to deal with AND-like semantics*. Given that restriction, its approach of finding the latest first occurrence of any query term, and then backing up to the earliest last occurrence, is visibly correct; and we weren't hearing of any performance issues with it either. The problem is that this approach fails miserably for tsqueries that aren't pure ANDs, because it will insist on including query terms that aren't required for a match and indeed could prevent a match. So what we need is to find a way to fold the other sorts of queries into an AND context. After a couple of false starts, I came up with the attached patch. It builds on the existing TS_phrase_execute infrastructure, which already produces an ExecPhraseData structure containing an exact list of the match locations for a phrase subquery. It's easy to get an ExecPhraseData structure for a single-lexeme match, too. We can handle plain ANDs by forming lists of ExecPhraseData structs (with implicit AND semantics across the list). We can handle ORs by union'ing the component ExecPhraseData structs. And we can handle NOTs by just dropping them, because we don't want ts_headline to prioritize matches to such words. There are some fine points but it all seems to work pretty well. Hence I propose the attached patchset. 0001 adds some test cases that I felt necessary after examining code coverage reports; I split this out mainly so that 0002 clearly shows the behavioral changes from current code. 0002 adds TS_execute_locations() which does what I just described, and rewrites hlCover() into something that's a spiritual descendant of the original algorithm. It can't be quite the same as before because the location data that TS_execute_locations() returns is measured in lexeme indexes not word indexes, so some translation is needed. Notable here is that a couple of regression test results revert to what they were before c9b0c678d. They're both instances of preferring "x ... x ... y" to "x ... y", which I argued was okay at the time, but I now see the error in that. Although we back-patched c9b0c678d, I'm inclined to propose this for HEAD only. The misbehaviors it's fixing are less bad than what we were dealing with in that patch. BTW, while experimenting with this I realized that tsvector's limitation of lexeme indexes to be at most 16383 is really quite disastrous for phrase searches. That limitation was arguably okay before we had phrase searching, but now it seems untenable. For example, tsvector entries like "foo:16383 bar:16383" will not match "foo <-> bar" because the phrase match code wants their lexeme positions to differ by one. This basically means that phrase searches do not work beyond ~20K words into a document. I'm not sure what to do about that exactly, but I think we need to do something. Whatever we might do about that would be largely orthogonal to this patch, in any case. I'll stick this in the January CF. regards, tom lane [1] https://www.postgresql.org/message-id/flat/17691-93cef39a14663963%40postgresql.org diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out index dc03f15499..e6a01cf985 100644 --- a/src/test/regress/expected/tsearch.out +++ b/src/test/regress/expected/tsearch.out @@ -1804,6 +1804,123 @@ S. T. Coleridge (1772-1834) Water, water, every where (1 row) +SELECT ts_headline('english', ' +Day after day, day after day, + We stuck, nor breath nor motion, +As idle as a painted Ship + Upon a painted Ocean. +Water, water, every where + And all the boards did shrink; +Water, water, every where, + Nor any drop to drink. +S. T. Coleridge (1772-1834) +', to_tsquery('english', 'day & drink')); +ts_headline +--- + Day after day, day after day,+ + We stuck, nor breath nor motion, + + As idle as a painted Ship+ + Upon a painted Ocean. + + Water, water, every where+ + And all the boards did shrink;