[Wikidata-bugs] [Maniphest] [Commented On] T198341: Remove all references to the rev_text_id and ar_text_id fields

Anomie Thu, 21 Mar 2019 14:16:40 -0700

Anomie added a comment.

  In T198341#5044442 <https://phabricator.wikimedia.org/T198341#5044442>, 
@BPirkle wrote:

  > To make sure I'm heading in the right direction:
  >
  > - Postgres currently uses columns in specific tables 
(pagecontent.textvector and page.titlevector, both of type tsvector) to store 
search index information
  > - Postgres also currently uses triggers/procedures to maintain these tables
  > - And directly related to this task, searchPostgres.php currently and 
undesirably references fields rev_text_id and old_id

  All correct.

  > I should:
  > 
  > - add a searchindex table, similar to the MySQL one, but adapted to 
Postgres data types

  Yes. Basically `page.titlevector` becomes `searchindex.si_title` and 
`pagecontent.textvector` becomes `searchindex.si_text`. `searchindex.si_page` 
works exactly like it does in MySQL.

  > - add related indexes/triggers/procedures to maintain this table

  Per T164898 <https://phabricator.wikimedia.org/T164898>, we don't want to 
keep using triggers for this. `SearchPostgres::update()` should do it, similar 
to what `SearchMySQL::update()` does. Same for `::updateTitle()`.

  > - add necessary documentation and install/update support for these db 
changes
  > - modify searchPostgres.php to use the new db changes

  Yes to both.

  > Questions:
  > 
  > - I'm assuming we are using triggers/procedures to maintain the existing 
tsvector search columns because that works well in Postgres, and we want to 
retain that technique with the searchindex table. Am I right, or do we want to 
do more of this at the PHP level per SearchMySQL.php?

  My guess is that whoever originally wrote the SearchPostgres code just 
thought triggers were nicer/"cleaner" than doing it from PHP. Unfortunately 
there's no explanation on r15335 
<https://www.mediawiki.org/w/index.php?title=Special:Code/MediaWiki/15335&path=>
 or  r15336 <https://www.mediawiki.org/wiki/Special:Code/MediaWiki/15336> as to 
why they did it.

  > - should I remove/refactor the existing columns/triggers/procedures and 
have all Postgres searches work off (only) the new searchindex table? Or should 
I leave that alone for now and use $wgMultiContentRevisionSchemaMigrationStage 
gating in searchPostgres.php to continue to use the existing columns if we're 
reading the old schema? Leaving the old and maintaining two sets of indexes at 
the db level sounds inefficient. But maybe there's a reason I'm missing to 
retain it.

  Just transition straight to the new searchindex table, no need for a 
migration flag.

  > - there are a. bunch of "ts2_" prefixes floating around in 
postgres/tables.sql (ex. "CREATE TRIGGER ts2_page_text..."). Is that a 
meaningful naming convention I should be aware of and retain, or is that just 
"text search 2" to indicate it was a "new thing" at some point?

  I think it's just namespacing. It's not anything that you'd have to preserve, 
just name the new things whatever makes sense.

  > - is there developer documentation on 
update.php/DatabaseUpdater.php/PostgresUpdater.php? I've looked through that 
code (and postgres/tables.sql). I have a formative understanding, but I need to 
be more solid on how all this works.

  Not that I'm aware of. Feel free to ask any questions you have.

  In T198341#5045616 <https://phabricator.wikimedia.org/T198341#5045616>, 
@daniel wrote:

  > That sounds like "completely change how postgres does search". Now, my 
initial impression is that the way postgres does search is broken by design, 
since a database trigger has no way to correctly interpret the contents of the 
text table, which may be a serialized object, or compressed, or JSON, or a 
reference to an external storage system, or all of these. This has been the 
case for over ten years now.

  Your initial impression is correct.

  > So, my (possibly completely wrong) understanding of how this currently 
works is: pagecontent.textvector  contains the searchable version of the page 
text. We can keep that as it is, the trigger just needs to use the slots and 
content tables to connect rows in pagecontent to rows in revision, instead of 
using rev_text_id. From my current understanding, that should be it. But 
perhaps Brad has different ideas.

  Yes, we //could// do something like that. Probably we wouldn't even need to 
change the triggers, just the SeachPostgres code to join 
revision↔slots↔content↔text (and that last join would probably be really hacky).

  But IMO this is a situation where it would be better to cut out the bad code 
now instead of trying to hack around it and leave tech debt for the future. 
That way when we get to the part about wanting SearchMySQL and so on to handle 
more than just the main slot the changes to SearchPostgres will be similar 
rather than entirely different (and probably punted on again).

TASK DETAIL
  https://phabricator.wikimedia.org/T198341

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: BPirkle, Anomie
Cc: tstarling, gerritbot, Tgr, Jdforrester-WMF, Anomie, Addshore, aude, 
Aklapper, daniel, alaa_wmde, EvanProdromou, CucyNoiD, Nandana, NebulousIris, 
Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, 
Lahi, Gq86, Baloch007, Ramsey-WMF, Darkminds3113, Bsandipan, Lordiis, 
GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, 
LawExplorer, Lewizho99, JJMC89, Maathavan, _jensen, rosalieper, Agabi10, 
Wikidata-bugs, Mbch331, Ltrlg

_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T198341: Remove all references to the rev_text_id and ar_text_id fields

Reply via email to