Anomie added a comment.
In T198341#5044442 <https://phabricator.wikimedia.org/T198341#5044442>, @BPirkle wrote: > To make sure I'm heading in the right direction: > > - Postgres currently uses columns in specific tables (pagecontent.textvector and page.titlevector, both of type tsvector) to store search index information > - Postgres also currently uses triggers/procedures to maintain these tables > - And directly related to this task, searchPostgres.php currently and undesirably references fields rev_text_id and old_id All correct. > I should: > > - add a searchindex table, similar to the MySQL one, but adapted to Postgres data types Yes. Basically `page.titlevector` becomes `searchindex.si_title` and `pagecontent.textvector` becomes `searchindex.si_text`. `searchindex.si_page` works exactly like it does in MySQL. > - add related indexes/triggers/procedures to maintain this table Per T164898 <https://phabricator.wikimedia.org/T164898>, we don't want to keep using triggers for this. `SearchPostgres::update()` should do it, similar to what `SearchMySQL::update()` does. Same for `::updateTitle()`. > - add necessary documentation and install/update support for these db changes > - modify searchPostgres.php to use the new db changes Yes to both. > Questions: > > - I'm assuming we are using triggers/procedures to maintain the existing tsvector search columns because that works well in Postgres, and we want to retain that technique with the searchindex table. Am I right, or do we want to do more of this at the PHP level per SearchMySQL.php? My guess is that whoever originally wrote the SearchPostgres code just thought triggers were nicer/"cleaner" than doing it from PHP. Unfortunately there's no explanation on r15335 <https://www.mediawiki.org/w/index.php?title=Special:Code/MediaWiki/15335&path=> or r15336 <https://www.mediawiki.org/wiki/Special:Code/MediaWiki/15336> as to why they did it. > - should I remove/refactor the existing columns/triggers/procedures and have all Postgres searches work off (only) the new searchindex table? Or should I leave that alone for now and use $wgMultiContentRevisionSchemaMigrationStage gating in searchPostgres.php to continue to use the existing columns if we're reading the old schema? Leaving the old and maintaining two sets of indexes at the db level sounds inefficient. But maybe there's a reason I'm missing to retain it. Just transition straight to the new searchindex table, no need for a migration flag. > - there are a. bunch of "ts2_" prefixes floating around in postgres/tables.sql (ex. "CREATE TRIGGER ts2_page_text..."). Is that a meaningful naming convention I should be aware of and retain, or is that just "text search 2" to indicate it was a "new thing" at some point? I think it's just namespacing. It's not anything that you'd have to preserve, just name the new things whatever makes sense. > - is there developer documentation on update.php/DatabaseUpdater.php/PostgresUpdater.php? I've looked through that code (and postgres/tables.sql). I have a formative understanding, but I need to be more solid on how all this works. Not that I'm aware of. Feel free to ask any questions you have. In T198341#5045616 <https://phabricator.wikimedia.org/T198341#5045616>, @daniel wrote: > That sounds like "completely change how postgres does search". Now, my initial impression is that the way postgres does search is broken by design, since a database trigger has no way to correctly interpret the contents of the text table, which may be a serialized object, or compressed, or JSON, or a reference to an external storage system, or all of these. This has been the case for over ten years now. Your initial impression is correct. > So, my (possibly completely wrong) understanding of how this currently works is: pagecontent.textvector contains the searchable version of the page text. We can keep that as it is, the trigger just needs to use the slots and content tables to connect rows in pagecontent to rows in revision, instead of using rev_text_id. From my current understanding, that should be it. But perhaps Brad has different ideas. Yes, we //could// do something like that. Probably we wouldn't even need to change the triggers, just the SeachPostgres code to join revision↔slots↔content↔text (and that last join would probably be really hacky). But IMO this is a situation where it would be better to cut out the bad code now instead of trying to hack around it and leave tech debt for the future. That way when we get to the part about wanting SearchMySQL and so on to handle more than just the main slot the changes to SearchPostgres will be similar rather than entirely different (and probably punted on again). TASK DETAIL https://phabricator.wikimedia.org/T198341 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BPirkle, Anomie Cc: tstarling, gerritbot, Tgr, Jdforrester-WMF, Anomie, Addshore, aude, Aklapper, daniel, alaa_wmde, EvanProdromou, CucyNoiD, Nandana, NebulousIris, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Lahi, Gq86, Baloch007, Ramsey-WMF, Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, LawExplorer, Lewizho99, JJMC89, Maathavan, _jensen, rosalieper, Agabi10, Wikidata-bugs, Mbch331, Ltrlg
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs