With several weeks of delay, I finally had a chance to submit three Phab tasks around this topic. I have at least 10 more queries that I could add to the list, but I don't want to overwhelm people, so I will wait for those three to be resolved first. Thanks! Huji
On Tue, Apr 14, 2020 at 8:53 AM Huji Lee <huji.h...@gmail.com> wrote: > I completely appreciate the points you are making, Bryan and Jaime. And I > would very much enjoy "dealing with you" if we end up going to "Cloud VPS > project" route! If anything, I keep learning new things from you all. > > Let's start where you suggested. I will create Phab tickets on which I > will seek advice about how to optimize those super-slow queries (if at all > possible). > > Thank you for your attention! > > On Tue, Apr 14, 2020 at 8:49 AM Huji Lee <huji.h...@gmail.com> wrote: > >> Yes. If you go to the source of all those pages, there is a hidden HTML >> element (<!-- --> kind) that has the SQL code for that report. >> >> Here is one example: [1] >> >> [1] >> https://fa.wikipedia.org/w/index.php?title=%D9%88%DB%8C%DA%A9%DB%8C%E2%80%8C%D9%BE%D8%AF%DB%8C%D8%A7:%DA%AF%D8%B2%D8%A7%D8%B1%D8%B4_%D8%AF%DB%8C%D8%AA%D8%A7%D8%A8%DB%8C%D8%B3/%D8%AD%D8%B3%D8%A7%D8%A8%E2%80%8C%D9%87%D8%A7%DB%8C_%D8%A7%D8%B2_%D8%A2%D8%A8%E2%80%8C%D9%86%D9%85%DA%A9_%D8%AF%D8%B1%D8%A2%D9%85%D8%AF%D9%87&action=edit >> >> On Mon, Apr 13, 2020 at 7:19 PM MusikAnimal <musikani...@gmail.com> >> wrote: >> >>> Is the source code public? Maybe the queries could be improved. I ran >>> into many such issues too after the actor migration, but after taking >>> advantage of specialized views[0] and join decomposition (get just the >>> actor IDs, i.e. rev_actor, then the actor_names in a separate query), my >>> tools are seemingly as fast as they were before. >>> >>> ~ MA >>> >>> [0] >>> https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_Replicas#Advanced_use_cases_of_specialized_views >>> >>> On Mon, Apr 13, 2020 at 5:03 PM Huji Lee <huji.h...@gmail.com> wrote: >>> >>>> I understand. However, I think that the use case we are looking at is >>>> relatively unique. I also think that indexes we need may not be desirable >>>> for all the Wiki Replicas (they would often be multi-column indexes geared >>>> towards a specific set of queries) and I honestly don't want to go through >>>> the several weeks (months?) of discussion to justify them. >>>> >>>> Note that if we open the can of "more indexes on Wiki Replicas" worms, >>>> this would all of a sudden become an all-wiki discussion. I'm not sure if >>>> there are more than a handful wikis that do this level of page-level and >>>> user-level analytics as fawiki does, which means for most wikis (and for >>>> most Wiki Replica databases) those additional indexes may not even be >>>> justified. >>>> >>>> Even if we were to generalize parts of this approach and bring it to >>>> Wiki Replicas, I would still argue that doing it at a smaller extent (one >>>> wiki DB for now) would be a reasonable starting point, no? >>>> >>>> On Mon, Apr 13, 2020 at 4:42 PM Bryan Davis <bd...@wikimedia.org> >>>> wrote: >>>> >>>>> On Sun, Apr 12, 2020 at 7:48 AM Huji Lee <huji.h...@gmail.com> wrote: >>>>> > >>>>> > One possible solution is to create a script which is scheduled to >>>>> run once a month; the script would download the latest dump of the wiki >>>>> database,[3] load it into MySQL/MariaDB, create some additional indexes >>>>> that would make our desired queries run faster, and generate the reports >>>>> using this database. A separate script can then purge the data a few days >>>>> later. >>>>> >>>>> If I am understanding your proposal here, I think the main difference >>>>> from the current Wiki Replicas would be "create some additional >>>>> indexes that would make our desired queries run faster". We do have >>>>> some indexes and views in the Wiki Replicas which are specifically >>>>> designed to make common things faster today. If possible, adding to >>>>> these rather than building a one-off process of moving lots of data >>>>> round for your tool would be nice. >>>>> >>>>> I say this not because what you are proposing is a ridiculous >>>>> solution, but because it is a unique solution for your current problem >>>>> that will not help others who are having similar problems. Having 1 >>>>> tool use ToolsDB or a custom Cloud VPS project like this is possible, >>>>> but having 100 tools try to follow that pattern themselves is not. >>>>> >>>>> > Out of abundance of caution, I thought I should ask for permission >>>>> now, rather than forgiveness later. Do we have a process for getting >>>>> approval for projects that require gigabytes of storage and hours of >>>>> computation, or is what I proposed not even remotely considered a "large" >>>>> project, meaning I am being overly cautious? >>>>> >>>>> <https://phabricator.wikimedia.org/project/view/2875/> >>>>> >>>>> Bryan >>>>> -- >>>>> Bryan Davis Technical Engagement Wikimedia Foundation >>>>> Principal Software Engineer Boise, ID USA >>>>> [[m:User:BDavis_(WMF)]] irc: bd808 >>>>> >>>>> _______________________________________________ >>>>> Wikimedia Cloud Services mailing list >>>>> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) >>>>> https://lists.wikimedia.org/mailman/listinfo/cloud >>>> >>>> _______________________________________________ >>>> Wikimedia Cloud Services mailing list >>>> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) >>>> https://lists.wikimedia.org/mailman/listinfo/cloud >>> >>> _______________________________________________ >>> Wikimedia Cloud Services mailing list >>> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) >>> https://lists.wikimedia.org/mailman/listinfo/cloud >> >>
_______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud