jcrespo added subscribers: Manuel, aaron.jcrespo added a comment.
@Manuel, @daniel Actually it is a problem, because masters have a limit of CPU# or 32 active threads on the pool of connections, which means half of the connections are reserved but doing nothing, so you are limiting the master
jcrespo removed projects: netops, Operations.
TASK DETAILhttps://phabricator.wikimedia.org/T151681EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: jcrespoCc: aaron, Manuel, Marostegui, jcrespo, Aklapper, Jonas, Lydia_Pintscher, hoo, daniel, Vali.matei, Minhnv
jcrespo added a comment.
max_connections is 5000, maximum active threads is 32 enforced on the connection pool. No connections should be open that are idle, and a typical connection should take less than 1 second, otherwise it has the risk of getting killed by the watchdog looking for idle
jcrespo added a comment.
Another example of why long running connections are a problem: I am depooling es1017 for important maintenance, I have depooled it, so I expect connections so finish within a few seconds, with the exception of wikiadmin's known long running queries, but I just
jcrespo added a comment.
I also do not want you to make you work more than necessary. If you only need 1000 rows, and it contains no private data, I can give you access to a misc server shared with other resources, no need to have a dedicated server.TASK DETAILhttps://phabricator.wikimedia.org
jcrespo added a comment.
Hm, these are both job runners, jobs (probably) shouldn't run for so long. I wonder what's causing this.
Separate issue then, but heads up for it.TASK DETAILhttps://phabricator.wikimedia.org/T151681EMAIL PREFERENCEShttps://phabricator.wikimedia.org/sett
jcrespo added a comment.
Storage is not a problem. I wonder what is the impact in IO activity (write QPS). Could we separate usage tracking to a different set of servers? This table(s) are probably very dynamic, but also probably not 100% in sync with the content edits (handled on asynchronous
jcrespo added a comment.
how would we generate the kind of estimates you would need in order to sign off on this type of change?
Measure the QPS writes/rows written/percentage of write IOPS you have now, evaluate what is the increase with the new method, and scale with a worst-case-scenario
jcrespo added a comment.
To clarify, I am not saying it should be one way or another, what I am asking is:
measure the write load impact
Have into account both options, and be aware of them (e.g. maybe it is now worth it now, but we can prepare things so if it is needed in the future, we do not
jcrespo closed subtask T148988: Cognate DB review as "Resolved".
TASK DETAILhttps://phabricator.wikimedia.org/T150182EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: jcrespoCc: Meno25, Yair_rand, Addshore, Aklapper, Lydia_Pintscher, Urbanecm, D3r1
jcrespo closed this task as "Resolved".jcrespo claimed this task.jcrespo added a comment.
Yes, no major problem in the current state.TASK DETAILhttps://phabricator.wikimedia.org/T148988EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: jcrespo
jcrespo created this task.jcrespo added projects: Wikimedia-log-errors, Wikidata.Herald added a subscriber: Aklapper.
TASK DESCRIPTIONI am not sure if this is wikidata or deployment related: https://logstash.wikimedia.org/goto/29bb2ebcf0ccd9e90e4f7773fdc666dd
They are caused by a job, but no
jcrespo added subscribers: ArielGlenn, Zppix.jcrespo merged a task: T139636: Wikidata Database contention under high edit rate.
TASK DETAILhttps://phabricator.wikimedia.org/T111535EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: jcrespoCc: Zppix, ArielGlenn
jcrespo closed this task as a duplicate of T111535: Wikibase\Repo\Store\SQL\EntityPerPageTable::{closure} creating high number of deadlocks.
TASK DETAILhttps://phabricator.wikimedia.org/T139636EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: jcrespoCc
jcrespo added a comment.
See merged ticket, this happened again when 300-400 new pages per minute were being created, with 35 parallel threads.TASK DETAILhttps://phabricator.wikimedia.org/T111535EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: jcrespoCc: Zppix
jcrespo added a comment.
@Marostegui I think they do not yet want it done yet, but an ok from us/review. But they should probably clarify that. "feasibility" is an ambiguous term.TASK DETAILhttps://phabricator.wikimedia.org/T159718EMAIL PREFERENCEShttps://phabricator.wikimedia.org/sett
jcrespo added a comment.
Evaluate if it is feasible to add such an "empty" column without making Wikidata readonly.
we can probably do it
How certain are you? In my experience, the biggest blocker on production is not the size, but how busy the table is. That would create metadata lock
jcrespo added a comment.
If we depool the slaves we should be fine, shouldn't we? And if we use the DC switchover to alter the masters we'd also get rid of that issue?
Hey, don't tell me, tell @WMDE-leszek, and see if he is ok with that schedule. :-)TASK DETAILhttps://phabricato
jcrespo added a comment.
Depending on the answer to this, we will plan further steps.
I think you should add the full plan here ASAP, even if it is not 100% clear or decided, otherwise we may be adding steps to the process and make it unnecessarily long. E.g. if you plan to add an index later, it
jcrespo added a comment.
Ok, now I have some comments against that method, logistically, I am at a meeting, let me finish it and I will have some time to properly explain myself (nothing against the spirit of the changes, I would just do it in a different way, if code can handle it).TASK
jcrespo added a comment.
So the comments:
do not defer the creation of the indexes- those are extra alter tables and do no make things easier in any way- just create the indexes from the start- assuming they will be used.
Renaming columns is a big no- specially to an already existent name
jcrespo created this task.jcrespo added projects: Wikidata, DBA.Herald added a subscriber: Aklapper.
TASK DESCRIPTIONFor example, I found on db1070 2 long running queries:
Server Connection User Client Database Time
db1070 51133099 wikiuser mw1256 wikidatawiki 19h
SELECT /* Wikibase\Repo
jcrespo added a comment.
It just occurred to me an extra reason to avoid using a db master- master failover is a relative frequent operation: it will happen every time the master mysql is upgraded, or when there is a datacenter failover (2 of those will happen on April/May)- probably there wasn
jcrespo added a comment.
This is ongoing right now, for example:
db1082 307726822 wikiuser mw1248 wikidatawiki 4m
SELECT /* Wikibase\Repo\Store\Sql\SqlEntitiesWithoutTermFinder::getEntitiesWithoutTerm */ page_title AS `entity_id_serialization` FROM `page` LEFT JOIN `wb_terms` ON
jcrespo added a comment.
Sorry about this- you are not the only "sufferers" of beta not being a reliable place for testing in a truly distributed fashion- we were just discussing this on IRC. I also support a test on test, and offer my help if I can provide it. Thanks again for
jcrespo added a comment.
Thank you very much for working on this- do you have an estimation on when this will be fully deployed?TASK DETAILhttps://phabricator.wikimedia.org/T160887EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: jcrespoCc: gerritbot
jcrespo added a comment.
Thank you again!TASK DETAILhttps://phabricator.wikimedia.org/T160887EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: jcrespoCc: gerritbot, Lydia_Pintscher, thiemowmde, Marostegui, aude, hoo, daniel, Aklapper, jcrespo, QZanden, Salgo60
jcrespo added a comment.
As a small side note- that can also happen on mysql. Despite locks being released on session disconnection, there has been some occasions where the mysql session is not killed (it continuous), but the thread on mediawiki has been. There are several known bugs about that
jcrespo added a comment.
So that we are not a blocker- creation of tables in production, specially if we have areadly given the OK to the plans, is not considered a schema change, so anyone with production rights can do it- you just need to mark it on the deployments calendar. Wis this we (DBAs
jcrespo added a comment.
To clarify- it may be blocked on us right now to create the database and because labs filtering is not well managed, but the general idea stays for normal table creations.TASK DETAILhttps://phabricator.wikimedia.org/T162252EMAIL PREFERENCEShttps
jcrespo added a comment.
This sentence actually confused me
x1 == extension1 :-)TASK DETAILhttps://phabricator.wikimedia.org/T162252EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: jcrespoCc: jcrespo, Marostegui, Lydia_Pintscher, Aklapper, Lea_Lacroix_WMDE
jcrespo added a comment.
@Addshore I hope labs access is not a blocker for this, that can be done at a later date.TASK DETAILhttps://phabricator.wikimedia.org/T162252EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: jcrespoCc: Liuxinyu970226, jcrespo
jcrespo added a comment.
production hosts
Was thinking on dbstore (backup) hosts, which were problematic (remember you where the ones to set up those last time) + private table filtering. x1 has been traditionally not replicated to labs, this can be challenging (I would start by not replicationg
jcrespo added a comment.
I would ask Addshore to confirm by running SELECT on the empty tables from terbium/tin, etc, using mediawiki scripts.TASK DETAILhttps://phabricator.wikimedia.org/T162252EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Marostegui
jcrespo added a comment.
I did not understand your last comment, is the previous patch invalid? Do you have another patch to show me?TASK DETAILhttps://phabricator.wikimedia.org/T151717EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: jcrespoCc: Halfak, jcrespo
jcrespo added a comment.
To clarify, I have to be specially strict in this particular case because in the past, wbc_entity_usage (with the exception of linksupdate job) was a large point of contention and a major cause of lag, and this ticket starts by saying: we'd write a lot (?) more ro
jcrespo added a comment.
We want to collect additional information on one of these wikis for a while
If that doesn't involve a schema change, sure.TASK DETAILhttps://phabricator.wikimedia.org/T151717EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: jcre
jcrespo created this task.jcrespo added a project: Wikidata.
TASK DESCRIPTIONthese 2 queries were among the most expensive on datacenter failover or after it- while it is normal to have lower performance than usual due to colder caches, most likely they are surfacing issues with improvements. It
jcrespo edited the task description. (Show Details)
EDIT DETAILSthese 2 queries were among the most expensive on datacenter failover or after it- while it is normal to have lower performance than usual due to colder caches, most likely they are surfacing existing issues with improvements, which
jcrespo added a comment.
I was thinking on https://dev.mysql.com/doc/refman/5.6/en/server-system-variables.html#sysvar_eq_range_index_dive_limit but that is probably more wishful thinking than practical give we are on MariaDB 10.TASK DETAILhttps://phabricator.wikimedia.org/T163544EMAIL
jcrespo edited projects, added MediaWiki-Database; removed DBA.jcrespo added a comment.
That is the max lag, and it is normal on the slaves that are not waited by mediawiki. This issue has nothing to do with databases, mediawiki does what it is programmed to do: if it detects lag even if a few
jcrespo added a comment.
This is a bit offtopic to T163551 but with the latest schema changes, wb_terms has become the largest table on a wiki (with the exception of revision on enwiki and image on commons)- and I think it will get bigger once the new column (I assume) gets populated with actual
jcrespo added a comment.
I may have done this comment on the wrong ticket: T163551#3221748TASK DETAILhttps://phabricator.wikimedia.org/T86530EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: jcrespoCc: jcrespo, Ricordisamoa, Lydia_Pintscher, adrianheine
jcrespo added a comment.
So do you think this had something to do with reports like T123867 T164191? This is highly surprising- I was expecting low to no master or replication performance impact, but zero is highly suspicious. Was this expected? Couldn't this be related to a bug on monitorin
jcrespo added a comment.
With the current state, we still have the same amount of connections to the master DBs, but we don't use GET_LOCK etc. on them anymore.
And that for me is a huge win alone.TASK DETAILhttps://phabricator.wikimedia.org/T151681EMAIL PREFERENCES
jcrespo added a comment.
@aude: don't run update.php on s3 for altering a table- you will create lag
on 900 wikis unless connections are cleared and table is pre-warmed-up (and
appropiately tested).
Also, the only wiki mentioned here was wikidatawiki, you have to create a
separate reques
jcrespo added a comment.
The problem usually is not the alter size, but the metadata locking, which creates way more contention.TASK DETAILhttps://phabricator.wikimedia.org/T165246EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: jcrespoCc: Marostegui, jcrespo
jcrespo added a comment.
To clarify- reads on a slave are not a big concern for MySQL- of course, if you get in the end better latency, that is cool (and I normally ping because it means there is an inefficiency that could be solved); but reads are easy to scale in the large order of things ("
jcrespo added a comment.
While contention is bad in general- it is the opposite of lag- more contention would create less lag - of course it could be a common source: large updates causing contention on the master, and then lag because the transaction size is large.
I wonder if instead of fixing
jcrespo created this task.jcrespo added projects: Wikidata, Performance, DBA.Herald added a subscriber: Aklapper.
TASK DESCRIPTIONThe following query was detected running on a master database:
Host User Schema Client Source Thread Transaction Runtime Stamp
db1063 wikiuser wikidatawiki
jcrespo added a comment.
This is still ongoing with 15-minute queries. I am going to setup a task to kill all related queries on s5-master to prevent a potential outage of dewiki and wikidata writesTASK DETAILhttps://phabricator.wikimedia.org/T169336EMAIL PREFERENCEShttps
jcrespo added a comment.
I've setup a temporary watchdog on the s5 master:
pt-kill F=/dev/null --socket=/tmp/mysql.sock --print --kill --victims=all --match-info="EntityUsageTable" --match-db=wikidatawiki --match-user=wikiuser --busy-time=1
This will mitigate for now the clo
jcrespo added a comment.
Probably related: T169884TASK DETAILhttps://phabricator.wikimedia.org/T164173EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: jcrespoCc: Krinkle, aaron, MZMcBride, daniel, Ladsgroup, hoo, Marostegui, Aklapper, jcrespo
jcrespo added a comment.
I also wonder why some of those log warnings come from close() and others have the proper commitMasterChanges() bit in the stack trace. Normally, there should be nothing to commit by close() and it is just commits for sanity.
We were theorizing the other day on IRC that
jcrespo added a comment.
Database crashed, it should be ok to edit now.TASK DETAILhttps://phabricator.wikimedia.org/T171928EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: jcrespoCc: jcrespo, greg, Mbch331, Smalyshev, MisterSynergy, TerraCodes, Jay8g
jcrespo claimed this task.jcrespo added a comment.
Investigation is not over, here is what we have found out for now of the causes:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20170728-s5_(WikiData_and_dewiki)_read-onlyTASK DETAILhttps://phabricator.wikimedia.org/T171928EMAIL
jcrespo renamed this task from "Wikidata database locked" to "Wikidata and dewiki databases locked".
TASK DETAILhttps://phabricator.wikimedia.org/T171928EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: jcrespoCc: Joe, jcrespo, greg
jcrespo added a comment.
I've almost finished the above incident documentation. However, I am unsure about which are the right actionables and their priorities (last section).
let's use this ticket to agree on what would be the best followup, a) making puppet change read-only state
jcrespo added a comment.
I have started working on more complete monitoring, useful if we go over the route of human monitoring rather than automation, here is one example:
$ ./check_mariadb.py --icinga -h db1052.eqiad.wmnet --check_read_only=0
Version 10.0.28-MariaDB, Uptime 16295390s, read_only
jcrespo added a comment.
Wikidata goes into read-only the subscriptions mentioned
Yes, definitely some extensions in the past do not behave perfectly and do not respect mediawiki's read-only mode- I do not know what is the sate of Wikidata, but for what you say, a ticket should be filed s
jcrespo added a comment.
$ check_mariadb.py -h db1052 --slave-status --primary-dc=eqiad
{"datetime": 1501777331.898183, "ssl_expiration": 1619276854.0, "connection": "ok", "connection_latency": 0.07626748085021973, "ssl": true, "to
jcrespo added a subtask: T172489: Monitor read_only variable and/or uptime on atabase masters, make it page.
TASK DETAILhttps://phabricator.wikimedia.org/T171928EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: jcrespoCc: gerritbot, mark, Marostegui, Elitre
jcrespo added a subtask: T172490: Monitor swap/memory usage on databases.
TASK DETAILhttps://phabricator.wikimedia.org/T171928EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: jcrespoCc: gerritbot, mark, Marostegui, Elitre, Joe, jcrespo, greg, Mbch331
jcrespo closed this task as "Resolved".jcrespo added a comment.
I have created all actionables on both the incident documentation ( https://wikitech.wikimedia.org/wiki/Incident_documentation/20170728-s5_(WikiData_and_dewiki)_read-only ) and phabricator- consequently, I have closed this
jcrespo added a comment.
I've been told that several thousands of UPDATES Title::invalidateCache per second had caused trouble on s7 over the night, not sure if this is related:
https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?orgId=1&var-dc=eqiad%20prometheus%2Fops&
jcrespo added a comment.
To avoid the continuous lagging on non-directly pooled hosts (passive dc codfw, labs, other hosts replicating on a second tier), I have forced a slowdown of writes to go at the pace of the slowest slaves of eqiad with semisync replication, adding automaticaly a pause of up
jcrespo added a comment.
At 16:02-16:06, which would fit with the deployment of
https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/725019/ logstash
baseline number of messages increased around a 50%:
https://grafana.wikimedia.org/d/00561/logstash?orgId=1&viewPan
jcrespo added a subscriber: Kormat.
jcrespo added a comment.
> the DBAs to approve this
That should be @Kormat and/or @Marostegui (he is on vacations right now).
TASK DETAIL
https://phabricator.wikimedia.org/T186716
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/pa
jcrespo added a project: Data-Persistence (Consultation).
TASK DETAIL
https://phabricator.wikimedia.org/T186716
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: jcrespo
Cc: Kormat, Michael, Ladsgroup, Marostegui, RP88, Mike_Peel, Aklapper
jcrespo added projects: bacula, Data-Persistence-Backup, Data-Persistence.
jcrespo added a comment.
number of files are (within reason) a non-blocker for bacula, as files are
packaged into volumes. It is true that each file is stored as a mysql record,
but that should be able to scale until
jcrespo added a comment.
One more question, to finally decide if setting up weekly full backups or
daily but incremental- do all files mostly change completely, or only a subset
of them? Incrementals are able to be done with file granularity only (it will
backup fully files as long as its
jcrespo added a comment.
I don't have the answer to that question, but whenever any of you have the
servers and path(s), you can follow the instructions at
https://wikitech.wikimedia.org/wiki/Bacula#Adding_a_new_client to send a
preliminary backup proposal to Puppet, and I will assis
jcrespo added a comment.
Let me give it a deeper look, while the patch by itself looks good as is, I
want to check if a different (non-default) backup policy would be more
advantageous in frequency and space. :-)
TASK DETAIL
https://phabricator.wikimedia.org/T294355
EMAIL PREFERENCES
jcrespo added a comment.
Running Jobs:
Console connected using TLS at 13-Dec-21 09:20
JobId Type Level Files Bytes Name Status
==
396417 Back Full 4,568412.9 M
graphite1004
jcrespo added a comment.
Terminated Jobs:
JobId Level FilesBytes Status FinishedName
396417 Full 108,32011.70 G OK 13-Dec-21 09:34
graphite1004.eqiad.wmnet-Weekly-Mon
jcrespo added a comment.
In addition to the fix/rollback- could some integration test or heuristic
production monitoring also be able to be implemented, for future faster
detection?
TASK DETAIL
https://phabricator.wikimedia.org/T307586
EMAIL PREFERENCES
https
jcrespo updated the task description.
jcrespo added a project: User-notice.
TASK DETAIL
https://phabricator.wikimedia.org/T307586
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: jcrespo
Cc: jcrespo, Raymond, Moebeus, Lucas_Werkmeister_WMDE, Aklapper
jcrespo added a comment.
Thanks for such a quick reaction, BTW.
TASK DETAIL
https://phabricator.wikimedia.org/T307586
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: jcrespo
Cc: brennen, hashar, jcrespo, Raymond, Moebeus, Lucas_Werkmeister_WMDE
jcrespo added a comment.
In T307586#7908045 <https://phabricator.wikimedia.org/T307586#7908045>,
@Quiddity wrote:
> For Tech News purposes, how should this entry be described? IIUC from the
description, something like this?
>
>> There was a problem with
jcrespo added a comment.
Sorry if it is the wrong ticket, but several services of wdqs2010, wdqs2011
and wdqs2012 are alerting. The sevice is returnin 400 commands. My guess is
this is due to this ongoing data reload (no issue). If that is the case, could
the alerts "WDQS SPARQL"
jcrespo added a comment.
I was told by @Gehel that it was unrelated to this, but related to T301167
<https://phabricator.wikimedia.org/T301167>. Sorry for the confussion.
TASK DETAIL
https://phabricator.wikimedia.org/T323096
EMAIL PREFERENCES
https://phabricator.wikimedia.org/se
jcrespo added a comment.
One tip to avoid having people on call (like me) worrying about pending
implementation services is to add the hiera key
`profile::monitoring::notifications_enabled: false`. This is not promoted much
because most people handle stateless services that are easy and
jcrespo created this task.
jcrespo added projects: Wikidata, MediaWiki-extensions-PropertySuggester,
Wikimedia-production-error.
Restricted Application added a subscriber: Aklapper.
TASK DESCRIPTION
We got an alert on `#wikimedia-databases` IRC saying:
PROBLEM - MariaDB sustained
jcrespo added a subscriber: Marostegui.
jcrespo added a comment.
Handover to @Marostegui for him to comment, as he will be the person to know
if this continues happening or now.
TASK DETAIL
https://phabricator.wikimedia.org/T138208
EMAIL PREFERENCES
https://phabricator.wikimedia.org
jcrespo created this task.
jcrespo added projects: Wikidata, Wikibase.
Restricted Application added a subscriber: Aklapper.
TASK DESCRIPTION
This weekend, while an ongoing incident was being handled, I checked and saw
several badly performing queries running. These didn't have (I believe
jcrespo updated the task description.
TASK DETAIL
https://phabricator.wikimedia.org/T276762
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: jcrespo
Cc: Marostegui, hoo, Aklapper, jcrespo, maantietaja, Akuckartz, darthmon_wmde,
Nandana, Lahi, Gq86
jcrespo added a comment.
Offtopic- and feel free to PM in private. What is a good way to report
database-related issues to wikidata development team? I am a bit intimidated by
the amount of tags and dashboards (which will probably reflect your internal
organization, but I am not too
jcrespo triaged this task as "Unbreak Now!" priority.
jcrespo added a comment.
This should be a blocker- es traffic has grown almost grown 100x since 14
april, correlates strongly with the 19h deploy:
F34434387: es_issue.png <https://phabricator.wikimedia.org/F34434387&
501 - 588 of 588 matches
Mail list logo